ETL pipeline with PySpark on Dataproc for data lake on Google Cloud Storage
-
Updated
Mar 17, 2021 - Python
ETL pipeline with PySpark on Dataproc for data lake on Google Cloud Storage
Checking the scalability of a data pipeline involving MySQL, Spark and Machine Learning Models using Latency.
This repository contains code for comparing the performance of three different ELT (Extract, Load, Transform) methods on CSV files of different sizes. The three methods are implemented in Python using different approaches and libraries, and their execution times are compared and plotted for analysis.
A custom Airbyte connector to fetch football data from the Football-Data.org API. It allows users to retrieve match results, league tables, and player statistics for specific leagues, making it a versatile tool for football data analysis.
DAGs adapting the MRI preprocessing pipeline to Airflow
Kedro tests
An extension enabling the monitoring of Apache Airflow DAGs directly from Jupyter notebooks. Tailored for developers and data scientists, it simplifies tracking specific DAGs, reduces unnecessary friction, and allows severity levels setup for failed DAGs.
Data Pipelines with Airflow
A repository for the Methods of Advanced Data Engineering course at FAU
ETL pipeline with AWS Redshift orchestrated with Airflow
Data pipeline to gather data from chess website APIs using Airflow.
An end-to-end data pipeline deployed on GCP that extracts cryptocurrency data for analytics.
The mini project for the course Database Technologies. The task is to take in data via a pipeline built using spark-streaming and kafka, and store the processed data into a SQLite database for further manipulation
Deployable AWS data platform to process powerlifting data extracted from openpowerlifting.org.
a pyspark-based data cleaning pipeline for glassdoor reviews
package for simple interaction with the Kaggle API for brane data pipelines.
Data Pipeline with Airflow Project from Udacity Data Engineer Nanodegree
A toolset for data pipelines in Genomics
The goal of this project is to build data pipeline for gathering real-time carpark lots availability and weather datasets from Data.gov.sg. These data are extracted via API, and stored them in the S3 bucket before ingesting them into the Data Warehouse.
Add a description, image, and links to the data-pipeline topic page so that developers can more easily learn about it.
To associate your repository with the data-pipeline topic, visit your repo's landing page and select "manage topics."