This repo is my experimental projects on Data Engineering.
-
Updated
Mar 6, 2023 - Python
This repo is my experimental projects on Data Engineering.
Automate Apache Spark in Hadoop with Airflow in Cloud
We Build an ETL pipeline using Airflow that accomplishes the following: Downloads data from an AWS S3 bucket, Runs a Spark/Spark SQL job on the downloaded data producing a cleaned-up dataset of delivery deadline missing orders and then Upload the cleaned-up dataset back to the same S3 bucket in a folder primed for higher level analytics
Keywords: Python, Airflow, AWS, S3, Redshift, ETL
Udacity project within the Data Engineer Nanodegree
PySpark Analysis from log files
Constructing a protein fragment database in the context of Lyme disease.
Create an animation from the NYC Taxi dataset
A template to quickly set up the prefect workflow orchestration engine.
ETL pipeline with PySpark on Dataproc for data lake on Google Cloud Storage
This is a data engineering-focused project that used Python, SQL, and Airflow to perform an ETL job of my Spotify listening data and send me an automated email of my weekly listening habits.
Deep feed forward neural network predicting taxi fare prices. Project features data & feature engineering.
Scraw data from websites > Python
Data pipeline that fetches recently played songs in the past 24 hours using Spotify API and saves the data in the SQLite database. Scheduled to run daily using Apache Airflow.
Using the Spotify API to create a minimal and basic ETL, while learning in the process.
Creating a data pipeline to extract data from spotify and save the songs listened to everyday into a local sqlite db.
End-to-end ML system - Batch Ingestion
Add a description, image, and links to the data-engineering topic page so that developers can more easily learn about it.
To associate your repository with the data-engineering topic, visit your repo's landing page and select "manage topics."