- Airflow
- Data Generator (Producers)
- Spark Streaming
- Web App
Increments is a project that utilizes a streaming data pipeline to implement incremental processing on data. The goal of this project is to provide near-real time processed data to analysts, instead of having them wait hours for the batch jobs to finish
The airflow folder contains the dags and the actual redshift upload script. The redshift job is scheduled to run every 5 minutes on partitioned data in S3
This folder contains producer scripts that generate web event data. There are two versions for these scripts, one for Kinesis, one for Kafka.
The spark streaming job uses pyspark, and sets up a stream that processes data every 1 minute and uploads it to s3 in partitions. The spark job uses spark sql to process the data. The two current jobs are:
- Pre-aggregate data
- Partition and sort the raw logs for Redshift
Initially used Airbnb's Superset, now working on a plotly dash because of performance issues