SONGSTREAMS

Data Engineering on a simulated song streaming application with Kafka, PySpark, dbt, S3, Redshift.

Content:

Tools
Data Samples
Data Representation & Transformation
Analytics
How To Run This Project

Assumption:

Streamify is a music streaming company that joys in the satisfaction of their users. The intelligence of the application is derived from the team of Data Techies who track, monitor, and filter out playlists uniquely for each user, making it less probable for a user to skip through a track because this app knows them so well.

Some common question asked by the Business Intelligence team are:

What song/genre does user A play the most?
What artists are listened to the most by each user?
At what time are these particular artists listened to?
Most played songs per location

Tools

Terraform (Infrastructure as Code)
Apache Kafka (Streaming)
Apache Spark (Streamed data processor)
Apache Airflow (Workflow management)
AWS S3 (Data lake)
DBT (Data Transformation)
Amazon Redshift (Data Warehouse)

Streamed Data Samples

Data Representation

All data in the lake (AWS S3) is stored in CSV format

S3 schema

Data Transformation - dbt

Stages:
- Staging
- Production
ERD:

Analytics

How to run this project

Get docker installed and running

Visit the docker page to install docker on your Mac, Windows, or Linux OSes. Test run this installation by creating a dummy hello-world image. If this works, you're good to go.
Don't forget to log in on your terminal with docker login or sudo docker login. Enter your username and password.

Get zookeeper, kafka, and its brokers up and running

run the following code to power kafka cd kafka && docker compose build && docker compose up
if all builds well, you'll be able to view the confluence UI in your browser. (localhost:9021).

Start streaming Eventsim data

cd scripts && bash eventsim_startup.sh
(optional) run docker --follow million_events to see logs
it may take a while to see these topics reflect in your UI. But once it does, you'll have about four topics all together.

Listen Via Spark

cd lake && python extraction.py Spark reads data from the broker(s) every 120 seconds. Each read is saved in a new csv. The naming convention is sparks default - partition.csv, Watch Spark perform its magic ;)

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
airflow		airflow
dbt		dbt
eventism		eventism
images		images
kafka		kafka
lake		lake
scripts		scripts
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extraction.py		extraction.py
requirements.txt		requirements.txt

License

the-timoye/songstreams

Folders and files

Latest commit

History

Repository files navigation

SONGSTREAMS

Content:

Assumption:

Tools

Streamed Data Samples

Data Representation

Data Transformation - dbt

Analytics

How to run this project

Get docker installed and running

Get zookeeper, kafka, and its brokers up and running

Start streaming Eventsim data

Listen Via Spark

About

Topics

Resources

License

Stars

Watchers

Forks

Languages