A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis, and bring you on the project. Your role is to create a database schema and ETL pipeline for this analysis. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.
- Python
- web development (server-side),
- software development,
- mathematics,
- system scripting.
- PostgreSQL
- Pandas
Python is a popular programming language. It was created by Guido van Rossum, and released in 1991. It is used for:
(Retrieved from : python.org)
PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.
(Retrieved from : postgresql.org)
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
(Retrieved from : pydata.org)
- 1. Establish a python environment or Jupyter Notebook
- 2. Make sure the following python libraries are available/installed: glob, psycopg2, numpy, and os
- 3. Download Files
- 4. Instantiate database by running create_tables first
- 5. Run etl.py
- 6. Run queries
As required by Sparkify, the focus of the database is faster and simpler queries of a large data set. Star schema satisfies this requirement and has been implemented as part of the solution in order to avoid complex JOIN queries.
STAR SCHEMA (1st image)Data is stored in a dataframe ----> Data is extracted into lists ----> Lists are inserted into their respective tables ----> Process is repeated until all files are processed
Terminal / Jupyter Notebook Outputs