POSTGRES AND PANDAS ETL

Context Problem

For a start-up with collections of data on songs and user activity on a music streaming app. An analytics team is required to perform operations leading to discovery, interpretation, and communication of meaningful patterns in data, vto do this various queries has to be carried out on their data, which resides in a directory of JSON logs, as well as a directory with JSON metadata on the songs. Currently, there isn't an easy way to carry out this query.

Solution

The solution proposed in this repo is to create a Postgres database with tables designed to optimize queries on song play analysis also create a database schema and ETL pipeline for this analysis. The following concepts of data engineering are brought into practice:

Relational data modeling using Postgresql.

Queries and joins are particular in the tools of the analytics team, with the data being structured and not voluminous enough to require big data analytics tools such as Nosql, Spark. Hence, the use of relational data modeling.

Fact and dimensional table modeling.

Since analytics is the major use case for this database and not trasanctional processes a star schema is used, with fact and dimension tables.

Extarct Transform and Load pipeline developed with Python,SQL and Pandas.

Data is extracted and transformed from the json log using python and pandas and loaded into the fact and dimension tables with sql queries.

File Structure

sql_queries.py contains the sql queries used.
create_tables.py resets the database(drops and create table).
connection_manger.py contains context manager to help manage cursors and connections to the database, including decorators for error handling.
etl.py extracts, transforms and loads data into tables.
data folder, contains the data used.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
README.md		README.md
connection_manger.py		connection_manger.py
create_table.py		create_table.py
etl.py		etl.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POSTGRES AND PANDAS ETL

Context Problem

Solution

Relational data modeling using Postgresql.

Fact and dimensional table modeling.

Extarct Transform and Load pipeline developed with Python,SQL and Pandas.

File Structure

About

Releases

Packages

Languages

theKoladeAkande/etl-pipeline

Folders and files

Latest commit

History

Repository files navigation

POSTGRES AND PANDAS ETL

Context Problem

Solution

Relational data modeling using Postgresql.

Fact and dimensional table modeling.

Extarct Transform and Load pipeline developed with Python,SQL and Pandas.

File Structure

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages