GitHub - renzsupnet/Apache-Cassandra: This is a project for Udacity's Data Engineering nanodegree program. It's goal is to satisfy a mock start-up -Sparkify's requirement of querying a large amount of data.

Project: Data Modeling with Postgres

Problem Statement:

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis, and bring you on the project. Your role is to create a database schema and ETL pipeline for this analysis. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

Technology Used:

Python

Python is a popular programming language. It was created by Guido van Rossum, and released in 1991. It is used for:

web development (server-side),
software development,
mathematics,
system scripting.

(Retrieved from : python.org)

PostgreSQL

PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.

(Retrieved from : postgresql.org)

Pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

(Retrieved from : pydata.org)

Installation

1. Establish a python environment or Jupyter Notebook
- python installation
- jupyter installation
2. Make sure the following python libraries are available/installed: glob, psycopg2, numpy, and os
3. Download Files
4. Instantiate database by running create_tables first
5. Run etl.py
6. Run queries

Database Scheme

As required by Sparkify, the focus of the database is faster and simpler queries of a large data set. Star schema satisfies this requirement and has been implemented as part of the solution in order to avoid complex JOIN queries.

STAR SCHEMA (1st image)

ETL PROCESS

Data is stored in a dataframe ----> Data is extracted into lists ----> Lists are inserted into their respective tables ----> Process is repeated until all files are processed

QUERY SAMPLES

Terminal / Jupyter Notebook Outputs

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb
testing.pem		testing.pem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Data Modeling with Postgres

Problem Statement:

Technology Used:

Installation

Database Scheme

ETL PROCESS

QUERY SAMPLES

About

Releases

Packages

Languages

renzsupnet/Apache-Cassandra

Folders and files

Latest commit

History

Repository files navigation

Project: Data Modeling with Postgres

Problem Statement:

Technology Used:

Installation

Database Scheme

ETL PROCESS

QUERY SAMPLES

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages