Skip to content

This is a project for Udacity's Data Engineering nanodegree program. It's goal is to satisfy a mock start-up -Sparkify's requirement of querying a large amount of data.

Notifications You must be signed in to change notification settings

renzsupnet/Apache-Cassandra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project: Data Modeling with Postgres


Problem Statement:


A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis, and bring you on the project. Your role is to create a database schema and ETL pipeline for this analysis. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.



Technology Used:


  • Python
  • Python is a popular programming language. It was created by Guido van Rossum, and released in 1991. It is used for:

    • web development (server-side),
    • software development,
    • mathematics,
    • system scripting.

    (Retrieved from : python.org)


  • PostgreSQL
  • PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.

    (Retrieved from : postgresql.org)


  • Pandas
  • pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

    (Retrieved from : pydata.org)

Installation

  • 1. Establish a python environment or Jupyter Notebook
  • 2. Make sure the following python libraries are available/installed: glob, psycopg2, numpy, and os
  • 3. Download Files
  • 4. Instantiate database by running create_tables first
  • 5. Run etl.py
  • 6. Run queries

Database Scheme

As required by Sparkify, the focus of the database is faster and simpler queries of a large data set. Star schema satisfies this requirement and has been implemented as part of the solution in order to avoid complex JOIN queries.

STAR SCHEMA (1st image)

ETL PROCESS

Data is stored in a dataframe ----> Data is extracted into lists ----> Lists are inserted into their respective tables ----> Process is repeated until all files are processed

QUERY SAMPLES

Terminal / Jupyter Notebook Outputs


About

This is a project for Udacity's Data Engineering nanodegree program. It's goal is to satisfy a mock start-up -Sparkify's requirement of querying a large amount of data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published