Data Modeling With Postgres

Project repo for the Udacity Data Engineering Program Project 1.

This README file includes a summary of the project, how to run the Python scripts, and an explanation of the files in the repository.

Getting Started

You will need to first connect and build the tables for the Sparkify database. To do this run the following command:

python create_tables.py

To load data into the Sparkify database run the following command:

python etl.py

Prerequisites

Baseline configured PostgreSQL database.
The following Python libraries need to be installed in the environment.
- psycopg
- pandas

Purpose

The purpose of this database is to conduct ETL operations and store data from user activity from the Sparkify app.
This data will be used by the Sparkify analytics team will use this data gain a greater understanding of user activity and songs being listened to.

Database Schema

There are 5 tables in the database. This design focuses on the songplay table which houses the most important information for the analytics team. The supporting tables of time, users, songs, and artists help to provide context and additional details for the songplay table. The time table which contains:

Field	Data Type	Key
start_time	INT	Primary
hour	INT
day	INT
week	INT
month	INT
year	INT
weekday	INT

The users table which contains:

Field	Data Type	Key
user_id	VARCHAR	Primary
first_name	VARCHAR
last_name	VARCHAR
gender	VARCHAR
level	VARCHAR

The songs table which contains:

Field	Data Type	Key
song_id	VARCHAR	Primary
title	VARCHAR
artist_id	VARCHAR	Foreign Key
year	INT
duration	NUMERIC

The artists table which contains:

Field	Data Type	Key
artist_id	VARCHAR	Primary
name	VARCHAR
location	VARCHAR	Foreign Key
latitude	real
longitude	real

The songplay table which contains:

Field	Data Type	Key
songplay_id	INT	Primary
start_time	INT	Foreign Key
user_id	VARCHAR	Foreign Key
song_id	VARCHAR	Foreign Key
artist_id	VARCHAR	Foreign Key
session_id	INT
location	VARCHAR
user_agent	VARCHAR

Example Queries

To get the top 5 users by session.

SELECT users.last_name, users.first_name, COUNT(songplay.user_id) FROM songplay INNER JOIN users ON songplay.user_id=users.user_id GROUP BY users.user_id ORDER BY count DESC LIMIT 5;

To get the 10 locations with the least amount of user sessions

SELECT songplay.location, count(songplay.location) FROM songplay where location is not null GROUP BY songplay.location ORDER BY count limit 10;

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets/images		assets/images
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
create_tables.py		create_tables.py
etl.py		etl.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Modeling With Postgres

Getting Started

Prerequisites

Purpose

Database Schema

Example Queries

About

Uh oh!

Releases

Packages

Languages

tvanpat/postgresql-data-modeling

Folders and files

Latest commit

History

Repository files navigation

Data Modeling With Postgres

Getting Started

Prerequisites

Purpose

Database Schema

Example Queries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages