# Assignment #6 - Data Gathering and Warehousing - DSSA-5102

Instructor: Melissa Laurino</br>
Spring 2025</br>

Name: Melissa Laurino/Instructor Guide
</br>
Date: 2/23/25
<br>
<br>
**At this time in the semester:** <br>
- We have explored a dataset. <br>
- We have cleaned our dataset. <br>
- We created a Github account with a repository for this class and included a metadata read me file about our data. <br>
- We introduced general SQL syntax, queries, and applications in Python.<br>
<br>
Now we will start the process of uploading our dataset into a database. There are many different ways to upload your .csv data into a database (.db file). Databases can be created in many open source applications, MySQL workbench, and even some websites can load your .csv data into a database...for a small fee. Instead of using an application, we are going to first create our database for our dataset from scratch in Python. On a much larger scale, data may be automatically uploaded to a database once it is aquired.<br>

#### Assignment #6 Objectives

We will use the Python packages SQL Alchemy and SQLite to create three separate databases for practice. 
- Create one database on our MySQL server (10)
  - Create and populate our first table with appropriate data types
  - View the MySQL workbench schema to see the table you created
- Create one test database locally that we can still use with MySQL (3)
- Create one test database locally as a .db file. (2) <br>
<br>
Follow the instructions below to complete the assignment. For submission, please include your .ipynb file with output cells (Or a link to Github), and the screen shot of your first database table in MySQL Workbench. Answer any questions in markdown cell boxes. Be sure to comment all code.


### Creating our database from scratch to integrate with MySQL Workbench in Python<br>

**BEFORE YOU BEGIN!**<br>
Is your MySQL Server running on your local machine?<br>
**Start the server** if it is not running already.

We need the MySQL connector to work with Python since we are using SQLAlchemy with MySQL Workbench. Let's install the MySQL driver. Run the following code in a terminal window to install the MySQL connector: <br>
pip install mysql-connector-python mysql-connector

#### Creating a database from scratch in Python using SQL Alchemy<br>
Additional sources: <br>
-- https://medium.com/@sandyjtech/creating-a-database-using-python-and-sqlalchemy-422b7ba39d7e <br>
-- https://www.youtube.com/watch?v=xr7vDSFXjW0 <br>
-- https://www.geeksforgeeks.org/how-to-design-a-database-for-spotify/ (My specific inspiration for understanding a Spotify schema)

In [185]:
# Load necessary packages:
from sqlalchemy import create_engine, Column, String, Integer, Boolean, BigInteger, Float, text # Database navigation
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
import mysql.connector
import sqlite3 # A second option for working with databases
import pandas as pd # Python data manilpulation

Open MySQL Workbench.
- Click on Local Instance (This is your port number - if needed)

In [189]:
# Connect to the MySQL server 
# Define our variables. We set these during our first class in our technology set up. 
# If you are unsure of these variables, do not guess. 
# Visit MySQL Workbench for the localport number, host and user.

conn = mysql.connector.connect(
        host="localhost", # This is your local instance number when you open MySQL Workbench.
        user="root", # This is your username for MySQL Workbench
        password="TippyTt0006!") # We wrote this password down in our first class!

# In order to connect to the server, we must include all of the above.

cursor = conn.cursor()

# CREATE DATABASE (SQL command) if it does not already exist
cursor.execute("CREATE DATABASE IF NOT EXISTS MySQL_SpotifyDatabase")
# MySQL_SpotifyDatabase will be the name when the database is created.

print("Database created successfully in MySQL Workbench! Go check it out.")

Database created successfully in MySQL Workbench! Go check it out.


**STOP**<br><br>
Confirm your database was created before continuing. <br> <br>
Open MySQL Workbench.<br>
Under MySQL Connections, click Local Instance<br>
Click the Schemas tab<br>
**You should now see a new (empty) database that you created**<br>
If it does not show up right away, hit refresh (The circular arrows)

In [193]:
# Time to connect to the database using SQL Alchemy:
DATABASE_URL = "mysql+mysqlconnector://root:TippyTt0006!@localhost/MySQL_SpotifyDatabase" # Use MySQL Connector to connect to the database
engine = create_engine(DATABASE_URL) # Creates a connection to the MySQL database

print("Connected to MySQL database successfully!")

Connected to MySQL database successfully!


In [195]:
# Read in the CLEAN .csv file (Using pandas) we will use to populate our database. This is the same dataset that you cleaned for Assignment #2!
# A subset of my personal Spotify data from 2012-2024

spotify = pd.read_csv("spotify_subset_2012_2024_cleaned.csv")

  spotify = pd.read_csv("spotify_subset_2012_2024_cleaned.csv")


In [197]:
# Preview the dataframe by looking at the first five rows.
spotify.head()

Unnamed: 0,platform,ms_played,conn_country,ip_addr,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,spotify_track_uri,episode_name,episode_show_name,...,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,year,date,time
0,OS X 10.7.4 [x86 4],137760,US,134.210.225.27,Some Nights - Intro,fun.,Some Nights,spotify:track:1JAI5Ia020mdGH2wMQEacy,,,...,uriopen,trackdone,False,False,False,,False,2012,2012-08-03,15:43:50
1,OS X 10.7.4 [x86 4],277040,US,134.210.225.27,Some Nights,fun.,Some Nights,spotify:track:6t6oULCRS6hnI7rm0h5gwl,,,...,trackdone,trackdone,False,False,False,,False,2012,2012-08-03,15:48:28
2,OS X 10.7.4 [x86 4],108244,US,134.210.225.27,We Are Young (feat. Janelle Monáe),fun.,Some Nights,spotify:track:7a86XRg84qjasly9f6bPSD,,,...,trackdone,uriopen,False,True,False,,False,2012,2012-08-03,15:50:16
3,OS X 10.7.4 [x86 4],16015,US,134.210.225.27,Trip to Your Heart,Britney Spears,Femme Fatale (Deluxe Version),spotify:track:2qbhijQG7phGVHkPt22fTP,,,...,uriopen,uriopen,False,True,False,,False,2012,2012-08-03,15:50:31
4,OS X 10.7.4 [x86 4],73786,US,134.210.225.27,Stan,Eminem,The Marshall Mathers LP,spotify:track:3UmaczJpikHgJFyBTAJVoz,,,...,uriopen,popup,False,True,False,,False,2012,2012-08-03,15:53:53


In [199]:
# What are all of the column names and data types for our dataset? 
# It is important to know the column names from the .csv because these are the field names we will want to use for our first table.
# Remember, the field names represent the column names of the csv/table.
spotify.dtypes

platform                              object
ms_played                              int64
conn_country                          object
ip_addr                               object
master_metadata_track_name            object
master_metadata_album_artist_name     object
master_metadata_album_album_name      object
spotify_track_uri                     object
episode_name                         float64
episode_show_name                    float64
spotify_episode_uri                  float64
audiobook_title                      float64
audiobook_uri                        float64
audiobook_chapter_uri                float64
audiobook_chapter_title              float64
reason_start                          object
reason_end                            object
shuffle                                 bool
skipped                                 bool
offline                               object
offline_timestamp                    float64
incognito_mode                          bool
year      

If you are an experienced Python user, you can create a base Python class for all of our tables before populating them and use built in SQLAlchemy features. <br>
To practice SQL, we will create our database from scratch using SQL commands in Python instead.

We can use a new SQL statement CREATE TABLE to create our first table in our new database by writing a query.<br>
Everyone's data is different! Choose the SQL data types that fit YOUR data needs!<br>
SQL Data Types: https://www.w3schools.com/sql/sql_datatypes.asp

In [176]:
# Create our first table in the database file using SQL statements:
# We want our table column names to match what is in the .csv file
first_table_query = """CREATE TABLE IF NOT EXISTS songs (
                        id INT AUTO_INCREMENT PRIMARY KEY,
                        master_metadata_track_name VARCHAR(255),
                        master_metadata_album_artist_name VARCHAR(255),
                        master_metadata_album_album_name VARCHAR(255),
                        spotify_track_uri VARCHAR(255),
                        ms_played BIGINT,
                        date DATE,
                        time TIME,
                        year YEAR
                    );"""
# Note that the primary key for this table is a column/field "id"
# This is not a field that existed previously. AUTO_INCREMENT automatically generates a unique value for each new row added to the table. 
# Each new value is one greater than the previous value. We cannot make the Date column/field our primary key, because it is not unique.

In [201]:
#Execute the query:
with engine.connect() as connection:
    connection.execute(text(first_table_query))

print("First table created successfully!")

First table created successfully!


Define your SQL data types for your first table: <br><br>
**My SQL data types for my first table, species:**<br>
VARCHAR(30) - A VARIABLE length string can contain letters, numbers, and special characters with a maximum string length of 255 characters<br>
BIGINT - A large integer with up to 255 numbers. <br>
DATE - SQL Format: YYYY-MM-DD<br>
TIME - SQL Format: hh:mm:ss.  <br>

Why did you choose these values to make up your first database table? What did you choose for your primary key and why?

In [204]:
# There are multiple ways to populate the fields of the table. 
# Another option is to add a subset of the data into data table, and then populate the database table.
# Please feel free to change or alter the code below.
# This example uses the MySQL connector:

with engine.connect() as connection:
    # Make sure MySQL is using the correct database
    cursor.execute("USE MySQL_SpotifyDatabase;")

    # Populate the songs table
    for _, row in spotify.iterrows():
        cursor.execute("""INSERT INTO songs (master_metadata_track_name, master_metadata_album_artist_name, 
                                              master_metadata_album_album_name, spotify_track_uri, ms_played, date, time, year)
                          VALUES (%s, %s, %s, %s, %s, %s, %s, %s) 
                        """, (row['master_metadata_track_name'], # %s acts as a placeholder for values that will be inserted into the table
                              row['master_metadata_album_artist_name'],
                              row['master_metadata_album_album_name'], 
                              row['spotify_track_uri'], 
                              row['ms_played'], 
                              row['date'], 
                              row['time'], 
                              row['year']))
    conn.commit()

# Another option can use executemany() to take a list of tuples and substitute %s with the actual data from the Spotify data frame

**STOP**<br><br>
In MySQL Workbench, you should see your new table that you have created and populated.<br>
You can now run a SQL query directly in MySQL Workbench!<br>
You can also run a query below to test it:

In [205]:
# Now that we have populated our table, let's try out a query.
# SELECT the COUNT of the artist_name FROM table songs
# and GROUP them BY artist_name in DESCending order.

with engine.connect() as connection:  # Establish a connection
    practice_query = text("""SELECT master_metadata_album_artist_name, COUNT(*) as count
                                 FROM songs
                                 GROUP BY master_metadata_album_artist_name
                                 ORDER BY count DESC
                                 LIMIT 10;
                                 """) # Define the query - text() ensures that the query string is read as a SQL expression 
    practice_query = pd.read_sql(practice_query, connection) #Use pandas to read the sql query with the connection to the database
    
# Print the results
practice_query

Unnamed: 0,master_metadata_album_artist_name,count
0,Miley Cyrus,17790
1,Ariana Grande,9072
2,Marian Hill,8040
3,Noah Cyrus,7346
4,Hozier,6438
5,Lana Del Rey,6338
6,Lady Gaga,5976
7,Nashville Cast,5846
8,Meghan Trainor,5620
9,Billie Eilish,5602


**STOP**<br>
To create a new schema diagram for your new database (Even though it only has one table...it's good practice!)<br>
Open MySQL Workbench again<br>
Click Home<br>
Click the Models icon<br>
Click the > icon to the right of "Models"<br>
Choose “Create EER Model from Database” <br>
The Reverse Engineer Database Wizard starts and will walk you through your first database schema diagram.<br>
Save your model. <br>
You can now add relationships and or modify tables...but for this assignment, all we need is that first table. <br>

**Add a screen shot of your first schema diagram (The table) to your repository/Blackboard subission.**

In [208]:
#Close the database connection :)
cursor.close()
conn.close()

Now what if we wanted to explore the number of songs that were skipped or shuffled? We'll begin JOINing tables in our next assignment.

In [183]:
second_table_query = """CREATE TABLE IF NOT EXISTS listen_analytics (
                        listen_id INT,
                        ts DATETIME,
                        platform TEXT,
                        conn_country TEXT,
                        ip_addr TEXT,
                        reason_start TEXT,
                        reason_end TEXT,
                        shuffle BOOLEAN,
                        skipped BOOLEAN,
                        offline BOOLEAN,
                        offline_timestamp FLOAT,
                        incognito_mode BOOLEAN,
                        FOREIGN KEY (listen_id) REFERENCES songs(id)
                    );"""

#Execute the query:
with engine.connect() as connection:
    connection.execute(text(second_table_query))

print("Second table created successfully!")

Second table created successfully!


### Creating a local database from scratch

#### Creating a local database from scratch in Python using SQL Alchemy for MySQL Workbench:<br>
Another example: https://blog.sqlitecloud.io/sqlite-python-sqlalchemy

In [127]:
# BEFORE YOU BEGIN!
# Is your MySQL Server running on your local machine?
# Doesn't matter this time, please continue! :)
from sqlalchemy import create_engine

In [129]:
engine = create_engine("sqlite:///Local_SpotifyDatabase.db")  # Creates a local database file in the SAME directory as this document.

In [131]:
# The only database connection parameters we need here are the name of the database we just created locally
# NOTE: We are not using the local host, but can still connect our database to MySQL
DATABASE_URL = "mysql+mysqlconnector://root:TippyTt0006!@127.0.0.1/Local_SpotifyDatabase"

In [134]:
cursor.execute("CREATE DATABASE IF NOT EXISTS Local_SpotifyDatabase")

In [145]:
# Close your connection :)
conn.close()

**STOP HERE**<br>
Before moving on, it is **important** to understand the difference of what we have just completed. Using SQL Alchemy, we have created a database LOCALLY. Notice we did not specify a specific host, BUT we did specify a user and password! This means we can access this database locally in MySQL Workbench if we choose.

#### Creating a local database (.db file) from scratch in Python using SQLite:<br>


In [149]:
# Load necessary packages:
from sqlalchemy import create_engine, inspect, text # Database navigation
import sqlite3 # A second option for working with databases
import pandas as pd # Python data manilpulation

In [147]:
# Load the .csv subset again if you need to if you are starting over 
csv_file = "spotify_subset_2012_2024_cleaned.csv"  # Replace with your actual file path
df = pd.read_csv(csv_file)

# Create a SQLite database and engine
db_file = "spotify_data.db"
engine = create_engine(f"sqlite:///{db_file}")

# Store the dataframe in the database as a single table for quick practice (Never recommended, especially for large data sets) 
df.to_sql("spotify_history", con=engine, if_exists="replace", index=False)

  df = pd.read_csv(csv_file)


202678

**STOP HERE**<br>
This method creates a database as a file on our local machine. The .db file is created in the same location or working directory you are currently in (Go check!). If you did not specify a working directory, the .db file is created where this .ipynb is located. 

In [152]:
#Close the database connection :)
cursor.close()  # Close the cursor
conn.close()  # Close the database connection