# Final Project: How Does Musical Key Vary Across Genres?
Author: Toby Draper

DS 2023

# Notebook 1: Establish Data
In this notebook, we gather and clean the data needed for the final infographic. The data set I used for this project can be accessed at https://www.kaggle.com/datasets/byomokeshsenapati/spotify-song-attributes. It comes from s a spotify user's listening history in 2022, and contains data for just over 10,000 independent tracks.

Before running this notebook, you should create a dedicated virtual environment and install the project’s required Python packages. Open a terminal, cd into the final_project folder, and run the command below as a single line:

%%bash
python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt 

## Import Packages: 
The first step is to import the necessary Python packages for data manipulation and visualization. See the cell below:

In [1]:
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import math
import os

## Load Data:
Next, we load the data into the file, and create a new dataframe with only the columns we need for our analysis. We then convert the genre column to strings and map Spotify’s numeric key values (0–11) into readable note names.

In [2]:
# =========================================================
# 1) LOAD DATA 
# =========================================================
df = pd.read_csv("../Spotify_Song_Attributes.csv")

# Keep only what we need for this infographic
df_small = df[["genre", "key"]].copy()
df_small["genre"] = df_small["genre"].astype(str)

# Map Spotify "key" integers (0–11) to note names
key_map = {
    0: "C", 1: "C#", 2: "D", 3: "D#", 4: "E", 5: "F",
    6: "F#", 7: "G", 8: "G#", 9: "A", 10: "A#", 11: "B"
}
df_small["key_note"] = df_small["key"].map(key_map)
key_order = list(key_map.values())

df_small = df_small[["genre", "key_note"]].copy()

## Examine Data Features
With the data fully loaded, we can examine the features of the original data set as well as the adapted dataset to understand their structure and contents,. Note that this code is not included in the original code file, but has been added to this notebook as it is useful for understanding the data.

In [3]:
COLS1 = pd.DataFrame(index=df.columns)
COLS1.index.name = "col_id"

COLS1["dtypes"] = df.dtypes.astype(str)
COLS1["n_unique"] = df.nunique(dropna=True)
COLS1["tot_observations"] = len(df)

COLS1


Unnamed: 0_level_0,dtypes,n_unique,tot_observations
col_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
trackName,object,4815,10080
artistName,object,2312,10080
msPlayed,int64,4979,10080
genre,object,523,10080
danceability,float64,762,10080
energy,float64,1066,10080
key,float64,12,10080
loudness,float64,3965,10080
mode,float64,2,10080
speechiness,float64,1001,10080


In [4]:
COLS2 = pd.DataFrame(index=df_small.columns)
COLS2.index.name = "col_id"

COLS2["dtypes"] = df_small.dtypes.astype(str)
COLS2["n_unique"] = df_small.nunique(dropna=True)
COLS2["tot_observations"] = len(df_small)

COLS2

Unnamed: 0_level_0,dtypes,n_unique,tot_observations
col_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
genre,object,524,10080
key_note,object,12,10080


The orginial dataset contains 22 features, including identification features such as track name and artist name, as well as musical features such as key, tempo, and danceability. The adapted dataset contains only the genre and key note columns, which are the only features needed for our analysis.

We now have a clean, well-structured dataset ready for analysis in the next notebook.