# Notebook 2: Feature Engineering

In this notebook we start from the cleaned dataset `spotify_clean.csv` and build a few simple but useful features for modeling. In particular:

- we extract a release year (`release_year`) from the release date column; 
- we convert duration from milliseconds to minutes (`duration_min`);
- we select a clear subset of columns to use for modeling;
- we impute missing values so that the final dataset `spotify_model_df.csv` has no NaNs (this is important for methods like PCA that do not accept missing values).

## 1. Load the cleaned dataset

We load `spotify_clean.csv` produced in Notebook 1 and inspect the available columns.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 100)

spotify_clean = pd.read_csv('spotify_clean.csv')
spotify_clean.head()

Unnamed: 0,energy,tempo,danceability,playlist_genre,liveness,valence,track_artist,time_signature,speechiness,track_popularity,track_album_name,playlist_name,track_name,track_album_release_date,instrumentalness,mode,key,duration_ms,playlist_subgenre,type,label_kaggle
0,0.592,157.969,0.521,pop,0.122,0.535,"Lady Gaga, Bruno Mars",3.0,0.0304,100,Die With A Smile,Today's Top Hits,Die With A Smile,2024-08-16,0.0,0.0,6.0,251668.0,mainstream,audio_features,1
1,0.507,104.978,0.747,pop,0.117,0.438,Billie Eilish,4.0,0.0358,97,HIT ME HARD AND SOFT,Today's Top Hits,BIRDS OF A FEATHER,2024-05-17,0.0608,1.0,2.0,210373.0,mainstream,audio_features,1
2,0.808,108.548,0.554,pop,0.159,0.372,Gracie Abrams,4.0,0.0368,93,The Secret of Us (Deluxe),Today's Top Hits,That’s So True,2024-10-18,0.0,1.0,1.0,166300.0,mainstream,audio_features,1
3,0.91,112.966,0.67,pop,0.304,0.786,Sabrina Carpenter,4.0,0.0634,81,Short n' Sweet,Today's Top Hits,Taste,2024-08-23,0.0,0.0,0.0,157280.0,mainstream,audio_features,1
4,0.783,149.027,0.777,pop,0.355,0.939,"ROSÉ, Bruno Mars",4.0,0.26,98,APT.,Today's Top Hits,APT.,2024-10-18,0.0,0.0,0.0,169917.0,mainstream,audio_features,1


In [2]:
# Show the list of columns
spotify_clean.columns.tolist()

['energy',
 'tempo',
 'danceability',
 'playlist_genre',
 'liveness',
 'valence',
 'track_artist',
 'time_signature',
 'speechiness',
 'track_popularity',
 'track_album_name',
 'playlist_name',
 'track_name',
 'track_album_release_date',
 'instrumentalness',
 'mode',
 'key',
 'duration_ms',
 'playlist_subgenre',
 'type',
 'label_kaggle']

## 2. Create `release_year` from release date

Many Spotify datasets include a column like `track_album_release_date` (or similar). We want a numerical variable `release_year` that represents the year in which the track was released. We parse the date column (if present) and extract the year.

In [3]:
date_cols_candidate = [
    'track_album_release_date'
]

date_col = None
for c in date_cols_candidate:
    if c in spotify_clean.columns:
        date_col = c
        break


if date_col is not None:
    spotify_clean['release_year'] = pd.to_datetime(
        spotify_clean[date_col], errors='coerce'
    ).dt.year
else:
    # If there is no date column, fill release_year with NaN, will handle the missing data later
    spotify_clean['release_year'] = np.nan

spotify_clean[['release_year']].describe()

Unnamed: 0,release_year
count,4692.0
mean,2017.354859
std,10.099106
min,1954.0
25%,2016.0
50%,2022.0
75%,2024.0
max,2024.0


## 3. Create `duration_min` from milliseconds

Track duration is typically given in milliseconds (`duration_ms`). For interpretability we create a version in minutes (`duration_min`).

In [4]:
if 'duration_ms' in spotify_clean.columns:
    spotify_clean['duration_min'] = spotify_clean['duration_ms'] / 60000.0
else:
    spotify_clean['duration_min'] = np.nan

spotify_clean[['duration_ms', 'duration_min']].head()

Unnamed: 0,duration_ms,duration_min
0,251668.0,4.194467
1,210373.0,3.506217
2,166300.0,2.771667
3,157280.0,2.621333
4,169917.0,2.83195


## 4. Explicit selection of columns for modeling

We now explicitly decide which columns we will use in the models. We want to include:

- core audio numerical features (danceability, energy, loudness, speechiness, acousticness, etc.);
- the new numerical features `duration_min` and `release_year`;
- some musical categorical variables (`key`, `mode`, `time_signature`, `playlist_genre`, if available);
- the main target variable `track_popularity` and, for exploration, the original `label_kaggle`.

We collect these into a new DataFrame `spotify_model_df`, some columns were explicitly dropped/ignored for mainly two reasons: first to match sickit_learn format as much as possible and second because they were too complicated to handle at this moment and not necessary for the purpose of the project.

In [5]:
# List of possible numerical audio features
audio_numeric_features = [
    'danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
    'instrumentalness', 'liveness', 'valence', 'tempo'
]

numeric_features = [f for f in audio_numeric_features if f in spotify_clean.columns]

# Add newly engineered numerical features
for extra in ['duration_min', 'release_year']:
    if extra in spotify_clean.columns:
        numeric_features.append(extra)

print('Selected numerical features:', numeric_features)

# Musical/categorical features
categorical_features = []
for col in ['key', 'mode', 'time_signature', 'playlist_genre']:
    if col in spotify_clean.columns:
        categorical_features.append(col)

print('Selected categorical features:', categorical_features)

# Target columns
target_cols = []
if 'track_popularity' in spotify_clean.columns:
    target_cols.append('track_popularity')
if 'label_kaggle' in spotify_clean.columns:
    target_cols.append('label_kaggle')

print('Target columns included:', target_cols)

# Final modeling DataFrame
model_cols = numeric_features + categorical_features + target_cols
spotify_model_df = spotify_clean[model_cols].copy()
spotify_model_df.head()

Selected numerical features: ['danceability', 'energy', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_min', 'release_year']
Selected categorical features: ['key', 'mode', 'time_signature', 'playlist_genre']
Target columns included: ['track_popularity', 'label_kaggle']


Unnamed: 0,danceability,energy,speechiness,instrumentalness,liveness,valence,tempo,duration_min,release_year,key,mode,time_signature,playlist_genre,track_popularity,label_kaggle
0,0.521,0.592,0.0304,0.0,0.122,0.535,157.969,4.194467,2024.0,6.0,0.0,3.0,pop,100,1
1,0.747,0.507,0.0358,0.0608,0.117,0.438,104.978,3.506217,2024.0,2.0,1.0,4.0,pop,97,1
2,0.554,0.808,0.0368,0.0,0.159,0.372,108.548,2.771667,2024.0,1.0,1.0,4.0,pop,93,1
3,0.67,0.91,0.0634,0.0,0.304,0.786,112.966,2.621333,2024.0,0.0,0.0,4.0,pop,81,1
4,0.777,0.783,0.26,0.0,0.355,0.939,149.027,2.83195,2024.0,0.0,0.0,4.0,pop,98,1


## 5. Impute missing values (no NaNs in `spotify_model_df`)

Before saving `spotify_model_df`, we want to make sure there are no missing values. This is important because some methods we will use later (for example PCA in Notebook 3) do not accept NaN values.

Here we adopt a simple, standard strategy:

- for numerical features, we replace NaNs with the median of that column;
- for categorical features, we replace NaNs with the string `'Unknown'`.


In [6]:
# Check total number of missing values before imputation
print('Total missing values BEFORE imputation:', spotify_model_df.isna().sum().sum())

# Impute numerical features with the median
if numeric_features:
    medians = spotify_model_df[numeric_features].median()
    spotify_model_df[numeric_features] = spotify_model_df[numeric_features].fillna(medians)

# Impute categorical features with the string 'Unknown'
if categorical_features:
    spotify_model_df[categorical_features] = spotify_model_df[categorical_features].fillna('Unknown')

# Check that there are no missing values left
print('Total missing values AFTER imputation:', spotify_model_df.isna().sum().sum())

Total missing values BEFORE imputation: 138
Total missing values AFTER imputation: 0


## 6. Final checks and save

We check basic statistics and missing values in the modeling dataset, and then save it as `spotify_model_df.csv`. This file will be the starting point for Notebook 3 (unsupervised learning) and Notebook 4 (supervised learning).

In [7]:
spotify_model_df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
danceability,4830.0,,,,0.622311,0.187706,0.0589,0.525,0.653,0.758,0.979
energy,4830.0,,,,0.586691,0.246263,0.000202,0.44225,0.633,0.777,0.998
speechiness,4830.0,,,,0.101738,0.101032,0.0219,0.0386,0.0561,0.118,0.927
instrumentalness,4830.0,,,,0.201053,0.351918,0.0,0.0,9.1e-05,0.2005,0.991
liveness,4830.0,,,,0.167613,0.124429,0.021,0.0954,0.118,0.195,0.979
valence,4830.0,,,,0.48193,0.258036,0.0296,0.275,0.483,0.69,0.987
tempo,4830.0,,,,118.269293,28.512615,48.232,96.063,118.0595,136.7235,241.426
duration_min,4830.0,,,,3.435847,1.362426,0.589583,2.65,3.247775,3.8913,22.587667
release_year,4830.0,,,,2017.487578,9.983801,1954.0,2016.0,2022.0,2024.0,2024.0
key,4830.0,,,,5.233333,3.580857,0.0,2.0,5.0,8.0,11.0


In [8]:
spotify_model_df.isna().sum().sort_values(ascending=False)

danceability        0
energy              0
speechiness         0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_min        0
release_year        0
key                 0
mode                0
time_signature      0
playlist_genre      0
track_popularity    0
label_kaggle        0
dtype: int64

In [9]:
spotify_model_df.to_csv('spotify_model_df.csv', index=False)
spotify_model_df.head()

Unnamed: 0,danceability,energy,speechiness,instrumentalness,liveness,valence,tempo,duration_min,release_year,key,mode,time_signature,playlist_genre,track_popularity,label_kaggle
0,0.521,0.592,0.0304,0.0,0.122,0.535,157.969,4.194467,2024.0,6.0,0.0,3.0,pop,100,1
1,0.747,0.507,0.0358,0.0608,0.117,0.438,104.978,3.506217,2024.0,2.0,1.0,4.0,pop,97,1
2,0.554,0.808,0.0368,0.0,0.159,0.372,108.548,2.771667,2024.0,1.0,1.0,4.0,pop,93,1
3,0.67,0.91,0.0634,0.0,0.304,0.786,112.966,2.621333,2024.0,0.0,0.0,4.0,pop,81,1
4,0.777,0.783,0.26,0.0,0.355,0.939,149.027,2.83195,2024.0,0.0,0.0,4.0,pop,98,1
