# Notebook 4: Supervised Learning (Popularity Regression)

In this notebook we build supervised learning models to predict the Spotify popularity score (`track_popularity`, typically between 0 and 100) from the audio and descriptive features. We consider three simple but representative models:

- Linear Regression;
- k-Nearest Neighbors (KNN) Regressor;
- Random Forest Regressor.

All models share a common preprocessing pipeline that handles missing values, scales numerical variables, and one‑hot encodes categorical variables.

## 1. Load data and define X and y

We load `spotify_model_df.csv` from Notebook 2 and separate the feature matrix `X` from the target `y`, which in this notebook is the continuous popularity score `track_popularity`.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV
pd.set_option('display.max_columns', 100)

spotify_model_df = pd.read_csv('spotify_model_df.csv')
spotify_model_df.head()

Unnamed: 0,danceability,energy,speechiness,instrumentalness,liveness,valence,tempo,duration_min,release_year,key,mode,time_signature,playlist_genre,track_popularity,label_kaggle
0,0.521,0.592,0.0304,0.0,0.122,0.535,157.969,4.194467,2024.0,6.0,0.0,3.0,pop,100,1
1,0.747,0.507,0.0358,0.0608,0.117,0.438,104.978,3.506217,2024.0,2.0,1.0,4.0,pop,97,1
2,0.554,0.808,0.0368,0.0,0.159,0.372,108.548,2.771667,2024.0,1.0,1.0,4.0,pop,93,1
3,0.67,0.91,0.0634,0.0,0.304,0.786,112.966,2.621333,2024.0,0.0,0.0,4.0,pop,81,1
4,0.777,0.783,0.26,0.0,0.355,0.939,149.027,2.83195,2024.0,0.0,0.0,4.0,pop,98,1


In [2]:
# Define target y
if 'track_popularity' not in spotify_model_df.columns:
    raise ValueError('Column track_popularity is missing from the dataset.')

y = spotify_model_df['track_popularity']

# X contains all features except the target and label_kaggle (if present)
X = spotify_model_df.drop(columns=[c for c in ['track_popularity', 'label_kaggle']
                                   if c in spotify_model_df.columns])
X.head()

Unnamed: 0,danceability,energy,speechiness,instrumentalness,liveness,valence,tempo,duration_min,release_year,key,mode,time_signature,playlist_genre
0,0.521,0.592,0.0304,0.0,0.122,0.535,157.969,4.194467,2024.0,6.0,0.0,3.0,pop
1,0.747,0.507,0.0358,0.0608,0.117,0.438,104.978,3.506217,2024.0,2.0,1.0,4.0,pop
2,0.554,0.808,0.0368,0.0,0.159,0.372,108.548,2.771667,2024.0,1.0,1.0,4.0,pop
3,0.67,0.91,0.0634,0.0,0.304,0.786,112.966,2.621333,2024.0,0.0,0.0,4.0,pop
4,0.777,0.783,0.26,0.0,0.355,0.939,149.027,2.83195,2024.0,0.0,0.0,4.0,pop


## 2. Identify numerical and categorical feature columns

To build a `ColumnTransformer` we need lists of names for **numerical features** and **categorical features**. We follow the same logic as in Notebook 2: we treat audio variables and engineered features `duration_min` and `release_year` as numerical, and columns like `key`, `mode`, `time_signature`, `playlist_genre` as categorical.

In [3]:
# List of possible numerical audio features
audio_numeric_features = [
    'danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
    'instrumentalness', 'liveness', 'valence', 'tempo'
]

numeric_features = [f for f in audio_numeric_features if f in X.columns]

# Add engineered numerical features
for extra in ['duration_min', 'release_year']:
    if extra in X.columns:
        numeric_features.append(extra)

categorical_features = []
for col in ['key', 'mode', 'time_signature', 'playlist_genre']:
    if col in X.columns:
        categorical_features.append(col)

print('Numerical features in ColumnTransformer:', numeric_features)
print('Categorical features in ColumnTransformer:', categorical_features)

Numerical features in ColumnTransformer: ['danceability', 'energy', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_min', 'release_year']
Categorical features in ColumnTransformer: ['key', 'mode', 'time_signature', 'playlist_genre']


## 3. Build the preprocessing pipeline

The preprocessing pipeline does the following:

1. For numerical features: imputes missing values with the **median** and applies **StandardScaler**;
2. For categorical features: imputes missing values with the **most frequent** value and applies **OneHotEncoder**;
3. Combines everything using a `ColumnTransformer`.

This preprocessing will be reused by all regression models.

In [4]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='drop'
)

## 4. Train/test split

We split the dataset into a **training set** (80%) and a **test set** (20%). Models are trained on the training data and evaluated on the test data, which simulates new songs the model has not seen before.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

X_train.shape, X_test.shape

((3864, 13), (966, 13))

## 5. Define models and full pipelines

For each model we build a pipeline that first applies the common preprocessing and then fits the regression model.

In [6]:
# 1) Linear Regression
linreg_model = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', LinearRegression())
])

# 2) KNN Regressor (with k=5, we choose k=5 as a common default value as instructed, 
# of course this parameter could be better identified with hyperparameter tuning but we tried to not over complicate things)
knn_model = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', KNeighborsRegressor(n_neighbors=5))
])

# 3) Random Forest Regressor
rf_model = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', RandomForestRegressor(n_estimators=100, random_state=0))
])

## 6. Evaluation function (MSE, RMSE, MAE, R²)

We define a helper function that, given a model, fits it on the training data and computes standard regression metrics on both training and test sets:

- **MSE** (Mean Squared Error): average squared error; lower is better;
- **RMSE** (Root Mean Squared Error): square root of MSE, in the same units as the target (popularity 0–100);
- **MAE** (Mean Absolute Error): average absolute error;
- **R²**: proportion of variance of the target explained by the model (1 = perfect, 0 = baseline).

In [7]:
def evaluate_regression_model(name, model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    mse_train = mean_squared_error(y_train, y_train_pred)
    mse_test = mean_squared_error(y_test, y_test_pred)
    
    rmse_train = np.sqrt(mse_train)
    rmse_test = np.sqrt(mse_test)
    
    mae_train = mean_absolute_error(y_train, y_train_pred)
    mae_test = mean_absolute_error(y_test, y_test_pred)
    
    r2_train = r2_score(y_train, y_train_pred)
    r2_test = r2_score(y_test, y_test_pred)
    
    return {
        'model': name,
        'mse_train': mse_train,
        'mse_test': mse_test,
        'rmse_train': rmse_train,
        'rmse_test': rmse_test,
        'mae_train': mae_train,
        'mae_test': mae_test,
        'r2_train': r2_train,
        'r2_test': r2_test
    }

## 7. Train and compare models

We apply the evaluation function to the three models and collect the results in a table.

In [8]:
results = []
results.append(evaluate_regression_model('LinearRegression', linreg_model,
                                         X_train, X_test, y_train, y_test))
results.append(evaluate_regression_model('KNNRegressor', knn_model,
                                         X_train, X_test, y_train, y_test))
results.append(evaluate_regression_model('RandomForestRegressor', rf_model,
                                         X_train, X_test, y_train, y_test))

results_df = pd.DataFrame(results)
results_df.sort_values(by='rmse_test')

Unnamed: 0,model,mse_train,mse_test,rmse_train,rmse_test,mae_train,mae_test,r2_train,r2_test
2,RandomForestRegressor,29.290002,193.69054,5.412024,13.917275,3.99333,10.263686,0.926282,0.485101
0,LinearRegression,268.725073,256.318596,16.392836,16.009953,12.640921,12.420396,0.323668,0.318613
1,KNNRegressor,199.387236,275.177226,14.120455,16.588467,10.561439,12.223602,0.498179,0.268479


The column `rmse_test` is particularly intuitive: it tells us, on average, how many points (on the 0–100 popularity scale) we are off when predicting the popularity of a previously unseen song. For example, a test RMSE of 10 means that the typical error is about ±10 popularity points.

### Hyperparameter tuning (fine-tuning) of the best model

From the previous comparison, the Random Forest regressor achieved the lowest test RMSE,
so we select it as our **best base model**.

In this section we perform a simple **hyperparameter tuning** (fine-tuning) step:
we keep the same model family (Random Forest) but try different choices for its
hyperparameters, such as the number of trees and the maximum depth of each tree.

We use a small grid of hyperparameters and `GridSearchCV` with cross-validation.
The goal is to see whether a slightly different configuration of the Random Forest
can reduce the prediction error (RMSE) on the validation folds.


In [9]:
# Rebuild the Random Forest pipeline (same structure as before)
rf_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', RandomForestRegressor(random_state=0))
])

# Define a small grid of hyperparameters to try
param_grid = {
    'model__n_estimators': [50, 100, 200],    # number of trees
    'model__max_depth': [None, 10, 20]        # maximum depth of each tree
}

# GridSearchCV: 3-fold cross-validation, using negative RMSE as the score
grid_search = GridSearchCV(
    rf_pipeline,
    param_grid=param_grid,
    cv=3,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print("Best parameters found:", grid_search.best_params_)
print("Best CV RMSE:", -grid_search.best_score_)


Best parameters found: {'model__max_depth': None, 'model__n_estimators': 200}
Best CV RMSE: 14.88597497547255


In [10]:
# Use the best found Random Forest model
best_rf_tuned = grid_search.best_estimator_

# Evaluate tuned model on the test set
y_test_pred_tuned = best_rf_tuned.predict(X_test)

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

mse_test_tuned = mean_squared_error(y_test, y_test_pred_tuned)
rmse_test_tuned = np.sqrt(mse_test_tuned)
mae_test_tuned = mean_absolute_error(y_test, y_test_pred_tuned)
r2_test_tuned = r2_score(y_test, y_test_pred_tuned)

print("Tuned Random Forest – test RMSE:", rmse_test_tuned)
print("Tuned Random Forest – test MAE:", mae_test_tuned)
print("Tuned Random Forest – test R²:", r2_test_tuned)


Tuned Random Forest – test RMSE: 13.863292235763735
Tuned Random Forest – test MAE: 10.207189095928227
Tuned Random Forest – test R²: 0.4890871893402142


### Interpretation of the fine-tuning results

The grid search selects the combination of hyperparameters that gives the lowest
cross-validated RMSE on the training data. The best parameters found are stored in
`grid_search.best_params_`.

Comparing the tuned Random Forest with the original one, we see that:

- the **test RMSE** changes from our previous value to the new value reported above;
- any reduction in RMSE (even by a few points) means that, on average, our predictions
  are closer to the true popularity scores on unseen songs.

This fine-tuning step shows how, once we have chosen a model family (Random Forest),
we can **optimize its hyperparameters** to squeeze out a bit more predictive performance.
In the context of our project, this is a simple example of **model fine-tuning**:
we do not change the type of model, only how it is configured.
