# Spotify Song Popularity Prediction

This project explores what makes a song popular on Spotify. The goal is to predict a track’s popularity score (0–100) using its audio features such as energy, danceability, tempo, and valence.

The dataset comes from the public **Spotify Tracks Database** on Kaggle. This analysis follows a typical data science workflow — data cleaning, exploration, model building, evaluation, and reflection.



In [None]:
# Imports
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
try:
    from xgboost import XGBRegressor
    HAS_XGB = True
except Exception:
    HAS_XGB = False

plt.rcParams['figure.figsize'] = (7,4)
pd.set_option('display.max_columns', 100)

In [None]:
# Load data
PATH = 'data/spotify_tracks.csv'  # <- replace if needed
df_raw = pd.read_csv(PATH)
df_raw.head()

## Step 1: Load and Prepare the Data

The first step is to load the dataset and select the relevant numerical audio features, along with the target column `popularity`. I’ll also handle missing values and remove extreme outliers to ensure cleaner training data.


In [None]:
NUMERIC_FEATURES = ['danceability','energy','loudness','speechiness','acousticness',
                     'instrumentalness','liveness','valence','tempo','duration_ms']
TARGET_COL = 'popularity'

df = df_raw.copy()
cols = [c for c in NUMERIC_FEATURES if c in df.columns] + [TARGET_COL]
df = df[cols].dropna()
df = df[(df['popularity']>=0) & (df['popularity']<=100)]
df['tempo'] = df['tempo'].clip(30,220)
df['loudness'] = df['loudness'].clip(-60,5)
df.describe(include='all')

## Step 2: Explore the Data (EDA)

Before building any model, it’s important to understand the relationships between features and popularity. In this step, I’ll explore correlations and visualize how variables like energy, tempo, and danceability relate to popularity.


In [None]:
# Correlation with popularity
corr = df.corr(numeric_only=True)['popularity'].sort_values(ascending=False)
corr

In [None]:
# Pairwise scatter vs popularity (a few)
cols_to_plot = ['danceability','energy','valence','tempo','loudness']
for c in cols_to_plot:
    plt.figure()
    plt.scatter(df[c], df['popularity'], alpha=0.3)
    plt.xlabel(c); plt.ylabel('popularity'); plt.title(f'{c} vs popularity')
    plt.show()

## Step 3: Split Data for Training and Testing

To properly evaluate model performance, I’ll split the data into training and test sets. This ensures that the model is tested on unseen data and gives a realistic view of how it might perform in the real world.


In [None]:
X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

## Step 4: Build and Compare Regression Models

Next, I’ll train multiple regression models — Linear Regression, Ridge, Lasso, Random Forest, and XGBoost. The goal is to see which one best captures the relationship between audio features and popularity.

I’ll use cross-validation and grid search to tune hyperparameters for fair comparison.


In [None]:
models = {
    'linreg': Pipeline([('scaler', StandardScaler()), ('model', LinearRegression())]),
    'ridge':  Pipeline([('scaler', StandardScaler()), ('model', Ridge())]),
    'lasso':  Pipeline([('scaler', StandardScaler()), ('model', Lasso(max_iter=5000))]),
    'rf':     Pipeline([('model', RandomForestRegressor(random_state=42))])
}
if HAS_XGB:
    models['xgb'] = Pipeline([('model', XGBRegressor(random_state=42, n_estimators=300, learning_rate=0.05,
                                                     max_depth=6, subsample=0.8, colsample_bytree=0.8))])

param_grids = {
    'ridge': {'model__alpha':[0.1,1.0,3.0,10.0]},
    'lasso': {'model__alpha':[0.001,0.01,0.1,1.0]},
    'rf':    {'model__n_estimators':[200,400],
              'model__max_depth':[None,10,20],
              'model__min_samples_split':[2,5]}
}
if 'xgb' in models:
    param_grids['xgb'] = {'model__n_estimators':[200,400],
                          'model__max_depth':[4,6,8],
                          'model__learning_rate':[0.03,0.05,0.1]}

def eval_metrics(y_true, y_pred):
    return {
        'RMSE': mean_squared_error(y_true, y_pred, squared=False),
        'MAE': mean_absolute_error(y_true, y_pred),
        'R2': r2_score(y_true, y_pred)
    }

results = {}
best_name, best_model, best_r2 = None, None, -np.inf

for name, pipe in models.items():
    if name in param_grids:
        gs = GridSearchCV(pipe, param_grids[name], scoring='r2', cv=5, n_jobs=-1)
        gs.fit(X_train, y_train)
        model = gs.best_estimator_
    else:
        pipe.fit(X_train, y_train)
        model = pipe

    y_pred = model.predict(X_test)
    metrics = eval_metrics(y_test, y_pred)
    results[name] = metrics
    if metrics['R2'] > best_r2:
        best_name, best_model, best_r2 = name, model, metrics['R2']

results, best_name, best_r2

## Step 5: Visualize Predictions vs Actual Values

This scatter plot compares the predicted popularity scores against the true values from the test set. A perfect model would produce points close to the diagonal line — meaning its predictions are accurate.


In [None]:
y_pred = best_model.predict(X_test)
plt.figure()
plt.scatter(y_test, y_pred, alpha=0.4)
plt.xlabel('Actual popularity')
plt.ylabel('Predicted popularity')
plt.title(f'Predicted vs Actual ({best_name})')
plt.plot([0,100],[0,100])
plt.show()

## Step 6: Interpret Feature Importance

Understanding which audio features influence popularity is key. This section examines the most important predictors according to the best-performing model. It helps explain what aspects of a song tend to make it popular.


In [None]:
import numpy as np
feat_names = list(X.columns)

if hasattr(best_model[-1], 'feature_importances_'):
    importances = best_model[-1].feature_importances_
    order = np.argsort(importances)[::-1]
    for idx in order:
        print(f"{feat_names[idx]}: {importances[idx]:.4f}")
elif hasattr(best_model[-1], 'coef_'):
    coefs = best_model[-1].coef_
    if coefs.ndim == 1:
        order = np.argsort(np.abs(coefs))[::-1]
        for idx in order:
            print(f"{feat_names[idx]}: {coefs[idx]:.4f}")
else:
    print('Model does not expose feature importances or coefficients.')

## Step 7: Save the Best Model

Finally, I’ll save the best-performing model as a `.pkl` file so it can be reused later in a web app or API without retraining. This step simulates how machine learning models are deployed in real-world applications.


In [None]:
import joblib, os
os.makedirs('models', exist_ok=True)
joblib.dump(best_model, 'models/final_model.pkl')
'Model saved to models/final_model.pkl'

## Step 8: Conclusion and Reflection

In this project, I built several regression models to predict Spotify song popularity from audio features. Random Forest and XGBoost generally performed the best, suggesting that non-linear patterns play a strong role in music popularity.

**Key insights:**

- Energy, valence, and danceability were strong predictors of popularity.

- Simpler models like linear regression underfit the data, missing complex interactions.

- Data preprocessing and feature scaling significantly affected performance.

If I were to extend this project, I would include text data such as lyrics or user reviews, or incorporate time-based features like release year and genre trends.

