# **CS M148 Final Project Appendix**

**Group Name**: Samson and Friends

**Group Members**: Trevor Cai, Samantha Chou, Samson Huynh, Shreyas Kamath, Hannah Jin, Joshua Li

[**Link to Colab File**](https://colab.research.google.com/drive/1gJQEs9-5FkhU0ETsk-EeGJXFMRZ2XqgL?usp=sharing) OR https://colab.research.google.com/drive/1gJQEs9-5FkhU0ETsk-EeGJXFMRZ2XqgL?usp=sharing

## **Project Check-In Note**
Our submitted project check-ins show our quarter progress, though we did not end up using some of that code in our final project. We revised our methods to better work towards accomplishing our overall goal, outlined in our readme file, and the revised methods are reflected below.

## **Importing Libraries and Data**

In [None]:
# Install scikit and mlxtend libraries and packages
%pip install scikit-lego
!pip install mlxtend

Collecting scikit-lego
  Downloading scikit_lego-0.9.3-py2.py3-none-any.whl.metadata (12 kB)
Downloading scikit_lego-0.9.3-py2.py3-none-any.whl (219 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.6/219.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-lego
Successfully installed scikit-lego-0.9.3
Collecting mlxtend
  Downloading mlxtend-0.23.3-py3-none-any.whl.metadata (7.3 kB)
Downloading mlxtend-0.23.3-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mlxtend
Successfully installed mlxtend-0.23.3


In [None]:
# Import all necessary libraries
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from transformers import AutoModel
from sklearn.model_selection import train_test_split, cross_val_score, KFold
# from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Set up importing the data
access_token = "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"
pd.set_option('display.max_columns', None)
pd.options.mode.copy_on_write = True



In [None]:
# Import data as a csv file
tracks = pd.read_csv("hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## **Data Cleaning**

The data cleaning steps involved first identifying tracks with null values. Since only one row with a null value was found, we removed it by dropping the row from the dataset. Next, duplicate rows were eliminated by checking for identical combinations of track name, artists, and track genre. We further simplified the data by dropping the columns for artists, track name, and album name.

The steps below also include dropping the rows where `'popularity' == 0` and the creation of a new feature column, `popularity-rating`, where we split the data into 3 popularity bins- low, medium, and high popularity.

In [None]:
# Display dataset info
tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        114000 non-null  int64  
 1   track_id          114000 non-null  object 
 2   artists           113999 non-null  object 
 3   album_name        113999 non-null  object 
 4   track_name        113999 non-null  object 
 5   popularity        114000 non-null  int64  
 6   duration_ms       114000 non-null  int64  
 7   explicit          114000 non-null  bool   
 8   danceability      114000 non-null  float64
 9   energy            114000 non-null  float64
 10  key               114000 non-null  int64  
 11  loudness          114000 non-null  float64
 12  mode              114000 non-null  int64  
 13  speechiness       114000 non-null  float64
 14  acousticness      114000 non-null  float64
 15  instrumentalness  114000 non-null  float64
 16  liveness          11

In [None]:
# View null values -- total of 1
tracks_with_nan = tracks[tracks.isnull().any(axis=1)]
display(tracks_with_nan)

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
65900,65900,1kR4gIb7nGxHPI3D2ifs59,,,,0,0,False,0.501,0.583,7,-9.46,0,0.0605,0.69,0.00396,0.0747,0.734,138.391,4,k-pop


In [None]:
# Drop tracks with null values
tracks = tracks.dropna()

In [None]:
# View dataset shape to confirm dropped track
tracks.shape

(113999, 21)

In [None]:
# Drop duplicate tracks if the combination of track name, artists, and track genre is the same
tracks = tracks.drop_duplicates(subset=['track_name', 'artists', 'track_genre'])

# Dropping artists, track name, and album name columns from the dataframe
tracks = tracks.drop(['artists', 'track_name', 'album_name'], axis=1)

# Drop tracks that are not popular (have popularity score of 0)
tracks = tracks[tracks['popularity'] != 0]

In [None]:
# recheck shape
tracks.shape

(91479, 18)

## **Linear Regression**
One of the first methodologies we tried to model the data with was linear regression. We began by extracting a sample of the data to run linear regression on with the strongest positively correlated variable with popularity, loudness, from our correlation heatmap in our exploratory data analysis. After seeing the results of these, we moved on to creating a linear regression model using all of the numeric predictor variables in the data and popularity rating as our response variable.


In [None]:
# Importing necessary modules
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge

#### **Linear Regression of Loudness and Popularity on 2500 Sample Tracks**
The predictor variable we examined was loudness because it had the highest magnitude positive correlation with popularity, with a correlation coefficient of 0.079.

In [None]:
# Step 1: Sampling 2500 tracks from the dataset to perform linear regression
sample_tracks = tracks.sample(n=2500, random_state=42)

In [None]:
# Step 2: Define predictor variable loudness and response variable popularity
X = np.array(sample_tracks['loudness']).reshape(-1, 1)
y = np.array(sample_tracks['popularity'])

In [None]:
# Step 3: Performing linear regression
ls_fit = LinearRegression()
ls_fit.fit(X, y)
ls_fit.intercept_, ls_fit.coef_[0] # Extracting b0 - intercept and b1 - slope

(41.375298688320285, 0.25283211677897577)

From running the model with loudness as the predictor variable, we observed that the correlation coefficient is relatively low, 0.253, indicating a very weak positive correlation. To better understand this relationship, we created a scatterplot and added the least squares regression line, where the regression line provides a visual representation of the least squares fit.

In [None]:
# Step 4: Scatter plot of linear regression with least squares line of fit
fig = px.scatter(sample_tracks, x='loudness', y='popularity')

fig.add_trace(
    go.Scatter(x=np.array(sample_tracks['loudness']),
                y=ls_fit.intercept_ + np.array(sample_tracks['loudness']) * ls_fit.coef_[0],
                mode='lines',
                name='LS Regression',
                line={'dash': 'solid',
                      'color': 'red'})
)

fig.update_layout(
    title='Linear Regression of Track Popularity with Loudness',
    xaxis_title='Loudness',
    yaxis_title='Popularity'
)

We also wanted to look at how well our model did at predicting the true popularity. In order to do this, we created a dataframe where each track has a true popularity value and a popularity value predicted by the model.

In [None]:
# Step 5: Evaluate the model's predictions using metrics
# Creating a dataframe with true response variable versus least squares prediction
pred_train_df = pd.DataFrame(
    {'true': y,
     'ls_pred': ls_fit.predict(X)}
    )

We used these evaluation metrics to provide insights into the performance of the linear regression model:
- rMSE and MAE to quantify the average error in the predictions.
- MAD to highlight the robustness of the error distribution.
- Correlation and R² to assess how well the model explains the variability in the data.


In [None]:
# Evaluation metrics for least squares model
print('LS rMSE:', np.sqrt(mean_squared_error(pred_train_df['true'], pred_train_df['ls_pred']))) # Root Mean Squared Error
print('LS MAE:', mean_absolute_error(pred_train_df['true'], pred_train_df['ls_pred'])) # Mean Absolute Error
print('LS MAD:', np.median(np.abs(pred_train_df['true'] - pred_train_df['ls_pred']))) # Mean Absolute Deviation
print('LS correlation:', np.corrcoef(pred_train_df['true'], pred_train_df['ls_pred'])[0, 1]) # Correlation
print('LS R2:', r2_score(pred_train_df['true'], pred_train_df['ls_pred'])) # R2 Score

LS rMSE: 18.491571582468158
LS MAE: 15.318903667841287
LS MAD: 14.301798009546332
LS correlation: 0.06734801166669604
LS R2: 0.004535754675457415


The results suggest that the linear regression model with loudness as the sole predictor is not effective at explaining or predicting popularity. This is consistent with the low correlation coefficient and weak R² value. Additional predictors or a more complex model might be needed to capture the variability in track popularity.

#### **Modeling and Evaluating the Regression**
Here, we proceeded with the least squares linear regression as our model, but with more predictor variables to see if that would improve our model's performance.


We begin by selecting the predictor variables (`X`) and the response variable (`y`). The predictor variables exclude columns that are irrelevant for modeling, such as `track_id`, `Unnamed: 0`, and the target variable `popularity`.


In [None]:
# Step 1: Defining our predictor and response variables for the regression model
X = tracks.drop(['popularity', 'track_id', "Unnamed: 0", 'track_genre'], axis=1)
y = tracks['popularity']

In [None]:
# Step 2: Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

We trained the linear regression model, and evaluated the model's performance using a 5-fold cross-validation.

In [None]:
# Step 3: Create and evaluate linear regression model using cross-validation
# Creating a linear regression model
model = LinearRegression()

# Performing cross-validation
cv = KFold(n_splits=5, random_state=42, shuffle=True)  # 5-fold cross-validation
cv_mse = cross_val_score(model, X_train, y_train, cv=cv, scoring='neg_mean_squared_error')
cv_mae = cross_val_score(model, X_train, y_train, cv=cv, scoring='neg_mean_absolute_error')
cv_r2 = cross_val_score(model, X_train, y_train, cv=cv, scoring='r2')

print("Cross-Validation (Linear Regression):")
print(f"Average MSE: {-np.mean(cv_mse)}")
print(f"Average MAE: {-np.mean(cv_mae)}")
print(f"Average R-squared: {np.mean(cv_r2)}")

Cross-Validation (Linear Regression):
Average MSE: 315.2366665909499
Average MAE: 14.489307963611102
Average R-squared: 0.08018302051899481


After cross-validation, we fit the model to the training data and make predictions on both training and validation sets. These predictions are used to evaluate the model's performance.

In [None]:
# Step 4: Fit the model and predict on training and validation datasets
# Train model on the training data
model.fit(X_train, y_train)

# Make predictions on the training and validation sets
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)

To evaluate training and validation performance, we used the evaluation metrics MSE, MAE, and R2 score.

In [None]:
# Step 5: Calculate evaluation metrics for the validation set to evaluate model performance
train_mse = mean_squared_error(y_train, y_train_pred)
train_mae = mean_absolute_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

val_mse = mean_squared_error(y_val, y_val_pred)
val_mae = mean_absolute_error(y_val, y_val_pred)
val_r2 = r2_score(y_val, y_val_pred)

print("\nTraining Set:")
print(f"MSE: {train_mse}")
print(f"MAE: {train_mae}")
print(f"R-squared: {train_r2}")

print("\nValidation Set:")
print(f"MSE: {val_mse}")
print(f"MAE: {val_mae}")
print(f"R-squared: {val_r2}")


Training Set:
MSE: 315.0769208551451
MAE: 14.485824106262903
R-squared: 0.08070329028352008

Validation Set:
MSE: 315.24100404778756
MAE: 14.500162463896068
R-squared: 0.07943284335298306


To improve model performance, we perform Ridge regularization. This involves testing different alpha values to penalize large coefficients and prevent overfitting. The best alpha value is chosen based on the lowest cross-validated MSE.

In [None]:
# Step 6: Optimize with Ridge regularizatoin
# Loop to find the best alpha value for Ridge regularization
alphas = np.logspace(-4, 0, 50)  # Try alpha values between 0.0001 and 1
best_alpha = None
best_mse = float('inf')  # Initialize with a large number to minimize

for alpha in alphas:
    ridge_model = Ridge(alpha=alpha)

    # Performing cross-validation for the current alpha
    cv_mse_ridge = cross_val_score(ridge_model, X_train, y_train, cv=cv, scoring='neg_mean_squared_error')
    mean_mse = -np.mean(cv_mse_ridge)  # Get the average MSE for the current alpha

    # Track the alpha with the lowest MSE
    if mean_mse < best_mse:
        best_mse = mean_mse
        best_alpha = alpha

print(f"\nBest alpha value: {best_alpha}")
print(f"Best cross-validated MSE: {best_mse}")


Best alpha value: 1.0
Best cross-validated MSE: 315.2365867482645


The Ridge regression model is retrained with the optimal alpha value, and predictions are made on the validation set. Evaluation metrics are recalculated to assess the model's performance with regularization.

In [None]:
# Step 7: Evaluate the ridge regression model with best alpha
# Use the best alpha to train the final Ridge model
ridge_model = Ridge(alpha=best_alpha)
ridge_model.fit(X_train, y_train)

# Make predictions on the validation set with Ridge
y_val_pred_ridge = ridge_model.predict(X_val)

# Calculate evaluation metrics for Ridge
val_mse_ridge = mean_squared_error(y_val, y_val_pred_ridge)
val_mae_ridge = mean_absolute_error(y_val, y_val_pred_ridge)
val_r2_ridge = r2_score(y_val, y_val_pred_ridge)

print("\nRidge Regression (Validation Set with Best Alpha):")
print(f"MSE: {val_mse_ridge}")
print(f"MAE: {val_mae_ridge}")
print(f"R-squared: {val_r2_ridge}")


Ridge Regression (Validation Set with Best Alpha):
MSE: 315.2415117775171
MAE: 14.500338183692945
R-squared: 0.07943136068002032


#### **Conclusion**

In these three linear regression models, we see consistency in the results as they all showed very low R2 scores, indicating underfitting. Even when we added more predictors, the R2 score only improved marginally (0.0045 to 0.079). Furthermore, the ridge regression did not improve performance, which may be due to the model underfitting.

The results of these linear regression models suggests that the models are too simple to capture the relationship we are trying to gauge and that we will need to explore other methodlogies to tackle our problem.

## **Logistic Regression**

For our next methodology, we tried utilizing a logistic regression model to classify the music tracks into popularity categories. To do this, we first created a new feature column, `popularity_class`, where tracks would be classified as not popular or popular, so we would have a binary classification problem. This was done by splitting based on the `popularity` feature value, where scores of 1-50 were labeled not popular and scores of 51-100 were labeled popular.

We then ran a logistic regression model with each individual numerical column as a predictor against the `probability_class` and calculated the Area Under Curve (AUC) for model performance. The results and methodologies are further discussed in the appendix file.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import roc_curve, auc

In [None]:
# Defining popularity bins
  # 0 - Not Popular 1-50
  # 1 - Popular (51-100)
bins = [0, 50, 101]
labels = [0, 1]

# Using Pandas cut to split popularity into 2 bins and assign corresponding labels
tracks['popularity_class'] = pd.cut(tracks['popularity'], bins=bins, labels=labels, right=True)

In [None]:
# All numerical columns
numerical_cols = tracks.select_dtypes(include=np.number).columns

# drop less relevant columns
excluded_cols = ['key', 'track_id', 'explicit', 'time_signature', 'Unnamed: 0', 'popularity']
numerical_cols = [col for col in numerical_cols if col not in excluded_cols]
numerical_cols

['duration_ms',
 'danceability',
 'energy',
 'loudness',
 'mode',
 'speechiness',
 'acousticness',
 'instrumentalness',
 'liveness',
 'valence',
 'tempo']

In [None]:
# numerical_cols = tracks.select_dtypes(include=np.number).columns
# excluded_cols = ['key', 'track_id', 'explicit', 'time_signature']
# numerical_cols = [col for col in numerical_cols if col not in excluded_cols]

auc_results = {}

for col in numerical_cols:
  X = tracks[[col]]
  y = tracks['popularity_class']

  X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

  lr = LogisticRegression(solver='liblinear')
  lr.fit(X_train, y_train)

  y_pred_proba = lr.predict_proba(X_val)[:, 1]
  fpr, tpr, _ = roc_curve(y_val, y_pred_proba)
  roc_auc = auc(fpr, tpr)

  auc_results[col] = roc_auc

print("AUC for each numerical variable against popularity_class:")
for col, auc_score in auc_results.items():
  print(f"{col}: {auc_score}")

AUC for each numerical variable against popularity_class:
duration_ms: 0.5315299515116478
danceability: 0.5428728561521772
energy: 0.533033647629419
loudness: 0.5352849437430116
mode: 0.5120871652893856
speechiness: 0.5304492959836538
acousticness: 0.49994067487581584
instrumentalness: 0.5479616958581001
liveness: 0.5473044650925428
valence: 0.5198064056446903
tempo: 0.5217899408023916


In [None]:
# Use each individual numerical column as a predictor against the probability_class response variable
X = tracks.drop(['popularity', 'track_id', "Unnamed: 0", 'track_genre', 'popularity_class'], axis=1)
y = tracks['popularity_class'].astype(int)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
y_train.value_counts()

Unnamed: 0_level_0,count
popularity_class,Unnamed: 1_level_1
0,52684
1,20499


In [None]:
# Running logistic regression with threshold set to 0.5
lr = LogisticRegression(solver='liblinear')
lr.fit(X=np.array(X_train),
       y=y_train)
lr.intercept_, lr.coef_

(array([-8.63207041e-06]),
 array([[-3.31169421e-06,  3.60885735e-06, -2.31860529e-06,
         -7.60213727e-06, -4.52506264e-05,  1.32306234e-04,
         -9.55687245e-06, -3.23515490e-06, -7.35738578e-06,
         -9.96417336e-06, -8.29456438e-06, -8.68407604e-06,
         -1.21653077e-03, -2.99058975e-05]]))

In [None]:
# Observed and predicted mode for validation data set
pred_val = pd.DataFrame(dict(
    popularity_class = y_val,
    lr_predict = lr.predict_proba(X_val)[:,1],
    lr_predict_binary = lr.predict(X_val)))

pred_val.head(5)


X has feature names, but LogisticRegression was fitted without feature names


X has feature names, but LogisticRegression was fitted without feature names



Unnamed: 0,popularity_class,lr_predict,lr_predict_binary
5373,0,0.309187,0
107340,1,0.252576,0
37168,0,0.277974,0
14001,1,0.30135,0
37106,0,0.253256,0


In [None]:
# Computing confusion matrix where top left is true negatives and bottom right is true positives.
conf_lr = metrics.confusion_matrix(y_true=pred_val['popularity_class'],
                                   y_pred=pred_val['lr_predict_binary'])
conf_lr

array([[13171,     0],
       [ 5125,     0]])

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

# Calculate prediction accuracy
accuracy = accuracy_score(pred_val['popularity_class'], pred_val['lr_predict_binary'])
print(f"\nPrediction Accuracy: {accuracy}")

# Calculate prediction error
error_rate = 1 - accuracy
print(f"Prediction Error: {error_rate}")

# Calculate true positive rate (TPR) and true negative rate (TNR)
tp = metrics.recall_score(y_true=pred_val['popularity_class'],
                     y_pred=pred_val['lr_predict_binary'])
tn = metrics.recall_score(y_true=pred_val['popularity_class'],
                     y_pred=pred_val['lr_predict_binary'],
                     pos_label=0)
print(f"True Positive Rate (TPR): {tp}")
print(f"True Negative Rate (TNR): {tn}")

# Precision and recall
precision = precision_score(pred_val['popularity_class'], pred_val['lr_predict_binary'])
recall = recall_score(pred_val['popularity_class'], pred_val['lr_predict_binary'])

print(f"Precision: {precision}")
print(f"Recall: {recall}")

# F1 Score
f1 = f1_score(pred_val['popularity_class'], pred_val['lr_predict_binary'])
print(f"F1 Score: {f1}")


Prediction Accuracy: 0.719884127678181
Prediction Error: 0.280115872321819
True Positive Rate (TPR): 0.0
True Negative Rate (TNR): 1.0
Precision: 0.0
Recall: 0.0
F1 Score: 0.0



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



In [None]:
# Computing the ROC curve variables
lr_fpr, lr_tpr, lr_thresholds = metrics.roc_curve(pred_val['popularity_class'], pred_val['lr_predict'])
lr_fpr, lr_tpr, lr_thresholds

(array([0.00000000e+00, 7.59243793e-05, 1.51848759e-04, ...,
        9.98861134e-01, 9.99012983e-01, 1.00000000e+00]),
 array([0., 0., 0., ..., 1., 1., 1.]),
 array([           inf, 4.52089716e-01, 4.51046094e-01, ...,
        1.48340960e-02, 1.48018454e-02, 4.89367001e-07]))

In [None]:
# Plotting the ROC Curve
roc_lr = pd.DataFrame({
    'False Positive Rate': lr_fpr,
    'True Positive Rate': lr_tpr,
    'Model': 'Logistic Regression'
}, index=lr_thresholds)


roc_df = pd.concat([roc_lr])


px.line(roc_df, y='True Positive Rate', x='False Positive Rate',
        color='Model',
        width=700, height=500
)

In [None]:
# Calculating Area Under Curve (AUC)
lr_auc = metrics.roc_auc_score(pred_val['popularity_class'], pred_val['lr_predict'])
print('Logistic regression AUC:', lr_auc.round(3))

Logistic regression AUC: 0.54


In [None]:
# 5-fold CV to calculate AUC and accuracy of each fold
from sklearn.model_selection import cross_val_score

cross_val_score(lr, X_val, y_val, cv=5, scoring='roc_auc')

array([0.54603638, 0.53486157, 0.54374021, 0.53469304, 0.540738  ])

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=5)
i = 1
for train_index, test_index in skfolds.split(X_val, y_val):
    clone_lr = clone(lr)
    X_train_folds = X.iloc[train_index]
    y_train_folds = y.iloc[train_index]
    X_test_fold = X.iloc[test_index]
    print(test_index)
    clone_lr.fit(X_train_folds, y_train_folds)
    y_pred = clone_lr.predict(X_test_fold)
    X_test_fold = X.iloc[test_index]

    clone_lr.fit(X_train_folds, y_train_folds)
    y_pred = clone_lr.predict(X_test_fold)
    auc_sample = metrics.roc_auc_score(y.iloc[test_index], y_pred)
    print('Fold: ', i)
    print('AUC: ', auc_sample)
    print('Accuracy: ', metrics.accuracy_score(y.iloc[test_index], y_pred))
    i += 1

[   0    1    2 ... 3681 3682 3683]
Fold:  1
AUC:  0.5
Accuracy:  0.6090163934426229
[3608 3610 3614 ... 7337 7338 7341]
Fold:  2
AUC:  0.5
Accuracy:  0.7633233123804318
[ 7267  7269  7271 ... 11023 11026 11029]
Fold:  3
AUC:  0.5
Accuracy:  0.8133369773162066
[10954 10958 10959 ... 14634 14636 14639]
Fold:  4
AUC:  0.5
Accuracy:  0.7152227384531292
[14635 14637 14638 ... 18293 18294 18295]
Fold:  5
AUC:  0.5
Accuracy:  0.7767149494397376


In [None]:
# Computing the optimal threshold (maximizing tpr and minimizing fpr)
optimal_idx = np.argmax(lr_tpr - lr_fpr)
optimal_threshold = lr_thresholds[optimal_idx]
pred_val['optimal_predict'] = (pred_val['lr_predict'] >= optimal_threshold).astype(int)

In [None]:
# Optimal threshold from calculations above
optimal_threshold

0.2607197141400966

In [None]:
# Comparing original confusion matrix with optimal threshold matrix
conf_lr_optimal = metrics.confusion_matrix(y_true=pred_val['popularity_class'],
                                   y_pred=pred_val['optimal_predict'])
print('Original Confusion Matrix')
print(conf_lr)
print('Optimal Confusion Matrix')
print(conf_lr_optimal)

Original Confusion Matrix
[[13171     0]
 [ 5125     0]]
Optimal Confusion Matrix
[[3601 9570]
 [ 957 4168]]


In [None]:
# Calculate prediction accuracy
accuracy_optimal = accuracy_score(pred_val['popularity_class'], pred_val['optimal_predict'])
print(f"\nPrediction Accuracy: {accuracy_optimal}")

# Calculate prediction error
error_rate_optimal = 1 - accuracy_optimal
print(f"Prediction Error: {error_rate_optimal}")

# Calculate true positive rate (TPR) and true negative rate (TNR)
tp_optimal = metrics.recall_score(y_true=pred_val['popularity_class'],
                     y_pred=pred_val['optimal_predict'])
tn_optimal = metrics.recall_score(y_true=pred_val['popularity_class'],
                     y_pred=pred_val['optimal_predict'],
                     pos_label=0)
print(f"True Positive Rate (TPR): {tp_optimal}")
print(f"True Negative Rate (TNR): {tn_optimal}")

# Precision and recall
precision_optimal = precision_score(pred_val['popularity_class'], pred_val['optimal_predict'])
recall_optimal = recall_score(pred_val['popularity_class'], pred_val['optimal_predict'])

print(f"Precision: {precision_optimal}")
print(f"Recall: {recall_optimal}")

# F1 Score
f1 = f1_score(pred_val['popularity_class'], pred_val['optimal_predict'])
print(f"F1 Score: {f1}")


Prediction Accuracy: 0.42462833406209005
Prediction Error: 0.57537166593791
True Positive Rate (TPR): 0.8132682926829268
True Negative Rate (TNR): 0.2734036899248349
Precision: 0.3033920512447227
Recall: 0.8132682926829268
F1 Score: 0.44192334199225997


## **Neural Network**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# # Prepare the data
# # Assuming 'tracks' is your DataFrame
# X = tracks.drop(["popularity","popularity", "track_id", "track_genre", "Unnamed: 0"], axis=1)
# y = tracks["popularity"]


# Encode categorical features
categorical_cols = ['explicit']

for col in categorical_cols:
  le = LabelEncoder()
  X[col] = le.fit_transform(X[col])

# Convert boolean to numerical if necessary
X['explicit'] = X['explicit'].astype(int)


# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert y_train and y_test to numerical using .cat.codes
y_train = y_train.cat.codes
y_test = y_test.cat.codes

# Scale numerical features
numerical_cols = ['duration_ms', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence',
       'tempo', 'time_signature']
scaler = StandardScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

# Build the Neural Network
model = keras.Sequential([
    layers.Input(shape=(X_train.shape[1],)), # Input layer
    layers.Dense(64, activation='relu'),   # Hidden layer 1
    layers.Dense(32, activation='relu'),   # Hidden layer 2
    layers.Dense(3, activation='softmax')  # Output layer (3 classes)
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',  # Use sparse for integer labels
              metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)


# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Get model predictions
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)  # Convert probabilities to class labels

# Generate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_classes)

# Display confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=np.unique(y_test))
disp.plot(cmap='viridis', values_format='d')
plt.title("Confusion Matrix")
plt.show()


In [None]:
from sklearn.metrics import confusion_matrix

# Assuming y_test and y_pred_classes are available
conf_matrix = confusion_matrix(y_test, y_pred_classes)

# Calculate class-wise accuracies
class_accuracies = conf_matrix.diagonal() / conf_matrix.sum(axis=1)

# Print class accuracies
class_labels = np.unique(y_test)
for label, accuracy in zip(class_labels, class_accuracies):
    print(f"Accuracy for Class {label}: {accuracy:.2f}")


**Conceptual Explanation of Feature Permutation Importance for Neural Networks**

The feature permutation test is a method for quantifying the importance of each feature in a predictive model. Conceptually, we evaluate how much the model's accuracy drops when the values of a specific feature are randomly shuffled (or permuted). By doing so, we break the relationship between that feature and the target variable, effectively removing its predictive power.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

def feature_permutation_test(model, X, y, feature_names, n_permutations=3):
    # Store original accuracy
    original_pred = model.predict(X)
    original_accuracy = accuracy_score(y, np.argmax(original_pred, axis=1))

    # Dictionary to store accuracy drops for each feature
    feature_importance = {}

    # Test each feature
    for feature in feature_names:
        accuracy_drops = []

        # Perform n permutations for this feature
        for _ in range(n_permutations):
            # Make a copy of X
            X_permuted = X.copy()

            # Permute only this feature
            X_permuted[feature] = np.random.permutation(X_permuted[feature])

            # Get new predictions and accuracy
            pred = model.predict(X_permuted)
            acc = accuracy_score(y, np.argmax(pred, axis=1))

            # Store the drop in accuracy
            accuracy_drops.append(original_accuracy - acc)

        # Store mean accuracy drop for this feature
        feature_importance[feature] = np.mean(accuracy_drops)

    return feature_importance

# Run permutation importance test
importance_scores = feature_permutation_test(model, X_test, y_test, X_test.columns)

# Convert to DataFrame and sort
importance_df = pd.DataFrame({
    'Feature': list(importance_scores.keys()),
    'Importance': list(importance_scores.values())
}).sort_values('Importance', ascending=False)

# Plot results
plt.figure(figsize=(12, 6))
plt.bar(importance_df['Feature'], importance_df['Importance'])
plt.xticks(rotation=45, ha='right')
plt.title('Feature Importance by Permutation Testing')
plt.xlabel('Features')
plt.ylabel('Decrease in Accuracy When Feature is Permuted')
plt.tight_layout()
plt.show()

# Print numerical results
print("\nFeature Importance Rankings:")
for idx, row in importance_df.iterrows():
    print(f"{row['Feature']}: {row['Importance']:.4f}")


**Neural Network Feature Importance and Comparison**

The neural network's feature importance rankings reveal a slightly different perspective compared to the decision tree and random forest models. The most influential features for the neural network are **energy** (0.0413), **danceability** (0.0396), and **valence** (0.0315). Interestingly, features like **acousticness** and **duration_ms**, which were dominant in the decision tree and random forest, rank lower in the neural network’s analysis. This suggests that the neural network captures different patterns and interactions between features.

The random forest and decision tree models emphasized **acousticness**, **duration_ms**, and **speechiness** as top predictors, which aligns less with the neural network's rankings. One possible reason for these differences is the limited number of permutations performed for the neural network, which may have restricted its ability to fully explore feature importance. We had limited time and compute resources so we were unable to take large permutations of the data.

