# SC1015 Mini-Project: Spotify Music Recommender System

### By FDDA Group 5
Syed Ali Redha Alsagoff, Huang Yongjian, Ma Jinlin


## Our problem statement

Our simple question of what determines a user's music preference actually has important implications in real life! With our project, we aim to find out **how companies can maximise the use of machine learning techniques to recommend songs**. To streamline the efficiency aspect, we will also look into **what features make a good recommender system**, as well as **what model to use in a genre classification model**.
<br>
<br>
A good genre prediction model is important due to the **lack of robust datasets with labelled genres**. Even in some research literatures on music recommender systems, the teams made use of genre classification models in their approaches.
<br>
<br>
We believe that this is an important issue to tackle because of **the vast amount of features used in music recommender systems**. Novel approaches to recommender systems make use of Deep learning that engineer even more features for their system[1]. Advanced recommendation systems can make use of many features[2], in the realm of hundreds to even thousands of features to make their recommendations. Hence, to improve the efficiency and resource management, and possibly the quality of recommendations as well, **good feature selection** is paramount to any music streaming service company, such as Spotify.
<br>
<br>
We will be using different feature selection techniques and will be evaluating our model trained using different feature selection techniques.
<br>
<br>
Due to time and hardware and other resource constraints, we will be doing our project using a smaller set of features than what is typically used. Larger models make use of more features such as lyric data (making use of tokenisers), audio data (making use of RNNs and LSTMs) and user-item correlation features, and more. However, the techniques that we use here would be **easily extendable** to datasets that make use of features on the scale of hundreds or even thousands.

## Installing relevant libraries

In [1]:
pip install --upgrade numpy scipy scikit-learn threadpoolctl

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install xgboost

Note: you may need to restart the kernel to use updated packages.


* Note you may need to restart your kernel after running the cell below, we need to get the latest versions of the following libraries for our code to work

In [3]:
pip install --upgrade numpy matplotlib seaborn pandas

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install spotipy

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install sklearn-pandas

Note: you may need to restart the kernel to use updated packages.


In [6]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

# Libraries for extracting dataset
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import time

# Models use for EDA and light ML
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor, XGBClassifier

# Library and functions for getting pretrained models and mapper
import pickle

# Libraries for Feature Engineering (Genre Classification), and data preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import LabelEncoder, LabelBinarizer, StandardScaler, MinMaxScaler

# Libraries for our autoencoder model
import random
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.regularizers import l1_l2

# For our Recommender System
from sklearn.neighbors import NearestNeighbors




  from pandas.core import (


# Our dataset for recommender system

In this section, we will be detailing how we made our dataset to recommend songs to users. Spotify does not give access to all of its songs through its API, so we made our own dataset, using popular songs from a variety of genres for maximum coverage. To provide some simplicity, we constrained to system to recommend only English songs.

### Function to extract playlist information


In [7]:
def fetch_playlist_tracks(playlist_id, sp):
    track_uris = []
    artist_uris = []
    track_names = []
    artist_names = []
    track_popularities = []
    release_years = []
    explicit_statuses = []
    results = sp.playlist_tracks(playlist_id)
    tracks = results['items']
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    for item in tracks:
        track = item['track']
        if track:
            track_uris.append(track['uri'])
            track_names.append(track['name'])
            artist_uris.extend([artist['uri'] for artist in track['artists']])
            artist_names.append(track['artists'][0]['name'] if track['artists'] else 'Unknown')
            track_popularities.append(track['popularity'])
            explicit_statuses.append(track['explicit'])
            release_date = track['album']['release_date']
            release_year = release_date.split("-")[0] if release_date else 'Unknown'
            release_years.append(release_year)
    return track_uris, track_names, artist_names, track_popularities, release_years, explicit_statuses

def get_audio_info_from_playlist(playlist_id, sp):
    tracks, track_names, artist_names, popularities, release_years, explicit_statuses = fetch_playlist_tracks(playlist_id, sp)
    combined_data = []

    for i in range(0, len(tracks), 50):
        track_interval = tracks[i:i+50]
        track_name_interval = track_names[i:i+50]
        artist_name_interval = artist_names[i:i+50]
        popularity_interval = popularities[i:i+50]
        release_year_interval = release_years[i:i+50]
        explicit_status_interval = explicit_statuses[i:i+50]

        if not track_interval:
            continue

        try:
            audio_features = sp.audio_features(track_interval)
        except spotipy.SpotifyException as e:
            if e.http_status == 429:
                print(f"Rate limit exceeded, sleeping for {retry_after} seconds")
                time.sleep(retry_after)
                # Exponential Backoff to comply with Spotify rate limits
                retry_after = min(retry_after * 2, 64)
            elif e.http_status == 400:
                print(f"Error with batch {i} to {i + len(track_interval) - 1}: {e.msg}, skipping...")
                continue
            else:
                print(f"Unhandled error: {e}")
                continue

        audio_features_batch = [
            [track_name_interval[j], artist_name_interval[j], popularity_interval[j], release_year_interval[j], explicit_status_interval[j]] +
            [v for k, v in d.items() if k not in ['type', 'id', 'uri', 'track_href', 'analysis_url']]
            for j, d in enumerate(audio_features) if d is not None]

        for features in audio_features_batch:
            combined_data.append(features)

    columns = ['Track Name', 'artists', 'popularity', 'release_year', 'explicit', 'danceability', 'energy', 'key', 'loudness',
               'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']
    return pd.DataFrame(combined_data, columns=columns)

# Please use the developer ID and secret attached in the NTULearn submission
# Feel free to running the code but sometimes there might be error due too too many requests, this may be due to
# Spotify API temporarily banning the account from using its data
client_id = ''  
client_secret = ''
auth_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(auth_manager=auth_manager, requests_timeout=10, retries=5, status_forcelist=(429, 500, 502, 503, 504), backoff_factor=0.3)

Above are the functions to extract the dataset from our playlists. We make use of the get_audio_info_from_playlist() function that makes use of the fetch_playlist_tracks() helper function. We then stored the resulting dataframe as a CSV file to import to this notebook.
<br>
<br>
To save you time, we have already saved the dataset into 2 separate CSVs (Since spotify's limit on playlist is 10000 songs) but feel free to try and extract it using these uris:
1. spotify:playlist:3Om5x2SLDI2QRnXOXNuTPo
2. spotify:playlist:2YHkCdEuQ9hKoZZjJXsAS2

In [8]:
# test_df = get_audio_info_from_playlist("(insert the playlist uri here)",sp)

If you'd like, you could uncomment the code above and try adding in different playlist uris to extract their information into the test_df DataFrame. You can get playlist uris by simply going to your playlist on desktop and hovering share. If you press the option(for mac) or control(for windows) button, you can see the copy link button change to copy playlist uri.

### Extracting the dataset

In [9]:
full_dataset_df1 = pd.read_csv('spotify_playlist_dataset.csv')
full_dataset_df2 = pd.read_csv('spotify_playlist_dataset2.csv')
full_dataset_df = pd.concat([full_dataset_df1,full_dataset_df2], ignore_index=True)
full_dataset_df.drop_duplicates(subset=['Track Name'], keep='first',inplace=True)
print(full_dataset_df.shape)

(11794, 19)


## Data Cleaning

In [10]:
# Check the data in the playlist
full_dataset_df.info()
full_dataset_df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
Index: 11794 entries, 0 to 17579
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        11794 non-null  int64  
 1   Track Name        11793 non-null  object 
 2   Artist Name       11793 non-null  object 
 3   Popularity        11794 non-null  int64  
 4   Release Year      11794 non-null  int64  
 5   Explicit          11794 non-null  bool   
 6   Danceability      11794 non-null  float64
 7   Energy            11794 non-null  float64
 8   Key               11794 non-null  int64  
 9   Loudness          11794 non-null  float64
 10  Mode              11794 non-null  int64  
 11  Speechiness       11794 non-null  float64
 12  Acousticness      11794 non-null  float64
 13  Instrumentalness  11794 non-null  float64
 14  Liveness          11794 non-null  float64
 15  Valence           11794 non-null  float64
 16  Tempo             11794 non-null  float64
 17

Unnamed: 0          0
Track Name          1
Artist Name         1
Popularity          0
Release Year        0
Explicit            0
Danceability        0
Energy              0
Key                 0
Loudness            0
Mode                0
Speechiness         0
Acousticness        0
Instrumentalness    0
Liveness            0
Valence             0
Tempo               0
Duration_ms         0
Time Signature      0
dtype: int64

As seen above, we still have some NaN values and an irrelevant column, we will now remove them.

In [11]:
# Remove irrelevant columns and drop any rows with NaN values
full_dataset_df.drop("Unnamed: 0", axis=1, inplace=True)
full_dataset_df.dropna(inplace=True)
full_dataset_df.info()
full_dataset_df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
Index: 11793 entries, 0 to 17579
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Track Name        11793 non-null  object 
 1   Artist Name       11793 non-null  object 
 2   Popularity        11793 non-null  int64  
 3   Release Year      11793 non-null  int64  
 4   Explicit          11793 non-null  bool   
 5   Danceability      11793 non-null  float64
 6   Energy            11793 non-null  float64
 7   Key               11793 non-null  int64  
 8   Loudness          11793 non-null  float64
 9   Mode              11793 non-null  int64  
 10  Speechiness       11793 non-null  float64
 11  Acousticness      11793 non-null  float64
 12  Instrumentalness  11793 non-null  float64
 13  Liveness          11793 non-null  float64
 14  Valence           11793 non-null  float64
 15  Tempo             11793 non-null  float64
 16  Duration_ms       11793 non-null  int64  
 17

Track Name          0
Artist Name         0
Popularity          0
Release Year        0
Explicit            0
Danceability        0
Energy              0
Key                 0
Loudness            0
Mode                0
Speechiness         0
Acousticness        0
Instrumentalness    0
Liveness            0
Valence             0
Tempo               0
Duration_ms         0
Time Signature      0
dtype: int64

In [12]:
full_dataset_df = full_dataset_df.rename(columns={"Acousticness": "acousticness", "Danceability": "danceability","Energy":"energy","Instrumentalness":"instrumentalness","Liveness":"liveness","Loudness":"loudness","Popularity":"popularity","Speechiness":"speechiness","Tempo":"tempo","Valence":"valence","Duration_ms":"duration_ms","Time Signature":"time_signature","Key":"key","Mode":"mode", "Artist Name": "artists","Release Year":"release_year","Explicit":"explicit"})
full_dataset_df.columns

Index(['Track Name', 'artists', 'popularity', 'release_year', 'explicit',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_ms', 'time_signature'],
      dtype='object')

## Data Preparation - Feature Engineering
Firstly, we will be doing feature engineering, which will be the genre feature. As explained in the first notebook (Part 1: Genre Classification Model), We believe that this will be an essential feature in recommending tracks. We will be making use of the model and mapper from the Genre Classification Notebook to get this feature.

In [13]:
class CustomMultiLabelBinarizer:
    def __init__(self):
        self.mlb = MultiLabelBinarizer()

    def fit(self, X, y=None):
        # Fit the MultiLabelBinarizer to the data
        self.mlb.fit(X)
        return self

    def transform(self, X):
        # Transform the data, ignoring unseen labels
        try:
          return self.mlb.transform(X)
        except ValueError as e:  
          # Handle the case where the label set in X is not a subset of the label set fitted
          unseen = set(x for sublist in X for x in sublist) - set(self.mlb.classes_)
          X_filtered = [[x for x in sublist if x not in unseen] for sublist in X]
          return self.mlb.transform(X_filtered)

    def fit_transform(self, X, y=None):
        # Fit to data, then transform it
        return self.fit(X).transform(X)

    def get_feature_names(self):
        # Return feature names for output columns
        return self.mlb.classes_

In [14]:
# Import pretrained models and mapper from other notebook
with open('multinomial_logistic_regression_model.pkl', 'rb') as file:
    multinomial_logistic_regression_model = pickle.load(file)
with open('mapper.pkl', 'rb') as file:
    mapper = pickle.load(file)

# Function to extract the values from the DataFrames
def get_genres(df):
    df = df.copy()
    df['artists'] = df['artists'].str.split(';')
    df.drop(['Track Name', 'release_year','tempo','duration_ms'], axis=1, inplace=True)
    X = mapper.transform(df)
    X.columns = [col.replace('artists_', '').replace('[', '').replace(']', '').replace('<', '') for col in X.columns]
    X = X.loc[:, ~X.columns.duplicated()]
    genres = multinomial_logistic_regression_model.predict(X)
    return genres

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [15]:
# Add genres to the dataset
full_dataset_df["genre"] = get_genres(full_dataset_df)



In [16]:
# Check for any rows where genres are null
full_dataset_df["genre"].isnull().sum()

0

As you can see we successfully got the genres for all rows.

## Data Preparation - Data Preprocessing

We will now be preprocessing our data, by making all categorical features numeric. We will be making use of one-hot encoding to do so. <br>

Then, we will make use of an autoencoder model to further reduce the dimensionality and improve the richness of the features of our model. <br>

Then, we will be scaling and normalising our features before finally making use of them in our model.

In [17]:
scaler = StandardScaler()
categorical_features = ['key', 'mode', 'genre','explicit']
numerical_features = ['duration_ms', 'danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo','popularity']
full_dataset_df = pd.get_dummies(full_dataset_df, columns=categorical_features)

# Listout one-Hot encoded features
final_categorical_features = ['key_0','key_1','key_2','key_3','key_4','key_5','key_6','key_7','key_8','key_9','key_10','key_11','mode_0','mode_1', 'genre_0', 'genre_1', 'genre_2', 'genre_3', 'genre_4', 'genre_5', 'genre_6','explicit_False','explicit_True']
for column in final_categorical_features:
    full_dataset_df[column] = full_dataset_df[column].astype(int)

full_dataset_df.info()

scaled_full_dataset_df = full_dataset_df.copy()
scaled_full_dataset_df[numerical_features] = scaler.fit_transform(scaled_full_dataset_df[numerical_features])

<class 'pandas.core.frame.DataFrame'>
Index: 11793 entries, 0 to 17579
Data columns (total 38 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Track Name        11793 non-null  object 
 1   artists           11793 non-null  object 
 2   popularity        11793 non-null  int64  
 3   release_year      11793 non-null  int64  
 4   danceability      11793 non-null  float64
 5   energy            11793 non-null  float64
 6   loudness          11793 non-null  float64
 7   speechiness       11793 non-null  float64
 8   acousticness      11793 non-null  float64
 9   instrumentalness  11793 non-null  float64
 10  liveness          11793 non-null  float64
 11  valence           11793 non-null  float64
 12  tempo             11793 non-null  float64
 13  duration_ms       11793 non-null  int64  
 14  time_signature    11793 non-null  int64  
 15  key_0             11793 non-null  int64  
 16  key_1             11793 non-null  int64  
 17

## Autoencoder Model

To achieve dimensionality reduction, one of the methods that we will be using is an **Autoencoder neural network**. An autoencoder is made out of 2 parts, an encoder and a decoder. The encoder section applies transformations onto the input data, **reducing the dimensionality of data** in each layer. This is done until it reaches a low-dimensional bottleneck. This bottleneck captures the essential features of the data. The decoder then tries to **reconstruct the original data** from this compressed representation. How different the resulting output is from the original is taken as the **loss function**.

### Unweighted autoencoder model
The Unweighted Autoencoder **does not assign different importances to different features** when calculating the loss during training. It treats all errors between the reconstructed output and the original input equally across the dataset. This is useful for dataset with features that have around the same importances and is generally simpler than the weighted variant.

In [18]:
# Exclude non-numeric features like 'Track Name', 'Artist Name' for input
input_dim = full_dataset_df.shape[1] - 2

# Set seed for reproducibility
seed_value = 20
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Encoder/Decoder definition (Hyperparameters were tuned such that optimal results were given)
input_layer = Input(shape=(input_dim,))
encoded = Dense(128, activation='relu', activity_regularizer=l1_l2(l1=1e-5, l2=1e-4))(input_layer)
encoded = Dropout(0.1)(encoded)
encoded = Dense(68, activation='relu', activity_regularizer=l1_l2(l1=1e-5, l2=1e-4))(encoded)
encoded = Dropout(0.1)(encoded)
encoded = Dense(40, activation='relu', activity_regularizer=l1_l2(l1=1e-5, l2=1e-4))(encoded)
decoded = Dense(68, activation='relu')(encoded)
decoded = Dropout(0.1)(decoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dropout(0.1)(decoded)
decoded = Dense(input_dim, activation='sigmoid')(decoded)

# Compile the autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')
# Compile the autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

# Prepare data for training

# Use numeric data only
X_train = full_dataset_df.drop(['Track Name', 'artists'], axis=1)

# Convert to float32 to ensure compatibility with TensorFlow
X_train = X_train.astype('float32')

# Train the autoencoder
autoencoder.fit(X_train, X_train, epochs=50, batch_size=256, validation_split=0.2)

# Create a separate encoder model from the full autoencoder
unweighted_encoder = Model(input_layer, encoded)

Epoch 1/50


2024-04-24 23:50:57.765613: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


### Weighted autoencoder model

The Weighted Autoencoder can **assign different weights to features**, allowing the model to prioritise minimising the error cost of specific features over the others during training. In this autoencoder model, we will be adjusting the weights based on the feature importance in deciding a song's popularity that we found earlier in the second notebook (Part 2: Music Popularity Model). 

The weights of **music genre, explicit lyrics, release year, instrumentalness** will be adjusted upwards while the weights of **key, mode and danceability** are adjusted downwards. Artist names are not included in our recommender system as it requires multilabel binarizer and will create too many features for our small-scaled recommender system. 

From experience, we found that, specifically for the autoencoder, improving the weight of speechiness was also important in improving the quality of recommendations. Given the black box nature of Neural Networks and AI, we are not able to give a reason why at this moment.

We will be simulating this in the model by simply multiplying the values by a number greater than one for the more important features, and a number less than one for a less important feature.

In [19]:
# Exclude non-numeric features: 'Track Name', 'Artist Name' for input
input_dim = full_dataset_df.shape[1] - 2

# Set seed for reproducibility
seed_value = 21
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Encoder/Decoder definition (Hyperparameters were tuned such that optimal results were given)
input_layer = Input(shape=(input_dim,))
encoded = Dense(128, activation='relu', activity_regularizer=l1_l2(l1=1e-5, l2=1e-4))(input_layer)
encoded = Dropout(0.1)(encoded)
encoded = Dense(68, activation='relu', activity_regularizer=l1_l2(l1=1e-5, l2=1e-4))(encoded)
encoded = Dropout(0.1)(encoded)
encoded = Dense(40, activation='relu', activity_regularizer=l1_l2(l1=1e-5, l2=1e-4))(encoded)
decoded = Dense(68, activation='relu')(encoded)
decoded = Dropout(0.1)(decoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dropout(0.1)(decoded)
decoded = Dense(input_dim, activation='sigmoid')(decoded)

# Compile the autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

# Prepare data for training
# Use numeric data only
X_train = full_dataset_df.drop(['Track Name', 'artists'], axis=1)

# Convert to float32 to ensure compatibility with TensorFlow
X_train = X_train.astype('float32')

# music genre, explicit lyrics, release year, instrumentalness
# Adjust the weights of different values for the model
# key, mode and danceability are adjusted downwards as 
X_train[['key_0','key_1','key_2','key_3','key_4','key_5','key_6','key_7','key_8','key_9','key_10','key_11']] *= 0.90
X_train[['mode_0','mode_1']] *= 0.95
X_train['danceability'] *= 0.94
X_train['instrumentalness'] *= 1.5
X_train['release_year'] *= 1.7
X_train['speechiness'] *= 1.5
X_train[['explicit_False','explicit_True']] *= 1.1
X_train['popularity'] *= 1.7
X_train[['genre_0','genre_1','genre_3','genre_4','genre_5','genre_6']] *= 1.4
X_train['genre_2'] *= 1.3

# Train the autoencoder
autoencoder.fit(X_train, X_train, epochs=50, batch_size=256, validation_split=0.2)

# Create a separate encoder model from the full autoencoder
weighted_encoder = Model(input_layer, encoded)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


Now, we have done all the relevant preprocessing and preparation with our main dataset. We will now preprocess and prepare our user playlist dataset.

## Preprocessing and preparing user playlist data

In [20]:
# Extract the features of the user playlist
playlist_features_df = get_audio_info_from_playlist('spotify:playlist:5ZbOfWEU7hok5xNqN2vuc0', sp)
playlist_features_df['genre'] = get_genres(playlist_features_df)
playlist_features_df.head()



Unnamed: 0,Track Name,artists,popularity,release_year,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,genre
0,SLOW DANCING IN THE DARK,Joji,49,2018,True,0.515,0.479,3,-7.458,1,0.0261,0.544,0.00598,0.191,0.284,88.964,209274,4,2
1,505,Arctic Monkeys,81,2007,False,0.52,0.852,0,-5.866,1,0.0543,0.00237,5.8e-05,0.0733,0.234,140.267,253587,4,5
2,Sweater Weather,The Neighbourhood,90,2013,False,0.612,0.807,10,-2.81,1,0.0336,0.0495,0.0177,0.101,0.398,124.053,240400,4,6
3,Are You Bored Yet? (feat. Clairo),Wallows,83,2019,False,0.682,0.683,8,-6.444,0,0.0287,0.156,2.3e-05,0.273,0.64,120.023,178000,4,5
4,Young Dumb & Broke,Khalid,82,2017,False,0.799,0.539,1,-6.351,1,0.0421,0.199,1.7e-05,0.165,0.394,136.948,202547,4,0


In [21]:
playlist_features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Track Name        24 non-null     object 
 1   artists           24 non-null     object 
 2   popularity        24 non-null     int64  
 3   release_year      24 non-null     object 
 4   explicit          24 non-null     bool   
 5   danceability      24 non-null     float64
 6   energy            24 non-null     float64
 7   key               24 non-null     int64  
 8   loudness          24 non-null     float64
 9   mode              24 non-null     int64  
 10  speechiness       24 non-null     float64
 11  acousticness      24 non-null     float64
 12  instrumentalness  24 non-null     float64
 13  liveness          24 non-null     float64
 14  valence           24 non-null     float64
 15  tempo             24 non-null     float64
 16  duration_ms       24 non-null     int64  
 17 

In [22]:
# One-hot encoding the categorical features
playlist_features_df[final_categorical_features] = 0
categorical_features = ['key', 'mode', 'genre','explicit']
playlist_features_df = pd.get_dummies(playlist_features_df, columns=categorical_features)
# Not every value in the categorical features will be present, so to circumvent this
# We will be manually putting in every necessary column, and merge them
for col in final_categorical_features:
    cols_to_merge = [c for c in playlist_features_df.columns if c.startswith(col)]
    if len(cols_to_merge) > 1:
        # Make use of max function to effectively perform an OR operation across duplicates
        playlist_features_df[col] = playlist_features_df[cols_to_merge].max(axis=1)
playlist_features_df = playlist_features_df.loc[:,~playlist_features_df.columns.duplicated()].copy()

# ensure all correct values are numeric and not of type object
for column in (final_categorical_features + ["release_year"]):
    playlist_features_df[column] = playlist_features_df[column].astype(int)
playlist_features_df.info()

# Scale the playlist features for the non-autoencoder model
scaled_playlist_features_df = playlist_features_df.copy()
scaled_playlist_features_df[numerical_features] = scaler.fit_transform(scaled_playlist_features_df[numerical_features])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 38 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Track Name        24 non-null     object 
 1   artists           24 non-null     object 
 2   popularity        24 non-null     int64  
 3   release_year      24 non-null     int64  
 4   danceability      24 non-null     float64
 5   energy            24 non-null     float64
 6   loudness          24 non-null     float64
 7   speechiness       24 non-null     float64
 8   acousticness      24 non-null     float64
 9   instrumentalness  24 non-null     float64
 10  liveness          24 non-null     float64
 11  valence           24 non-null     float64
 12  tempo             24 non-null     float64
 13  duration_ms       24 non-null     int64  
 14  time_signature    24 non-null     int64  
 15  key_0             24 non-null     int64  
 16  key_1             24 non-null     int64  
 17 

In [23]:
playlist_features_df.head()

Unnamed: 0,Track Name,artists,popularity,release_year,danceability,energy,loudness,speechiness,acousticness,instrumentalness,...,mode_1,genre_0,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,explicit_False,explicit_True
0,SLOW DANCING IN THE DARK,Joji,49,2018,0.515,0.479,-7.458,0.0261,0.544,0.00598,...,1,0,0,1,0,0,0,0,0,1
1,505,Arctic Monkeys,81,2007,0.52,0.852,-5.866,0.0543,0.00237,5.8e-05,...,1,0,0,0,0,0,1,0,1,0
2,Sweater Weather,The Neighbourhood,90,2013,0.612,0.807,-2.81,0.0336,0.0495,0.0177,...,1,0,0,0,0,0,0,1,1,0
3,Are You Bored Yet? (feat. Clairo),Wallows,83,2019,0.682,0.683,-6.444,0.0287,0.156,2.3e-05,...,0,0,0,0,0,0,1,0,1,0
4,Young Dumb & Broke,Khalid,82,2017,0.799,0.539,-6.351,0.0421,0.199,1.7e-05,...,1,1,0,0,0,0,0,0,1,0


Now all that we need to do is to apply the Autoencoder and scale them.

In [24]:
unweighted_full_encoded = unweighted_encoder.predict(X_train)
weighted_full_encoded = weighted_encoder.predict(X_train)

unweighted_playlist_encoded = unweighted_encoder.predict(playlist_features_df.drop(['Track Name', 'artists'], axis=1).astype('float32'))
weighted_playlist_encoded = weighted_encoder.predict(playlist_features_df.drop(['Track Name', 'artists'], axis=1).astype('float32'))

weighted_full_encoded_scaled = scaler.fit_transform(weighted_full_encoded)
unweighted_full_encoded_scaled = scaler.fit_transform(unweighted_full_encoded)

unweighted_playlist_encoded_scaled = scaler.transform(unweighted_playlist_encoded)
weighted_playlist_encoded_scaled = scaler.transform(weighted_playlist_encoded)



# Our Models

In this section, we will be making different models using the different features and techniques we found earlier. Our baseline model will be a simple **K Nearest Neighbours algorithm** to recommend songs to users. While the novel approach is a combination of collaborative, content-based and deep learning approaches, the lack of available good datasets and hardware constraints made us stick to a **simpler, content-based filtering algorithm** as our recommender system.
<br>
<br>
Our 5 models are as follows:
1. Baseline model with using all features and default feature weights
2. Baseline model with using all features and adjusted feature weights (based on insights from Part 2: Music Popularity Model)
3. Baseline model with the worst feature weights dropped (based on insights from Part 2: Music Popularity Model)
4. Baseline model with the Unweighted Autoencoder
5. Baseline model with the Weighted Autoencoder

We will not only be testing how well the models do with the dimensionality reduction, but also how well the model does if we add weights to the features, and seeing if it helps to improve the quality of our predictions.

### 1. Our baseline model

In [25]:
# Initialize and Train the NearestNeighbors Model
similarity_metric = 'cosine'

# Fit and Train model
nn_model = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=similarity_metric)
nn_model.fit(scaled_full_dataset_df.drop(['Track Name','artists'], axis=1).values)

# Calculate the average features for the playlist
average_features = scaled_playlist_features_df.drop(['Track Name','artists'], axis=1).mean(axis=0).to_numpy().reshape(1, -1)

# Query the nearest neighbors model using the average feature vector
distances, indices = nn_model.kneighbors(average_features)
recommended_tracks = full_dataset_df.iloc[indices[0]]
for i in range(5):
    print(f"{recommended_tracks['Track Name'].iloc[i]} by {recommended_tracks['artists'].iloc[i]}")

Two of Us by Louis Tomlinson
Let’s Go by Matt and Kim
Sagittarius Superstar by COIN
The Best Day (Taylor’s Version) by Taylor Swift
Welcome To My Island by Caroline Polachek


### 2. Baseline model with weights

Here we will be making use of our baseline model, but adding feature importances to our features through multiplying a feature that is more important with a value greater than one and a feature that is less important with a value less than one.

In [26]:
# Set the similarity metric
similarity_metric = 'cosine'

weighted_scaled_full_dataset_df = scaled_full_dataset_df.copy()
weighted_scaled_playlist_features_df = scaled_playlist_features_df.copy()
    
# Set weights
weights = {
    'genre_0': 3, 'genre_1': 3, 'genre_2': 3, 'genre_3': 3, 'genre_4': 3, 'genre_5': 3, 'genre_6': 3,
    'explicit_False': 3, 'explicit_True': 3, 'release_year': 5, 'instrumentalness': 4,
    'key_0': 0.3, 'key_1': 0.3, 'key_2': 0.3, 'key_3': 0.3, 'key_4': 0.3, 'key_5': 0.3, 'key_6': 0.3,
    'key_7': 0.3, 'key_8': 0.3, 'key_9': 0.3, 'key_10': 0.3, 'key_11': 3, 'mode_0': 0.4, 'mode_1': 0.4,'danceability':0.3,'liveness':0.3
}

# Apply weights
for feature, weight in weights.items():
    weighted_scaled_full_dataset_df[feature] *= weight
    weighted_scaled_playlist_features_df[feature] *= weight
    

# Initialize and Train the NearestNeighbors Model with the chosen metric
nn_model = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=similarity_metric)
nn_model.fit(weighted_scaled_full_dataset_df.drop(['Track Name', 'artists'], axis=1).values)

average_features = weighted_scaled_playlist_features_df.drop(['Track Name', 'artists'], axis=1).values.mean(axis=0).reshape(1, -1)
# Query the nearest neighbors model using the average feature vector
distances, indices = nn_model.kneighbors(average_features)

recommended_tracks = full_dataset_df.iloc[indices[0]]
for i in range(5):
    print(f"{recommended_tracks['Track Name'].iloc[i]} by {recommended_tracks['artists'].iloc[i]}")

Tessellate by alt-J
Shot in the Dark by John Mayer
Harness Your Hopes - B-side by Pavement
The Night Begins to Shine by B.E.R.
Kenny by Still Woozy


### 3. Baseline models with features dropped

Here, we will only be using the top 5 features (excluding artists) that were found in the Music Popularity Model, namely genre, explicit, instrumentalness, loudness and release year.

In [27]:
# Set the similarity metric
similarity_metric = 'cosine'

# Initialize and Train the NearestNeighbors Model with the chosen metric
nn_model = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=similarity_metric)
nn_model.fit(scaled_full_dataset_df[['genre_0', 'genre_1', 'genre_2', 'genre_3','genre_4','genre_5','genre_6','explicit_True','explicit_False','instrumentalness','loudness','release_year']].values)
average_features = scaled_playlist_features_df[['genre_0', 'genre_1', 'genre_2', 'genre_3','genre_4','genre_5','genre_6','explicit_True','explicit_False','instrumentalness','loudness','release_year']].values.mean(axis=0).reshape(1, -1)
# Query the nearest neighbors model using the average feature vector
distances, indices = nn_model.kneighbors(average_features)
recommended_tracks = full_dataset_df.iloc[indices[0]]
for i in range(5):
  print(f"{recommended_tracks['Track Name'].iloc[i]} by {recommended_tracks['artists'].iloc[i]}")

Why Don't You by Cleo Sol
The Night Begins to Shine by B.E.R.
Call Me Up by daydreamers
Afterthought by Joji
Every Other Freckle by alt-J


### 4. Baseline model with the Autoencoder

Our model, with the unweighted autoencoder used to reduce dimensionality.

In [28]:
# Set the similarity metric
similarity_metric = 'cosine'

# Initialize and Train the NearestNeighbors Model with the chosen metric
nn_model = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=similarity_metric)
nn_model.fit(unweighted_full_encoded_scaled)


average_features = unweighted_playlist_encoded_scaled.mean(axis=0).reshape(1, -1)
# Query the nearest neighbors model using the average feature vector
distances, indices = nn_model.kneighbors(average_features)
recommended_tracks = full_dataset_df.iloc[indices[0]]
for i in range(5):
  print(f"{recommended_tracks['Track Name'].iloc[i]} by {recommended_tracks['artists'].iloc[i]}")

Who's Lovin' You by The Jackson 5
Dirt Off Your Shoulder by JAY-Z
Keep Holding On by Avril Lavigne
Swayed by How We Burn
Pony by Ginuwine


### 5. Baseline model with the autoencoder and weights

Our model, with the weighted autoencoder used to reduce dimensionality and to try and improve the quality of recommendations.

In [29]:
# Set the similarity metric, we chose cosine as seemed to give the best results
similarity_metric = 'cosine' 

# Initialize and Train the NearestNeighbors Model with the chosen metric
nn_model = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=similarity_metric)
nn_model.fit(weighted_full_encoded_scaled)

# Calculate the average features for the playlist
average_features = weighted_playlist_encoded_scaled.mean(axis=0).reshape(1, -1)

# Query the nearest neighbors model using the average feature vector
distances, indices = nn_model.kneighbors(average_features)
recommended_tracks = full_dataset_df.iloc[indices[0]]
for i in range(5):
  print(f"{recommended_tracks['Track Name'].iloc[i]} by {recommended_tracks['artists'].iloc[i]}")

III. Urn by Childish Gambino
AirplaneMode by BONES
White by Frank Ocean
From the Subway Train by Vansire
Attracted to You by PinkPantheress


# Results

After making our models, we will be evaluating them based on user feedback. We got 10 users to try the recommender system, and got them to rate the quality of the top 5 recommendations. Here are our results:
<br>
1. Baseline model with adjusted weights: **7.04**
2. Baseline model with the weighted autoencoder: 6.9
3. Baseline model: 6.74
4. Baseline model with the unweighted Autoencoder: 6.06
5. Baseline models with worst features dropped: 4.7

![title](img/survey.png)

# Data-driven Insights and Recommendations

As you can see, while there are still areas for improvement for the quality of predictions across the board, the resultant recommendations were overall satisfactory. From our data, we can make a number of insights and recommendations to small-scale music streaming companies.

1. **The baseline model with adjusted weights** based on features that we found to be important for song popularity **performed the best** amongst the five models. However, we also note that the use of **Weighted Autoencoders** can help **improve efficiency without significant performance cost** when we are making use of numerical metadata. It seems that for a small-scaled music recommender system, dimensionality reduction techniques like Weighted Autoencoder can be paired with feature enhancement/ model enhancement strategies to reach compromise between cost and performance. 

2. **Make use of a more robust recommender system**. While our results are not too bad, a simple content-based filtering algorithm is not as effective in dealing with **subjective preferences** compared to a model that uses **both content-based and collaborative filtering**. Implementation of collaborative filtering, which seeks to filter out music suggestions to a user based on how similar users react, can help to enhance our model further to personalise recommendations. 

3. **Broaden the types of data used**. We noticed that our system only uses **structured data** for training the recommender system. Novel approaches will make use of **lyric and audio data** to enhance feature selection and potentially improve the quality of recommendations to users. However, this will come at the expense of **more compute**. To balance this, for systems which take in unstructured data such as unstructured text-based lyric data and audio data, we recommend similar dimensionality reduction techniques such as the use of **Mel-Frequency Cepstral Coefficients (MFCCs) for audio data** [3], which is a method of encoding audio data and the use of **Singular Value Decomposition (SVD) for textual data**.[4]
<br>
<br>
We believe that with these recommendations, a smaller company with less resources can make make a and a good and computationally efficient model to recommend songs to users. Some of our insights could also possibly help larger companies as well, to make their recommendations more robust.


### From Part 1: Genre Classification Model
4. **Simpler models sometimes achieve better results**. The Multinomial Logistic Model beat the likes of sophisticated models such as XGBoost and Random Forest to give us the best classification metrics (e.g. accuracy, AOC score etc)

### From Part 2: Music Popularity Model
5. **Popularity of a song** is more significantly influenced by the artists due to their accumulated fame and genre is shaped by one's preference in music. It is more interesting to note that **newer songs tend to be more popular** since they leave a stronger impression to current listeners. This is also boosted by the **digitalisation of music industry** as newer songs could be distributed more easily online. Presence of explicit lyrics could attract **eyeball attention** and influence the song popularity. Instrumentalness which measures the presence of vocal (higher implies lesser vocal content) is also important in deciding song popularity as most popular songs **generally contain vocals**.

# Citations

[1] M. Schedl, ‘Deep Learning in Music Recommendation Systems’, Front. Appl. Math. Stat., vol. 5, doi: 10.3389/fams.2019.00044.
<br>
[2] F. Ricci, L. Rokach, and B. Shapira, Eds., Recommender Systems Handbook. New York, NY: Springer US, 2022. doi: 10.1007/978-1-0716-2197-4.
<br>
[3] V. Tiwari, "MFCC and its applications in speaker recognition," International Journal on Emerging Technologies, vol. 1, no. 1, pp. 19-22, 2010.
<br>
[4] P. C. Hansen, "The truncatedSVD as a method for regularization," BIT Numerical Mathematics, vol. 27, no. 4, pp. 534-553, 1987. doi: 10.1007/BF01937276.