## 03 Recommender

Project: Building a Personalised Playlist Generator for Spotify
<br>
Name: Syahiran Rafi

---

### Description

This notebook outlines the entire process of building the song recommender system.

The user journey may be broken down into two main stages:
1. Choosing your favourite genre(s) to generate a list of random tracks (`generate_random_tracks_balanced`)
2. Choosing at least 3 songs from the randomly generated list to create your own personalist playlist of 15 songs (`generate_personalised_playlist`)

The following functions were created to support the above user journey:
| Function                      | Description |
|-------------------------------|-------------|
| `top_subgenres`               |Generates a list of top sub-genres for a list of genres|
| `generate_random_tracks`      |Generates a list of `n` random songs based on genre(s)|
| `generate_random_tracks_balanced` |Generates a list of `n` random songs based on genre(s), with a balanced sample of genres if more than one genre is input|
| `track_viewer`                |Formats the random list of songs into a more readable format for user studies|
| `extract_genres`              |Reformats the genre strings `'pop,rock,dance'` into a list of genres `['pop', 'rock', 'dance']`|
| `one_hot_encode_genres`       |Performs one hot encoding on `artist_genres` before building the `cosine_similarity` matrix|
| `recommend_songs`             |Recommends top `n` songs for a single `track_name` and `artist_name` pair|
| `generate_personalised_playlist` |Generates a personalised playlist of 15 songs, using the tracks selected from the randomly generated list|
| `df_to_tuples`                |Reformats the result from `generate_personalised_playlist` into a list of `('track_name','artist_name')` tuples to input in the `generate_spotify_playlist` function|
| `generate_spotify_playlist`   |Creates an actual Spotify playlist using Spotify API|

---

### 1. Import libraries

In [1]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import random

# Recommender
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

# Spotify API
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import pandas as pd
import socket
from http.server import HTTPServer, BaseHTTPRequestHandler

### 2. Read relevant data sets into a data frame

Three main CSV files processed and exported from the earlier notebooks are exported into data frames for use within this notebook:
1. spotify-top-10k-processed.csv
2. spotify-40k-processed
3. genres-10k

The 10k data set is mainly used to generate random tracks for users based on genre(s) selected (`generate_random_tracks_balanced`).

In [2]:
tracks_10k_df = pd.read_csv("../data/spotify-top-10k-processed.csv")

In [3]:
tracks_10k_df

Unnamed: 0,track_uri,track_name,artist_uri,artist_name,album_uri,album_name,album_artist_uri,album_artist_name,album_release_date,album_image_url,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,album_genres,label,copyrights
0,spotify:track:1xazlnvtthcdzt2ni1dtxo,justified & ancient - stand by the jams,spotify:artist:6dyrdrlnzskavxyg5irvch,the klf,spotify:album:4mc0zjntvp1ndd5lslxfjc,songs collection,spotify:artist:6dyrdrlnzskavxyg5irvch,the klf,1992-08-03,https://i.scdn.co/image/ab67616d0000b27355346b...,...,0.0480,0.015800,0.112000,0.4080,0.504,111.458,4.0,,jams communications,"c 1992 copyright control, p 1992 jams communic..."
1,spotify:track:6a8gbqilv8hbuw3c6uk9ph,i know you want me (calle ocho),spotify:artist:0tnoyisbd1xyrbk9myaseg,pitbull,spotify:album:5xlacbvbsalrtpxnkkggxa,pitbull starring in rebelution,spotify:artist:0tnoyisbd1xyrbk9myaseg,pitbull,2009-10-23,https://i.scdn.co/image/ab67616d0000b27326d73a...,...,0.1490,0.014200,0.000021,0.2370,0.800,127.045,4.0,,mr.305/polo grounds music/j records,"p (p) 2009 rca/jive label group, a unit of son..."
2,spotify:track:70xtwbcvzcpaoddjftmcvi,from the bottom of my broken heart,spotify:artist:26dsoyclwsylmakd3tpor4,britney spears,spotify:album:3wnxdumksmgmjrhegk80qx,...baby one more time (digital deluxe version),spotify:artist:26dsoyclwsylmakd3tpor4,britney spears,1999-01-12,https://i.scdn.co/image/ab67616d0000b2738e4986...,...,0.0305,0.560000,0.000001,0.3380,0.706,74.981,4.0,,jive,p (p) 1999 zomba recording llc
3,spotify:track:1nxuwypjk5ko6dqj5t7bdu,apeman - 2014 remastered version,spotify:artist:1sqrv42e4pjeyfphs0tk9e,the kinks,spotify:album:6ll6hugnen4vlc8sj0zcse,"lola vs. powerman and the moneygoround, pt. on...",spotify:artist:1sqrv42e4pjeyfphs0tk9e,the kinks,2014-10-20,https://i.scdn.co/image/ab67616d0000b2731e7c53...,...,0.2590,0.568000,0.000051,0.0384,0.833,75.311,4.0,,sanctuary records,"c © 2014 sanctuary records group ltd., a bmg c..."
4,spotify:track:72wztws6v7uu3amgmmekye,you can't always get what you want,spotify:artist:22be4uq6banwshpvcdxlce,the rolling stones,spotify:album:0c78nsgqx6vfnisnwixwod,let it bleed,spotify:artist:22be4uq6banwshpvcdxlce,the rolling stones,1969-12-05,https://i.scdn.co/image/ab67616d0000b27373d927...,...,0.0687,0.675000,0.000073,0.2890,0.497,85.818,4.0,,universal music group,"c © 2002 abkco music & records inc., p ℗ 2002 ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8870,spotify:track:6puzxtihkv346yp89nzp9x,kernkraft 400,spotify:artist:7vfpnlbcxbbfs4kfbulksl,zombie nation,spotify:album:2qmrrouzqemrkfr9pbmdhd,kernkraft 400 single mixes,spotify:artist:7vfpnlbcxbbfs4kfbulksl,zombie nation,2006-03-07,https://i.scdn.co/image/ab67616d0000b273916e34...,...,0.0868,0.005500,0.901000,0.1460,0.487,140.064,4.0,,ukw records,"c 2006 copyright control, p 2006 copyright con..."
8871,spotify:track:3kcklokqqepvwxwljbgj5p,kernkraft 400 (a better day),"spotify:artist:0u6gtibw46tfx7koq6unjz, spotify...","topic, a7s",spotify:album:2nichqkijgw4r4dqfmg0a3,kernkraft 400 (a better day),"spotify:artist:0u6gtibw46tfx7koq6unjz, spotify...","topic, a7s",2022-06-17,https://i.scdn.co/image/ab67616d0000b273e1cafe...,...,0.0562,0.184000,0.000020,0.3090,0.400,125.975,4.0,,virgin,"c © 2022 topic, under exclusive license to uni..."
8872,spotify:track:5k9qrzjfdap5cxvdzai02f,never say never - radio edit,spotify:artist:1sczsjoyaihnnm9qlhzdnl,vandalism,spotify:album:2n506u3hkn3caedvajv5ct,never say never,spotify:artist:1sczsjoyaihnnm9qlhzdnl,vandalism,2005-10-24,https://i.scdn.co/image/ab67616d0000b273b65ad4...,...,0.0340,0.000354,0.011200,0.3380,0.767,130.978,4.0,,vicious,"c 2005 vicious, a division of vicious recordin..."
8873,spotify:track:5ydecnawdmfbu4zl0ropah,groovejet (if this ain't love) [feat. sophie e...,"spotify:artist:4bmymfwdu9zlcitrumrewb, spotify...","spiller, sophie ellis-bextor",spotify:album:20q3pgpyiyicf32x5l8pph,groovejet (if this ain't love) [feat. sophie e...,spotify:artist:4bmymfwdu9zlcitrumrewb,spiller,2000-08-14,https://i.scdn.co/image/ab67616d0000b27342781a...,...,0.0389,0.000132,0.088900,0.3610,0.626,123.037,4.0,,defected records,"c © 2021 defected records limited, p ℗ 2021 de..."


In [4]:
tracks_10k_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8875 entries, 0 to 8874
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   track_uri            8875 non-null   object 
 1   track_name           8874 non-null   object 
 2   artist_uri           8873 non-null   object 
 3   artist_name          8874 non-null   object 
 4   album_uri            8873 non-null   object 
 5   album_name           8874 non-null   object 
 6   album_artist_uri     8873 non-null   object 
 7   album_artist_name    8873 non-null   object 
 8   album_release_date   8873 non-null   object 
 9   album_image_url      8871 non-null   object 
 10  disc_number          8875 non-null   int64  
 11  track_number         8875 non-null   int64  
 12  track_duration_(ms)  8875 non-null   int64  
 13  track_preview_url    6415 non-null   object 
 14  explicit             8875 non-null   bool   
 15  popularity           8875 non-null   i

The 40k data set is mainly used to build the similarity matrix and the `recommend_songs` function.

In [5]:
tracks_40k_df = pd.read_csv("../data/spotify-40k-processed.csv")

In [6]:
tracks_40k_df

Unnamed: 0,track_uri,track_name,artist_name,artist_genres,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,spotify:track:1xazlnvtthcdzt2ni1dtxo,justified & ancient - stand by the jams,the klf,"acid house,ambient house,big beat,hip house",0.617,0.872,8.0,-12.305,1.0,0.0480,0.0158,0.112000,0.4080,0.504,111.458
1,spotify:track:6a8gbqilv8hbuw3c6uk9ph,i know you want me (calle ocho),pitbull,"dance pop,miami hip hop,pop",0.825,0.743,2.0,-5.995,1.0,0.1490,0.0142,0.000021,0.2370,0.800,127.045
2,spotify:track:70xtwbcvzcpaoddjftmcvi,from the bottom of my broken heart,britney spears,"dance pop,pop",0.677,0.665,7.0,-5.171,1.0,0.0305,0.5600,0.000001,0.3380,0.706,74.981
3,spotify:track:1nxuwypjk5ko6dqj5t7bdu,apeman - 2014 remastered version,the kinks,"album rock,art rock,british invasion,classic r...",0.683,0.728,9.0,-8.920,1.0,0.2590,0.5680,0.000051,0.0384,0.833,75.311
4,spotify:track:72wztws6v7uu3amgmmekye,you can't always get what you want,the rolling stones,"album rock,british invasion,classic rock,rock",0.319,0.627,0.0,-9.611,1.0,0.0687,0.6750,0.000073,0.2890,0.497,85.818
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40542,spotify:track:3uchi1gfoul5j5sweh0tch,i don't know,jon d,unknown,0.669,0.228,2.0,-12.119,1.0,0.0690,0.7920,0.065000,0.0944,0.402,83.024
40543,spotify:track:0p1oo2gremyucookzyayfu,the answer,big words,australian r&b,0.493,0.727,1.0,-5.031,1.0,0.2170,0.0873,0.000000,0.1290,0.289,73.259
40544,spotify:track:2om4burudnevk59ivixcwn,25.22,allan rayman,"canadian contemporary r&b,modern alternative rock",0.702,0.524,7.0,-10.710,1.0,0.0793,0.3320,0.055300,0.2980,0.265,140.089
40545,spotify:track:4ri5ttugjm96tbqzd5ua7v,good feeling,jon jason,unknown,0.509,0.286,8.0,-14.722,1.0,0.1230,0.4020,0.000012,0.1310,0.259,121.633


In [7]:
tracks_40k_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40547 entries, 0 to 40546
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_uri         40547 non-null  object 
 1   track_name        40546 non-null  object 
 2   artist_name       40546 non-null  object 
 3   artist_genres     40012 non-null  object 
 4   danceability      40545 non-null  float64
 5   energy            40545 non-null  float64
 6   key               40545 non-null  float64
 7   loudness          40545 non-null  float64
 8   mode              40545 non-null  float64
 9   speechiness       40545 non-null  float64
 10  acousticness      40545 non-null  float64
 11  instrumentalness  40545 non-null  float64
 12  liveness          40545 non-null  float64
 13  valence           40545 non-null  float64
 14  tempo             40545 non-null  float64
dtypes: float64(11), object(4)
memory usage: 4.6+ MB


The consolidated list of genres from the 10k data set is mainly used to support the `generate_random_tracks_balanced` function.

In [8]:
genres_10k_df = pd.read_csv("../data/genres-10k.csv")

In [9]:
genres_10k_df

Unnamed: 0,genre,count
0,acid house,6
1,ambient house,7
2,big beat,105
3,hip house,123
4,dance pop,4362
...,...,...
2286,chinese hip hop,1
2287,chinese idol pop,1
2288,musica potosina,1
2289,weightless,1


In [10]:
genres_10k_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2291 entries, 0 to 2290
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   genre   2291 non-null   object
 1   count   2291 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 35.9+ KB


### 3. Create a function for generating sub-genres

The following list of common genres was consolidated based on the EDA performed in the 2nd notebook.

In [11]:
# Common genres to choose from
common_genres = ['pop', 'rock', 'hip hop', 'rap', 'r&b', 'soul', 'dance', 'electronic', 'house', 'metal', 'punk', 'country']

The `common_genres` list was created initially to support the `top_subgenres` function. This function takes in different genres as a list of strings then outputs the top 5 sub-genres for each over-arching genre.

The idea is to incorporate this function into the user studies so that users may choose specific sub-genres to generate random tracks. However, this idea was eventually retracted to reduce friction in the user journey. I also found that giving users the option to be as generic or as specific as they want when selecting genres is generally better for them.

In [12]:
def top_subgenres(genres_list):
    for genre in genres_list:
        if genre == 'rap':
            search_term = r'\brap\b'
            result = list(genres_10k_df[genres_10k_df['genre'].str.contains(search_term, case=False)].sort_values(by='count', ascending=False).head(6)['genre'])
        else:
            result = list(genres_10k_df[genres_10k_df['genre'].str.contains(genre, case=False)].sort_values(by='count', ascending=False).head(6)['genre'])
        
        if genre in result:
            result.remove(genre)
        else:
            result = result[:-1]

        print(f"'{genre}' sub-genres:")
        for i in result:
            print(i)
        print("\n")

Sample of how the `top_subgenres` function may be used.

In [13]:
top_subgenres(['pop','rock','dance'])

'pop' sub-genres:
dance pop
pop rap
pop rock
post-teen pop
electropop


'rock' sub-genres:
modern rock
pop rock
classic rock
soft rock
indie rock


'dance' sub-genres:
dance pop
pop dance
alternative dance
australian dance
dance rock




### 4. Create `generate_random_tracks` function based on genres selected

Check all columns in `tracks_10k_df`.

In [14]:
tracks_10k_df.columns

Index(['track_uri', 'track_name', 'artist_uri', 'artist_name', 'album_uri',
       'album_name', 'album_artist_uri', 'album_artist_name',
       'album_release_date', 'album_image_url', 'disc_number', 'track_number',
       'track_duration_(ms)', 'track_preview_url', 'explicit', 'popularity',
       'isrc', 'added_by', 'added_at', 'artist_genres', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'album_genres', 'label', 'copyrights'],
      dtype='object')

Rows with null cells in the `'artist_genres'` column are removed.

In [15]:
tracks_10k_df['artist_genres'].isna().sum()

535

In [16]:
tracks_10k_df.dropna(subset=['artist_genres'], inplace=True)

In [17]:
tracks_10k_df['artist_genres'].isna().sum()

0

The cleaned up `tracks_10k_df` with no rows containing null `'artist_genres'` cells is exported as a new CSV file to develop the Streamlit app.

In [18]:
# tracks_10k_df.to_csv('../data/tracks-10k-processed.csv', index=False)

The `generate_random_tracks` function filters takes in a list of genres (`genres`) and an integer `n` to output a randomly generated list of `n` songs. The function searches through the `'artist_genres'` column of the 10k data set and outputs `n` random rows where the genres match the input genres.

For the first 20 user studies, this function was used to generate random tracks for users. It seemed to work well for most users, where the list needed to be refreshed only 1-3 times before the user recognises at least 3 songs that they like.

However, the function did not work well if the genres chosen by the user had a mix of overrepresented and underrepresented genres. For example, if a user chooses 'pop', 'rock' and 'alternative r&b', the popular genres ('pop' and 'rock') would be overly represented in the random list of tracks such that 'alternative r&b' tracks are rarely shown, if at all. This proved to be an issue because the user's personalised playlist would then end up being only representative of 1-2 of his favourite genres, instead of all 3.

The `generate_random_tracks` function works by generating `n` random songs for the user. Whether the user selects 1, 2 or 3 genres, `n` songs will still be displayed in the list. I found that the 'sweet spot' for the number of songs in each randomly generated list is 15.

In [19]:
def generate_random_tracks(genres, n):
    # Load data from 'tracks_10k_df'
    data = tracks_10k_df

    # Filter DataFrame based on specified genres
    filtered_data = data[data['artist_genres'].apply(lambda x: any(genre in x for genre in genres))]

    # Get the number of available tracks
    num_tracks_available = len(filtered_data)

    # Check if number of requested tracks (n) exceeds available tracks
    if n > num_tracks_available:
        raise ValueError(f"Requested number of tracks ({n}) exceeds available tracks ({num_tracks_available}).")

    # Randomly select n tracks
    random_indices = np.random.choice(num_tracks_available, n, replace=False)
    random_tracks = filtered_data.iloc[random_indices]

    # Extract track_name, artist_name pairs and artist_genres
    track_artist_genres = list(zip(random_tracks['track_name'], random_tracks['artist_name'], random_tracks['artist_genres']))
    random.shuffle(track_artist_genres)

    result = {i + 1: tpl for i, tpl in enumerate(track_artist_genres)}
    return result

To solve the issue of the underrepresentation of less popular genres in the random track generation, the `generate_random_tracks` was tweaked to create `generate_random_tracks_balanced` - this new function allows a more even sample of tracks to be displayed across all genres, regardless of its popularity or mainstream status.

The `generate_random_tracks_balanced` function works by taking 4 random tracks from each genre to generate the list. This "forces" the function to display an even sample of songs for each genre. However, a limitation to this is that a user will only be shown 5 songs if he chooses 1 genre, which is too little. To overcome this limitation, I enforced a rule where users needed to select at least 3 genres so as to generate a sufficiently long random list of (15) songs. The function also returns an error message "Choose at least 3 genres!"

Note: Most users surveyed will select at most 5 genres.

In [20]:
# Convert df to dictionary
# To be used in the `generate_random_tracks` function
genres_10k_dict = dict(zip(genres_10k_df['genre'], genres_10k_df['count']))         

In [21]:
def generate_random_tracks_balanced(genres):
    # Load data from 'tracks_10k_df'
    track_data = tracks_10k_df

    # Initialise empty data frame to store results
    songs = pd.DataFrame()

    # If less than 3 genres is input, an error message is shown
    if len(genres) < 3:
        return "Choose at least 3 genres!"
    else:
        # For each genre, load 4 random tracks
        for g in genres:
            # Error statement if user chooses a genre that does not exist
            if g not in genres_10k_dict.keys():
                print(f"The '{g}' genre does not exist.")
                return
            # Error statement if user chooses a genre with fewer than 20 tracks in the data set
            elif genres_10k_dict[g] < 20:
                print(f"Choose a genre other than '{g}' for better results.")
                return
            else:
                try:
                    # Append to 'songs' df using pd.concat()
                    songs_by_genre = track_data[track_data['artist_genres'].apply(lambda x: g in x)].sample(int(5))
                    songs = pd.concat([songs, songs_by_genre])
                except:
                    # Error statement in case a genre slips past the "elif" statement and has less than 5 songs
                    print(f"Choose a genre other than '{g}' for better results.")
                    return

    # Extract track_name, artist_name and artist_genres
    track_artist_genres = list(zip(songs['track_name'], songs['artist_name'], songs['artist_genres']))
    random.shuffle(track_artist_genres)

    result = {i + 1: tpl for i, tpl in enumerate(track_artist_genres)}
    return result

To further make the `generate_random_tracks_balanced` function more robust, error messages for the following edge cases are included:
1. A user chooses an obscure genre that does not exist in the data set
2. A user chooses an underrepresented genre with fewer than 20 songs in the data set

`generate_random_tracks_balanced` returns a dictionary of ('track_name', 'artist_name') tuples. To make the randomly generated list more readable for users during the user studies, the `track_viewer` function is created to convert the dictionary into a data frame.

In [22]:
def track_viewer(tracks_dict):
    if not tracks_dict:
        return

    try:
        # Create lists to store track details
        track_numbers = []
        track_names = []
        artist_names = []
        artist_genres = []
        
        # Extract information from the dictionary
        for key, value in tracks_dict.items():
            track_numbers.append(key)
            track_names.append(value[0])
            artist_names.append(value[1])
            artist_genres.append(value[2].replace(',', ', '))
        
        # Create a DataFrame with the extracted information
        df = pd.DataFrame({
            "track_number": track_numbers,
            "track_name": track_names,
            "artist_name": artist_names,
            "artist_genres": artist_genres,
        })

        # Set 'track_number' as the index
        df.set_index('track_number', inplace=True)
        
        # Display the DataFrame (table format)
        return df
    
    except:
        return "Choose at least 3 genres!"

Sample: `generate_random_tracks` returns a list of 12 'pop' songs when `genres = ['pop']`  and `n = 12`.

In [23]:
track_viewer(generate_random_tracks(['pop'], 12))

Unnamed: 0_level_0,track_name,artist_name,artist_genres
track_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,you were right,rüfüs du sol,"australian electropop, indietronica"
2,pleasure and pain,divinyls,"australian rock, new wave pop"
3,here you come again,dolly parton,"classic country pop, country, country dawn"
4,thank you,mkto,"pop, post-teen pop"
5,make me (cry),"noah cyrus, labrinth","alt z, pop, indie poptimism, pop"
6,wild things,san cisco,"australian indie, fremantle indie, metropopoli..."
7,i love this life,kim cesarion,"swedish pop, swedish soul"
8,don't let me down (feat. daya),"the chainsmokers, daya","electropop, pop, pop"
9,ghost,justin bieber,"canadian pop, pop"
10,the man,taylor swift,pop


Sample: `generate_random_tracks_balanced` returns a list of 20 'pop' songs when `genres = ['pop', 'rock', 'hip hop', 'edm']` (5 songs for each genre).

In [24]:
track_viewer(generate_random_tracks_balanced(['pop', 'rock', 'hip hop', 'edm']))

Unnamed: 0_level_0,track_name,artist_name,artist_genres
track_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,get lucky (radio edit) [feat. pharrell william...,"daft punk, pharrell williams, nile rodgers","electro, filter house, rock, dance pop, pop, d..."
2,you don't know me - radio edit,"jax jones, raye","dance pop, edm, house, pop dance, uk dance, uk..."
3,troublemaker (feat. flo rida),"olly murs, flo rida","dance pop, pop, talent show, dance pop, miami ..."
4,can't take my eyes off you,lady a,"contemporary country, country, country dawn, c..."
5,satisfied (feat. max),"galantis, max","dance pop, edm, pop, pop dance, singer-songwri..."
6,nothing from nothing,billy preston,"funk, psychedelic soul, rock keyboard, soul"
7,beautiful people (feat. benny benassi),"chris brown, benny benassi","r&b, rap, edm, electro house, pop dance"
8,ready or not,"fugees, ms. lauryn hill, wyclef jean, pras","east coast hip hop, hip hop, neo soul, new jer..."
9,dream a little dream of me - album version wit...,the mamas & the papas,"classic rock, folk, folk rock, mellow gold, ps..."
10,antmusic - remastered,adam & the ants,"new romantic, new wave, new wave pop"


Sample: `generate_random_tracks_balanced` returns an error message when `genres = ['pop']`. User needs to select at least 3 genres.

In [25]:
track_viewer(generate_random_tracks_balanced(['pop']))

'Choose at least 3 genres!'

Sample: `generate_random_tracks_balanced` returns an error message when a genre has less than 20 songs in the data set.

In [26]:
track_viewer(generate_random_tracks_balanced(['pop','r&b','classical']))

Choose a genre other than 'classical' for better results.


In [27]:
track_viewer(generate_random_tracks_balanced(['pop','french rock','vogue']))

Choose a genre other than 'french rock' for better results.


Sample: `generate_random_tracks_balanced` returns an error message when a genre that does not exist in the data set is input.

In [28]:
track_viewer(generate_random_tracks_balanced(['pop','r&b','nonsense']))

The 'nonsense' genre does not exist.


In the Streamlit app, I decided to tweak the if-else statements in the `generate_random_tracks_balanced` function to make it more visually impactful for the presentation (a consistent number of random tracks are shown regardless of the number of genres selected). The following block of code shows how the sampling is done when 1-2 genres are selected vs. 3 or more genres. This ensures that there is always 1 random tracks shown and all genres have an equal chance of being represented.

```
# If 1 genre is chosen, 10 random songs are shown
    if len(genres_list) == 1:
        for g in genres_list:
            # Append to 'songs' df using pd.concat()
            songs_by_genre = track_data[track_data['artist_genres'].apply(lambda x: g in x)].sample(10)
            songs = pd.concat([songs, songs_by_genre])
    # If 2 genres are chosen, 5 random songs of each genre are selected and shown
    elif len(genres_list) == 2:
        for g in genres_list:
            # Append to 'songs' df using pd.concat()
            songs_by_genre = track_data[track_data['artist_genres'].apply(lambda x: g in x)].sample(5)
            songs = pd.concat([songs, songs_by_genre])
    # If 3 or more genres are chosen, 4 random songs of each genre are selected,
    # but random sample of 10 songs are shown in the end
    else:
        for g in genres_list:
            # Append to 'songs' df using pd.concat()
            songs_by_genre = track_data[track_data['artist_genres'].apply(lambda x: g in x)].sample(4)
            songs = pd.concat([songs, songs_by_genre])
        songs = songs.sample(n=10, random_state=42)
```

##### Sample output for `generate_random_tracks_balanced`

In [29]:
# Specify genres and number of random track pairs to generate
selected_genres = ['dance pop', 'singer-songwriter', 'electro house', 'folk']

# Generate random track pairs based on the specified genres
random_tracks_by_genre = generate_random_tracks_balanced(selected_genres)
random_tracks_by_genre

{1: ('2step (feat. budjerah)',
  'ed sheeran, budjerah',
  'pop,singer-songwriter pop,uk pop,australian indigenous,australian r&b'),
 2: ("acceptable in the 80's",
  'calvin harris',
  'dance pop,edm,electro house,house,pop,progressive house,uk dance'),
 3: ('back home again',
  'john denver',
  'classic country pop,folk,folk rock,mellow gold,singer-songwriter,soft rock'),
 4: ('all for you', 'janet jackson', 'dance pop,r&b,urban contemporary'),
 5: ('beautiful noise',
  'neil diamond',
  'adult standards,brill building pop,folk rock,heartland rock,mellow gold,singer-songwriter,soft rock,yacht rock'),
 6: ('time of our lives - radio edit',
  'pitbull, ne-yo',
  'dance pop,miami hip hop,pop,dance pop,pop,r&b,urban contemporary'),
 7: ("cathy's clown",
  'the everly brothers',
  'adult standards,folk rock,mellow gold,rock-and-roll,rockabilly,sunshine pop'),
 8: ('my way',
  'calvin harris',
  'dance pop,edm,electro house,house,pop,progressive house,uk dance'),
 9: ('dear mr. president',


##### Sample output for `track_viewer`

In [30]:
track_viewer(random_tracks_by_genre)

Unnamed: 0_level_0,track_name,artist_name,artist_genres
track_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2step (feat. budjerah),"ed sheeran, budjerah","pop, singer-songwriter pop, uk pop, australian..."
2,acceptable in the 80's,calvin harris,"dance pop, edm, electro house, house, pop, pro..."
3,back home again,john denver,"classic country pop, folk, folk rock, mellow g..."
4,all for you,janet jackson,"dance pop, r&b, urban contemporary"
5,beautiful noise,neil diamond,"adult standards, brill building pop, folk rock..."
6,time of our lives - radio edit,"pitbull, ne-yo","dance pop, miami hip hop, pop, dance pop, pop,..."
7,cathy's clown,the everly brothers,"adult standards, folk rock, mellow gold, rock-..."
8,my way,calvin harris,"dance pop, edm, electro house, house, pop, pro..."
9,dear mr. president,"p!nk, indigo girls","dance pop, pop, ectofolk, folk, lilith, singer..."
10,lana,roy orbison,"adult standards, classic rock, folk rock, mell..."


### 5. Generate cosine similarity matrix for all songs in the `tracks` data set

The 40k track data set will be used to build the similarity matrix.

##### Data pre-processing before one-hot encoding `artist_genres` column

In [31]:
tracks_40k_df.columns

Index(['track_uri', 'track_name', 'artist_name', 'artist_genres',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo'],
      dtype='object')

The cells in `'artist_genres'` column are currently strings `'acid house,ambient house,big beat,hip house'`. I'd like to convert each string into a list of strings `['acid house','ambient house','big beat','hip house']` with each representing a genre.

In [32]:
tracks_40k_df['artist_genres']

0              acid house,ambient house,big beat,hip house
1                              dance pop,miami hip hop,pop
2                                            dance pop,pop
3        album rock,art rock,british invasion,classic r...
4            album rock,british invasion,classic rock,rock
                               ...                        
40542                                              unknown
40543                                       australian r&b
40544    canadian contemporary r&b,modern alternative rock
40545                                              unknown
40546    indie poptimism,indiecoustica,modern alternati...
Name: artist_genres, Length: 40547, dtype: object

The `extract_genres` function splits the genre string based on the comma, then stores each genre into a list as a string.

In [33]:
def extract_genres(input_string):
    # Check for null or missing values
    if pd.isna(input_string):
        return []
    # Split the input string based on the comma delimiter
    else:
        genre_list = input_string.split(',')
        return genre_list

Test case to see if the `extract_genres` function works as intended. A new column called `'genres_list'` is created to store the reformatted genres.

In [34]:
# Create a new column called "genres_list" which takes the string from "artist_genres" and converts it into a list
tracks_40k_df['genres_list'] = tracks_40k_df['artist_genres'].apply(lambda x: extract_genres(x) if isinstance(x, str) else False)
tracks_40k_df['genres_list']

0         [acid house, ambient house, big beat, hip house]
1                          [dance pop, miami hip hop, pop]
2                                         [dance pop, pop]
3        [album rock, art rock, british invasion, class...
4        [album rock, british invasion, classic rock, r...
                               ...                        
40542                                            [unknown]
40543                                     [australian r&b]
40544    [canadian contemporary r&b, modern alternative...
40545                                            [unknown]
40546    [indie poptimism, indiecoustica, modern altern...
Name: genres_list, Length: 40547, dtype: object

The `tracks_40k_df` is duplicated to generate the cosine similarity matrix. This is to prevent the original `tracks_40k_df` from being modified as we might still need to use it.

In [35]:
# Drop 'artist_genres' and 'track_uri' column
similarity_data = tracks_40k_df.drop(['artist_genres', 'track_uri'], axis=1)
similarity_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40547 entries, 0 to 40546
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_name        40546 non-null  object 
 1   artist_name       40546 non-null  object 
 2   danceability      40545 non-null  float64
 3   energy            40545 non-null  float64
 4   key               40545 non-null  float64
 5   loudness          40545 non-null  float64
 6   mode              40545 non-null  float64
 7   speechiness       40545 non-null  float64
 8   acousticness      40545 non-null  float64
 9   instrumentalness  40545 non-null  float64
 10  liveness          40545 non-null  float64
 11  valence           40545 non-null  float64
 12  tempo             40545 non-null  float64
 13  genres_list       40547 non-null  object 
dtypes: float64(11), object(3)
memory usage: 4.3+ MB


Double check if there are any null cells in the `genres_list` column.

In [36]:
similarity_data['genres_list'].isna().sum()

0

Set `'track_name'` and `'artist_name'` as indices.

In [37]:
similarity_data.set_index(['track_name', 'artist_name'], inplace=True)

##### One-hot encode `artist_genres` column

Apart from the indices `track_name` and `artist_name`, the only other non-numeric feature is `genres_list`.

We will one hot encode the `genres_list` column. One-hot encoding non-numeric features helps the cosine similarity matrix by converting categorical data into numerical representations, ensuring equal importance of categories, enabling the cosine similarity matrix to capture similarities based on both numerical and categorical attributes accurately.

In [38]:
import pandas as pd

def one_hot_encode_genres(df, column_name='genres_list'):
    """
    One-hot encodes the genres in the specified column of the DataFrame.

    Args:
        df (pandas.DataFrame): The input DataFrame.
        column_name (str, optional): The name of the column to be encoded. Default is 'genres_list'.

    Returns:
        pandas.DataFrame: The original DataFrame with additional columns for each genre.
    """
    # Create a duplicate copy of the input df
    df = df.copy()

    # Explode the genres column into separate rows
    exploded = df[column_name].explode()

    # Get a list of all unique genres
    all_genres = exploded.unique()

    # Use list comprehension to filter out null genres
    filtered_genres = [genre for genre in all_genres if genre and not pd.isna(genre)]

    # Create a list to hold the one-hot encoded genre DataFrames
    genre_dfs = []

    # Create one-hot encoded DataFrames for each genre
    for genre in filtered_genres:
        new_column_name = f'genre_{genre.replace(" ", "_")}'
        new_column = df[column_name].apply(lambda genres: int(genre in genres) if isinstance(genres, list) else 0)
        genre_dfs.append(new_column.rename(new_column_name))

    # Concatenate all the one-hot encoded genre DataFrames along the columns axis
    if genre_dfs:
        df = pd.concat([df] + genre_dfs, axis=1)

    # Drop the original genres column
    df = df.drop(column_name, axis=1)

    return df

In [39]:
encoded_similarity_data = one_hot_encode_genres(similarity_data)
encoded_similarity_data

Unnamed: 0_level_0,Unnamed: 1_level_0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,genre_danish_metal,genre_little_rock_indie,genre_african_percussion,genre_igbo_pop,genre_steel_guitar,genre_chinese_hip_hop,genre_chinese_idol_pop,genre_musica_potosina,genre_weightless,genre_jazz_guitar_trio
track_name,artist_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
justified & ancient - stand by the jams,the klf,0.617,0.872,8.0,-12.305,1.0,0.0480,0.0158,0.112000,0.4080,0.504,...,0,0,0,0,0,0,0,0,0,0
i know you want me (calle ocho),pitbull,0.825,0.743,2.0,-5.995,1.0,0.1490,0.0142,0.000021,0.2370,0.800,...,0,0,0,0,0,0,0,0,0,0
from the bottom of my broken heart,britney spears,0.677,0.665,7.0,-5.171,1.0,0.0305,0.5600,0.000001,0.3380,0.706,...,0,0,0,0,0,0,0,0,0,0
apeman - 2014 remastered version,the kinks,0.683,0.728,9.0,-8.920,1.0,0.2590,0.5680,0.000051,0.0384,0.833,...,0,0,0,0,0,0,0,0,0,0
you can't always get what you want,the rolling stones,0.319,0.627,0.0,-9.611,1.0,0.0687,0.6750,0.000073,0.2890,0.497,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
i don't know,jon d,0.669,0.228,2.0,-12.119,1.0,0.0690,0.7920,0.065000,0.0944,0.402,...,0,0,0,0,0,0,0,0,0,0
the answer,big words,0.493,0.727,1.0,-5.031,1.0,0.2170,0.0873,0.000000,0.1290,0.289,...,0,0,0,0,0,0,0,0,0,0
25.22,allan rayman,0.702,0.524,7.0,-10.710,1.0,0.0793,0.3320,0.055300,0.2980,0.265,...,0,0,0,0,0,0,0,0,0,0
good feeling,jon jason,0.509,0.286,8.0,-14.722,1.0,0.1230,0.4020,0.000012,0.1310,0.259,...,0,0,0,0,0,0,0,0,0,0


There are 2291 genres + 11 remaining non-index features.
Hence, total number of columns after one hot encoding = 2302.

The following three cells serve to check if the one hot encoding function worked as intended on three sample genres: pop rock, metal and country.

In [40]:
encoded_similarity_data['genre_pop_rock'].value_counts()

genre_pop_rock
0    38433
1     2114
Name: count, dtype: int64

In [41]:
encoded_similarity_data['genre_metal'].value_counts()

genre_metal
0    40353
1      194
Name: count, dtype: int64

In [42]:
encoded_similarity_data['genre_country'].value_counts()

genre_country
0    38785
1     1762
Name: count, dtype: int64

In [43]:
encoded_similarity_data.shape

(40547, 2302)

##### Normalise values

Before performing cosine similarity, we need to normalise values in the data frame.

Normalisation ensures that all features contribute equally to the similarity calculation, as cosine similarity is sensitive to the scale of the feature values. Without normalisation, features with larger scales might dominate the similarity calculation, leading to inaccurate results.

In [44]:
# Create a StandardScaler object
scaler = StandardScaler()

# Define the numerical columns to scale
numerical_columns = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
                     'instrumentalness', 'liveness', 'valence', 'tempo']

# Apply standardization to the numerical columns
encoded_similarity_data[numerical_columns] = scaler.fit_transform(encoded_similarity_data[numerical_columns])

##### Fill null values

Before performing cosine similarity, we also need to fill all null values.

Null values can distort the similarity calculation, as they introduce sparsity and uncertainty into the data. Filling null values with appropriate values (e.g., mean, median, or a constant) ensures that all instances have complete information, allowing for a more accurate similarity computation.

Since the number of cells with null values appear very insignificant, I decided to fill the null values with `0`.

In [45]:
encoded_similarity_data.isna().sum()

danceability              2
energy                    2
key                       2
loudness                  2
mode                      2
                         ..
genre_chinese_hip_hop     0
genre_chinese_idol_pop    0
genre_musica_potosina     0
genre_weightless          0
genre_jazz_guitar_trio    0
Length: 2302, dtype: int64

In [46]:
encoded_similarity_data = encoded_similarity_data.fillna(0)

In [47]:
encoded_similarity_data.isna().sum()

danceability              0
energy                    0
key                       0
loudness                  0
mode                      0
                         ..
genre_chinese_hip_hop     0
genre_chinese_idol_pop    0
genre_musica_potosina     0
genre_weightless          0
genre_jazz_guitar_trio    0
Length: 2302, dtype: int64

##### Generate similarity matrix using `cosine_similarity`

`cosine_similarity` computes the pairwise cosine similarity between all rows (instances) in the `encoded_similarity_data`. This square matrix is stored in a new data frame `track_sim`, where both rows and columns represent all the songs in the data set.

In [48]:
track_sim = pd.DataFrame(cosine_similarity(encoded_similarity_data), columns=encoded_similarity_data.index, index=encoded_similarity_data.index)
print(track_sim.shape)
track_sim.head(20)

(40547, 40547)


Unnamed: 0_level_0,track_name,justified & ancient - stand by the jams,i know you want me (calle ocho),from the bottom of my broken heart,apeman - 2014 remastered version,you can't always get what you want,don't stop - 2004 remaster,eastside (with halsey & khalid),something about the way you look tonight - edit version,juke box hero,mercy,...,asking,u make me feel good,oh my love,fragile,diamond child,i don't know,the answer,25.22,good feeling,cosmic angel - acoustic from capitol studios
Unnamed: 0_level_1,artist_name,the klf,pitbull,britney spears,the kinks,the rolling stones,fleetwood mac,"benny blanco, halsey, khalid",elton john,foreigner,shawn mendes,...,anwai,astronomyy,layla,rozes,aayushi,jon d,big words,allan rayman,jon jason,grizfolk
track_name,artist_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
justified & ancient - stand by the jams,the klf,1.0,0.063919,0.138761,-0.004569,-0.031267,0.127262,-0.214441,-0.036018,-0.043722,-0.107261,...,0.014658,-0.099745,-0.128333,0.009586,-0.00358,-0.204146,-0.100321,0.229376,0.043675,-0.202693
i know you want me (calle ocho),pitbull,0.063919,1.0,0.318152,0.132411,-0.190212,0.21528,-0.061195,-0.078176,-0.124597,-0.151043,...,-0.41355,0.001094,-0.547755,-0.202331,-0.573887,-0.160449,0.116614,-0.082598,-0.384017,-0.277699
from the bottom of my broken heart,britney spears,0.138761,0.318152,1.0,0.29955,0.241086,0.146539,0.043655,-0.169054,-0.27931,-0.174026,...,0.319092,-0.326875,-0.11622,0.198784,0.185787,0.247314,0.044071,-0.054428,-0.15399,-0.12088
apeman - 2014 remastered version,the kinks,-0.004569,0.132411,0.29955,1.0,0.278516,0.412918,0.316828,0.057642,0.020297,-0.116126,...,0.182431,-0.248372,-0.071425,0.068069,0.162062,0.194762,0.181522,-0.110481,0.093909,-0.104046
you can't always get what you want,the rolling stones,-0.031267,-0.190212,0.241086,0.278516,1.0,-0.015129,0.079169,0.079884,0.040025,-0.467329,...,0.062638,-0.209617,0.226762,0.300161,0.233105,0.422188,0.321965,-0.10856,0.152307,-0.08955
don't stop - 2004 remaster,fleetwood mac,0.127262,0.21528,0.146539,0.412918,-0.015129,1.0,-0.240193,0.337696,0.459914,0.071861,...,0.071414,-0.161969,-0.19267,0.059109,-0.035786,-0.151387,-0.17101,-0.104056,-0.074398,-0.142797
eastside (with halsey & khalid),"benny blanco, halsey, khalid",-0.214441,-0.061195,0.043655,0.316828,0.079169,-0.240193,1.0,-0.302533,-0.208548,0.214721,...,0.121783,0.061992,0.011752,0.187,0.193995,0.131206,0.351378,-0.131969,0.115064,0.258834
something about the way you look tonight - edit version,elton john,-0.036018,-0.078176,-0.169054,0.057642,0.079884,0.337696,-0.302533,1.0,0.572633,0.082096,...,0.02653,-0.183297,0.205101,0.108874,0.089362,-0.097224,-0.134084,0.009002,0.121389,-0.082695
juke box hero,foreigner,-0.043722,-0.124597,-0.27931,0.020297,0.040025,0.459914,-0.208548,0.572633,1.0,0.292515,...,-0.052646,-0.251254,0.192587,-0.033188,0.05222,-0.344657,-0.188117,0.018107,0.069857,-0.097825
mercy,shawn mendes,-0.107261,-0.151043,-0.174026,-0.116126,-0.467329,0.071861,0.214721,0.082096,0.292515,1.0,...,0.096127,0.014935,0.028822,-0.133621,0.083471,-0.454605,-0.273064,-0.027936,-0.053727,0.255888


The following cell is for calculating cosine similarity score between two songs.

For my presentation, I plan to briefly explain how cosine similarity works in the context of my project. Cosine similarity scores for the following songs are computed in pairwise order:
- "diamonds" by rihanna
- "mercy" by shawn mendes
- "i gotta feeling" by black eyed peas
- "hard times" by paramore

In [49]:
# Calculate cosine similarity score for two songs
track_sim[('diamonds', 'rihanna')][('hard times', 'paramore')]

0.3539780760471589

### 6. Create `recommend_songs` function given `track_name` and `artist_name`

The `recommend_songs` function is created to generate the top `n` song recommendations given a `track_name` and `artist_name`.

In [50]:
def recommend_songs(track_name, artist_name, n):
    """
    Get recommended songs based on track_name and artist_name using a similarity DataFrame.

    Parameters:
    - track_name (str): Name of the track to search for.
    - artist_name (str): Name of the artist associated with the track.
    - track_sim_df (pd.DataFrame): DataFrame containing the similarity scores with MultiIndex.

    Returns:
    - pd.Series: Series containing top recommended songs based on similarity scores.
    """

    sim_df = track_sim

    try:
        # Use loc to retrieve data based on MultiIndex values
        result = sim_df.loc[(track_name, artist_name)]

        # Sort the MultiIndex levels lexically
        result_sorted = result.sort_index()

        # Drop the specified track_name from the MultiIndex DataFrame
        result_dropped = result_sorted.drop(track_name, level="track_name")

        # Sort the resulting DataFrame (if needed) and retrieve top values
        top_values = result_dropped.sort_values(ascending=False).head(n)

        return top_values

    except KeyError:
        print(f"No recommendation found for '{track_name}' by '{artist_name}'")
        return pd.Series([])  # Return an empty Series if no recommendation is found

##### Sample outputs for `recommend_songs`

In [51]:
# Call the function with specific track_name and artist_name
recommend_songs('levitating', 'dua lipa', 10)

track_name                               artist_name       
dreams                                   dua lipa              0.930893
never gonna not dance again              p!nk                  0.923524
tv in the morning                        dnce                  0.892390
give it 2 me                             madonna               0.878239
the way                                  ariana grande         0.862144
finesse                                  bruno mars            0.861334
don't hold your breath                   nicole scherzinger    0.860264
dance the night (from barbie the album)  dua lipa              0.860172
like i love you                          justin timberlake     0.856917
bang bang bang - russ chimes remix       mark ronson           0.856188
Name: (levitating, dua lipa), dtype: float64

In [52]:
# Call the function with specific track_name and artist_name
recommend_songs('sorry', 'justin bieber', 10)

track_name                                artist_name     
sorry - latino remix                      justin bieber       0.986271
what's hatnin'                            justin bieber       0.890373
we can't stop                             miley cyrus         0.855894
love myself                               hailee steinfeld    0.829388
last time                                 labrinth            0.824754
roller coaster                            justin bieber       0.820546
hymn for the weekend - seeb remix         coldplay            0.807104
company                                   justin bieber       0.805850
2002                                      anne-marie          0.803936
liquor store blues (feat. damian marley)  bruno mars          0.799498
Name: (sorry, justin bieber), dtype: float64

In [53]:
# Call the function with specific track_name and artist_name
recommend_songs('a thousand miles', 'vanessa carlton', 10)

track_name                       artist_name             
something to sleep to            michelle branch             0.818109
where are you now?               michelle branch             0.761893
don't speak liar                 we the kings                0.757283
shattered [turn the car around]  o.a.r.                      0.754293
how to save a life               the fray                    0.749018
goodbye to you                   michelle branch             0.749015
hurricane                        the fray                    0.732766
she's so high                    tal bachman                 0.723493
there she goes                   sixpence none the richer    0.722743
breathe your name                sixpence none the richer    0.718964
Name: (a thousand miles, vanessa carlton), dtype: float64

In [54]:
# Call the function with specific track_name and artist_name
recommend_songs('hello', 'adele', 10)

track_name                                artist_name               
pray                                      sam smith                     0.848638
turning tables                            adele                         0.808146
love in the dark                          adele                         0.799982
run                                       leona lewis                   0.756665
half the man                              rozzi crane                   0.751191
one and only                              adele                         0.744689
send my love (to your new lover)          adele                         0.743433
met him last night (feat. ariana grande)  demi lovato, ariana grande    0.734106
water and a flame (feat. adele)           daniel merriweather, adele    0.725639
hometown glory                            adele                         0.724806
Name: (hello, adele), dtype: float64

In [55]:
# Call the function with specific track_name and artist_name
recommend_songs('fuel', 'metallica', 10)

track_name                                 artist_name
for whom the bell tolls - remastered       metallica      0.885263
from the pinnacle to the pit               ghost b.c.     0.857714
shout at the devil                         mötley crüe    0.805559
sad but true                               metallica      0.779839
jigolo har megiddo                         ghost b.c.     0.779290
looks that kill                            mötley crüe    0.767522
aces high - 1998 remastered version        iron maiden    0.756359
unskinny bop                               poison         0.751943
hero of the day                            metallica      0.744439
hot for teacher - 2015 remastered version  van halen      0.736229
Name: (fuel, metallica), dtype: float64

In [56]:
# Call the function with specific track_name and artist_name
recommend_songs('run to the hills', 'iron maiden', 10)

track_name                                  artist_name                
run to the hills - 1998 remastered version  iron maiden                    0.998652
aces high - 1998 remastered version         iron maiden                    0.928692
beautiful girls                             van halen                      0.874457
this is how i disappear                     my chemical romance            0.816112
young lovers go pop!                        this many boyfriends           0.815596
i don't love you                            my chemical romance            0.815417
crimson and clover                          joan jett & the blackhearts    0.813686
this time around                            hanson                         0.812910
solutions                                   the sundance kids              0.811164
i remember                                  bully                          0.807057
Name: (run to the hills, iron maiden), dtype: float64

In [57]:
# Call the function with specific track_name and artist_name
recommend_songs('supermodel', 'sza', 10)

track_name                                   artist_name                     
the need to know (feat. sza)                 wale                                0.821436
love yourself                                justin bieber                       0.809226
wake me up                                   ed sheeran                          0.803931
contacts                                     brockhampton                        0.800422
heaven - a cappella feat. pusha t of clipse  john legend                         0.799732
cold blood                                   bruno major                         0.788056
20 something                                 sza                                 0.782156
i'm not that smart                           original broadway cast recording    0.774694
lil tokyo                                    gnash                               0.774673
runnin' - interlude                          kehlani                             0.769629
Name: (supermodel, sza

### 7. Create `generate_personalised_playlist` function using the `recommend_songs` function

The `generate_personalised_playlist` function creates a 15-track playlist based on the songs selected in the randomly generated tracklist (from `generate_random_tracks_balanced`).

15 songs is a good length for the first playlist as it is not too short and not too long. 15 songs usually take about 50 to 55 minutes to complete from start to finish which is suitable for the users who volunteered to take part in the study.

In [58]:
def generate_personalised_playlist(input_indices, tracks_dict):
    # Instantiate an empty data frame to store the tracks
    playlist = pd.DataFrame()

    # For each song selected from the randomly generated list,
    # the top 5 recommended songs will be added into the `playlist` df
    for i in input_indices:
        track = tracks_dict[i][0]
        artist = tracks_dict[i][1]
        top5songs = recommend_songs(track, artist, 5)
        playlist = pd.concat([playlist, top5songs], ignore_index=False)
    
    # Convert (track_name, artist_name) from the df to a list of tuples
    tuple_list = list(playlist.index)

    # Create a DataFrame from the list of tuples
    # If playlist contains more than 15 songs, the 15 songs chosen in the final playlist will be randomly sampled
    # If less than 3 songs are selected, an error message will be shown
    if len(tuple_list) < 15:
        print("Choose at least 3 songs!")
    else:
        tuple_list_subset = random.sample(tuple_list, 15)
        df = pd.DataFrame(tuple_list_subset, columns=['track_name', 'artist_name'])
        return df

##### Create `df_to_tuples` function for data pre-processing

The function `df_to_tuples` returns a list of tuples as output for easier pre-processing when creating the actual Spotify playlist.

In [59]:
def df_to_tuples(personalised_playlist):
    # Convert to list of tuples
    track_artist_tuples = [(row['track_name'], row['artist_name']) for _, row in personalised_playlist.iterrows()]
    return track_artist_tuples

### 8. Set up Spotify API credentials for playlist generation

The following cell creates a temporary HTTP server on localhost with a random port number (port 0 means that the operating system will assign a free port). It then retrieves the assigned port number and closes the temporary server. This is a workaround to get a free port number, which is required for the Spotify authentication flow.

I created a Spotify Developer account to obtain my personal `CLIENT_ID` and `CLIENT_SECRET`. These Spotify application credentials are required for authentication with Spotify Web API.

Finally, the redirect URI (http://example.com/callback) is the one that Spotify will use to redirect the user after the authentication process.

Note: Replace `CLIENT_ID` and `CLIENT_SECRET` with your own Spotify Developer credentials.

In [60]:
# Create a temporary HTTP server to get the assigned port
temp_server = HTTPServer(('localhost', 0), BaseHTTPRequestHandler)
port = temp_server.server_address[1]
temp_server.server_close()

# Replace these with your own Spotify credentials
CLIENT_ID = 'replace'
CLIENT_SECRET = 'replace'

# Update the REDIRECT_URI with the assigned port
REDIRECT_URI = f'http://example.com/callback'

The following cell uses the `spotipy` library, which is a lightweight Python library for the Spotify Web API. 

This line initialises the `spotipy.Spotify` client with an authentication manager (`SpotifyOAuth`). The `SpotifyOAuth` object is initialised with my Spotify application credentials, the redirect URI, and the desired scope (`playlist-modify-public` in this case, which allows the application to modify public playlists).

After executing this code, the `sp` object will be authenticated with Spotify, and I can use it to interact with the Spotify Web API, such as fetching data, creating playlists, or adding tracks to playlists.

In [61]:
# Authenticate with Spotify
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=CLIENT_ID,
                                               client_secret=CLIENT_SECRET,
                                               redirect_uri=REDIRECT_URI,
                                               scope="playlist-modify-public"))

##### Create `generate_spotify_playlist` function to generate actual playlist

The `generate_spotify_playlist` is the final and most important function in this notebook which creates the actual personalised playlist on Spotify, using the Spotify Web API.

The entire process from selecting at least 3 genres to creating the actual playlist on Spotify could take less than 10 minutes.

This facilitates the conduct of my user studies which would allow me to gather data to evaluate the performance of the recommender, as well as useful observations and feedback to make certain adjustments to improve the recommender.

In [62]:
def generate_spotify_playlist(tracklist, creator):
    # Example list of (track_name, artist_name) tuples
    track_artist_tuples = df_to_tuples(tracklist)

    # Create lists to store the track names and artist names
    track_names = []
    artist_names = []

    # Iterate over the tuples and append the track names and artist names to the respective lists
    for track_name, artist_name in track_artist_tuples:
        track_names.append(track_name)
        artist_names.append(artist_name)

    # Create a DataFrame to store the track and artist information
    tracks_40k_df = pd.DataFrame({'track_name': track_names, 'artist_name': artist_names})

    # Create an empty list to store the track URIs
    track_uris = []

    # Search for each track and add its URI to the list
    for index, row in tracks_40k_df.iterrows():
        track_name = row['track_name']
        artist_name = row['artist_name']
        query = f'track:{track_name} artist:{artist_name}'
        results = sp.search(q=query, type='track', limit=1)

        if results['tracks']['items']:
            track_uri = results['tracks']['items'][0]['uri']
            track_uris.append(track_uri)
        else:
            print(f"No track found for '{track_name}' by '{artist_name}'")

    # Create a new playlist
    user_id = sp.current_user()['id']
    playlist_name = f"{creator}'s Playlist"
    playlist_description = f"A custom playlist for {creator}"
    playlist = sp.user_playlist_create(user=user_id,
                                    name=playlist_name,
                                    public=True,
                                    description=playlist_description)

    # Add the tracks to the new playlist
    sp.playlist_add_items(playlist_id=playlist['id'], items=track_uris)
    print(f"{creator}'s Playlist created successfully with {len(track_uris)} tracks.")

---

### User Studies

This section is used primarily to conduct the user studies.

The following three steps are performed for each user:
1. Choose at least 3 genres to generate a random list of 15 tracks
2. Choose at least 3 songs to generate your own personalised playlist
3. Generate the playlist on Spotify

Each user is then requested to listen to their personalised playlist and evaluate each song on a scale of 1 to 10, where:
- 0 - i do not like the song at all
- 5 - neutral
- 10 - i like the song a lot

A percentage score is then computed for each playlist.

##### Step 1: Choose at least 3 genres to generate a random list of 15 tracks

In [63]:
# Common genres to choose from
# This list is not fed into any functions and is purely for display purposes
common_genres = ['pop',
                 'rock',
                 'hip hop',
                 'rap',
                 'r&b',
                 'soul',
                 'disco',
                 'edm',
                 'house',
                 'metal',
                 'punk',
                 'country',
                 'folk',
                 'singer-songwriter',
                 'jazz']

In [64]:
# This step was eventually removed
# Choose sub-genres (optional)
# top_subgenres(common_genres)

In [65]:
# Specify genres and number of random track pairs to generate
selected_genres = ['dance pop', 'disco', 'pop']

# Generate random track pairs based on the specified genres
random_tracks_by_genre = generate_random_tracks_balanced(selected_genres)
track_viewer(random_tracks_by_genre)

Unnamed: 0_level_0,track_name,artist_name,artist_genres
track_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,jump (for my love),the pointer sisters,"disco, girl group, hi-nrg, motown, new wave po..."
2,chain reaction,diana ross,"adult standards, disco, motown, quiet storm, s..."
3,larger than life,backstreet boys,"boy band, dance pop, pop"
4,hear my name - radio edit,armand van helden,"big beat, deep house, disco house, house, spee..."
5,tellin' everybody,human nature,"australian pop, australian rock, boy band"
6,high on me,guy sebastian,"australian pop, australian talent show"
7,the creeps (radio edit),"camille jones, fedde le grand","dutch house, edm, electro house, house, pop da..."
8,drop it like it's hot,"snoop dogg, pharrell williams","g funk, gangster rap, hip hop, pop rap, rap, w..."
9,have you never been mellow,olivia newton-john,"adult standards, australian dance, disco, mell..."
10,jolene,olivia newton-john,"adult standards, australian dance, disco, mell..."


##### Step 2: Choose at least 3 songs to generate your own personalised playlist

In [66]:
# With reference to the random song list generated in Step 1
# Input the track_number of at least 3 songs (thank you like) from the list above
preferred_tracks_index = [3,4,14]
preferred_tracks = generate_personalised_playlist(preferred_tracks_index, random_tracks_by_genre)
preferred_tracks

Unnamed: 0,track_name,artist_name
0,hearts on fire,randy meisner
1,here comes santa claus (right down santa claus...,mariah carey
2,crying in the chapel,peter blakeley
3,hot in the city,billy idol
4,same old girl,darryl cotton
5,come undone,duran duran
6,you don't know me (feat. duane harden) - radio...,"armand van helden, duane harden"
7,piano,ariana grande
8,christmas time,christina aguilera
9,put your hand in the hand,ocean


##### Step 3: Generate the playlist on Spotify

In [67]:
# Replace the string with the user's name
# generate_spotify_playlist(preferred_tracks, "Syahiran")

---

### Summary of Results

As I interviewed more users, I made incremental changes to the recommender system in a bid to improve the overall enjoyability score (calculated as the average percentage score of the user's song ratings). 

Altogether, there were 5 different versions and each version was tested on a different group of users. The following table describes the base model (V1) and the incremental changes that were made for subsequent versions.

| Version | Description |
|-------------------------------|-------------|
| V1 (base) | 1. User selects at least 3 preferred songs from the random list <br> 2. The recommender gives the top 10 similar songs for each song <br> 3. 15 songs are sampled to form the playlist |
| V2 | The recommender now gives the top 5 similar songs for each song, instead of 10 |
| V3 | The random song generator and song recommender now pull from the 40k songs data set, instead of the 10k songs data set |
| V4 | The random song generator now pulls from the 10k songs data set (as in V1 and V2), while the song recommender pulls from the 40k songs data set (as in V3) |
| V5 | The random song generator shows a balanced sample of songs for each genre that the user selects |

The following table displays the summary of results from the user studies. The percentage scores shown represent the average user ratings for each version of the recommender.

| Version | No. of Participants | Average Score |
|-------------|-------------|-------------|
| V1 (base) | 7 | 62.0% |
| V2 | 7 |61.9% |
| V3 | 2 |72.0% |
| V4 | 6 |68.9% |
| V5 | 8 |66.0% |

### Evaluation

The base recommender (V1) started off strong with an average score of 62%. I made a slight tweak for V2 where only the top 5 songs (instead of the top 10 songs) based on similarity scores would be considered for the playlist before random sampling the final 15 songs. With this change, the average score remained largely unaffected. However, from V1 and V2, there were two participants whose average scores are below 45% (specifically, 37.3% and 44.7%). One common point I observed for these two users is that their selected genres are less mainstream, for example: "indie folk", "electronica" and "metal." Upon further study, I chose two metal songs to test the song recommender and found that the top 5 recommended songs for each of these songs included pop songs by Lady Gaga and Sia.

With V3, my goal was to improve the recommendations for less popular/mainstream genres. This is why I found a larger data set (with a wider range of songs) to supplement the 10k data set (which contained songs that had charted and hence were generally more popular). While V3 appears to have the highest average score amongst the different versions, I only used V3 on two participants as drawing from the larger 40k data set to generate songs randomly based on genre resulted in users not being able to recognise most of the songs generated. This meant that the random list of songs had to be refreshed more than 5 times before users could see songs that they like or recognise, which impacted the user experience quite greatly. This prompted me to revert to the earlier version of the random song generator, which draws from the smaller 10k data set, for V4 (the song recommender still draws from the 40k data set as in V3). With the smaller data set, users typically need less than 3 refreshes to recognise at least 3 songs from the random list generated, resulting in a better user experience.

After conducting studies on six participants for V4, I found that the results were mainly positive with only one participant having an average score below 60% (specifically, 59.3%). One possible reason why this user did not enjoy her playlist as much is because the songs in her playlist were mainly "pop" even though she had chosen two other (less popular) genres, "contemporary r&b" and "indie pop." The reason for this is that the random list generated was overpopulated by "pop" songs. In a bid to improve the user experience further, I decided to tweak the random song generator to allow each of the user's selected genres to be represented evenly. This meant that popular genres such as "pop", "rock" and "hip hop" will not "overpower" other less popular genres in the random song list, if a user selects a mix of popular and less popular genres. The goal is to have the playlist be representative of the user's music taste overall, instead of having it be skewed to one popular genre.

### Recommendations

For further development of this recommender system, I suggest exploring the following two approaches:

1. **A hybrid approach consisting of collaborative filtering and content-based filtering techniques** <br>
- Content-based filtering (current approach)
    - Currently, the song recommender employs the content-based collaborative filtering technique using cosine similarity.
    - Uses characteristics such as genre, artist, energy, danceability, acousticness etc. to measure similarity between songs.
    - Recommends songs that are similar to the ones the user has previously liked or consumed, based on their content or metadata.
    - Does not consider preferences or behavior of other users; recommendations are based solely on the user's own preferences and the item's content.
- Collaborative filtering (supplementary approach)
    - Since each user was asked to rate each song on their personalised playlist from 0 to 10 as part of the user studies, we may have sufficient preference data to start implementing collaborative filtering techniques.
        1. Item-based collaborative filtering:
            - This approach recommends songs that are similar to the ones the user has previously liked, based on the preferences and behavior of other users.
            - It analyses the patterns of songs that users have listened to, and recommends songs that are similar to the ones a user has enjoyed.
            - Measures similarity between items based on the preferences and behavior of other users, rather than the content or attributes of the items themselves.
        2. User-based collaborative filtering:
            - This approach recommends songs based on the preferences of other users with similar tastes.
            - It analyses the behavior and preferences of users to find similarities between them and recommend songs that similar users have liked.
            - Does not consider the content or attributes of the items themselves; recommendations are based solely on the preferences of similar users.

- In practice, hybrid approaches that combine user-based collaborative filtering, item-based filtering, and content-based filtering are often used to leverage the strengths of each technique and mitigate their weaknesses.

2. **Deep learning or matrix factorisation to determine track genres** <br>
- One major drawback in the Spotify API is that we are unable to extract the track genres, only the artist genres and album artist album genres.
- Upon further reading online, it seems like many tracks on Spotify don't even have genres assigned to them, unlike on Apple Music.
- Let's consider the following edge case from one of my user studies:
    - A user selects the genres "pop, punk, singer-songwriter" to generate random tracks, and the song "Side to Side" by "Ariana Grande and Nicki Minaj" is shown as one of the songs.
    - While the song is largely considered a "pop" song as Ariana Grande is the main artist, the "artist genres" metadata attached to the song reads "pop, hip pop, pop, queens hip hop, rap."
    - The user selects "Side to Side" alongside two other songs to feed into the recommender.
    - The user's personalised playlist eventually included rap songs even though she had not selected "rap" as one of her preferred genres, and did not select any "rap" songs from the random list.
    - My inference is that this happened as a result of the "rap" genre (assigned to Nicki Minaj) being attached to the song "Side to Side" (each track's 'artist_genres' was one hot encoded before performing cosine similarity).
- While the above case is rare, it could happen if multiple different genres are assigned to the same song, especially when it comes to artist collaborations. If artists X and Y were to collaborate on the same track, a user may like a song because of artist X and not necessarily artist Y. However, the current recommender would still be able to recommend songs similar to artist Y.
- A possible workaround is to experiment with deep learning and/or matrix factorisation.
    - Deep learning:
        - Neural networks and deep learning techniques, such as autoencoders, recurrent neural networks (RNNs), and convolutional neural networks (CNNs), can be used to learn complex patterns and representations from user data and song features to determine the genre(s) of each song and generate accurate, personalised recommendations.
    - Matrix factorisation:
        - This technique is often used in collaborative filtering systems. It decomposes the user-item rating matrix into lower-dimensional matrices, capturing latent factors that represent user preferences and item characteristics. These latent factors can be used to predict missing ratings (in this case, track genres) and generate recommendations.