# Recommender Playlists

Exploration implementing recommender system techniques using Spotify playlist data:

- Popularity Recommender: Recommend popular songs regardless of user's preferences
- Content-based Recommender: Use song attributes (e.g. genre) to recommend similar songs
- Collaborative Recommender: Predict what songs a user might be interested in based on a collection of preference information from multiple users
- Hybrid Recommender: A hybrid approach generally outperforms a single model and can be used to overcome some of the common problems in recommender systems such as the cold start problem and the sparsity problem
- HybridPopularity Recommender: An extension of the hybrid approach which applies weighting/mixes in songs based on popularity

For the purposes of this dataset we will focus on the playlist_tracks_df dataset and treat different playlists as different users. This is a strong approach for exploring recommendation techniques but leads to a problem the best way to recommend songs for our 'users' is based on genre (as my playlists tend to be genre-based).

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import yaml

In [6]:
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix, vstack
from scipy.sparse.linalg import svds
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, normalize

In [7]:
# To create a playlist and add tracks
import spotipy 
from spotipy.oauth2 import SpotifyOAuth

## Import Data

- Artist and track data was pulled using the Spotify API via the spotipy package
- Data was saved in pickle format using music_data.py and data_functions.py modules
- The data can now be quickly read by multiple workflows

In [8]:
top_artist_df = pd.read_pickle("spotify/top_artists.pkl")
followed_artists_df = pd.read_pickle("spotify/followed_artists.pkl")
top_tracks_df = pd.read_pickle("spotify/top_tracks.pkl")
saved_tracks_df = pd.read_pickle("spotify/saved_tracks.pkl")
playlist_tracks_df = pd.read_pickle("spotify/playlist_tracks.pkl")
recommendation_tracks_df = pd.read_pickle("spotify/recommendation_tracks.pkl")

In [9]:
playlist_tracks_df['popularity'] = playlist_tracks_df['popularity'] / 100  # normalise popularity feature between 0 and 1
playlist_tracks_df.head()

Unnamed: 0,id,name,popularity,type,is_local,explicit,duration_ms,disc_number,track_number,artist_id,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,uri,track_href,analysis_url,time_signature
0,6glsMWIMIxQ4BedzLqGVi4,"So Fresh, So Clean",0.73,audio_features,False,True,240027,1,4,1G9G7WwrXka3Z1r7aIDjI7,...,0.332,0.0281,0.0,0.099,0.915,166.028,spotify:track:6glsMWIMIxQ4BedzLqGVi4,https://api.spotify.com/v1/tracks/6glsMWIMIxQ4...,https://api.spotify.com/v1/audio-analysis/6gls...,3
1,0VaeksJaXy5R1nvcTMh3Xk,"Darling, I (feat. Teezo Touchdown)",0.87,audio_features,False,True,253834,1,4,4V8LLVI7PbaPR0K2TGSxFF,...,0.102,0.36,0.0,0.434,0.361,97.571,spotify:track:0VaeksJaXy5R1nvcTMh3Xk,https://api.spotify.com/v1/tracks/0VaeksJaXy5R...,https://api.spotify.com/v1/audio-analysis/0Vae...,4
2,3ZaEs1O8BG581qYPHpQ8d6,I Smoked Away My Brain (I'm God x Demons Mashu...,0.83,audio_features,False,True,190286,1,1,13ubrt8QOOCPljQ2FL1Kca,...,0.0561,0.0831,4.1e-05,0.175,0.104,141.981,spotify:track:3ZaEs1O8BG581qYPHpQ8d6,https://api.spotify.com/v1/tracks/3ZaEs1O8BG58...,https://api.spotify.com/v1/audio-analysis/3ZaE...,4
3,2JqkpMe2eJToJNHEqkJeCu,Forever Yours,0.78,audio_features,False,True,96369,1,3,3tlXnStJ1fFhdScmQeLpuG,...,0.0312,0.00426,0.000279,0.652,0.789,137.007,spotify:track:2JqkpMe2eJToJNHEqkJeCu,https://api.spotify.com/v1/tracks/2JqkpMe2eJTo...,https://api.spotify.com/v1/audio-analysis/2Jqk...,4
4,6FBzhcfgGacfXF3AmtfEaX,C U Girl,0.79,audio_features,False,False,129698,1,1,57vWImR43h4CaDao012Ofp,...,0.116,0.663,0.0523,0.128,0.409,100.0,spotify:track:6FBzhcfgGacfXF3AmtfEaX,https://api.spotify.com/v1/tracks/6FBzhcfgGacf...,https://api.spotify.com/v1/audio-analysis/6FBz...,4


In [10]:
with open("spotify/playlists.yml", 'r') as stream:
    playlist_ids = yaml.safe_load(stream)

## Evaluation Metric

Here we use the Top-N accuracy metric, which evaluates the accuracy of the top recommendations provided to a user by comparing to the items the user has actually interacted in test set. This evaluation method works as follows:

- For each user
    - For each item the user has interacted in test set
        - Sample n other items the user has never interacted with (assume these are not relevant, but the user may just have not been aware of them)
        - Ask the recommender model to produce a ranked list of recommended items, from a set composed one interacted item and the 100 non-interacted ("non-relevant") items
        - Compute the Top-N accuracy metrics for this user and interacted item from the recommendations ranked list (is the item along the Top-N ranked items)
- Aggregate the global Top-N accuracy metrics

In [54]:
class ModelEvaluator:
    
    def __init__(self, tracks):
        self.tracks = tracks
    
    def evaluate_model_for_playlist(self, model, playlist_id, n=100, seed=42):
        # Get the interacted and non-interacted tracks
        tracks_interacted, tracks_not_interacted = get_interacted_tracks(self.tracks, playlist_id)
        train, test = train_test_split(tracks_interacted, test_size=0.2, random_state=seed)
        
        # Get recommendations
        ranked_recommendations_df = model.recommend_tracks(playlist_id)
        
        hits_at_5_count, hits_at_10_count = 0, 0
        
        for _, row in test.iterrows():
            # Ensure n does not exceed the size of tracks_not_interacted
            sample_size = min(n, len(tracks_not_interacted))
            if sample_size == 0:  # Handle edge case when there are no non-interacted tracks
                non_interacted_sample = pd.DataFrame(columns=['id'])
            else:
                non_interacted_sample = tracks_not_interacted.sample(sample_size, random_state=seed)
            
            # Prepare evaluation set
            evaluation_ids = [row['id']] + non_interacted_sample['id'].tolist()
            evaluation_recommendations_df = ranked_recommendations_df[
                ranked_recommendations_df['id'].isin(evaluation_ids)
            ]
            
            # Check for hits in the top 5 and top 10 recommendations
            hits_at_5_count += 1 if row['id'] in evaluation_recommendations_df['id'][:5].tolist() else 0
            hits_at_10_count += 1 if row['id'] in evaluation_recommendations_df['id'][:10].tolist() else 0
        
        # Calculate metrics
        playlist_metrics = {
            'n': n,
            'evaluation_count': len(test),
            'hits@5': hits_at_5_count,
            'hits@10': hits_at_10_count,
            'recall@5': hits_at_5_count / len(test) if len(test) > 0 else 0,
            'recall@10': hits_at_10_count / len(test) if len(test) > 0 else 0,
        }
        
        return playlist_metrics

    def evaluate_model(self, model, n=100, seed=42):
        playlists = []
        for playlist_id in self.tracks['playlist_id'].unique():
            playlist_metrics = self.evaluate_model_for_playlist(model, playlist_id, n=n, seed=seed)  
            playlist_metrics['playlist_id'] = playlist_id
            playlists.append(playlist_metrics)

        detailed_playlists_metrics = pd.DataFrame(playlists).sort_values('evaluation_count', ascending=False)
        
        global_recall_at_5 = detailed_playlists_metrics['hits@5'].sum() / detailed_playlists_metrics['evaluation_count'].sum()
        global_recall_at_10 = detailed_playlists_metrics['hits@10'].sum() / detailed_playlists_metrics['evaluation_count'].sum()
        
        global_metrics = {'model_name': model.model_name,
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10,
                         }  
                            
        return global_metrics, detailed_playlists_metrics
    
model_evaluator = ModelEvaluator(playlist_tracks_df)

### Interacted tracks

Now to evaluate a model for a playlist (and overall), we need to get both iteracted and non-interacted tracks for a playlist.

In [None]:
def get_interacted_tracks(tracks, playlist_id, drop_duplicates=True):
    interacted_track_ids = set(tracks[tracks['playlist_id'] == playlist_id]['id'])
    tracks_interacted = tracks[tracks['id'].isin(interacted_track_ids)]
    tracks_not_interacted = tracks[~tracks['id'].isin(interacted_track_ids)]

    if drop_duplicates is True:
        tracks_interacted = tracks_interacted.drop_duplicates(subset='id', keep="first").reset_index()
        tracks_not_interacted = tracks_not_interacted.drop_duplicates(subset='id', keep="first").reset_index()

    return tracks_interacted, tracks_not_interacted

In [13]:
interacted_tracks, non_interacted_tracks = get_interacted_tracks(playlist_tracks_df, playlist_ids['Hello'])

## Popularity Recommender

A popularity based recommender recommends songs in order of overall popularity, regardless of what the user has listened to. Spotify's 'audio features' API call automatically comes with a 'popularity' feature. Although it is 0 in ~10% of cases (higher than expectrf - these are probably default null values), this is perfect for creating a Popularity recommender.

As song popularity generally accounts for the "wisdom of the crowds", it usually provides good recommendations overall. However this isn't tailored to the user in particular, as a good recommender system should be.

In [14]:
class PopularityRecommender:
    
    def __init__(self, tracks):
        self.tracks = tracks
        self.model_name = 'Popularity Recommender'
    
    def recommend_tracks(self, playlist_id, ignore_ids=[]):
        recommendations_df = self.tracks[~self.tracks['id'].isin(ignore_ids)] \
                                .drop_duplicates(subset='id', keep="first").reset_index() \
                                .sort_values('popularity', ascending=False)

        return recommendations_df
    
popularity_model = PopularityRecommender(playlist_tracks_df)

In [16]:
# You can see this is essentially sorted by popularity
popularity_model_recommendations = popularity_model.recommend_tracks(playlist_ids['Hello'], interacted_tracks['id'].tolist())
popularity_model_recommendations[['id', 'name', 'artist_name', 'album_name', 'popularity']].head()

Unnamed: 0,id,name,artist_name,album_name,popularity
12,1Es7AUAhQvapIcoh3qMKDL,Timeless (with Playboi Carti),The Weeknd,Timeless,0.91
5,7ne4VBA60CxGM75vw0EYad,That’s So True,Gracie Abrams,The Secret of Us (Deluxe),0.9
16,42VsgItocQwOQC3XWZ8JNA,FE!N (feat. Playboi Carti),Travis Scott,UTOPIA,0.89
13,7CyPwkp0oE8Ro9Dd5CUDjW,"One Of The Girls (with JENNIE, Lily Rose Depp)",The Weeknd,The Idol Episode 4 (Music from the HBO Origina...,0.89
6,51rfRCiUSvxXlCSCfIztBy,"I Love You, I'm Sorry",Gracie Abrams,The Secret of Us,0.89


In [52]:
popularity_model_metrics, popularity_model_details = model_evaluator.evaluate_model(popularity_model)

print(popularity_model_metrics)
popularity_model_details[[x for x in popularity_model_details.columns if x != 'playlist_id']] \
    .sort_values('recall@5', ascending=False) \
    .head()

ValueError: Cannot take a larger sample than population when 'replace=False'

As one would expect, solely recommending by popularity is a poor way to recommend tracks. As we will see later it is however a good method to mix in for variety and to avoid the cold-start problem.

## Content-based Recommender

A content-based recommender leverages attributes from items the user has interacted with to recommend similar items. As it depends only on the past this method avoids the cold-start problem for implementation.

For text items we can use a popular information retrieval method used in search engines named TF-IDF. This technique converts unstructured text into a vector structure, where each word is represented by a position in the vector, and the value measures how relevant a given word is for an article. We can then compute the cosine similarity between the user vector and the initial sparse matrix (all users).

### TF-IDF

First we need to apply the TF-IDF technique, and use it to build playlist profiles

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

def get_tfidf(tracks, ngram_range=(1, 2), min_df=0.003, max_df=0.5, max_features=5000):
    # Transform list cols to string, we use bigrams later so no need to remove spaces
    tracks['genres_str'] = tracks['genres'].apply(lambda x: ' '.join(x))

    # Initialize TfidfVectorizer
    vectorizer = TfidfVectorizer(analyzer='word',
                                 ngram_range=ngram_range,
                                 min_df=min_df,
                                 max_df=max_df,
                                 max_features=max_features,
                                 stop_words=stopwords.words('english'))

    # Create TF-IDF matrix
    tfidf_matrix = vectorizer.fit_transform(tracks['name'] + ' ' +
                                            tracks['artist_name'] + ' ' +
                                            tracks['album_name'] + ' ' +
                                            tracks['playlist_name'] + ' ' +
                                            tracks['genres_str']
                                           )

    # Use get_feature_names_out instead of get_feature_names
    tfidf_feature_names = vectorizer.get_feature_names_out()

    return tfidf_matrix, tfidf_feature_names


In [31]:
import nltk
nltk.download('stopwords')
tfidf_matrix, tfidf_feature_names = get_tfidf(playlist_tracks_df)
tfidf_matrix

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dmhsc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 618 stored elements and shape (35, 370)>

In [43]:
import numpy as np
from sklearn.preprocessing import normalize

def get_track_profile(tracks, track_id, tfidf_matrix):
    idx = tracks['id'].tolist().index(track_id)
    track_profile = tfidf_matrix[idx:idx+1]
    return track_profile

def get_track_profiles(tracks, track_ids, tfidf_matrix):
    track_profiles_list = [get_track_profile(tracks, x, tfidf_matrix) for x in track_ids]
    track_profiles = vstack(track_profiles_list)
    return track_profiles

def build_playlists_profile(tracks, playlist_id, interactions_indexed_df, tfidf_matrix):
    # There isn't any weighting we want to do in this case, 
    # but a common approach is weighting by interaction strength (liking, commenting, etc.)
    interaction_tracks_df = interactions_indexed_df.loc[playlist_id]  # duplicate interacted tracks, filter to current playlist
    playlist_track_profiles = get_track_profiles(tracks, interaction_tracks_df['id'], tfidf_matrix)
    # Weighted average for event strengths
#     playlist_track_strengths = np.array(interaction_tracks_df['event_strength']).reshape(-1,1)
#     playlist_track_profiles_array = np.sum(playlist_track_profiles.multiply(playlist_track_strengths), axis=0) / np.sum(playlist_track_strengths)
    playlist_track_profiles_array = np.asarray(np.sum(playlist_track_profiles, axis=0))  # Flattens no_tracksx948 matrix to 1x948 array
    playlist_track_profiles_norm = normalize(playlist_track_profiles_array)
    return playlist_track_profiles_norm

def build_playlists_profiles(tracks, tfidf_matrix): 
    playlist_profiles = {}
    for playlist_id in tracks['playlist_id'].unique():
        interacted_tracks, non_interacted_tracks = get_interacted_tracks(tracks, playlist_id, drop_duplicates=False)
        playlist_profiles[playlist_id] = build_playlists_profile(tracks, playlist_id, interacted_tracks.set_index('playlist_id'), tfidf_matrix)
    return playlist_profiles

In [44]:
playlist_profiles = build_playlists_profiles(playlist_tracks_df, tfidf_matrix)
len(playlist_profiles), len(playlist_tracks_df['playlist_id'].unique())  # all playlists accounted for

(3, 3)

In [45]:
# Get the keywords for my "Chill" playlist
chill_profile = playlist_profiles[playlist_ids['Hello']]
print(chill_profile.shape)  # 95 songs of vector length 948
pd.DataFrame(sorted(zip(tfidf_feature_names, chill_profile.flatten().tolist()), key=lambda x: -x[1])[:10],  # sort by value desc
             columns=['token', 'relevance'])

(1, 370)


Unnamed: 0,token,relevance
0,hip,0.306571
1,hip hop,0.306571
2,hop,0.306571
3,hello,0.305489
4,rap,0.254931
5,hello rap,0.17375
6,hop rap,0.171406
7,chromakopia,0.125325
8,chromakopia hello,0.125325
9,creator,0.125325


### Apply Content-based Recommender

Now with our playlist profiles setup, we can apply a content-based recommender.

In [46]:
class ContentRecommender:
    
    def __init__(self, tracks, tfidf_matrix, playlist_profiles):
        self.tracks = tracks
        self.tfidf_matrix = tfidf_matrix
        self.playlist_profiles = playlist_profiles
        self.model_name = 'Content-based Recommender'

    def _get_similar_tracks(self, playlist_id):
        #Computes the cosine similarity between the playlist profile and all profiles
        cosine_similarities = cosine_similarity(self.playlist_profiles[playlist_id], self.tfidf_matrix)
        similar_indices = cosine_similarities.argsort().flatten()
        #Sort the similar tracks by similarity
        similar_tracks = sorted([(self.tracks['id'].tolist()[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
        return similar_tracks
        
    def recommend_tracks(self, playlist_id, ignore_ids=[]):
        similar_tracks = self._get_similar_tracks(playlist_id)
        similar_tracks_non_interacted = list(filter(lambda x: x[0] not in ignore_ids, similar_tracks))
        recommendations_df = pd.DataFrame(similar_tracks_non_interacted, columns=['id', 'recStrength']) \
                                .drop_duplicates(subset='id', keep="first").reset_index() \
                                .sort_values('recStrength', ascending=False)

        return recommendations_df
    
content_model = ContentRecommender(playlist_tracks_df, tfidf_matrix, playlist_profiles)

In [48]:
content_model_recommendations = content_model.recommend_tracks(playlist_ids['Hello'], interacted_tracks['id'].tolist())
# Get track details from original track data
content_model_recommendations_name = pd.merge(content_model_recommendations, playlist_tracks_df.drop_duplicates(subset='id', keep="first"), how='left', on='id')
content_model_recommendations_name[['id', 'name', 'artist_name', 'album_name', 'recStrength']].head()

Unnamed: 0,id,name,artist_name,album_name,recStrength
0,28drn6tQo95MRvO0jQEo5C,Type Shit,Future,WE DON'T TRUST YOU,0.454809
1,42VsgItocQwOQC3XWZ8JNA,FE!N (feat. Playboi Carti),Travis Scott,UTOPIA,0.044265
2,3xby7fOyqmeON8jsnom0AT,Nightcrawler (feat. Swae Lee & Chief Keef),Travis Scott,Rodeo,0.038634
3,6NMtzpDQBTOfJwMzgMX0zl,SKELETONS,Travis Scott,ASTROWORLD,0.035325
4,6gBFPUFcJLzWGx4lenP6h2,goosebumps,Travis Scott,Birds In The Trap Sing McKnight,0.034553


In [55]:
content_model_metrics, content_model_details = model_evaluator.evaluate_model(content_model)

print(content_model_metrics)
content_model_details[[x for x in content_model_details.columns if x != 'playlist_id']] \
    .sort_values('recall@5', ascending=False) \
    .head()

{'model_name': 'Content-based Recommender', 'recall@5': np.float64(1.0), 'recall@10': np.float64(1.0)}


Unnamed: 0,n,evaluation_count,hits@5,hits@10,recall@5,recall@10
0,100,3,3,3,1.0,1.0
1,100,3,3,3,1.0,1.0
2,100,3,3,3,1.0,1.0


From the token relevance and exceptionally high recall it like the genre of a song is a very powerful recommender for me personally. It is likely that the Spotify breakdown of genres e.g. modern pop and chill pop helps a lot with this, alongside the bias of the dataset being playlist based (which tend to be genre/mood based).

## Collaborative Recommender

There are two main implementation strategies for a collaborative recommender:

- **Memory-based:** Computes user (user-based) or item (item-based) similarities based on past user interactions with items
- **Model-based:** Like in the accompanying recommender_playlists.ipynb file, models are developed using different ML algorithms such as neural networks, bayesian networks, and clustering models, as well as latent factor models such as Singular Value Decomposition (SVD) and, probabilistic latent semantic analysis

### Matrix Factorisation

Latent factor models compress a user-item matrix into a low-dimensional representation in terms of latent factors. This has several advantages including fewer missing values, better scalability during similarity comparison, and that we will be dealing with a much smaller matrix in lower-dimensional space. Note that model-based approaches handle the sparsity of the original matrix better than memory-based ones. An important number is the number of selected factors to factor the user-item matrix. The higher the number of factors, the more precise is the factorisation, and the more details of the original matrix which are memorized. Reducing the number of factors increases the model generalisation whilst too many factors can lead to overfitting (high variance).

Here we a use popular latent factor model named Singular Value Decomposition (SVD) which is available in SciPy. Other choices might have been surprise, mrec, or python-recsys.

In [56]:
#Creating a sparse pivot table with users in rows and items in columns
playlist_tracks_df['event_strength'] = 1  ## create dummy column for pivot value
playlist_tracks_matrix_df = playlist_tracks_df.pivot_table(index='playlist_id',
                                                           columns='id',
                                                           values='event_strength',
                                                           aggfunc='sum',
                                                          ).fillna(0)

playlist_tracks_matrix_df.values

array([[1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 1.,
        1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1.,
        0., 0., 0.],
       [0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.,
        0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 1., 0., 0.,
        0., 1., 0.],
       [0., 0., 0., 1., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
        1., 0., 1.]])

In [57]:
playlist_tracks_matrix = playlist_tracks_matrix_df.values
playlist_tracks_sparse = csr_matrix(playlist_tracks_matrix)
playlist_tracks_sparse

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 35 stored elements and shape (3, 35)>

In [59]:
# Performs matrix factorisation of the original user item matrix
u, s, vt = svds(playlist_tracks_sparse, k = 2)  # k is number of factors
s = np.diag(s)
print(u.shape, vt.shape, s.shape)

(3, 2) (2, 35) (2, 2)


In [60]:
playlist_predicted_ratings = np.dot(np.dot(u, s), vt) 
playlist_predicted_ratings
# MinMaxScaler doesn't work for global min(), max()
playlist_predicted_ratings_norm = (playlist_predicted_ratings - playlist_predicted_ratings.min()
                                  ) / (playlist_predicted_ratings.max() - playlist_predicted_ratings.min())
playlist_predicted_ratings_norm

array([[1.00000000e+00, 1.00000000e+00, 2.12578117e-16, 1.92971121e-16,
        2.12578117e-16, 1.92971121e-16, 1.92971121e-16, 1.92971121e-16,
        1.92971121e-16, 1.00000000e+00, 1.00000000e+00, 2.12578117e-16,
        1.00000000e+00, 2.12578117e-16, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 1.00000000e+00, 1.92971121e-16, 1.92971121e-16,
        2.12578117e-16, 2.12578117e-16, 2.12578117e-16, 2.12578117e-16,
        2.12578117e-16, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
        1.92971121e-16, 2.12578117e-16, 1.92971121e-16, 1.00000000e+00,
        1.98074819e-16, 2.08145428e-16, 1.98074819e-16],
       [1.56707589e-16, 3.06658462e-16, 8.49779301e-02, 2.78848851e-01,
        8.49779301e-02, 2.78848851e-01, 2.78848851e-01, 2.78848851e-01,
        2.78848851e-01, 3.06658462e-16, 3.06658462e-16, 8.49779301e-02,
        3.06658462e-16, 8.49779301e-02, 3.06658462e-16, 3.06658462e-16,
        3.06658462e-16, 3.06658462e-16, 2.78848851e-01, 2.78848851e-01,
       

In [61]:
#Converting the reconstructed matrix back to a Pandas dataframe
matrix_preds_df = pd.DataFrame(playlist_predicted_ratings_norm, columns = playlist_tracks_matrix_df.columns, 
                               index=playlist_tracks_df['playlist_id'].unique()).transpose()
matrix_preds_df.shape

(35, 3)

### Apply Collaborative Recommender

Now that we have completed our matrix factorisation, we can apply a collaborative recommender.

In [62]:
class CollaborativeRecommender:
    
    def __init__(self, tracks, matrix_preds_df):
        self.tracks = tracks
        self.matrix_preds_df = matrix_preds_df
        self.model_name = 'Collaborative Recommender'
    
    def recommend_tracks(self, playlist_id, ignore_ids=[]):
        sorted_playlist_predictions = self.matrix_preds_df[playlist_id].sort_values(ascending=False) \
                                        .reset_index().rename(columns={playlist_id: 'recStrength'})        
        recommendations_df = sorted_playlist_predictions[~sorted_playlist_predictions['id'].isin(ignore_ids)] \
                                .drop_duplicates(subset='id', keep="first").reset_index() \
                                .sort_values('recStrength', ascending = False)
        
        return recommendations_df

collaborative_model = CollaborativeRecommender(playlist_tracks_df, matrix_preds_df)

In [64]:
collaborative_model_recommendations = collaborative_model.recommend_tracks(playlist_ids['Hello'], interacted_tracks['id'].tolist())
# Get track details from original track data
collaborative_model_recommendations_name = pd.merge(collaborative_model_recommendations, playlist_tracks_df.drop_duplicates(subset='id', keep="first"), how='left', on='id')
collaborative_model_recommendations_name[['id', 'name', 'artist_name', 'album_name', 'recStrength']].head()

Unnamed: 0,id,name,artist_name,album_name,recStrength
0,0hhzNPE68LWLfgZwdpxVdR,us. (feat. Taylor Swift),Gracie Abrams,The Secret of Us,2.125781e-16
1,5N3hjp1WNayUPZrA8kJmJP,Please Please Please,Sabrina Carpenter,Please Please Please,2.125781e-16
2,2bl81llf715VEEbAx03yvB,Close To You,Gracie Abrams,The Secret of Us,2.125781e-16
3,2SPbioo65CuUB3H0aW1ID5,Bored,Laufey,Bewitched: The Goddess Edition,2.125781e-16
4,4KGGeE7RJsgLNZmnxGFlOj,Falling Behind,Laufey,Everything I Know About Love,2.125781e-16


In [65]:
collaborative_model_metrics, collaborative_model_details = model_evaluator.evaluate_model(collaborative_model)

print(collaborative_model_metrics)
collaborative_model_details[[x for x in collaborative_model_details.columns if x != 'playlist_id']] \
    .sort_values('recall@5', ascending=False) \
    .head()

{'model_name': 'Collaborative Recommender', 'recall@5': np.float64(0.6666666666666666), 'recall@10': np.float64(0.6666666666666666)}


Unnamed: 0,n,evaluation_count,hits@5,hits@10,recall@5,recall@10
0,100,3,3,3,1.0,1.0
2,100,3,3,3,1.0,1.0
1,100,3,0,0,0.0,0.0


The collaborative recommender performs better than the popularity approach but is not as good as the content-based approach as genre appears to be the most powerful feature.

## Hybrid Recommender

Combining the content-based and collaborative approaches into a hybrid method have been shown to perform better in many studies. Here we build an ensemble which takes the weighted average of the normalized recommendation scores. As the content-based approach has a higher recall we will apply a weighting.

In [66]:
class HybridRecommender:

    def __init__(self, tracks, content_model, collaborative_model, content_weight=2, collaborative_weight=1):
        self.tracks = tracks
        self.model_name = 'Hybrid Recommender'
        self.content_model = content_model
        self.collaborative_model = collaborative_model
        # Relative weights
        self.content_weight = content_weight
        self.collaborative_weight = collaborative_weight
        
    def recommend_tracks(self, playlist_id, ignore_ids=[]): 
        
        content_recs_df = self.content_model.recommend_tracks(
            playlist_id, ignore_ids).rename(columns={'recStrength': 'recStrengthContent'})
        collaborative_recs_df = self.collaborative_model.recommend_tracks(
            playlist_id, ignore_ids).rename(columns={'recStrength': 'recStrengthCollaborative'})
        combined_recs_df = content_recs_df.merge(collaborative_recs_df,
                                                 how = 'outer', 
                                                 on = 'id',
                                                ).fillna(0)
        # Compute hybrid score based on weights
        combined_recs_df['recStrengthHybrid'] = (combined_recs_df['recStrengthContent'] * self.content_weight) \
                                                + (combined_recs_df['recStrengthCollaborative'] * self.collaborative_weight)
        recommendations_df = combined_recs_df \
                                .drop_duplicates(subset='id', keep="first").reset_index() \
                                .sort_values('recStrengthHybrid', ascending=False)

        return recommendations_df
    
hybrid_model = HybridRecommender(playlist_tracks_df, content_model, collaborative_model)

In [69]:
hybrid_model_recommendations = hybrid_model.recommend_tracks(playlist_ids['Hello'], interacted_tracks['id'].tolist())
# Get track details from original track data
hybrid_model_recommendations_name = pd.merge(hybrid_model_recommendations, playlist_tracks_df.drop_duplicates(subset='id', keep="first"), how='left', on='id')
hybrid_model_recommendations_name[['id', 'name', 'artist_name', 'album_name', 'recStrengthHybrid']].head()

Unnamed: 0,id,name,artist_name,album_name,recStrengthHybrid
0,28drn6tQo95MRvO0jQEo5C,Type Shit,Future,WE DON'T TRUST YOU,0.909618
1,42VsgItocQwOQC3XWZ8JNA,FE!N (feat. Playboi Carti),Travis Scott,UTOPIA,0.088529
2,3xby7fOyqmeON8jsnom0AT,Nightcrawler (feat. Swae Lee & Chief Keef),Travis Scott,Rodeo,0.077268
3,6NMtzpDQBTOfJwMzgMX0zl,SKELETONS,Travis Scott,ASTROWORLD,0.07065
4,6gBFPUFcJLzWGx4lenP6h2,goosebumps,Travis Scott,Birds In The Trap Sing McKnight,0.069106


In [70]:
hybrid_model_metrics, hybrid_model_details = model_evaluator.evaluate_model(hybrid_model)

print(hybrid_model_metrics)
hybrid_model_details[[x for x in hybrid_model_details.columns if x != 'playlist_id']] \
    .sort_values('recall@5', ascending=False) \
    .head()

{'model_name': 'Hybrid Recommender', 'recall@5': np.float64(1.0), 'recall@10': np.float64(1.0)}


Unnamed: 0,n,evaluation_count,hits@5,hits@10,recall@5,recall@10
0,100,3,3,3,1.0,1.0
1,100,3,3,3,1.0,1.0
2,100,3,3,3,1.0,1.0


In this case solely applying the content-based recommender actually performs better. As discussed previously, this is because this is a playlist based dataset where each playlist tends to be genre/mood based.

### HybridPopularity Recommender

In practice we don't want to overfit on any one approach and as the content-based approach has shown a strong preference for genre for this dataset it's better to incorporate variety. Things that are popular are liked by a lot of people so we should also recommend a few things that are popular. This helps to avoid a recommendation loop where a user is consistently recommended songs from a certain genre (as similar tracks via collaborative content-based recommendations will likely be of a few set genres).

In [71]:
class HybridPopularityRecommender:

    def __init__(self, tracks, content_model, collaborative_model, popularity_model, 
                 content_weight=1, collaborative_weight=2, popularity_weight=1):
        self.tracks = tracks
        self.model_name = 'HybridPopularity Recommender'
        self.content_model = content_model
        self.collaborative_model = collaborative_model
        self.popularity_model = popularity_model
        # Relative weights
        self.content_weight = content_weight
        self.collaborative_weight = collaborative_weight
        self.popularity_weight = popularity_weight
        
    def recommend_tracks(self, playlist_id, ignore_ids=[]): 
        
        content_recs_df = self.content_model.recommend_tracks(
            playlist_id, ignore_ids).rename(columns={'recStrength': 'recStrengthContent'})
        collaborative_recs_df = self.collaborative_model.recommend_tracks(
            playlist_id, ignore_ids).rename(columns={'recStrength': 'recStrengthCollaborative'})
        popularity_recs_df = self.popularity_model.recommend_tracks(
            playlist_id, ignore_ids).rename(columns={'popularity': 'recStrengthPopularity'})
        combined_recs_df = content_recs_df.merge(collaborative_recs_df,
                                                 how = 'outer', 
                                                 on = 'id',
                                                ).merge(popularity_recs_df,
                                                        how = 'outer', 
                                                        on = 'id',
                                                       ).fillna(0)
        # Compute hybrid score based on weights
        combined_recs_df['recStrengthHybridPopularity'] = (combined_recs_df['recStrengthContent'] * self.content_weight) \
                                                            + (combined_recs_df['recStrengthCollaborative'] * self.collaborative_weight) \
                                                            + (popularity_recs_df['recStrengthPopularity'] * self.popularity_weight)
        recommendations_df = combined_recs_df \
                                .drop_duplicates(subset='id', keep="first").reset_index() \
                                .sort_values('recStrengthHybridPopularity', ascending=False)

        return recommendations_df
    
hybridpopularity_model = HybridPopularityRecommender(playlist_tracks_df, content_model, collaborative_model, popularity_model)

In [73]:
hybridpopularity_model_recommendations = hybridpopularity_model.recommend_tracks(playlist_ids['Hello'], interacted_tracks['id'].tolist())
hybridpopularity_model_recommendations[['id', 'name', 'artist_name', 'album_name', 'recStrengthHybridPopularity']].head()

Unnamed: 0,id,name,artist_name,album_name,recStrengthHybridPopularity
4,28drn6tQo95MRvO0jQEo5C,Type Shit,Future,WE DON'T TRUST YOU,1.214809
16,6NMtzpDQBTOfJwMzgMX0zl,SKELETONS,Travis Scott,ASTROWORLD,0.925325
10,42VsgItocQwOQC3XWZ8JNA,FE!N (feat. Playboi Carti),Travis Scott,UTOPIA,0.924265
12,4FF0Te5R85sLW8MNvehHKK,That’s So True,Gracie Abrams,The Secret of Us (Deluxe),0.91
5,2Ch7LmS7r2Gy2kc64wv3Bz,Die For You,The Weeknd,Starboy,0.906452


In [74]:
hybridpopularity_model_metrics, hybridpopularity_model_details = model_evaluator.evaluate_model(hybridpopularity_model)

print(hybridpopularity_model_metrics)
hybridpopularity_model_details[[x for x in hybridpopularity_model_details.columns if x != 'playlist_id']] \
    .sort_values('recall@5', ascending=False) \
    .head()

{'model_name': 'HybridPopularity Recommender', 'recall@5': np.float64(0.8888888888888888), 'recall@10': np.float64(1.0)}


Unnamed: 0,n,evaluation_count,hits@5,hits@10,recall@5,recall@10
0,100,3,3,3,1.0,1.0
2,100,3,3,3,1.0,1.0
1,100,3,2,3,0.666667,1.0


The model weighting were set ad-hoc and should be adjusted freely. Personally I believe this HybridPopularity recommender gives the recommendations so far in practice, even though it has a lower recall than the hybrid and content-based recommender. It may therefore be better to incorporate other evaluation metrics such as one that measures variety, or increase the scope of this dataset as its current playlist based approach means that pure genre-based recommendations perform the best.

## Recommendations
- Let's see what songs the HybridPopularity recommender suggests to add to the 'Chill' playlist

In [75]:
# 243 tracks where prob_ratings >= 1.15 which is a good number
tracks_to_add = hybridpopularity_model_recommendations[hybridpopularity_model_recommendations['recStrengthHybridPopularity'] >= 1.15]['id']
len(tracks_to_add)

1

In [77]:
# Spotify API
with open("spotify_details.yml", 'r') as stream:
    spotify_details = yaml.safe_load(stream)

scope = "playlist-modify-private"

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(
    client_id=spotify_details['client_id'],
    client_secret=spotify_details['client_secret'],
    redirect_uri=spotify_details['redirect_uri'],
    scope=scope,
))

# Create a new playlist for tracks to add - you may also add these tracks to your source playlist and proceed
new_playlist = sp.user_playlist_create(user=spotify_details['user'], 
                                       name="spotify-recommender-systems",
                                       public=False, 
                                       collaborative=False, 
                                       description="Created by https://github.com/anthonyli358/spotify-recommender-systems",
                                      )

# Add tracks to the new playlist
for id in tracks_to_add:
    sp.user_playlist_add_tracks(user=spotify_details['user'], 
                                playlist_id=new_playlist['id'], 
                                tracks=[id],
                               );

KeyError: 'user'

This worked very well, the playlist is lit.