# 4. Training and Modelling Item Based Cosine Similarity<a id='4.A_Training_and_Modelling_Item_Based_Cosine_Similarity'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4. Training and Modelling Item Based Cosine Similarity ](#4._Training_and_Modelling_Item_Based_Cosine_Similarity)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Imports](#4.2_Imports)
  * [4.3 Load The Data](#4.3_Load_The_Data)
  * [4.4 Extract The Data](#4.4_Extract_The_Data)
       * [4.4.1 Train Test Split For Playlist](#4.4.1_Train_Test_Split_For_Playlist)
       * [4.4.2 Manipulating The Data](#4.4.2_Manipulating_The_Data)
  * [4.5 Modeling Using Cosine Similarity](#4.5_Modeling_Using_Cosine_Similarity)
      * [4.5.1 The Compressed Dot Product Algorithm](#4.5.1_The_Compressed_Dot_Product_Algorithm)
       * [4.5.2 Cosine Similarity Between Two Songs](#4.5.2_Cosine_Similarity_Between_Two_Songs)
             * [4.5.2.1 Example](#4.5.2.1_Example)
      * [4.5.3 Recommendation Score For A Single Track](#4.5.3_Recommendation_Score_For_A_Single_Track)
      * [4.5.4 Example: Top 20 Recommendation Songs For One Playlist](#4.5.4_Example_Top_20_Recommendation_Songs_For_One_Playlist)
  * [4.6 Metrics and Evaluation](#4.6_Metrics_and_Evaluation)
       * [4.6.1 R_Precision](#4.6.1_R_Precision)
       * [4.6.2 Song Clicks](#4.6.2_Song_Clicks)
       * [4.6.3 Normalized Discounted Cumulative_Gain(NDCG)](#4.6.3_Normalized_Discounted_Cumulative_Gain)

## 4.2 Imports<a id='4.2_Imports'></a>

In [1]:
# Install and import other jupyter notebook
# !pip install ipynb
import import_ipynb


import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob, collections
from sklearn.model_selection import train_test_split

from help_functions import *

importing Jupyter notebook from help_functions.ipynb


## 4.3 Load The Data<a id='4.3_Load_The_Data'></a>

In [2]:
# path = "spotify_million_playlist_dataset/data/"
# file_name = "mpd.slice.0-999.json"

# # Load the first single json file 
# data = json.load(open(path + file_name))

# # Read json as DataFrame
# df = pd.DataFrame(data['playlists'])

In [3]:
read_files = glob.glob("spotify_million_playlist_dataset/data/*.json")
df = pd.DataFrame()

for f in read_files:
    with open(f, "rb") as infile:
        data = json.load(infile)
        each_json_file = pd.DataFrame(data['playlists'])
        df = df.append(each_json_file)

In [4]:
df.head(3)

Unnamed: 0,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,tracks,num_edits,duration_ms,num_artists,description
0,NewNew,False,7000,1509321600,83,78,2,"[{'pos': 0, 'artist_name': 'WILDES', 'track_ur...",49,18461552,72,
1,chilllll,False,7001,1506902400,18,15,1,"[{'pos': 0, 'artist_name': 'Angus & Julia Ston...",8,4031475,10,
2,offline,False,7002,1505433600,63,48,1,"[{'pos': 0, 'artist_name': 'Keith Urban', 'tra...",16,15021695,45,


## 4.4 Extract The Data<a id='4.4_Extract_The_Data'></a>

In [5]:
# Only include useful columns
df = df.loc[:, ['name', 'pid', 'num_tracks', 'tracks', 'num_albums', 'num_artists']]
df.head()

Unnamed: 0,name,pid,num_tracks,tracks,num_albums,num_artists
0,NewNew,7000,83,"[{'pos': 0, 'artist_name': 'WILDES', 'track_ur...",78,72
1,chilllll,7001,18,"[{'pos': 0, 'artist_name': 'Angus & Julia Ston...",15,10
2,offline,7002,63,"[{'pos': 0, 'artist_name': 'Keith Urban', 'tra...",48,45
3,feels,7003,97,"[{'pos': 0, 'artist_name': 'Chance The Rapper'...",51,27
4,Latin Dance,7004,23,"[{'pos': 0, 'artist_name': 'Merengue Latin Ban...",20,11


## 4.4.1 Train/Test Split For Playlist<a id='4.4.1_Train_Test_Split_For_Playlist'></a>

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

print(f"Number of playlist in Train set: {len(X_train)}")
print(f"Number of playlist in Test set: {len(X_test)}")

Number of playlist in Train set: 16000
Number of playlist in Test set: 4000


In [7]:
X_train.head()

Unnamed: 0,name,pid,num_tracks,tracks,num_albums,num_artists
894,Shower songs,13894,129,"[{'pos': 0, 'artist_name': 'Justin Timberlake'...",115,93
728,chill,19728,66,"[{'pos': 0, 'artist_name': 'Nohidea', 'track_u...",53,43
958,warriors,9958,23,"[{'pos': 0, 'artist_name': 'Florida State Univ...",20,18
671,June 2015,16671,13,"[{'pos': 0, 'artist_name': 'Kopecky', 'track_u...",13,13
999,Worship,13999,129,"[{'pos': 0, 'artist_name': 'Hillsong United', ...",53,30


In [8]:
X_test.head()

Unnamed: 0,name,pid,num_tracks,tracks,num_albums,num_artists
650,Australia Day,650,41,"[{'pos': 0, 'artist_name': 'Mariah Carey', 'tr...",40,39
41,2k17,6041,27,"[{'pos': 0, 'artist_name': 'Demi Lovato', 'tra...",27,27
668,squad,9668,181,"[{'pos': 0, 'artist_name': 'Migos', 'track_uri...",113,75
114,Chillin,3114,157,"[{'pos': 0, 'artist_name': 'William Singe', 't...",136,75
902,feel good,14902,102,"[{'pos': 0, 'artist_name': 'Ben Rector', 'trac...",88,77


## 4.4.2 Manipulating The Data<a id='4.4.2_Manipulating_The_Data'></a>

## Create all tracks for all playlist in train and test set

In [9]:
def create_tracks(X, fst_playlist_df):
    for i in range(1, len(X)):
        # create df for next playlist
        playlist = pd.DataFrame(X.iloc[i]['tracks'])
        playlist['playlist_name'] = X.iloc[i]['name']
        playlist['playlist_pid'] = X.iloc[i]['pid']

        # append subset df of each playlist to the big df
        fst_playlist_df = fst_playlist_df.append(playlist)
        
    return fst_playlist_df

## Create tracks dictionary with key = track_uri, and value = [playlist vector , artist_uri]

In [10]:
track_dict = {}

# X_train is a DataFrame
for i in range(len(X_train)):
    row = X_train.iloc[i]
    playlist_id = row['pid']
    
    for track in row['tracks']:
        song_id = track['track_uri']
        song_name = track['track_name']
        
        if song_id not in track_dict:
            track_dict[song_id] = {'name': song_name, 'playlist_vector': []}

        track_dict[song_id]['playlist_vector'].append(playlist_id)

In [11]:
print('Train set, the numbers of unique tracks in all playlist: ', len(track_dict))

Train set, the numbers of unique tracks in all playlist:  229351


## 4.5 Modeling Using Cosine Similarity<a id='4.5_Modeling_Using_Cosine_Similarity'></a>

### 4.5.1 The Compressed Dot Product Algorithm <a id='4.5.1_The_Compressed_Dot_Product_Algorithm'></a>

In [12]:
# The Compressed dot product algorithm is only work for BINARY compressed vector
# Ex: compressed vector = [0, 3, 5, 6, 7] <=> Sparse vector = [1, 0, 0, 1, 0, 1, 1, 1]
# or compressed([1, 0, 0, 1, 0, 1, 1, 1]) = [0, 3, 5, 6, 7]


def compressed_dot_product(v1, v2):
    '''
    This function calculate the dot product of two sparse vector (with same dimensionality)
    @param: 
        vector1, vector2: list of the playlists that include that track
        
    @return: the dot product of 2 compressed vectors
    '''
    matched = 0
    n, m = len(v1), len(v2)
    #print(f"N = {n}, M = {m}")
    i , j = 0, 0
    
    while i < n and j < m: # create two pointers
        
        if v1[i] == v2[j]:
            matched += 1
            i += 1
            j += 1
        elif v1[i] < v2[j]:
            i += 1
        else:
            j += 1
        
        #print(f"matched = {matched}")
    
    return matched

v1 = [0,1,3,5, 10, 100, 999]
v2 = [0,5, 13, 207]
print(compressed_dot_product(v1, v2))
    

2


### 4.5.2 Cosine Similarity Between 2 Songs<a id='4.5.2_Cosine_Similarity_Between_Two_Songs'></a>

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
from math import sqrt


# v1, v2 are compressed lists that represents sparse vectors 
# len(compressed(x)) = ||x||^2

def compressed_cos_similarity(v1, v2):
    return compressed_dot_product(v1, v2) / sqrt(len(v1) * len(v2))

compressed_cos_similarity([0,2,4], [0,1,3])

0.3333333333333333

Unpopular songs will have higher cosine similarity in general.

("Using Cosine similarity or Pearson correlation helps to mitigate the bias towards popular items, but can also end up recommending very unpopular, niche items." https://medium.com/bag-of-words/what-similarity-metric-should-you-use-for-your-recommendation-system-b45eb7e6ebd0)

#### 4.5.2.1 Example<a id='4.5.2.1_Example'></a>
Comparing the songs "Crazy In Love" and "Toxic".

In [14]:
# Create two vectors of the two songs
crazy_in_love = track_dict['spotify:track:0WqIKmW4BTrj3eJFmnCKMv']['playlist_vector']
toxic = track_dict['spotify:track:6I9VzXrHxO9rA9A5euc8Ak']['playlist_vector']

similarity = compressed_cos_similarity(crazy_in_love, toxic)
print(f"The cosine similarity of 'crazy in love' and 'toxic' tracks is: {similarity}")

The cosine similarity of 'crazy in love' and 'toxic' tracks is: 0.012416852662165207


Cosine similarity gives a real number ranged [0, 1] indicating how similar two songs are. 0.0 means the songs are in completely different playlists, 1.0 means the songs are included in all the same playlists.

### 4.5.3 Recommendation Score For A Single Track<a id='4.5.3_Recommendation_Score_For_A_Single_Track'></a>

In [15]:
def get_cosines_for_single_track(track_id):
    '''
    This function get ALL cosine scores of songs that are in same playlist with this single track.
    
    @param: track_id: a string that contain one SINGLE track_id (ex: "spotify:track:6I9VzXrHxO9rA9A5euc8Ak")
    @return: dictionary with all keys = track_ids, values = cosine scores
    '''
    
    if track_id not in track_dict:
        return {}
    
    track_vector = track_dict[track_id]['playlist_vector']        #track_dict: only create vector matrix for train data
    
    track_score_dict = {}
    
    for playlist_pid in track_vector:
        playlist_row = list(X_train.loc[X_train['pid'] == playlist_pid]['tracks'])[0]
        
        track_list = [track['track_uri'] for track in playlist_row]

        # Compare the song to ALL the tracks on each playlist
        for item_id in track_list:
            track_score_dict[item_id] = compressed_cos_similarity(track_dict[track_id]['playlist_vector'], track_dict[item_id]['playlist_vector'])
            
    return track_score_dict

Helper functions to get a set of recommendations, get track names.

In [16]:
def get_track_ids(tracks_in_playlist):
    '''
    @param: 
        tracks_in_playlist: a "list" that only contain track IDs
    @ return
        track_ids : a list contains all track ids in ONE playlist
    '''
    track_ids = list() 
    
    for track_id in tracks_in_playlist:
        track_ids.append(track_id)
    
    return track_ids


def get_recs_for_playlist(tracks_in_playlist):
    '''
    @param: 
        tracks_in_playlist: a "list" that only contain track IDs
    @return:
        track_scores_dict: a "dictionary" that has all recommendation songs for all tracks in the playlist 
    '''
    track_scores_dict = {}
    track_ids = get_track_ids(tracks_in_playlist)

    for track_id in tracks_in_playlist:
        single_track_score = get_cosines_for_single_track(track_id)
        
        #print(f"Track_id = {track_id}, length = {len(single_track_score)}")
        
        for k,v in single_track_score.items():
            if k in track_ids:
                continue
            
            if k not in track_scores_dict:
                track_scores_dict[k] = v
            else:
                track_scores_dict[k] = max(track_scores_dict[k], v)
    print(len(track_scores_dict))
    return track_scores_dict


def sorted_cosine_tracks(rec_tracks_dict):
    '''
    The function sorts cosine similarity scores of the tracks in one playlist
    
    @param:
        rec_tracks_dict: a dict of playlist contain all recommendation songs, key = track_id (in diff playlist)
            value = cosine similarity score
    @return:
        recs_list: a list of sorted recommendation songs based on cosine score (max = 1, min = 0)
    '''
    recs_list = []

    for k in sorted(rec_tracks_dict, key=rec_tracks_dict.get, reverse=True):
        recs_list.append((k, rec_tracks_dict[k]))
#         recs_list.append(k)

    return recs_list

def get_n_recommendation_songs_id(tracks_in_playlist, n):
    '''
    @param: 
        tracks_in_playlist: a "list" that only contain track IDs
    @return 
        recommendation_songs: a list of "n" recommendation songs for a single playlist
    '''
    
    rec_tracks_dict = get_recs_for_playlist(tracks_in_playlist)
    sorted_tracks_list = sorted_cosine_tracks(rec_tracks_dict)
    
    # If n < length of recommendation, give n songs, else give the maximum number of songs in the cosine maxtrix
    rec_songs_id = sorted_tracks_list[: min(n, len(sorted_tracks_list))]

    return rec_songs_id
    
def get_track_name(track_id, track_dict):
    return track_dict[track_id]['name']

def train_test_split_tracks(playlist_pid, X_test):
    tracks = flatten_playlist(playlist_pid, X_test)
    
    train, test = train_test_split(tracks, test_size=0.3, random_state=42)

    return train, test

### 4.5.4 Example: Top 20 Recommended Songs For One Playlist<a id='4.5.4_Example_Top_20_Recommendation_Songs_For_One_Playlist'></a>

In [17]:
def flatten_playlist(playlist_pid, df):
    # [0] because convert to list, it wrap around with extra '[]'
    row = list(df.loc[df['pid'] == playlist_pid]['tracks'])[0]   
    
    tracks = []
    
    for item in row:
        track_id = item['track_uri']
        if track_id not in tracks:
            tracks.append(track_id)
            
    return tracks

In [19]:
n = 20
throwback_lst_pid = 0
country_pid = 456

# playlist = df[df['pid'] == country_pid].iloc[0]['tracks']
tracks_in_playlist = flatten_playlist(country_pid, df)
playlist_name = df[df['pid'] == country_pid].iloc[0]['name']

rec_tracks_dict = get_recs_for_playlist(tracks_in_playlist)
rec_songs = [r[0] for r in get_n_recommendation_songs_id(tracks_in_playlist, n)]
rec_song_names = [get_track_name(rec_song, track_dict) for rec_song in rec_songs]

print(f"There are {len(rec_tracks_dict)} in \"{playlist_name}\"")
print(f"The top {n} recommended songs for {playlist_name} is:")
rec_song_names

32576
32576
There are 32576 in "Country"
The top 20 recommended songs for Country is:


["I've Been Known",
 'Front Row Seat',
 'Kisses We Steal',
 "Hurtin'",
 'Guilty As Can Be',
 'Level',
 'Good Ole Boys',
 'Neon Moon',
 'White Trash Story - II (The Deuce)',
 "That's My Story",
 'My Kind of Girl',
 "This Romeo Ain't Got Julie Yet",
 'Friends In Low Places',
 "Crazy Eddie's Last Hurrah (Live)",
 'The Race Is On',
 'Cadillac Style',
 'Vidalia',
 "Let's Get Drunk And Make Friends",
 'Over My Head',
 "Drinkin' With Me"]

## 4.6 Metrics & Evaluation<a id='4.6_Metrics_and_Evaluation'></a>

### 4.6.1 R-Precision<a id='4.6.1_R_Precision'></a>

In [20]:
# Precision for one playlist (model: Cosine similarity)
def get_r_precision_for_pid(playlist_pid):
    test_name = list(X_test.loc[X_test['pid'] == playlist_pid]['name'])[0]
    
    shown, hidden = train_test_split_tracks(playlist_pid, X_test)
    
    # Recommendation songs n = test_size * 15
    n = len(hidden) * 15
    
    recs_for_playlist = [r[0] for r in get_n_recommendation_songs_id(shown, n)]
    
    return r_precision(set(recs_for_playlist), set(hidden))

In [21]:
get_r_precision_for_pid(5897)

31847


0.0

In [22]:
'''
# This return the r-precision for all the playlists in the test set --> report this one
total_precision = 0
n = 0
for pid in X_test['pid'].unique():
    
    total_precision += get_r_precision_for_pid(pid)
    n += 1
    
    print(f"n: {n}, pid: {pid}, average_precision: {total_precision / n}")
    
avg = total_precision / n
'''

'\n# This return the r-precision for all the playlists in the test set --> report this one\ntotal_precision = 0\nn = 0\nfor pid in X_test[\'pid\'].unique():\n    \n    total_precision += get_r_precision_for_pid(pid)\n    n += 1\n    \n    print(f"n: {n}, pid: {pid}, average_precision: {total_precision / n}")\n    \navg = total_precision / n\n'

<img src="image/cosine-similarity-precision.png" style="height=200, width=200">

### 4.6.2 Song Clicks<a id='4.6.2_Song_Clicks'></a>

In [23]:
def recs_songs_click(R, G):
    '''
     Recommendation Songs Clicks: measure the index of the first correct recommendation track.
        @param: 
            R: a 'list' of recommendation songs for the playlist giving some seen tracks
            G: a 'set' of the hidden of tracks in the playlist (in the test set)
        @ return: the recommendation songs click value for a given playlist playlist
    '''
    # If no songs that we recommends are not in hidden tracks of the playlist, return click = 51
    clicks = 51
    
    for i in range(len(R)):
        if R[i] in G:
            print(f'HOORAY find it i = {i}')
            return i / 10
        
    return clicks

In [24]:
# Song Click for one playlist
def get_recs_songs_clicks_for_pid(test_pid):
    test_name = list(X_test.loc[X_test['pid'] == test_pid]['name'])[0]
    
    shown, hidden = train_test_split_tracks(test_pid, X_test)
    
    # Recommendation songs n = test_size * 15
    n = len(hidden) * 15
    
    recs_for_playlist = [r[0] for r in get_n_recommendation_songs_id(shown, n)]
    
    return recs_songs_click(recs_for_playlist, set(hidden))

In [25]:
'''
# This return the average recommendation songs clicks for all the playlists in the test set --> report this one
total_clicks = 0
n = 0
for pid in X_test['pid'].unique():
    
    total_clicks += get_recs_songs_clicks_for_pid(pid)
    n += 1
    
    print(f"n: {n}, avg_clicks: {total_clicks / n}")
    
avg_recs_songs_clicks = total_clicks / n
'''

'\n# This return the average recommendation songs clicks for all the playlists in the test set --> report this one\ntotal_clicks = 0\nn = 0\nfor pid in X_test[\'pid\'].unique():\n    \n    total_clicks += get_recs_songs_clicks_for_pid(pid)\n    n += 1\n    \n    print(f"n: {n}, avg_clicks: {total_clicks / n}")\n    \navg_recs_songs_clicks = total_clicks / n\n'

In [26]:
# print(f'The average recommendation Songs Clicks for all playlist in test set is: {avg_recs_songs_clicks}')

<img src="image/cosine-similarity-recommendation_song_clicks.png" style="height=200, width=200">

### 4.6.3 Normalized Discounted Cumulative Gain (NDCG)<a id='4.6.3_Normalized_Discounted_Cumulative_Gain'></a>

In [28]:
# NDCG for one playlist
def get_NDCG_for_pid(test_pid):
    test_name = list(X_test.loc[X_test['pid'] == test_pid]['name'])[0]
    
    shown, hidden = train_test_split_tracks(test_pid, X_test)
    
    # Recommendation songs n = test_size * 15
    n = len(hidden) * 15
    
    recs_for_playlist = get_n_recommendation_songs_id(shown, n)
#     print(recs_for_playlist)
    return NDCG(recs_for_playlist, set(hidden))
get_NDCG_for_pid(5897)

31847


0.010594662831891212

In [2]:
# Calculate the average Normalized Discount Cumulative Gain for the test set
total_NDCG = 0.0
n = 0
for pid in X_test['pid'].unique():
    total_NDCG += get_recs_songs_clicks_for_pid(pid)
    n += 1
    
    print(f"n: {n}, avg_NDCG: {total_NDCG / n}")
    
avg_recs_NDCG = total_NDCG / n

'\ntotal_NDCG = 0.0\nn = 0\nfor pid in X_test[\'pid\'].unique():\n    total_NDCG += get_recs_songs_clicks_for_pid(pid)\n    n += 1\n    \n    print(f"n: {n}, avg_NDCG: {total_NDCG / n}")\n    \navg_recs_NDCG = total_NDCG / n\n'

<img src="image/cosine-similarity-NDCG.png">