# Capstone 3 
## Recommendation system
### July 2022

##### Recommendation systems for the entertainment industry have become popular in recent years, such as Youtube, Netflix and Spotify are using different approaches to recommend video or music to users. A good recommendation system is able to ensure high-quality recommendations according to the user's taste. From Spotify to Amazon, recommendation systems provide users with quality and personalized recommendations.

##### The goal of The Million Song Dataset Challenge is to predict the songs that a user will listen to, given both the user's listening history and full information (including meta-data and content analysis) for all songs.

##### Data : https://www.kaggle.com/competitions/msdchallenge , http://millionsongdataset.com/challenge/

##### project proposal
##### https://docs.google.com/document/d/1bybbHktaNtRWpOuZq8H2j6U6X1n7m4VKFeroqADNNYM/edit?usp=sharing
##### project idea
##### https://docs.google.com/document/d/17e3VP4f4s_cnssqMTAmYqmXfuDhb5LBZtTl_VWck-fo/edit?usp=sharing

In [27]:
# import lib

import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns



# Load dataset

In [28]:
# MSDChallengeGettingstarted.pdf
# kaggle_visible_evaluation_triplets.txt
# kaggle_songs.txt
# kaggle_users.txt
# taste_profile_song_to_tracks.txt

In [29]:
basedir="/Users/yuenyeelo/Documents/springboard/projects/Capstone3/msdchallenge/"
songfile=basedir+"kaggle_songs.txt"
usersfile=basedir+"kaggle_users.txt"
profilefile=basedir+"taste_profile_song_to_tracks.txt"
evalfile=basedir+"kaggle_visible_evaluation_triplets.txt"
# song info, artistname , songtitle
songinfo=basedir+"unique_tracks.csv"


df_song=pd.read_csv(songfile,names=['song_id', 'id'],sep=' ')
df_users=pd.read_csv(usersfile, names=["user_id"],sep=' ')
df_profile=pd.read_csv(profilefile, names=["song_id", "user_id"],sep=' ')
df_eval=pd.read_csv(evalfile, names=["user_id","song_id","listen_count"],sep='\t')
df_songinfo=pd.read_csv(songinfo)

# EDA 

In [30]:
# songs
print(df_song.isnull().sum())
print(df_song.dtypes)
print(df_song['song_id'].value_counts)
df_song.describe()

song_id    0
id         0
dtype: int64
song_id    object
id          int64
dtype: object
<bound method IndexOpsMixin.value_counts of 0         SOAAADD12AB018A9DD
1         SOAAADE12A6D4F80CC
2         SOAAADF12A8C13DF62
3         SOAAADZ12A8C1334FB
4         SOAAAFI12A6D4F9C66
                 ...        
386208    SOZZZRJ12AB0187A75
386209    SOZZZRV12A8C1361F1
386210    SOZZZSR12AB01854CD
386211    SOZZZWD12A6D4F6624
386212    SOZZZWN12AF72A1E29
Name: song_id, Length: 386213, dtype: object>


Unnamed: 0,id
count,386213.0
mean,193107.0
std,111490.234095
min,1.0
25%,96554.0
50%,193107.0
75%,289660.0
max,386213.0


In [31]:
# users
print(df_users.isnull().sum())
print(df_users.dtypes)

df_users.describe()

user_id    0
dtype: int64
user_id    object
dtype: object


Unnamed: 0,user_id
count,110000
unique,110000
top,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d
freq,1


##  There are total 110,000 users, and 386, 213 songs. There is no nan value.

In [32]:
# df_eval 
print(df_eval.isnull().sum())
print(df_eval.dtypes)

df_eval.describe()

user_id         0
song_id         0
listen_count    0
dtype: int64
user_id         object
song_id         object
listen_count     int64
dtype: object


Unnamed: 0,listen_count
count,1450933.0
mean,3.187149
std,7.051664
min,1.0
25%,1.0
50%,1.0
75%,3.0
max,923.0


In [33]:
#### There is no nan in the eval file.

print(df_eval['song_id'].nunique())


163206


In [34]:
#### Let see which songs are most popular
df_eval['total_listenByUser'] = df_eval.groupby('song_id')['listen_count'].transform('count')
#df_eval.groupby('song_id')
df_eval.head()

Unnamed: 0,user_id,song_id,listen_count,total_listenByUser
0,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOBONKR12A58A7A7E0,1,4136
1,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOEGIYH12A6D4FC0E3,1,3272
2,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOFLJQZ12A6D4FADA6,1,2668
3,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOHTKMO12AB01843B0,1,2097
4,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SODQZCY12A6D4F9D11,1,177


In [35]:
df_songinfo.head()

Unnamed: 0,track_id,song_id,artist_name,song_title
0,TRMMMYQ128F932D901,SOQMMHC12AB0180CB8,Faster Pussy cat,Silent Night
1,TRMMMKD128F425225D,SOVFVAK12A8C1350D9,Karkkiautomaatti,Tanssi vaan
2,TRMMMRX128F93187D9,SOGTUKN12AB017F4F1,Hudson Mohawke,No One Could Ever
3,TRMMMCH128F425532C,SOBNYVR12A8C13558C,Yerba Brava,Si Vos Querés
4,TRMMMWA128F426B589,SOHSBXH12A8C13B0DF,Der Mystic,Tangle Of Aspens


In [36]:
# combine both data
song_df = pd.merge(df_eval, df_songinfo.drop_duplicates(['song_id']), on='song_id', how='left')
song_df.head()

Unnamed: 0,user_id,song_id,listen_count,total_listenByUser,track_id,artist_name,song_title
0,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOBONKR12A58A7A7E0,1,4136,TRAEHHJ12903CF492F,Dwight Yoakam,You're The One
1,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOEGIYH12A6D4FC0E3,1,3272,TRLGMFJ128F4217DBE,Barry Tuckwell/Academy of St Martin-in-the-Fie...,Horn Concerto No. 4 in E flat K495: II. Romanc...
2,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOFLJQZ12A6D4FADA6,1,2668,TRTNDNE128F1486812,Cartola,Tive Sim
3,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOHTKMO12AB01843B0,1,2097,TRASTUE128F930D488,Lonnie Gordon,Catch You Baby (Steve Pitron & Max Sanna Radio...
4,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SODQZCY12A6D4F9D11,1,177,TRFPLWO128F1486B9E,Miguel Calo,El Cuatrero


In [37]:
# creating new feature combining title and artist name
song_df['song'] = song_df['song_title']+' - '+song_df['artist_name']
song_df.head()

Unnamed: 0,user_id,song_id,listen_count,total_listenByUser,track_id,artist_name,song_title,song
0,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOBONKR12A58A7A7E0,1,4136,TRAEHHJ12903CF492F,Dwight Yoakam,You're The One,You're The One - Dwight Yoakam
1,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOEGIYH12A6D4FC0E3,1,3272,TRLGMFJ128F4217DBE,Barry Tuckwell/Academy of St Martin-in-the-Fie...,Horn Concerto No. 4 in E flat K495: II. Romanc...,Horn Concerto No. 4 in E flat K495: II. Romanc...
2,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOFLJQZ12A6D4FADA6,1,2668,TRTNDNE128F1486812,Cartola,Tive Sim,Tive Sim - Cartola
3,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOHTKMO12AB01843B0,1,2097,TRASTUE128F930D488,Lonnie Gordon,Catch You Baby (Steve Pitron & Max Sanna Radio...,Catch You Baby (Steve Pitron & Max Sanna Radio...
4,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SODQZCY12A6D4F9D11,1,177,TRFPLWO128F1486B9E,Miguel Calo,El Cuatrero,El Cuatrero - Miguel Calo


In [179]:
# cummulative sum of listen count of the songs 
song_grouped = song_df.groupby(['song']).agg({'listen_count':'count'}).reset_index()
print(song_grouped.shape)
song_grouped.head()

(162047, 2)


Unnamed: 0,song,listen_count
0,Ef Ég Hefði Aldrei... - Johann Johannsson,1
1,Light Mass Prayers - Porcupine Tree,4
2,"The Arsonist Story"": Evil Craves Attention/O...",1
3,Ég Átti Gráa Æsku - Johann Johannsson,4
4,(Jack The Stripper) - Nekromantix,2


In [180]:
# check popular songs
song_grouped.sort_values(by=['listen_count'],ascending=False)

Unnamed: 0,song,listen_count
116491,Sehr kosmisch - Harmonia,5043
147834,Undo - Björk,4483
160389,You're The One - Dwight Yoakam,4136
33504,Dog Days Are Over (Radio Edit) - Florence + Th...,3780
110728,Revelry - Kings Of Leon,3672
...,...,...
77924,Life Indoors - 1 Mile North,1
77922,Life In The Tropics - The Rippingtons,1
77921,Life In The Real World - Firehouse,1
77918,Life In The Gladhouse - Modern English,1


In [181]:
grouped_sum = song_grouped['listen_count'].sum()
print(grouped_sum)
song_grouped['percentage'] = (song_grouped['listen_count'] / grouped_sum ) * 100
song_grouped.sort_values(['listen_count', 'song'], ascending=[0,1])

1450932


Unnamed: 0,song,listen_count,percentage
116491,Sehr kosmisch - Harmonia,5043,0.347570
147834,Undo - Björk,4483,0.308974
160389,You're The One - Dwight Yoakam,4136,0.285058
33504,Dog Days Are Over (Radio Edit) - Florence + Th...,3780,0.260522
110728,Revelry - Kings Of Leon,3672,0.253079
...,...,...,...
162033,Último Desejo - Nana Caymmi,1,0.000069
162035,Über Grenzen Geh'n - Drafi Deutscher,1,0.000069
162037,Übers Geld (Skit) - Samy Deluxe,1,0.000069
162039,Üdvözöl A Pokol - Tankcsapda,1,0.000069


# Method 1

### Cold start: Recommend most popular songs
##### Recommend the most popular songs if we dont have any user preference information

In [182]:
## class popular items
## it is user-independant, this method for the new user (cold start)
class popularity_recommender():
    def __init__(self):
        self.train_data = None
        self.user_id = None
        self.item_id = None
        self.popularity_recommendations = None
        
    #Create the popularity based recommender system model
    def create(self, train_data, user_id, item_id):
        self.train_data = train_data
        self.user_id = user_id
        self.item_id = item_id

        #Get a count of user_ids for each unique song as recommendation score
        train_data_grouped = train_data.groupby([self.item_id]).agg({self.user_id: 'count'}).reset_index()
        train_data_grouped.rename(columns = {'user_id': 'Total_countListened'},inplace=True)
    
        #Sort the songs based upon recommendation score
        train_data_sort = train_data_grouped.sort_values(['Total_countListened', self.item_id], ascending = [0,1])
    
        #Generate a recommendation rank based upon score
        train_data_sort['Rank'] = train_data_sort['Total_countListened'].rank(ascending=0, method='first')
        
        #Get the top 10 recommendations
        self.popularity_recommendations = train_data_sort.head(10)

    #Use the popularity based recommender system model to
    #make recommendations
    def recommend(self, user_id):    
        user_recommendations = self.popularity_recommendations
        
        #Add user_id column for which the recommendations are being generated
        user_recommendations['user_id'] = user_id
    
        #Bring user_id column to the front
        cols = user_recommendations.columns.tolist()
        cols = cols[-1:] + cols[:-1]
        user_recommendations = user_recommendations[cols]
        
        return user_recommendations

In [183]:
pr = popularity_recommender()

In [184]:
pr.create(song_df, 'user_id', 'song')

In [185]:
# display the top 10 popular songs
# it is independant of user
pr.recommend(song_df['user_id'][0])

Unnamed: 0,user_id,song,Total_countListened,Rank
116491,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Sehr kosmisch - Harmonia,5043,1.0
147834,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Undo - Björk,4483,2.0
160389,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,You're The One - Dwight Yoakam,4136,3.0
33504,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Dog Days Are Over (Radio Edit) - Florence + Th...,3780,4.0
110728,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Revelry - Kings Of Leon,3672,5.0
116240,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Secrets - OneRepublic,3430,6.0
57994,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Horn Concerto No. 4 in E flat K495: II. Romanc...,3272,7.0
56238,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Hey_ Soul Sister - Train,2791,8.0
44083,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Fireflies - Charttraxx Karaoke,2725,9.0
143094,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Tive Sim - Cartola,2668,10.0


# Method 2

### Collaborative Filtering
### Item-Based Top-N Recommendation Algorithms (by Mukund Deshpande, George Karypis)

In [269]:
#Class for Item similarity based Recommender System model
class item_similarity_recommender():
    def __init__(self):
        self.train_data = None
        self.user_id = None
        self.item_id = None
        self.cooccurence_matrix = None
        self.songs_dict = None
        self.rev_songs_dict = None
        self.item_similarity_recommendations = None
        
    #Get unique items (songs) corresponding to a given user
    def get_user_items(self, user):
        user_data = self.train_data[self.train_data[self.user_id] == user]
        user_items = list(user_data[self.item_id].unique())
        
        return user_items
        
    #Get unique users for a given item (song)
    def get_item_users(self, item):
        item_data = self.train_data[self.train_data[self.item_id] == item]
        item_users = list(set(item_data[self.user_id].unique()))
            
        return item_users
        
    #Get unique items (songs) in the training data
    def get_all_items_train_data(self):
        all_items = list(self.train_data[self.item_id].unique())
            
        return all_items
        
    #Construct cooccurence matrix
    def construct_cooccurence_matrix(self, user_songs, all_songs):
            
        ####################################
        #Get users for all songs in user_songs.
        ####################################
        user_songs_users = []        
        for i in range(0, len(user_songs)):
            user_songs_users.append(self.get_item_users(user_songs[i]))
        #print(user_songs_users)
        ###############################################
        #Initialize the item cooccurence matrix of size 
        #len(user_songs) X len(songs)
        ###############################################
        print("get cooccurence_matrix for number of user_songs :", len(user_songs))
        cooccurence_matrix = np.matrix(np.zeros(shape=(len(user_songs), len(all_songs))), float)
           
        #############################################################
        #Calculate similarity between user songs and all unique songs
        #in the training data
        #############################################################
        for i in range(0,len(all_songs)):
            #Calculate unique listeners (users) of song (item) i
            
            songs_i_data = self.train_data[self.train_data[self.item_id] == all_songs[i]]
            users_i = set(songs_i_data[self.user_id].unique())
            # print("Numer of user listened ", len(users_i))
            for j in range(0,len(user_songs)): 
                # print("cooccurence for user song:",user_songs_users[j])
                    
                #Get unique listeners (users) of song (item) j
                users_j = user_songs_users[j]
                    
                #Calculate intersection of listeners of songs i and j
                
                users_intersection = users_i.intersection(users_j)
                # print("Intersection",users_intersection)
                # print(len(users_i))
                # print(len(users_j))
                #Calculate cooccurence_matrix[i,j] as Jaccard Index
                
                
                if len(users_intersection) != 0:
                    #Calculate union of listeners of songs i and j
                    
                    users_union = users_i.union(users_j)
                    # print("Union", len(users_union))
                    cooccurence_matrix[j,i] = float(len(users_intersection))/float(len(users_union))
                    #if cooccurence_matrix[j,i] ==1:
                     #   print(users_intersection)
                      #  print(users_union)
                    #print(cooccurence_matrix[j,i])
                else:
                    cooccurence_matrix[j,i] = 0
           
        # print(cooccurence_matrix)
        return cooccurence_matrix

    
    #Use the cooccurence matrix to make top recommendations
    def generate_top_recommendations(self, user, cooccurence_matrix, all_songs, user_songs):
        print("Generation Top Recommendataion for user : ", user)
        print("Non zero values in cooccurence_matrix :%d" % np.count_nonzero(cooccurence_matrix))
        
        #Calculate a weighted average of the scores in cooccurence matrix for all user songs.
        user_sim_scores = cooccurence_matrix.sum(axis=0)/float(cooccurence_matrix.shape[0])
        user_sim_scores = np.array(user_sim_scores)[0].tolist()
 
        #Sort the indices of user_sim_scores based upon their value
        #Also maintain the corresponding score
        sort_index = sorted(((e,i) for i,e in enumerate(list(user_sim_scores))), reverse=True)
        #print(sort_index[:10])
        #Create a dataframe from the following
        columns = ['user_id', 'song', 'score', 'rank']
        #index = np.arange(1) # array of numbers for the number of samples
        df = pd.DataFrame(columns=columns)
         
        #Fill the dataframe with top 10 item based recommendations
        rank=1
        '''
        for i in range(0,len(sort_index)):
            if all_songs[sort_index[i][1]] not in user_songs:
                print(i, all_songs[sort_index[i][1]] )
                rank+=1
                if rank >10:
                    break
        '''    
        rank = 1 
        for i in range(0,len(sort_index)):
            if ~np.isnan(sort_index[i][0]) and all_songs[sort_index[i][1]] not in user_songs and rank <= 10:
                df.loc[len(df)]=[user,all_songs[sort_index[i][1]],sort_index[i][0],rank]
                rank +=1
        
        #Handle the case where there are no recommendations
        if df.shape[0] == 0:
            print("The current user has no songs for training the item similarity based recommendation model.")
            return -1
        else:
            return df
 
    #Create the item similarity based recommender system model
    def create(self, train_data, user_id, item_id):
        self.train_data = train_data
        self.user_id = user_id
        self.item_id = item_id

    #Use the item similarity based recommender system model to
    #make recommendations
    def recommend(self, user):
        
        ########################################
        #A. Get all unique songs for this user
        ########################################
        user_songs = self.get_user_items(user)    
            
        print("No. of unique songs for the user: %d" % len(user_songs))
        
        ######################################################
        #B. Get all unique items (songs) in the training data
        ######################################################
        all_songs = self.get_related_songs(user_songs)
        
        print("No. of unique songs in the training set: %d" % len(all_songs))
         
        ###############################################
        #C. Construct item cooccurence matrix of size 
        #len(user_songs) X len(songs)
        ###############################################
        print("Start coocurence_matrix")
        cooccurence_matrix = self.construct_cooccurence_matrix(user_songs, all_songs)
        print("Finish coocurence_matrix")
        #######################################################
        #D. Use the cooccurence matrix to make recommendations
        #######################################################
        print("Generate top recommendations")
        df_recommendations = self.generate_top_recommendations(user, cooccurence_matrix, all_songs, user_songs)
                
        return df_recommendations
    
    def get_related_songs(self, item_list):
         ######################################################
        #B. Get all unique items (songs) in the training data
        ######################################################
        
        # other user listened same songs
        user_songs = item_list
        users_lst=[]
        for i in range(0, len(user_songs)):
            users_lst+=self.get_item_users(user_songs[i])
        
        #users_lst=np.array(users_lst)
        # users = set(users_lst.flatten())  
        users=list(set(users_lst))
        # print(len(users))
        
        print("Number of user listened same songs:", len(users))
        # print(self.train_data.head())
        train_songs_df=self.train_data[self.train_data['user_id'].isin(users)]
        #print(train_songs_df.shape)
        all_songs=list(train_songs_df['song'].unique())
        print("Total unique songs that all related users listened: ", len(all_songs))
        
        return all_songs
    
    #Get similar items to given items
    def get_similar_songs(self, item_list):
        ## get related songs, songs that other users listened 
        user_songs = item_list
        all_songs = self.get_related_songs(user_songs)
        print("No. of unique songs in the training set: %d" % len(all_songs))
         
        ###############################################
        #C. Construct item cooccurence matrix of size 
        #len(user_songs) X len(songs)
        ###############################################
        cooccurence_matrix = self.construct_cooccurence_matrix(user_songs, all_songs)
        
        #######################################################
        #D. Use the cooccurence matrix to make recommendations
        #######################################################
        user = ""
        df_recommendations = self.generate_top_recommendations(user, cooccurence_matrix, all_songs, user_songs)
         
        return df_recommendations

In [270]:
ir = item_similarity_recommender()
ir.create(song_df, 'user_id', 'song')

In [271]:
# list of songs that user listened
user_items = ir.get_user_items(song_df['user_id'][100])

In [272]:
# display user songs history
print("User id : ",song_df['user_id'][100])
for user_item in user_items:
    print(user_item)

User id :  fdf6afb5daefb42774617cf223475c6013969724
Ballad Of Big Nothing - Elliott Smith
My Paper Heart - The All-American Rejects
We're All Gonna Die (featuring Iggy Pop) - Slash
Saint Is A Sinner (featuring Rocco DeLuca) - Slash
I Hold On (featuring Kid Rock) - Slash
Alphabet Town - Elliott Smith
Nothing To Say (featuring M. Shadows) - Slash
Gotten (featuring Adam Levine) - Slash
The Golden Rose (Album Version) - Tom Petty
Genius - Kings Of Leon


In [276]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [277]:
# give song recommendation for that user
df_top10_user100=ir.recommend(song_df['user_id'][100])
df_top10_user100

No. of unique songs for the user: 10
Number of user listened same songs: 294
Total unique songs that all related users listened:  3625
No. of unique songs in the training set: 3625
Start coocurence_matrix
get cooccurence_matrix for number of user_songs : 10
Finish coocurence_matrix
Generate top recommendations
Generation Top Recommendataion for user :  fdf6afb5daefb42774617cf223475c6013969724
Non zero values in cooccurence_matrix :4669


Unnamed: 0,user_id,song,score,rank
0,fdf6afb5daefb42774617cf223475c6013969724,Watch This (featuring Dave Grohl and Duff McKa...,0.075644,1
1,fdf6afb5daefb42774617cf223475c6013969724,Crucify The Dead (featuring Ozzy Osbourne) - S...,0.07225,2
2,fdf6afb5daefb42774617cf223475c6013969724,Starlight (featuring Myles Kennedy) - Slash,0.057586,3
3,fdf6afb5daefb42774617cf223475c6013969724,Promise (featuring Chris Cornell) - Slash,0.049832,4
4,fdf6afb5daefb42774617cf223475c6013969724,By The Sword (featuring Andrew Stockdale or Wo...,0.044446,5
5,fdf6afb5daefb42774617cf223475c6013969724,Doctor Alibi (featuring Lemmy Kilmister) - Slash,0.036253,6
6,fdf6afb5daefb42774617cf223475c6013969724,The White Lady Loves You More - Elliott Smith,0.032323,7
7,fdf6afb5daefb42774617cf223475c6013969724,Beautiful Dangerous (featuring Fergie) - Slash,0.030842,8
8,fdf6afb5daefb42774617cf223475c6013969724,Single File - Elliott Smith,0.030664,9
9,fdf6afb5daefb42774617cf223475c6013969724,Alameda - Elliott Smith,0.028571,10


In [283]:
pd.set_option("display.width", None) #pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 200) 
df_top10_user100

Unnamed: 0,user_id,song,score,rank
0,fdf6afb5daefb42774617cf223475c6013969724,Watch This (featuring Dave Grohl and Duff McKagan) - Slash,0.075644,1
1,fdf6afb5daefb42774617cf223475c6013969724,Crucify The Dead (featuring Ozzy Osbourne) - Slash,0.07225,2
2,fdf6afb5daefb42774617cf223475c6013969724,Starlight (featuring Myles Kennedy) - Slash,0.057586,3
3,fdf6afb5daefb42774617cf223475c6013969724,Promise (featuring Chris Cornell) - Slash,0.049832,4
4,fdf6afb5daefb42774617cf223475c6013969724,By The Sword (featuring Andrew Stockdale or Wolfmother) - Slash,0.044446,5
5,fdf6afb5daefb42774617cf223475c6013969724,Doctor Alibi (featuring Lemmy Kilmister) - Slash,0.036253,6
6,fdf6afb5daefb42774617cf223475c6013969724,The White Lady Loves You More - Elliott Smith,0.032323,7
7,fdf6afb5daefb42774617cf223475c6013969724,Beautiful Dangerous (featuring Fergie) - Slash,0.030842,8
8,fdf6afb5daefb42774617cf223475c6013969724,Single File - Elliott Smith,0.030664,9
9,fdf6afb5daefb42774617cf223475c6013969724,Alameda - Elliott Smith,0.028571,10


In [284]:
df_top10_user100['song']

0         Watch This (featuring Dave Grohl and Duff McKagan) - Slash
1                 Crucify The Dead (featuring Ozzy Osbourne) - Slash
2                        Starlight (featuring Myles Kennedy) - Slash
3                          Promise (featuring Chris Cornell) - Slash
4    By The Sword (featuring Andrew Stockdale or Wolfmother) - Slash
5                   Doctor Alibi (featuring Lemmy Kilmister) - Slash
6                      The White Lady Loves You More - Elliott Smith
7                     Beautiful Dangerous (featuring Fergie) - Slash
8                                        Single File - Elliott Smith
9                                            Alameda - Elliott Smith
Name: song, dtype: object

## Get similar songs given a song

In [274]:
# give related songs based on the words
ir.get_similar_songs(['Oliver James - Fleet Foxes'])

Number of user listened same songs: 71
Total unique songs that all related users listened:  1087
No. of unique songs in the training set: 1087
get cooccurence_matrix for number of user_songs : 1
Generation Top Recommendataion for user :  
Non zero values in cooccurence_matrix :1087


Unnamed: 0,user_id,song,score,rank
0,,Quiet Houses - Fleet Foxes,0.161017,1
1,,Sun It Rises - Fleet Foxes,0.137725,2
2,,Meadowlarks - Fleet Foxes,0.133333,3
3,,He Doesn't Know Why - Fleet Foxes,0.11465,4
4,,Your Protector - Fleet Foxes,0.114504,5
5,,Drops In The River - Fleet Foxes,0.1,6
6,,Heard Them Stirring - Fleet Foxes,0.084906,7
7,,Tiger Mountain Peasant Song - Fleet Foxes,0.073171,8
8,,Ragged Wood - Fleet Foxes,0.056738,9
9,,White Winter Hymnal - Fleet Foxes,0.053279,10


## The above method is pretty slow.  
### There is a library from scikit surprise which does recommendation system. 

In [None]:
## scikit-surprise  

In [None]:
df_ms=pd.get_dummies(df_eval.song_id).groupby(df_eval.user_id).apply(max)

In [14]:
from scipy import sparse
f="/Users/yuenyeelo/Documents/springboard/projects/Capstone3/msdchallenge/kaggle_visible_evaluation_triplets.txt"
df=pd.read_csv(f,names=["user_id","song_id","listen_count"],sep='\t')
print(df.shape)
df.head()


(1450933, 3)


Unnamed: 0,user_id,song_id,listen_count
0,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOBONKR12A58A7A7E0,1
1,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOEGIYH12A6D4FC0E3,1
2,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOFLJQZ12A6D4FADA6,1
3,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOHTKMO12AB01843B0,1
4,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SODQZCY12A6D4F9D11,1


In [18]:
df['user_id'] = df['user_id'].astype("category")
df['song_id'] = df['song_id'].astype("category")
#df['count'] = df['']
user_items = sparse.coo_matrix((df.listen_count.astype(float),(df.song_id.cat.codes,df.user_id.cat.codes)))
my_data=user_items.toarray()
print(my_data[:10])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [None]:
df_ms.head()

In [19]:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(user_items)
distances, indices = knn.kneighbors(user_items, n_neighbors=3)

In [20]:
print(distances[:10])

[[2.22044605e-16 2.92893219e-01 2.92893219e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 6.31345636e-01 6.74877411e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.22044605e-16 2.92893219e-01 2.92893219e-01]
 [0.00000000e+00 2.25988989e-03 6.59648353e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 2.98574999e-02 5.42675125e-02]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 6.98488655e-01 8.25922344e-01]]


In [26]:
# get the index for a song SOHGGAH12A58A795BE
#index_for_song = df.index.tolist().index('SOHGGAH12A58A795BE')
# find the indices for the similar songs
index_for_song=3
sim_songs = indices[index_for_song].tolist()
# distances between song and the similar songs
song_distances = distances[index_for_song].tolist()
# the position of song in the list sim_songs
id_song = sim_songs.index(index_for_song)
# remove song from the list sim_songs
sim_songs.remove(index_for_song)
# remove a song from the list movie_distances
song_distances.pop(id_song)
print('The Nearest songs to song_3:', sim_songs)
print('The Distance from song_3:', song_distances)

The Nearest songs to song_3: [91387, 99470]
The Distance from song_3: [0.0, 0.0]


In [43]:
print(song_df.shape)
print(song_df.columns)
df_songid_songtitle=song_df[['song_id','song']]
df_songid_songtitle.drop_duplicates(inplace=True)
df_songid_songtitle.sort_values(by=['song_id'], inplace=True)
print(df_songid_songtitle.shape)
df_songid_songtitle.head()

(1450933, 8)
Index(['user_id', 'song_id', 'listen_count', 'total_listenByUser', 'track_id',
       'artist_name', 'song_title', 'song'],
      dtype='object')
(163206, 2)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


Unnamed: 0,song_id,song
1026279,SOAAAFI12A6D4F9C66,The Less You See - I Love You But I've Chosen ...
466551,SOAAAGK12AB0189572,Grateful - Au Revoir Simone
75080,SOAAAGQ12A8C1420C8,Orgelblut - Bohren & Der Club Of Gore
237736,SOAAAMT12AB018C9C4,Flirted With You All My Life - Vic Chesnutt
1104630,SOAAAQN12AB01856D3,Campeones De La Vida - Alejandro Lerner


In [48]:
print(df_songid_songtitle.iloc[[3]])
print(df_songid_songtitle.iloc[[91387]])
print(df_songid_songtitle.iloc[[99470]])

                   song_id                                         song
237736  SOAAAMT12AB018C9C4  Flirted With You All My Life - Vic Chesnutt
                   song_id                                        song
237732  SONYIWJ12AB0190329  We Hovered With Short Wings - Vic Chesnutt
                   song_id                   song
237727  SOPFWNQ12AB018C9DE  Granny - Vic Chesnutt
