# Build a Song Recommender System
## Due: Wednesday, May 29rd, by 11:59 pm on Canvas

In this assignment you will

- Explore the song data set.
- Build two different song recommendation models and make comparisons between the two.
- Investigate the song recommendations of the item similarity model.
- For a given song, find the most similar songs.

Copyright ©2018 Emily Fox. All rights reserved. Permission is hereby granted to students registered for University of Washington CSE/STAT 416 for use solely during Spring Quarter 2019 for purposes of the course. No other use, copying, distribution, or modification is permitted without prior written consent. Copyrights for third-party components of this work must be honored. Instructors interested in reusing these course materials should contact the author.

In [1]:
import pandas as pd
import numpy as np
import operator
from sklearn.model_selection import train_test_split
from sklearn.decomposition import NMF

In [4]:
song_data = pd.read_csv(filepath_or_buffer = '/data/song_data.csv')
song_data = song_data.iloc[0:50000]
song_data

Unnamed: 0,user_id,song_id,listen_count,title,artist,song
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,The Cove,Jack Johnson,The Cove - Jack Johnson
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Paco De Lucia,Entre Dos Aguas - Paco De Lucia
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,Stronger,Kanye West,Stronger - Kanye West
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1,Constellations,Jack Johnson,Constellations - Jack Johnson
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1,Learn To Fly,Foo Fighters,Learn To Fly - Foo Fighters
5,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODDNQT12A6D4F5F7E,5,Apuesta Por El Rock 'N' Roll,Héroes del Silencio,Apuesta Por El Rock 'N' Roll - Héroes del Sile...
6,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODXRTY12AB0180F3B,1,Paper Gangsta,Lady GaGa,Paper Gangsta - Lady GaGa
7,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOFGUAY12AB017B0A8,1,Stacked Actors,Foo Fighters,Stacked Actors - Foo Fighters
8,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOFRQTD12A81C233C0,1,Sehr kosmisch,Harmonia,Sehr kosmisch - Harmonia
9,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOHQWYZ12A6D4FA701,1,Heaven's gonna burn your eyes,Thievery Corporation feat. Emiliana Torrini,Heaven's gonna burn your eyes - Thievery Corpo...


## Q1) Of the following artists, which artist has the most number of unique users that have listened to them in this data set?

- Kanye West  
- Foo Fighters  
- Taylor Swift (o)
- Lady GaGa

In [17]:
# Student writes code here
print(song_data.columns)
song_data.groupby(['artist'])['user_id'].count().nlargest(11)

Index(['user_id', 'song_id', 'listen_count', 'title', 'artist', 'song'], dtype='object')


artist
Coldplay                  676
Florence + The Machine    515
Kings Of Leon             459
OneRepublic               365
Taylor Swift              361
The Black Keys            361
Eminem                    358
Justin Bieber             356
Daft Punk                 354
Train                     340
Radiohead                 315
Name: user_id, dtype: int64

## Q2) Of the users that have listened to "Where Is My Mind?" by The Pixies (song_id: SOBBKGF12A8C1311EE), what is the average number of times they listened to that particular song? State answer in decimal format to nearest 0.01.

In [18]:
# Student writes code here
song_data[song_data['song_id'] == 'SOBBKGF12A8C1311EE']['listen_count'].mean()

4.5

## Q3) Identify the song that has the largest total listen count across all users. Who is the artist of this song?

Hint: Consider using the pandas.groupby(), .sum() and pandas.Series.idxmax()

- Kings of Leon
- Dwight Yoakam (o)
- Cold Play
- Righteous Pigs


In [33]:
# Student writes code here
max_song = song_data.groupby(['song_id'])['listen_count'].sum().idxmax()
song_data[song_data['song_id'] == max_song]['artist'].head()

91     Dwight Yoakam
182    Dwight Yoakam
348    Dwight Yoakam
535    Dwight Yoakam
646    Dwight Yoakam
Name: artist, dtype: object

## Relevance of top 10 song recommendations

Randomly partition the data into 80% training and 20% test using seed = 0. Build two song recommendation models. 
- The first model is a popularity model that just recommends the 10 most popular songs (in order of popularity) for every user regardless of their listening history. Popularity is measured in the number of users who listned to the song.
- The second model is an item similarity model, that uses item-item similarities based on users in common.

### Create a bag of songs for each user.

This is a matrix where rows represent users and columns represent songs. Matrix(i,j) = number of times user i heard song j.

In [34]:
#make sure random_state = 1
#use this for popularity model
train, test = train_test_split(song_data, test_size = 0.2, random_state = 1)

In [35]:
#Making the list of users and songs
songs = list(song_data['song'].unique())
users_train = list(train['user_id'].unique())
users_test = set(test['user_id'].unique())

#Making the siilarity matrix function
def similarity_matrix(data):
    users = list(data['user_id'].unique())
    similarity_matrix = np.zeros((len(users), len(songs)))
    i = 0;
    for user in users:
        songs_heard = list(data[data['user_id'] == user]['song'])
        times_heard = list(data[data['user_id'] == user]['listen_count'])

        for song, times in zip(songs_heard, times_heard):
            index = songs.index(song)
            similarity_matrix[i][index] = times;
        i += 1
    return similarity_matrix

In [36]:
#use these for similarity model
similarity_train = similarity_matrix(train)
similarity_test = similarity_matrix(test)

In [37]:
#converting to low rank using NMF
#The inverse transform approximates the listen counts of each song for each user.
#Thus, the prediction k songs with heighest listen counts that the user has not heard
model = NMF(n_components = 10)
out = model.fit_transform(similarity_train)
similarity_train = model.inverse_transform(out)

In [38]:
def popularity_model(number_of_songs):
    out = train.groupby(by = ['song'], as_index=False)['user_id'].count().\
    nlargest(number_of_songs,'user_id', keep = 'first').iloc[0:number_of_songs]
    out = out.rename(columns = {'user_id': 'total_users'})
    return list(out['song'])

In [39]:
def similarity_model(user_id, number_of_songs):
    if number_of_songs < 0:
        print('number_of_songs must be non negative')
        return None
    if (user_id in users_train):
        user_heard_songs = set(train[train['user_id'] == user_id]['song'])
        index = users_train.index(user_id)
        user_vector = similarity_train[index].copy()
        
        songs_added = 0
        to_return = []
        while songs_added < number_of_songs:
            best_song_index = np.argmax(user_vector)
            user_vector[best_song_index] = -float('inf');
            best_song = songs[best_song_index]
            if best_song not in user_heard_songs:
                to_return.append(best_song)
                songs_added += 1
        return to_return
    #if user is not in train data, we return the most popular songs
    #this is called a cold start problem.
    #Imagine that you make a new youtube account. What does youtube recommend to you?
    #It has no data on you so it will recommend the most popular videos.
    #This is what is happening here
    else:
        return popularity_model(number_of_songs)

## Check that these two models are giving different recommendations for a given user

In [40]:
popularity_model(10)

['Sehr kosmisch - Harmonia',
 'Undo - Björk',
 "You're The One - Dwight Yoakam",
 'Dog Days Are Over (Radio Edit) - Florence + The Machine',
 'Revelry - Kings Of Leon',
 'Secrets - OneRepublic',
 'Horn Concerto No. 4 in E flat K495: II. Romance (Andante cantabile) - Barry Tuckwell/Academy of St Martin-in-the-Fields/Sir Neville Marriner',
 'Fireflies - Charttraxx Karaoke',
 'Tive Sim - Cartola',
 'Hey_ Soul Sister - Train']

In [41]:
similarity_model(users_train[0], 10)

['Dancing Shoes - Arctic Monkeys',
 "Ain't Misbehavin - Sam Cooke",
 'Tive Sim - Cartola',
 'Somebody To Love - Justin Bieber',
 'One Time - Justin Bieber',
 'Représente - Alliance Ethnik',
 "Everything You Touch We Touch First (America'S Mexican) - George Lopez",
 'Alright - Mystikal',
 'Catch You Baby (Steve Pitron & Max Sanna Radio Edit) - Lonnie Gordon',
 'Moog Island - Morcheeba']

In [42]:
def precision_recall(user_list, model_name, number_of_songs):
    true_positive = 0
    false_positive = 0
    true_negative = 9200
    false_negative = 0
    if type(model_name) != str:
        print('model_name parameter must be a string')
        return None
    if number_of_songs < 0:
        print('number_of_songs should be non negative')
        return None
    if number_of_songs > 80:
        print('number_of_songs is too high')
    if model_name.lower() != 'popularity' and model_name.lower() != 'similarity' :
        print('model_name must be \'popularity\' or \'similarity\'')
        return None
    for user in user_list:
        if user not in users_test:
            true_negative -= len(train[train['user_id'] == user]['song'])
            continue
        if model_name.lower() == 'popularity':
            predictions = popularity_model(number_of_songs)
        elif model_name.lower() == 'similarity':
            predictions = similarity_model(user, number_of_songs)
        truth = set(test[test['user_id'] == user]['song'])
        for prediction in predictions:
            if prediction in truth:
                true_positive += 1
                true_negative -= 1
            else:
                false_positive += 1
                true_negative -= 1
        for song in truth:
            if song not in predictions:
                false_negative += 1
                true_negative -= 1
    precision = true_positive / (true_positive + false_positive)
    recall = true_positive / (true_positive + false_negative)
    print('Precision is: ' + str(precision))
    print('Recall is: ' + str(recall))
    return (precision, recall)

## Q4) If the popularity model was used to show 10 results to users, on average, what portion of the results would be relevant? State answer in decimal format to nearest 0.01.

In [43]:
# students write code here
precision_recall(users_train, 'popularity', 10)
#

Precision is: 0.013205374280230326
Recall is: 0.034472392023248824


(0.013205374280230326, 0.034472392023248824)

## Q5) If the item similarity model was used to show 10 results to users, on average, what portion of the results would be relevant? State answer in decimal format to nearest 0.01.

In [45]:
# students write code here
precision_recall(users_train, 'similarity', 10)

Precision is: 0.009366602687140116
Recall is: 0.02445134783044393


(0.009366602687140116, 0.02445134783044393)

## Top song recommendation for a subset of users

We will now only use the similarity model for the remainder of the assignment. For Q6, take a subset of users **(the first 1000 users)** from the test set. Using the item similarity model, get the top song recommendation for each of these users - refer to the item_similarity documentation on how to do this.

In [46]:
users_test_subset = []

i = 0
for val in iter(users_test):
    if i == 1000:
        break
    users_test_subset.append(val)
    i += 1

## Q6) Which song is the most frequent top song recommendation from the item similarity model for this subset of users?

### Set number of songs to 10 when retrieving 

- "Secrets" by OneRepublic
- "Undo" by Bjork (o)
- "The Scientist" by Coldplay
- "Hey Soul Sister" by Train

In [48]:
# Students write code here
similarity_model(users_test_subset, 10)

['Sehr kosmisch - Harmonia',
 'Undo - Björk',
 "You're The One - Dwight Yoakam",
 'Dog Days Are Over (Radio Edit) - Florence + The Machine',
 'Revelry - Kings Of Leon',
 'Secrets - OneRepublic',
 'Horn Concerto No. 4 in E flat K495: II. Romance (Andante cantabile) - Barry Tuckwell/Academy of St Martin-in-the-Fields/Sir Neville Marriner',
 'Fireflies - Charttraxx Karaoke',
 'Tive Sim - Cartola',
 'Hey_ Soul Sister - Train']

## Q7) What percentage of rows in the test set have listen count 1? State answer in decimal format to nearest 0.01. 

Note: Unique is not required here

In [49]:
# Students write code here
len(test[test['listen_count'] == 1])/len(test)

0.5671

## Q8) Compared to the entire test set, how does the precision and recall change if we were to evaluate our item similarity model only on rows (of the test set) with listen count greater than 100? Assume 10 songs are recommended for both populations.

- both precision and recall go down
- precision goes up, recall goes down
- precision goes down, recall goes up
- both precision and recall go up (o)

In [50]:
users_test_subset = []

for user in users_test:
    sum_of_listen_counts = sum(test[test['user_id'] == user]['listen_count'])
    if sum_of_listen_counts > 100:
        users_test_subset.append(user)

In [51]:
# Students wrrite solution here
precision_recall(users_test, 'similarity', 10)


Precision is: 0.009316533027873233
Recall is: 0.024402440244024402


(0.009316533027873233, 0.024402440244024402)

In [52]:
precision_recall(users_test_subset, 'similarity', 10)

Precision is: 0.04090909090909091
Recall is: 0.05806451612903226


(0.04090909090909091, 0.05806451612903226)