## Assignment 6
Author - Shashank Thakre

In [1]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import cosine
import warnings
warnings.filterwarnings("ignore")

## 1. Collaborative Filtering

#### Read the data

In [2]:
data = pd.read_csv('radio_songs.csv')
data.head()

Unnamed: 0,user,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,33,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,42,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,51,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,62,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Collaborative Filtering

##### Use this user-item matrix to:

A. Recommend 10 songs to users who have listened to 'u2' and 'pink floyd'. Use item-item collaborative filtering to find songs that are similar using spatial distance with cosine. Since this measures the distance you need to subtract from 1 to get similarity as shown below.

In [3]:
# Create a new data frame for only items (songs)
data_items = data.drop(['user'], axis = 1)
data_items.head()

Unnamed: 0,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,all that remains,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Create an empty data frame to store the item-item collaborative filtering data
data_item_based = pd.DataFrame(index = data_items.columns, columns = data_items.columns)

In [5]:
# Fill the item-item data frame with cosine similarities between different items.
# Since this is a symmetric matrix, the number of rows and number of columns are same as number of items

# Loop through all the columns that need to be stored in the rows
for i in range(0, len(data_item_based.columns)):
    
    # Loop through all the columns for getting all the columns in target data frame
    for j in range(0, len(data_item_based.columns)):
        data_item_based.iloc[i,j] = 1 - cosine(data_items.iloc[:,i], data_items.iloc[:,j])

data_item_based.head()

Unnamed: 0,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,all that remains,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
abba,1,0.0,0,0,0.0,0,0,0,0,0,...,0,0,0,0,,0,0.0,0,0.0,0
ac/dc,0,1.0,0,0,0.223607,0,0,0,0,0,...,0,0,0,0,,0,0.223607,0,0.2,0
adam green,0,0.0,1,0,0.0,0,0,0,0,0,...,0,0,0,0,,0,0.0,0,0.0,0
aerosmith,0,0.0,0,1,0.0,0,0,0,0,0,...,0,0,0,0,,0,0.0,0,0.0,0
afi,0,0.223607,0,0,1.0,0,0,0,0,0,...,0,0,0,0,,0,0.0,0,0.0,0


In [6]:
# Create a placeholder items for closest neighbours to an item
data_neighbours = pd.DataFrame(index=data_items.columns,columns=range(1,11))
data_neighbours

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
abba,,,,,,,,,,
ac/dc,,,,,,,,,,
adam green,,,,,,,,,,
aerosmith,,,,,,,,,,
afi,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
trivium,,,,,,,,,,
u2,,,,,,,,,,
underoath,,,,,,,,,,
volbeat,,,,,,,,,,


In [7]:
# Loop through our similarity dataframe and fill in neighbouring item names
# In this loop we are just sorting the column in data_item_based matrix to get the 10 most similar by sorting in 
# descending order.
# Here we are taking index [1:11] because the index 0 always has the same value as index because the similarity with 
# same song is 1. So we don't want the song that has already been purchased.

for i in range(0,len(data_items.columns)):
    data_neighbours.iloc[i,:10] = data_item_based.iloc[0:,i].sort_values(ascending=False)[1:11].index

In [8]:
# print the dataframe.head()
data_neighbours.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
abba,hans zimmer,frank sinatra,howard shore,elvis presley,groove coverage,faithless,papa roach,limp bizkit,scooter,nightwish
ac/dc,hammerfall,in extremo,metallica,dream theater,blind guardian,bloodhound gang,nightwish,marilyn manson,frank sinatra,apocalyptica
adam green,nouvelle vague,three days grace,the fray,keane,tegan and sara,belle and sebastian,the strokes,razorlight,farin urlaub,the kooks
aerosmith,staind,maria mena,flogging molly,bad religion,morcheeba,eric clapton,papa roach,audioslave,manu chao,in extremo
afi,paramore,sum 41,breaking benjamin,nofx,anti-flag,blink-182,good charlotte,peter fox,clueso,ramones


In [9]:
# Print 10 songs to recommend to users who have listened to 'u2' and 'pink floyd'.
rows = ['u2', 'pink floyd']
data_neighbours.loc[rows]

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
u2,robbie williams,misfits,green day,depeche mode,peter fox,kelly clarkson,dire straits,enter shikari,madonna,johnny cash
pink floyd,genesis,queen,led zeppelin,sonic syndicate,hans zimmer,funeral for a friend,david bowie,coldplay,howard shore,the rolling stones


### B. Find user most similar to user 1606. Use user-user collaborative filtering with cosine similarity. List the recommended songs for user 1606 (Hint: find the songs listened to by the most similar user).

In [10]:
# Create an empty data frame to store the user-user collaborative filtering data
data_user_based = pd.DataFrame(index = data.user, columns = data.user)
data_user_based.head()

user,1,33,42,51,62,75,130,141,144,150,...,1521,1530,1536,1545,1549,1566,1586,1589,1601,1606
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
33,,,,,,,,,,,...,,,,,,,,,,
42,,,,,,,,,,,...,,,,,,,,,,
51,,,,,,,,,,,...,,,,,,,,,,
62,,,,,,,,,,,...,,,,,,,,,,


In [11]:
# Fill the user-user data frame with cosine similarities between different users.
# Since this is a symmetric matrix, the number of rows and number of columns are same as number of users

# Loop through all the columns that need to be stored in the rows
for i in range(0, len(data_user_based.columns)):
    
    # Loop through all the columns for getting all the columns in target data frame
    for j in range(0, len(data_user_based.columns)):
        data_user_based.iloc[i,j] = 1 - cosine(data_items.iloc[i,:], data_items.iloc[j,:])

data_user_based.head()

user,1,33,42,51,62,75,130,141,144,150,...,1521,1530,1536,1545,1549,1566,1586,1589,1601,1606
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0615457,0.0,0.0,0.0836242,0.0,0,0.0,0.0,0.150756,...,0.1066,0,0.0,0.190693,0.0,0,0.0,0.0,0.0,0
33,0.0615457,1.0,0.0771517,0.247537,0.226455,0.176777,0,0.0,0.0,0.102062,...,0.0,0,0.0645497,0.193649,0.0,0,0.0456435,0.0,0.0912871,0
42,0.0,0.0771517,1.0,0.0,0.0,0.0,0,0.0916698,0.0,0.0,...,0.0,0,0.0,0.0,0.0944911,0,0.0,0.125988,0.0,0
51,0.0,0.247537,0.0,1.0,0.336336,0.140028,0,0.0,0.108465,0.121268,...,0.0,0,0.0766965,0.0,0.0,0,0.0,0.0,0.0,0
62,0.0836242,0.226455,0.0,0.336336,1.0,0.160128,0,0.0672673,0.124035,0.138675,...,0.0,0,0.175412,0.0877058,0.0,0,0.0620174,0.0,0.0,0


In [12]:
# Create a placeholder items for closest neighbours to an item
data_user_neighbours = pd.DataFrame(index=data.user,columns=range(1,11))
data_user_neighbours

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,,,,,,,,,,
33,,,,,,,,,,
42,,,,,,,,,,
51,,,,,,,,,,
62,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
1566,,,,,,,,,,
1586,,,,,,,,,,
1589,,,,,,,,,,
1601,,,,,,,,,,


In [13]:
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_user_based.columns)):
    data_user_neighbours.iloc[i,:10] = data_user_based.iloc[0:,i].sort_values(ascending=False)[1:11].index

data_user_neighbours.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,205,1259,1545,1121,1479,975,150,648,1201,504
33,1253,978,477,917,51,1233,62,1376,951,1444
42,1037,972,890,1487,1135,504,917,584,472,1589
51,458,62,1253,319,33,477,422,1022,948,1361
62,458,51,1253,319,477,1444,1201,1487,1135,33


In [14]:
#Find the user most similar to user 1606
data_user_neighbours.loc[1606]

1     1144
2      144
3     1334
4     1509
5      890
6     1259
7      648
8     1174
9      504
10     477
Name: 1606, dtype: object

### Observation 
Based on the above it seems that the user most similar to 1606 is user 1144.  
Let's find the songs listened to by the user 1144

In [15]:
data[data.user == 1144]

Unnamed: 0,user,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
65,1144,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [16]:
# The songs listened to by user 1144

row = data.iloc[65] # As can be seen from the cell above, the user 1144 is on index 65
res = row.index[row == 1]
res

Index(['beastie boys', 'bob dylan', 'bob marley & the wailers', 'david bowie',
       'elvis presley', 'eric clapton', 'johnny cash', 'pearl jam',
       'pink floyd', 'the beatles', 'the doors', 'the rolling stones',
       'tom waits'],
      dtype='object')

**Answer** -
The songs recommended for user 1606 based on the most similar user are 
'beastie boys', 'bob dylan', 'bob marley & the wailers', 'david bowie', 'elvis presley', 'eric clapton', 'johnny cash', 'pearl jam', 'pink floyd', 'the beatles', 'the doors', 'the rolling stones', 'tom waits'

### C. How many of the recommended songs has already been listened to by user 1606?

In [17]:
mask = ['beastie boys', 'bob dylan', 'bob marley & the wailers', 'david bowie',
       'elvis presley', 'eric clapton', 'johnny cash', 'pearl jam',
       'pink floyd', 'the beatles', 'the doors', 'the rolling stones',
       'tom waits']
data[mask][data.user == 1606]

Unnamed: 0,beastie boys,bob dylan,bob marley & the wailers,david bowie,elvis presley,eric clapton,johnny cash,pearl jam,pink floyd,the beatles,the doors,the rolling stones,tom waits
99,0,0,0,0,1,0,0,0,0,1,0,0,0


**Answer**
Based on the above, the user 1606 has only listened to the songs by Elvis Presley and the Beatles

### D. Use a combination of user-item approach to build a recommendation score for each song for each user using the following steps for each user-

For each song for the user row, get the top 10 similar songs and their similarity score.  
For each of the top 10 similar songs, get a list of the user purchases  
Calculate a recommendation score as follows: $\frac{∑(purchaseHistory⋅similarityScore)}{∑similarityScore}$  
What are the top 5 song recommendations for user 1606?  

In [18]:
# Helper function to get similarity scores
def getScore(history, similarities):
    return sum(history*similarities)/sum(similarities)

In [19]:
# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.iloc[:,:1] = data.iloc[:,:1]
data_sims.head()

Unnamed: 0,user,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
0,1,,,,,,,,,,...,,,,,,,,,,
1,33,,,,,,,,,,...,,,,,,,,,,
2,42,,,,,,,,,,...,,,,,,,,,,
3,51,,,,,,,,,,...,,,,,,,,,,
4,62,,,,,,,,,,...,,,,,,,,,,


In [20]:
# Fill the data_sims matrix

for i in range(0, len(data_sims.index)): #loop through all the rows
    for j in range(1, len(data_sims.columns)): #loop through all the columns except first column (user)
        user = data_sims.index[i] #store user value in a variable
        product = data_sims.columns[j] #store the song name in a variable
        
        if data.iloc[i][j] == 1: # The value is 1 for the song that the user has already listned
            data_sims.iloc[i][j] = 0 #Since we don't want to recommend song that user has listened, setting this to 0
        
        else:
            product_top_names = data_neighbours.loc[product][0:10] #this gets all similar songs for a song
            
            # Below the index is from 1:11 because the 0th element is the same song.
            # We want to match this with the correct songs from data_item_based df and data_items df
            # this way the songs line up correctly with data_neighbours[0:10] above
            product_top_sims = data_item_based.loc[product].sort_values(ascending=False)[1:11]
            
            user_purchases = data_items.loc[user, product_top_names]
            
            data_sims.iloc[i][j] = getScore(user_purchases, product_top_sims)

In [21]:
data_sims.head()

Unnamed: 0,user,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
0,1,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0,...,0.0769574,0,0,0,,0.0,0,0.0882315,0.0944105,0
1,33,0.0,0.0,0.0,0.0,0.20807,0.0,0,0.0943774,0,...,0.0,0,0,0,,0.0,0,0.0882315,0.0,0
2,42,0.173849,0.206181,0.0,0.0720733,0.0,0.0,0,0.0,0,...,0.0,0,0,0,,0.0899993,0,0.0,0.0,0
3,51,0.0,0.0,0.188449,0.0,0.0813287,0.0955478,0,0.0,0,...,0.0,0,0,0,,0.0,0,0.0,0.0,0
4,62,0.0,0.0,0.134715,0.0,0.178129,0.0,0,0.0,0,...,0.217462,0,0,0,,0.0,0,0.101881,0.0944105,0


In [22]:
# Get the top songs
data_recommend = pd.DataFrame(index=data_sims.index, columns=['user','1','2','3','4','5','6','7','8','9','10'])
data_recommend.iloc[0:,0] = data_sims.iloc[:,0]

In [23]:
# Instead of top song scores, we want to see names
for i in range(0,len(data_sims.index)):
    data_recommend.iloc[i,1:] = data_sims.iloc[i,:].sort_values(ascending=False).iloc[1:11,].index.transpose()

In [24]:
# Print a sample
print (data_recommend.iloc[:10,:4])

  user                1                  2               3
0    1      korpiklaani     kelly clarkson         nirvana
1   33          placebo          gentleman      bloc party
2   42  subway to sally     marilyn manson       rammstein
3   51      the subways            justice   kaiser chiefs
4   62          incubus        the strokes       green day
5   75              afi          blink-182  good charlotte
6  130            bjork  alanis morissette             air
7  141           slayer        amon amarth      arch enemy
8  144        the kooks  bruce springsteen     the streets
9  150      evanescence            placebo    judas priest


In [25]:
# List the recommended songs for user 1606
data_recommend[data_recommend.user == 1606]

Unnamed: 0,user,1,2,3,4,5,6,7,8,9,10
99,1606,eric clapton,howard shore,david bowie,dream theater,apocalyptica,hans zimmer,manu chao,kings of leon,bob marley & the wailers,porcupine tree


**Answer** - The top 5 songs recommended for user 1606 are - Eric Clapton, Howard Shore, David Bowie, Dream Theater, Apocalytpica

## 2. Conceptual questions:

#### 1. Name 2 other similarity measures that you can use instead of cosine similarity above.

**Answer** - Pearson correlation and Jaccard similarity are other 2 similarity measures

#### 2. What is needed to build a Content-Based Recommender system?  

**Answer** - Content based recommender system takes different features of items to determine the similarity of items. This gives better recommendations as it uses features from the item itself rather that who purchased the item. However this needs upfront work to determine all the features of the item. Then we need to determine the similarity of items based on their features. Then we need to get the user purchases to see which user has bought what items. We can then use the similarity scores calculated above to get the items that are similar to the ones the user has bought and recommend the top items (based on the similarity scores) to the user.

#### 3. Name 2 methods to evaluate your recommender system.

**Answer** - In the Decison Support method, Recall and Precision are two parameters to evaluate the performance. In the Accuracy and Error based method, we can also use MAE (Mean Absolute Error) or MSE(Mean Squared Error) as well.