<center><h2> Collaborative Memory Based Recommendation system </center></h2>

In the memory-based approach, we try to predict a user’s preference based on the ratings given by other similar users or received by other similar items.

Memory-based approaches include: 

* user-based collaborative filtering,
* item-based collaborative filtering

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import string
import warnings
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.metrics.pairwise import pairwise_distances 
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')

In [6]:
cd drive/MyDrive/Colab Notebooks/rs

/content/drive/MyDrive/Colab Notebooks/rs


In [7]:
song_data = pd.read_csv("song_data.txt", sep = ',')
song_data.head()

Unnamed: 0,user,song,play_count,track_id,artist,title
0,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOBONKR12A58A7A7E0,1,TRAEHHJ12903CF492F,Dwight Yoakam,You're The One
1,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOEGIYH12A6D4FC0E3,1,TRLGMFJ128F4217DBE,Barry Tuckwell/Academy of St Martin-in-the-Fie...,Horn Concerto No. 4 in E flat K495: II. Romanc...
2,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOFLJQZ12A6D4FADA6,1,TRTNDNE128F1486812,Cartola,Tive Sim
3,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOHTKMO12AB01843B0,1,TRASTUE128F930D488,Lonnie Gordon,Catch You Baby (Steve Pitron & Max Sanna Radio...
4,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SODQZCY12A6D4F9D11,1,TRFPLWO128F1486B9E,Miguel Calo,El Cuatrero


In [8]:
song_data1 = song_data[['user', 'song', 'play_count', 'title']]

In [9]:
song_data1

Unnamed: 0,user,song,play_count,title
0,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOBONKR12A58A7A7E0,1,You're The One
1,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOEGIYH12A6D4FC0E3,1,Horn Concerto No. 4 in E flat K495: II. Romanc...
2,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOFLJQZ12A6D4FADA6,1,Tive Sim
3,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOHTKMO12AB01843B0,1,Catch You Baby (Steve Pitron & Max Sanna Radio...
4,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SODQZCY12A6D4F9D11,1,El Cuatrero
...,...,...,...,...
1450928,5e650759ebf89012044c6d52121eeada8b0ec814,SOVLNXV12A6D4F706E,1,Ms. Fat Booty
1450929,5e650759ebf89012044c6d52121eeada8b0ec814,SOVDSJC12A58A7A271,2,Ain't Misbehavin
1450930,5e650759ebf89012044c6d52121eeada8b0ec814,SOBRHVR12A8C133F35,2,Luvstruck
1450931,5e650759ebf89012044c6d52121eeada8b0ec814,SOMGVYU12A8C1314FF,2,Sinisten tähtien alla


In [10]:
song_data1 = song_data1[:200000]

In [11]:
encoding_user_song = preprocessing.OrdinalEncoder()
song_data1['user_id'] = encoding_user_song.fit_transform(song_data1[['user']])
song_data1['song_id'] = encoding_user_song.fit_transform(song_data1[['song']])

In [12]:
song_data1['user_id'] = song_data1['user_id'].astype('int')
song_data1['song_id'] = song_data1['song_id'].astype('int')

In [13]:
train, test = train_test_split(song_data1, test_size=0.30, random_state=31)

In [14]:
train.head()

Unnamed: 0,user,song,play_count,title,user_id,song_id
134777,75832aeeaad3c9292b670b736548b02c3c5ce7ba,SOMESIV12A6D4FC6F2,5,The Kindness Of Strangers,6894,30892
38143,7acceea3d238b5b92fb1d8ae89b7e96d4fd31435,SOVFBUL12A58A7B498,19,Naughty Girl,7222,52096
123781,28e746dacaa2255ddb4f1e474172970a3f515875,SOBADEB12AB018275F,1,Imma Be,2357,2694
95253,b904c9c1f9b27d90c84a3fbeebb0437ab09be586,SOZLWNB12AB0183C48,1,Memorial,10972,61684
142001,72306aabe072a8a716e375ef9d6aee1f412bc7ab,SOYKNOH12A6D4FA9C1,2,Scorpio (LP Version),6691,59378


In [15]:
song_data1.shape

(200000, 6)

In [16]:
print(train.shape, test.shape)

(140000, 6) (60000, 6)


In [17]:
n_users = song_data1.user.nunique()
n_items = song_data1.song.nunique()

In [18]:
print(n_users, n_items)

15101 62887


<h3> User based Recommendation system </h3>

<h4> Create empty data matrix: user*song </h4>

In [19]:
data_matrix = np.zeros((n_users, n_items))

<h4> Fill user*song matrix with rating values </h4> 

In [20]:
for line in train.itertuples():
    data_matrix[line[5]-1, line[6]-1] = line[3]

In [21]:
data_matrix.shape

(15101, 62887)

<h3> Pairwise distance with cosine metric </h3>

In [22]:
user_similarity = 1 - pairwise_distances(data_matrix, metric='cosine')

In [23]:

np.unique(user_similarity)

array([0.00000000e+00, 8.29812631e-06, 1.22425762e-05, ...,
       9.99478524e-01, 9.99839808e-01, 1.00000000e+00])

In [24]:
user_similarity.shape

(15101, 15101)

<h3> Dot product of Data Matrix with User similarity </h3>

In [25]:
item_prediction = np.dot(user_similarity, data_matrix)

In [26]:
item_prediction.shape

(15101, 62887)

In [27]:
prediction_df = pd.DataFrame(item_prediction)

In [28]:
prediction_df.shape

(15101, 62887)

<h3> Song recommendations for any user id </h3>

In [29]:
prediction_df[10].value_counts()

0.000000    14569
0.143223        5
0.138013        5
0.230940        4
0.298142        4
            ...  
0.057574        1
0.093048        1
0.046422        1
0.050369        1
0.081514        1
Name: 10, Length: 456, dtype: int64

In [30]:
user = 'fd50c4007b68a3737fe052d5a4f78ce8aa117f3d'
user_id = song_data1['user_id'][song_data['user']==user]
user = user_id[0]

In [31]:
user

14951

In [32]:
prediction_df.iloc[user].sort_values(ascending=False)[:10]

25413    4.000000
12683    2.847184
43086    2.232495
19929    2.000000
40501    2.000000
60105    1.492405
37234    1.000000
6977     1.000000
49109    1.000000
12763    0.683742
Name: 14951, dtype: float64

In [33]:
recommended_songs_df = pd.DataFrame(prediction_df.iloc[user].sort_values(ascending=False)[:10])

In [34]:
recommended_songs_df.reset_index(inplace=True)
recommended_songs_df.columns = ['song_id', 'score']

In [35]:
song_data2 = song_data1[['song', 'song_id', 'title']].copy()

In [36]:
merged = pd.merge(recommended_songs_df, song_data2, how='left', on='song_id')

In [37]:
merged.drop_duplicates(inplace=True)

In [38]:
merged.reset_index(drop=True)

Unnamed: 0,song_id,score,song,title
0,25413,4.0,SOJYRWX12AF729D9FD,Out Ta Get Me
1,12683,2.847184,SOEXGTH12A58A77D3F,Happy Talk
2,43086,2.232495,SORGWLX12A8C138A6C,So Harlem
3,19929,2.0,SOHTTAW12AF72A5C10,4X4
4,40501,2.0,SOQDHGP12AB0180BB6,Paradox 5
5,60105,1.492405,SOYTERC12A58A7BE66,Dark Clouds
6,37234,1.0,SOOTOBX12A6D4F5E10,She'll Go On You
7,6977,1.0,SOCRPFT12A8C139B1D,Numb Numb
8,49109,1.0,SOTWQKR12A8C13CF82,Oh So Insistent
9,12763,0.683742,SOEXXDY12A8C13F37A,Bohemian Rhapsody


<h3> Normalize the score </h3>

In [39]:

merged['score_normalized'] = (merged['score'] - min(merged['score'])) / (max(merged['score']) - min(merged['score']))

In [40]:
merged

Unnamed: 0,song_id,score,song,title,score_normalized
0,25413,4.0,SOJYRWX12AF729D9FD,Out Ta Get Me,1.0
10,12683,2.847184,SOEXGTH12A58A77D3F,Happy Talk,0.652374
11,43086,2.232495,SORGWLX12A8C138A6C,So Harlem,0.467018
12,19929,2.0,SOHTTAW12AF72A5C10,4X4,0.396911
13,40501,2.0,SOQDHGP12AB0180BB6,Paradox 5,0.396911
14,60105,1.492405,SOYTERC12A58A7BE66,Dark Clouds,0.243848
15,37234,1.0,SOOTOBX12A6D4F5E10,She'll Go On You,0.095366
16,6977,1.0,SOCRPFT12A8C139B1D,Numb Numb,0.095366
18,49109,1.0,SOTWQKR12A8C13CF82,Oh So Insistent,0.095366
22,12763,0.683742,SOEXXDY12A8C13F37A,Bohemian Rhapsody,0.0


<h3> Evaluation </h3>

In [41]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from math import sqrt

<h4> MAE </h4>

In [42]:
def mae(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()]
    ground_truth = ground_truth[ground_truth.nonzero()]
    return mean_absolute_error(prediction, ground_truth)

In [43]:
mae(item_prediction,data_matrix)

18.89844895704256

<h4> RMSE </h4>

In [44]:
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()]
    ground_truth = ground_truth[ground_truth.nonzero()]
    return sqrt(mean_squared_error(prediction, ground_truth))

In [45]:
rmse(item_prediction,data_matrix)

111.71919245787

<h4> Precision@K </h4>

In [46]:
user_index=550

In [47]:
song_data1.head()

Unnamed: 0,user,song,play_count,title,user_id,song_id
0,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOBONKR12A58A7A7E0,1,You're The One,14951,4050
1,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOEGIYH12A6D4FC0E3,1,Horn Concerto No. 4 in E flat K495: II. Romanc...,14951,11018
2,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOFLJQZ12A6D4FADA6,1,Tive Sim,14951,14039
3,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOHTKMO12AB01843B0,1,Catch You Baby (Steve Pitron & Max Sanna Radio...,14951,19891
4,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SODQZCY12A6D4F9D11,1,El Cuatrero,14951,9496


In [48]:
song_data1[song_data1['user_id']==user_index]

Unnamed: 0,user,song,play_count,title,user_id,song_id
78587,0a0e9d79d901c432ab4280a71bbb9d041c7add0e,SOGPBAW12A6D4F9F22,1,Livin' On A Prayer,550,16936
78588,0a0e9d79d901c432ab4280a71bbb9d041c7add0e,SOHEDRI12A8C13862B,2,Amazing Grace,550,18387
78589,0a0e9d79d901c432ab4280a71bbb9d041c7add0e,SOPKKRD12A6D4F806C,4,The Glory Of Your Name,550,38747
78590,0a0e9d79d901c432ab4280a71bbb9d041c7add0e,SOFGWUJ12AF72A6915,1,Choose,550,13624
78591,0a0e9d79d901c432ab4280a71bbb9d041c7add0e,SOHXRIT12A8C13860F,8,Blessed Assurance,550,20305
78592,0a0e9d79d901c432ab4280a71bbb9d041c7add0e,SOFKIUD12AB017F29C,2,By Our Love,550,13935
78593,0a0e9d79d901c432ab4280a71bbb9d041c7add0e,SOWXOAO12A8C1367E2,2,Hosanna,550,56040
78594,0a0e9d79d901c432ab4280a71bbb9d041c7add0e,SOMAFHP12A6D4FA910,2,Drug Store Truck Drivin' Man (1973 Live Version),550,30477
78595,0a0e9d79d901c432ab4280a71bbb9d041c7add0e,SOYTOLN12AB0183AEB,3,Marvelous Light,550,60134
78596,0a0e9d79d901c432ab4280a71bbb9d041c7add0e,SOYPYHA12AB0180346,2,You Are Able,550,59849


In [49]:
user_relevant_set = song_data1[song_data1['user_id']==user_index].sort_values(by='play_count', ascending=False)['song_id'].tolist()

In [50]:
user_relevant_set

[56685,
 22231,
 20305,
 60707,
 52451,
 38747,
 60134,
 41648,
 17743,
 13935,
 30477,
 59849,
 18387,
 56040,
 13624,
 4627,
 40978,
 51248,
 55281,
 58815,
 7646,
 25786,
 4827,
 16936]

In [51]:
len(user_relevant_set)

24

In [52]:
user_predicted_set = prediction_df.iloc[user_index].sort_values(ascending=False)[:10].index.tolist()

In [53]:
user_predicted_set

[18725, 13987, 51441, 31501, 11805, 13400, 27406, 3335, 1080, 19339]

In [54]:
len(user_predicted_set)

10

In [55]:
len(list(set(user_relevant_set) & set(user_predicted_set)))

0

In [56]:
precision_at_10 = len(list(set(user_relevant_set) & set(user_predicted_set)))/10

In [57]:
precision_at_10

0.0

<h4> Global Average Precision@K </h4>

In [58]:
global_precision = 0

In [59]:
n = song_data1['user_id'].nunique()

<h4> Modularize the code </h4>

In [60]:
for user_index in range(0, n):
    user_relevant_set = song_data1[song_data1['user_id']==user_index].sort_values(by='play_count', ascending=False)['song_id'].tolist()
    user_predicted_set = prediction_df.iloc[user_index].sort_values(ascending=False)[:10].index.tolist()
    precision_at_10 = (len(list(set(user_relevant_set) & set(user_predicted_set))))/10
    
    global_precision = global_precision + precision_at_10

In [61]:
global_precision

2.800000000000001

In [62]:
global_average_precision = global_precision/song_data1['user_id'].nunique()

In [63]:
global_average_precision

0.00018541818422621026

<h3> Don't recommend song which user has already listened </h3> 

In [64]:
df_user = song_data1[song_data1['user_id']==550]

In [65]:
df_user.shape

(24, 6)

<h3> Filter already listened songs from the global result list  </h3> 

In [66]:
pf_user = pd.DataFrame(prediction_df.iloc[550].sort_values(ascending=False))

In [67]:
pf_user.reset_index(inplace=True)
pf_user.columns = ['song_id', 'score']

In [68]:
output_df = pd.merge(pf_user, song_data2, how='left', on='song_id')

In [69]:
output_df.drop_duplicates(inplace=True)

In [70]:
output_df.reset_index(drop=True)

Unnamed: 0,song_id,score,song,title
0,18725,8.345894,SOHHSGK12A8AE49262,Like You'll Never See Me Again
1,13987,3.178924,SOFKXVU12A8C140CD7,Jeunesse de l'Occident
2,51441,2.265276,SOUXKXE12A58A7A1E7,Concrete And Clay
3,31501,1.964186,SOMLDZL12A6D4FA66F,Charade
4,11805,1.752155,SOEOJDP12A8C145B4E,Say It Again
...,...,...,...,...
62882,41820,0.000000,SOQSKWS12A8C143178,That Girl featuring Pied Piper (Remix) (Bonus ...
62883,41819,0.000000,SOQSKOC12A8C13CB92,My Arms Stay Open
62884,41818,0.000000,SOQSKKA12A58A78E3D,Make Your Own Kind Of Music
62885,41817,0.000000,SOQSJPL12AF729C98D,Just Right


In [71]:
output_df['score_normalized'] = (output_df['score'] - min(output_df['score'])) / (max(output_df['score']) - min(output_df['score']))

In [72]:
merged_collab = pd.merge(output_df, df_user, on='song_id', how='left')

In [73]:
merged_collab.shape

(62887, 10)

In [74]:
merged_collab = merged_collab.drop(merged_collab[merged_collab['play_count']>0].index)

In [75]:
merged_collab.head()

Unnamed: 0,song_id,score,song_x,title_x,score_normalized,user,song_y,play_count,title_y,user_id
0,18725,8.345894,SOHHSGK12A8AE49262,Like You'll Never See Me Again,1.0,,,,,
1,13987,3.178924,SOFKXVU12A8C140CD7,Jeunesse de l'Occident,0.380897,,,,,
2,51441,2.265276,SOUXKXE12A58A7A1E7,Concrete And Clay,0.271424,,,,,
3,31501,1.964186,SOMLDZL12A6D4FA66F,Charade,0.235348,,,,,
4,11805,1.752155,SOEOJDP12A8C145B4E,Say It Again,0.209942,,,,,


In [76]:
merged_collab.shape

(62863, 10)

In [77]:
merged_collab['title_x'][:10]

0                Like You'll Never See Me Again
1                        Jeunesse de l'Occident
2                             Concrete And Clay
3                                       Charade
4                                  Say It Again
5    Return Of The Rising Moon (Electronic Mix)
6                               La rueda mágica
7                               Under Your Skin
8                               Er steht im Tor
9                          Trans-Atlantic Drawl
Name: title_x, dtype: object

Steps of user-based collaborative filtering:

* Firstly built a data matrix: (number of users in rows) x (number of items in column) and filled the values corresponding to the item rating given by a certain user.
* Secondly, from the above matrix, built the distance-wise pair matrix (user similarity) by using cosine similarity whose values indicate the similarities between the users. 
* Thirdly, applied the dot product of the data matrix with the user similarity matrix, which produces the ‘item_prediction’ matrix. It contains the user-ids in columns and item ids in rows.
* Finally, to predict the items to the user, the ‘item_prediction’ matrix was used by finding the column that corresponds to the user id.

The following matrices were used for evaluation purposes:

* Mean absolute error (MAE)
* Root mean squared error (RMSE)
* Precision@k: Here, ‘k’ denotes the number of recommended items that can be checked for their precision.
* Global precision@k: The average of precision@k for all the users.