## Recommender System and NN with numpy

Yun Xing 2023.5.2

references: https://github.com/lppier/Recommender_Systems 

### Data preparation

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('radio_songs.csv')
df.head()

Unnamed: 0,user,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,33,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,42,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,51,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,62,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [2]:
df.describe()

Unnamed: 0,user,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,876.87,0.02,0.05,0.01,0.01,0.04,0.05,0.02,0.03,0.04,...,0.05,0.06,0.01,0.02,0.0,0.04,0.04,0.04,0.05,0.01
std,472.055909,0.140705,0.219043,0.1,0.1,0.196946,0.219043,0.140705,0.171447,0.196946,...,0.219043,0.238683,0.1,0.140705,0.0,0.196946,0.196946,0.196946,0.219043,0.1
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,468.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,921.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1270.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1606.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


In [3]:
df.dtypes

user            int64
abba            int64
ac/dc           int64
adam green      int64
aerosmith       int64
                ...  
trivium         int64
u2              int64
underoath       int64
volbeat         int64
yann tiersen    int64
Length: 285, dtype: object

### Collaborative Filtering 
### A. item-item

Use this user-item matrix to: Recommend 10 songs to users who have listened to 'u2' and 'pink floyd'. Use item-item collaborative filtering to find songs that are similar using spatial distance with cosine. Since this measures the distance you need to subtract from 1 to get similarity as shown below.
ref: 
1. https://github.com/SwathyMM/Top-10-song-recommendation-using-collaborative-filtering-and-KNN/blob/master/Song%20recommender.ipynb
2. https://github.com/ugis22/music_recommender/blob/master/collaborative_recommender_system/CF_knn_music_recommender.ipynb
3. https://www.kaggle.com/code/ecemboluk/recommendation-system-with-cf-using-knn

In [4]:
df['user'] = df['user'].astype('object')

In [5]:
df.dtypes

user            object
abba             int64
ac/dc            int64
adam green       int64
aerosmith        int64
                 ...  
trivium          int64
u2               int64
underoath        int64
volbeat          int64
yann tiersen     int64
Length: 285, dtype: object

In [6]:
df.sum()

user            87687
abba                2
ac/dc               5
adam green          1
aerosmith           1
                ...  
trivium             4
u2                  4
underoath           4
volbeat             5
yann tiersen        1
Length: 285, dtype: object

In [7]:
df.sum(axis = 1)

0       12.0
1       57.0
2       49.0
3       68.0
4       75.0
       ...  
95    1567.0
96    1606.0
97    1598.0
98    1606.0
99    1610.0
Length: 100, dtype: float64

In [8]:
df_pivot = df.set_index('user')
df_pivot.columns.name = 'song'
df_pivot

song,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,all that remains,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
51,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
62,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1566,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1586,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1589,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1601,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
df_clean = df_pivot.loc[:, (df_pivot != 0).any(axis=0)]
df_clean.shape

(100, 270)

#### U2

In [10]:
from scipy.spatial.distance import cosine

similar_songs = pd.DataFrame(index=df_clean.columns, columns=df_clean.columns)

for i in range(0, len(similar_songs.columns)):
    for j in range(0, len(similar_songs.columns)):
        similar_songs.iloc[i,j] = 1 - cosine(df_clean.iloc[:,i], df_clean.iloc[:,j])

In [11]:
similar_u2 = similar_songs['u2'].sort_values(ascending=False)
similar_u2.head(11)


song
u2                      1.0
misfits                 0.5
robbie williams         0.5
green day          0.433013
depeche mode       0.408248
peter fox          0.377964
dire straits       0.353553
kelly clarkson     0.353553
madonna            0.353553
johnny cash        0.353553
enter shikari      0.353553
Name: u2, dtype: object

#### Pink Floyd

In [12]:
# Combine
similar_pf = similar_songs['pink floyd'].sort_values(ascending=False)
similar_pf.head(11)

song
pink floyd                   1.0
genesis                  0.57735
sonic syndicate         0.408248
led zeppelin            0.408248
queen                   0.408248
david bowie             0.408248
funeral for a friend    0.408248
hans zimmer             0.408248
coldplay                0.348155
maria mena              0.333333
howard shore            0.333333
Name: pink floyd, dtype: object

### B. user-user
Find user most similar to user 1606. Use user-user collaborative filtering with cosine similarity. List the recommended songs for user 1606 (Hint: find the songs listened to by the most similar user).

In [13]:
similar_users = pd.DataFrame(index=df_clean.index, columns=df_clean.index)
similar_users.shape

(100, 100)

In [14]:
for i in range(0, len(similar_users.columns)):
    for j in range(0, len(similar_users.columns)):
        similar_users.iloc[i,j] = 1 - cosine(df_clean.iloc[i,:], df_clean.iloc[j,:])

  dist = 1.0 - uv / np.sqrt(uu * vv)


In [15]:
similar_users

user,1,33,42,51,62,75,130,141,144,150,...,1521,1530,1536,1545,1549,1566,1586,1589,1601,1606
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.061546,0.0,0.0,0.083624,0.0,0.0,0.0,0.0,0.150756,...,0.1066,0.0,0.0,0.190693,0.0,0.0,0.0,0.0,0.0,0.0
33,0.061546,1.0,0.077152,0.247537,0.226455,0.176777,0.0,0.0,0.0,0.102062,...,0.0,0.0,0.06455,0.193649,0.0,0.0,0.045644,0.0,0.091287,0.0
42,0.0,0.077152,1.0,0.0,0.0,0.0,0.0,0.09167,0.0,0.0,...,0.0,0.0,0.0,0.0,0.094491,0.0,0.0,0.125988,0.0,0.0
51,0.0,0.247537,0.0,1.0,0.336336,0.140028,0.0,0.0,0.108465,0.121268,...,0.0,0.0,0.076696,0.0,0.0,0.0,0.0,0.0,0.0,0.0
62,0.083624,0.226455,0.0,0.336336,1.0,0.160128,0.0,0.067267,0.124035,0.138675,...,0.0,0.0,0.175412,0.087706,0.0,0.0,0.062017,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1586,0.0,0.045644,0.0,0.0,0.062017,0.129099,0.0,0.0,0.0,0.0,...,0.0,0.0,0.212132,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1589,0.0,0.0,0.125988,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,...,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,1.0,0.0,0.0
1601,0.0,0.091287,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [16]:
sorted(similar_users[1606])

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.10206207261596578,
 0.10425720702853736,
 0.10660035817780522,
 0.12126781251816654,
 0.1290994448735806,
 0.22360679774997894,
 nan,
 nan,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.09805806756909208,
 0.10660035817780522,
 0.14433756729740643,
 0.15075567228888187,
 0.27735009811261446,
 nan,
 0.125,
 nan,
 1.0]

In [54]:
# most similar user to user 1606
most_similar_user = similar_users[1606].sort_values(ascending=False).index[1]
most_similar_user

1144

In [55]:
# songs listened by 1606
song1606 = df_clean.loc[1606]
a = song1606[song1606>0]
a

song
abba             1
elvis presley    1
frank sinatra    1
the beatles      1
Name: 1606, dtype: int64

#### Song recommendation based on most simlar user

In [56]:
rec2 = df_clean.loc[most_similar_user]

b = rec2[rec2>0]
b

song
beastie boys                1
bob dylan                   1
bob marley & the wailers    1
david bowie                 1
elvis presley               1
eric clapton                1
johnny cash                 1
pearl jam                   1
pink floyd                  1
the beatles                 1
the doors                   1
the rolling stones          1
tom waits                   1
Name: 1144, dtype: int64

### How many of the recommended songs has already been listened to by user 1606?

In [57]:
np.intersect1d(list(a.index),list(b.index))

array(['elvis presley', 'the beatles'], dtype='<U24')

#### Two songs: elvis presley, and the beatles, have already been listened. 

### C. user-item
Use a combination of user-item approach to build a recommendation score for each song for each user using the following steps for each user. 
1. For each song for the user row, get the top 10 similar songs and their similarity score.
2. For each of the top 10 similar songs, get a list of the user purchases
3. Calculate a recommendation score as follows: sum(purchaseHistory-similarityScore)/sum(similarityScore)
4. What are the top 5 song recommendations for user 1606?

In [58]:
def rec_score(user, song):

    ten_similar = similar_songs[song].sort_values(ascending=False).iloc[1:11]
    
    purchase = df_clean.loc[user, similar_songs.index]
    
    rec_score = (ten_similar * purchase).sum() / ten_similar.sum()
    
    return rec_score

In [59]:
rec_df = pd.DataFrame(columns = df_clean.columns,index = df_clean.index)

for user in df_clean.index:
    for song in df_clean.columns:
        rec_df.loc[user,song] = rec_score(user,song)

In [60]:
#top 5 song rec for user 1606

rec_df.loc[1606].sort_values(ascending=False).head(5)

song
elvis presley    0.289328
abba             0.239023
eric clapton      0.20274
frank sinatra    0.201139
howard shore     0.171749
Name: 1606, dtype: object

### Discussions:

#### 1. There are 2 other similarity measures that can be used instead of cosine similarity above:
1. Euclidean distance: straight-line distance between two points in a high-dimensional space. Sensitive to the scale of the data and may not work well with high-dimensional data. 
2. Pearson correlation coefficient: linear correlation between two variables. invariant to scaling and can be used with continuous data.

#### 2. Things needed to build a Content-Based Recommender system:
1. Data: 

1) User Profile: create vectors that describe the user’s preference. In the creation of a user profile, we use the utility matrix which describes the relationship between user and item.

2) item profile: build a profile for each item, which will represent the important characteristics of that item.

3) Utility matrix: describe the user’s preference with certain items. In the data gathered from the user, we have to find some relation between the items which are liked by the user and those which are disliked

2. Similarity measures. We can use cosine distance or classification algorithms like Bayesian classifiers or decision tree models.

3. get recommendations for user. Evaluate and further fine-tune. 


#### 3. There are 2 methods to evaluate recommender systems:
1. Traditional: Statistical accuracy metrics (precision, recall, F1 score). e.g.Average precision - indication of effectiveness across the full range of recall values.
2. Non-traditional: 1) Normalized Discounted Cumulative Gain (nDCG) measures graded relevance of recommendations by comparing the relevance values of a recommendation set with the “ideal” result. 2) Decision support accuracy metrics: These metrics evaluate the effectiveness of the recommendations in helping users make decisions. Examples of such metrics include hit rate and mean reciprocal rank
