Example of Collaborative Filtering techniques make recommendations for a user based on ratings and preferences data of many users. The main underlying idea is that if two users have both rated 2 items similarly, then the items that one user has liked might interest the other user. The steps are as follows: 

- Separate those who have rated all jokes and those who haven't
- Select a random user who has not rated all jokes as "object of interest" and note down how many jokes he/she hasn't rated, call it N
- Find all users who have rated all jokes and have positive similarity with "object of interest". This is based on things everyone has rated. 
- Narrow it down to some number, say P = 30 or 50  
- Score those N jokes based on similar P users, arrange them descending order with label
- Suggest first P labels as jokes which "object of interest" will like. 


This is memory-based collaborative filtering and uses all the data in the database to generate a prediction while the model-based collaborative filtering uses the data in the database to create a model that can then be used to generate predictions

There are 2 main types of memory-based collaborative filtering algorithms:

- User-User Collaborative Filtering: Here we find look alike users based on similarity and recommend things which first user’s look-alike has chosen in past. This algorithm is very effective but takes a lot of time and resources. It requires to compute every user pair information which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system. Remember as 'Liked by Alice, Suggest to Bob'. 

- Item-Item Collaborative Filtering: It is quite similar to previous algorithm, but instead of finding user's look-alike, we try finding item's look-alike. Once we have item's look-alike matrix, we can easily recommend alike items to user who have bought/rated any from the dataset. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new user, the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between users. This is similar to Amazon's, the user who bought this also bought that. 
Remember as 'Alice likes Prague, suggest her Kyoto'. 



In [168]:
import numpy as np
import pandas as pd
import sqlite3 as db

In [144]:
# Connect to a database (or create one if it doesn't exist)
sql_db = 'jester_jokes'

# database location and creating sql connection!
db_loc = 'data/{}.db'.format(sql_db)
conn = db.connect(db_loc)
# Create a 'cursor' for executing commands
c = conn.cursor()

In [145]:
# Selecting rating dataframe
query = 'SELECT * FROM ratings'
ratings_df = pd.read_sql(query, conn)
ratings_df.head()

# Not rated = 99.00 

Unnamed: 0,user_id,number_of_jokes_rated,joke_1,joke_2,joke_3,joke_4,joke_5,joke_6,joke_7,joke_8,...,joke_91,joke_92,joke_93,joke_94,joke_95,joke_96,joke_97,joke_98,joke_99,joke_100
0,1,74,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,...,2.82,99.0,99.0,99.0,99.0,99.0,-5.63,99.0,99.0,99.0
1,2,100,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,...,2.82,-4.95,-0.29,7.86,-0.19,-2.14,3.06,0.34,-4.32,1.07
2,3,49,99.0,99.0,99.0,99.0,9.03,9.27,9.03,9.27,...,99.0,99.0,99.0,9.08,99.0,99.0,99.0,99.0,99.0,99.0
3,4,48,99.0,8.35,99.0,99.0,1.8,8.16,-2.82,6.21,...,99.0,99.0,99.0,0.53,99.0,99.0,99.0,99.0,99.0,99.0
4,5,91,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61,...,5.19,5.58,4.27,5.19,5.73,1.55,3.11,6.55,1.8,1.6


In [146]:
# We will be using small set of data to monitor each step and later create a method 
# r_df is subset of rating_df ! here we will only take 10 users and 8 jokes
r_df = ratings_df.iloc[0:10, 0:10]
r_df.head(10)

Unnamed: 0,user_id,number_of_jokes_rated,joke_1,joke_2,joke_3,joke_4,joke_5,joke_6,joke_7,joke_8
0,1,74,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17
1,2,100,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34
2,3,49,99.0,99.0,99.0,99.0,9.03,9.27,9.03,9.27
3,4,48,99.0,8.35,99.0,99.0,1.8,8.16,-2.82,6.21
4,5,91,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61
5,6,100,-6.17,-3.54,0.44,-8.5,-7.09,-4.32,-8.69,-0.87
6,7,47,99.0,99.0,99.0,99.0,8.59,-9.85,7.72,8.79
7,8,100,6.84,3.16,9.17,-6.21,-8.16,-1.7,9.27,1.41
8,9,100,-3.79,-3.54,-9.42,-6.89,-8.74,-0.29,-5.29,-8.93
9,10,72,3.01,5.15,5.15,3.01,6.41,5.15,8.93,2.52


In [147]:
joke_columns = r_df.columns[2:]
joke_columns

Index(['joke_1', 'joke_2', 'joke_3', 'joke_4', 'joke_5', 'joke_6', 'joke_7',
       'joke_8'],
      dtype='object')

In [148]:
def normalization(ratings):
    ''' 
    Subtract user's rating by their mean value for each row 
    '''
    total_users = ratings.shape[0]
    for i in range(2):
        ratings.iloc[i, 2:] -= np.mean(ratings.iloc[i, 2:])
    return ratings

# Replace 99 with 0
def replace_0(ratings):

    joke_ids = ratings.columns[2:]    
    for joke_id in joke_ids: 
        ratings[joke_id] = ratings[joke_id].replace([99],0)
    return ratings

In [149]:
temp2 = normalization(r_df)
temp3 = replace_0(temp2)
temp3.head(10)

Unnamed: 0,user_id,number_of_jokes_rated,joke_1,joke_2,joke_3,joke_4,joke_5,joke_6,joke_7,joke_8
0,1,74,-3.00125,13.60875,-4.84125,-3.34125,-2.70125,-3.68125,-5.03125,8.98875
1,2,100,4.52875,0.15875,6.80875,4.81875,-1.93125,-9.21125,-0.28125,-4.89125
2,3,49,0.0,0.0,0.0,0.0,9.03,9.27,9.03,9.27
3,4,48,0.0,8.35,0.0,0.0,1.8,8.16,-2.82,6.21
4,5,91,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61
5,6,100,-6.17,-3.54,0.44,-8.5,-7.09,-4.32,-8.69,-0.87
6,7,47,0.0,0.0,0.0,0.0,8.59,-9.85,7.72,8.79
7,8,100,6.84,3.16,9.17,-6.21,-8.16,-1.7,9.27,1.41
8,9,100,-3.79,-3.54,-9.42,-6.89,-8.74,-0.29,-5.29,-8.93
9,10,72,3.01,5.15,5.15,3.01,6.41,5.15,8.93,2.52


In [150]:
ratings_of_active_user = temp3.iloc[6, 2:] # User 6
ratings_of_other_user = temp3.iloc[7, 2:] # User 7 
ratings_of_active_user, ratings_of_other_user

(joke_1    0.00
 joke_2    0.00
 joke_3    0.00
 joke_4    0.00
 joke_5    8.59
 joke_6   -9.85
 joke_7    7.72
 joke_8    8.79
 Name: 6, dtype: float64,
 joke_1    6.84
 joke_2    3.16
 joke_3    9.17
 joke_4   -6.21
 joke_5   -8.16
 joke_6   -1.70
 joke_7    9.27
 joke_8    1.41
 Name: 7, dtype: float64)

In [151]:
# note probably use list and assert that x and y are list!
# since we are using normalized data, we won't subtract by mean value again
def PC(x, y):
    ''' 
    Similarity between user x and user y
    We are using Pearson correlation coefficient here. 
    '''    
    t1, t2, t3 = 0, 0, 0 
    for i, j in zip(x, y):
        t1+=i*j
        t2+=i*i
        t3+=j*j
    return t1/(np.sqrt(t2) * np.sqrt(t3))




In [152]:
x =  temp3.iloc[0, 2:]
similarity = [PC(x, temp3.iloc[i, 2:]) for i in range(temp3.shape[0])]
similarity

[1.0,
 -0.23803270911260682,
 -0.05974489083217877,
 0.5781139901185975,
 0.25979039870395565,
 0.21872562219087632,
 0.1608792361703059,
 -0.019745617666472735,
 0.00810164662892576,
 -0.11486535366335289]

We will now do Part 2

In [153]:
# Connect to a database (or create one if it doesn't exist)
sql_db = 'jester_jokes'

# database location and creating sql connection!
db_loc = 'data/{}.db'.format(sql_db)
conn = db.connect(db_loc)
# Create a 'cursor' for executing commands
c = conn.cursor()

# Selecting normalized ratings 
query_normalized = 'SELECT * FROM normalized_ratings'
normalized_ratings_df = pd.read_sql(query_normalized, conn)

# Selecting ratings
query_ratings = 'SELECT * FROM ratings'
ratings_df = pd.read_sql(query_ratings, conn)

In [154]:
normalized_ratings_df.head(4)

Unnamed: 0,user_id,number_of_jokes_rated,joke_1,joke_2,joke_3,joke_4,joke_5,joke_6,joke_7,joke_8,...,joke_91,joke_92,joke_93,joke_94,joke_95,joke_96,joke_97,joke_98,joke_99,joke_100
0,1,74,-4.388108,12.221892,-6.228108,-4.728108,-4.088108,-5.068108,-6.418108,7.601892,...,6.251892,0.0,0.0,0.0,0.0,0.0,-2.198108,0.0,0.0,0.0
1,2,100,1.3337,-3.0363,3.6137,1.6237,-5.1263,-12.4063,-3.4763,-8.0863,...,0.0737,-7.6963,-3.0363,5.1137,-2.9363,-4.8863,0.3137,-2.4063,-7.0663,-1.6763
2,3,49,0.0,0.0,0.0,0.0,1.930612,2.170612,1.930612,2.170612,...,0.0,0.0,0.0,1.980612,0.0,0.0,0.0,0.0,0.0,0.0
3,4,48,0.0,5.691875,0.0,0.0,-0.858125,5.501875,-5.478125,3.551875,...,0.0,0.0,0.0,-2.128125,0.0,0.0,0.0,0.0,0.0,0.0


In [155]:
# Users who have rated all 100 jokes versus those who haven't

# We will be using users who have rated all the 100 jokes as other users.
complete_ratings = normalized_ratings_df[normalized_ratings_df['number_of_jokes_rated'] == 100]


print('Total user count who have rated all the jokes: ', np.shape(complete_ratings)[0])
# We will be randomly using one out of these users as active user and use it to find 
# similarity with complete_ratings dataset. 
sparse_ratings = normalized_ratings_df[normalized_ratings_df['number_of_jokes_rated'] != 100]
print('Total user count who have not rated all the jokes: ', np.shape(sparse_ratings)[0])


Total user count who have rated all the jokes:  14116
Total user count who have not rated all the jokes:  59305


In [156]:
# selecting a random user say 1000th user in sparse_ratings list
n = 1000
active_user_id = sparse_ratings.iloc[n, 0]
print("Let's selct a random user with user id {} as active user for whom we will recommend the joke".format(str(active_user_id)))

Let's selct a random user with user id 1352 as active user for whom we will recommend the joke


In [157]:
print('Ratings given by active user {} for jokes he has rated'.format(str(active_user_id)))
active_user = sparse_ratings[sparse_ratings['user_id'] == active_user_id]
active_user_rating = active_user.iloc[:, 2:]
active_user_rating

Ratings given by active user 1352 for jokes he has rated


Unnamed: 0,joke_1,joke_2,joke_3,joke_4,joke_5,joke_6,joke_7,joke_8,joke_9,joke_10,...,joke_91,joke_92,joke_93,joke_94,joke_95,joke_96,joke_97,joke_98,joke_99,joke_100
1351,7.871096,-1.888904,-2.568904,-1.118904,2.821096,-4.078904,-0.188904,1.891096,-2.568904,3.931096,...,0.0,-3.448904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [158]:
active_user_rating_list = active_user_rating.values.ravel()
len = complete_ratings.shape[0] # 14116
similarity = np.array([(complete_ratings.iloc[i, 0],PC(active_user_rating_list, complete_ratings.iloc[i, 2:])) for i in range(len)])

ind = np.argsort(similarity[:,1])
similarity = similarity[ind]
similarity

array([[ 2.76640000e+04, -3.52644866e-01],
       [ 3.26670000e+04, -3.34265224e-01],
       [ 2.53620000e+04, -3.30902199e-01],
       ...,
       [ 1.92170000e+04,  3.99469178e-01],
       [ 1.06150000e+04,  4.06355890e-01],
       [ 7.49200000e+03,  4.59458050e-01]])

In [159]:
neighbours = similarity[similarity[:,1] > 0.2]
total = neighbours.shape[0]
print('We have {} potential neighbours! Now we will be randomly selecting 5 samples out of them'.format(total))
# This number will be large, we will take 5 for simplicity 

We have 932 potential neighbours! Now we will be randomly selecting 5 samples out of them


In [160]:
# by replace = False, ensuring that no duplicate neighbour is selected !!
index_neighbour = np.random.choice(range(total), 5, replace=False)
selec_neigh = neighbours[index_neighbour]
selec_neigh

array([[1.54820000e+04, 2.03156411e-01],
       [4.05470000e+04, 2.13235348e-01],
       [4.61290000e+04, 2.06849657e-01],
       [1.94770000e+04, 2.14827018e-01],
       [2.27500000e+04, 2.20878846e-01]])

In [161]:
recommen_columns = [column for column in active_user_rating.columns if active_user_rating[column].values[0] == 0]
num_rec = np.shape(recommen_columns)[0]
num_rec

27

In [162]:
active_user_raw_ratings = ratings_df[ratings_df['user_id'] == active_user_id].iloc[:, 2:]
active_user_mean_rating = np.mean(active_user_raw_ratings.drop(recommen_columns, axis = 1).values)
active_user_mean_rating

0.6289041095890411

In [163]:
# Selecting neighbours user_id & similarity 

neigh_id = selec_neigh[:, 0]
neigh_user_sim = selec_neigh[:, 1]

# Viewing first 10 neighbours user id and similarity
print('Neighbours UseriD: ', neigh_id[:10])
print('Neighbours User similarity: ', neigh_user_sim[:10])

Neighbours UseriD:  [15482. 40547. 46129. 19477. 22750.]
Neighbours User similarity:  [0.20315641 0.21323535 0.20684966 0.21482702 0.22087885]


In [164]:
neighbours_df = complete_ratings[complete_ratings['user_id'].isin(neigh_id)]

In [165]:
# selecting only recommendation columns
print('We will be suggesting one out of {} jokes to the active user \n\n'.format(num_rec))
neighbours_df = neighbours_df[recommen_columns]
neighbours_df.head()

We will be suggesting one out of 27 jokes to the active user 




Unnamed: 0,joke_71,joke_72,joke_73,joke_74,joke_75,joke_76,joke_77,joke_78,joke_80,joke_82,...,joke_90,joke_91,joke_93,joke_94,joke_95,joke_96,joke_97,joke_98,joke_99,joke_100
15481,8.4673,2.4073,8.0873,5.5073,-5.0227,5.3673,-3.4227,-3.8627,-9.0027,8.6173,...,6.6773,5.1673,2.0173,8.9073,-5.8027,-5.3127,9.0073,1.6273,0.8973,-2.6927
19476,-2.8519,-1.8419,-4.0219,-4.0719,2.9181,3.2081,-2.8119,-4.0719,-2.9019,-4.4119,...,-2.1319,-2.3219,-3.0519,6.3181,3.8881,3.9381,-2.7119,-2.7619,-2.9519,-3.0519
22749,7.9119,1.4519,-3.2581,-10.3881,2.3319,2.6719,-3.0081,-2.1881,0.9219,1.8419,...,8.0519,7.5719,4.0719,8.0519,1.5019,7.7619,8.2019,-3.0081,-9.6181,-9.0781
40546,-6.6443,6.1657,-9.5643,-4.2743,2.4757,4.4157,4.5657,4.5657,-9.1743,-5.0443,...,-4.8043,3.3957,2.6257,-4.7543,3.9857,-3.5943,1.1657,5.7257,3.2557,-7.3243
46128,12.2609,-6.8091,-6.8091,-6.6691,-6.6691,-6.6191,-6.6691,-6.6191,-6.7591,-6.7191,...,-6.8091,12.3609,-1.3291,12.1709,12.3609,-1.9591,-6.6191,2.2609,-6.2291,-4.8691


In [166]:
def score_user_item(item_id, neighbours_df,neighbour_user_similarity, active_user_mean_rating ):
    item_rating = neighbours_df[item_id]

    t1, t2 = 0, 0
    for sim, rat in zip(neighbour_user_similarity, item_rating):
        t1+= rat * sim
        t2+= sim
    score = (t1 + active_user_mean_rating)/t2
    return score

In [167]:
# Computing user item scores !
how_many = 5
list_of_scores = []
index = [] 

for column in neighbours_df.columns:
    score = score_user_item(column, neighbours_df,neigh_user_sim, active_user_mean_rating)
    list_of_scores.append(score)
    index.append(column)


predictions = np.array(list_of_scores)
ids = (-predictions).argsort()[:how_many]
pred = [index[a] for a in ids]
pred

#print('Highest score is', max(list_of_scores))
#print('The highest score obtained by the joke among all the unseen jokes is', joke_to_suggest, 'so we recommend this') 

['joke_94', 'joke_91', 'joke_86', 'joke_71', 'joke_95']