# Collaborative filtering - memory based using cosine distance and kNN

Collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources.Basically, it is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users.
There are 2 approaches to CF -->
1) Memory-Based CF - It is an approach which finds similarity between users or between items to recommend similar items. Examples include Neighbourhood-based CF and Item-based/User-based top-N recommendations.

2) Model-Based CF - In this approach we use different data mining, machine learning algorithms to predict users' rating of unrated items. Examples include Singular Value Decomposition (SVD) , Principal Component Analysis (PCA) etc.

In this notebook, I am going to walk through the implementation of Recommender system using Memory Based Collaborative Filtering.
There are 2 approaches to Memory-Based Collaborative filtering -->
1) User-User Collaborative Filtering - In this we we calculate similarity of all the users to the active user ( the user whom the prediction is for ).Then sort and filter the Top-N users to make predictions for the active user. This is usually very effective but takes a lot of time and resources due to large number of dimensions of the User-Item Matrix. Assuming Dennis and Davis like similar items and another item comes out that Davis likes, then we can prescribe that item to Dennis on the grounds that Davis and Dennis appear to like similar items.

2) Item-Item Collaborative Filtering - This is similar to User-User CF, just that we now compute similarity between items to recommend similar items. Eg. When you buy any product on Amazon, you will find this line "Users who bought this item also bought...", so Amazon uses item-item CF widely, Mind that I'm not saying they use only item-item CF, they have hybrid techniques to better suit users of even unique interests.

Importing necessary libraries

In [13]:
# import required libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import operator

In [2]:
# reading the dataset
df = pd.read_csv('../ratings_Beauty.csv')
df

Unnamed: 0,UserId,ProductId,Rating,Timestamp
0,A39HTATAQ9V7YF,0205616461,5.0,1369699200
1,A3JM6GV9MNOF9X,0558925278,3.0,1355443200
2,A1Z513UWSAAO0F,0558925278,5.0,1404691200
3,A1WMRR494NWEWV,0733001998,4.0,1382572800
4,A3IAAVS479H7M7,0737104473,1.0,1274227200
...,...,...,...,...
2023065,A3DEHKPFANB8VA,B00LORWRJA,5.0,1405296000
2023066,A3DEHKPFANB8VA,B00LOS7MEE,5.0,1405296000
2023067,AG9TJLJUN5OM3,B00LP2YB8E,5.0,1405382400
2023068,AYBIB14QOI9PC,B00LPVG6V0,5.0,1405555200


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2023070 entries, 0 to 2023069
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   UserId     object 
 1   ProductId  object 
 2   Rating     float64
 3   Timestamp  int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 61.7+ MB


The columns UserId and ProductId are of type Object. To use them for building Machine Learning model, we need to convert all the categorical values to numeric values. Hence we need to encode them.

In [4]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

In [5]:
#converting the alphanumeric data to numeric
dataset = df
dataset['user'] = label_encoder.fit_transform(df['UserId'])
dataset['product'] = label_encoder.fit_transform(df['ProductId'])
dataset.head()

Unnamed: 0,UserId,ProductId,Rating,Timestamp,user,product
0,A39HTATAQ9V7YF,205616461,5.0,1369699200,725046,0
1,A3JM6GV9MNOF9X,558925278,3.0,1355443200,814606,1
2,A1Z513UWSAAO0F,558925278,5.0,1404691200,313101,1
3,A1WMRR494NWEWV,733001998,4.0,1382572800,291075,2
4,A3IAAVS479H7M7,737104473,1.0,1274227200,802842,3


In [6]:
#Calculating the average rating of the user and add this as a new dimension to the original dataset
average_rating = dataset.groupby(by="user", as_index=False)['Rating'].mean()
dataset = pd.merge(dataset, average_rating, on="user")
dataset = dataset.rename(columns={"Rating_x": "real_rating", "Rating_y": "average_rating"})
dataset

Unnamed: 0,UserId,ProductId,real_rating,Timestamp,user,product,average_rating
0,A39HTATAQ9V7YF,0205616461,5.0,1369699200,725046,0,4.25
1,A39HTATAQ9V7YF,B002OVV7F0,3.0,1369699200,725046,81854,4.25
2,A39HTATAQ9V7YF,B0031IH5FQ,5.0,1369699200,725046,89013,4.25
3,A39HTATAQ9V7YF,B006GQPZ8E,4.0,1369699200,725046,154092,4.25
4,A3JM6GV9MNOF9X,0558925278,3.0,1355443200,814606,1,3.50
...,...,...,...,...,...,...,...
2023065,ADQ41IJPQW2TN,B00LNOKBYW,5.0,1405728000,1012502,249268,5.00
2023066,A1SJD7QDROVPCC,B00LNOKBYW,5.0,1405296000,254698,249268,5.00
2023067,AFPRQT3V8C1U1,B00LNOKBYW,5.0,1405468800,1030275,249268,5.00
2023068,A1RYQPQ01T5D5R,B00LNOKBYW,5.0,1406073600,249628,249268,5.00


Certain users tend to give higher ratings while others tend to give lower ratings. To negate this bias, we normalise the ratings given by the users.

In [7]:
dataset['normalized_rating'] = dataset['real_rating'] - dataset['average_rating']
dataset

Unnamed: 0,UserId,ProductId,real_rating,Timestamp,user,product,average_rating,normalized_rating
0,A39HTATAQ9V7YF,0205616461,5.0,1369699200,725046,0,4.25,0.75
1,A39HTATAQ9V7YF,B002OVV7F0,3.0,1369699200,725046,81854,4.25,-1.25
2,A39HTATAQ9V7YF,B0031IH5FQ,5.0,1369699200,725046,89013,4.25,0.75
3,A39HTATAQ9V7YF,B006GQPZ8E,4.0,1369699200,725046,154092,4.25,-0.25
4,A3JM6GV9MNOF9X,0558925278,3.0,1355443200,814606,1,3.50,-0.50
...,...,...,...,...,...,...,...,...
2023065,ADQ41IJPQW2TN,B00LNOKBYW,5.0,1405728000,1012502,249268,5.00,0.00
2023066,A1SJD7QDROVPCC,B00LNOKBYW,5.0,1405296000,254698,249268,5.00,0.00
2023067,AFPRQT3V8C1U1,B00LNOKBYW,5.0,1405468800,1030275,249268,5.00,0.00
2023068,A1RYQPQ01T5D5R,B00LNOKBYW,5.0,1406073600,249628,249268,5.00,0.00


In order to improve the robustness of the model and to save computational power and time, we only consider the products which have atleast 200 ratings. 
If there are less rating for a product, It is difficult to find similar users to the given user.

# Filter based on number of ratings available

In [8]:
#getting the number of ratings for each product and dropping products with less than 200 ratings.
rating_of_product = dataset.groupby('product')['real_rating'].count() 
ratings_of_products_df = pd.DataFrame(rating_of_product)
filtered_ratings_per_product = ratings_of_products_df[ratings_of_products_df.real_rating >= 200]
filtered_ratings_per_product

Unnamed: 0_level_0,real_rating
product,Unnamed: 1_level_1
704,558
719,377
754,288
834,412
843,313
...,...
244448,239
245600,260
247603,233
249109,338


In [9]:
# build a list of products to keep
popular_products = filtered_ratings_per_product.index.tolist()
print("Popular product count which have ratings over average rating count: ",len(popular_products))
filtered_ratings_data = dataset[dataset["product"].isin(popular_products)]
print("The size of dataset has changed from ", len(dataset), " to ", len(filtered_ratings_data))


Popular product count which have ratings over average rating count:  934
The size of dataset has changed from  2023070  to  370511


# Creating the User-item matrix

In [10]:
#Creating user matrix 
similarity = pd.pivot_table(filtered_ratings_data,values='normalized_rating',index='UserId',columns='product')
similarity = similarity.fillna(0)
similarity.head()

product,704,719,754,834,843,858,861,873,944,981,...,241604,242018,242048,243416,244376,244448,245600,247603,249109,249211
UserId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A0010876CNE3ILIM9HV0,,,,,,,,,,,...,,,,,,,,,,
A0011102257KBXODKL24I,,,,,,,,,,,...,,,,,,,,,,
A00120381FL204MYH7G3B,,,,,,,,,,,...,,,,,,,,,,
A00126503SUWI86KZBMIN,,,,,,,,,,,...,,,,,,,,,,
A001573229XK5T8PI0OKA,,,,,,,,,,,...,,,,,,,,,,


This is a sparse matrix.

There are many measures to calculate the similarity matrix, some of them are -->
1) Jaccard Similarity - It is a statistic used for comparing the similarity and diversity of sample sets. It is defined as the size of the intersection divided by the size of the union of the sample sets.

2) Cosine Similarity - It measures the angle between the ratings vector. If the angle is 0°, then they are vectors having same orientation and if the angle is 180°, then they are highly dissimilar vectors.

3) Pearson Similarity - It is actually Centered-Cosine similarity. We subtract the mean ratings from the user ratings, so that the mean is centered at 0, and then calculate the cosine similarity.

Here I am using Cosine Similarity to calculate similar users.

In [14]:
#function to find 5 similar users to the given users. This function takes userid and similarity matrix and pridicts 5 similar users using cosine matrix.
def getting_top_5_similar_users(user_id, similarity_table, k=5):
    user = similarity_table[similarity_table.index == user_id]
    other_users = similarity_table[similarity_table.index != user_id]
    similarities = cosine_similarity(user, other_users)[0].tolist()
    indices = other_users.index.tolist()
    index_similarity = dict(zip(indices, similarities))
    index_similarity_sorted = sorted(index_similarity.items(), key=operator.itemgetter(1))
    index_similarity_sorted.reverse()
    top_users_similarities = index_similarity_sorted[:k]
    users = []
    for user in top_users_similarities:
        users.append(user[0])
    return users



In [15]:
# This function takes the similar users and pridicts top 5 product recommendations.
def getting_top_5_recommendations_based_on_users(user_id, similar_users, similarity_table, top_recommendations=5):
    similar_user_products = dataset[dataset.UserId.isin(similar_users)]
    similar_users = similarity_table[similarity_table.index.isin(similar_users)]
    similar_users = similar_users.mean(axis=0)
    similar_users_df = pd.DataFrame(similar_users, columns=['mean'])
    user_df = similarity_table[similarity_table.index == user_id]
    user_df_transposed = user_df.transpose()
    user_df_transposed.columns = ['rating']
    user_df_transposed = user_df_transposed[user_df_transposed['rating'] == 0]
    products_not_rated = user_df_transposed.index.tolist()
    similar_users_df_filtered = similar_users_df[similar_users_df.index.isin(products_not_rated)]
    similar_users_df_ordered = similar_users_df_filtered.sort_values(by=['mean'], ascending=False)
    top_products = similar_users_df_ordered.head(top_recommendations)
    top_products_indices = top_products.index.tolist()
    return top_products_indices




In [20]:
def recommend_products_for_user(userId, similarity_matrix):
    similar_users = getting_top_5_similar_users(userId, similarity_matrix)
    product_list = getting_top_5_recommendations_based_on_users(userId, similar_users, similarity)
    return product_list

In [21]:
recommend_products_for_user("A2XVNI270N97GL", similarity)

[30773, 27327, 149282, 122707, 122630]