# Lab 8: Recommender System

In this assignment, we will study how to do user-based collaborative filtering and item-based collaborative filtering. 

## 1. Dataset

In this assignment, we will use MovieLens-100K dataset. It includes about 100,000 ratings from 1000 users on 1700 movies.  

In [51]:
from math import sqrt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics.pairwise import linear_kernel
from sklearn.neighbors import NearestNeighbors


# 1. load data
user_ratings_train = pd.read_csv('./ml-100k/u1.base',
                            sep='\t',names=['user_id','movie_id','rating'], usecols=[0,1,2])

user_ratings_test = pd.read_csv('./ml-100k/u1.test',
                            sep='\t',names=['user_id','movie_id','rating'], usecols=[0,1,2])

movie_info =  pd.read_csv('./ml-100k/u.item', 
                          sep='|', names=['movie_id','title'], usecols=[0,1],
                          encoding="ISO-8859-1")

user_ratings_train = pd.merge(movie_info, user_ratings_train)
user_ratings_test = pd.merge(movie_info, user_ratings_test)

# 2. get the rating matrix. Each row is a user, and each column is a movie.
user_ratings_train = user_ratings_train.pivot_table(index=['user_id'],
                                        columns=['title'],
                                        values='rating')

user_ratings_test = user_ratings_test.pivot_table(index=['user_id'],
                                        columns=['title'],
                                        values='rating')




user_ratings_train = user_ratings_train.reindex(
                            index=user_ratings_train.index.union(user_ratings_test.index), 
                            columns=user_ratings_train.columns.union(user_ratings_test.columns) )

user_ratings_test = user_ratings_test.reindex(
                            index=user_ratings_train.index.union(user_ratings_test.index), 
                            columns=user_ratings_train.columns.union(user_ratings_test.columns) )

print(user_ratings_train.shape)
print(user_ratings_test.shape)

(943, 1664)
(943, 1664)


## Task 1. User-based CF

* Use pearson correlation to get the similarity between different users.
* Based on the obtained similarity score, predict the ratings. You can use 5 nearest neighbors or 10 nearest neighbors.
* Compute MAE for the testing set.

In [52]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error

#preprocess data
user_means = user_ratings_train.mean(axis=1)
user_ratings_train_filled = user_ratings_train.apply(lambda x: x.fillna(user_means[x.name]), axis=1)
user_ratings_test_filled = user_ratings_test.apply(lambda x: x.fillna(user_means[x.name]), axis=1)
ratings_filled = user_ratings_train_filled.fillna(0)

#pearson correlation matrix
Pearson_matrix = np.corrcoef(ratings_filled.T) #not alligned unless transposed

#predict
weighted_sum = np.dot(Pearson_matrix, ratings_filled.T)
sum_of_weights = np.abs(Pearson_matrix).sum(axis=1)
predictions = weighted_sum / sum_of_weights[:, np.newaxis]

predictions = predictions.T #gotta transpose again cause indicies imply this shape
predictions_df = pd.DataFrame(predictions, index=user_ratings_train_filled.index, columns=user_ratings_train_filled.columns)

#MAE
mae = mean_absolute_error(user_ratings_test_filled.values.flatten(), predictions_df.values.flatten())
print(f"Mean Absolute Error: {mae}")


Mean Absolute Error: 0.012179534897074821


## Task 2. Item-based CF
* Use cosine similarity to get the similarity between different items.
* Based on the obtained similarity score, predict the ratings. You can use 5 nearest neighbors or 10 nearest neighbors.
* Compute MAE for the testing set.

In [53]:
from sklearn.metrics.pairwise import cosine_similarity

#cosine correlation matrix
item_similarity = cosine_similarity(ratings_filled.T) #not alligned unless transposed

#predict
weighted_sum = np.dot(ratings_filled, item_similarity)  
sum_of_weights = np.abs(item_similarity).sum(axis=0)
predictions = weighted_sum / sum_of_weights[np.newaxis, :]

#no need to transpose again
predictions_df = pd.DataFrame(predictions, index=user_ratings_train_filled.index, columns=user_ratings_train_filled.columns)

#MAE
mae = mean_absolute_error(user_ratings_test_filled.values.flatten(), predictions_df.values.flatten())
print(f"Item-based CF Mean Absolute Error: {mae}")


Item-based CF Mean Absolute Error: 0.010858552415203431
