# Recommender Systems

In this notebook, collaborative filtering (CF), the more advanced and widly spread recommender systems will be presented. CF systems can again be divided into Memory-based CF and Model-based CF. 

In this notebook Single Value Decompositions (SVD) will be used the present a Model-based CF on movie data ([MovieLens dataset](https://files.grouplens.org/datasets/movielens/ml-100k-README.txt)). The Memory-based CF will be presented by using cosinus similarity. 

In [1]:
import numpy as np
import pandas as pd

In [2]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names)

In [3]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [4]:
# again merge the movie titles with the user data
movie_titles = pd.read_csv("movie_id_titles.txt")
df = pd.merge(df,movie_titles,on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,0,172,5,881250949,"Empire Strikes Back, The (1980)"
2,0,133,1,881250949,Gone with the Wind (1939)
3,196,242,3,881250949,Kolya (1996)
4,186,302,3,891717742,L.A. Confidential (1997)


In [5]:
unique_users = df['user_id'].nunique()
unique_movies = df['item_id'].nunique()

print("Number of unique users: ", unique_users)
print("Number of unique movies: ", unique_movies)

Number of unique users:  944
Number of unique movies:  1682


In [7]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.25)

# Memory-based CF

Memory-based CF can again be divied into: user-item filtering and item-item filtering.

- User-item filtering: First take a user and find users that are very similiar in terms of their ratings on movies. Then, recommend objects the similiar user liked.
    - Users that are similiar to you, also liked: ...
- Item-item filtering: Take an object, and find users that liked it as well. Now look for other objects these users liked.
    - Users that liked this, also liked: ...

For both cases we need a user-item-matrix.

In [9]:
# Create user-item matrix for training data
train_data_matrix = train_data.pivot_table(index='user_id', columns='item_id', values='rating').fillna(0)

# Create user-item matrix for test data
test_data_matrix = test_data.pivot_table(index='user_id', columns='item_id', values='rating').fillna(0)

# Display the matrices
print("Training Data Matrix:")
print(train_data_matrix.head())

print("\nTest Data Matrix:")
print(test_data_matrix.head())

Training Data Matrix:
item_id  1     2     3     4     5     6     7     8     9     10    ...  \
user_id                                                              ...   
0         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
1         0.0   3.0   4.0   3.0   3.0   5.0   4.0   0.0   0.0   3.0  ...   
2         4.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   2.0  ...   
3         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
4         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   

item_id  1672  1673  1674  1675  1676  1677  1678  1679  1681  1682  
user_id                                                              
0         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
1         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
2         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
3         0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
4         0.0   0.0   0.0

In [10]:
from sklearn.metrics.pairwise import pairwise_distances # to calculate cosine similarity
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine') # transpose the matrix to calculate item similarity

In [None]:

def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1).to_numpy() # convert to numpy array, mean_user_rating is needed as different user use the same rating scale very differently
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) # use np.newaxis so that mean_user_rating has same format as ratings
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [14]:
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

In [None]:
# evaluate the memory-based CF model
from sklearn.metrics import mean_squared_error
from math import sqrt

def rmse(prediction, ground_truth):
    prediction = np.array(prediction)[np.array(ground_truth).nonzero()].flatten()
    ground_truth = np.array(ground_truth)[np.array(ground_truth).nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.270967421492594
Item-based CF RMSE: 3.445976156261265


Memory-based CF algorithms are not well scalable. They also dont solve the so called "cold-start" problem where totally new users in a system can be compared with.

Model-based CF algorithms on the other hand are indeed scalable and can work on less data. However, they also dont solve the "cold-start" problem.*

# Model-based CF

Model-based CF is mostly based on matrix factorization (MF) as it is used as unsupervised learning algorithm and dimesion reduction and can handle scalability and sparsity (seltenheit). MF aims to learn latent user behaviour and latent attribute based on user ratings. 

Single Value Decomposition is often used in Model-based CF and will be presented here. 

In [None]:
# calc sparsity of MovieLens dataset
sparsity = round(1.0 - len(df) / float(unique_users * unique_movies), 3)
print('The sparsity level of MovieLens is ' +  str(sparsity * 100) + '%')

The sparsity level of MovieLens is 93.7%


In [24]:
# SVD
import scipy.sparse as sp  # to create sparse matrix
from scipy.sparse.linalg import svds  # to perform SVD


# Convert train_data_matrix to a sparse matrix
train_data_matrix_sparse = sp.csr_matrix(train_data_matrix)

u, s, vt = svds(train_data_matrix_sparse, k=20) # k is the number of singular values and vectors to compute
s_diag_matrix = np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print('User-based CF RMSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF RMSE: 3.1262281587891105
