# Collaborative Filtering

We transpose the user rating matrix to get movie vectors. We can then look for similar movies or apply clustering.

Cosine similarity is used as a measure. To prevent curse of dimensionality we reduce it with PCA beforehand.

* **Disciplines:** Unsupervised Learning, recommender systems, collaborative filtering.
* **Data:** Movies rated by users (https://grouplens.org/datasets/movielens/)

> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os.path

In [41]:
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
from fuzzywuzzy import process

In [4]:
import warnings

## Load, clean and wrangle data

In [5]:
DATA_SET_ROOT = '../data/ml-latest-small/'
WEB_APP_DATA_ROOT = './recommender/data'

In [6]:
df_movies = pd.read_csv(os.path.join(DATA_SET_ROOT,'movies.csv'), index_col='movieId')

In [7]:
df_ratings = pd.read_csv(os.path.join(DATA_SET_ROOT,'ratings.csv'))

In [8]:
df_ratings = df_ratings.merge(df_movies['title'], on='movieId')

In [9]:
# filter for movies that have at minimum N raitings
min_rating_count = 10
# https://stackoverflow.com/a/29791952
df_ratings['raiting_count_per_movie'] = df_ratings.groupby('movieId')['movieId'].transform('count')
df_ratings = df_ratings[df_ratings.raiting_count_per_movie > min_rating_count]

In [10]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,raiting_count_per_movie
0,1,1,4.0,964982703,Toy Story (1995),215
1,5,1,4.0,847434962,Toy Story (1995),215
2,7,1,4.5,1106635946,Toy Story (1995),215
3,15,1,2.5,1510577970,Toy Story (1995),215
4,17,1,4.5,1305696483,Toy Story (1995),215


* *https://stackoverflow.com/a/39358924*
* *https://stackoverflow.com/q/45312377*

In [11]:
M_movie_genres = df_movies.genres.str.get_dummies().drop('(no genres listed)', axis=1)

In [12]:
M_ratings = df_ratings.pivot(columns='title', values='rating', index='userId').dropna(how='all')
M_ratings.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Cloverfield Lane (2016),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),...,Zack and Miri Make a Porno (2008),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,4.0
2,,,,,,,,,,,...,,,,,3.0,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,5.0,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


## Imputation

We have to apply imputation on the user rating matrix, because PCA cannot deal with missing values (NaN).

In [13]:
imputer = KNNImputer(n_neighbors=5)

In [14]:
R_true = imputer.fit_transform(M_ratings)

In [15]:
R_true = pd.DataFrame(data=R_true, columns=M_ratings.columns)

For the recommendation web service we will impute the user vector with the mean movie ratings.

In [16]:
generic_user_vector = M_ratings.mean(skipna=True, axis=0)

Save preprocessed data for web service.

In [17]:
R_true.to_json(os.path.join(WEB_APP_DATA_ROOT,'user_rating_matrix.json'))

In [18]:
generic_user_vector.to_json(os.path.join(WEB_APP_DATA_ROOT,'generic_user_vector.json'))
# read with pd.read_json(..., typ='series')

## Transposition

Take movie vectors instead of user vectors.

In [19]:
R_true_user = R_true
R_true = R_true_user.T.copy()
R_true.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,600,601,602,603,604,605,606,607,608,609
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"'burbs, The (1989)",3.7,3.0,2.3,3.2,2.2,4.3,3.3,3.0,3.6,3.2,...,3.4,4.2,2.6,3.3,3.1,2.9,3.7,2.3,3.3,3.0
(500) Days of Summer (2009),4.3,3.8,2.6,3.7,3.7,3.3,3.6,4.0,3.3,4.3,...,4.4,4.0,3.1,4.1,3.9,4.2,3.9,4.5,3.4,3.5
10 Cloverfield Lane (2016),4.0,3.7,3.6,3.4,3.8,4.1,3.2,3.6,3.7,3.2,...,4.0,3.5,3.5,3.5,3.7,3.7,3.6,3.8,3.5,4.0
10 Things I Hate About You (1999),4.5,3.3,3.1,3.7,3.2,3.5,3.6,3.9,3.3,4.2,...,4.0,3.8,3.0,3.3,5.0,3.7,3.8,3.8,4.0,3.6
"10,000 BC (2008)",2.7,2.3,2.8,2.1,2.3,2.9,3.1,2.8,2.6,3.1,...,2.7,2.7,2.5,2.5,2.9,2.5,2.7,2.8,2.1,2.8


## Dimensionality reduction with PCA

To prevent the curse of dimensionality. Similiarity won't be meaningful.

> *[Wikipedia:](https://en.wikipedia.org/wiki/Curse_of_dimensionality) "When a measure such as a Euclidean distance is defined using many coordinates, there is little difference in the distances between different pairs of samples."*

In [20]:
new_dimension_size = 10

In [21]:
pca = PCA()
pca.fit(R_true)
W = pca.transform(R_true)

In [22]:
R_reduced = pd.DataFrame(data=W[:,:new_dimension_size], index=R_true.index)
R_reduced.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
"'burbs, The (1989)",4.393776,-2.560851,-1.584095,3.04677,-1.294251,0.25323,2.011061,1.909395,-1.117368,2.392422
(500) Days of Summer (2009),-8.236289,-0.142313,3.035142,-3.212539,-1.998334,-0.035503,-1.207209,-1.212478,-0.460058,0.011536
10 Cloverfield Lane (2016),-5.835672,-1.231815,0.352325,-0.681045,0.680107,-1.132061,0.051418,0.789037,-0.985281,-0.064272
10 Things I Hate About You (1999),-3.336827,0.487274,-3.093922,-1.13958,-2.341061,1.652057,-0.490742,-1.755479,-2.594218,-0.401042
"10,000 BC (2008)",19.51086,-0.406645,-0.821071,-1.0041,-0.20477,0.304663,0.12835,-0.282667,0.410991,0.499342


## Cosine similarity approach

### Select a movie vector

In [80]:
selection = process.extractOne("forrest gump", R_reduced.index)[0]
selection

'Forrest Gump (1994)'

In [81]:
# R_reduced.loc[selection]

### Calculate similarities to all other movies

In [78]:
sim = pd.DataFrame(data=cosine_similarity(R_reduced, (R_reduced.loc[selection],)), columns=['similarity'], index=R_reduced.index)

In [79]:
sim.sort_values(ascending=False, by='similarity')[:20]

Unnamed: 0_level_0,similarity
title,Unnamed: 1_level_1
Forrest Gump (1994),1.0
Good Will Hunting (1997),0.976642
Cast Away (2000),0.976377
Emma (1996),0.975894
Mary Poppins (1964),0.972237
"River Runs Through It, A (1992)",0.968305
Remember the Titans (2000),0.964427
"Pursuit of Happyness, The (2006)",0.964115
"Beautiful Mind, A (2001)",0.964098
Say Anything... (1989),0.960328


Looks good! One could improve this with clustering. This would prevent unappealing recommendations when a movie at the edge of a cluster is selected. In that case neighbours may belong to another cluster leading to an uninteresting recommendation.

### API'fy

Make this approach reusable for the web service.

#### Similarty Matrix