# Collaborative Filtering

We transpose the user rating matrix to get movie vectors. We can then look for similar movies or apply clustering.

Cosine similarity is used as a measure. To prevent curse of dimensionality we reduce it with PCA beforehand.

* **Disciplines:** Unsupervised Learning, recommender systems, collaborative filtering.
* **Data:** Movies rated by users (https://grouplens.org/datasets/movielens/)

> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os.path

In [2]:
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
from fuzzywuzzy import process

In [4]:
import warnings

## Load, clean and wrangle data

In [5]:
DATA_SET_ROOT = '../data/ml-latest-small/'
WEB_APP_DATA_ROOT = './recommender/data'

In [6]:
df_movies = pd.read_csv(os.path.join(DATA_SET_ROOT,'movies.csv'), index_col='movieId')

In [7]:
df_ratings = pd.read_csv(os.path.join(DATA_SET_ROOT,'ratings.csv'))

In [8]:
df_ratings = df_ratings.merge(df_movies['title'], on='movieId')

In [9]:
# filter for movies that have at minimum N raitings
min_rating_count = 10
# https://stackoverflow.com/a/29791952
df_ratings['raiting_count_per_movie'] = df_ratings.groupby('movieId')['movieId'].transform('count')
df_ratings = df_ratings[df_ratings.raiting_count_per_movie > min_rating_count]

In [10]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,raiting_count_per_movie
0,1,1,4.0,964982703,Toy Story (1995),215
1,5,1,4.0,847434962,Toy Story (1995),215
2,7,1,4.5,1106635946,Toy Story (1995),215
3,15,1,2.5,1510577970,Toy Story (1995),215
4,17,1,4.5,1305696483,Toy Story (1995),215


* *https://stackoverflow.com/a/39358924*
* *https://stackoverflow.com/q/45312377*

In [11]:
M_movie_genres = df_movies.genres.str.get_dummies().drop('(no genres listed)', axis=1)

In [12]:
M_ratings = df_ratings.pivot(columns='title', values='rating', index='userId').dropna(how='all')
M_ratings.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Cloverfield Lane (2016),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),...,Zack and Miri Make a Porno (2008),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,4.0
2,,,,,,,,,,,...,,,,,3.0,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,5.0,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


## Imputation

We have to apply imputation on the user rating matrix, because PCA cannot deal with missing values (NaN).

In [13]:
imputer = KNNImputer(n_neighbors=5)

In [14]:
R_true = imputer.fit_transform(M_ratings)

In [15]:
R_true = pd.DataFrame(data=R_true, columns=M_ratings.columns)

## Transposition

Take movie vectors instead of user vectors.

In [16]:
R_true_user = R_true
R_true = R_true_user.T.copy()
R_true.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,600,601,602,603,604,605,606,607,608,609
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"'burbs, The (1989)",3.7,3.0,2.3,3.2,2.2,4.3,3.3,3.0,3.6,3.2,...,3.4,4.2,2.6,3.3,3.1,2.9,3.7,2.3,3.3,3.0
(500) Days of Summer (2009),4.3,3.8,2.6,3.7,3.7,3.3,3.6,4.0,3.3,4.3,...,4.4,4.0,3.1,4.1,3.9,4.2,3.9,4.5,3.4,3.5
10 Cloverfield Lane (2016),4.0,3.7,3.6,3.4,3.8,4.1,3.2,3.6,3.7,3.2,...,4.0,3.5,3.5,3.5,3.7,3.7,3.6,3.8,3.5,4.0
10 Things I Hate About You (1999),4.5,3.3,3.1,3.7,3.2,3.5,3.6,3.9,3.3,4.2,...,4.0,3.8,3.0,3.3,5.0,3.7,3.8,3.8,4.0,3.6
"10,000 BC (2008)",2.7,2.3,2.8,2.1,2.3,2.9,3.1,2.8,2.6,3.1,...,2.7,2.7,2.5,2.5,2.9,2.5,2.7,2.8,2.1,2.8


## Dimensionality reduction with PCA

To prevent the curse of dimensionality. Similiarity won't be meaningful.

> *[Wikipedia:](https://en.wikipedia.org/wiki/Curse_of_dimensionality) "When a measure such as a Euclidean distance is defined using many coordinates, there is little difference in the distances between different pairs of samples."*

In [17]:
new_dimension_size = 5

In [18]:
pca = PCA()
pca.fit(R_true)
W = pca.transform(R_true)

In [19]:
R_reduced = pd.DataFrame(data=W[:,:new_dimension_size], index=R_true.index)
R_reduced.head()

Unnamed: 0_level_0,0,1,2,3,4
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"'burbs, The (1989)",4.393776,-2.560851,-1.584095,3.04677,-1.294251
(500) Days of Summer (2009),-8.236289,-0.142313,3.035142,-3.212539,-1.998334
10 Cloverfield Lane (2016),-5.835672,-1.231815,0.352325,-0.681045,0.680107
10 Things I Hate About You (1999),-3.336827,0.487274,-3.093922,-1.13958,-2.341061
"10,000 BC (2008)",19.51086,-0.406645,-0.821071,-1.0041,-0.20477


## Cosine similarity approach

### Select a movie vector

In [20]:
selection = process.extractOne("forrest gump", R_reduced.index)[0]
selection

'Forrest Gump (1994)'

In [21]:
# R_reduced.loc[selection]

### Calculate similarities to all other movies

In [22]:
sim = pd.DataFrame(data=cosine_similarity(R_reduced, (R_reduced.loc[selection],)), columns=['similarity'], index=R_reduced.index)

In [23]:
sim.sort_values(ascending=False, by='similarity')[:20]

Unnamed: 0_level_0,similarity
title,Unnamed: 1_level_1
Forrest Gump (1994),1.0
We're the Millers (2013),0.994606
Emma (1996),0.993815
"Lion King, The (1994)",0.992916
"Sound of Music, The (1965)",0.990602
Mary Poppins (1964),0.989862
Mulan (1998),0.989144
Cast Away (2000),0.988073
50 First Dates (2004),0.986272
Say Anything... (1989),0.984337


Looks good! One could improve this with clustering. This would prevent unappealing recommendations when a movie is at the edge of a cluster is selected. In that case neighbours may belong to another cluster leading to an uninteresting recommendation.

### API'fy

Make this approach reusable for the web service.

#### Similarty Matrix

In [24]:
M_sim = pd.DataFrame(data=cosine_similarity(R_reduced), columns=R_reduced.index, index=R_reduced.index)

In [25]:
M_sim.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Cloverfield Lane (2016),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),...,Zack and Miri Make a Porno (2008),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"'burbs, The (1989)",1.0,-0.797876,-0.685052,-0.346756,0.695186,0.80152,0.345173,-0.633271,-0.734686,-0.723037,...,-0.300983,-0.671683,-0.604245,-0.771678,-0.90554,-0.298943,-0.644253,-0.762035,0.569394,0.819944
(500) Days of Summer (2009),-0.797876,1.0,0.866676,0.523757,-0.853036,-0.863128,-0.666903,0.799974,0.92264,0.9526,...,0.73018,0.838297,0.835419,0.978929,0.550655,-0.302285,0.880965,0.748865,-0.748861,-0.873192
10 Cloverfield Lane (2016),-0.685052,0.866676,1.0,0.532459,-0.955532,-0.884184,-0.820137,0.91883,0.889949,0.949298,...,0.461716,0.974646,0.951575,0.927044,0.580998,-0.429887,0.994407,0.734276,-0.951842,-0.947
10 Things I Hate About You (1999),-0.346756,0.523757,0.532459,1.0,-0.593663,-0.785061,-0.076248,0.553759,0.645618,0.638468,...,0.67159,0.64545,0.638597,0.444303,0.072454,-0.338543,0.536282,0.823407,-0.523914,-0.357565
"10,000 BC (2008)",0.695186,-0.853036,-0.955532,-0.593663,1.0,0.909162,0.689389,-0.990922,-0.962873,-0.962734,...,-0.507072,-0.989731,-0.990436,-0.908075,-0.595574,0.394405,-0.960196,-0.839163,0.975375,0.939574


#### Persist Matrix

In [26]:
M_sim.to_json(os.path.join(WEB_APP_DATA_ROOT,'cosine_similarity_matrix.json'))

#### Interface implementation

In [27]:
class RecommenderCossim:
    def __init__(self, movie_sim_path=os.path.join(WEB_APP_DATA_ROOT,'cosine_similarity_matrix.json')):
        self.movie_similarities = pd.read_json(movie_sim_path)
        
    def recommend(self, raw_title, k=20, random_n=None):
        """
        Recommendation based on cosine similarity neighbours.

        Parameters
        ----------
        movie_similarities : pandas.DataFrame
            Cosine similarity matrix. Square shape. Columns and index are movie title strings.
        raw_title : str
            Movie title as raw user input. Matched with fuzzywuzzy.
        k : int
            How many of the most similar movies to consider.

        Returns
        -------
        matched_title, recommendations : str , list
        """
        matched_title = process.extractOne(raw_title, self.movie_similarities.index)[0]
        recommendations = self.movie_similarities[matched_title].sort_values(ascending=False)[1:k+1]
        if random_n is not None:
            recommendations = recommendations.sample(n=random_n)
        return matched_title, list(recommendations.index)

In [28]:
r = RecommenderCossim()

In [29]:
r.recommend('terminator', random_n=3, k=50)[1]

['Dark Knight Rises, The (2012)',
 'Toy Story 2 (1999)',
 'Ghostbusters (a.k.a. Ghost Busters) (1984)']

In [30]:
r.recommend('terminator')

('Terminator 2: Judgment Day (1991)',
 ["Bill & Ted's Excellent Adventure (1989)",
  'Ghostbusters (a.k.a. Ghost Busters) (1984)',
  'Batman: Mask of the Phantasm (1993)',
  'Femme Nikita, La (Nikita) (1990)',
  'Evil Dead, The (1981)',
  'Pineapple Express (2008)',
  'Braveheart (1995)',
  'Dirty Harry (1971)',
  'Die Hard (1988)',
  'Star Wars: Episode IV - A New Hope (1977)',
  'Toy Story 2 (1999)',
  'Cloud Atlas (2012)',
  'Terminator, The (1984)',
  'Aliens (1986)',
  "There's Something About Mary (1998)",
  'Day the Earth Stood Still, The (1951)',
  'Ghost in the Shell (Kôkaku kidôtai) (1995)',
  'Gattaca (1997)',
  'Saving Private Ryan (1998)',
  'Collateral (2004)'])

In [31]:
r.recommend('forrest gump')

('Forrest Gump (1994)',
 ["We're the Millers (2013)",
  'Emma (1996)',
  'Lion King, The (1994)',
  'Sound of Music, The (1965)',
  'Mary Poppins (1964)',
  'Mulan (1998)',
  'Cast Away (2000)',
  '50 First Dates (2004)',
  'Say Anything... (1989)',
  'Dead Zone, The (1983)',
  'Four Rooms (1995)',
  'Boyz N the Hood (1991)',
  'Few Good Men, A (1992)',
  'As Good as It Gets (1997)',
  'Tombstone (1993)',
  'Toy Story (1995)',
  'Dances with Wolves (1990)',
  'Amistad (1997)',
  'Apollo 13 (1995)',
  'Untouchables, The (1987)'])

In [32]:
r.recommend('big short')

('Big Short, The (2015)',
 ['True Grit (2010)',
  '127 Hours (2010)',
  'About Time (2013)',
  'Walk the Line (2005)',
  'Girl with the Dragon Tattoo, The (Män som hatar kvinnor) (2009)',
  'Eastern Promises (2007)',
  'Moon (2009)',
  'Hero (Ying xiong) (2002)',
  'Interstellar (2014)',
  'Shutter Island (2010)',
  'The Nice Guys (2016)',
  'Hot Fuzz (2007)',
  'Secret Life of Walter Mitty, The (2013)',
  'Harry Potter and the Deathly Hallows: Part 1 (2010)',
  'Dogville (2003)',
  'Fracture (2007)',
  '22 Jump Street (2014)',
  'Boyhood (2014)',
  'Crazy, Stupid, Love. (2011)',
  'Finding Nemo (2003)'])