# Collaborative Filtering

Movie Recommendation based on Non-negative Matrix Factorization.

* **Disciplines:** Unsupervised Learning, recommender systems, collaborative filtering.
* **Data:** Movies rated by users (https://grouplens.org/datasets/movielens/)

> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>

In [233]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm
import seaborn as sns
import os.path

In [234]:
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.decomposition import NMF

In [235]:
from fuzzywuzzy import process

## Load, clean and wrangle data

In [236]:
DATA_SET_ROOT = '../data/ml-latest-small/'
WEB_APP_DATA_ROOT = './recommender/data'

In [237]:
df_movies = pd.read_csv(os.path.join(DATA_SET_ROOT,'movies.csv'), index_col='movieId')

In [238]:
df_ratings = pd.read_csv(os.path.join(DATA_SET_ROOT,'ratings.csv'))

In [239]:
df_ratings = df_ratings.merge(df_movies['title'], on='movieId')

In [240]:
# filter for movies that have at minimum N raitings
min_rating_count = 10
# https://stackoverflow.com/a/29791952
df_ratings['raiting_count_per_movie'] = df_ratings.groupby('movieId')['movieId'].transform('count')
df_ratings = df_ratings[df_ratings.raiting_count_per_movie > min_rating_count]

In [241]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,raiting_count_per_movie
0,1,1,4.0,964982703,Toy Story (1995),215
1,5,1,4.0,847434962,Toy Story (1995),215
2,7,1,4.5,1106635946,Toy Story (1995),215
3,15,1,2.5,1510577970,Toy Story (1995),215
4,17,1,4.5,1305696483,Toy Story (1995),215


* *https://stackoverflow.com/a/39358924*
* *https://stackoverflow.com/q/45312377*

In [242]:
M_movie_genres = df_movies.genres.str.get_dummies().drop('(no genres listed)', axis=1)

In [243]:
M_ratings = df_ratings.pivot(columns='title', values='rating', index='userId').dropna(how='all')
M_ratings.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Cloverfield Lane (2016),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),...,Zack and Miri Make a Porno (2008),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,4.0
2,,,,,,,,,,,...,,,,,3.0,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,5.0,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


## Imputation

We have to apply imputation on the user rating matrix, because NMF cannot deal with missing values (NaN).

In [244]:
imputer = KNNImputer(n_neighbors=10)

In [245]:
R_true = imputer.fit_transform(M_ratings)

In [246]:
R_true = pd.DataFrame(data=R_true, columns=M_ratings.columns)

For the recommendation web service we will impute the user vector with the mean movie ratings.

In [247]:
generic_user_vector = M_ratings.mean(skipna=True, axis=0)

Save preprocessed data for web service.

In [248]:
R_true.to_json(os.path.join(WEB_APP_DATA_ROOT,'user_rating_matrix.json'))

In [250]:
generic_user_vector.to_json(os.path.join(WEB_APP_DATA_ROOT,'generic_user_vector.json'))
# read with pd.read_json(..., typ='series')

## NMF Recommendation

In [251]:
generic_user_vector

title
'burbs, The (1989)                   3.176471
(500) Days of Summer (2009)          3.666667
10 Cloverfield Lane (2016)           3.678571
10 Things I Hate About You (1999)    3.527778
10,000 BC (2008)                     2.705882
                                       ...   
Zoolander (2001)                     3.509259
Zootopia (2016)                      3.890625
eXistenZ (1999)                      3.863636
xXx (2002)                           2.770833
¡Three Amigos! (1986)                3.134615
Length: 2121, dtype: float64

In [252]:
generic_user_vector.index

Index([''burbs, The (1989)', '(500) Days of Summer (2009)',
       '10 Cloverfield Lane (2016)', '10 Things I Hate About You (1999)',
       '10,000 BC (2008)', '101 Dalmatians (1996)',
       '101 Dalmatians (One Hundred and One Dalmatians) (1961)',
       '12 Angry Men (1957)', '12 Years a Slave (2013)', '127 Hours (2010)',
       ...
       'Zack and Miri Make a Porno (2008)', 'Zero Dark Thirty (2012)',
       'Zero Effect (1998)', 'Zodiac (2007)', 'Zombieland (2009)',
       'Zoolander (2001)', 'Zootopia (2016)', 'eXistenZ (1999)', 'xXx (2002)',
       '¡Three Amigos! (1986)'],
      dtype='object', name='title', length=2121)

In [253]:
process.extractOne("Star Wars", generic_user_vector.index)

('Rogue One: A Star Wars Story (2016)', 90)

In [266]:
def clamp_rating(rating):
    """ clamp rating to range of [1,5] """
    return min(max(1,rating),5)

class MovieRecommenderNMF:
    def __init__(self,
                 rating_matrix_path=os.path.join(WEB_APP_DATA_ROOT,'user_rating_matrix.json'),
                 generic_user_vec_path=os.path.join(WEB_APP_DATA_ROOT,'generic_user_vector.json')):
        self.rating_matrix = pd.read_json(rating_matrix_path)
        self.generic_user_vec = pd.read_json(generic_user_vec_path, typ='series')
    
        self.model = NMF(n_components=30)
        
    def recommend(self, user_input):
        """
        user_input : dict
            Dictionary with raw movie titles as keys and raiting [1,5] as values.
            
        Returns
        -------
        List of suggested movies; input matches.
        """
        input_matches = []
        
        # create user vector
        uvec = self.generic_user_vec.copy()
        uvec[:] = 3.#self.rating_matrix.mean().mean() # 3.
        
        for raw_movie_title, rating in user_input.items():
            matched_title = process.extractOne(raw_movie_title, uvec.index)[0]
            input_matches.append((matched_title, rating))
            uvec[matched_title] = clamp_rating(rating)
            print("Matched",matched_title)
        
        # embed in rating matrix
        self.rating_matrix.iloc[0] = uvec #*100
        
        # NMF
        self.rating_matrix
        W = self.model.fit_transform(self.rating_matrix)
        H = self.model.components_
        transformed_uvec = pd.Series(data=W.dot(H)[0], index=self.generic_user_vec.index)
        
        return {'recommendations':transformed_uvec.sort_values(ascending=False)[:20],
                'matches':input_matches}

In [267]:
a = MovieRecommenderNMF()

In [268]:
#b=a.recommend({'Terminator 2':5, 'Taken':10, 'xxx':5, 'star wars':0})
b=a.recommend({'pretty woman':5, 'forest gump':5, 'american beauty':5})['recommendations']

Matched Pretty Woman (1990)
Matched Forrest Gump (1994)
Matched American Beauty (1999)




In [269]:
b[:20]

Taken (2008)                                 3.895786
Pretty Woman (1990)                          3.868243
Secrets & Lies (1996)                        3.713686
Forrest Gump (1994)                          3.693256
Dawn of the Dead (1978)                      3.681013
Guess Who's Coming to Dinner (1967)          3.674461
Celebration, The (Festen) (1998)             3.674119
It Happened One Night (1934)                 3.668205
Dances with Wolves (1990)                    3.653890
Con Air (1997)                               3.641414
Mulholland Drive (2001)                      3.632133
Paths of Glory (1957)                        3.617138
Dallas Buyers Club (2013)                    3.615663
Akira (1988)                                 3.598953
Ran (1985)                                   3.596335
Rebecca (1940)                               3.596191
Hustler, The (1961)                          3.588517
Thor: Ragnarok (2017)                        3.584069
Waterworld (1995)           

In [272]:
b=a.recommend({'Terminator 2':3, 'Taken':3, 'xxx':3, 'star wars':3})['recommendations']
b[:20]

Matched Terminator 2: Judgment Day (1991)
Matched Taken (2008)
Matched xXx (2002)
Matched Rogue One: A Star Wars Story (2016)




Star Trek: First Contact (1996)        3.753296
Taken (2008)                           3.746639
Secrets & Lies (1996)                  3.732919
Celebration, The (Festen) (1998)       3.710497
It Happened One Night (1934)           3.697135
Guess Who's Coming to Dinner (1967)    3.690231
Dawn of the Dead (1978)                3.685338
Rebecca (1940)                         3.667016
Big Short, The (2015)                  3.637836
Ran (1985)                             3.629793
Evil Dead II (Dead by Dawn) (1987)     3.611133
Mulholland Drive (2001)                3.605740
Paths of Glory (1957)                  3.598015
Waterworld (1995)                      3.597871
Pretty Woman (1990)                    3.589761
Double Indemnity (1944)                3.588068
Thor: Ragnarok (2017)                  3.578587
Ransom (1996)                          3.575660
Inside Job (2010)                      3.575389
Tombstone (1993)                       3.573561
dtype: float64

In [271]:
M_ratings.mean().mean()

3.4438105717465817

Problem:

* Very similar recommendations for very different input set.

Ideas:

* User input has to be larger. More rated movies.
* Better preprocessing of user rating matrix. E.g. move mean for each user to 3 and scale variation to range [1,5] 