# Collaborative Filtering

Movie recommendation based on Non-negative Matrix Factorization (NMF).

* **Disciplines:** Unsupervised Learning, recommender systems, collaborative filtering.
* **Data:** Movies rated by users (https://grouplens.org/datasets/movielens/)

> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm
import seaborn as sns
import os.path

In [2]:
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.decomposition import NMF

In [3]:
from fuzzywuzzy import process

In [4]:
import warnings

## Load, clean and wrangle data

In [5]:
DATA_SET_ROOT = '../data/ml-latest-small/'
WEB_APP_DATA_ROOT = './recommender/data'

In [6]:
df_movies = pd.read_csv(os.path.join(DATA_SET_ROOT,'movies.csv'), index_col='movieId')

In [7]:
df_ratings = pd.read_csv(os.path.join(DATA_SET_ROOT,'ratings.csv'))

In [8]:
df_ratings = df_ratings.merge(df_movies['title'], on='movieId')

In [9]:
# filter for movies that have at minimum N raitings
min_rating_count = 10
# https://stackoverflow.com/a/29791952
df_ratings['raiting_count_per_movie'] = df_ratings.groupby('movieId')['movieId'].transform('count')
df_ratings = df_ratings[df_ratings.raiting_count_per_movie > min_rating_count]

In [10]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,raiting_count_per_movie
0,1,1,4.0,964982703,Toy Story (1995),215
1,5,1,4.0,847434962,Toy Story (1995),215
2,7,1,4.5,1106635946,Toy Story (1995),215
3,15,1,2.5,1510577970,Toy Story (1995),215
4,17,1,4.5,1305696483,Toy Story (1995),215


* *https://stackoverflow.com/a/39358924*
* *https://stackoverflow.com/q/45312377*

In [11]:
M_movie_genres = df_movies.genres.str.get_dummies().drop('(no genres listed)', axis=1)

In [12]:
M_ratings = df_ratings.pivot(columns='title', values='rating', index='userId').dropna(how='all')
M_ratings.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Cloverfield Lane (2016),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),...,Zack and Miri Make a Porno (2008),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,4.0
2,,,,,,,,,,,...,,,,,3.0,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,5.0,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


## Imputation

We have to apply imputation on the user rating matrix, because NMF cannot deal with missing values (NaN).

In [13]:
imputer = KNNImputer(n_neighbors=10)

In [14]:
R_true = imputer.fit_transform(M_ratings)

In [15]:
R_true = pd.DataFrame(data=R_true, columns=M_ratings.columns)

For the recommendation web service we will impute the user vector with the mean movie ratings.

In [16]:
generic_user_vector = M_ratings.mean(skipna=True, axis=0)

Save preprocessed data for web service.

In [17]:
R_true.to_json(os.path.join(WEB_APP_DATA_ROOT,'user_rating_matrix.json'))

In [18]:
generic_user_vector.to_json(os.path.join(WEB_APP_DATA_ROOT,'generic_user_vector.json'))
# read with pd.read_json(..., typ='series')

## NMF Recommendation

In [19]:
def clamp_rating(rating):
    """ clamp rating to range of [1,5] """
    return min(max(1,rating),5)

class MovieRecommenderNMF:
    def __init__(self,
                 rating_matrix_path=os.path.join(WEB_APP_DATA_ROOT,'user_rating_matrix.json'),
                 generic_user_vec_path=os.path.join(WEB_APP_DATA_ROOT,'generic_user_vector.json')):
        self.rating_matrix = pd.read_json(rating_matrix_path)
        self.generic_user_vec = pd.read_json(generic_user_vec_path, typ='series')
    
        self.model = NMF(n_components=50, init='nndsvd', max_iter=1500)
        with warnings.catch_warnings(record=True): # suppress convergence warning
            W = self.model.fit_transform(self.rating_matrix)
            
    def recommend(self, user_input):
        """
        user_input : dict
            Dictionary with raw movie titles as keys and raiting [1,5] as values.
            
        Returns
        -------
        List of suggested movies; input matches.
        """
        input_matches = []
        
        # create user vector
        uvec = self.generic_user_vec.copy()
        #uvec[:] = 3.#self.rating_matrix.mean().mean() # 3.
        
        for raw_movie_title, rating in user_input.items():
            matched_title = process.extractOne(raw_movie_title, uvec.index)[0]
            input_matches.append((matched_title, rating))
            uvec[matched_title] = clamp_rating(rating)
            print("Fuzzywuzzy matched:",matched_title)
        
        # NMF
        with warnings.catch_warnings(record=True): # suppress convergence warning
            W = self.model.transform((uvec,))
        H = self.model.components_
        transformed_uvec = pd.Series(data=W.dot(H)[0], index=self.generic_user_vec.index)
        
        return {'recommendations':transformed_uvec.sort_values(ascending=False)[:20],
                'matches':input_matches}

In [20]:
recommender = MovieRecommenderNMF()

In [21]:
r = recommender.recommend({'pretty woman':5, 'forest gump':5, 'american beauty':5})['recommendations']
r[:20]

Fuzzywuzzy matched: Pretty Woman (1990)
Fuzzywuzzy matched: Forrest Gump (1994)
Fuzzywuzzy matched: American Beauty (1999)


Secrets & Lies (1996)                                                          4.579753
Guess Who's Coming to Dinner (1967)                                            4.571742
Paths of Glory (1957)                                                          4.530853
Streetcar Named Desire, A (1951)                                               4.453782
Celebration, The (Festen) (1998)                                               4.449396
Ran (1985)                                                                     4.411455
It Happened One Night (1934)                                                   4.384573
His Girl Friday (1940)                                                         4.368254
Philadelphia Story, The (1940)                                                 4.342205
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)                                  4.320853
Godfather: Part II, The (1974)                                                 4.319354
Shawshank Redemption, The (1994)

In [22]:
r = recommender.recommend({'terminator 2':5, 'taken':5, 'xxx':5, 'star wars':5})['recommendations']
r[:20]

Fuzzywuzzy matched: Terminator 2: Judgment Day (1991)
Fuzzywuzzy matched: Taken (2008)
Fuzzywuzzy matched: xXx (2002)
Fuzzywuzzy matched: Rogue One: A Star Wars Story (2016)


Secrets & Lies (1996)                                                         4.588351
Guess Who's Coming to Dinner (1967)                                           4.569521
Paths of Glory (1957)                                                         4.530301
Celebration, The (Festen) (1998)                                              4.461077
Streetcar Named Desire, A (1951)                                              4.451403
Ran (1985)                                                                    4.419434
It Happened One Night (1934)                                                  4.380766
His Girl Friday (1940)                                                        4.369757
Philadelphia Story, The (1940)                                                4.333069
Shawshank Redemption, The (1994)                                              4.325141
Dark Knight, The (2008)                                                       4.324028
Sunset Blvd. (a.k.a. Sunset Boulevard) (195

Problem:

* Very similar recommendations for very different input set.

Ideas:

* User input has to be larger. More rated movies.
* Better preprocessing of user rating matrix.
    * E.g. move mean for each user to 3 and scale variation to range [1,5] 
    * Transform into binary ratings: Hot (1) or Not (0)