## Project 7: Non-Negative Matrix Factorization
JarGarGot

To implement a basic recommendation system. Many of those datasets are already loaded into the Surpise package to make this easy. You should tune and cross validate your system to select the best values for the # of latent dimensions, the regularization parameter, and any other hyperparameters.

NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.


In [None]:
Resources:

https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Building%20Recommender%20System%20with%20Surprise.ipynb

https://github.com/alorber/Frequentist-Machine-Learning-Projects/blob/master/Project%206%20-%20Nonnegative%20Matrix%20Factorization/NMF.py

https://github.com/rosegebhardt/Frequentist-ML/blob/master/nmf.ipynb


In [None]:
#!pip install scikit-surprise

import numpy as np
from surprise import Dataset
from surprise import NMF
import pandas as pd
from surprise.model_selection import cross_validate
from surprise import Reader
from surprise.model_selection import train_test_split
from surprise import accuracy
from surprise.model_selection import GridSearchCV
from collections import defaultdict

In [None]:
# Loading in movie dataset
data = Dataset.load_builtin('ml-100k')

In [None]:
# Splitting Data https://github.com/NicolasHug/Surprise/issues/277

all_data = data.raw_ratings

threshold = int(0.2*len(all_data))
training_data = all_data[threshold:]
testing_data = all_data[:threshold]

# Only want to fit model with training data
data.raw_ratings = training_data

In [None]:
# Tuning latent dimensions and regularization parameters

#params = {"biased": [True], "n_factors": np.arange(2,12,2)}
params = {"biased": [True], 'n_factors': [5,10,15], 'reg_pu': [0.02,0.1], 'reg_qi': [0.02,0.1]}
nmf = GridSearchCV(NMF, params, measures=["mse"], cv=3)
nmf.fit(data)


In [None]:
print("Best Parameters:\n")
print("Number of latent dimensions:", nmf.best_params['mse']['n_factors'])
print("Regulariztion term pu:", nmf.best_params['mse']['reg_pu'])
print("Regulariztion term qi:", nmf.best_params['mse']['reg_qi'])


Best Parameters:

Number of latent dimensions: 10
Regulariztion term pu: 0.1
Regulariztion term qi: 0.1


In [None]:
# Fit the model with the tuned parameters
best_nmf = NMF(biased=True, n_factors=nmf.best_params['mse']['n_factors'], reg_pu=nmf.best_params['mse']['reg_pu'], reg_qi=nmf.best_params['mse']['reg_qi'])
best_nmf.fit(data.build_full_trainset())

In [None]:
# Predict ratings on training data 
predictions = best_nmf.test(data.build_full_trainset().build_testset())
mse = accuracy.mse(predictions, verbose=False)
print('Mean squared error of training data:')
print(mse)

Mean squared error of training data:
0.8006820710494104


In [None]:
# Predict ratings on testing data
predictions = best_nmf.test(data.construct_testset(testing_data))
mse = accuracy.mse(predictions, verbose=False)
print('Mean squared error of testing data:')
print(mse)

Mean squared error of testing data:
0.900671194085953


In [None]:
# Convert string to dataframe for readability
predict = pd.DataFrame(predictions)

# Filter predictions
recommendations = predict[predict['est'] > 4.5]

recommendations = pd.DataFrame(recommendations)
recommendations = recommendations.drop(['details'],axis=1)

display(recommendations)

Unnamed: 0,uid,iid,r_ui,est
102,462,313,5.0,4.790023
104,462,315,4.0,4.542540
106,462,22,5.0,4.676112
110,462,272,5.0,4.808524
114,462,100,4.0,4.557632
...,...,...,...,...
79791,936,1368,5.0,4.686228
79817,936,272,4.0,4.608073
79856,936,127,5.0,4.556547
79886,936,251,4.0,4.515870


In [None]:
recs = defaultdict(list)  # List of recommendations for each user
num_recs = 5    # Number of recommendations to get for each user
for uid, iid, true_r, est, _ in predictions:
    recs[uid].append((iid, est))

# Sorts the predictions for each user and retrieve the num_recs highest ones.
for uid, user_ratings in recs.items():
    user_ratings.sort(key=lambda x: x[1], reverse=True)
    recs[uid] = user_ratings[:num_recs]

# Results

The first result is printed below. For each user ID, the top 5 movie IDs and corresponding predicted rating is estimated. The recommended movies vector was really long and difficult to interpret with just the movie ID numbers. Therefore, we then converted the movie IDs to movie names so that each user can look up their user ID in order to see their top 5 recommendations.

In [None]:
recommended_movie_ids=[]
# Print User ID and top 5 movie predictions with rating
for k,v in recs.items():
  print (f"{k} - {v}")
  for x in range(0, 4):
    recommended_movie_ids.append(v[x][0])

391 - [('318', 4.435236723229418), ('64', 4.392607143801648), ('483', 4.3906408273467195), ('12', 4.3443052514097475), ('603', 4.308232243944192)]
462 - [('272', 4.808524067787269), ('313', 4.79002287250365), ('22', 4.676112047185291), ('136', 4.591910195825649), ('100', 4.557632433335283)]
268 - [('408', 4.172189745446), ('12', 3.9830215326267426), ('480', 3.953230450481128), ('178', 3.9247264317532076), ('50', 3.913779718278849)]
21 - [('853', 4.409163906141477), ('408', 4.349286830815149), ('127', 4.141265991264257), ('656', 4.09977956269731), ('854', 4.032199807182083)]
207 - [('64', 4.054173210332224), ('483', 4.033913178688169), ('12', 3.99276095910507), ('515', 3.914500806302), ('357', 3.9042769563025823)]
291 - [('64', 4.853064311488998), ('12', 4.795144870662613), ('50', 4.704230686934064), ('98', 4.670343043704603), ('285', 4.665162094853658)]
60 - [('483', 4.794263895806649), ('12', 4.754045954955859), ('603', 4.695892241366537), ('513', 4.690640055576443), ('357', 4.6798204

In [None]:
# List of movie titles with ID number and genre
movies = pd.read_csv('movies.csv') #https://github.com/divensambhwani/MovieLens-100K_Recommender-System/blob/master/MovieLens-100K-Recommeder%20System-SVD.ipynb
display(movies)
recommended_movies = movies[movies['movieId'].isin(recommended_movie_ids)]
display(recommended_movies)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller
...,...,...,...
1201,1599,Steel (1997),Action
1202,1600,She's So Lovely (1997),Drama|Romance
1233,1642,Indian Summer (a.k.a. Alive & Kicking) (1996),Comedy|Drama
1237,1646,RocketMan (a.k.a. Rocket Man) (1997),Children|Comedy|Romance|Sci-Fi


In [None]:
# Converts movie ID to title
def movie_id_to_title(movieId):
  #index = recommended_movies.index[recommended_movies['movieId'] == movieId][0]
  index = movies.index[movies['movieId'] == movieId][0]
  return movies.iloc[[index]]

In [None]:
# Examples of searching for title name based on movie ID number
title = movie_id_to_title(2)
print(title,'\n\n')

title = movie_id_to_title(64)
print(title)

   movieId           title                      genres
1        2  Jumanji (1995)  Adventure|Children|Fantasy 


    movieId                 title          genres
57       64  Two if by Sea (1996)  Comedy|Romance


In [None]:
# Lists up to top 5 movie recommendations for user based on ID 
def user_recommendations(userId):
  if userId in recs.keys():
    for v in recs[userId]:
      x = int(v[0])
      i = recommended_movies.index[recommended_movies['movieId'] == x]
      # Some of the movie Ids are not listed in movies.csv
      if len(i) != 0:
        i = movie_id_to_title(x)
        print(i)
  else:
    print("Not present") 

Below is an example of using the user recommendation function to get the top 5 recommendations for a given user ID number.
Some movie titles were omitted from the data so when running the code, sometimes fewer than 5 recommendations show up

In [None]:
# Example of getting top 5 movies for a user (with ID number)
user_recommendations('1')

     movieId                                    title                    genres
141      169  Free Willy 2: The Adventure Home (1995)  Adventure|Children|Drama
     movieId                           title  genres
104      119  Steal Big, Steal Little (1995)  Comedy
     movieId                        title        genres
150      178  Love & Human Remains (1993)  Comedy|Drama
    movieId                       title                  genres
46       50  Usual Suspects, The (1995)  Crime|Mystery|Thriller
