# Recommender Systems with NMF Mini-Project

### 1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from scipy.sparse import coo_matrix

In [2]:
# Load data
MV_users = pd.read_csv('users.csv')
MV_movies = pd.read_csv('movies.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Create dataset
from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

In [3]:
# Create matrix
allusers = list(data.users['uID'])
allmovies = list(data.movies['mID'])
mid2idx = dict(zip(data.movies.mID,list(range(len(data.movies)))))
uid2idx = dict(zip(data.users.uID,list(range(len(data.users)))))

ind_movie = [mid2idx[x] for x in data.train.mID]
ind_user = [uid2idx[x] for x in data.train.uID]
rating_train = list(data.train.rating)

rating_matrix = np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(allusers), len(allmovies))).toarray())
rating_df = pd.DataFrame(rating_matrix, columns=allmovies, index=allusers)

In [4]:
# Train model and calculate RMSE
nmf_model = NMF(n_components=6, random_state=10)
nmf_model.fit(rating_df)

W = nmf_model.transform(rating_df)
H = nmf_model.components_

V = np.dot(W, H)
np.sqrt(mean_squared_error(rating_df, V))

0.5644786056624103

In [5]:
# Predict test data and calculate RMSE
test_user_idx = data.test.uID
test_movie_idx = data.test.mID
mtx_user_idx = [uid2idx[i] for i in test_user_idx]
mtx_movie_idx = [mid2idx[i] for i in test_movie_idx]

predictions = [V[user, movie] for user, movie in zip(mtx_user_idx, mtx_movie_idx)]
np.sqrt(mean_squared_error(data.test.rating, predictions))

2.973667312351122

### 2. Discuss the results and why they did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it?

The training data does have a decently low root mean square error (RMSE) of 0.564. However, when predicting the test dataset, the RMSE jumps to a high value of 2.974. This is a likely indicator of overfitting, which is a downside to using non-negative matrix factorization (NMF) for this example because of the sparsity of the initial ratings matrix. Using other dimension reduction techniques or feature engineering in tandem with NMF could potentially combat this overfitting. Another problem is because of the sparsity and size of the matrix, the model building process is very inefficient, so another method would likely produce better results in a faster time.