# Using NMF on Movie Rating Data

Importing all necesary libraries

In [17]:
import math
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import re
import unicodedata
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn.decomposition import NMF
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression

Reading in data and doing a brief cleaning procedure

In [18]:
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [22]:
for df in [MV_users, train, test]:
    df['uID'] = df['uID'].astype(str).str.strip()
for df in [MV_movies, train, test]:
    df['mID'] = df['mID'].astype(str).str.strip()

In [23]:
u_id = list(MV_users['uID'])
m_id = list(MV_movies['mID'])

utoidx = dict(zip(u_id, range(len(u_id))))
mtoidx = dict(zip(m_id, range(len(m_id))))

n_user, n_mov = len(u_id), len(m_id)
R = np.zeros((n_user, n_mov))
for _, row in train.iterrows():
    R[utoidx[row.uID], mtoidx[row.mID]] = row.rating


Now implementing the NMF on the data

In [25]:
nmf = NMF(n_components = 20, init = 'nndsvda', random_state = 42, max_iter = 300)
W = nmf.fit_transform(R)
H = nmf.components_
pred_ = np.dot(W,H)

In [26]:
tpred = []

for _, row in test.iterrows():
    uid, mid, = row.uID, row.mID
    u_idx, m_idx = utoidx[uid], mtoidx[mid]
    pred = np.clip(pred_[u_idx, m_idx], 1, 5)
    tpred.append(pred)

rsme = np.sqrt(mean_squared_error(test.rating, tpred))
print('NMF RSME is', rsme)

NMF RSME is 2.537119135178176


Discussion for Part 2: Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it? 

Sklearn's non-negative matrix factorization did not work well compared to the simple baseline methods we used in module 3 because it is simply not meant for recomender systems. First, it doesnt handle missing entries well. Missing data is treated as 0, rather than as not pertinent. Second, NMF does not handle over/under prediction well for reccomender systems. Finally, this method is actually completely inefficient for reccomender systems compared to what we used in module 3. 

A way we could improve sklearns NMF library for this dataset would be to use data that is normalized for only known user ratings. Masking the data could potentially help improve the overal RSME and result in better modelling. 

