## Part 2: Non-Negative Matrix Factorization for Recommender Systems
Data Science Student, University of Colorado, Boulder

In [12]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, ConfusionMatrixDisplay
from scipy.sparse import coo_matrix, csr_matrix
from collections import namedtuple

In [5]:
MV_users = pd.read_csv('users.csv')
MV_movies = pd.read_csv('movies.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [6]:
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

In [27]:
class RecSys():
    def __init__(self,data):
        self.data=data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID,list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID,list(range(len(self.data.users)))))
        self.Mr=self.rating_matrix()
        self.Mm=None 
        self.sim=np.zeros((len(self.allmovies),len(self.allmovies)))
        
    def rating_matrix(self):
        """
        Convert the rating matrix to numpy array of shape (#allusers,#allmovies)
        """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID] 
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(self.data.train.rating)
        
        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(self.allusers), len(self.allmovies))).toarray())

    def rmse(self,yp):
        yp[np.isnan(yp)]=3 #In case there is nan values in prediction, it will impute to 3.
        yt=np.array(self.data.test.rating)
        return np.sqrt(((yt-yp)**2).mean())
    
    def factorization(self):
        

        result = np.zeros((self.data.test.shape[0],))
        
        ratings = self.Mr.copy()
        ratings[ratings==0] = 3
        
        nmf_model = NMF(n_components=25, random_state=0, init='random')
        W = nmf_model.fit_transform(ratings)
        H = nmf_model.components_
        
        predicted_ratings = np.dot(W, H)

        scaler = MinMaxScaler(feature_range=(1, 5.5))
        scaler.fit(predicted_ratings)
        predicted_ratings = scaler.transform(predicted_ratings)
        
        for i in range(self.data.test.shape[0]):
            pred = predicted_ratings[self.uid2idx[self.data.test["uID"][i]]][self.mid2idx[self.data.test["mID"][i]]]
            result[i] = pred

        #print(self.data.test)  
        #print(result)
        
        return result

In [19]:
np.random.seed(42)
sample_train = train[:30000]
sample_test = test[:30000]

sample_MV_users = MV_users[(MV_users.uID.isin(sample_train.uID)) | (MV_users.uID.isin(sample_test.uID))]
sample_MV_movies = MV_movies[(MV_movies.mID.isin(sample_train.mID)) | (MV_movies.mID.isin(sample_test.mID))]

sample_data = Data(sample_MV_users, sample_MV_movies, sample_train, sample_test)

In [28]:
rs = RecSys(sample_data)

In [29]:
predictions = rs.factorization()



In [30]:
print(rs.rmse(predictions))

1.356462346134449


1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library. [10 pts]
> My baseline Root Mean Square Error of predicting all ratings as 3 is 1.259.  Unfortunately, my score using NMF is 1.356.  Clearly, higher error means worse performance.  RMSE is the average error between the predicted and true values.  This suggests that using NMF is worse than the baseline value.  

2. Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it? [10 pts]
> I'm not completely certain of why NMF performed so poorly; however, my impression would be that NMF is picking up all of the null values, and those are greatly influencing the predicted results.  For instance, if I keep the null values as 0, I obtain a RMSE of 2.6.  However, when I impute that value to 3, then the result improves to 1.356.  This suggests to me that all of those 0s are skewing the results.  Perhaps a way to further improve the results would be to impute the user's average for each null value.  That will likely improve the results.  However, I think that interpretation might also be important to consider.  While it may be difficult to predict a user rating, it may be useful to look at the relative difference between the ratings.  For instance, the highest predicted values for a given user likely are the films that should be recommended, even if we don't predict them in the preferred scale (1-5) 