# Module 4 - matrix factorization technique(s) to predict ratings

Limitation(s) of sklearn’s non-negative matrix factorization library.

Please mark the sections of your notebook as 1 and 2 so that graders can follow along.

1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library. [10 pts] Make sure that your notebook includes the following:
* use's sklearn's non-negative matrix factorization
* notebook shows the RMSE with an analysis of what that RMSE means

2. Discuss the results and why they did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it? [10 pts]

In [1]:
from dataclasses import asdict
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix, diags
from scipy.spatial.distance import jaccard, cosine, pdist, squareform
from collections import namedtuple

from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
from pytest import approx

## Setup - From Module 3

In [2]:
MV_users = pd.read_csv('data/users_m3.csv')
MV_movies = pd.read_csv('data/movies_m3.csv')
train = pd.read_csv('data/train_m3.csv')
test = pd.read_csv('data/test_m3.csv')

In [3]:

Data = namedtuple('Data', ['users','movies','train','test'])

data = Data(MV_users, MV_movies, train, test)
# Creating Sample test data
np.random.seed(42)
sample_train = train[:30000]
sample_test = test[:30000]

sample_MV_users = MV_users[(MV_users.uID.isin(sample_train.uID)) | (MV_users.uID.isin(sample_test.uID))]
sample_MV_movies = MV_movies[(MV_movies.mID.isin(sample_train.mID)) | (MV_movies.mID.isin(sample_test.mID))]

sample_data = Data(sample_MV_users, sample_MV_movies, sample_train, sample_test)

In [4]:
class RecSys():
    def __init__(self, data):
        self.data = data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID, list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID, list(range(len(self.data.users)))))
        self.Mr = self.rating_matrix()
        self.Mm = None
        self.sim = np.zeros((len(self.allmovies), len(self.allmovies)))

    def rating_matrix(self):
        """
        Convert the rating matrix to numpy array of shape (#allusers,#allmovies)
        """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID]
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(self.data.train.rating)

        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)),
                                   shape=(len(self.allusers), len(self.allmovies))).toarray())

    def predict_everything_to_3(self):
        """
        Predict everything to 3 for the test data
        """
        return np.ones(len(self.data.test)) * 3

    def predict_to_user_average(self):
        """
        Predict to average rating for the user.
        Returns numpy array of shape (#users,)
        """
        user_averages = np.zeros(len(self.data.users))
        for i, uid in enumerate(self.data.users.uID):
            user_ratings = self.Mr[self.uid2idx[uid], :]
            num_rated = user_ratings.sum()
            if num_rated > 0:
                user_movieset = self.Mr[self.uid2idx[uid], :] > 0
                user_averages[i] = user_ratings.sum() / user_movieset.sum()
            else:
                user_averages[i] = 3

        # Now map test users to their averages
        test_predictions = np.zeros(len(self.data.test))
        for i, uid in enumerate(self.data.test.uID):
            user_idx = self.uid2idx[uid]
            test_predictions[i] = user_averages[user_idx]

        return test_predictions

    def predict_from_sim(self, uid, mid):
        """
        Predict a user rating on a movie given userID and movieID
        """
        user_idx = self.uid2idx[uid]
        movie_idx = self.mid2idx[mid]
        user_ratings = self.Mr[user_idx]
        movie_sim = self.sim[movie_idx]

        user_average = 3

        unweighted_ratings = user_ratings @ movie_sim
        rated_movies = user_ratings > 0
        weighting = np.dot(movie_sim, rated_movies)
        if weighting == 0:
            num_rated = user_ratings.sum()
            if num_rated > 0:
                user_movieset = user_ratings > 0
                user_average = num_rated / user_movieset.sum()

            user_pred = user_average
        else:
            user_pred = unweighted_ratings / weighting

        return user_pred

    def predict(self):
        """
        Predict ratings in the test data. Returns predicted rating in a numpy array of size (# of rows in testdata,)
        """
        predicted_ratings = np.zeros(len(self.data.test))
        for i in range(len(self.data.test)):
            uid = self.data.test.uID[i]
            mid = self.data.test.mID[i]
            predicted_ratings[i] = self.predict_from_sim(uid, mid)
        return predicted_ratings

    def rmse(self, yp):
        yp[np.isnan(yp)] = 3  #In case there is nan values in prediction, it will impute to 3.
        yt = np.array(self.data.test.rating)
        return np.sqrt(((yt - yp) ** 2).mean())


class ContentBased(RecSys):
    def __init__(self, data):
        super().__init__(data)
        self.data = data
        self.Mm = self.calc_movie_feature_matrix()

    def calc_movie_feature_matrix(self):
        """
        Create movie feature matrix in a numpy array of shape (#allmovies, #genres)
        """
        # your code here

        return np.asarray(self.data.movies[self.data.movies.columns[3:]])

    def calc_item_item_similarity(self):
        # item-item similarity using Jaccard similarity
        # Method 2: Sparse implementation
        Mm_sparse = csr_matrix(self.Mm, dtype=np.float32)
        # Calculate intersection matrix
        intersection = Mm_sparse @ Mm_sparse.T
        # Get movie sizes (number of genres)
        movie_sizes = np.array(Mm_sparse.sum(axis=1)).flatten()
        # Broadcast to create union matrix
        union = movie_sizes[:, None] + movie_sizes[None, :] - intersection.toarray()
        # Avoid division by zero
        union = np.maximum(union, 1)
        # Calculate Jaccard similarity
        sparse_sim = intersection.toarray() / union
        # Ensure diagonal is 1
        np.fill_diagonal(sparse_sim, 1.0)

        self.sim = sparse_sim


class Collaborative(RecSys):
    def __init__(self, data):
        super().__init__(data)

    def calc_item_item_similarity(self, simfunction, *X):
        # General function that calculates item-item similarity based on the sim function and data inputed
        if len(X) == 0:
            self.sim = simfunction()
        else:
            self.sim = simfunction(X[0])  # *X passes in a tuple format of (X,), to X[0] will be the actual transformed matrix

    def cossim(self):
        """
        Calculates item-item similarity for all pairs of items using cosine similarity (values from 0 to 1) on utility matrix
        Returns a cosine similarity matrix of size (#all movies, #all movies)
        """
        # Return a sim matrix by calculating item-item similarity for all pairs of items using Jaccard similarity
        # Cosine Similarity: C(A, B) = (A.B) / (||A||.||B||)

        #####
        # Method 1 - using dense matrix (including centering and normalisation)
        Xd = self.Mr.copy().astype(np.float64)     #(users, movies)
        # calculate user averages
        user_means = np.zeros(Xd.shape[0])
        for i in range(Xd.shape[0]):
            rated_movies = Xd[i,:] > 0
            if rated_movies.sum() > 0:
                user_means[i] = Xd[i,rated_movies].mean()
            else:
                user_means[i] = 3.0
        # replace 0s with user averages
        for i in range(Xd.shape[0]):
            zero_mask = Xd[i,:] == 0
            Xd[i,zero_mask] = user_means[i]
        # Centre the data
        Xd = Xd - user_means[:,np.newaxis]  # converts user means to column vector of shape (users, 1) before centering
        # Transpose to get movies as rows and users as columns
        Xd = Xd.T     #(movies, users)
        # Normalise the data
        norms = np.linalg.norm(Xd, axis=1, keepdims=True)
        norms = np.maximum(norms, 1e-10)  # Avoid divide by zero
        Xd_norm = Xd / norms
        # Calc simularity matrix
        sim_dense = Xd_norm @ Xd_norm.T     #Cosine similarity = (A.B) / (||A||.||B||)
        # Rescale to be 0~1
        sim_dense = (sim_dense + 1) / 2
        # Handle NaNs and ensure diagonal is 1
        sim_dense = np.nan_to_num(sim_dense, nan=0.5)
        np.fill_diagonal(sim_dense, 1.0)

        return sim_dense


    def jacsim(self, Xr):
        """
        Calculates item-item similarity for all pairs of items using jaccard similarity (values from 0 to 1)
        Xr is the transformed rating matrix. Shape: (users, movies)
        """
        n = Xr.shape[1]  # Number of movies
        maxr = int(Xr.max())  # Maximum rating value

        if maxr > 1: # Multi-category case
            intersection = np.zeros((n, n)).astype(int)
            union = np.zeros((n, n)).astype(int)

            for i in range(1, maxr + 1): # intersections and unions for each rating level
                rating_level = (Xr == i).astype(int)
                csr = csr_matrix(rating_level)
                # Intersection for this rating level
                level_intersection = np.array(csr.T.dot(csr).toarray()).astype(int)
                intersection += level_intersection
                # Union for this rating level (users who gave rating i to either movie)
                rowsum = rating_level.sum(axis=0)  # Movies with rating i
                level_union = rowsum[:, None] + rowsum[None, :] - level_intersection
                union += level_union

        else: # Binary case
            csr0 = csr_matrix((Xr > 0).astype(int))
            intersection = np.array(csr0.T.dot(csr0).toarray()).astype(int)
            A = (Xr > 0).astype(bool)
            rowsum = A.sum(axis=0)
            union = rowsum[:, None] + rowsum[None, :] - intersection

        # Calculate Jaccard similarity
        union = np.maximum(union, 1)
        similarity = intersection.astype(float) / union.astype(float)

        # Boundary checks
        similarity = np.nan_to_num(similarity, nan=0.0, posinf=0.0, neginf=0.0)
        np.fill_diagonal(similarity, 1.0)

        return similarity


In [5]:
# Setup rs and sample_rs
sample_rs = RecSys(sample_data)
rs = RecSys(data)

# predict_everything_to_3 in class RecSys
sample_yp = sample_rs.predict_everything_to_3()
yp = rs.predict_everything_to_3()

# build dictionary for comparison of RMSE values of all approaches
dict_compare = {}
rmse_val_sample = sample_rs.rmse(sample_yp)
rmse_val_data = rs.rmse(yp)
dict_compare['Baseline 3'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

# Sample tests predict_to_user_average in the class RecSys
sample_yp = sample_rs.predict_to_user_average()
yp = rs.predict_to_user_average()
rmse_val_sample = sample_rs.rmse(sample_yp)
rmse_val_data = rs.rmse(yp)
dict_compare['Baseline User Average'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

### Content-Based model

In [6]:
cb = ContentBased(data)
cb.calc_item_item_similarity()
sample_cb = ContentBased(sample_data)
sample_cb.calc_item_item_similarity()
# Sample tests method predict in the RecSys class
sample_yp = sample_cb.predict()
sample_rmse = sample_cb.rmse(sample_yp)
# predict data
yp = cb.predict()
rmse = cb.rmse(yp)
rmse_val_sample = sample_rs.rmse(sample_yp)
rmse_val_data = rs.rmse(yp)
dict_compare['Content Based Item-Item'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

### Collaborative Filtering

In [7]:
# Sample tests cossim method in the Collaborative class
sample_cf = Collaborative(sample_data)
sample_cf.calc_item_item_similarity(sample_cf.cossim)
sample_yp = sample_cf.predict()
sample_rmse = sample_cf.rmse(sample_yp)

cf = Collaborative(data)
cf.calc_item_item_similarity(cf.cossim)
yp = cf.predict()
rmse = cf.rmse(yp)

rmse_val_sample = sample_rs.rmse(sample_yp)
rmse_val_data = rs.rmse(yp)
dict_compare['Collaborative Cosine Similarity'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}
#Mr >= 3
Xr = cf.Mr>=3
cf.calc_item_item_similarity(cf.jacsim,Xr)
yp = cf.predict()
rmse_val_sample = None
rmse_val_data = rs.rmse(yp)
dict_compare['Jacsim - Rating > 3'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

# Mr >= 1
Xr = cf.Mr>=1
cf.calc_item_item_similarity(cf.jacsim,Xr)
yp = cf.predict()
rmse_val_sample = None
rmse_val_data = rs.rmse(yp)
dict_compare['Jacsim - Rating > 1'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

# Jacsim Multiclass
Xr = cf.Mr.astype(int)
cf.calc_item_item_similarity(cf.jacsim,Xr)
yp = cf.predict()
rmse_val_sample = None
rmse_val_data = rs.rmse(yp)
dict_compare['Jacsim - Multiclass'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}


## SECTION 1 - matrix factorization technique(s) to predict the missing ratings from the test data

Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library. Make sure that your notebook includes the following:
* use's sklearn's non-negative matrix factorization
* notebook shows the RMSE with an analysis of what that RMSE means

In [8]:
class RatingsNMF(RecSys):
    def __init__(self, data):
        super().__init__(data)
        self.H = None
        self.W = None
        self.predicted_ratings = None

    # This was the original fit_nmf function that I tried, but the results were truly awful.
    # I found a mix of issues: the sparsity - NMF not handling zeros well, the loss function may not be stable
    def fit_nmf_v1(self,genres=16,max_iter=1000,alpha_W=0.01,alpha_H=0.01, beta_loss="kullback-leibler"):

        # 'mu' solver required when using kullback-leibler beta_loss
        if beta_loss == 'kullback-leibler':
            solver = 'mu'
            init = 'nndsvda' # changed from 'nndsvd' due to warning message received
        else:
            solver = 'cd'
            init = 'nndsvd'

        nmf_model = NMF(
            n_components=genres,
            random_state=42,
            max_iter=max_iter,
            alpha_W=alpha_W,  # No regularization initially
            alpha_H=alpha_H,
            init=init,
            solver=solver,
            beta_loss=beta_loss, # from lectures kullback-leibler good where we have many zeros
        )
        # self.Mr rating matrix is a numpy array of shape (#allusers,#allmovies)
        # We meed movies as rows and users as columns
        self.W = nmf_model.fit_transform(self.Mr)  # users x genres
        self.H = nmf_model.components_   # genres x movies
        self.predicted_ratings = self.W @ self.H # users x movies
        return

    # New version of fit_nmf function that uses the frobenius loss function.
    # Plus replacing zeros with user-movie bias
    def fit_nmf_v2(self,genres=16,max_iter=1000,alpha_W=0.01,alpha_H=0.01):

        matrix_for_nmf = self.Mr.copy().astype(np.float64)

        # Calculate user and movie means for non-zero ratings
        user_means = np.zeros(matrix_for_nmf.shape[0])
        movie_means = np.zeros(matrix_for_nmf.shape[1])

        for i in range(matrix_for_nmf.shape[0]):
            rated = matrix_for_nmf[i, :] > 0
            if rated.sum() > 0:
                user_means[i] = matrix_for_nmf[i, rated].mean()
            else:
                user_means[i] = 3.0

        for j in range(matrix_for_nmf.shape[1]):
            rated = matrix_for_nmf[:, j] > 0
            if rated.sum() > 0:
                movie_means[j] = matrix_for_nmf[rated, j].mean()
            else:
                movie_means[j] = 3.0

        # Fill zeros with combined user-movie bias
        for i in range(matrix_for_nmf.shape[0]):
            for j in range(matrix_for_nmf.shape[1]):
                if matrix_for_nmf[i, j] == 0:
                    matrix_for_nmf[i, j] = (user_means[i] + movie_means[j]) / 2

        # Apply NMF
        nmf_model = NMF(
            n_components=genres,
            random_state=42,
            max_iter=max_iter,
            alpha_W=alpha_W,
            alpha_H=alpha_H,
            init='nndsvd',
            solver='cd',
            beta_loss='frobenius'
        )

        self.W = nmf_model.fit_transform(matrix_for_nmf)
        self.H = nmf_model.components_
        self.predicted_ratings = self.W @ self.H
        return

    def predict(self, test_data=None):
        """
        Predict ratings using the fitted NMF model
        """
        if self.predicted_ratings is None:
            raise ValueError("Model must be fitted first. Call fit_nmf() before predict()")
        if test_data is None:
            test_data = self.data.test

        test_pairs = test_data[['uID', 'mID']].values
        user_indices = [self.uid2idx[uid] for uid in test_pairs[:, 0].astype(int)]
        movie_indices = [self.mid2idx[mid] for mid in test_pairs[:, 1].astype(int)]
        predictions = self.predicted_ratings[user_indices, movie_indices]
        predictions = np.clip(predictions, 1.0, 5.0)
        return predictions


In [9]:
# instantiate class for NMF exploration
nmf_r = RatingsNMF(data)
# Latent Dimension = genres
genres = 16
# fit the model and add the results to self.W, self.H and self.predicted_ratings
nmf_r.fit_nmf_v1(genres=genres)
# predict and find the rmse
yp = nmf_r.predict(data.test)
rmse_val_sample = None
rmse_val_data = rs.rmse(yp)
dict_compare['NMF v1'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

In [12]:
# fit the model and add the results to self.W, self.H and self.predicted_ratings
nmf_r.fit_nmf_v2(genres=genres, alpha_W = 0.05, alpha_H = 0.05, max_iter=2000)
# predict and find the rmse
yp = nmf_r.predict(data.test)
rmse_val_sample = None
rmse_val_data = rs.rmse(yp)
dict_compare['NMF v2'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}


In [13]:
df_compare = pd.DataFrame(dict_compare).T.reset_index()
df_compare.columns.name = None  # Remove the column name 'index'
df_compare = df_compare.rename(columns={'index': 'Method'})
display(df_compare)

Unnamed: 0,Method,sample rmse,data rmse
0,Baseline 3,1.264278,1.258551
1,Baseline User Average,1.14296,1.035291
2,Content Based Item-Item,1.196609,1.012503
3,Collaborative Cosine Similarity,1.142693,1.026308
4,Jacsim - Rating > 3,,0.98195
5,Jacsim - Rating > 1,,0.991354
6,Jacsim - Multiclass,,0.951656
7,NMF v1,,2.594395
8,NMF v2,,0.953741


## SECTION 2 - Discuss the Results

Discuss the results and why they did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it?

#### Original attempt - fit_nmf_v1 in class RatingsNMF
This first attempt yielded an RMSE that is equivalent to chance and far worse than all previous methods tried. I discovered several issues:
1) The sparsity of the data - NMF not handling zeros well
2) The loss function may not be stable (I only introduced kullback-leibler based on what we heard in the lectures in an attempt to deal with the sparsity issue, but it did not help)

NMF struggles with highly sparse data because:
- It tries to find latent factors that explain all observed ratings
- Sparse data provides insufficient information for meaningful factorization
- The algorithm may converge to poor local minima

I tried replacing all the zeros with minimal values (0.01), but this did not improve the RMSE at all.

#### Final version Fixing the Issues - fit_nmf_v2 in class RatingsNMF
Remembering the discussion in the lectures and the user average computation used in Item-Item similarity, I tried replacing all the zeros with the average of the user and movie biases. For this function, I used both user-average and movie-average to compute a combined user-movie bias, which was then used to replace all zeros.
Additionally, I played with the parameters:
1) Froebenius loss function
2) Regularization parameters - to reduce the iterations required
3) Max iterations - to limit the warnings received

This second attempt yielded a much better RMSE, almost identical to the performance of Jaccard Multi Class Similarity.

Essentially this is a hybrid solution, imputing averages to address the problem of sparsity. It seems to be a good fix!
