In [2]:
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

def preprocess_data(train, test):
    train_arr = train.to_numpy()
    test_arr = test.to_numpy()
    return train_arr, test_arr
    
def train_nmf_model(train_arr):
    nmf_model = NMF(n_components=15, init='random', random_state=42, max_iter=500)
    W = nmf_model.fit_transform(train_arr)
    H = nmf_model.components_
    return nmf_model, W, H

def predict_ratings(nmf_model, W, H, test_arr):
    test_predicted_ratings = np.dot(W, H)[:test_arr.shape[0], :]
    return test_predicted_ratings

def evaluate_model(test_arr, test_predicted_ratings):
    test_mask = (test_arr > 0)
    rmse = np.sqrt(mean_squared_error(test_arr[test_mask], test_predicted_ratings[test_mask]))
    print("RMSE:", rmse)


In [4]:
users = pd.read_csv('data/users.csv')
movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

train_arr, test_arr = preprocess_data(train, test)
nmf_model, W, H = train_nmf_model(train_arr)
test_predicted_ratings = predict_ratings(nmf_model, W, H, test_arr)
evaluate_model(test_arr, test_predicted_ratings)

RMSE: 1672.7374298158854


We use sckit's NMF library with 15 components to recommend movies. The resulting RMSE is 1672 which is extremely high meaning we can probalby improve this model.

This is probably because sckit's NMF library has several limitations like how it treats missing data as observed rather than missing. This leads to suboptimal results in some cases. A potential solution for this would be to refine the NMF algorithm to handle this or to consider alternative libraries which are able to handle this.

Another issue is that scikit's NMF lacks regularization which increases its susceptibility to overfitting. This can be addressed by implementing a regularized versions of NMF or using regularization matrix factorization techniques.

scikit's NMF also does not model biases in user/item ratings explicitly. This can be solved by either refining the NMF to incorporate bias terms or by using alternative factorization models to handle the bias.

Finally, scikit's NMF faces limitations for recommender systems compared to similarity-based methods. Solving this will probably involve alternative libraries, regularization techniques, and explicit modeling of biases.