# Limitations of sklearn’s non-negative matrix factorization library

### Contents
1. The caliculation of the RMSE recomender system which is based on the NMF
2. The limitations of the sklearn’s NMF and conclusion
---

### 1. The caliculation of the RMSE recomender system which is based on the NMFIntroduction
Previousery, we build the recommendation system based on jaccard similarity. 
In this project, we will build a recommendation system based on matrix factorization.
How it performs? Let's build it and see.

In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer

In [2]:
# Loarding data
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [3]:
# Macke userID and movieID pivot table(matrix table)
rating_matrix = train.pivot(index='uID', columns='mID', values='rating')

In [4]:
# Print whow many users and movies
print('Number of users:', rating_matrix.shape[0])
print('Number of movies:', rating_matrix.shape[1])
# print missing values
print('Number of missing values:', rating_matrix.isnull().sum().sum())
# print missing values per all matrix
print('Number of missing values:', round(rating_matrix.isnull().sum().sum() / (rating_matrix.shape[0] * rating_matrix.shape[1]) * 100, 1), '%')

Number of users: 6040
Number of movies: 3664
Number of missing values: 21430414
Number of missing values: 96.8 %


In [5]:
# Filling missing values with mean value
imputer = SimpleImputer(strategy='mean')
rating_matrix_imputed = imputer.fit_transform(rating_matrix)

In [6]:
# NMF transformation with 10 components
nmf_model = NMF(n_components=50, init='random', random_state=42, max_iter=500)
W = nmf_model.fit_transform(rating_matrix_imputed)
H = nmf_model.components_



In [7]:
# generate predicted ratings
predicted_ratings = np.dot(W, H)

In [8]:
# Calculate RMSE of test data from the predicted ratings
test_ratings = []
predicted_test_ratings = []

for _, row in test.iterrows():
    user_idx = row['uID'] - 1  
    movie_idx = row['mID'] - 1
    actual_rating = row['rating']
    # Check if the indices are within the valid range
    if user_idx < predicted_ratings.shape[0] and movie_idx < predicted_ratings.shape[1]:
        predicted_rating = predicted_ratings[user_idx, movie_idx]
    else:
        predicted_rating = 0  # or any other appropriate value for out-of-bounds indices
    
    test_ratings.append(actual_rating)
    predicted_test_ratings.append(predicted_rating)

# Calculate RMSE
rmse = mean_squared_error(test_ratings, predicted_test_ratings, squared=False)
print(f"RMSE: {rmse:.4f}")

RMSE: 1.5218


---
### 2 The limitations of the sklearn’s NMF and conclusion
RESM of the NMF model is 1.52.  

The results of the RMSE of models of week 3 is next:

|Method|RMSE|
|:----|:--------:|
|Baseline, $Y_p$=3| 1.26 |
|Baseline, $Y_p=\mu_u$| 1.04 |
|Content based, item-item| 1.01 |
|Collaborative, cosine| 1.03 |
|Collaborative, jaccard, $M_r\geq 3$|0.98 |
|Collaborative, jaccard, $M_r\geq 1$| 0.99 |
|Collaborative, jaccard, $M_r$| 0.95 |

The RMSE of the NMF is worst. Why?

Befor I mention the reason, I will show the features of the data.

Number of users: 6040
Number of movies: 3664
Number of missing values: 21430414
Number of missing values: 96.8 %

The data has 96.8% missing values in the users vs movies matrix.
The algorithm of NMF fills the missing values with the mean value of the matrix. 
The missing values are too many, so filling with the mean value afects the result, 
Completing the missing values, common values were increasing for all users and movies, so the RMSE is not good idea.
    
To improve the RMSE, we need to fill the missing values with a better value or
we should use the another algorithm such as Collaborative Filtering(jaccard similairty which is based on the user-user similarity).

If the matrix has dense data, NMF is a good choice. It helps to reduce the dimensionality and finding the latent features of the data.


### References

1. [NMF reference sckit learn ](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html)
2. [レコメンドアルゴリズム入門：基礎から応用まで実装に必要な知識を解説(Japanese website about recommendation)](https://qiita.com/birdwatcher/items/b60822bdf9be267e1328)
3. [Non-negative matrix factorization for recommendation systems](https://medium.com/logicai/non-negative-matrix-factorization-for-recommendation-systems-985ca8d5c16c#:~:text=Let%20me%20introduce%20you%20to%20Non-negative)