# Part 2: Limitations of sklearn’s NMF with MovieLens (.dat files)

In this section, we will:
1. Read the MovieLens data from `.dat` files (particularly `ratings.dat`).
2. Perform a train-test split.
3. Create a user-item matrix from the training portion.
4. Use `sklearn.decomposition.NMF` to predict missing ratings.
5. Compare RMSE with a simple baseline (global mean).
6. Discuss why `sklearn`'s NMF might underperform compared to specialized or simpler methods.

# Section 1: Load the movie ratings data


## 1. Reading MovieLens `.dat` Files
We'll load it into a DataFrame using `pandas.read_csv` with `sep='::'` and `engine='python'`.


In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import NMF
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Path to ratings.dat (adjust if needed)
ratings_path = './movielens/ratings.dat'

# Define column names
ratings_cols = ['userId', 'movieId', 'rating', 'timestamp']

# Load ratings.dat
ratings_df = pd.read_csv(
    ratings_path,
    sep='::',
    engine='python',  # needed for '::' delimiter
    header=None,
    names=ratings_cols
)

print("Ratings shape:", ratings_df.shape)
ratings_df.head(10)

Ratings shape: (1000209, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
5,1,1197,3,978302268
6,1,1287,5,978302039
7,1,2804,5,978300719
8,1,594,4,978302268
9,1,919,4,978301368


## 2. Split into Train & Test

We do a random 80/20 split on the ratings.

`train_test_split(..., test_size=0.2, random_state=42)`

In [26]:
# Random 80/20 split
train_df, test_df = train_test_split(
    ratings_df,
    test_size=0.2,
    random_state=42
)

print("Train set:", train_df.shape)
print("Test set:", test_df.shape)
train_df.head()

Train set: (800167, 4)
Test set: (200042, 4)


Unnamed: 0,userId,movieId,rating,timestamp
416292,2507,3035,2,974076680
683230,4087,2840,4,965431652
2434,19,457,3,978146863
688533,4118,2804,4,965804599
472584,2907,805,4,971838472


## 3. Create User-Item Matrix (Training)

We'll pivot the training set so that:
- Rows = `userId`
- Columns = `movieId`
- Values = `rating`

Missing entries become `NaN` if a user hasn't rated a movie.

In [27]:
# Pivot the training data
train_matrix = train_df.pivot_table(
    index='userId',
    columns='movieId',
    values='rating'
)  # shape: (#users x #movies)

print("Training matrix shape:", train_matrix.shape)
train_matrix.head(5)

Training matrix shape: (6040, 3683)


movieId,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


## 4. Apply NMF to Predict Missing Ratings

### 4.1 Fill `NaN` with 0
Since `sklearn`'s NMF cannot handle NaNs directly, we'll set them to `0`. Recall that this is a simplistic assumption ("missing" != "zero rating").

### 4.2 Fit NMF
We pick a number of latent factors (`n_factors`) and run NMF with a higher `max_iter` (e.g., 500) to reduce the risk of `ConvergenceWarning`.

In [28]:
# Fill missing with 0
train_matrix_filled = train_matrix.fillna(0)
R = train_matrix_filled.values  # (#users x #movies)

print("Shape of training matrix R:", R.shape)

# Number of latent factors
n_factors = 15

nmf_model = NMF(
    n_components=n_factors,
    init='random',
    random_state=42,
    max_iter=500,  # Increased from 200
    tol=1e-4       # Default is 1e-4, but could adjust if needed
)

# Fit
U = nmf_model.fit_transform(R)  # shape: (#users, n_factors)
V = nmf_model.components_       # shape: (n_factors, #movies)

print("U shape:", U.shape)
print("V shape:", V.shape)

# Reconstruct rating matrix
R_pred = np.dot(U, V)
print("R_pred shape:", R_pred.shape)

Shape of training matrix R: (6040, 3683)
U shape: (6040, 15)
V shape: (15, 3683)
R_pred shape: (6040, 3683)


## 5. Evaluate on Test Set

For each `(userId, movieId, rating)` in `test_df`:
1. If `(userId, movieId)` appears in the training matrix, use the corresponding `R_pred[userIndex, movieIndex]`.
2. Otherwise, fall back to a global mean rating (instead of 0).
3. Calculate RMSE.


In [29]:
# Build user->index and movie->index mappings
user_index_map = {uid: idx for idx, uid in enumerate(train_matrix.index)}
movie_index_map = {mid: idx for idx, mid in enumerate(train_matrix.columns)}

# Simple global mean fallback
global_mean = train_df['rating'].mean()
print("Global mean rating:", global_mean)

y_true = []
y_pred = []

for row in test_df.itertuples():
    uid = row.userId
    mid = row.movieId
    actual = row.rating

    if (uid in user_index_map) and (mid in movie_index_map):
        u_idx = user_index_map[uid]
        m_idx = movie_index_map[mid]
        pred_rating = R_pred[u_idx, m_idx]
    else:
        # Fallback: unknown user or movie => global mean
        pred_rating = global_mean

    y_true.append(actual)
    y_pred.append(pred_rating)

rmse_nmf = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"Test RMSE (NMF, n_factors={n_factors}): {rmse_nmf:.4f}")

Global mean rating: 3.5813473937315585
Test RMSE (NMF, n_factors=15): 2.7668


### 5.1 Compare with a Global Mean Baseline
We can see if NMF outperforms simply predicting the global mean for every user-item pair. If NMF can't beat this baseline, it indicates room for improvement (e.g., better hyperparameters, masked factorization, etc.).

In [30]:
# Evaluate a simple global mean baseline
y_true_base = []
y_pred_base = []

for row in test_df.itertuples():
    actual = row.rating
    y_true_base.append(actual)
    y_pred_base.append(global_mean)

rmse_base = np.sqrt(mean_squared_error(y_true_base, y_pred_base))
print(f"Test RMSE (Global Mean Baseline): {rmse_base:.4f}")

# Quick comparison
print("\n--- Comparison ---")
print(f"NMF RMSE: {rmse_nmf:.4f}")
print(f"Baseline RMSE: {rmse_base:.4f}")

Test RMSE (Global Mean Baseline): 1.1197

--- Comparison ---
NMF RMSE: 2.7668
Baseline RMSE: 1.1197


# Section 2: Discuss the results and why they did not work well
## 6. Discussion

### 6.1 Results
- Check the RMSE printed above for both **NMF** and the **Global Mean Baseline**.
- If `rmse_nmf` is lower, NMF is outperforming a naive approach. If it's higher or similar, we likely need better hyperparameters or a specialized CF approach.

### 6.2 Why sklearn's NMF might not perform well
1. **NaN → 0**: We treat missing ratings as 0, which is not truly correct.
2. **No user/item biases**: Many CF methods incorporate separate biases, e.g., user-level or item-level.
3. **Sparse data**: Real-world rating matrices are very sparse, and naive matrix factorization can struggle.
4. **No built-in masking**: `sklearn`'s NMF sees all 0's as actual data.

### 6.3 Possible Fixes
1. **Specialized CF libraries**: e.g., `surprise`, `lightfm`, or other frameworks that properly handle missing data.
2. **Masked factorization**: Only include known ratings in the cost function.
3. **Add user/item biases**: A typical improvement in CF.
4. **Hyperparameter tuning**: Vary `n_factors`, `max_iter`, or introduce regularization.

### 6.4 Conclusion
While `sklearn`'s NMF is a convenient demonstration of matrix factorization, it often underperforms compared to specialized techniques or even simple baselines if the data is large and sparse. Careful tuning or alternative libraries are recommended for real-world collaborative filtering.