# Part 2: Matrix Factorization HW3-Recommender-System

### Github Repository

https://github.com/tseidel0509/Unsupervised_ML/blob/main/Matrix_Factorization_RecommenderSystem.ipynb

In [15]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import NMF
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from numpy import sqrt

## Load the Movie Ratings Data as in the Homework

In [2]:
MV_users = pd.read_csv('/Users/timseidel/Documents/Graduate_Certificate/MovieLens/MV_users.csv')
MV_movies = pd.read_csv('/Users/timseidel/Documents/Graduate_Certificate/MovieLens/MV_movies.csv')
train = pd.read_csv('/Users/timseidel/Documents/Graduate_Certificate/MovieLens/train.csv')
test = pd.read_csv('/Users/timseidel/Documents/Graduate_Certificate/MovieLens/test.csv')

In [3]:
from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

## 1. Matrix Factorization to predict Missing Ratings

#### Step 1: Merge Datasets on uID and mID to create feature-rich train and test set

In [4]:
train_merged = train.merge(MV_users, on='uID', how='left')
test_merged = test.merge(MV_users, on='uID', how='left')

train_merged = train_merged.merge(MV_movies, on='mID', how='left')
test_merged = test_merged.merge(MV_movies, on='mID', how='left')

train_merged

Unnamed: 0,uID,mID,rating,gender,age,accupation,zip,title,year,Doc,...,Chi,Cri,Thr,Sci,Mys,Rom,Fil,Fan,Act,Mus
0,744,1210,5,M,25,17,77007,Star Wars: Episode VI - Return of the Jedi,1983,0,...,0,0,0,1,0,1,0,0,1,0
1,3040,1584,4,M,25,8,22046,Contact,1997,0,...,0,0,0,1,0,0,0,0,0,0
2,1451,1293,5,M,35,20,90012,Gandhi,1982,0,...,0,0,0,0,0,0,0,0,0,0
3,5455,3176,2,F,18,17,55449,"Talented Mr. Ripley, The",1999,0,...,0,0,1,0,1,0,0,0,0,0
4,2507,3074,5,M,25,4,94107,Jeremiah Johnson,1972,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
700141,1184,2916,3,F,25,4,92354,Total Recall,1990,0,...,0,0,1,1,0,0,0,0,1,0
700142,137,1372,5,F,45,6,78758,Star Trek VI: The Undiscovered Country,1991,0,...,0,0,0,1,0,0,0,0,1,0
700143,195,2514,3,M,25,12,10458,Pet Sematary II,1992,0,...,0,0,0,0,0,0,0,0,0,0
700144,1676,2566,3,M,18,4,91301,Doug's 1st Movie,1999,0,...,1,0,0,0,0,0,0,0,0,0


In [5]:
print(train_merged.dtypes)

uID            int64
mID            int64
rating         int64
gender        object
age            int64
accupation     int64
zip           object
title         object
year           int64
Doc            int64
Com            int64
Hor            int64
Adv            int64
Wes            int64
Dra            int64
Ani            int64
War            int64
Chi            int64
Cri            int64
Thr            int64
Sci            int64
Mys            int64
Rom            int64
Fil            int64
Fan            int64
Act            int64
Mus            int64
dtype: object


In [6]:
missing_ratings = test['rating'].isna().sum()
print(f"Number of rows in test dataset missing a rating: {missing_ratings}")

Number of rows in test dataset missing a rating: 0


There are no rows with missing rating in the test dataset, hence I am assuming Part 2 asks us to predict the rating of the test data and then compare to the ground truth which is what I'll be doing in the following section

#### Step 2: Encode Categorical Values

In [7]:
categorical_features = ["gender"]
encoder = OneHotEncoder(sparse_output = False, handle_unknown="ignore")

encoded_train = encoder.fit_transform(train_merged[categorical_features])
encoded_test = encoder.transform(test_merged[categorical_features])

encoded_train_df = pd.DataFrame(encoded_train, columns=encoder.get_feature_names_out(categorical_features))
encoded_test_df = pd.DataFrame(encoded_test, columns=encoder.get_feature_names_out(categorical_features))
                                      

#### Step 3: Create Feature Training/Testing Matrices

In [8]:
train_merged = train_merged.drop(columns = categorical_features)
test_merged = test_merged.drop(columns = categorical_features)

X_train = pd.concat([train_merged.drop(columns=['uID', 'mID', 'rating', 'zip', 'title']).reset_index(drop=True), encoded_train_df], axis=1)
X_test = pd.concat([test_merged.drop(columns=['uID', 'mID', 'rating', 'zip', 'title']).reset_index(drop=True), encoded_test_df], axis=1)

y_train = train_merged["rating"].values
y_test = test_merged["rating"].values


#### Step 4: Apply Matrix Factorization and Predict on Test

In [16]:
X_train_filled = np.nan_to_num(X_train, nan=0.0)
X_test_filled = np.nan_to_num(X_test, nan=0.0)

# Apply NMF
nmf = NMF(n_components=23, random_state=42, init='nndsvda', max_iter=200)
X_train_nmf = nmf.fit_transform(X_train_filled)
X_test_nmf = nmf.transform(X_test_filled)

# Train and predict with linear regression
regressor = LinearRegression()
regressor.fit(X_train_nmf, y_train)

y_pred = regressor.predict(X_test_nmf)

## Evaluate RMSE

In [17]:
rmse = sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE on the Test Data: {rmse:.4f}")

RMSE on the Test Data: 1.2292


## 2. Discussion
My results from HW 3 are the following: 
- Baseline, 𝑌𝑝=3: 1.2585510334053043
- Baseline, 𝑌𝑝=𝜇𝑢: 1.0352910334228647
- Content based, item-item: 1.0128116783754684
- Collaborative, cosine: 1.0263081874204125
- Collaborative, jaccard, 𝑀𝑟≥3: 0.9819058692126349
- Collaborative, jaccard, 𝑀𝑟≥1: 0.991363571262366
- Collaborative, jaccard, 𝑀: 0.991363571262366

Hence the RMSE on the test data using non-negative matrix factorization is worse than all but the first baseline model. One reason for this could be that the recommender system used cosine or jaccard, hence user-item interactions, which are more predictive of ratings than simply age or movie genre. 

Unknown ratings are left as zeros in order to achieve non-negative matrices for NMF to work, this may have been one reason for the poorer result.

#### Ways to improve the model
The following steps could improve the performance of the model: 
- Feature Engineering: extracting decades from years or merging similar zip codes
- Better Preprocessing: normalize age or year using StandardScaler, drop or encode rare values such as uncommen zip code, check and handle outliers
- Dimensionality Reduction: use PCA to reduce the number of dimensions
- Use a more expressive model such as RandomForestRegressor
- Hyperparameter Tuning: apply GridSearchCV or RandomSearchCV to find more optimal hyperparameters