# Improved SVD

In this notebook the improved SVD method for matrix completing is implemented on the CIL dataset.

The sections "Preprocess", "Run Improved SVD model" and "Create submission file" can be run entirely and in sequence to produce the results. Run the "Download data from Kaggle" code block only if you are running on Colab.

## Download data from Kaggle

In [None]:
!pip install kaggle

!mkdir ~/.kaggle

import json

kaggle_username = "yuvalnis" #@param {type:"string"}
kaggle_api_key = "1800d5a286834f0416c338c7bd7f6dee" #@param {type:"string"}

assert len(kaggle_username) > 0 and len(kaggle_api_key) > 0

api_token = {"username": kaggle_username,"key": kaggle_api_key}

with open('kaggle.json', 'w') as file:
    json.dump(api_token, file)

!mv kaggle.json ~/.kaggle/kaggle.json

!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c cil-collaborative-filtering-2022

!unzip -n cil-collaborative-filtering-2022.zip

## Preprocess

### Install Surprise package

In [None]:
!pip install surprise

### Imports

In [None]:
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from typing import Tuple
from IPython.display import display

### Data parsing helper function declarations

In [None]:
def parse_csv(csv_path: str) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
  """
  Extract the arrays of user indices, item indices and ratings listed in a .csv file

  :param csv_path: path to .csv file to read from
  :return: 3 arrays containing the users indices, the item indices and the observed ratings in order  
  """
  df = pd.read_csv(csv_path)
  # extract user and item indices from the Id label in the dataframe
  df = df.join(df.Id.str.extract(r"r(?P<User>\d+)_c(?P<Item>\d+)").astype(int) - 1)
  # extract user, item and prediction triplets from dataframe
  users = df.User.values
  items = df.Item.values
  preds = df.Prediction.values
  return users, items, preds

## Run Improved SVD model

Create the Dataset object with the data from data_train.csv.

In [None]:
# construct data in correct format
users, items, preds = parse_csv("data_train.csv")
ratings_dict = {'itemID': items, 'userID': users, 'rating': preds}
df = pd.DataFrame(ratings_dict)
# Defining a rating scale of [0, 5.5] gives a better result than a rating scale of [1, 5]
reader = Reader(rating_scale=(0, 5.5))
data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)

Declare the parameters for the parameter tuning, run a grid search algorithm, and find and save the parameters which give the optimal RMSE score. The grid search splits the data into two subsets - one to fit the model with, and the other for validation. 

In [None]:
# declare all parameters and their values to test in the grid search
param_grid = {
    'n_factors': list(range(2, 21, 2)),
    'n_epochs': list(range(25, 101, 25)),
    'lr_all': list(np.arange(0.001, 0.011, 0.001)),
    'reg_all': list(np.arange(0.0, 1.1, 0.1))
}
# run grid search
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=10, n_jobs=-1, joblib_verbose=2)
gs.fit(data)
# save grid search results to dataframe for future reference
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df.to_csv(f"surprise_svd_grid_search_results.csv", index=False)
# report and save the best parameters
best_params = gs.best_params['rmse']
print(f"The minimum RMSE score is {gs.best_score['rmse']}")
print(f"The parameters which give the best RMSE score are: {best_params}")
n_factors = best_params['n_factors']
n_epochs = best_params['n_epochs']
lr_all = best_params['lr_all']
reg_all = best_params['reg_all']

In [None]:
print(f"Table with the results of the parameter tuning:")
display(results_df)

Now fit the model with the optimal parameters and the entire data set.

In [None]:
# for the submission the full set is set as the trainset
trainset = data.build_full_trainset()
# init SVD with the best params found in the param tuning
algo = SVD(n_factors=n_factors, n_epochs=n_epochs, lr_all=lr_all, reg_all=reg_all, random_state=1234)
# Train the algorithm on the trainset
algo.fit(trainset)

## Create submission file

Extract the users and item ids needed for the submission file from sampleSubmission.csv, and use the fitted model to give predictions for them. Save the user ids, item ids and predictions to a .csv file for submission.

In [None]:
# extract the needed users and items for submission
pred_users, pred_items, _ = parse_csv('sampleSubmission.csv')
pred_ratings = list()
df_ids = list()
# use the trained model to extract the predictions for submission
for user, item in zip(pred_users, pred_items):
  df_ids.append(f"r{user + 1}_c{item + 1}")
  pred_ratings.append(algo.predict(user, item, verbose=False).est)
# save the prediction into a file in the agreed format
df = pd.DataFrame({"Id": df_ids, "Prediction": pred_ratings})
df.to_csv(f"surprise_svd_n_factors-{n_factors}_n_epochs-{n_epochs}_lr_all-{lr_all:.3f}_reg_all-{reg_all:.1f}_submission.csv", index=False)