As concluded by the experiments, SVD (fine-tune) turned out to be the best performing model. Hence, this notebook will be creating SVD as our final model, using the top performing hyperparameters. 

# Packages & Working Directory

In [None]:
!pip install surprise     #install surprise (takes a while, around 1 min)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Import Necessary Packages

In [None]:
# Dataframe 
import pandas as pd
import numpy as np

# Surprise models and functions 
from surprise import Dataset, Reader 
from surprise import SVD 
from surprise.accuracy import mse

# Time
import time 

Set google drive local working directory 

In [None]:
from google.colab import drive                    # mount google drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# change this to your own local directory path
%cd drive/MyDrive/DSA4212/DSA4212_Assignment2     

/content/drive/MyDrive/DSA4212/DSA4212_Assignment2


# Prepare Data

The training and testing dataset and read into the environment as pandas dataframe 

In [None]:
train_ds = pd.read_csv('assignment_2_ratings_train.csv')
test_ds = pd.read_csv('assignment_2_ratings_test.csv')

Take a look at size of the datasets, the column names and range of ratings 

In [None]:
train_ds.shape        #approx 4.4 million rows 

(4436068, 3)

In [None]:
train_ds.columns

Index(['user_id', 'anime_id', 'rating'], dtype='object')

In [None]:
test_ds.shape        #approx 1.9 million rows 

(1901173, 3)

In [None]:
test_ds.columns

Index(['user_id', 'anime_id', 'rating'], dtype='object')

In [None]:
np.min(train_ds['rating']), np.max(train_ds['rating'])     #all ratings between 1-10

(1, 10)

The training dataset needs to be converted to a surprise trainset in order to train the model

In [None]:
#Instantiate a reader class to parse the ratings 
reader = Reader(rating_scale=(1, 10))

In [None]:
#Shuffle the dataset 
train_shuffled = train_ds.sample(frac = 1)

In [None]:
# The datasets are first loaded into a Surprise Dataset format
train_data = Dataset.load_from_df(train_shuffled[['user_id', 'anime_id', 'rating']], reader)
test_data = Dataset.load_from_df(test_ds[['user_id', 'anime_id', 'rating']], reader)

Note that the entire training dataset is used for the training of the final model. 

A train-valid split is not required since the purpose of this notebook is not the perform experiments and compare performance of different models, but rather to train the final model, which is already based off the previous experiments. Hence, the using the entire training dataset for the final model allows it to learn with more data and perform better on real-life test cases.  

In [None]:
# The train dataset is built as a full Surprise trainset
train = train_data.build_full_trainset()

The testing dataset is also prepared as a Surprise testset for the model to predict ratings on the provided test data

In [None]:
#first built the test data as a trainset, then use it to build a testset
test = test_data.build_full_trainset().build_testset()

# Training

The SVD model is defined with the best-performing hyperparmeters as follows: 
<br> Number of factors: 50 
<br> Regularization parameter: 0.03 (for all regularization terms) 
<br> Epochs: 15 
<br> Learning Rate: 0.005 (default)
<br> Usage of baseline averages: True (default)

In [None]:
model = SVD(n_factors = 50, 
            reg_all = 0.03,
            n_epochs = 15, 
            verbose = True)         #verbose allows the printing of epoch number during the training process 

The model is then fitted on the training data. The training time is also tracked using start and end timestamps

In [None]:
start = time.time()                 #starting time 
model_fit = model.fit(train)        #fit model on training data
end = time.time()                   #ending time 

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14


# Prediction & Evaluation

The trained model is now used to get the predictions on both the training and testing set. The time taken for the model to predict is also tracked

In [None]:
#Training prediction 
start_train_pred = time.time()
train_predictions = model.test(train.build_testset())        #the trainset has to be converted to a testset in order to make predictions  
end_train_pred = time.time()

In [None]:
#View the first few rows of predictions 
train_predictions[:5]              #the prediction contains the user id, the anime id, the actual rating, and the estimaated rating

[Prediction(uid=20170, iid=10794, r_ui=6.0, est=5.863842927322248, details={'was_impossible': False}),
 Prediction(uid=20170, iid=72, r_ui=7.0, est=6.2926306993173435, details={'was_impossible': False}),
 Prediction(uid=20170, iid=2993, r_ui=6.0, est=5.997494507450494, details={'was_impossible': False}),
 Prediction(uid=20170, iid=1691, r_ui=7.0, est=6.439331094160516, details={'was_impossible': False}),
 Prediction(uid=20170, iid=11757, r_ui=8.0, est=8.847326276546688, details={'was_impossible': False})]

In [None]:
#Testing prediction 
start_test_pred = time.time()
test_predictions = model.test(test)
end_test_pred = time.time()

In [None]:
#View the first few rows of predictions 
test_predictions[:5]              #the prediction contains the user id, the anime id, the actual rating, and the estimaated rating

[Prediction(uid=44017, iid=13161, r_ui=4.0, est=6.966595816442626, details={'was_impossible': False}),
 Prediction(uid=44017, iid=11235, r_ui=7.0, est=7.4253928083998, details={'was_impossible': False}),
 Prediction(uid=44017, iid=22199, r_ui=7.0, est=8.235426484010707, details={'was_impossible': False}),
 Prediction(uid=44017, iid=6166, r_ui=6.0, est=6.715043443178633, details={'was_impossible': False}),
 Prediction(uid=44017, iid=30206, r_ui=6.0, est=7.563165456075529, details={'was_impossible': False})]

In [None]:
#estimated predictions as a numpy array 
np.array(list(map(lambda row: row.est, test_predictions)))

array([6.96659582, 7.42539281, 8.23542648, ..., 8.22623551, 8.95945247,
       8.42878708])

The training and prediction time of the model are as follows 

In [None]:
print("Training Time: ", round(end-start, 2), "seconds")
print("Train Prediction Time (4.4M rows): ", round(end_train_pred-start_train_pred, 2), "seconds")      #some time also taken to convert the training data into a Surprise testset
print("Test Prediction Time (1.9M rows): ", round(end_test_pred-start_test_pred,2), "seconds")

Training Time:  53.15 seconds
Train Prediction Time (4.4M rows):  61.12 seconds
Test Prediction Time (1.9M rows):  19.99 seconds


The model performance is evaluated by the Mean Squared Error of the true ratings and the model's predictions

In [None]:
#surprise's accuracy module's mse function is used to calculate the MSE 
print("Train MSE: ", mse(train_predictions, verbose = False))     
print("Test MSE: ", mse(test_predictions, verbose = False))

Train MSE:  0.9728367777395038
Test MSE:  1.2954865742329127


The final test MSE is 1.2955