# Gridsearching BaselineOnly

In [0]:
import pandas as pd
import numpy as np
import surprise as sur
import random

Read in the review dataset and prepare the dataset to be processed by the Surprise library using the Reader and Dataset classes.

In [0]:
df=pd.read_csv('df_sub.csv.gz', 
                       compression='gzip').astype({'rating':'int8', 'total_votes':'int32'})

In [0]:
reader = sur.Reader(rating_scale=(1,5))
data = sur.Dataset.load_from_df(df[['reviewerId', 'asin','rating']], reader)

In [33]:
#Splitting the data into train and test set

raw_ratings = data.raw_ratings
np.random.seed(1)
# shuffle ratings if you want
random.shuffle(raw_ratings)

threshold = int(.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]

print(len(A_raw_ratings))
print(len(B_raw_ratings))

data.raw_ratings = A_raw_ratings

246294
27367


Gridsearching BaselineOnly with various options. I ran through quite a few iterations of testing different ranges for all options, see the [Surprise documentation](https://surprise.readthedocs.io/en/stable/index.html) for more information.

In [0]:
baseline = sur.BaselineOnly()

In [0]:
param_grid = {'bsl_options': {'method': ['als', 'sgd'], 
                              'reg_i': [4,5,6],
                              'reg_u': [4,6,7],
                              'n_epochs': [10, 15, 20], 
                              'reg': [0.01, 0.02, 0.05, 0.1]
                              },
              }
grid_search = sur.model_selection.GridSearchCV(sur.BaselineOnly,
                                               param_grid,
                                               measures=['rmse'],
                                               cv=3,
                                               refit=True)

In [50]:
grid_search.fit(data)

algo = grid_search.best_estimator['rmse']

print('Best score ', end='   ')
print(grid_search.best_score)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimati

In [52]:
#Obtaining the optimal parameters
grid_search.best_params

{'rmse': {'bsl_options': {'method': 'als',
   'n_epochs': 20,
   'reg_i': 4,
   'reg_u': 4}}}

In [0]:
algo = grid_search.best_estimator['rmse']

Refitting the optimal algo with the training and test set to compute trainign and test scores.

In [44]:
# retrain on the whole set A
trainset = data.build_full_trainset()
algo.fit(trainset)

# Compute score on training set
trainset_build = trainset.build_testset()
predictions_train = algo.test(trainset_build)
print('Training score ', end='   ')
print(sur.accuracy.rmse(predictions_train))

# Compute score on rated test set
testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
predictions_test = algo.test(testset)
print('Test score (rated items) ', end=' ')
print(sur.accuracy.rmse(predictions_test))


Estimating biases using als...
Training score    RMSE: 0.8097
0.8096642650404079
Test score (rated items)  RMSE: 0.8549
0.8549069265508885
