In [2]:
import pandas as pd 
import numpy as np 
import random
import surprise as sur

#This is the python script that I wrote that contains all the functions required to calculate the recommendations
import recommender_functions

%config InlineBackend.figure_format = 'retina'

After gridsearching BaselineOnly, KNNBaseline and SVD, BaselineOnly had the best CV score at 0.8528 (vs SVD at 0.8579  and KNN 0.8920). I implement a full recommender system for the BaselineOnly algo (as I did with my basic recommder system) which can be replicated and used to build recommender systems with any other algo provided by Surprise. 

I have included the functions necessary for this in a seperate python script (recommender_functions.py) and have included in a seperate notebook how to build the prediction matrices for both SVD and KNNBasline also.

# Building a recommender system using BaselineOnly

BaslineOnly approach works by estimating $b_{ui}$, which can be defined as follows:

$$b_{ui} = \mu + b_u + b_i$$

where $\mu$ is the overall rating mean, $b_u$ is the user bias (e.g. are they usually a more critical rater) and $b_i$ is the item bias after adjusting for the overall mean. The difference from the basic recommder system is that the algorithm tries to find the optimal $b_u$ and $b_i$ by minimising the following equation (RMSE):

$$ min \sum (r_{ui} - \mu - b_u - b_i)^2 + \lambda (\sum_u b_u^2 + \sum_i b_i^2)$$

where $\lambda$ serves as a regularisation term to avoid overfitting.

## Load in the data

In [3]:
df = pd.read_csv('/Volumes/external/Sangeetha-Project/df_sub.csv.gz', 
                       compression='gzip').astype({'rating':'int8', 'total_votes':'int32'})

In [4]:
metadata = pd.read_csv('/Volumes/EXTERNAL/Sangeetha-Project/meta_df_sub.csv.gz', compression='gzip', 
                      names = ['asin', 'title', 'description', 'price', 'categories'])

In [7]:
#load in the metadata and book review merged dataframe
merged = pd.read_csv('/Users/Sangeetha/Documents/data-science/side-projects/merged.csv.gz', compression='gzip')

## Read in the data as a DataSet

In [9]:
reader = sur.Reader(rating_scale=(1,5))
data = sur.Dataset.load_from_df(df[['reviewerId', 'asin','rating']], reader)

## Fitting the model - 0.8527922039787805

In [10]:
bsl_options = {'method': 'als',
               'n_epochs': 20,
               'reg_i': 4,
               'reg_u': 4,
               }

algo = sur.BaselineOnly(bsl_options=bsl_options)

In [11]:
raw_ratings = data.raw_ratings

#shuffle ratings if you want
np.random.seed(1)
random.shuffle(raw_ratings)


#section the data into training set and test set
threshold = int(.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]

print(len(A_raw_ratings))
print(len(B_raw_ratings))

#make the raw ratings contain only the training set
data.raw_ratings = A_raw_ratings

246294
27367


In [12]:
#Built a trainset out the training set
trainset = data.build_full_trainset()
algo.fit(trainset)

Estimating biases using als...


<surprise.prediction_algorithms.baseline_only.BaselineOnly at 0x106fbecd0>

In [13]:
# Compute score on training set
trainset_build = trainset.build_testset()
predictions_train = algo.test(trainset_build)

print('Training score ', end='   ')
print(sur.accuracy.rmse(predictions_train))

Training score    RMSE: 0.8089
0.8088931782771778


In [14]:
# Compute score on rated test set
testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
predictions_test = algo.test(testset)
print('Test score (rated items) ', end=' ')
print(sur.accuracy.rmse(predictions_test))

Test score (rated items)  RMSE: 0.8421
0.8420917910018215


## Calculating the user item matrix

In [15]:
data.raw_ratings = raw_ratings

#Built a trainset using the full data
trainset_full = data.build_full_trainset()
algo.fit(trainset_full)

# Compute score on training set
trainset_full_build = trainset_full.build_testset()
predictions_full_train = algo.test(trainset_full_build)

print('Training score ', end='   ')
print(sur.accuracy.rmse(predictions_full_train))

Estimating biases using als...
Training score    RMSE: 0.8100
0.8100039279363411


In [17]:
mu = algo.default_prediction()
print(mu)
full_pred = mu + algo.bu.reshape(-1, 1) + algo.bi.reshape(1, -1)

4.089157753571024


In [18]:
user_baselines=[]

for user in np.unique(df.reviewerId):
    user_baselines.append((user, trainset_full.to_inner_uid(user), algo.bu[trainset_full.to_inner_uid(user)]))

user_baselines[:5]


[('A100NGGXRQF0AQ', 1699, -0.028663511478524202),
 ('A102Z3T7NSM5KC', 134, 0.06436157233892911),
 ('A106016KSI0YQ', 1937, -0.47443014958425983),
 ('A106E1N0ZQ4D9W', 918, 0.16208264542733108),
 ('A10BZSGALQPS0V', 2112, -0.30592575596839156)]

In [19]:
len(user_baselines)

2647

In [20]:
item_baselines=[]

for item in np.unique(df.asin):
    item_baselines.append((item, trainset_full.to_inner_iid(item), algo.bi[trainset_full.to_inner_iid(item)]))

item_baselines[:5]

[('000100039X', 9051, 0.29767752282567267),
 ('0002007770', 1372, 0.3285301439426249),
 ('0002051850', 2041, 0.3421724601050048),
 ('0002219417', 1484, 0.49449610293689394),
 ('000222383X', 6799, 0.419044973967076)]

In [21]:
len(item_baselines)

10982

In [22]:
full_pred_df = pd.DataFrame(full_pred, index = [x for x,y,z in sorted(user_baselines, key=lambda x:x[1])], 
                         columns = [x for x,y,z in sorted(item_baselines, key=lambda x:x[1])])

In [None]:
full_pred_df[full_pred_df>5] = 5
full_pred_df[full_pred_df<1] = 1

In [23]:
full_pred_df.to_csv('/Volumes/external/Sangeetha-Project/baselineOnly_est.csv.gz', 
                    index = True, header=True, compression='gzip')

In [24]:
sur.dump.dump('/Volumes/external/Sangeetha-Project/baselineOnly_dump_file', algo=algo)

### Checking the user-item matrix

In [25]:
item = '0060515198'
user = 'A2NHD7LUXVGTD3'

In [26]:
algo.predict('A2NHD7LUXVGTD3', '0060515198')

Prediction(uid='A2NHD7LUXVGTD3', iid='0060515198', r_ui=None, est=4.331345005869336, details={'was_impossible': False})

In [27]:
full_pred[trainset_full.to_inner_uid(user), trainset_full.to_inner_iid(item)]

4.331345005869336

In [28]:
full_pred_df.loc[user,item]

4.331345005869336

## Precision@ and Recall@K

## Getting top N recommendations

## Getting an example working
