# Collaborative Filtering Recommender System

## Table of Contents
1. Loading in Reviews
2. Model-Based Collaborative Filtering
    - 2.a. Simple SVD
    - 2.b. Grid Searching SVD
3. Memory-Based Collaborative Filtering
    - 3.a. User-User Similarity

In [7]:
import pandas as pd
from surprise import Reader, SVD, Dataset, accuracy
from surprise.model_selection import GridSearchCV, train_test_split, cross_validate

# Loading in Reviews

In [1]:
ratings = pd.read_csv('data/gr_reviews_clean.csv')
ratings.head()

NameError: name 'pd' is not defined

In [3]:
ratings.drop(columns=['Unnamed: 0'], inplace=True)

In [4]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1373966 entries, 0 to 1373965
Data columns (total 4 columns):
 #   Column            Non-Null Count    Dtype 
---  ------            --------------    ----- 
 0   user_id           1373966 non-null  object
 1   rating            1373966 non-null  int64 
 2   book_id           1373966 non-null  int64 
 3   review_sentences  1373966 non-null  object
dtypes: int64(2), object(2)
memory usage: 41.9+ MB


# Model-Based CF

In [10]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings[['user_id','book_id','rating']], reader)

In [12]:
trainset, testset = train_test_split(data, test_size=0.25, random_state=42)

## Simple SVD

In [14]:
svd_simple = SVD(random_state=42)
results = cross_validate(svd_simple, data, measures=['RMSE'], cv=5, n_jobs = -1, verbose=True)
svd_simple.fit(trainset)
predictions = svd_simple.test(testset)
svd_simple_acc = accuracy.rmse(predictions)
print(results)
print(svd_simple_acc)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1167  1.1109  1.1146  1.1168  1.1105  1.1139  0.0027  
Fit time          20.49   20.10   19.84   19.40   18.76   19.72   0.60    
Test time         3.78    3.70    2.94    2.63    1.82    2.97    0.73    
RMSE: 1.1157
{'test_rmse': array([1.11673995, 1.11091317, 1.11459762, 1.11680213, 1.11054797]), 'fit_time': (20.49123787879944, 20.09840202331543, 19.836138010025024, 19.40072011947632, 18.75564980506897), 'test_time': (3.77850604057312, 3.6963131427764893, 2.938753843307495, 2.6298141479492188, 1.8161752223968506)}
1.115713059999836


## Grid Searching SVD

In [16]:
svd_grid = {'n_factors':[20, 100],
            'n_epochs': [5, 10], 
            'lr_all': [0.002, 0.005],
            'reg_all': [0.4, 0.6],
           'biased': [True, False]}
gs1_svd = GridSearchCV(SVD,param_grid=svd_grid,n_jobs = -1,joblib_verbose=5, cv=3)
gs1_svd.fit(data)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   16.3s
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done  96 out of  96 | elapsed:  4.5min finished


NameError: name 'gs1_svd_' is not defined

In [17]:
gs1_svd.best_params['rmse']


{'n_factors': 20,
 'n_epochs': 10,
 'lr_all': 0.005,
 'reg_all': 0.4,
 'biased': True}

In [20]:
gs1_svd_best = SVD(n_factors=20, n_epochs=10, lr_all=0.005, reg_all=0.4, biased=True)
gs1_svd_best.fit(trainset)
predictions = gs1_svd_best.test(testset)
print(accuracy.rmse(predictions))

RMSE: 1.1056
1.1056224350578843


# Memory-Based CF

## User-User similarity

In [None]:
sim_cos = {"name": "cosine", "user_based": True}


In [None]:
# KNN
knn_simple = knns.KNNBasic(sim_options=sim_cos)
knn_simple.fit(trainset)
knn_simple.sim

In [None]:
knn_predictions = knn_simple.test(testset)
knn_accuracy = accuracy.rmse(knn_predicitons)