# Capstone 3 - Book Recommendation System

# Pre Processing and Training Data Development

Pre Processing and Training Data Development is the fourth step in the Data Science Method. The following will be performed in this step:

1. Create dummy or indicator features for categorical variables
2. Standardize the magnitude of numeric features
3. Split into testing and training datasets
4. Apply scaler to the testing set

# Modeling

Modeling is the fifth step in the Data Science Method.  The following will be performed in this step:

1. Fit Models with Training Data Set
2. Review Model Outcomes — Iterate over additional models as needed.
3. Identify the Final Model

In [29]:
#load python packages
import os
import pandas as pd
import pandas.api.types as ptypes
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import warnings 
warnings.filterwarnings('ignore')

In [30]:
df = pd.read_csv("../Data_Wrangle_EDA/data/Cap3_step23_output.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Book-Rating,Location,Age
0,13715,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,85526.0,0.0,"victoria, british columbia, canada",36.0
1,13716,804106304,The Joy Luck Club,Amy Tan,1994,Prentice Hall (K-12),85526.0,0.0,"victoria, british columbia, canada",36.0
2,13717,786868716,The Five People You Meet in Heaven,Mitch Albom,2003,Hyperion,85526.0,0.0,"victoria, british columbia, canada",36.0
3,13718,60929790,One Hundred Years of Solitude,Gabriel Garcia Marquez,1998,Perennial,85526.0,0.0,"victoria, british columbia, canada",36.0
4,13719,452282152,Girl with a Pearl Earring,Tracy Chevalier,2001,Plume Books,85526.0,7.0,"victoria, british columbia, canada",36.0


In [31]:
df = df.drop(["Unnamed: 0"], axis=1)
df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Book-Rating,Location,Age
0,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,85526.0,0.0,"victoria, british columbia, canada",36.0
1,804106304,The Joy Luck Club,Amy Tan,1994,Prentice Hall (K-12),85526.0,0.0,"victoria, british columbia, canada",36.0
2,786868716,The Five People You Meet in Heaven,Mitch Albom,2003,Hyperion,85526.0,0.0,"victoria, british columbia, canada",36.0
3,60929790,One Hundred Years of Solitude,Gabriel Garcia Marquez,1998,Perennial,85526.0,0.0,"victoria, british columbia, canada",36.0
4,452282152,Girl with a Pearl Earring,Tracy Chevalier,2001,Plume Books,85526.0,7.0,"victoria, british columbia, canada",36.0


In [32]:
df.shape

(429486, 9)

In [33]:
df.tail()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Book-Rating,Location,Age
429481,263827461,A Poor Relation (Historical Romance: Regency),Joanna Maitland,2001,Harlequin Mills &amp; Boon Ltd,163759.0,5.0,"abertillery, wales, united kingdom",37.0
429482,263816575,Mistress of Madderlea (Historical Romance: Reg...,Mary Nichols,1999,Harlequin Mills &amp; Boon Ltd,163759.0,5.0,"abertillery, wales, united kingdom",37.0
429483,440222974,A Fire in Heaven,Annee Carter,1998,Dell Publishing Company,163759.0,5.0,"abertillery, wales, united kingdom",37.0
429484,373059191,Mr. Easy (Man Of The Month) (Silhouette Desir...,Cait London,1995,Silhouette,163759.0,4.0,"abertillery, wales, united kingdom",37.0
429485,373760930,Groom Candidate (Man Of The Month/The Tallchi...,Cait London,1997,Silhouette,163759.0,4.0,"abertillery, wales, united kingdom",37.0


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429486 entries, 0 to 429485
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   ISBN                 429486 non-null  object 
 1   Book-Title           429486 non-null  object 
 2   Book-Author          429486 non-null  object 
 3   Year-Of-Publication  429486 non-null  int64  
 4   Publisher            429486 non-null  object 
 5   User-ID              429486 non-null  float64
 6   Book-Rating          429486 non-null  float64
 7   Location             429486 non-null  object 
 8   Age                  429486 non-null  float64
dtypes: float64(3), int64(1), object(5)
memory usage: 29.5+ MB


In [35]:
df_reviews = df[['User-ID','ISBN','Book-Rating']]
df_reviews.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,85526.0,2005018,0.0
1,85526.0,804106304,0.0
2,85526.0,786868716,0.0
3,85526.0,60929790,0.0
4,85526.0,452282152,7.0


In [36]:
df_reviews.shape

(429486, 3)

# This size is too big for my computer memory.  Using sample size of 5000.

In [37]:
df_sample = df_reviews.sample(n=100000, random_state=1)
df_sample.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
124756,13093.0,345441109,0.0
261251,105374.0,345404114,0.0
11937,230522.0,590407201,10.0
271158,259829.0,886771528,0.0
220988,222050.0,486272842,10.0


In [38]:
df_sample.rename(columns = {'User-ID' : 'userID', 'ISBN' : 'itemID', 'Book-Rating' : 'rating'}, inplace=True)
df_sample.head()

Unnamed: 0,userID,itemID,rating
124756,13093.0,345441109,0.0
261251,105374.0,345404114,0.0
11937,230522.0,590407201,10.0
271158,259829.0,886771528,0.0
220988,222050.0,486272842,10.0


In [39]:
df_sample.nunique()

userID     1279
itemID    59338
rating       11
dtype: int64

In [45]:
df_sample['rating'].unique()

array([ 0., 10.,  9.,  8.,  7.,  5.,  6.,  4.,  3.,  2.,  1.])

# Using scikit-surprise

## 1. NormalPredictor

In [46]:
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(0, 10))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df_sample[['userID', 'itemID', 'rating']], reader)

# We can now use this dataset as we please, e.g. calling cross_validate
cross_validate(NormalPredictor(), data, cv=2)

{'test_rmse': array([4.48750559, 4.50302891]),
 'test_mae': array([3.36372737, 3.37337035]),
 'fit_time': (0.15366339683532715, 0.1310873031616211),
 'test_time': (0.5519328117370605, 0.5445795059204102)}

## 2. SVD with 3 fold cross validation

In [47]:
from surprise import SVD
from surprise import accuracy
from surprise.model_selection import KFold

# define a cross-validation iterator
kf = KFold(n_splits=3)

algo_SVD = SVD()

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo_SVD.fit(trainset)
    predictions = algo_SVD.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 3.2085
RMSE: 3.2133
RMSE: 3.2166


## 3. SVD with 3 fold cross validation and GridSearchCV

In [48]:
from surprise import SVD
from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

3.224945374136199
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


# The lowest RMSE is achieved with 'SVD with 3 fold cross validation'.

## 4. K Nearest Neighbor with 3 fold cross validation and Cosine Similarity

## Note:  This cannot run on my computer with 'data' length of 100,000
Unable to allocate 14.1 GiB for an array with shape (43528, 43528) and data type float64

Using sample size of length 40000

In [52]:
df_small_sample = df_sample.sample(n=40000, random_state=1)
df_small_sample.head()

Unnamed: 0,userID,itemID,rating
100291,148744.0,0446329185,0.0
43638,78973.0,0380711532,7.0
313209,163804.0,0517082381,8.0
290694,235105.0,0425135020,7.0
46239,101209.0,051512219X,0.0


In [53]:
df_small_sample.nunique()

userID     1279
itemID    28948
rating       11
dtype: int64

In [54]:
from surprise import KNNBasic
from surprise.model_selection import KFold

sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }
algo_KNN = KNNBasic(sim_options=sim_options)

# define a cross-validation iterator
kf = KFold(n_splits=3)

# The columns must correspond to user id, item id and ratings (in that order).
data_small = Dataset.load_from_df(df_small_sample[['userID', 'itemID', 'rating']], reader)

for trainset, testset in kf.split(data_small):

    # train and test algorithm.
    algo_KNN.fit(trainset)
    predictions = algo_KNN.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 3.5448
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 3.5271
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 3.5669


# Questions:
1.  Next step - Recommend for a certain user?  How?

In [55]:
algo_SVD.predict(13093, 345441109)

Prediction(uid=13093, iid=345441109, r_ui=None, est=0.8096818346479493, details={'was_impossible': False})