# Capstone 3 - Book Recommendation System

# Pre Processing and Training Data Development

Pre Processing and Training Data Development is the fourth step in the Data Science Method. The following will be performed in this step:

1. Create dummy or indicator features for categorical variables
2. Standardize the magnitude of numeric features
3. Split into testing and training datasets
4. Apply scaler to the testing set

# Modeling

Modeling is the fifth step in the Data Science Method.  The following will be performed in this step:

1. Fit Models with Training Data Set
2. Review Model Outcomes — Iterate over additional models as needed.
3. Identify the Final Model

In [1]:
#load python packages
import os
import pandas as pd
import pandas.api.types as ptypes
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import warnings 
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv("../Data_Wrangle_EDA/data/Cap3_step23_output.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Book-Rating,Location,Age
0,13715,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,85526.0,0.0,"victoria, british columbia, canada",36.0
1,13716,804106304,The Joy Luck Club,Amy Tan,1994,Prentice Hall (K-12),85526.0,0.0,"victoria, british columbia, canada",36.0
2,13717,786868716,The Five People You Meet in Heaven,Mitch Albom,2003,Hyperion,85526.0,0.0,"victoria, british columbia, canada",36.0
3,13718,60929790,One Hundred Years of Solitude,Gabriel Garcia Marquez,1998,Perennial,85526.0,0.0,"victoria, british columbia, canada",36.0
4,13719,452282152,Girl with a Pearl Earring,Tracy Chevalier,2001,Plume Books,85526.0,7.0,"victoria, british columbia, canada",36.0


In [3]:
df = df.drop(["Unnamed: 0"], axis=1)
df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Book-Rating,Location,Age
0,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,85526.0,0.0,"victoria, british columbia, canada",36.0
1,804106304,The Joy Luck Club,Amy Tan,1994,Prentice Hall (K-12),85526.0,0.0,"victoria, british columbia, canada",36.0
2,786868716,The Five People You Meet in Heaven,Mitch Albom,2003,Hyperion,85526.0,0.0,"victoria, british columbia, canada",36.0
3,60929790,One Hundred Years of Solitude,Gabriel Garcia Marquez,1998,Perennial,85526.0,0.0,"victoria, british columbia, canada",36.0
4,452282152,Girl with a Pearl Earring,Tracy Chevalier,2001,Plume Books,85526.0,7.0,"victoria, british columbia, canada",36.0


In [4]:
df.shape

(429486, 9)

In [5]:
df.tail()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Book-Rating,Location,Age
429481,263827461,A Poor Relation (Historical Romance: Regency),Joanna Maitland,2001,Harlequin Mills &amp; Boon Ltd,163759.0,5.0,"abertillery, wales, united kingdom",37.0
429482,263816575,Mistress of Madderlea (Historical Romance: Reg...,Mary Nichols,1999,Harlequin Mills &amp; Boon Ltd,163759.0,5.0,"abertillery, wales, united kingdom",37.0
429483,440222974,A Fire in Heaven,Annee Carter,1998,Dell Publishing Company,163759.0,5.0,"abertillery, wales, united kingdom",37.0
429484,373059191,Mr. Easy (Man Of The Month) (Silhouette Desir...,Cait London,1995,Silhouette,163759.0,4.0,"abertillery, wales, united kingdom",37.0
429485,373760930,Groom Candidate (Man Of The Month/The Tallchi...,Cait London,1997,Silhouette,163759.0,4.0,"abertillery, wales, united kingdom",37.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429486 entries, 0 to 429485
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   ISBN                 429486 non-null  object 
 1   Book-Title           429486 non-null  object 
 2   Book-Author          429486 non-null  object 
 3   Year-Of-Publication  429486 non-null  int64  
 4   Publisher            429486 non-null  object 
 5   User-ID              429486 non-null  float64
 6   Book-Rating          429486 non-null  float64
 7   Location             429486 non-null  object 
 8   Age                  429486 non-null  float64
dtypes: float64(3), int64(1), object(5)
memory usage: 29.5+ MB


In [7]:
df_reviews = df[['User-ID','ISBN','Book-Rating']]
df_reviews.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,85526.0,2005018,0.0
1,85526.0,804106304,0.0
2,85526.0,786868716,0.0
3,85526.0,60929790,0.0
4,85526.0,452282152,7.0


In [8]:
df_reviews.shape

(429486, 3)

# This size is too big for my computer memory.  Using smaller sample size of 100000.

In [9]:
df_sample = df_reviews.sample(n=100000, random_state=1)
df_sample.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
124756,13093.0,345441109,0.0
261251,105374.0,345404114,0.0
11937,230522.0,590407201,10.0
271158,259829.0,886771528,0.0
220988,222050.0,486272842,10.0


In [10]:
df_sample.rename(columns = {'User-ID' : 'userID', 'ISBN' : 'itemID', 'Book-Rating' : 'rating'}, inplace=True)
df_sample.head()

Unnamed: 0,userID,itemID,rating
124756,13093.0,345441109,0.0
261251,105374.0,345404114,0.0
11937,230522.0,590407201,10.0
271158,259829.0,886771528,0.0
220988,222050.0,486272842,10.0


In [11]:
df_sample['bookId']=pd.factorize(df_sample['itemID'].tolist())[0]

In [12]:
df_sample.nunique()

userID     1279
itemID    59338
rating       11
bookId    59338
dtype: int64

In [13]:
df_sample['rating'].unique()

array([ 0., 10.,  9.,  8.,  7.,  5.,  6.,  4.,  3.,  2.,  1.])

# Using scikit-surprise

## 1. NormalPredictor

In [14]:
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(0, 10))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df_sample[['userID', 'bookId', 'rating']], reader)

# We can now use this dataset as we please, e.g. calling cross_validate
cross_validate(NormalPredictor(), data, cv=2)

{'test_rmse': array([4.5022369, 4.482062 ]),
 'test_mae': array([3.37273763, 3.35256007]),
 'fit_time': (0.1093745231628418, 0.13977956771850586),
 'test_time': (0.5185461044311523, 0.7812104225158691)}

## 2. SVD with 3 fold cross validation

In [15]:
from surprise import SVD
from surprise import accuracy
from surprise.model_selection import KFold

# define a cross-validation iterator
kf = KFold(n_splits=3)

algo_SVD = SVD()

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo_SVD.fit(trainset)
    predictions = algo_SVD.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 3.2226
RMSE: 3.1964
RMSE: 3.2169


## 3. SVD with 3 fold cross validation and GridSearchCV

In [16]:
from surprise import SVD
from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

3.22384952699546
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


## 4. K Nearest Neighbor with 3 fold cross validation and Cosine Similarity and user based.

In [17]:
from surprise import KNNBasic
from surprise.model_selection import KFold

sim_options = {'name': 'cosine',
               'min_support' : 1,
               'user_based': True    # compute similarities between users
               #'user_based': False  # compute  similarities between items
               }
algo_KNN = KNNBasic(sim_options=sim_options)

# define a cross-validation iterator
kf = KFold(n_splits=3)

# The columns must correspond to user id, item id and ratings (in that order).
data_small = Dataset.load_from_df(df_sample[['userID', 'bookId', 'rating']], reader)

for trainset, testset in kf.split(data_small):

    # train and test algorithm.
    algo_KNN.fit(trainset)
    predictions = algo_KNN.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 3.6401
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 3.6448
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 3.6126


# Model Comparison

| Model | RMSE|
| --- | --- |
| Normal Predictor | 4.482 |
| SVD | 3.2169 |
| K Nearest Neighbor | 3.6126 |

# Predictions

SVD performed the best.

In [14]:
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise import SVD
from surprise import accuracy
from surprise.model_selection import KFold

# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(0, 10))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df_sample[['userID', 'bookId', 'rating']], reader)

#algo_SVD = SVD(n_epochs = 10, lr_all = 0.005, reg_all = 0.4)
algo_SVD = SVD()

trainset = data.build_full_trainset()
algo_SVD.fit(trainset)


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2246e9ad588>

In [15]:
df_sample[df_sample['rating'] > 4.0]

Unnamed: 0,userID,itemID,rating,bookId
11937,230522.0,0590407201,10.0,2
220988,222050.0,0486272842,10.0,4
253147,78783.0,0449223604,9.0,11
28167,31556.0,0061099155,8.0,14
277297,46443.0,0836270045,7.0,17
...,...,...,...,...
379317,81977.0,0374525641,7.0,1732
43664,78973.0,0394709306,7.0,920
285004,158254.0,0156997789,8.0,2443
242305,257204.0,0886779804,10.0,59330


In [16]:
algo_SVD.predict(uid = 230522.0, iid = 2)

Prediction(uid=230522.0, iid=2, r_ui=None, est=9.292148801445823, details={'was_impossible': False})

In [17]:
algo_SVD.predict(uid = 81977.0, iid = 1732)

Prediction(uid=81977.0, iid=1732, r_ui=None, est=4.626253442228012, details={'was_impossible': False})

In [18]:
algo_SVD.predict(uid = 158254.0, iid = 2443)

Prediction(uid=158254.0, iid=2443, r_ui=None, est=5.970619555023497, details={'was_impossible': False})

In [19]:
algo_SVD.predict(uid = 7346.0, iid = 19409)

Prediction(uid=7346.0, iid=19409, r_ui=None, est=7.8785364605271315, details={'was_impossible': False})

In [None]:
#testset = trainset.build_anti_testset()
#predictions = algo_SVD.test(testset)

In [20]:
from surprise.model_selection import train_test_split

trainset_new, testset = train_test_split(data, test_size=.2)
predictions = algo_SVD.test(testset)

In [21]:
# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)

RMSE: 0.7669


0.7668843311773347

# Top 5 predictions for users in testset

In [22]:
from collections import defaultdict

n = 5

# First map the predictions to each user.
top_n = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
    top_n[uid].append((iid, est))

# Then sort the predictions for each user and retrieve the k highest ones.
for uid, user_ratings in top_n.items():
    user_ratings.sort(key=lambda x: x[1], reverse=True)
    top_n[uid] = user_ratings[:n]

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])
    

236283.0 [55979, 40678, 33799, 23370, 40353]
172742.0 [42190, 9828, 2819, 24345, 30735]
157273.0 [6111, 1361, 1541, 37821, 32732]
141901.0 [14945, 24168, 46574, 5717, 2290]
130474.0 [35755, 4377, 294, 13709, 25685]
81492.0 [9606, 11301, 39154, 27084, 50150]
224349.0 [7020, 32205, 1362, 7117, 6775]
56856.0 [17870, 946, 6053, 6618, 22520]
94951.0 [40293, 3391, 58782, 28074, 39167]
65258.0 [58946, 38033, 46530, 45234, 43763]
184299.0 [17805, 50910, 50135, 26268, 39102]
105979.0 [1592, 3702, 7846, 7178, 19654]
52614.0 [37292, 29396, 2861, 32974, 43972]
231210.0 [3347, 33293, 13578, 8305, 11]
147141.0 [3481, 9307, 21597, 3885, 3442]
102275.0 [8648, 1771, 6289, 1059, 2195]
36606.0 [23725, 16135, 47350, 45551, 51362]
170415.0 [34346, 43481, 1707, 15760, 5287]
271705.0 [12583, 42660, 8211, 40106, 12609]
98686.0 [37948, 9044, 44759, 37182, 58092]
69697.0 [4929, 35523, 47021, 30892, 29647]
155027.0 [44043, 29546, 57232, 23151, 52012]
254241.0 [56903, 35056, 16890, 50357, 1272]
108352.0 [42611, 1

78553.0 [28101, 7396, 45280, 50801, 47942]
238864.0 [55273, 56327, 49492, 55839, 1419]
58911.0 [55309, 57381, 11198, 46547, 12083]
2766.0 [41789, 14138, 33627, 5364, 33937]
257700.0 [20114, 47261, 49140, 4593]
182993.0 [7144, 4258, 8272, 48368, 11053]
169682.0 [13182, 27957, 44650, 48978, 5463]
151098.0 [13166, 26993, 28486, 26210, 5193]
32721.0 [24509, 40138, 41013, 56057, 12536]
267409.0 [15141, 1587, 14443, 5616, 20895]
164533.0 [44827, 4687, 1312, 7706, 52086]
154992.0 [6557, 16964, 13422, 19135, 18589]
227520.0 [43633, 23964, 11907, 3425, 5789]
145165.0 [11113, 22781, 11241, 54187, 7783]
113983.0 [29427, 53539, 19911, 41020, 38783]
114178.0 [54697, 44392, 20289, 301, 30450]
250405.0 [49260, 278, 23407, 10226, 21933]
249924.0 [562, 30918, 3906, 17826, 7086]
32122.0 [724, 3744]
162311.0 [2776, 46964, 16527, 39936, 33633]
198621.0 [13665, 8175, 354, 58395, 17703]
134837.0 [17359, 25417, 47739, 31478, 12029]
35433.0 [27232, 41977, 15400, 37991, 19384]
223154.0 [50198, 6086, 3492, 1521

In [28]:
isbn = df_sample[df_sample['bookId'] == 23725]['itemID'].unique()
isbn[0]

'0740711660'

In [25]:
df.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'User-ID', 'Book-Rating', 'Location', 'Age'],
      dtype='object')

In [30]:
df[df['ISBN'] == isbn[0]]['Book-Title'].unique()

array(["Don't Roll Your Eyes At Me, Young Man!  A Zits Sketchbook 3"],
      dtype=object)

## Recommendations for user 36606.0 are [23725, 16135, 47350, 45551, 51362]

In [39]:
def recommended_books(id_list):
    recommended_books = []
    ratings = []
    for bookid in id_list:
        isbn = df_sample[df_sample['bookId'] == bookid]['itemID'].unique()
        recommended_books.append(df[df['ISBN'] == isbn[0]]['Book-Title'].unique()[0])
        ratings.append(df[df['ISBN'] == isbn[0]]['Book-Rating'].unique()[0])
    print(recommended_books)
    print(ratings) 

In [45]:
df[df['User-ID'] == 36606.0]['Book-Rating'].describe()

count    1547.000000
mean        0.892696
std         2.491666
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max        10.000000
Name: Book-Rating, dtype: float64

In [40]:
print("Top 5 Books recommended for user 36606.0 are:\n")
recommended_books([23725, 16135, 47350, 45551, 51362])

Top 5 Books recommended for user 36606.0 are:

["Don't Roll Your Eyes At Me, Young Man!  A Zits Sketchbook 3", 'Little Mermaid', 'Tatterhood and Other Tales: Stories of Magic and Adventure', 'A, My Name Is Alice', 'Seeing Is Believing']
[10.0, 10.0, 10.0, 8.0, 5.0]


## Recommendations for user 98686.0 are [37948, 9044, 44759, 37182, 58092]

In [44]:
df[df['User-ID'] == 98686.0]['Book-Rating'].describe()

count    129.000000
mean       4.449612
std        3.832348
min        0.000000
25%        0.000000
50%        6.000000
75%        8.000000
max       10.000000
Name: Book-Rating, dtype: float64

In [41]:
print("Top 5 Books recommended for user 98686.0 are:\n")
recommended_books([37948, 9044, 44759, 37182, 58092])

Top 5 Books recommended for user 98686.0 are:

['Killing Mr. Griffin (Laurel Leaf Books)', 'A Thin Dark Line', "How to Write Horror Fiction (Writer's Digest Genre Writing Series)", 'The Psychoanalytic Study of the Child (Psychoanalytic Study of the Child)', 'At the Heart of Darkness: Witchcraft, Black Magic and Satanism Today']
[0.0, 0.0, 8.0, 7.0, 5.0]


## Recommendations for user 236283.0  are [55979, 40678, 33799, 23370, 40353]

In [43]:
df[df['User-ID'] == 236283.0]['Book-Rating'].describe()

count    1212.000000
mean        1.844884
std         3.761999
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max        10.000000
Name: Book-Rating, dtype: float64

In [42]:
print("Top 5 Books recommended for user 236283.0 are:\n")
recommended_books([55979, 40678, 33799, 23370, 40353])

Top 5 Books recommended for user 236283.0 are:

["Toot &amp; Puddle: I'll Be Home for Christmas : Picture Book #5 (Toot &amp; Puddle (Hardcover))", 'Four Past Midnight', 'Snow White and the Seven Dwarfs', "Mia's Sun Hat (A Start to Read Book)", 'Kin Dread']
[10.0, 0.0, 0.0, 10.0, 0.0]
