-----

# **BrainStation Bootcamp:**
### **Amazon Book Recommendation System**
* Author: Rurick Alejandro Granados Figueredo
* Contact: rurickgrfi@gmail.com
* Date: July 31, 2023
-----------

## **Introduction**
The core of this project revolves around creating a comprehensive book recommendation system, encompassing user-independent modeling, content-based recommendations, collaborative filtering, and matrix factorization techniques. These methodologies collectively aim to provide personalized suggestions, even for new users. Through an in-depth analysis of the entire book database, these models identify prevalent patterns, enabling them to propose book recommendations that align with users' preferences, whether based on book content or historical reading habits. 

---


## **Table of Contents** 
---
- [1. Loading Data and Libraries Setup](#_1)  
- [2. Data for the recommendation system](#_2)
    - [2.1 Data Transformation](#_2.1)
- [3. Modeling recommender system](#_3)
    - [3.1 User Independent System](#_3.1)
    - [3.2 Content Based Recommendations](#_3.2)
    - [3.3 Collaborative Based Recommendations](#_3.3)
    - [3.4 Matrix Factorization Methods and Latent Features](#_3.4)
- [4. Recommender Systems Evaluation ](#_4)    


-----

## 1. Loading Data and Libraries Setup <a class="anchor" id="_1"></a>

In [1]:
# Import libraries needed for the project
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Libraries for Content Based Filtering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Libraries for FunkSVD
from surprise import Dataset
from surprise.reader import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD


In [2]:
# Load the pickled DataFrame
book_rating = pd.read_pickle('/Users/rurickgranados/Desktop/Capstone.nosync/book_rating.pkl')

## 2. Data for the Recommendation System <a class="anchor" id="_2"></a>

After multiple unsuccessful attempts to configure the data for the recommendation system, it became apparent that the extensive amount of data needed to be reduced for the project. Following several iterations, it was concluded that utilizing only 10% of the data would be the optimal approach.

In [3]:
# getting 10% sample of the merged data frame
book_rating_s= book_rating.sample(frac=0.1, replace=True, random_state=1)

# selecting the columns needed for the recommendation system
book_rating_s.shape

(178835, 12)

In [4]:
book_rating_s.info()

<class 'pandas.core.frame.DataFrame'>
Index: 178835 entries, 128763 to 970894
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   BookTitle         178835 non-null  object 
 1   Book_Description  178835 non-null  object 
 2   Book_Category     178835 non-null  object 
 3   Published_Date    178835 non-null  object 
 4   Main_Author       178835 non-null  object 
 5   RatingID          178835 non-null  object 
 6   UserID            178835 non-null  object 
 7   Review_Score      178835 non-null  float64
 8   Review_Summary    178835 non-null  object 
 9   Review_Text       178835 non-null  object 
 10  Review_Counts     178835 non-null  int64  
 11  Average_Rating    178835 non-null  float64
dtypes: float64(2), int64(1), object(9)
memory usage: 17.7+ MB


### 2.1 Data Transformation <a class="anchor" id="_2.1"></a>


Creating numerical values for the `UserID` and `BookTitle` columns

In [5]:
# assigne a numeric value to the user_id column
book_rating_s["UserID_"] = book_rating_s["UserID"].astype("category").cat.codes

# assigne a numeric value to the Title column
book_rating_s["BookTitle_"] = book_rating_s["BookTitle"].astype("category").cat.codes

book_rating_s.sample(3)

Unnamed: 0,BookTitle,Book_Description,Book_Category,Published_Date,Main_Author,RatingID,UserID,Review_Score,Review_Summary,Review_Text,Review_Counts,Average_Rating,UserID_,BookTitle_
1072805,"Punk Rock Aerobics: 75 Killer Moves, 50 Punk C...",* Would you flee in terror if confronted with ...,Health & Fitness,2004-01-08,Maura Jasper,B000685KM4,A2GAOEOROMLZAI,5.0,Fun moves + DIY attitude,For the aerobic phobic: I checked out the clas...,18,4.722222,49106,28144
432109,Leaves of grass,Leaves of Grass is a poetry collection by the ...,Poetry,2014-09-09,Walt Whitman,B00087NWG4,A170IBIG8ONZNL,2.0,MISREPRESENTATION: This is 1892 Deathbed Edition!,Although the poems are beautiful... and I cert...,552,4.270073,6559,20424
980557,Codependent No More - How To Stop Controlling ...,"The healing touchstone of millions, this moder...",Self-Help,2009-06-10,Melody Beattie,B000NUJ80A,A15ADZKNG3MNDK,4.0,Buy the New Codepency by the same author,"It is a great book, but the updated version is...",393,4.443878,4969,8054


## 3. Modeling Recommender Systems <a class="anchor" id="_3"></a>


In this project, we will explore user-independent systems, content-based recommendations, collaborative recommendations, and matrix factorization methods as part of the modeling process for recommender systems, which analyze user preferences and historical behavior to provide personalized item or content suggestions based on their likely interests.

### 3.1 User Independent System <a class="anchor" id="_3.1"></a>

In situations where we lack any user-specific information or knowledge of their preferences, we can opt to showcase the most popular books based on a defined review threshold. However, when it comes to newly released books, there might be instances where we choose to present them to users without strictly adhering to the review threshold.

In [6]:
# creat a new df top_rated with BookTitle_, BookTitle, Book_Description, Average_Rating, and Review_Counts columns
top_rated = book_rating_s[["BookTitle_","BookTitle","Average_Rating","Review_Counts"]]
top_rated.head(3)


Unnamed: 0,BookTitle_,BookTitle,Average_Rating,Review_Counts
128763,40686,The Princess Bride,4.410299,1210
494478,40308,The Pdr Family Guide to Prescription Drugs 8th Ed,4.0,4
473572,19392,Joy in the Morning,4.228571,178


In [7]:
top_rated.shape

(178835, 4)

In [8]:
# update the top_rated df with unique values of BookTitle_ and Book_Description
top_rated = top_rated.drop_duplicates(subset=['BookTitle_'], keep='first')
top_rated.shape

(48544, 4)

In [9]:
top_rated

Unnamed: 0,BookTitle_,BookTitle,Average_Rating,Review_Counts
128763,40686,The Princess Bride,4.410299,1210
494478,40308,The Pdr Family Guide to Prescription Drugs 8th Ed,4.000000,4
473572,19392,Joy in the Morning,4.228571,178
493986,29161,Return to Promise (Heart of Texas),3.750000,16
840772,40614,The Power of a Praying Parent,4.711538,104
...,...,...,...,...
290473,19906,Knopf Guide: Ireland (Knopf Guides),5.000000,1
1783011,23513,Momma Why?,4.166667,6
1442686,1726,"A history of the Republican party,",5.000000,3
462524,7824,Classic Four Block Applique Quilts,4.600000,5


In [10]:
# Finding the top rated books
top_rated = top_rated.sort_values(by=['Average_Rating'], ascending=False)
top_rated['BookTitle'].head(10)

557540             Criminology: A Sociological Understanding
633806     Williams-Sonoma Essentials Of Roasting Recipes...
1209008    The Why, What, and How of Management Innovatio...
1210614               The Music Box: The Story of Cristofori
1310918    The Translator's Handbook: With Special Refere...
620129           How Firm a Foundation in Scripture and Song
1063508    Awakened to a Calling: Reflections on the Voca...
1175101    Fools Are Everywhere: The Court Jester Around ...
666002                Carmina Gadelica: Hymns & Incantations
490700     The Problem of a Chinese Aesthetic (Meridian: ...
Name: BookTitle, dtype: object

In [11]:
# These are the top rated books. However, they are not considering the number of reviews per book
top_rated.head(5)

Unnamed: 0,BookTitle_,BookTitle,Average_Rating,Review_Counts
557540,9138,Criminology: A Sociological Understanding,5.0,3
633806,47510,Williams-Sonoma Essentials Of Roasting Recipes...,5.0,6
1209008,42920,"The Why, What, and How of Management Innovatio...",5.0,1
1210614,39631,The Music Box: The Story of Cristofori,5.0,5
1310918,42350,The Translator's Handbook: With Special Refere...,5.0,2


- ##### **Thresholding**

In [12]:
threshold = 1000  # Set the threshold for the minimum number of reviews

# Filter the 'top_rated' DataFrame to include only books with review counts equal to or above the threshold
book_w_many_reviews_df = top_rated[top_rated['Review_Counts'] >= threshold]

# Sort the filtered DataFrame by 'Average_Rating' in descending order to get the top-rated books
top_rated_v2 = book_w_many_reviews_df.sort_values(by=['Average_Rating'], ascending=False)

# Display the top 5 books from the sorted DataFrame
top_rated_v2.head(5)


Unnamed: 0,BookTitle_,BookTitle,Average_Rating,Review_Counts
955697,15865,Harry Potter & the Prisoner of Azkaban,4.769741,1513
1187851,28891,Redeeming Love,4.745455,1210
112582,22110,Man's Search for Meaning,4.735404,2424
37349,1557,A Tree Grows in Brooklyn,4.714389,2776
467957,15867,Harry Potter and The Sorcerer's Stone,4.687938,3729


----

### 3.2 Content Based Recommendations <a class="anchor" id="_3.2"></a>

These systems aim to provide recommendations for new items using the concept that if you enjoy a specific book, you are likely to appreciate books with similar descriptions.

In [13]:
# create a new df for the content based recommendation system
content_df = book_rating_s[["BookTitle_","BookTitle","Book_Description","Average_Rating","Review_Counts"]]
content_df = content_df.drop_duplicates(subset=['BookTitle_'], keep='first')

In [14]:
# update the index of the content_df
content_df = content_df.reset_index(drop=True)
content_df.head(3)

Unnamed: 0,BookTitle_,BookTitle,Book_Description,Average_Rating,Review_Counts
0,40686,The Princess Bride,"In a twenty-fifth anniversary, behind-the-scen...",4.410299,1210
1,40308,The Pdr Family Guide to Prescription Drugs 8th Ed,"Based on the Physicians' Desk Reference, the n...",4.0,4
2,19392,Joy in the Morning,A timeless classic is reborn! From Betty Smith...,4.228571,178


In [15]:
# create a dataframe with book title and description
df_descriptions = content_df[['BookTitle', 'Book_Description']]
df_descriptions.head(5)

Unnamed: 0,BookTitle,Book_Description
0,The Princess Bride,"In a twenty-fifth anniversary, behind-the-scen..."
1,The Pdr Family Guide to Prescription Drugs 8th Ed,"Based on the Physicians' Desk Reference, the n..."
2,Joy in the Morning,A timeless classic is reborn! From Betty Smith...
3,Return to Promise (Heart of Texas),"Come back again to Promise, Texas, in this cla..."
4,The Power of a Praying Parent,Offers insight and sample prayers for parents ...


In [16]:
# create TF-IDF matrix
vectorizer = TfidfVectorizer(stop_words = "english", min_df=2)
content_df['Book_Description'] = content_df['Book_Description'].fillna("")

TF_IDF_matrix = vectorizer.fit_transform(content_df['Book_Description'])

In [17]:
TF_IDF_matrix.shape

(48544, 56327)

In [18]:
# Display content-based recommendations for books similar to Harry Potter
content_df[content_df['BookTitle'].str.contains('Harry Potter', na=False)].head(8)

Unnamed: 0,BookTitle_,BookTitle,Book_Description,Average_Rating,Review_Counts
537,15868,Harry Potter and the Chamber of Secrets,"Witchcraft, wizardry - fiction.",4.672855,1572
688,41796,The Sorcerer's Companion: A Guide to the Magic...,"In a revised edition of a best-selling book, f...",4.552239,67
1229,15865,Harry Potter & the Prisoner of Azkaban,"Through classroom activities, wizard rock conc...",4.769741,1513
1908,15867,Harry Potter and The Sorcerer's Stone,Celebrate 20 years of Harry Potter magic! Harr...,4.687938,3729
8195,30271,Science of Harry Potter: How Magic Really Works,Behind the magic of Harry Potter—a witty and i...,3.52381,21
11891,15869,Harry Potter and the Prisoner of Azkaban 2005 ...,"Now in its twentieth edition, a concise guide ...",4.333333,3
18806,15866,Harry Potter and Philosophy: If Aristotle Ran ...,Urging readers of the Harry Potter series to d...,4.192308,26
19587,41350,The Science of Harry Potter: How Magic Really ...,A look at the scientific principles underpinni...,3.52381,21


In [19]:
# Displauy the similarity between two books
book_1 = TF_IDF_matrix[(content_df['BookTitle'] == 'Harry Potter & the Prisoner of Azkaban').values,]
book_2 = TF_IDF_matrix[(content_df['BookTitle'] == "Harry Potter and The Sorcerer's Stone").values,]

print("Similarity:", cosine_similarity(book_1, book_2)) 

Similarity: [[0.33990507]]


In [20]:
# Displauy the similarity between two books
book_1 = TF_IDF_matrix[(content_df['BookTitle'] == 'Harry Potter & the Prisoner of Azkaban').values,]
book_3 = TF_IDF_matrix[(content_df['BookTitle'] == "Harry Potter et le prisonnier d'Azkaban").values,]

print("Similarity:", cosine_similarity(book_1, book_3)) 

Similarity: [[0.16893991]]


In [21]:
# Define similarities 
similarities = cosine_similarity(TF_IDF_matrix, dense_output=False)

In [22]:
similarities.shape

(48544, 48544)

In [23]:
# Test with a sample book
content_df[content_df['BookTitle'] == 'Harry Potter & the Prisoner of Azkaban']

Unnamed: 0,BookTitle_,BookTitle,Book_Description,Average_Rating,Review_Counts
1229,15865,Harry Potter & the Prisoner of Azkaban,"Through classroom activities, wizard rock conc...",4.769741,1513


In [24]:
content_df.shape

(48544, 5)

In [25]:
# Get the column based upon the index
book_index = content_df[content_df['BookTitle'] == 'Harry Potter & the Prisoner of Azkaban'].index

# Create a dataframe with the book titles
sim_df = pd.DataFrame({'book':content_df['BookTitle'], 
                       'similarity': np.array(similarities[book_index, :].todense()).squeeze()})

In [26]:
# Return the top 10 most similar books
sim_df.sort_values(by='similarity', ascending=False).head(10)

Unnamed: 0,book,similarity
1229,Harry Potter & the Prisoner of Azkaban,1.0
26126,Harry Potter und der Gefangene von Azkaban,0.383321
44721,Harry Potter and the Sorcerer's Stone Movie Po...,0.369084
1908,Harry Potter and The Sorcerer's Stone,0.339905
688,The Sorcerer's Companion: A Guide to the Magic...,0.338103
30442,Critical Perspectives on Harry Potter,0.327859
41779,There's Something About Harry: A Catholic Anal...,0.283842
19587,The Science of Harry Potter: How Magic Really ...,0.259385
43978,An Unofficial Muggle's Guide to the Wizarding ...,0.253538
5023,Over the Hills & Far Away,0.229769


In [27]:
# Define functions for the book recommendation system
def content_recommender(title, books, similarities, vote_threshold=10) :
    
    # Get the book by the title
    book_index = content_df[content_df['BookTitle'] == title].index
    
    # Create a dataframe with the book titles
    sim_df = pd.DataFrame(
        {'books': books['BookTitle'], 
         'similarity': np.array(similarities[book_index, :].todense()).squeeze(),
         'Review_Counts': books['Review_Counts']
        })
    
    # Get the top 10 books with > 10 votes
    top_books = sim_df[sim_df['Review_Counts'] > vote_threshold].sort_values(by='similarity', ascending=False).head(10)
    
    return top_books

In [28]:
# Test the recommender
similar_books = content_recommender("Harry Potter & the Prisoner of Azkaban", content_df, similarities, vote_threshold=1000)
similar_books.head(10)

Unnamed: 0,books,similarity,Review_Counts
1229,Harry Potter & the Prisoner of Azkaban,1.0,1513
1908,Harry Potter and The Sorcerer's Stone,0.339905,3729
628,Jane Eyre (New Windmill),0.052504,1189
0,The Princess Bride,0.03811,1210
718,The Hobbit There and Back Again,0.024913,3675
1203,Alice's Adventures in Wonderland,0.022018,2817
155,The Hitchhiker's Guide to the Galaxy,0.021032,2698
219,"The Hobbitt, or there and back again; illustra...",0.01937,3658
1841,Catch-22,0.017821,1603
216,Dracula (G. K. Hall (Large Print)),0.016899,1034


### 3.3 Collaborative Based Recommendations <a class="anchor" id="_3.3"></a>
Collaborative filtering represents a robust set of techniques that harness the collective choices made by user groups to generate predictions. 

In [29]:
# create a new df for the collaborative based recommendation system
collab_df = book_rating_s[["UserID","BookTitle","Review_Score"]]

# filter the collab_df to include only the users with more than 10 reviews
collab_df = collab_df.groupby("UserID").filter(lambda x: len(x) >= 10)

collab_df.head(5)

Unnamed: 0,UserID,BookTitle,Review_Score
493986,A1S3IN6CSTBPY6,Return to Promise (Heart of Texas),4.0
73745,A3J0OXB9KIC5SS,"Post Captain: Aubrey/Maturin Series, Book 2",3.0
149079,A18IZDWUIHLUC9,Brave New World,5.0
720044,AEKWD5WIFP6,The Tale of Benjamin Bunny,4.0
340522,AU6DIIDZK2OQM,Pride and Prejudice,3.0


In [30]:
# update the index of the collab_df starting from 1
collab_df = collab_df.reset_index(drop=True)

# map the user_id to a numeric value starting from 1 to the number of unique users
collab_df["UserID_"] = collab_df["UserID"].astype("category").cat.codes + 1

# map the book title to a numeric value starting from 1 to the number of unique books
collab_df["BookTitle_"] = collab_df["BookTitle"].astype("category").cat.codes + 1

collab_df.head(5)

Unnamed: 0,UserID,BookTitle,Review_Score,UserID_,BookTitle_
0,A1S3IN6CSTBPY6,Return to Promise (Heart of Texas),4.0,170,4813
1,A3J0OXB9KIC5SS,"Post Captain: Aubrey/Maturin Series, Book 2",3.0,501,4568
2,A18IZDWUIHLUC9,Brave New World,5.0,37,1007
3,AEKWD5WIFP6,The Tale of Benjamin Bunny,4.0,639,7303
4,AU6DIIDZK2OQM,Pride and Prejudice,3.0,734,4610


In [31]:
# create a new df for the collaborative based recommendation system
collab_df_ = collab_df[["UserID_","BookTitle_","Review_Score"]]
collab_df_.head(5)

Unnamed: 0,UserID_,BookTitle_,Review_Score
0,170,4813,4.0
1,501,4568,3.0
2,37,1007,5.0
3,639,7303,4.0
4,734,4610,3.0


In [32]:
collab_df_.shape    

(14659, 3)

#### Utility Matrix

In [33]:
users = collab_df_['UserID_'].unique()
books = collab_df_['BookTitle_'].unique()

num_users = len(users)
num_books = len(books)

R = np.full((num_users, num_books), np.nan)

for row in collab_df_.itertuples():
    user = row[1]
    book = row[2]
    rating = row[3]
    R[user-1, book-1] = rating

    
R_df = pd.DataFrame(data=R, index=range(1, num_users+1), columns=range(1, num_books+1))

In [34]:
R_df

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,8515,8516,8517,8518,8519,8520,8521,8522,8523,8524
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
753,,,,4.0,,,,,,,...,,,,,,,,,,
754,,,,,,,,,,,...,,,,,,,,,,
755,,,,,,,,,,,...,,,,,,,,,,
756,,,,,,,,,,,...,,,,,,,,,,


This matrix comprises 757 rows, representing distinct users, and 8524 columns, corresponding to various books. As anticipated, the majority of the cells remain vacant, indicating that not all users have read or rated each book.

##### Identify users with the most reviews

In [35]:
# list users with most reviews
user_most_reviews = collab_df_['UserID_'].value_counts()
user_most_reviews.head(6)

UserID_
21     408
647    292
664    130
72     128
197    125
121    105
Name: count, dtype: int64

##### Identify books with the most reviews

In [36]:
# calculate the book with most reviews by different users
book_most_reviews = collab_df_['BookTitle_'].value_counts()
book_most_reviews.head(6)

BookTitle_
4610    104
6433     64
1007     55
2543     50
6365     41
7868     41
Name: count, dtype: int64

#### Similar Users

After an iterative process observing that among the users who have rated the most books, it was found tha both user 647 and user 121 have rated exactly four books in common.

It is worth highlighting that having a higher number of users rating the same book is considered advantageous for the collaborative-based recommendation system. However, for the purpose of demonstration and continuity, we will proceed with using these specific users (647 and 121) despite the limited overlap in their rated books.

In [37]:
### Similar Users
books_1st_user_rated = ~R_df.loc[647, :].isna()
books_2nd_user_rated = ~R_df.loc[121, :].isna()

books_both_users_rated = books_1st_user_rated & books_2nd_user_rated
print(books_both_users_rated.sum())

4


In [38]:
print("Reviewers' scores:")
R_df.loc[[647, 121], books_both_users_rated]

Reviewers' scores:


Unnamed: 0,1242,1793,6582,7145
647,5.0,4.0,4.0,5.0
121,4.0,4.0,3.0,5.0


These selected items collectively form a vector for each user, with dimensions corresponding to the items. Consequently, we can utilize a measure of vector similarity to determine the similarity between users. For this purpose, we will employ the cosine similarity once again to gauge the closeness between two users.

In [39]:
print("Similarity:", cosine_similarity(R_df.loc[647, books_both_users_rated].values.reshape(1,-1), 
                                       R_df.loc[121, books_both_users_rated].values.reshape(1,-1)))

Similarity: [[0.99230223]]


#### Making Predictions

In [40]:
R_df.sample(3)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,8515,8516,8517,8518,8519,8520,8521,8522,8523,8524
216,,,,,,,,,,,...,,,,,,,,,,
186,,,,,,,,,,,...,,,,,,,,,,
580,,,,,,,,,,,...,,,,,,,,,,


With the method for directly comparing users in place, we can now determine the similarity between a specific user and every other user who reviewed an item. Let's demonstrate this for user 100 and book 2, resulting in the following comparisons:

In [41]:
def find_user_similarity(user_1, user_2, R_df):
    
    # Define the mask which finds all books they rated together
    books_1st_user_rated = ~R_df.loc[user_1, :].isna()
    books_2nd_user_rated = ~R_df.loc[user_2, :].isna()

    books_both_users_rated = books_1st_user_rated & books_2nd_user_rated

    # Sum boolean to get the counts
    number_of_books_rated_together = books_both_users_rated.sum()

    # Find the ratings of both users for books they both watched
    ratings_of_user1 = R_df.loc[user_1, books_both_users_rated].values.reshape(1, -1)
    ratings_of_user2 = R_df.loc[user_2, books_both_users_rated].values.reshape(1, -1)

    # Finally, calculate the similarity between them
    similarity = cosine_similarity(ratings_of_user1, ratings_of_user2)[0][0]
    
    return similarity, number_of_books_rated_together

In [42]:
current_user = 100 
current_book = 2
similarities_to_user_100 = []
ratings_given_to_book_2 = []

# Find only the users who rated book 2 (rows)
R_df2 = R_df[~R_df.iloc[:, 1].isna()].copy()

for other_user in R_df2.index:
    
    similarity, number_of_books_rated_together = find_user_similarity(current_user, other_user, R_df)
    similarities_to_user_100.append(similarity)
    ratings_given_to_book_2.append(R_df.loc[other_user, current_book])
            
# Finally, let's turn these into numpy arrays so life is easier
similarities_to_user_100 = np.array(similarities_to_user_100)
ratings_given_to_book_2 = np.array(ratings_given_to_book_2)

In [43]:
predicted_rating = np.dot(ratings_given_to_book_2, similarities_to_user_100)/np.sum(similarities_to_user_100)

print(f'Predicted rating for book 2 by user 100 is {round(predicted_rating, 2)}')

Predicted rating for book 2 by user 100 is 4.0


### 3.4 Matrix Factorization Methods and Latent Features <a class="anchor" id="_3.4"></a>

#### Funk Singular Value Decomposition (FunkSVD)

In [44]:
# Import the surprise packages
from surprise import Dataset
from surprise.reader import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD

In [45]:
# create a new df for the FunkSVD recommendation system
funk_df = book_rating_s[["UserID_","BookTitle_","Review_Score","BookTitle","Book_Category"]]

#order the df by UserID_ and BookTitle_
funk_df = funk_df.sort_values(by=['UserID_','BookTitle_'], ascending=True)

# filter the collab_df to include only the users with more than 5 reviews
funk_df = funk_df.groupby("UserID_").filter(lambda x: len(x) >= 5)

In [46]:
# Check the anime dataset
display(funk_df.head(3))
print('Shape: ', funk_df.shape)
print('Number of Unique Users:',len(funk_df['UserID_'].unique()))
print('Number of Unique Books:',len(funk_df['BookTitle_'].unique()))

Unnamed: 0,UserID_,BookTitle_,Review_Score,BookTitle,Book_Category
422548,179,399,5.0,5001 Nights At the Movies Signed,Performing Arts
132639,179,2459,4.0,Alfred & Guinevere,History
1715377,179,5436,4.0,Black Sun,Fiction


Shape:  (25960, 5)
Number of Unique Users: 2574
Number of Unique Books: 12333


In [47]:
# Check the min and max value of Review_Score
print(f'Min: {funk_df["Review_Score"].min()}')
print(f'Max: {funk_df["Review_Score"].max()}')

Min: 1.0
Max: 5.0


In [48]:
# check if there is null values in the Review_Score
funk_df['Review_Score'].isna().sum()

0

Using user 179 as an example:

In [49]:
# Check number of books rated by user 179
user_179 = funk_df[funk_df['UserID_'] == 179]
user_179.shape

(20, 5)

In [50]:
# find the number of books user 179 rated as 5/5
sum(user_179['Review_Score'] == 5.0)

5

In [51]:
# Set the reader with accurate rating scale
my_reader = Reader(rating_scale=(1, 5))

# Set the dataset
my_dataset = Dataset.load_from_df(funk_df[["UserID_", "BookTitle_", "Review_Score"]], my_reader)
my_dataset

<surprise.dataset.DatasetAutoFolds at 0x7fa05162dc30>

Algorithm Tuning:

In [52]:
# Import GridSearchCV for algorithm tuning
from surprise.model_selection import GridSearchCV

# Set the parameter grid
param_grid = {
    'n_factors': [100, 150], 
    'n_epochs': [10, 20],
    'lr_all': [0.005, 0.1],
    'biased': [False] }

# Set GridSearchCV with 3 cross validation
GS = GridSearchCV(FunkSVD, param_grid, measures=['fcp'], cv=3)

# Fit the model
GS.fit(my_dataset)

In [53]:
# Check the FCP accuracy score (1.0 is ideal and 0 is worst)
GS.best_score['fcp']

0.4267479678995385

In [54]:
# Check the best parameters
GS.best_params['fcp']

{'n_factors': 100, 'n_epochs': 10, 'lr_all': 0.1, 'biased': False}

Applying best parameters on the model:

In [55]:
# Import train_test_split
from surprise.model_selection import train_test_split

# Split train test set
trainset, testset = train_test_split(my_dataset, test_size=0.25)

# Set the algorithm
my_svd = FunkSVD(n_factors=100, 
                 n_epochs=20, 
                 lr_all=0.01, 
                 biased=False,
                 verbose=0)
# Fit train set
my_svd.fit(trainset)

# Test the algorithm using test set
my_pred = my_svd.test(testset)

In [56]:
# Put my_pred result in a dataframe
df_prediction = pd.DataFrame(my_pred, columns=['UserID_',
                                                     'BookTitle_',
                                                     'actual',
                                                     'prediction',
                                                     'details'])

# Calculate the difference of actual and prediction into diff column
df_prediction['diff'] = abs(df_prediction['prediction'] - 
                            df_prediction['actual'])

In [57]:
# Check the df_prediction
df_prediction.head()

Unnamed: 0,UserID_,BookTitle_,actual,prediction,details,diff
0,95199,18892,4.0,1.0,{'was_impossible': False},3.0
1,106681,8237,5.0,4.265126,"{'was_impossible': True, 'reason': 'User and i...",0.734874
2,49130,5657,4.0,1.0,{'was_impossible': False},3.0
3,47887,44829,5.0,4.302942,{'was_impossible': False},0.697058
4,66747,5975,5.0,1.0,{'was_impossible': False},4.0


In [58]:
# See the best 10 predictions
df_prediction.sort_values(by='diff')[:10]

Unnamed: 0,UserID_,BookTitle_,actual,prediction,details,diff
5232,36117,20840,1.0,1.0,{'was_impossible': False},0.0
898,79306,33173,1.0,1.0,{'was_impossible': False},0.0
2882,93097,13189,1.0,1.0,{'was_impossible': False},0.0
5525,56520,31951,5.0,5.0,{'was_impossible': False},0.0
657,32860,40686,5.0,5.0,{'was_impossible': False},0.0
4056,58067,24859,1.0,1.0,{'was_impossible': False},0.0
3414,82286,27770,5.0,5.0,{'was_impossible': False},0.0
5536,33571,19378,1.0,1.0,{'was_impossible': False},0.0
5520,55433,37686,5.0,5.0,{'was_impossible': False},0.0
5246,3788,5540,5.0,5.0,{'was_impossible': False},0.0


In [59]:
# See the worst 10 predictions
df_prediction.sort_values(by='diff')[-10:]

Unnamed: 0,UserID_,BookTitle_,actual,prediction,details,diff
2166,68093,38920,5.0,1.0,{'was_impossible': False},4.0
3882,73840,2491,5.0,1.0,{'was_impossible': False},4.0
5375,119228,16225,5.0,1.0,{'was_impossible': False},4.0
2170,17862,20653,5.0,1.0,{'was_impossible': False},4.0
5373,93687,12071,5.0,1.0,{'was_impossible': False},4.0
3877,106905,45791,5.0,1.0,{'was_impossible': False},4.0
2180,69941,586,5.0,1.0,{'was_impossible': False},4.0
4549,31196,44589,5.0,1.0,{'was_impossible': False},4.0
2246,88295,31144,5.0,1.0,{'was_impossible': False},4.0
3244,93687,47988,5.0,1.0,{'was_impossible': False},4.0


In [60]:
# Check total rows with same actual and prediction ratings
df_prediction[df_prediction['diff'] <= 0]

Unnamed: 0,UserID_,BookTitle_,actual,prediction,details,diff
108,53738,39048,5.0,5.0,{'was_impossible': False},0.0
153,83014,22110,5.0,5.0,{'was_impossible': False},0.0
170,88151,9864,1.0,1.0,{'was_impossible': False},0.0
196,9314,41813,1.0,1.0,{'was_impossible': False},0.0
204,47315,21055,1.0,1.0,{'was_impossible': False},0.0
...,...,...,...,...,...,...
6373,61237,27770,5.0,5.0,{'was_impossible': False},0.0
6421,128554,25669,1.0,1.0,{'was_impossible': False},0.0
6424,1363,33871,5.0,5.0,{'was_impossible': False},0.0
6434,95076,38546,1.0,1.0,{'was_impossible': False},0.0


In [61]:
(df_prediction['diff'] == 0).mean()

0.03020030816640986

In [62]:
(df_prediction["diff"] <= 1).mean()

0.4604006163328197

In [63]:
# Build full trainset
full_trainset = my_dataset.build_full_trainset()

# Build the SVD algorithm
my_svd = FunkSVD(n_factors=100, 
                 n_epochs=20, 
                 lr_all=0.005,    
                 biased=False, 
                 verbose=0)

# Fit with full trainset
my_svd.fit(full_trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fa0838390f0>

In [64]:
# Define the full test set
full_testset = full_trainset.build_anti_testset(fill=-1) #SVD doesnt accept null values

In [65]:
# Set the prediction
my_prediction = my_svd.test(full_testset)

In [66]:
# Put into a dataframe
df_prediction = pd.DataFrame(my_prediction, columns=['UserID_',
                                                     'BookTitle_',
                                                     'actual',
                                                     'prediction',
                                                     'details'])

In [67]:
# Check user id `179` predictions
df = df_prediction[df_prediction['UserID_'] == 179]\
    .sort_values(by=['prediction'], ascending=False)

display(df)

Unnamed: 0,UserID_,BookTitle_,actual,prediction,details
586,179,194,-1.0,2.212530,{'was_impossible': False}
1559,179,41813,-1.0,2.087860,{'was_impossible': False}
4272,179,31300,-1.0,2.023814,{'was_impossible': False}
186,179,40166,-1.0,1.982377,{'was_impossible': False}
1245,179,35651,-1.0,1.924700,{'was_impossible': False}
...,...,...,...,...,...
4154,179,29358,-1.0,1.000000,{'was_impossible': False}
4155,179,32992,-1.0,1.000000,{'was_impossible': False}
4156,179,33383,-1.0,1.000000,{'was_impossible': False}
4157,179,35624,-1.0,1.000000,{'was_impossible': False}


In [68]:
# Merge with the book data
merge_df = df.merge(funk_df[["BookTitle_", "BookTitle", "Book_Category"]].drop_duplicates(), how='left', 
                    left_on=['BookTitle_'], right_on=['BookTitle_'])

# Check anime of user 179
merge_df.head(10)

Unnamed: 0,UserID_,BookTitle_,actual,prediction,details,BookTitle,Book_Category
0,179,194,-1.0,2.21253,{'was_impossible': False},1491: New Revelations of the Americas Before C...,History
1,179,41813,-1.0,2.08786,{'was_impossible': False},The Sound and the Fury,Fiction
2,179,31300,-1.0,2.023814,{'was_impossible': False},Sister Carrie: A novel,Chicago (Ill.)
3,179,40166,-1.0,1.982377,{'was_impossible': False},The Other Side of the Bridge,Fiction
4,179,35651,-1.0,1.9247,{'was_impossible': False},The City of Falling Angels,Travel
5,179,45599,-1.0,1.900855,{'was_impossible': False},"Undaunted Courage: Meriwether Lewis, Thomas Je...",History
6,179,37126,-1.0,1.893529,{'was_impossible': False},The Fire Next Time,Political Science
7,179,30073,-1.0,1.859757,{'was_impossible': False},Satanic Verses,Fiction
8,179,44693,-1.0,1.851516,{'was_impossible': False},To Kill A Mockingbird,Fiction
9,179,42460,-1.0,1.817391,{'was_impossible': False},The Two Towers,Fiction


In [69]:
# sort the books of user 179 by the rating in descending order (10->1)
user_179.sort_values("Review_Score", ascending=False).head(10)

Unnamed: 0,UserID_,BookTitle_,Review_Score,BookTitle,Book_Category
422548,179,399,5.0,5001 Nights At the Movies Signed,Performing Arts
949268,179,13028,5.0,Fidelity,Family & Relationships
518940,179,42776,5.0,The War of the Worlds,Fiction
3864,179,37259,5.0,The Forsythe Saga,Fiction
601668,179,33352,5.0,THE WORLD OF ODYSSEUS,History
132639,179,2459,4.0,Alfred & Guinevere,History
1510379,179,46621,4.0,We think the world of you,Fiction
833804,179,46608,4.0,We Think the World of You,Fiction
1287007,179,44091,4.0,"The woodlanders,",Drama
1778547,179,38036,4.0,The History of Pendennis; Volume I,History


## 4. Recommender Systems Evaluation <a class="anchor" id="_4"></a>


In [70]:
from surprise import accuracy
from surprise.model_selection import train_test_split

# The surprise package doesn't allow you to test on the trainset we built
my_train_dataset, my_test_dataset = train_test_split(my_dataset, test_size=0.5)

predictions = my_svd.test(my_test_dataset)

RMSE (Root Mean Squared Error): 
This metric is used to measure the average difference between the predicted ratings and the actual ratings in the test set. The lower the RMSE, the better the performance of the recommender system. 

In [71]:
# RMSE
RMSE = accuracy.rmse(predictions, verbose=False)
print(RMSE)

1.8908401631257876


Our RMSE value of 1.8891 indicates that, on average, the predicted ratings deviate from the actual ratings by approximately 1.8891 units.

----

In [72]:
# MSE
MSE = accuracy.mse(predictions, verbose=False)
print(MSE)

3.5752765224895553


MSE (Mean Squared Error):
It measures the prediction accuracy of a recommender system. It is the average of the squared differences between predicted and actual ratings. Lower MSE values are desirable. To obtain MSE, you simply square the RMSE value. In our case, the MSE is 3.5690 (1.8891^2).

-----

In [73]:
# MAE
MAE = accuracy.mae(predictions, verbose=False)
print(MAE)

1.538003985563139


MAE (Mean Absolute Error):
This metric  represents the average absolute difference between predicted and actual ratings. It gives a measure of how close the predictions are to the true ratings. Lower MAE values are preferred as they indicate better performance. 

-----

In [74]:
# FCP - Fraction of Concordant Pairs, the fraction of pairs whose relative ranking order is correct

FCP = accuracy.fcp(predictions, verbose=False)
print(FCP) 

0.844141059411072


FCP (F-measure of Concordance Probability):
It measures the ability of the system to rank items accurately by comparing the predicted rankings to the actual rankings. FCP ranges from 0 to 1, where 0 indicates the worst performance and 1 indicates perfect predictions. An FCP value of 0.8455 suggests that the recommender system is performing relatively well in ranking the items.