# Capstone 1: Collaborative Filtering Based Book Recommendation Engine

# Project Summary

## Introduction

Recommendation engines have laid the foundation of every major tech company around us that provides retail, video-on-demand or music streaming service and thus redefined the way we shop, search for an old friend, find new music or places to go to. From finding the best product in the market to searching for an old friend online or listening to songs while driving, recommender systems are everywhere. A recommender system helps to filter vast amount of information from all users and item database to individual’s preference. For example, Amazon uses it to suggest products to customers, and Spotify uses it to decide which song to play next for a user. 
Book reading apps like Goodreads has personally helped me to find books I couldn’t put away and thus getting back to the habit of reading regularly again. While a lot of datasets for movies (Netflix, Movielens) or songs have been explored previously to understand how recommendation engine works for those applications and what are the scopes of future improvement, book recommendation engines have been relatively less explored.
The primary goal of this project is to develop a collaborative book recommendation model using good#reads dataset that can suggest readers what books to read next. Additionally, data wrangling and exploratory data analysis will be utilized to draw insights about users reading preferences (e.g. how they like to tag, what ratings they usually provide etc.) and current trends in the book market (book categories that are in demand, successful authors in the market etc.).

## Key Busieness Insights

> **Understanding User Behavior**

- When the tag counts of different generalized tag_names were ranked, the top 10 tag name shows that users prefer to have separate shelves for books they marked as favorite, read in a particular year (e.g. read in 1990, Childhood Books), owned or borrowed from library, read in a different format (e.g. ebook\ audiobook). The other shelving preferance per the top 10 tag_names were different book categories such as 'Fiction', 'Young - Adult' etc.

- The count plot of user provided ratings shows that users are more likely to rate a book 4 or higher. As the tag counts for books they mark as favorite is also higher (shown previosuly), it seems that users are more likely to rate and store a book when they like it. 

- Users use a wide variety of names even if they are tagging a book in the same category. Foe example Science Fiction and Fantasy 

> **Factors to Consider for a Book's Rating**

- EDA shows that the top 15 books per tag_count as reader's favorite is not same as the top 15 books ranked per ratings of the users. Also, while the average rating counts for the top 15 books marked as the reader's favorite is significantly higher (2191465) than the average ratings received by all books (23833), the avaerage rating counts (18198) for the top 15 books is below the average. Both favorite and top rated 15 books have higher average ratings (4.26 and 4.74 respectively compared to the average ratings of all books (4.01). These statistics suggest that only considering the average rating is not enough to rank books for recommendation. An an ideal metric should also consider how many times the book has been marked as favorite and the total number of ratings it received in addition to the average rating of the book.

- 9 of the top 15 favorite books are most frequently tagged in the Young - Adult Category.The other popular categories in the top 15 favorite books are science fiction and fantasy, romance, historical fiction or fiction in general. The harry potter books (ranked 2,3,4,6,7) have also been freqently tagged as children/childhood books. A quick look at the publication date of these books reveal that most of the books under Young Adult and Childhood categories were actually publsihed at least 10 years ago. Therefore, they were probably the favorite books of many adult readers when they were young. This highlights that the year of publication and dates of ratings can also impact a book's ranking and should be factored into the performance metric. To be able to determine if the books are equally liked by current generation of young readers, one can check if the average number of positive ratings recevied by a book per year has reduced or increased since its year of publication. As the datasets used in this project do not provide the dates when the books were rated, it was not possible to implement this scheme into the recommendation framework. 

> **Book Categories**

- Based on the tag_counts of different book categories, it was found that 'Fiction' dominates as the popular category for users of all age groups (i.e. Adult and Young Adult readers). Beside fiction in general, tags related to 'Science Fiction and Fantasy' seems to be used more frequently than other categories in both adult and young adult section. Some other popular categories are Crime & Mysetery, Historical Fiction etc. Based on the findings, it seems that the demand for different kinds of fiction are higher than books based on actual events/facts (i.e. History or Science) in the market. The market seems to agree with these conclusions as about 43% of the books in the dataset are found to be Fiction, with Non - Fiction (20.5%), Young Adult (8.3%) and Science Fiction and Fantasy (5.73%) as other prevailing categories. 

- Does this finding indicate that a new Fiction has higher chances of getting a good rating than new history book? The answer is probably negative. When average ratings of different book categories were compared, it was found that readers do not have a bias towards rating a particular category higher than the others. The average rating in every category is close to the average rating of all the books (4.01) and mostly range from 3.25 to 4.75. Higher variability exists in the ratings of categories that have more books in the market than other categories. 

> **Authors in Demand**

- JK Rowling seems to be everyone's most favorite author with 4 of her books in the the 15 Favorite books. However, when authors were ranked per the number of books they wrote and the average ratings their books received, JK Rowling did not make it to the top 10. Stephen King seems to be the most successful authors with 44 books in the market with an average rating of 3.9. Other succesful authors considering both ratings and number of books are Dean Koontz, John Grisham, Nora Roberts and Jodi Picoult. This suggests that an ideal metric to evaluate an author's demand in the market should include the number of books an author wrote, the ratings the books received, the number of books that has been marked as favorite, and the tag counts as favorite for each book.

> **Rating Counts per Book and Per User**

- All the users in the dataset have rated at least 19 books where the most active users rated 200 books. 80% of the users rated at least 100 books
- All the books in the dataset received at least 8 ratings. When books were ramked by rating_counts, it seems that the top 10 books recived more than 10,000 ratings. CDF plot of the ratings per book showed that only ~20% of the book received more than 5000 ratings.
- As the number of books in the dataset 10000 are less than the number of users (53,424), sparsity is less likely to be an issue for ML modeling with this dataset.

# Non Personalized and Personalized Recommendation System:

- This step implements the book recommendation engine based on the cleaned datasets and ML modeling results. The non - personalized database explores the book related datasets and recommend books to a new user when they provide their reading preferences. The personalized recommendation system ranks books based on the rating predicted by ML model for a user and provide recommendation to him/her based on their search preferences.

- Data Wrangling step gave us an idea of how users like to tag\shelf their books and the information was used to group books into different dominant categories based on the user provided tag names. The tag names used by different users can also be used as a repository for KW search. For example, it was found that users often used words like ya, YA, juvenile or teen to tag young - adult books \children books. The recommendation system is designed to keep a collection of such frequently used words by exploring records of tag names, so that when users use those words in the search engine, it can be used  to find books in the relevant categories.  

- The search results can also provide some built in suggestions to tag the books or further refine the search

# Import Packages

In [2]:
import pandas as pd
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise import Reader
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import re  

# Import Data

In [3]:
Category_Repos = pd.read_csv ('Category_KW_Respository.csv')
user_rating_pred = pd.read_csv('Rating_Prediction.csv')
book_info = pd.read_csv('Additional dataset_for_tag&KW_recommendation.csv')

book_info = book_info.drop( ['Unnamed: 0','count'],axis=1).sort_values(by = 'Avg Rating', ascending = False)
Category_Repos = Category_Repos.drop( ['Unnamed: 0'],axis = 1)

In [4]:
# Lets Check Every Imported Dataset 

In [5]:
# Display full (non - truncated dataframe)
#pd.set_option('display.max_colwidth', -1)

In [6]:
book_info.head()

Unnamed: 0,title,book_id,authors,Avg Rating,tag_name,People Often Tagged as
1554,The Complete Calvin and Hobbes,3628,Bill Watterson,4.818306,Children Books,"['Non-Fiction', 'Fiction', 'Children Books', '..."
3572,Mark of the Lion Trilogy,8854,Francine Rivers,4.759087,Historical Fiction,"['Historical Fiction', 'History', 'romance', '..."
1890,It's a Magical World: A Calvin and Hobbes Coll...,4483,Bill Watterson,4.754492,Children Books,"['Non-Fiction', 'Fiction', 'Children Books', '..."
2624,There's Treasure Everywhere: A Calvin and Hobb...,6361,Bill Watterson,4.740989,Children Books,"['Non-Fiction', 'Fiction', 'Children Books', '..."
1601,"Harry Potter Collection (Harry Potter, #1-6)",3753,J.K. Rowling,4.727754,Young-Adult,"['Science Fiction & Fantasy', 'Crime & Mystery..."


In [7]:
Category_Repos.head(2)

Unnamed: 0,Category,Possible Search KW
0,Favorite,"['favorites', 'favourites', 'all-time-favorite..."
1,Children Books,"['kids', 'kids-books', 'kid-lit', 'kid-books',..."


In [8]:
user_rating_pred.head()

Unnamed: 0.1,Unnamed: 0,user_id,book_id,Predicted Rating
0,0,9503,42,3.669583
1,1,7246,71,3.381209
2,2,8440,14,3.902109
3,3,10948,49,2.090546
4,4,5335,52,3.488853


# Non Personalized Recommendations for New Users 

In [9]:
def non_personalized_recommend(cat = None, auth = None, n = 5):
    
    #if only number of top n books is searched 
    if (cat == None) and (auth == None):
        Books = book_info
    if cat != None:
    # Filter for Category
        if cat in list(Category_Repos.Category):
            Books = book_info[book_info.tag_name.str.match(pat = cat)]
        else:
            all_selected_rows = pd.DataFrame()
            Books = pd.DataFrame()
            for item in Category_Repos['Possible Search KW']:
                if cat in item:
                     Boolean_Mask = Category_Repos['Possible Search KW']==item
                     select_row  = Category_Repos[ Boolean_Mask]# Category_Repos[Category_Repos.Category.str.match(pat = str(Category))]
                     all_selected_rows = pd.concat([all_selected_rows, select_row])
            Categories = list(all_selected_rows.Category)
            for Category in Categories:                                   
                book_row = book_info[book_info.tag_name.str.match(pat = str( Category))]
                Books = pd.concat([Books,book_row])
            
            
    # Filter for authors
    if auth!= None:
        if cat!= None:
            author_names = list(set(Books.authors))
            authors = []
            for author_name in author_names:
                if auth in author_name:
                    authors.append(author_name)
            Books_filtered_auth = pd.DataFrame() 
            for author in list(authors):                                   
                book_row_au = Books[Books.authors.str.match(pat = str(author))]
                Books_filtered_auth = pd.concat([Books_filtered_auth,book_row_au])
            Books =  Books_filtered_auth
        else:
            Books = book_info[book_info.authors.str.match(pat = auth)]
    result = Books.sort_values(by = 'Avg Rating',ascending = False) [:n]
    return result.drop('book_id',axis =1)

## Demonstration

In [10]:
# Top 5 books to show any user 
non_personalized_recommend(n = 5)

Unnamed: 0,title,authors,Avg Rating,tag_name,People Often Tagged as
1554,The Complete Calvin and Hobbes,Bill Watterson,4.818306,Children Books,"['Non-Fiction', 'Fiction', 'Children Books', '..."
3572,Mark of the Lion Trilogy,Francine Rivers,4.759087,Historical Fiction,"['Historical Fiction', 'History', 'romance', '..."
1890,It's a Magical World: A Calvin and Hobbes Coll...,Bill Watterson,4.754492,Children Books,"['Non-Fiction', 'Fiction', 'Children Books', '..."
2624,There's Treasure Everywhere: A Calvin and Hobb...,Bill Watterson,4.740989,Children Books,"['Non-Fiction', 'Fiction', 'Children Books', '..."
1601,"Harry Potter Collection (Harry Potter, #1-6)",J.K. Rowling,4.727754,Young-Adult,"['Science Fiction & Fantasy', 'Crime & Mystery..."


In [32]:
# If a new user search by category and also is a bit lazy at putting inputs
non_personalized_recommend( cat = 'sci')

Unnamed: 0,title,authors,Avg Rating,tag_name,People Often Tagged as
2233,The Constitution of the United States of America,Founding Fathers,4.543922,History,"['Non-Fiction', 'History']"
612,"Band of Brothers: E Company, 506th Regiment, 1...",Stephen E. Ambrose,4.42486,History,"['Historical Fiction', 'Non-Fiction', 'History..."
2010,Battle Cry of Freedom,James M. McPherson,4.323674,History,"['Historical Fiction', 'Non-Fiction', 'History..."
2652,The Making of the Atomic Bomb,Richard Rhodes,4.320956,History,"['Science', 'Non-Fiction', 'History']"
2814,"The Autobiography of Martin Luther King, Jr.","Martin Luther King Jr., Clayborne Carson",4.314657,History,"['Science', 'Non-Fiction', 'History']"


In [27]:
# If a new user search by both category and author
non_personalized_recommend( cat = 'ya', auth ='Rowling')

Unnamed: 0,title,authors,Avg Rating,tag_name,People Often Tagged as
1601,"Harry Potter Collection (Harry Potter, #1-6)",J.K. Rowling,4.727754,Young-Adult,"['Science Fiction & Fantasy', 'Crime & Mystery..."
1953,Fantastic Beasts and Where to Find Them: The O...,J.K. Rowling,4.407889,Young-Adult,"['Science Fiction & Fantasy', 'Fiction', 'Chil..."
196,The Tales of Beedle the Bard,J.K. Rowling,4.062274,Children Books,"['Science Fiction & Fantasy', 'Fiction', 'Youn..."
235,Fantastic Beasts and Where to Find Them,"Newt Scamander, J.K. Rowling, Albus Dumbledore",3.949581,Young-Adult,"['Science Fiction & Fantasy', 'Fiction', 'Youn..."
599,Quidditch Through the Ages,"Kennilworthy Whisp, J.K. Rowling",3.853885,Children Books,"['Science Fiction & Fantasy', 'Non-Fiction', '..."


# Personalized Recommendation for Existing Users

In [13]:
def recommend( user = None, cat = None, auth = None, n = 5):
    if user != None:
        Books_for_user = user_rating_pred[user_rating_pred.user_id == user] [['user_id','book_id','Predicted Rating']]
        require_columns = book_info[['book_id','title','tag_name','authors','People Often Tagged as']]
        book_repository = Books_for_user.merge(require_columns, on = 'book_id') 
       
    #if only number of top n books is searched 
        if cat == None and auth == None:
            Books = book_repository
        if cat != None:
    # Filter for Category
            if cat in list(Category_Repos.Category):
                Books = book_repository[book_repository.tag_name.str.match(pat = cat)]
            else:
                all_selected_rows = pd.DataFrame()
                Books = pd.DataFrame()
                for item in Category_Repos['Possible Search KW']:
                    if cat in item:
                         Boolean_Mask = Category_Repos['Possible Search KW']==item
                         select_row  = Category = Category_Repos[ Boolean_Mask]# Category_Repos[Category_Repos.Category.str.match(pat = str(Category))]
                         all_selected_rows = pd.concat([all_selected_rows, select_row])
                Categories = list(all_selected_rows.Category)
                for Category in Categories:                                   
                    book_row = book_repository[book_repository.tag_name.str.match(pat = str( Category))]
                    Books = pd.concat([Books,book_row])
            
    # Filter for authors
        if auth!= None:
            if cat!= None:
                author_names = list(set(Books.authors))
                authors = []
                for author_name in author_names:
                    if auth in author_name:
                        authors.append(author_name)
                Books_filtered_auth = pd.DataFrame() 
                for author in list(authors):                                   
                    book_row_au = Books[Books.authors.str.match(pat = str(author))]
                    Books_filtered_auth = pd.concat([Books_filtered_auth,book_row_au])
                Books =  Books_filtered_auth
            else:
                Books = book_repository[book_repository.authors.str.match(pat = auth)]
        result = Books.sort_values(by = 'Predicted Rating',ascending = False) [:n]
    else:
        result = non_personalized_recommend(cat,auth,n)
        
    return result

## Demonstration

In [14]:
# Lets do a search for user 9503

# Top 5 books to show user 9503 
recommend(user =9503, n = 30)

Unnamed: 0,user_id,book_id,Predicted Rating,title,tag_name,authors,People Often Tagged as
0,9503,102,4.368945,Where the Wild Things Are,Children Books,Maurice Sendak,"['Fiction', 'Children Books', 'Young-Adult']"
10,9503,31,4.215812,The Help,Fiction,Kathryn Stockett,"['Fiction', 'History', 'Young-Adult', 'Histori..."
1,9503,49,3.953784,"New Moon (Twilight, #2)",Young-Adult,Stephenie Meyer,"['Science Fiction & Fantasy', 'Fiction', 'Youn..."
2,9503,59,3.892518,Charlotte's Web,Children Books,"E.B. White, Garth Williams, Rosemary Wells","['Science Fiction & Fantasy', 'Fiction', 'Youn..."
13,9503,119,3.819488,The Handmaid's Tale,Fiction,Margaret Atwood,"['Science Fiction & Fantasy', 'Women Book List..."
12,9503,4,3.722053,To Kill a Mockingbird,Historical Fiction,Harper Lee,"['Crime & Mystery', 'Fiction', 'History', 'Chi..."
5,9503,83,3.670293,A Tale of Two Cities,Fiction,"Charles Dickens, Richard Maxwell, Hablot Knigh...","['Fiction', 'History', 'Young-Adult', 'Histori..."
4,9503,43,3.619458,Jane Eyre,Women Book List,"Charlotte Brontë, Michael Mason","['Crime & Mystery', 'Fiction', 'History', 'You..."
9,9503,131,3.470799,The Grapes of Wrath,Fiction,John Steinbeck,"['Historical Fiction', 'History', 'Fiction', '..."
8,9503,178,3.385116,The Bell Jar,Fiction,Sylvia Plath,"['Historical Fiction', 'Non-Fiction', 'Women B..."


In [15]:
# if user 9503 likes to read Fiction
recommend(user =9503, cat = 'Fiction')

Unnamed: 0,user_id,book_id,Predicted Rating,title,tag_name,authors,People Often Tagged as
10,9503,31,4.215812,The Help,Fiction,Kathryn Stockett,"['Fiction', 'History', 'Young-Adult', 'Histori..."
13,9503,119,3.819488,The Handmaid's Tale,Fiction,Margaret Atwood,"['Science Fiction & Fantasy', 'Women Book List..."
5,9503,83,3.670293,A Tale of Two Cities,Fiction,"Charles Dickens, Richard Maxwell, Hablot Knigh...","['Fiction', 'History', 'Young-Adult', 'Histori..."
9,9503,131,3.470799,The Grapes of Wrath,Fiction,John Steinbeck,"['Historical Fiction', 'History', 'Fiction', '..."
8,9503,178,3.385116,The Bell Jar,Fiction,Sylvia Plath,"['Historical Fiction', 'Non-Fiction', 'Women B..."


In [16]:
# if user 9503 likes to read books from Mark Twain
recommend(user =9503, auth = 'Mark Twain')

Unnamed: 0,user_id,book_id,Predicted Rating,title,tag_name,authors,People Often Tagged as
3,9503,116,3.208147,The Adventures of Tom Sawyer,Fiction,"Mark Twain, Guy Cardwell, John Seelye","['Fiction', 'History', 'Children Books', 'Youn..."
14,9503,58,3.180214,The Adventures of Huckleberry Finn,Fiction,"Mark Twain, John Seelye, Guy Cardwell","['Fiction', 'History', 'Children Books', 'Youn..."


## Smart Filtering of Keywords: Example

In [26]:
# If a new user only remember that the author's name of the book  his friend recommended starts with 'Row'
non_personalized_recommend( cat = 'ya', auth ='Row')

Unnamed: 0,title,authors,Avg Rating,tag_name,People Often Tagged as
1601,"Harry Potter Collection (Harry Potter, #1-6)",J.K. Rowling,4.727754,Young-Adult,"['Science Fiction & Fantasy', 'Crime & Mystery..."
1953,Fantastic Beasts and Where to Find Them: The O...,J.K. Rowling,4.407889,Young-Adult,"['Science Fiction & Fantasy', 'Fiction', 'Chil..."
756,Carry On,Rainbow Rowell,4.190929,Young-Adult,"['Science Fiction & Fantasy', 'Crime & Mystery..."
77,Eleanor & Park,Rainbow Rowell,4.11015,Young-Adult,"['Fiction', 'Young-Adult-Fiction', 'Young-Adul..."
196,The Tales of Beedle the Bard,J.K. Rowling,4.062274,Children Books,"['Science Fiction & Fantasy', 'Fiction', 'Youn..."
