<h1 align=center><font size = 5>BOOK RECOMMENDATION SYSTEM</font></h1>

In [1]:
import numpy as np 
import pandas as pd

In [2]:
books = pd.read_csv("books.csv")
book_tags = pd.read_csv("book_tags.csv")
tags = pd.read_csv("tags.csv")
ratings = pd.read_csv("ratings.csv")

<a id="ref1"></a>
# Preprocessing Data

Reviewing the data in ***tags*** and ***book_tags***

In [3]:
tags.head()

Unnamed: 0,tag_id,tag_name
0,0,-
1,1,--1-
2,2,--10-
3,3,--12-
4,4,--122-


In [4]:
book_tags.head()

Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697
1,1,11305,37174
2,1,11557,34173
3,1,8717,12986
4,1,33114,12716


Both of these can be merged as one using the column ***'tag_id'***

In [5]:
#Left join between book_tags and tags dataframe
book_tags = pd.merge(book_tags,tags,on='tag_id',how='left')
book_tags

Unnamed: 0,goodreads_book_id,tag_id,count,tag_name
0,1,30574,167697,to-read
1,1,11305,37174,fantasy
2,1,11557,34173,favorites
3,1,8717,12986,currently-reading
4,1,33114,12716,young-adult
...,...,...,...,...
999907,33288638,21303,7,neighbors
999908,33288638,17271,7,kindleunlimited
999909,33288638,1126,7,5-star-reads
999910,33288638,11478,7,fave-author


Removing duplicated rows, if any.

In [6]:
book_tags.drop(book_tags[book_tags.duplicated()].index, inplace = True)

In [7]:
book_tags.rename(columns={'goodreads_book_id':'book_id'})

Unnamed: 0,book_id,tag_id,count,tag_name
0,1,30574,167697,to-read
1,1,11305,37174,fantasy
2,1,11557,34173,favorites
3,1,8717,12986,currently-reading
4,1,33114,12716,young-adult
...,...,...,...,...
999907,33288638,21303,7,neighbors
999908,33288638,17271,7,kindleunlimited
999909,33288638,1126,7,5-star-reads
999910,33288638,11478,7,fave-author


**FINAL *book_tags*:**

In [8]:
book_tags

Unnamed: 0,goodreads_book_id,tag_id,count,tag_name
0,1,30574,167697,to-read
1,1,11305,37174,fantasy
2,1,11557,34173,favorites
3,1,8717,12986,currently-reading
4,1,33114,12716,young-adult
...,...,...,...,...
999907,33288638,21303,7,neighbors
999908,33288638,17271,7,kindleunlimited
999909,33288638,1126,7,5-star-reads
999910,33288638,11478,7,fave-author


Reviewing the data in ***books***

In [9]:
books = books.sort_values(by=['book_id'])

In [10]:
books.head()

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
26,27,1,1,41335427,275,439785960,9780440000000.0,"J.K. Rowling, Mary GrandPré",2005.0,Harry Potter and the Half-Blood Prince,...,1678823,1785676,27520,7308,21516,136333,459028,1161491,https://images.gr-assets.com/books/1361039191m...,https://images.gr-assets.com/books/1361039191s...
20,21,2,2,2809203,307,439358078,9780439000000.0,"J.K. Rowling, Mary GrandPré",2003.0,Harry Potter and the Order of the Phoenix,...,1735368,1840548,28685,9528,31577,180210,494427,1124806,https://images.gr-assets.com/books/1387141547m...,https://images.gr-assets.com/books/1387141547s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
17,18,5,5,2402163,376,043965548X,9780440000000.0,"J.K. Rowling, Mary GrandPré, Rufus Beck",1999.0,Harry Potter and the Prisoner of Azkaban,...,1832823,1969375,36099,6716,20413,166129,509447,1266670,https://images.gr-assets.com/books/1499277281m...,https://images.gr-assets.com/books/1499277281s...
23,24,6,6,3046572,332,439139600,9780439000000.0,"J.K. Rowling, Mary GrandPré",2000.0,Harry Potter and the Goblet of Fire,...,1753043,1868642,31084,6676,20210,151785,494926,1195045,https://images.gr-assets.com/books/1361482611m...,https://images.gr-assets.com/books/1361482611s...


Removing columns that aren't needed for a content-based recommendation system and renaming some of them for better understanding.

In [11]:
#Drop unnecessary columns
books.drop(columns=['id', 'best_book_id', 'work_id', 'isbn', 'isbn13', 'title','work_ratings_count','ratings_count','work_text_reviews_count', 'ratings_1', 'ratings_2', 'ratings_3','ratings_4', 'ratings_5', 'image_url','small_image_url'], inplace= True)

#Rename columns
books.rename(columns={'original_publication_year':'pub_year', 'original_title':'title', 'language_code':'language', 'average_rating':'rating'}, inplace=True)

Splitting the values in the ***authors*** column into a ***list of authors*** to simplify future use.

In [12]:
books['authors'] = books.authors.str.split(',')

**FINAL *books*:**

In [13]:
books

Unnamed: 0,book_id,books_count,authors,pub_year,title,language,rating
26,1,275,"[J.K. Rowling, Mary GrandPré]",2005.0,Harry Potter and the Half-Blood Prince,eng,4.54
20,2,307,"[J.K. Rowling, Mary GrandPré]",2003.0,Harry Potter and the Order of the Phoenix,eng,4.46
1,3,491,"[J.K. Rowling, Mary GrandPré]",1997.0,Harry Potter and the Philosopher's Stone,eng,4.44
17,5,376,"[J.K. Rowling, Mary GrandPré, Rufus Beck]",1999.0,Harry Potter and the Prisoner of Azkaban,eng,4.53
23,6,332,"[J.K. Rowling, Mary GrandPré]",2000.0,Harry Potter and the Goblet of Fire,eng,4.53
...,...,...,...,...,...,...,...
7522,31538647,18,[J.K. Rowling],2016.0,Hogwarts: An Incomplete and Unreliable Guide,eng,4.21
4593,31845516,20,[Glennon Doyle Melton],2016.0,Love Warrior,en-US,4.10
9568,32075671,36,[Angie Thomas],2017.0,The Hate U Give,eng,4.62
9579,32848471,8,[Vi Keeland],2017.0,,,4.34


**Pivot table for tag:**

In [14]:
bt_matrix = book_tags.pivot_table(values='count', index='goodreads_book_id', columns='tag_name')
bt_matrix

tag_name,-,--1-,--10-,--12-,--122-,--166-,--17-,--19-,--2-,--258-,...,漫画,골든,﹏moonplus-reader﹏,ﺭﺿﻮﻯ-عاشور,ﻳﻮﺳﻒ-زيدان,Ｃhildrens,Ｆａｖｏｒｉｔｅｓ,Ｍａｎｇａ,ＳＥＲＩＥＳ,ｆａｖｏｕｒｉｔｅｓ
goodreads_book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31538647,,,,,,,,,,,...,,,,,,,,,,
31845516,,,,,,,,,,,...,,,,,,,,,,
32075671,,,,,,,,,,,...,,,,,,,,,,
32848471,,,,,,,,,,,...,,,,,,,,,,


Fill NaN with 0

In [15]:
bt_matrix = bt_matrix.fillna(0)

**FINAL *bt_matrix*:**

In [16]:
bt_matrix

tag_name,-,--1-,--10-,--12-,--122-,--166-,--17-,--19-,--2-,--258-,...,漫画,골든,﹏moonplus-reader﹏,ﺭﺿﻮﻯ-عاشور,ﻳﻮﺳﻒ-زيدان,Ｃhildrens,Ｆａｖｏｒｉｔｅｓ,Ｍａｎｇａ,ＳＥＲＩＥＳ,ｆａｖｏｕｒｉｔｅｓ
goodreads_book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31538647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31845516,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32075671,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32848471,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Use CountVectorizer to generate author_vector

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
cv = CountVectorizer(lowercase = False)

author = cv.fit_transform(books["authors"].astype(str))
author_vector_df = pd.DataFrame(author.todense(), columns = cv.get_feature_names())

author_vector_df.set_index(books.index, inplace = True)

author_vector_df

Unnamed: 0,Aardema,Aaron,Aaronovitch,Ab,Abagnale,Abarbanell,Abbas,Abbey,Abbi,Abbott,...,村上,桜坂洋,樋口,武内,田中メカ,直子,石田,神尾葉子,義博,신경숙
26,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7522,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4593,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9579,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Add author_vector and tag_vector to the ***books*** dataframe

In [19]:
books['author_vector'] = [list(row) for index, row in author_vector_df.iterrows()]
books['tag_vector'] = [list(row) for index, row in bt_matrix.iterrows()]

Dropping the null values

In [20]:
books.dropna(inplace= True)

**FINAL *books*:**

In [21]:
books

Unnamed: 0,book_id,books_count,authors,pub_year,title,language,rating,author_vector,tag_vector
26,1,275,"[J.K. Rowling, Mary GrandPré]",2005.0,Harry Potter and the Half-Blood Prince,eng,4.54,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
20,2,307,"[J.K. Rowling, Mary GrandPré]",2003.0,Harry Potter and the Order of the Phoenix,eng,4.46,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,3,491,"[J.K. Rowling, Mary GrandPré]",1997.0,Harry Potter and the Philosopher's Stone,eng,4.44,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
17,5,376,"[J.K. Rowling, Mary GrandPré, Rufus Beck]",1999.0,Harry Potter and the Prisoner of Azkaban,eng,4.53,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
23,6,332,"[J.K. Rowling, Mary GrandPré]",2000.0,Harry Potter and the Goblet of Fire,eng,4.53,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...,...,...,...,...,...,...,...
6427,31538635,18,"[J.K. Rowling, MinaLima]",2016.0,"Short Stories from Hogwarts of Heroism, Hardsh...",eng,4.22,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
7522,31538647,18,[J.K. Rowling],2016.0,Hogwarts: An Incomplete and Unreliable Guide,eng,4.21,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4593,31845516,20,[Glennon Doyle Melton],2016.0,Love Warrior,en-US,4.10,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
9568,32075671,36,[Angie Thomas],2017.0,The Hate U Give,eng,4.62,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


<a id="ref2"></a>
# Content-based Recommendation System

To learn user's preferences, we get the subset of authors that the user has already read from the dataframe (*book_authors*) containing authors of books with binary values.


**get_rated_books**: This function takes the user_id as input and return the list of book_id that user_id rated >= 4

In [22]:
def get_rated_books(user_id):
    user = ratings[['book_id','rating']][ratings['user_id']==user_id]
    user_authors = books[books.index.isin(user['book_id'].tolist())].reset_index(drop=True)
    user_authors_higher_than_4 = user_authors[user_authors['rating'] >= 4]
    return list(user_authors_higher_than_4['book_id'])

In [23]:
get_rated_books(9)

[95608, 152662, 241823, 1027760]

**cal_distance**: This function takes two book_id and the dataframe as input. It outputs the cosine distane of these books based on authors and tags

In [24]:
from scipy import spatial

def cal_distance(book1, book2, df):
    row1 = df[df['book_id'] == book1]
    row2 = df[df['book_id'] == book2]
    
    authors1 = row1['author_vector'].values[0]
    authors2 = row2['author_vector'].values[0]
    author_distance = spatial.distance.cosine(authors1, authors2)
    
    tags1 = row1['tag_vector'].values[0]
    tags2 = row2['tag_vector'].values[0]
    tag_distance = spatial.distance.cosine(tags1, tags2)
    
    return author_distance + tag_distance

In [25]:
cal_distance(95608,2, books)

1.0413580082219798

In [26]:
cal_distance(1,3, books)

0.014652648586020889

**top_k_neighbors**: This function takes a list of book_id, dataframe, and an integer k as input. It outputs the list of top k books that are closest to the list of user's prefered books

In [27]:
def top_k_neighbor(book, df, k):
    S = {}
    for b in book:
        for i in list(df['book_id']):
            if i in book:
                pass
            else:
                a = round(cal_distance(b, i, df),3)
                if i not in S:
                    S[i] = a
                else:
                    S[i] = min(S[i],a)
    sortedS = sorted(S.items(), key=lambda kv: kv[1])
    return sortedS[:k]

**recommend_book**: This function takes an user_id and an integer num_of_book as inputs. It prints out the list of book's title that the user may like

In [28]:
def recommend_book(user_id, num_of_book):
    top_k_book = top_k_neighbor(get_rated_books(user_id), books, num_of_book)
    for i in top_k_book:
        print(books['title'][books['book_id']==i[0]].values[0])
    return

<a id="ref3"></a>
# The final recommendation result:

In [29]:
#test for user_id 9 that wants to have 8 new books to read
recommend_book(9,8)

And the Shofar Blew
An Echo in the Darkness (Mark of the Lion, #2)
As Sure as the Dawn (Mark of the Lion, #3)
The Atonement Child
Her Mother's Hope (Marta's Legacy, #1)
Mad About Madeline
Madeline's Rescue
The Scarlet Thread 


<a id="ref3"></a>
# Evaluation

In [30]:
from surprise import Reader, Dataset
reader = Reader()
data = Dataset.load_from_df(ratings[['book_id', 'user_id', 'rating']], reader)

In [31]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=0.25)

In [32]:
from surprise import SVD, accuracy
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x155e1555688>

In [33]:
predictions = algo.test(testset)

In [34]:
from surprise import accuracy
accuracy.rmse(predictions)

RMSE: 0.8451


0.8450886747413946