## Book Recommendation

Book recommendation systems are a technology that provides book recommendations based on users' interests and previous reading habits. With the rapid growth of digital publishing and online book sales platforms, the need for such systems has increased to make it easier for users to find the right books. Book recommendation systems play an important role in enriching users' discovery experience while also increasing book sales.

The success of recommendation systems not only improves user experience, but also provides valuable data for publishers and authors by analyzing reading habits. Users' trust in book recommendation systems is directly related to the accuracy of the systems and the quality of the recommendations. Therefore, developing an effective book recommendation system has become an important goal in terms of both technical challenges and user satisfaction.

Here, we will develop a recommendation system that recommends similar books to the user based on the book title and description.

<img src='book.jpg' width=450 >

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
import plotly.express as px
import plotly.graph_objects as go

In [2]:
df=pd.read_csv('book_data.csv')
#Read the file.

## EDA - Exploratory Data Analysis

In [3]:
df.head()

Unnamed: 0,book_authors,book_desc,book_edition,book_format,book_isbn,book_pages,book_rating,book_rating_count,book_review_count,book_title,genres,image_url
0,Suzanne Collins,Winning will make you famous. Losing means cer...,,Hardcover,9780440000000.0,374 pages,4.33,5519135,160706,The Hunger Games,Young Adult|Fiction|Science Fiction|Dystopia|F...,https://images.gr-assets.com/books/1447303603l...
1,J.K. Rowling|Mary GrandPré,There is a door at the end of a silent corrido...,US Edition,Paperback,9780440000000.0,870 pages,4.48,2041594,33264,Harry Potter and the Order of the Phoenix,Fantasy|Young Adult|Fiction,https://images.gr-assets.com/books/1255614970l...
2,Harper Lee,The unforgettable novel of a childhood in a sl...,50th Anniversary,Paperback,9780060000000.0,324 pages,4.27,3745197,79450,To Kill a Mockingbird,Classics|Fiction|Historical|Historical Fiction...,https://images.gr-assets.com/books/1361975680l...
3,Jane Austen|Anna Quindlen|Mrs. Oliphant|George...,«È cosa ormai risaputa che a uno scapolo in po...,"Modern Library Classics, USA / CAN",Paperback,9780680000000.0,279 pages,4.25,2453620,54322,Pride and Prejudice,Classics|Fiction|Romance,https://images.gr-assets.com/books/1320399351l...
4,Stephenie Meyer,About three things I was absolutely positive.F...,,Paperback,9780320000000.0,498 pages,3.58,4281268,97991,Twilight,Young Adult|Fantasy|Romance|Paranormal|Vampire...,https://images.gr-assets.com/books/1361039443l...


In [4]:
df.shape

(54301, 12)

In [5]:
#df=df[['book_title', 'book_desc','book_rating', 'book_rating_count']]

In [6]:
df.drop(['book_edition','image_url','book_isbn','genres'], axis=1,inplace=True)

In [7]:
df.head()

Unnamed: 0,book_authors,book_desc,book_format,book_pages,book_rating,book_rating_count,book_review_count,book_title
0,Suzanne Collins,Winning will make you famous. Losing means cer...,Hardcover,374 pages,4.33,5519135,160706,The Hunger Games
1,J.K. Rowling|Mary GrandPré,There is a door at the end of a silent corrido...,Paperback,870 pages,4.48,2041594,33264,Harry Potter and the Order of the Phoenix
2,Harper Lee,The unforgettable novel of a childhood in a sl...,Paperback,324 pages,4.27,3745197,79450,To Kill a Mockingbird
3,Jane Austen|Anna Quindlen|Mrs. Oliphant|George...,«È cosa ormai risaputa che a uno scapolo in po...,Paperback,279 pages,4.25,2453620,54322,Pride and Prejudice
4,Stephenie Meyer,About three things I was absolutely positive.F...,Paperback,498 pages,3.58,4281268,97991,Twilight


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54301 entries, 0 to 54300
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   book_authors       54301 non-null  object 
 1   book_desc          52970 non-null  object 
 2   book_format        52645 non-null  object 
 3   book_pages         51779 non-null  object 
 4   book_rating        54301 non-null  float64
 5   book_rating_count  54301 non-null  int64  
 6   book_review_count  54301 non-null  int64  
 7   book_title         54301 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 3.3+ MB


In [9]:
df.isnull().sum()

Unnamed: 0,0
book_authors,0
book_desc,1331
book_format,1656
book_pages,2522
book_rating,0
book_rating_count,0
book_review_count,0
book_title,0


In [10]:
df=df.dropna()

In [11]:
df.isnull().sum()

Unnamed: 0,0
book_authors,0
book_desc,0
book_format,0
book_pages,0
book_rating,0
book_rating_count,0
book_review_count,0
book_title,0


In [12]:
df=df[df['book_format'].str.lower().isin(['paperback'])]

In [13]:
df['book_format'].value_counts()

Unnamed: 0_level_0,count
book_format,Unnamed: 1_level_1
Paperback,27682
paperback,16


In [14]:
#df['book_desc']=df['book_desc'].str.lower()
#df['book_desc']=df['book_desc'].str.replace("[^\w\s]" , "",regex=True)
#df['book_desc']=df['book_desc'].str.replace('[\n]', '',regex=True)
#df['book_desc']=df['book_desc'].str.replace('\d+','',regex=True)
#df['book_desc']=df['book_desc'].str.replace('\r',' ')

In [15]:
#df['book_authors']=df['book_authors'].str.lower()
#df['book_authors']=df['book_authors'].str.replace("[^\w\s]" , "",regex=True)
#df['book_authors']=df['book_authors'].str.replace('[\n]', '',regex=True)
#df['book_authors']=df['book_authors'].str.replace('\d+','',regex=True)
#df['book_authors']=df['book_authors'].str.replace('\r',' ')

In [16]:
def clean_text(series):
    series = series.str.lower()
    series = series.str.replace("[^\w\s]", " ", regex=True)
    series = series.str.replace('[\n]', ' ', regex=True)
    series = series.str.replace('\d+', '', regex=True)
    series = series.str.replace('\r', ' ', regex=True)
    return series

df['book_desc'] = clean_text(df['book_desc'])
df['book_authors'] = clean_text(df['book_authors'])

In [17]:
df['book_pages']=df['book_pages'].str.replace('pages','')

### Separating Non-English Values

In [42]:
#pip install langid

In [19]:
import langid

In [20]:
df['language'] = df['book_desc'].apply(lambda text: langid.classify(text)[0])
#I used a different library because the langdetect library found the texts too short and gave an error.

In [21]:
df = df[df['language'] == 'en']
#Some books looked like different books even though they were the same, because they were in different languages.
#That's why I chose only English.

In [22]:
df['language'].value_counts()

Unnamed: 0_level_0,count
language,Unnamed: 1_level_1
en,23549


In [23]:
df['book_key']=df['book_title'] + "_" + df['book_authors']
#Some books looked like different books because their pages were different. We do this to prevent this.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['book_key']=df['book_title'] + "_" + df['book_authors']


In [24]:
df.duplicated().sum()

4

In [25]:
df=df.drop_duplicates()

In [26]:
df['book_title'].duplicated().sum()

1529

In [27]:
df=df.drop_duplicates(subset=['book_title'])

In [28]:
df=df.drop_duplicates(subset=['book_key'])
#Remove duplicate records (by key field)

In [29]:
most_popular=df.sort_values(by='book_rating_count', ascending=False)
most_popular.head(5)

Unnamed: 0,book_authors,book_desc,book_format,book_pages,book_rating,book_rating_count,book_review_count,book_title,language,book_key
14936,suzanne collins,winning will make you famous losing means cert...,Paperback,454,4.33,5521568,160750,The Hunger Games,en,The Hunger Games_suzanne collins
4,stephenie meyer,about three things i was absolutely positive f...,Paperback,498,3.58,4281268,97991,Twilight,en,Twilight_stephenie meyer
2,harper lee,the unforgettable novel of a childhood in a sl...,Paperback,324,4.27,3745197,79450,To Kill a Mockingbird,en,To Kill a Mockingbird_harper lee
29,f scott fitzgerald,alternate cover edition isbn isbn a true ...,Paperback,180,3.9,3141842,56953,The Great Gatsby,en,The Great Gatsby_f scott fitzgerald
21598,john green,despite the tumor shrinking medical miracle th...,Paperback,318,4.24,2883603,147296,The Fault in Our Stars,en,The Fault in Our Stars_john green


In [30]:
df = df.sort_values(by="book_rating_count", ascending=False)
most_popular = most_popular.head(5)

labels = most_popular["book_title"]
values = most_popular["book_rating_count"]

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text="Top 5 Rated Books")

fig.show()

In [31]:
feature = df["book_desc"].tolist()
tfidf = text.TfidfVectorizer(stop_words="english")
tfidf_matrix = tfidf.fit_transform(feature)
similarity = linear_kernel(tfidf_matrix, tfidf_matrix)

In [32]:
indices = pd.Series(df.index,
                    index=df['book_title']).drop_duplicates()

In [40]:
def book_recommendation(title, similarity = similarity):
    index = indices[title]
    similarity_scores = list(enumerate(similarity[index]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[0:5]
    bookindices = [i[0] for i in similarity_scores]
    return df['book_title'].iloc[bookindices]

print(book_recommendation("The Hunger Games"))

53903                 Manituana
2622     The Jewel in the Crown
2100        Into the Wilderness
44786         Ministry of Space
42911                The Escape
Name: book_title, dtype: object


In [43]:
book_recommendation("Pride and Prejudice")

Unnamed: 0,book_title
2120,"Ronia, the Robber's Daughter"
51507,The Three Robbers
36963,The Last Unicorn #3
15833,Castle
34809,Dreams of Sex and Stage Diving


## Summary

In this project, we had a book dataset. In order to develop a user-based recommendation system, we first organized our data. We selected the columns we needed. We removed our empty data. Although some books were the same, they looked like different books because of the number of pages, language and cover. To prevent this, we only used the English language, adjusted the number of pages and cover. We removed our repetitive data. We found the 5 most voted books and created a recommendation system accordingly and completed our project.