Importing Project dependencies

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

We will start by importing our data set. It's a data set that contains information about books, who wrote these books and other relevant information. Let's take a look aat what our different columns mean - 
* bookID - Contains the unique ID for each book/series
* title - contains the titles of the books
* authors - contains the author of the particular book
* average_rating - the average rating of the books, as decided by the users
* ISBN - Another unique number to identify the book, the International Standard Book Number.
* ISBN 13 - A 13-digit ISBN to identify the book, instead of the standard 11-digit ISBN.
* language_code - Helps understand what is the primary language of the book. For instance, eng is standard for                     English.
* Num_pages - Number of pages the book contains.
* Ratings_count - Total number of ratings the book received.
* text_reviews_count - Total number of written text reviews the book received.

In [None]:
df = pd.read_csv('/kaggle/input/goodreadsbooks/books.csv',error_bad_lines = False)

In [None]:
df.head() #checking the head of our data

Now that we know how what our data looks like, lets go ahead and look for any null values present in our data.

In [None]:
df.isnull().sum() #checking for any null values present in the data

In [None]:
df.dtypes #checking the data types of each column

In [None]:
df.describe() #checking for hidden values such as the maximum rating of our books, the average number of ratings

From the above results, we can see that our our ratings all lie between 0 and 5. We get know more about the other columns as well, such as the mean of average ratings and some other information that might help us in the future steps. We also checked the data types of each column and also saw that there are no null values present in our data.

In [None]:
top_ten = df[df['ratings_count'] > 1000000]
top_ten.sort_values(by='average_rating', ascending=False).head(10)

The above results show us the top 10 books present in our data. We saw that the maximum rating in our data was 5.0 but we dont see any books in the above result with 5.0 rating. This is because we filtered these books on the basis of the number of ratings. We made sure that all the books that we have in the above results have a decent amount of rating. There can be books in the data that can have only 1 or 2 ratings can be rated 5.0. We want to avoid such books hence this sort of filtering. Let's go ahead and visualize this outcome in form of a graph.

In [None]:
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(10, 10))
data = top_ten.sort_values(by='average_rating', ascending=False).head(10)
sns.barplot(x="average_rating", y="title", data=data, palette='inferno')

Let's go ahead and take a look at some top authors present in our data. We will rank them according to the number of books they have written provided these books are present in the data.

In [None]:
most_books = df.groupby('authors')['title'].count().reset_index().sort_values('title', ascending=False).head(10).set_index('authors')
plt.figure(figsize=(15,10))
ax = sns.barplot(most_books['title'], most_books.index, palette='inferno')
ax.set_title("Top 10 authors with most books")
ax.set_xlabel("Total number of books")
totals = []
for i in ax.patches:
    totals.append(i.get_width())
total = sum(totals)
for i in ax.patches:
    ax.text(i.get_width()+.2, i.get_y()+.2,str(round(i.get_width())), fontsize=15,color='black')
plt.show()

According to our graphs, Stephen king and P.G. Wodehouse have the most number of books in the data. Both the authors have 40 books in our data set followed by Rumiko Takahashi and Orson scott Card.

In [None]:
#df.head()

Next we will take a look at the books that have been reviewed the most. We have the average ratings column in our data and also the number of times a particular book has been rated. We will try to use this column to find out most reviewed Books present in our data.

In [None]:
most_rated = df.sort_values('ratings_count', ascending = False).head(10).set_index('title')
plt.figure(figsize=(15,10))
ax = sns.barplot(most_rated['ratings_count'], most_rated.index, palette = 'inferno')
totals = []
for i in ax.patches:
    totals.append(i.get_width())
total = sum(totals)
for i in ax.patches:
    ax.text(i.get_width()+.2, i.get_y()+.2,str(round(i.get_width())), fontsize=15,color='black')
plt.show()

We can see that Twilight has been rated more number of times as compared to any other book! Also, these ratings are all in Millions! So that means twilight was rated more than 4 Million times followed by The Hobbit or There and Back Again and The Catcher in the Rye which have been reviewed more than 2 Million times!

Now we know that these books can be written in many different languages. We will use the language code to check how many books were written in each language.

In [None]:
plt.figure(1, figsize=(25,10))
plt.title("Languages")
sns.countplot(x = "language_code", order=df['language_code'].value_counts().index[0:10] ,data=df,palette='inferno')

We can see most of the Books are written in english be it US or UK which was quite obvious but in order to check it thoroughly, we had to make this plot. We also have languages like Japanese and German but these aren't very prominent.

There are tons of great and famous authors present in our data. Our next goal is to figure out the top 10 authors present in our data based on on the average ratings on their books. We will filter out the authors based upon how many of their books have average rating above 4.4.

In [None]:
highly_rated_author =df[df['average_rating']>4.4]
highly_rated_author = highly_rated_author.groupby('authors')['title'].count().reset_index().sort_values('title',ascending=False).head(10).set_index('authors')
plt.subplots(figsize=(15,10))
ax = highly_rated_author['title'].sort_values().plot.barh(width=0.9,color=sns.color_palette('inferno',12))
ax.set_xlabel("Total books ", fontsize=15)
ax.set_ylabel("Authors", fontsize=15)
ax.set_title("Top 10 highly rated authors",fontsize=20,color='black')
totals = []
for i in ax.patches:
    totals.append(i.get_width())
total = sum(totals)
for i in ax.patches:
    ax.text(i.get_width()+.2, i.get_y()+.2,str(round(i.get_width())), fontsize=15,color='black')
plt.show()

From the above graph we can see that Hiromu arkawa is the highest rated author in our data set follwed by J.K Rowling and some other big names. Now that we know about our authors, we will go and take a look at our top publishers as well.

In [None]:
top_publishers = df.groupby('publisher')['title'].count().reset_index().sort_values('title',ascending=False).head(10).set_index('publisher')
plt.subplots(figsize=(15,10))
ax = top_publishers['title'].sort_values().plot.barh(width=0.9,color=sns.color_palette('inferno',12))
ax.set_xlabel("Total books ", fontsize=15)
ax.set_ylabel("Publishers", fontsize=15)
ax.set_title("Top 10 Publishers Present in our data",fontsize=20,color='black')
totals = []
for i in ax.patches:
    totals.append(i.get_width())
total = sum(totals)
for i in ax.patches:
    ax.text(i.get_width()+.2, i.get_y()+.2,str(round(i.get_width())), fontsize=15,color='black')
plt.show()

Vintage are the Most famous publishers present in our data followed by Penguin Books and Penguin Classics.

Next up, we will have a look at the distribution of average ratings, this will be very important for us when we go on to make our recommender.

In [None]:
df.average_rating = df.average_rating.astype(float)
fig, ax = plt.subplots(figsize=[15,10])
sns.distplot(df['average_rating'],ax=ax)
ax.set_title('Average rating distribution for all books',fontsize=20)
ax.set_xlabel('Average rating',fontsize=13)

So as we can see, majority of our rating fall between 3 and 4.5. There are hardly any books that have been rated a 1 or a 2 and same goes with 5.

In [None]:
df.info()

Let's try and find some relation between our average rating and the rating counts. We are doing this to see how we can use these columns in our recommender. We will also check the distribution of average ratings with Number of pages of a book, the language used in the Book and the Number of Text reviews.

In [None]:
ax = sns.relplot(data=df, x="average_rating", y="ratings_count", color = 'red', sizes=(100, 200), height=7, marker='o')
plt.title("Relation between Rating counts and Average Ratings",fontsize = 15)
ax.set_axis_labels("Average Rating", "Ratings Count")

In [None]:
plt.figure(figsize=(15,10))
ax = sns.relplot(x="average_rating", y="  num_pages", data = df, color = 'red',sizes=(100, 200), height=7, marker='o')
ax.set_axis_labels("Average Rating", "Number of Pages")

In [None]:
plt.figure(figsize=(15,10))
ax = sns.relplot(x="average_rating", y="language_code", data = df, color = 'red',sizes=(100, 200), height=7, marker='o')
ax.set_axis_labels("Average Rating", "Languages")

In [None]:
plt.figure(figsize=(15,10))
ax = sns.relplot(x="average_rating", y="text_reviews_count", data = df, color = 'red',sizes=(100, 200), height=7, marker='o')
ax.set_axis_labels("Average Rating", "Text Reviews Count")

After comparing the average rating with the different columns, we can go ahead with using the Language and the Rating counts for our recommender system. Rest other colummns weren't making much sense and using them might not help us in a big way so we can omit them.

We will create a copy of our original data just to be safe so that we are safe in case we mess up something.

In [None]:
df2 = df.copy()

We will now create a new column called 'rating_between'. We will divide our average rating column into various categories such as rating between 0 and 1, 1 and 2 and so on. This will work as one of the features that we will feed to our model so that it can make better predictions.

In [None]:
df2.loc[ (df2['average_rating'] >= 0) & (df2['average_rating'] <= 1), 'rating_between'] = "between 0 and 1"
df2.loc[ (df2['average_rating'] > 1) & (df2['average_rating'] <= 2), 'rating_between'] = "between 1 and 2"
df2.loc[ (df2['average_rating'] > 2) & (df2['average_rating'] <= 3), 'rating_between'] = "between 2 and 3"
df2.loc[ (df2['average_rating'] > 3) & (df2['average_rating'] <= 4), 'rating_between'] = "between 3 and 4"
df2.loc[ (df2['average_rating'] > 4) & (df2['average_rating'] <= 5), 'rating_between'] = "between 4 and 5"

In [None]:
df2.head()

We will now create two new data frames that contain the different values for the rating_between column we just made. We will assign the value 1 if a rating falls under a particular group lets say 4 and 5 and rest others will be given the value of 0. We will apply the same approach to divide the language code column to retrive  these languages individually and give them the value of 1 and 0 as well where 1 will be assigned if the book is written in a particular language for example, English and 0 if it's not written in English.

In [None]:
rating_df = pd.get_dummies(df2['rating_between'])
rating_df.head()

In [None]:
language_df = pd.get_dummies(df2['language_code'])
language_df.head()

We will now concatenate these two data frames into one and name it features. This Data frame will be the features that we will feed to the mmodel. It will contain the values of rating_df and language_df and will also have the values of average rating and ratings count.

In [None]:
features = pd.concat([rating_df, language_df, df2['average_rating'], df2['ratings_count']], axis=1)
features.head()

Now that we have our features ready, we will now use the Min-Max scaler to scale these values down. It will help in reducing the bias for some of the books that have too many features. It will basically find the median for all and equalize it,

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
min_max_scaler = MinMaxScaler()
features = min_max_scaler.fit_transform(features)

We have scaled down our features and now we will use KNN to create our Recommender system.

In [None]:
model = neighbors.NearestNeighbors(n_neighbors=6, algorithm='ball_tree')
model.fit(features)
dist, idlist = model.kneighbors(features)

We fit all the features to our model and now we will have to create a custom method. When this method will be called, we will have to pass the name of the book in it. The model will try and find the books based on the features that we have passed in it. We will store these book names that the system recommends in a list and return it at the end.

In [None]:
def BookRecommender(book_name):
    book_list_name = []
    book_id = df2[df2['title'] == book_name].index
    book_id = book_id[0]
    for newid in idlist[book_id]:
        book_list_name.append(df2.loc[newid].title)
    return book_list_name

In [None]:
BookNames = BookRecommender('Harry Potter and the Half-Blood Prince (Harry Potter  #6)')
BookNames

In [None]:
BookNames = BookRecommender('The Lord of the Rings: Weapons and Warfare')
BookNames

With this we come to an end to our Recommender system. As we can see, our model is showing some pretty decent result. We passed in the name of one of the Harry potter books and our system quickly recommended us books based upon the average ratings. The books that we recieved have almost the same ratings and we have also recieved books such as the The Fellowhip of the Ring, which again is a fantasy based story line somewhat similar to the Harry Potter books. So we can say that our model is giving decent results. I would like to thank - [Shivam Ralli](https://www.kaggle.com/hoshi7) whose notebook i referenced. It is a  very well written kernel and everyone should have a look at it once.

In [None]:
import pickle

In [None]:
pickle.dump(model,open('model.pkl','wb'))