# BOOKS EDA AND RECOMMENDER SYSTEM

![Good books image](https://images.unsplash.com/photo-1481627834876-b7833e8f5570?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1441&q=80)

Importing some importants libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
books = pd.read_csv('../input/goodreadsbooks/books.csv',error_bad_lines = False)

## Sneak Peak of Dataset

In [None]:
books.head()

In [None]:
books.shape

In [None]:
#List of columns
list(books.columns)

# Data Cleaning

From the dataset we can see that, the things that are needed to do are.

1. Remove the extra spaces before ```num_pages```
2. Keep bookID as the index
3. Replace "J.K. Rowling-Mary GrandPré" to "J.K. Rowling"
4. Check for null values
5. Check for duplicates
6. Check for outliers

The rename function is used to rename columns

In [None]:
books.rename(columns={'  num_pages':'num_pages'}, inplace=True)

We can use the ```.index``` to set an index column explicitly 

In [None]:
books.index = books['bookID']

Printing the length and breadth of the dataset

In [None]:
print("Dataset contains {} rows and {} columns".format(books.shape[0], books.shape[1]))

Replacing the author. Though I respect Mary GrandPré for her illustrations, but here I am taking J.K, Rowling only, for the sake of simplcity

In [None]:
books.replace(to_replace='J.K. Rowling-Mary GrandPré', value = 'J.K. Rowling', inplace=True)
books.replace(to_replace='J.K. Rowling/Mary GrandPré', value = 'J.K. Rowling', inplace=True)

Let us see if the changes have reflected or not

In [None]:
books.head(5)

Cool

Let's see if we have null values. The ```isnull``` is used for the same 

In [None]:
books.isnull().values.any()

We can see that there are no null values

Lets see if we have any duplicate values

In [None]:
books.duplicated()

In [None]:
plt.figure(figsize=(30,5))
sns.boxplot(x=books['average_rating'],palette = 'colorblind')

We can see that there are no outlier in ```average_rating```

In [None]:
plt.figure(figsize=(30,10))
sns.boxplot(x=books['ratings_count'],palette = 'colorblind')

Here too, we cannot see any abnormal values

#### Columns Description: 

- **bookID** Contains the unique ID for each book/series
- **title** contains the titles of the books
- **authors** contains the author of the particular book
- **average_rating** the average rating of the books, as decided by the users
- **ISBN** ISBN(10) number, tells the information about a book - such as edition and publisher
- **ISBN 13** The new format for ISBN, implemented in 2007. 13 digits
- **language_code** Tells the language for the books
- **Num_pages** Contains the number of pages for the book
- **Ratings_count** Contains the number of ratings given for the book
- **text_reviews_count** has the count of reviews left by users

# Exploratory Data Analysis

Let us see what are the unique values present in the dataset. <br> For this, ```unique``` method is used.

In [None]:
for feature in books.columns:
    uniq = np.unique(books[feature])
    print('{}: {} distinct values\n'.format(feature,len(uniq)))

### Which are the books with most occurances in the list?<a id="4"></a> <br>

In [None]:
#Taking the first 20:

sns.set_context('poster')
plt.figure(figsize=(20,15))
book = books['title'].value_counts()[:20]
rating = books.average_rating[:20]
sns.barplot(x = book, y = book.index, palette='deep')
plt.title("Most Occurring Books")
plt.xlabel("Number of occurances")
plt.ylabel("Books")
plt.show()

We can see that,The lliad and The brothers karamazov are the books with most occurances. This shows that the books have aged well.

### Which is the most frequent language?

In [None]:
sns.set_context('paper')
plt.figure(figsize=(15,10))
ax = books.groupby('language_code')['title'].count().plot.bar()
plt.title('Language Code')
plt.xticks(fontsize = 15)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x()-0.3, p.get_height()+100))

Clearly we can see that english is the most frequent language.

### Which book has got the most number of ratings?

In [None]:
most_rated = books.sort_values('ratings_count', ascending = False).head(10).set_index('title')
plt.figure(figsize=(15,10))
sns.barplot(most_rated['ratings_count'], most_rated.index, palette='rocket')

Twilight(Twilight #1) is the most rated book. But no other book from twilight series can be seen in the list.

### Which are the authors in the dataset, with maximum number of books? 

In [None]:
sns.set_context('talk')
most_books = books.groupby('authors')['title'].count().reset_index().sort_values('title', ascending=False).head(10).set_index('authors')
plt.figure(figsize=(15,10))
ax = sns.barplot(most_books['title'], most_books.index, palette='icefire_r')
ax.set_title("Top 10 authors with most books")
ax.set_xlabel("Total number of books")
for i in ax.patches:
    ax.text(i.get_width()+.3, i.get_y()+0.5, str(round(i.get_width())), fontsize = 10, color = 'k')

### Which publishing house is the most frequent in the dataset?

In [None]:
sns.set_context('talk')
most_books = books.groupby('publisher')['title'].count().reset_index().sort_values('title', ascending=False).head(10).set_index('publisher')
plt.figure(figsize=(15,10))
ax = sns.barplot(most_books['title'], most_books.index, palette='icefire_r')
ax.set_title("Top 10 publishers with most books")
ax.set_xlabel("Total number of books")
for i in ax.patches:
    ax.text(i.get_width()+.3, i.get_y()+0.5, str(round(i.get_width())), fontsize = 10, color = 'k')

### What is the average rating in the dataset?

In [None]:
plt.figure(figsize=(10,10))
rating= books.average_rating.astype(float)
sns.distplot(rating, bins=20)

### What is the percentage of books lying between various points?

To see the number of books lying in different points, we try to classify them. The function, ```segregation``` below does the same. <br>
Points 0 to 1 (Under Average books)

Points 1 to 2 (Average books)

Points 2 to 3 (Good books)

Points 3 to 4 (Very Good books)

Points 4 to 5 (Excellent books)


In [None]:
def segregation(data):
    values = []
    for val in data.average_rating:
        if val>=0 and val<=1:
            values.append("Between 0 and 1")
        elif val>1 and val<=2:
            values.append("Between 1 and 2")
        elif val>2 and val<=3:
            values.append("Between 2 and 3")
        elif val>3 and val<=4:
            values.append("Between 3 and 4")
        elif val>4 and val<=5:
            values.append("Between 4 and 5")
        else:
            values.append("NaN")
    print(len(values))
    return values

In [None]:
books['Ratings_Dist'] = segregation(books)
ratings_pie = books['Ratings_Dist'].value_counts().reset_index()
labels = ratings_pie['index']
colors = ['lightblue','darkmagenta','coral','bisque', 'black']
percent = 100.*ratings_pie['Ratings_Dist']/ratings_pie['Ratings_Dist'].sum()
fig, ax1 = plt.subplots()
ax1.pie(ratings_pie['Ratings_Dist'],colors = colors, 
        pctdistance=0.85, startangle=90, explode=(0.05, 0.05, 0.05, 0.05, 0.05))
#Draw a circle now:
centre_circle = plt.Circle((0,0), 0.70, fc ='white')
fig1 = plt.gcf()
fig1.gca().add_artist(centre_circle)
#Equal Aspect ratio ensures that pie is drawn as a circle
plt.axis('equal')
plt.tight_layout()
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(labels, percent)]
plt.legend( labels, loc = 'best',bbox_to_anchor=(-0.1, 1.),)

### Is there any relationship between average rating and number of reviews?

In [None]:
#Checking for any relation between them.
plt.figure(figsize=(15,10))
books.dropna(0, inplace=True)
sns.set_context('paper')
ax =sns.jointplot(x="average_rating",y='text_reviews_count', kind='scatter',  data= books[['text_reviews_count', 'average_rating']])
ax.set_axis_labels("Average Rating", "Text Review Count")
plt.show()

We can see that, most of the reviews were done for ratings around 4. This means that very less number of people give either full or 0 marks,

### Is there any realationship between number of pages and average rating?

In [None]:
plt.figure(figsize=(15,10))
sns.set_context('paper')
ax = sns.jointplot(x="average_rating", y="num_pages", data = books, color = 'crimson')
ax.set_axis_labels("Average Rating", "Number of Pages")

Here because of the outliers, we can see that the whole graph is getting positively skewed. For this we will try to plot a graph having books with pages not more than 1000.

In [None]:
trial = books[~(books['num_pages']>1000)]

In [None]:
ax = sns.jointplot(x="average_rating", y="num_pages", data = trial, color = 'darkcyan')
ax.set_axis_labels("Average Rating", "Number of Pages")

Even now, we can see that most of the 5 star rating is for books having pages less than 400.

### Is there any relationship between average rating and the numbers of rating received?

In [None]:
sns.set_context('paper')
ax = sns.jointplot(x="average_rating", y="ratings_count", data = books, color = 'blueviolet')
ax.set_axis_labels("Average Rating", "Ratings Count")

Here too we can see that the outliers are affecting the plot. Hence we will take a temporary dataframe having number of rantings more than 2000000

In [None]:
trial = books[~(books.ratings_count>2000000)]

In [None]:
sns.set_context('paper')
ax = sns.jointplot(x="average_rating", y="ratings_count", data = trial, color = 'brown')
ax.set_axis_labels("Average Rating", "Ratings Count")

Even now, we can see that most of the ratings are at 4.

### Which books have got the highest text reviews, i.e comments too?

In [None]:
most_text = books.sort_values('text_reviews_count', ascending = False).head(10).set_index('title')
plt.figure(figsize=(15,10))
sns.set_context('poster')
ax = sns.barplot(most_text['text_reviews_count'], most_text.index, palette='magma')
for i in ax.patches:
    ax.text(i.get_width()+2, i.get_y()+0.5,str(round(i.get_width())), fontsize=10,color='black')
plt.show()

Clearly we can see that Twilight(Twilight#) has got the highest text review.

# The Model

I attempt to find a relationship or groups between the rating count and average rating value.

Importing libraries

In [None]:
from sklearn.cluster import KMeans
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from scipy.cluster.vq import kmeans, vq
from matplotlib.lines import Line2D

# KMeans Clustering

KMeans clustering is a type of unsupervised learning which groups unlabelled data. The goal is to find groups in data.

With this, I attempt to find a relationship or groups between the rating count and average rating value.

In [None]:
trial = books[['average_rating', 'ratings_count']]
data = np.asarray([np.asarray(trial['average_rating']), np.asarray(trial['ratings_count'])]).T

Since KNN clustering is pretty basic and makes clusters, even though specified k is wrong, so we are using the Elbow Curve method for finding the number of clusters for the data

In [None]:
X = data
distortions = []
for k in range(2,30):
    k_means = KMeans(n_clusters = k)
    k_means.fit(X)
    distortions.append(k_means.inertia_)

fig = plt.figure(figsize=(15,10))
plt.plot(range(2,30), distortions, 'bx-')
plt.title("Elbow Curve")


From the above plot, we can see that the elbow lies around the value K=5, so that's what we will attempt it with

In [None]:
#Computing K means with K = 5, thus, taking it as 5 clusters
centroids, _ = kmeans(data, 5)

#assigning each sample to a cluster
#Vector Quantisation:

idx, _ = vq(data, centroids)

In [None]:
# some plotting using numpy's logical indexing
sns.set_context('paper')
plt.figure(figsize=(15,10))
plt.plot(data[idx==0,0],data[idx==0,1],'or',#red circles
     data[idx==1,0],data[idx==1,1],'ob',#blue circles
     data[idx==2,0],data[idx==2,1],'oy', #yellow circles
     data[idx==3,0],data[idx==3,1],'om', #magenta circles
     data[idx==4,0],data[idx==4,1],'ok',#black circles
    
     
        
        
        
        
        )
plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=8, )




circle1 = Line2D(range(1), range(1), color = 'red', linewidth = 0, marker= 'o', markerfacecolor='red')
circle2 = Line2D(range(1), range(1), color = 'blue', linewidth = 0,marker= 'o', markerfacecolor='blue')
circle3 = Line2D(range(1), range(1), color = 'yellow',linewidth=0,  marker= 'o', markerfacecolor='yellow')
circle4 = Line2D(range(1), range(1), color = 'magenta', linewidth=0,marker= 'o', markerfacecolor='magenta')
circle5 = Line2D(range(1), range(1), color = 'black', linewidth = 0,marker= 'o', markerfacecolor='black')

plt.legend((circle1, circle2, circle3, circle4, circle5)
           , ('Cluster 1','Cluster 2', 'Cluster 3', 'Cluster 4', 'Cluster 5'), numpoints = 1, loc = 0, )


plt.show()

We can see from the above plot, that because of two outliers, the whole clustering algortihm is skewed. Let's remove them and form inferences

# KMeans with Optimisation

Finding the outliers and then removing them.

In [None]:
trial.idxmax()

In [None]:
trial.drop(41865, inplace = True)

In [None]:
data = np.asarray([np.asarray(trial['average_rating']), np.asarray(trial['ratings_count'])]).T

In [None]:
#Computing K means with K = 8, thus, taking it as 8 clusters
centroids, _ = kmeans(data, 5)

#assigning each sample to a cluster
#Vector Quantisation:

idx, _ = vq(data, centroids)

In [None]:
# some plotting using numpy's logical indexing
sns.set_context('paper')
plt.figure(figsize=(15,10))
plt.plot(data[idx==0,0],data[idx==0,1],'or',#red circles
     data[idx==1,0],data[idx==1,1],'ob',#blue circles
     data[idx==2,0],data[idx==2,1],'oy', #yellow circles
     data[idx==3,0],data[idx==3,1],'om', #magenta circles
     data[idx==4,0],data[idx==4,1],'ok',#black circles
    
     
        
        
        
        
        )
plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=8, )




circle1 = Line2D(range(1), range(1), color = 'red', linewidth = 0, marker= 'o', markerfacecolor='red')
circle2 = Line2D(range(1), range(1), color = 'blue', linewidth = 0,marker= 'o', markerfacecolor='blue')
circle3 = Line2D(range(1), range(1), color = 'yellow',linewidth=0,  marker= 'o', markerfacecolor='yellow')
circle4 = Line2D(range(1), range(1), color = 'magenta', linewidth=0,marker= 'o', markerfacecolor='magenta')
circle5 = Line2D(range(1), range(1), color = 'black', linewidth = 0,marker= 'o', markerfacecolor='black')

plt.legend((circle1, circle2, circle3, circle4, circle5)
           , ('Cluster 1','Cluster 2', 'Cluster 3', 'Cluster 4', 'Cluster 5'), numpoints = 1, loc = 0, )


plt.show()

From the above plot, now we can see that once the whole system can be classified into clusters. As the count increases, the rating would end up near the cluster given above. The green squares are the centroids for the given clusters.

As the rating count seems to decrease, the average rating seems to become sparser, with higher volatility and less accuracy.

# Recommendation Engine

In a setting such as this, the unsupervised learning takes place, with the similar neighbors being recommended. For the given list, if I ask recommendations for "The Catcher in the Rye", five books related to it would appear. 

Creating a books features table, based on the Ratings Distribution, which classifies the books into ratings scale such as: 
- Between 0 and 1
- Between 1 and 2
- Between 2 and 3
- Between 3 and 4
- Between 4 and 5

Broadly, the recommendations then consider the average ratings and ratings cout for the query entered.

In [None]:
books_features = pd.concat([books['Ratings_Dist'].str.get_dummies(sep=","), books['average_rating'], books['ratings_count']], axis=1)

In [None]:
books_features.head()

The min-max scaler is used to reduce the bias which would have been present due to some books having a massive amount of features, yet the rest having less. Min-Max scaler would find the median for them all and equalize it. 


In [None]:
min_max_scaler = MinMaxScaler()
books_features = min_max_scaler.fit_transform(books_features)

In [None]:
np.round(books_features, 2)

In [None]:
model = neighbors.NearestNeighbors(n_neighbors=6, algorithm='ball_tree')
model.fit(books_features)
distance, indices = model.kneighbors(books_features)

Creating specific functions to help in finding the book names: 
- Get index from Title
- Get ID from partial name (Because not everyone can remember all the names) 
- Print the similar books from the feature dataset. 
 *(This uses the Indices metric from the nearest neighbors to pick the books.)*

In [None]:
def get_index_from_name(name):
    return books[books["title"]==name].index.tolist()[0]

all_books_names = list(books.title.values)

def get_id_from_partial_name(partial):
    for name in all_books_names:
        if partial in name:
            print(name,all_books_names.index(name))
            
def print_similar_books(query=None,id=None):
    if id:
        for id in indices[id][1:]:
            print(books.iloc[id]["title"])
    if query:
        found_id = get_index_from_name(query)
        for id in indices[found_id][1:]:
            print(books.iloc[id]["title"])

Checking out the Workings of the System, let's try with following examples. 

- System by name: The Catcher in the Rye
- System by Name: The Hobbit
- System by partial name: Harry Potter (Book 5)


In [None]:
print_similar_books("The Catcher in the Rye")

In [None]:
print_similar_books("The Hobbit")


In [None]:
print_similar_books("The Iliad")
