# Amazon 2009-2019 Best Selling Book EDA

## Introduction
As a gargantuan company, Amazon provides a wide selection of books for its customers. Books are never judged by its cover, but everyone should also be able to rate books (not just journalists or editors). Amazon enables customers to rate books by giving them scores between 1-5 and a detailed review text to help others decide whether to purchase a particular book or not.

In this notebook, we will attempt to explore the dataset consisting of books that are on the list of Amazon's best selling book between 2009 and 2019. We will analyze the authors, the genres and lastly, the most worthy book for purchase.

## Importing Resources
We will start by importing relevant libraries and the dataset.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df=pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
df.head(10)

In [None]:
df.describe()

## Visible Issues

### Duplicates
It's very common for bestselling books to get reissued/republished. It is necessary to filter out this kind of noise by either eliminating or merging.

In [None]:
dup = pd.concat(g for _, g in df.groupby("Name") if len(g) > 1)
print(dup.head(5))
print(dup.shape)

We see above that there are 295 rows containing books with same names and different publishing year. The rating and reviews are identical so we can simply pick the newest version/latest year. The years are already sorted, so we can just pick the **last entry** of each duplicate.

Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html



In [None]:
df_filtered=df.drop_duplicates(subset='Name',keep = 'last',inplace = False)
df_filtered.shape

Now we confirm that no more duplicates are found.

In [None]:
df_filtered.duplicated('Name').sum()

### Freebies

We can see in the dataframe description that some items are priced at 0. These can pose a misrepresentation later when we analyze the price and value performance.

In [None]:
df_filtered.loc[df_filtered['Price'] == 0]

## Exploratory Data Analysis

### The Authors

Let's explore the data by examining the authors. How many authors got into the list? How many books did they write? How many reviews did they get? 

In [None]:
df_filtered.nunique() #Count unique values of each column from the dataframe

So, there are 248 authors in this list, and they dished out 351 best selling books between 2009 and 2019. *Not bad at all!*

In [None]:
top10 = df_filtered.value_counts('Author')
top10 = top10[:10,]
plt.figure(figsize=(15,5))
sns.barplot(top10.index, top10.values, alpha=0.8)
plt.xticks(rotation=45)
plt.title('Top 10 Authors w. Most Books In Amazon Best Seller 2009-2019')
plt.ylabel('# Of Books')
plt.show()

We see the top 10 authors with the most books published within the timespan, with Jeff Kinney leading the ranks, followed by Rick Riordan. We'll see if quantity does matter. For this step, we will consider 4 variables as threshold:

1. a = # reviews received by the author
2. b = # least number of reviews received
3. S = mean rating received by the author
4. C = mean rating of the dataset

Then we calculate the weighted rating using Bayesian Average:

W_Rating = (S * a / a + b ) + ( C * b / a + b)

In [None]:
b = df_filtered['Reviews'].min()
C = df_filtered['User Rating'].mean()

In [None]:
all_author=df_filtered.value_counts('Author')
author_index=all_author.index
all_qty=all_author.values

In [None]:
sum_rating=np.zeros(len(all_author))
a=np.zeros(len(all_author))
S=np.zeros(len(all_author))
w_rating=np.zeros(len(all_author))

for i in range(0,len(all_author)):
    sum_rating[i]=df_filtered.loc[df_filtered['Author'] == author_index[i], 'User Rating'].sum()
    S[i]=sum_rating[i]/all_qty[i]
    a[i]=df_filtered.loc[df_filtered['Author'] == author_index[i], 'Reviews'].sum()
    
    w_rating[i]=(S[i] * a[i]/(a[i]+b))+(C * b/(a[i]+b))

In [None]:
author_rating=pd.DataFrame({'Author': author_index[:,],'Books Written': all_qty[:,],'Reviews': a[:,],'Average Rating': S[:,], 'Weighted Rating': w_rating[:,]})
author_rating['Average Rating']=author_rating['Average Rating'].round(decimals=2)
author_rating

We can now rank the authors based on the weighted ratings.

In [None]:
top10_rating=author_rating.nlargest(10,['Weighted Rating'])
plt.figure(figsize=(15,5))
sns.barplot(top10_rating['Author'], top10_rating['Weighted Rating'], alpha=0.8)
plt.title('Top 10 Authors w. Best Ratings In Amazon Best Seller 2009-2019')
plt.xticks(rotation=45)
plt.ylim(top10_rating['Weighted Rating'].min()-0.001,top10_rating['Weighted Rating'].max()+0.001)
plt.ylabel('Weighted Ratings')
plt.show()

As seen above, the ranking based on rating looks nothing like the one based on the amount of books published. Dav Pilkey tops the chart with a score of 4.899757.

Now let's look at the distribution of the authors by scattering them in quadrants.

In [None]:
top=author_rating.loc[(author_rating['Books Written'] >= 6) & (author_rating['Weighted Rating'] >= 4.4), 'Author']

fig=plt.figure(figsize=(20,10))
ax=fig.add_subplot(1,1,1)
plt.scatter(author_rating['Weighted Rating'],author_rating['Books Written'])
plt.plot([4.4,4.4],[0,12], linewidth=0.2, color='red')
plt.plot([3.8,5.0],[6,6], linewidth=0.2, color='red')
plt.xlim(3.8,5.0)
plt.ylim(0,12)

for i in range(0,len(top)):
    getx=author_rating.loc[author_rating['Author'] == top[i], 'Weighted Rating']
    gety=author_rating.loc[author_rating['Author'] == top[i], 'Books Written']
    plt.text(getx, gety-0.3,top[i],fontsize=8,ha='center')

plt.text(3.9, 3,'Regular Best Sellers',fontsize=12,ha='left')
plt.text(3.9, 9,'The Gamblers',fontsize=12,ha='left')
plt.text(4.9, 3,'The Planners',fontsize=12,ha='right')
plt.text(4.9, 9,'The Special Ones',fontsize=12,ha='right')

ax.set(title='Amazon Best Selling Author Distribution',xlabel='Weighted Ratings',ylabel='Books Written')
plt.show()

We see that the majority of authors are placed in the bottom right quadrant. These group represents authors that take their time to perfect their books, which got them high ratings with lower amount books published.

On the other hand, we see also some special ones, who dished out plenty of books and consistently getting top ratings.

### The Genre

Since we have only two genres: Fiction and Non-fiction, let's try to compare this two.

In [None]:
genre_reviews = df_filtered.groupby("Genre")["Reviews"].sum()
genre_ratings = df_filtered.groupby("Genre")["User Rating"].sum()
genre_reviews_avg = df_filtered.groupby("Genre")["Reviews"].mean()
genre_ratings_avg = df_filtered.groupby("Genre")["User Rating"].mean()

In [None]:
genre_table=pd.DataFrame({'Genre': ['Fiction', 'Non Fiction'],
                          'Total Reviews': genre_reviews.values[:,],
                          'Total Ratings': genre_ratings.values[:,],
                          'Average Reviews': genre_reviews_avg.values[:,],
                          'Average Ratings': genre_ratings_avg.values[:,]})
genre_table

In [None]:
a = genre_reviews
b = df_filtered['Reviews'].min()
C = df_filtered['User Rating'].mean()
S = genre_ratings_avg
w_rating=np.zeros(2)

for i in range(0,len(genre_reviews.index)):
    w_rating[i]=(S[i] * a[i]/(a[i]+b))+(C * b/(a[i]+b))

In [None]:
fig=plt.figure(figsize=(20,5))
fig.add_subplot(1,2,1)
plt.pie(genre_reviews,labels=genre_reviews.index, autopct='%1.2f%%')
plt.title('Amazon Best Selling Books 2009-2019 Genre Distribution')
fig.add_subplot(1,2,2)
sns.barplot(genre_reviews.index,w_rating,alpha=0.8)
plt.ylim(w_rating.min()-0.01,w_rating.max()+0.01)
plt.title('Amazon Best Selling Books 2009-2019 Genre Weighted Rating Comparison')
plt.show()

We can see in the figures above that Fiction genre trumps Non Fiction both in popularity and ratings. Although the difference in reviews received are large, the gap between the weighted ratings are very close (below 0.1).

### The Price and Value of Books

Now let's analyze the books themselves. Using the same approach, we measure their weighted ratings, and we consider the price also. However, we will consider books priced in 0 USD in the dataset as abnormal data points.

In [None]:
df_filtered2=df_filtered
print('Before: ',df_filtered.shape)
df_filtered2.drop(df_filtered2.loc[df_filtered2['Price'] == 0].index, inplace = True) 
df_filtered2.reset_index(drop=True, inplace=True)
print('After: ',df_filtered2.shape)

In [None]:
book_reviews = df_filtered2["Reviews"]
book_ratings = df_filtered2["User Rating"]
book_reviews_avg = df_filtered2["Reviews"].mean()

b = df_filtered2['Reviews'].min()
C = df_filtered2['User Rating'].mean()
a = book_reviews
S = book_ratings

w_rating=np.zeros(len(df_filtered2['Name']))

for i in range(0,len(df_filtered2['Name'])):
    w_rating[i]=(S[i] * a[i]/(a[i]+b))+(C * b/(a[i]+b))

df_filtered2['Weighted Rating']=w_rating
df_filtered2

In [None]:
fig=plt.figure(figsize=(20,10))
ax=fig.add_subplot(1,1,1)
plt.scatter(df_filtered2['Weighted Rating'],df_filtered2['Price'])
plt.plot([4.1,4.1],[0,110], linewidth=0.2, color='red')
plt.plot([3.2,5.0],[55,55], linewidth=0.2, color='red')
plt.xlim(3.2,5.0)
plt.ylim(0,110)


plt.text(3.6, 30,'The Lucky Ones That Made It',fontsize=12,ha='center')
plt.text(3.6, 80,'Not Worth It',fontsize=12,ha='center')
plt.text(4.6, 30,'Worth It',fontsize=12,ha='center')
plt.text(4.6, 80,'Deservedly Expensive',fontsize=12,ha='center')

ax.set(title='Amazon Best Selling Book Price - Value Distribution',xlabel='Weighted Ratings',ylabel='Price')
plt.show()

### Cream of The Crop

Looking at the figure above, the majority of best sellers are priced well. Now we'll dive deeper and find the cream of the crop: books under 30 USD with ratings above 4.85

In [None]:
topbooks=df_filtered2.loc[(df_filtered2['Price'] <= 30) & (author_rating['Weighted Rating'] > 4.85), :]
topbooks.sort_values(['Weighted Rating','Price'],ascending=False)

As seen above, we have successfully found the best books in terms of price-value performance.