# Amazon's bestsellers 2009-19

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
books=pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
books.head()

# Data Wrangling

**Checking if there is any NULL values in the data.**

In [None]:
books.info()

**Getting basic statistics about the data.**

In [None]:
books.describe()

Here we can extract many information regarding the data we have.
1. Average price of a book on Amazon's bestsellers is 13.2 dollars.Some books has price less than 1 dollar and some has 105 dollars of cost.
2. The Average Rating of a book is quite high which is 4.6 out of 5(assuming). Ratings for books goes from 3.3 to 4.9.

**Making data clean to extract insights.**

In [None]:
#Checking if any author's name is spelled differently or incorrectly.
#(honestly I did check other's notebook to check if there's any kind of string mistake in data.)
books['Author'].sort_values().unique()

In [None]:
#replacing the name of the author with appropriate name.
books=books.replace('George R. R. Martin','George R.R. Martin')
books=books.replace('J. K. Rowling','J.K. Rowling')

**Checking if data has any duplicate values**

In [None]:
#duplicates in Column Name
books.duplicated(subset=['Name']).sum()

In [None]:
#duplicates in whole data
books.duplicated().sum()

In [None]:
#sorting Values by year to check if any book is mentioned as bestsellers for consecutively two or more years.
books.sort_values('Year',ascending=False).head()

In [None]:
#dropping duplicates if any.
book=books.drop_duplicates('Name').sort_values('Year',ascending=False)

In [None]:
book.shape

Here we can see that there are only 351 entries now.

# Getting Some insights

**Top 5 Books,Authors and Genre of the book which has highest price.**

In [None]:
book[['Price','Name','Author','Genre']].sort_values(by='Price',ascending=False).head()

**Books that has price less than 1 dollar**

In [None]:
book[book['Price']==0][['Name','Author','Genre','Price']]

We found the the book for the Cosntitution of United States has price less than 1 dollar.

**Top 5 Books with highest User Rating.**

In [None]:
book[['User Rating','Name','Author','Genre']].sort_values(by='User Rating',ascending=False).head()

**Top 5 Books with lowest User Rating.**

In [None]:
book[['User Rating','Name','Author','Genre']].sort_values(by='User Rating').head()

We can see the book 'Fifty Shades Of Grey', on which movies are created are among the books having low ratings.
And Author J.K. Rowling famous for it's fiction adaption Harry Potter is on top.

# Some Insights With Visualisation

**Distribution of Genres in Amazon's Bestsellers.**

In [None]:
fig,ax=plt.subplots(figsize=(5,5))
ax.pie(x=book['Genre'].value_counts(),autopct='%0.1f%%',textprops=dict(color='w',fontweight='bold',fontsize=20))
plt.tight_layout()
ax.legend(['Non-Fiction','Fiction'],title='Genres',loc='upper right')
plt.title("Distribution of Genre",fontsize=20,fontweight='bold')

**Distribution of Reviews among Fiction and Non Fiction.**

In [None]:
mean1=book[book['Genre']=='Non Fiction']['Reviews'].mean()
mean2=book[book['Genre']=='Fiction']['Reviews'].mean()

sns.displot(aspect=2,x='Reviews',data=book,hue='Genre',multiple='stack')

plt.title("Review Distribution Plot",fontsize=20,fontweight='bold')
plt.xlabel("Reviews",fontsize=16)
plt.ylabel("Count",fontsize=16)
plt.axvline(mean1,color='b',linestyle='--',label='Mean Non Fiction')
plt.axvline(mean2,color='black',linestyle='--',label='Mean Fiction')

We can see that mean of fiction is much more than non-fiction.
This means people often read more ficiton than non-fiction.

**Distribution of Price among Fiction and Non Fiction.**

In [None]:
mean1=book[book['Genre']=='Non Fiction']['Price'].mean()
mean2=book[book['Genre']=='Fiction']['Price'].mean()


sns.displot(aspect=2,x='Price',data=books,hue='Genre',multiple='stack',palette='RdPu')
plt.title("Price Distribution Plot",fontsize=20,fontweight='bold')
plt.xlabel("Price",fontsize=16)
plt.ylabel("Count",fontsize=16)

plt.axvline(mean1, color='g',linestyle='--',label='Mean Non Fiction')
plt.axvline(mean2, color='r',linestyle='--',label='Mean Fiction')
plt.legend()

The price are almost equally distributed here because there are slightly difference in the respective mean. And there is some outliers also.

**User Rating for Fictions and Non-Fictions.**

In [None]:
sns.catplot(x='User Rating',kind='count',data=book,col='Genre',aspect=1.5)

WE can see that the ratings for fictional can go as low as 3.3 whereas non-fictionals ratings are quite high than fictional.

In [None]:
mean1=book[book['Genre']=='Fiction']['User Rating'].mean()
mean2=book[book['Genre']=='Non Fiction']['User Rating'].mean()
print("Mean User Rating for Fictional books is =",mean1)
print("Mean User Rating for Non Fictional books is =",mean2)

But the mean ratings for both the genres are almost equal.

**Top 10 Authors whose books are make appearance in Amazon's Bestsellers most of the time.**

In [None]:
top_authors=books['Author'].value_counts().sort_values(ascending=False).head(10)
top_authors.plot(kind='barh',color='purple',figsize=(6,6))

plt.xlabel("Number of occurance",fontsize=15,fontweight='bold')
plt.ylabel("Authors",fontsize=15,fontweight='bold')
plt.title("Top 10 Authors",fontsize=20,fontweight='bold')

**Top 10 Authors With max Reviews on Amazon.**

In [None]:
max_rev=book.sort_values('Reviews',ascending=False).head(10)
sns.barplot(x='Reviews',y='Author',data=max_rev,palette='RdPu')

plt.xlabel("Reviews",fontsize=15,fontweight='bold')
plt.ylabel("Authors",fontsize=15,fontweight='bold')
plt.title("Top 10 Authors With Maximum Reviews",fontsize=15,fontweight='bold')


**Reviews on book throughout the years among the ficional and non-fictional category.**

In [None]:
sns.lineplot(x='Year',y='Reviews',data=book,hue='Genre')

In year 2018 non-fictional book has more reviews and rest of the years fictional are more popular.

**Price for Fictional and Non-fictional throughout the years.**

In [None]:
sns.lineplot(x='Year',y='Price',data=books,hue='Genre')

There is a declining rate here. As the year passing by the rate of the books are decreasing as well specially for non-fictional books.

**User Rating for Fictional and Non Fictional books throughout the years.**

In [None]:
sns.lineplot(x='Year',y='User Rating',data=book,hue='Genre')

The mean rating for both genres are almost same in 2009. But as the year passing by,with lots of ups and downs, the difference in the mean rating is widened. Fictional books has slightly high User Rating.

# Conclusion

**Overall People tend to read more fictional books than non-fictional.Non Technical Reasons could be the plot,storyline or building of characters and also the rate at which non-fictional books are releasing. Technical Reason could be the price,as we can see that in initial years prices difference between the two genres are quite high and in end years non-fictional are still expensive than fictional ones.Also the reviews for fictional books are also more, more reviews means more customers.**