# Amazon's Top 50 Bestselling Novels 2009-2020

<div style="align:center">
    <img src="https://storage.googleapis.com/kaggle-datasets-images/1168231/1957177/d09f32ffcdd264272172414204ae7268/dataset-cover.jpg?t=2021-02-18-19-44-24">
</div>

<br>

> This file contains data on top 50 bestselling novels on Amazon each year from 2009 to 2020. The data is collected from amazon.com website and Kaggle. The inspiration behind it was Amazon top-selling books 2009-2019. I thought of updating it to recent. Both CSV and excel formats of the dataset are present for ease.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load

In [None]:
data = pd.read_csv('../input/amazons-top-50-bestselling-novels-20092020/AmazonBooks - Sheet1.csv')

In [None]:
data

# Data Visualization

# Genre

In [None]:
plt.figure(figsize=(20, 5))
sns.set_theme(style="darkgrid")
sns.countplot(x='Genre', data=data)
plt.show()

## User Rating

In [None]:
plt.figure(figsize=(20, 5))
sns.histplot(x=data['User Rating'], kde=True, fill=True)
plt.title("User Rating distribution")
plt.show()

In [None]:
g = sns.displot(x=data['User Rating'], hue=data['Genre'], kind="kde", fill=True)
plt.title("User Rating distribution by Genre")
g.fig.set_figwidth(20)
g.fig.set_figheight(5)
plt.show()

> It seems like *Fiction* genre has a more distributed range of user rating values unlike *Non Fiction*.

In [None]:
plt.figure(figsize=(20, 8))
sns.boxenplot(x="Year", y="User Rating", data=data)
plt.title('User rating distribution by year')
plt.show()

In [None]:
plt.figure(figsize=(20, 8))
sns.pointplot(x="Year", y="User Rating", hue="Genre", data=data, alpha=.3)
plt.title('User rating distribution by year and genre')
plt.show()

# Price

In [None]:
plt.figure(figsize=(20, 8))
sns.histplot(x="Price", data=data, fill="True", kde=True)
plt.title("Price distribution")
plt.show()

> Oh we can see that there are books around **80** and **100**!

In [None]:
g = sns.displot(x=data['Price'], hue=data['Genre'], kind="kde", fill=True)
plt.title("Price distribution by Genre")
g.fig.set_figwidth(20)
g.fig.set_figheight(5)
plt.show()

> So the books priced at **80** are fictional while **100** are not fictional.

In [None]:
data.query('Price > 75')

In [None]:
plt.figure(figsize=(20, 8))
sns.pointplot(x="Year", y="Price", hue="Genre", data=data, alpha=.3)
plt.title('Price distribution by year and genre')
plt.show()

> As we can see there is a huge peak at **2014** and **2013**, and we can assume that it is because of the extreme values from the books that we saw just above!

In [None]:
plt.figure(figsize=(20, 8))
sns.boxenplot(x="Year", y="Price", data=data)
plt.title('Price distribution by year')
plt.show()

> In this boxplot we can clearly see the outliers!

## Reviews

In [None]:
plt.figure(figsize=(20, 8))
sns.histplot(x="Reviews", data=data, fill="True", kde=True)
plt.title("Reviews distribution")
plt.show()

> Oh some books got more than **100 000** reviews!

In [None]:
g = sns.displot(x=data['Reviews'], hue=data['Genre'], kind="kde", fill=True)
plt.title("Reviews distribution by Genre")
g.fig.set_figwidth(20)
g.fig.set_figheight(5)
plt.show()

> We can see that around **100 000** those are **Non Fiction** novels while **120 000** are **Fiction** novels.

In [None]:
data.query("Reviews > 90000")

> One of the books is written by a Trump family member and it gives a critical view of the family from an insider point of view, so it can explains its popularity. Another one of the book is written by the former First Lady of the United States of America and it's about her life so it can also explains its popularity and reviews number.

In [None]:
plt.figure(figsize=(20, 8))
sns.pointplot(x="Year", y="Reviews", hue="Genre", data=data, alpha=.3)
plt.title('Reviews distribution by year and genre')
plt.show()

> As we can expect in 2020 there were the books that I queried above and we can also explain this great increase in the number of reviews by the lockdown and people putting more interest on their hobbies while at home!

In [None]:
plt.figure(figsize=(20, 8))
sns.boxenplot(x="Year", y="Reviews", data=data)
plt.title('Reviews distribution by year')
plt.show()

## Author

In [None]:
from wordcloud import WordCloud

# Generate a word cloud image
wordcloud = WordCloud(background_color='white').generate(str(data['Author'].values))

plt.figure(figsize=(40, 6))
plt.imshow(wordcloud)
plt.axis("off")
plt.title('Main authors')
plt.show()

## Combination

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(22, 10))

sns.scatterplot(x="Price", y="Reviews", hue='Genre', data=data, ax=ax[0])
sns.scatterplot(x="Price", y="User Rating", hue='Genre', data=data, ax=ax[1])
plt.show()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(22, 10))

sns.regplot(x="Year", y="Reviews", data=data, ax=ax[0])
sns.regplot(x="Year", y="Price", data=data, ax=ax[1])
plt.show()