<img style='margin: 0 auto' src="https://img.icons8.com/color/144/000000/amazon.png" />
<h1 style='text-align: center'>Amazon Bestselling Books</h1>
<h2 style='text-align: center'>An Exploratory Data Analysis using Matplotlib and Seaborn</h2>


In this noteboook we are going to take a look at the top 50 books on Amazon each year from 2009 to 2019. For the dataset, click [here](https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019)

# Getting Started

Lets first take a look at what data we are dealing with here.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
%matplotlib inline
import seaborn as sns
plt.style.use('fivethirtyeight')

In [None]:
books = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv', parse_dates=['Year'])
books.head(10)

In [None]:
books.describe()

In [None]:
books.describe(include='O')

# What were the best books over the decade?

In [None]:
best_books = books['Name'].value_counts()[:10]

fig, ax = plt.subplots(figsize=(12, 8))
sns.barplot(y=best_books.index, x=best_books, ax=ax)
ax.set_xlabel(None)
ax.set_title('Top 10 Bestselling Books from 2009-2019', fontdict={'fontsize':22})
fig.show()

# ...and the best Authors?

In [None]:
best_author = books[['Name', 'Author']].groupby('Author').agg('nunique').nlargest(10, columns='Name')

fig, ax = plt.subplots(figsize=(12,8))
sns.barplot(x=best_author.Name, y=best_author.index, ax=ax)  
ax.set_xlabel(None)
ax.set_ylabel(None)
ax.set_title('Author with most Bestselling Books from 2009-2019', fontdict={'fontsize':22})
fig.show()

Note: This graph does not imply that Jeff Kinney had written 12 *different* bestsellers.

# How are Bestsellers rated on Amazon?

In [None]:
sns.displot(books, x='User Rating', kind='kde')
plt.gcf().set_size_inches(12, 8)
plt.title('KDE for User Rating', fontdict={'fontsize':22})
plt.show()

* most of the ratings for bestselling books are in the range of 4.5-4.9
* peak of the ratings is at 4.75

# Fiction or Non Fiction - what is more popular?

In [None]:
genre_info = books.groupby('Genre', as_index=False).agg({'Name':'count', 'User Rating':'mean', 'Price':'mean'})

fig, ax1 = plt.subplots(figsize=(12,8))
sns.barplot(data=genre_info, x='Genre', y='Name', order=['Non Fiction', 'Fiction'], edgecolor="0", linewidth=3)
ax1.set_xlabel('Genre')
ax1.set_ylabel(None)
ax1.set_title('Amount of Bestsellers by Genre from 2009-2019', fontdict={'fontsize':22})
fig.show()

* Non Fiction books are more popular, although the difference is not substantial

# Is there a price difference between the two Genres?

In [None]:
#just calculations
most_expensive_book = books.loc[books.Price > 100]
median_vals = books.groupby('Genre').median()

fig, ax = plt.subplots(figsize=(12,8))
sns.boxplot(x="Genre", y="Price", data=books)
ax.set_yscale('log')
ax.set_ylabel('Price (log Scale)')
ax.set_title('Price per Genre', fontdict = {'fontsize' : 22})
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: f'{y:.0f}€'))
fig.show()

* Non Fiction books tend to be a more expensive
* Median for Non Fiction books is 12€, 9€ for Fiction
* Both have outliers that lie above the max whisker, and there is one bestseller that costs more than 100€ in the Non Fiction section <br>(Spoiler: The book Im talking about is 'Diagnostic and Statistical Manual of Mental Disorders' and costs 105€)

# Development of Price over the years

In [None]:
year_info = books.groupby(['Year', 'Genre'], as_index=False).agg({'Price':'mean'})

year_fiction = year_info.query('Genre == "Fiction"')
year_non_fiction = year_info.query('Genre == "Non Fiction"')

fig, ax = plt.subplots(figsize=(12,8))
ax.plot_date(year_fiction['Year'], year_fiction['Price'], linestyle='solid', label='Fiction', color='#e36149')
ax.plot_date(year_fiction['Year'], year_non_fiction['Price'], linestyle='solid', label='Non Fiction', color='#1b86ba')
ax.set_title('Average Price of Bestselling Books from 2009-2019', fontdict = {'fontsize' : 22})
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: f'{y:.0f}€'))
ax.legend()
fig.show()

* Prices of the books went down over time for both Fiction and Non Fiction
* Fiction books where on average almost always cheaper than Non Fiction, with an exception in 2009 where both genres had an average price of 15.5€, whilst Non Fiction were slightly cheaper

# How does the book price affect User Rating?

In [None]:
fig, ax = plt.subplots(figsize=(8,7))
sns.regplot(data=books, x='Price', y='User Rating', ax=ax)
ax.set_xscale('log')
ax.set_title('User Rating to Price', fontdict={'fontsize': 22})
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{x:.0f}€'))

* User Rating tends to diminish slightly when Prices get higher

__Possible Explanation__: higher prices mean higher expectations

<h2 style='text-align: center'>That you for reading this notebook to the end!<br>Feel free to upvote and leave a comment.</h2><h4 style='text-align: center'>Also please tell me what I could've done better...<h4>