# Amazon top 50 bestselling books

Source : [Kaggle.io](https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019)

Dataset contains for each year from 2009 to 2019, the Top 50 best-selling books for that respective year. 
There are a total of 550 entries.

__Index__

1. Data Import and Initial Overview
2. Summary of Data
3. Visualizations

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

## 1. Data Import and Initial Overview

### 1.1 Data Import

In [None]:
books = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
books.head()

### 1.2 Initial Overview

In [None]:
# Sorting dataset on Year and Book Name
books.sort_values(by = ['Year','Name'], inplace = True)
books.reset_index(inplace = True)
books.head(10)

In [None]:
books.info()

In [None]:
print("Missing Values \n", books.isnull().sum())

In [None]:
print("Duplicate entries", books.duplicated().sum())

__Note:__
1. There are no missing values or duplicate entries.
2. Out of 7 variables, 4 are numerical and 3 are text type.

## 2. Summary of Data

### 2.1 Top Performing

#### 2.1.1 Top 10 Reviewed Books

In [None]:
book2 = books.groupby(by = 'Name').aggregate({'Author':'first','User Rating':'mean','Reviews':'mean','Price':'max','Genre':'first'})
results = book2.sort_values(by = ['Reviews','User Rating'], ascending = False)
results.loc[:, ['Author','Reviews','User Rating','Genre']].head(10)

#### 2.1.2 Top 10 Rated Books

In [None]:
results.sort_values(by = ['User Rating','Reviews'], ascending = False, inplace = True)
results.loc[:, ['Author','User Rating','Reviews','Genre']].head(10)

#### 2.1.3 Highest Rated Free books

In [None]:
results = book2[book2.Price == 0]
results = results.sort_values(by = ['User Rating','Reviews'], ascending = False)
results.loc[:, ['Author','User Rating','Reviews','Genre']]

#### 2.1.4 Most Reviewed Non-Fiction

In [None]:
results = book2[book2.Genre == 'Non Fiction']
results = results.sort_values(by = ['Reviews','User Rating'], ascending = False)
results.loc[:, ['Author','User Rating','Reviews']].head(10)

#### 2.1.5 Most Reviewed Fiction

In [None]:
results = book2[book2.Genre == 'Fiction']
results = results.sort_values(by = ['Reviews','User Rating'], ascending = False)
results.loc[:, ['Author','User Rating','Reviews']].head(10)

#### 2.1.6 Highest Rated books below USD 5

In [None]:
results = book2[book2.Price <= 5]
results = results.sort_values(by = ['User Rating','Price'], ascending = False)
results.head(10)

### 2.2 Statistical Summaries

In [None]:
books.describe()

In [None]:
books['Genre'].value_counts()

## 3. Visualizations

### 3.1 Categorical variables

#### 3.1.1 Distribution of Genre

In [None]:
# Distribution of Genre
table = books.groupby(['Year','Genre']).aggregate({'Year':'count'})
table.rename(columns={'Year':'Year','Genre':'Genre','Year':'Count'}, inplace = True)

fig, ax = plt.subplots(1, 2, figsize=(24,8))
# Bar Plot
ax[0].bar(height = books['Genre'].value_counts(), x = pd.unique(books['Genre']))
ax[0].set_title('Overall distribution of Genre')
# Line Plot
sns.lineplot(data = table, x = 'Year', y = 'Count', hue = 'Genre')
ax[1].set_title('Number of Best-Sellers per Genre over time')
plt.show()

__Note:__
1. From 2009 - 2019, Non-Fiction(56.3%) books have more often been on the best-selling list than Non-Fiction(43.6%)
2. Post 2017, we see that Non-Fction books have more often been on the Best-Selling list.

#### 3.1.2 Top Books and Authors

In [None]:
# Summary of Book and Authors
table_book = books['Name'].value_counts()
table_book = table_book.head(10)

table_auth = books['Author'].value_counts()
table_auth = table_auth.head(25)

fig, ax = plt.subplots(1, 2, figsize=(30,10))
# Books
ax[0].barh(y = table_book.index, width = table_book.values)
ax[0].set_title('Number of Years a Book has been on Best-Selling List')
# Authors
ax[1].barh(y = table_auth.index, width = table_auth.values)
ax[1].set_title('Number of times Auhor has been on Best-Selling list')
plt.show()

__Note:__

In the time frame of 2009 - 2019, _Jeff Kinney_ is the best selling author and _Publication Manual of the American Psycological Asociation (6th Edition)_ has always been on the best-selling list.

### 3.2 Numerical variables

#### 3.2.1. Distribution of Price

In [None]:
fig, ax = plt.subplots(2,2, figsize = (24,12))
# Plot 1
sns.kdeplot(data = books, x = 'Price', ax = ax[0][0])
ax[0][0].set_title('Plot 1: Distribution of Price 2009 -- 2019')

# Plot 2
sns.kdeplot(data = books, x = 'Price', hue = 'Year', ax = ax[0][1])
ax[0][1].set_title('Plot 2: Distribution of Price vs Year')

# Plot 3
sns.kdeplot(data = books, x = 'Price', hue = 'Genre', ax = ax[1][0])
ax[1][0].set_title('Plot 3: Distribution of Price vs Genre')

# Plot 4
sns.lineplot(data = books, x = 'Year', y = 'Price', hue = 'Genre', ci = None, ax = ax[1][1])
ax[1][1].set_title('Plot 4: Time Series plot of Price of each Genre')

plt.show()

__Note:__
1. The overall price distribution has slightly postively skewed, i.e. An expensive book is more likely to be on the best-selling list than a cheap book. _(Plot 1)_
2. Overtime the price distribution is becomming tighter. As time is passing, Best-Selling books are coming from the a small price group rather a larger price group. _(Plot 2)_
3. Best-Selling Non-Fiction books are slightly more expensive than Fiction books. _(Plot 3)_
4. The price gap between Best-Selling Fiction and Non-Fiction books have narrowed down in 2018 and 2019. _(Plot 4)_ 

#### 3.2.2 Relation between Numerical variables

In [None]:
# Correlation
sns.heatmap(data = books.loc[:, ['User Rating','Price','Reviews']].corr(),
            cmap = 'YlOrBr', annot = True)

From the Pair Grids and Correlation plot, we can conclude that there is no correlation between User Ratings, Reviews and Price of a Book.

#### 3.2.3 Reviews and Ratings

In [None]:
table_year = books.groupby(by = ['Year','Genre']).aggregate({'Reviews':'sum','User Rating':'mean'})

fig, ax = plt.subplots(1,2, figsize = (20,10))
sns.lineplot(data = table_year, x = 'Year', y = 'User Rating', hue = 'Genre', ax = ax[0])
ax[0].set_title('Plot 1: User rating vs Time')

sns.lineplot(data = table_year, x = 'Year', y = 'Reviews', hue = 'Genre', ax = ax[1])
ax[1].set_title('Plot 2: Reviews vs Time')
plt.show()
