# Amazon top 50 bestselling books - Data analysis

# Introduction

"Amazon.com, Inc, is an American multinational technology company based in Seattle, Washington, which focuses on e-commerce, cloud computing, digital streaming, and artificial intelligence. It is considered one of the Big Five companies in the U.S. information technology industry, along with Google, Apple, Microsoft, and Facebook.The company has been referred to as "one of the most influential economic and cultural forces in the world", as well as the world's most valuable brand."
[https://en.wikipedia.org/wiki/Amazon_(company)](http://)

# About

Dataset on Amazon's Top 50 bestselling books from 2009 to 2019. Contains 550 books, data has been categorized into fiction and non-fiction using Goodreads

# Purpose

The purpose of this notebook is to perform a data analysis on the dataset and extract insights that could help the business model grow, such as the best selling genres, authors and a good price/rating relationship.
Such insights have a great business value since knowing what is selling a what is not, the company can focus their marketing strategies to improve overral sells: customer might perceive a genre too expensive so they give a low rating or books with high ratings and low sells might not be visible to other customers.

# Table of Contents
1. [Data loading and data cleaning](#1.-Data-loading-and-data-cleaning)
2. ['Genre' performance as per 'User Rating'](#2.-'Genre'-performance-as-per-'User-Rating')
3. [Prices behaviour](#3-Prices-behaviour)
4. [Prices and User Rating](#4.-Prices-and-User-Rating)
5. [Which books give the best rating for money?](#5.-Which-books-give-the-best-rating-for-money?)
6. [Which are the free books with the best rating?](#6.-Which-are-the-free-books-with-the-best-rating?)


**Features of the dataset**

- **Name**: Name of the Book. String
- **Author**: The author of the Book. String
- **User Rating**: Amazon User Rating. Float
- **Reviews**: Number of written reviews on amazon. Float
- **Price**: The price of the book (As at 13/10/2020). Float
- **Year**: The Year(s) it ranked on the bestseller. Date
- **Genre**: Whether fiction or non-fiction. String


In [None]:
# data wrangling
import pandas as pd
import numpy as np

# data viz
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

%matplotlib inline

# 1. Data loading and data cleaning
## Data Loading
Let's begging by loading the data set and taking a quick glance at the data set and general statistics, as well as any missing values.

In [None]:
amazon = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')

In [None]:
display(amazon.head())

sns.heatmap(amazon.isnull())
plt.show()

print(amazon.shape, "\n")
print(amazon.info(), "\n")
print(amazon.describe())

plt.show()

Thankfully, there aren't any missing values in the dataset.
by looking at the descriptive statistics we can get a first idea of how some features behave: 'Reviews's mean is much bigger than the median, meaning some books have we more reviews than others. on the other hand, 'Price' median is way lower than its mean, meaning that most of the books have a low price.

## Data cleaning
we will go feature by feature to look for inconsistencies and clean them

In [None]:
amazon.columns

### Strings

In [None]:
#Name
print("Name:", '\n', amazon['Name'].value_counts())

The books have repeating values. Perhaps, different versions were released in different years. Let's check

In [None]:
display(amazon[amazon['Name']=='Publication Manual of the American Psychological Association, 6th Edition'].head())
display(amazon[amazon['Name']=='StrengthsFinder 2.0'].head())
display(amazon[amazon['Name']=="Oh, the Places You'll Go!"].head())

# Dropping duplicates values
print("shape of the dataset before dropping duplicates is : {}".format(amazon.shape))
amazon.drop_duplicates(inplace=True)
print("shape of the dataset after dropping duplicates is : {}".format(amazon.shape))


The names that appear more than one time in the dataset are indeed duplicated since they were released in several years. The "drop_duplicates" function doesn't drop any value. There are indeed no duplicates in the dataset.

In [None]:
# Genre
print("Genre:", '\n', amazon['Genre'].value_counts())

sns.countplot('Genre', data=amazon, palette='Set3')
plt.show()

The 'Genre' feature shows balanced data and no inconsistencies.

### Numbers

In [None]:
numbers = ['User Rating', 'Reviews', 'Price']

amazon.loc[:,numbers].hist(color='salmon', figsize=(20,10), edgecolor='black', bins=10)
plt.show()

rating_binned = pd.cut(amazon['User Rating'], bins=10)
rating_binned = pd.DataFrame(pd.DataFrame(rating_binned).groupby('User Rating').size(), columns=['User Rating Count']).reset_index()

Reviews_binned = pd.cut(amazon['Reviews'], bins=10)
Reviews_binned = pd.DataFrame(pd.DataFrame(Reviews_binned).groupby('Reviews').size(), columns=['Reviews Count']).reset_index()

Price_binned = pd.cut(amazon['Price'], bins=10)
Price_binned = pd.DataFrame(pd.DataFrame(Price_binned).groupby('Price').size(), columns=['Price Count']).reset_index()

display(pd.concat([rating_binned, Reviews_binned, Price_binned], axis=1))


Even though the data presents some outliers, they are not going to be dropped (yet) until we see which books belong to these outliers in the tivia section.

### dates

In [None]:
# Year
print("Year:", '\n', amazon['Year'].value_counts())

sns.countplot('Year', data=amazon, palette='Set3')
plt.show()

the data again doesn't show any inconsistencies and has a highly balanced shape 

# 2. 'Genre' performance as per 'User Rating'

In [None]:
fig, ax = plt.subplots(1,2, figsize=(14,5))

rating = amazon.groupby('Year')['User Rating'].mean()
rating = rating.reset_index()

sns.lineplot(x='Year', y='User Rating', data=amazon, ci=None, markers=True, ax=ax[0])
ax[0].set_xticks(ticks=amazon['Year'].value_counts(ascending=True).index)

sns.lineplot(x='Year', y='User Rating', hue='Genre', data=amazon, ci=None, markers=True, ax=ax[1])
ax[1].set_xticks(ticks=amazon['Year'].value_counts(ascending=True).index)

plt.show()

time = pd.DataFrame(amazon.groupby('Year')['User Rating'].mean())
time_genre = pd.DataFrame(amazon.groupby(['Genre', 'Year'])['User Rating'].mean())

time['Rating Fiction'] = list(np.around(time_genre.loc['Fiction'].reset_index()['User Rating'], 3))
time['Rating Non Fiction'] = list(np.around(time_genre.loc['Non Fiction'].reset_index()['User Rating'], 3))
display(time)

* Since 2012, the books have been performing well in terms of average 'User Rating', showing an increasing trend. In 2016, though, the trend had a small decrease of 0.018. Taking a look at the performance per genre, we can see that the small decrease was because of a sloppy performance of Non fiction books that year, decrease from 4.655 to 4.588. However, Non fiction books were able to recover themeselves and revert the decreasing trend just in 2017
* Except from 2012 and 2013, Fiction books have been performing better that Non Fiction books

# 3 Prices behaviour

In [None]:
fig, ax = plt.subplots(1,2, figsize=(20,5))

price = amazon.groupby('Year')['Price'].mean()
price = price.reset_index()

sns.lineplot(x='Year', y='Price', data=amazon, ci=None, markers=True, ax=ax[0])
ax[0].set_xticks(ticks=amazon['Year'].value_counts(ascending=True).index)

sns.lineplot(x='Year', y='Price', data=amazon, ci=None, markers=True, ax=ax[1], hue='Genre')
ax[1].set_xticks(ticks=amazon['Year'].value_counts(ascending=True).index)

plt.show()

time = pd.DataFrame(amazon.groupby('Year')['Price'].mean())
time_genre = pd.DataFrame(amazon.groupby(['Genre', 'Year'])['Price'].mean())

time['Price Fiction'] = list(np.around(time_genre.loc['Fiction'].reset_index()['Price'], 3))
time['Price Non Fiction'] = list(np.around(time_genre.loc['Non Fiction'].reset_index()['Price'], 3))
display(time)

* Overall, prices show a decreasing trend, having a drastic decrease in 2014, due to the Non Fiction genre, which has been showing high prices in comparisson to Fiction books prices untils 2014. Perhaps an attempt to catch up with the Fiction genre in terms of User Rating?
* There was an increase in prices in 2015 from both genres. Nevertheless, the trend decreased again in 2016.

# 4. Prices and User Rating

In [None]:
fig, ax = plt.subplots(figsize=(7,5))

ax.set_title('Average Price and Average User Rating')

ax.plot(amazon.groupby('Year')['User Rating'].mean())
ax.tick_params('y', colors='blue')
ax.set_ylabel('User Rating', color='blue')

ax2 = ax.twinx()
ax2.plot(amazon.groupby('Year')['Price'].mean(), color='darkorange')
ax2.tick_params('y', colors='darkorange')
ax2.set_ylabel('Price', color='darkorange')

ax.set_xticks(ticks=amazon['Year'].value_counts(ascending=True).index)

plt.show()



In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20,5))

fiction = amazon[amazon['Genre']=='Fiction']

Nonfiction = amazon[amazon['Genre']=='Non Fiction']


ax[0].set_title('Average Price and Average User Rating (Fiction)')

ax[0].plot(fiction.groupby('Year')['User Rating'].mean())
ax[0].tick_params('y', colors='blue')
ax[0].set_ylabel('User Rating', color='blue')

ax2 = ax[0].twinx()
ax2.plot(fiction.groupby('Year')['Price'].mean(), color='darkorange')
ax2.tick_params('y', colors='darkorange')
ax2.set_ylabel('Price', color='darkorange')

ax[0].set_xticks(ticks=amazon['Year'].value_counts(ascending=True).index)


ax[1].set_title('Average Price and Average User Rating (Non Fiction)')

ax[1].plot(Nonfiction.groupby('Year')['User Rating'].mean())
ax[1].tick_params('y', colors='blue')
ax[1].set_ylabel('User Rating', color='blue')

ax3 = ax[1].twinx()
ax3.plot(Nonfiction.groupby('Year')['Price'].mean(), color='darkorange')
ax3.tick_params('y', colors='darkorange')
ax3.set_ylabel('Price', color='darkorange')

ax[1].set_xticks(ticks=amazon['Year'].value_counts(ascending=True).index)


plt.show()

In [None]:
sns.heatmap(amazon.corr(), vmin=-1, vmax=1, cmap=sns.diverging_palette(20, 220, as_cmap=True), annot=True)
plt.show()

* From the above graphics, we can see a not so clear tendency of 'User Rating' moving in the opposite direction of 'Price'

* The correlation is not so strong between 'Price' and 'User Rating'. Therefore, is not a good idea to jump into a ultimate conclusion out of this graphics. Nevertheless, The tendency still show 'Price' going down and 'User Rating' going up in the final years. 

# 5. Which books give the best rating for money?

In [None]:
amazon['Price/Rating'] = amazon['Price'] / amazon['User Rating']
amazon.sort_values('Price/Rating').head(10)

The free books don't help in the analysis. let's take those free books out and analyze them seperately in the next section. This section will focus in paid books

In [None]:
amazon[amazon['Price']!=0].sort_values('Price/Rating').head(10)

* With a price of just 1 and a rating of 4.5, "Eat This Not That! Supermarket Survival Guide", by David Zinczenko, gives the best rating for money 

# 6. Which are the free books with the best rating?

In [None]:
amazon[amazon['Price']==0].sort_values('User Rating', ascending=False).head(10)

* "Little Blue Truck" by Alice Schertle is the most beloved free book with a user rating of 4.9

## 2.6 Number of reviews and rating relationship
Sometimes the rating can be missleading since a very small group of people rank it the best book or a very hyped book gets lots of good reviews. The reviews and rating relationship can give a clearer and more honest picture of the book's performance.

In [None]:
amazon['Reviews/Rating'] = amazon['Reviews']/amazon['User Rating']
amazon.sort_values('Reviews/Rating', ascending=False)

* We can trust the top three books ratings since they have lots of reviews. In relation to the others bestselling books, two of the top three books in this list ('The Girl on the Rain' and 'Gone Girl') do not perform so well with a rating of 4.1 and 4.0. On the other hand, "Where the Crawdads Sing" by Delia Owens, looks like a very smart choice since it got lots of reviews and a top rating

* On the bottom we se Zhi Gang Sha books. Regardless of the good rating of this books, the number of reviews are not as big as the other books and therefore it gives an incomplete picture of the book.

## Bonus. What's the most expensive book?

In [None]:
print('Would you buy the {}?'.format(amazon.sort_values('Price', ascending=False).iloc[0,0]))