# 1. Introduction: Business Goal & Problem Definition

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

This project´s goal is doing an exploratory data analysis (EDA) in the Amazon's Top 50 bestselling books from 2009 to 2019. The goal is to analyze the data set to summarize its main characteristics using several different visual methods, primarily for seeing what the data can tell us. The available dataset features for analysis are:

* Name: Name of the Book
* Author: The author of the Book
* User Rating: Amazon User Rating
* Reviews: Number of written reviews on amazon
* Price: The price of the book (As at 13/10/2020)
* Year: The Year(s) it ranked on the bestseller
* Genre: Whether fiction or non-fiction
* User Rating to Price Ratio (feature engineering)

# 2. Importing Basic Libraries

In [None]:
!pip install openpyxl
import io
import openpyxl
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# 3. Data Collection

In [None]:
books_ds = pd.read_csv("../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv", sep=",")

books_ds

# 4. Data Preliminary Exploration

In [None]:
#Checking a dataset sample

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
pd.options.display.float_format="{:,.2f}".format
books_ds.sample(n=10, random_state=0)

In [None]:
#Checking dataset info by feature

books_ds.info(verbose=True, null_counts=True)

In [None]:
#Checking the existence of zeros in rows

(books_ds==0).sum(axis=0).to_excel("zeros_per_feature.xlsx")
(books_ds==0).sum(axis=0)

In [None]:
#Checking the existence of duplicated rows

books_ds.duplicated().sum()

In [None]:
#Checking basic statistical data by feature

books_ds.describe(include="all")

# 5. Data Cleaning

    We´ll perform the following:

    
    1. Treat 12 rows with Price = 0, replacing it with an estimation based on the User Rating, so we can keep the rows


    2. Create a calculated field that could bring relevant information to the analysis: User Rating to Price Ratio
    
    
    3. Convert "Genre" to dummy so we can analyze its correlations in step 7
    
    
    4. Convert all numerical variables to categorical ranges (to be used in step 7 when analyzing correlations): User Rating, Price, User Rating to Price Ratio
    
    
    * No duplications found
    * No missing, zero or invalid values to treat
    * No columns to remove
    * No outliers found
    * The entire dataset will be taken

In [None]:
#1

books_ds["Price"].replace(0, np.nan, inplace=True)
books_ds["Price"].fillna(books_ds["Price"].sum() / books_ds["User Rating"].sum() * books_ds["User Rating"], inplace=True)

#2

books_ds["User Rating to Price Ratio"] = books_ds["User Rating"] / books_ds["Price"]


#3

books_ds = pd.concat([books_ds, pd.get_dummies(books_ds["Genre"], prefix="Genre")], axis=1)

#4

books_ds["User Rating_Range"] = np.where(books_ds["User Rating"]>=4.75, "4.75 to 5", np.where(books_ds["User Rating"]>=4.5, "4.5 to 4.75", np.where(books_ds["User Rating"]>=4, "4 to 4.5", np.where(books_ds["User Rating"]>=3, "3 to 4", np.where(books_ds["User Rating"]>=2, "2 to 3", "<2")))))
books_ds["Price_Range"] = np.where(books_ds["Price"]>=60, ">60", np.where(books_ds["Price"]>=50, "50 to 60", np.where(books_ds["Price"]>=40, "40 to 50", np.where(books_ds["Price"]>=30, "30 to 40", np.where(books_ds["Price"]>=20, "20 to 30", np.where(books_ds["Price"]>=10, "10 to 20", "<10"))))))
books_ds["User Rating to Price Ratio_Range"] = np.where(books_ds["User Rating to Price Ratio"]>=2, ">2",
                                               np.where(books_ds["User Rating to Price Ratio"]>=1, "1 to 2",
                                               np.where(books_ds["User Rating to Price Ratio"]>=0.8, "0.8 to 1",
                                               np.where(books_ds["User Rating to Price Ratio"]>=0.6, "0.6 to 0.8",
                                               np.where(books_ds["User Rating to Price Ratio"]>=0.4, "0.4 to 0.6",
                                               np.where(books_ds["User Rating to Price Ratio"]>=0.2, "0.2 to 0.4", "<0.2"))))))


books_ds.to_excel("books_ds_clean.xlsx")

# 6. Data Exploration

# 6.1 Checking Top Reviews Books

In [None]:
#Checking Top Reviews Books

print("Checking top Reviews Books:")
books_ds.sort_values("Reviews", ascending=False).head(20)[["Name", "Author", "User Rating", "Reviews", "Price", "Year", "Genre"]].reset_index()

# 6.2 Checking Top User Rating Books

In [None]:
#Checking Top User Rating Books

print("Checking top User Rating Books:")
books_ds.sort_values("User Rating", ascending=False).head(20)[["Name", "Author", "User Rating", "Reviews", "Price", "Year", "Genre"]].reset_index()

# 6.3 Checking Top User Rating to Price Ratio Books

In [None]:
#Checking Top User Rating to Price Ratio Books

print("Checking top User Rating to Price Ratio Books:")
books_ds.sort_values("User Rating to Price Ratio", ascending=False).head(20)[["Name", "Author", "User Rating", "Reviews", "Price", "Year", "Genre", "User Rating to Price Ratio"]].reset_index()

# 6.4 Checking Book Names by Reviews and Genre Using TreeMap

In [None]:
#Checking Book Names by Reviews and Genre

import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 12}
matplotlib.rc('font', **font)

px.treemap(books_ds, path=["Name"], values="Reviews", color="Genre", title="Book Names by Reviews and Genre").show()

# 6.5 Checking Book Names by User Rating and Genre Using TreeMap

In [None]:
#Checking Book Names by User Rating and Genre

pivot = books_ds.pivot_table(index=["Name", "Genre"], columns=[], values=["User Rating"], aggfunc=np.mean).reset_index()
px.treemap(pivot, path=["Name"], values="User Rating", color="Genre", title="Book Names by User Rating and Genre").show()

# 6.6 Checking Authors by Reviews and Genre Using TreeMap

In [None]:
#Checking Authors by Reviews and Genre

px.treemap(books_ds, path=["Author"], values="Reviews", color="Genre", title="Authors by Reviews and Genre").show()

# 6.7 Checking Authors by User Rating and Genre Using TreeMap

In [None]:
#Checking Authors by User Rating and Genre

pivot = books_ds.pivot_table(index=["Author", "Genre"], columns=[], values=["User Rating"], aggfunc=np.mean).reset_index()
px.treemap(pivot, path=["Author"], values="User Rating", color="Genre", title="Authors by User Rating and Genre").show()

# 6.8 Checking Book Names by Reviews and User Rating Using Bubble Chart

In [None]:
#Checking Book Names by Reviews and User Rating

px.scatter(books_ds, x="Year", y="User Rating", size="Reviews", color="Genre", hover_name="Name", size_max=30, title="Book Names by Reviews and Genre").show()

# 6.9 Checking Book Names by Price and User Rating Using Bubble Chart

In [None]:
#Checking Book Names by Price and User Rating

px.scatter(books_ds, x="Year", y="User Rating", size="Price", color="Genre", hover_name="Name", size_max=30, title="Book Names by Price and Genre").show()

# 6.10 Checking Dataset Behaviour Along the Time Using Line Chart

In [None]:
#Checking Dataset Behaviour Along the Time

fig, ax = plt.subplots(1, 3, figsize=(20,5))
fig.suptitle("Non Fiction Books Data Behaviour Along The time", fontsize=25)
sns.lineplot(data=books_ds.query('Genre == "Non Fiction"'), x="Year", y="Reviews", estimator="sum", ax=ax[0])
sns.lineplot(data=books_ds.query('Genre == "Non Fiction"'), x="Year", y="User Rating", ax=ax[1])
sns.lineplot(data=books_ds.query('Genre == "Non Fiction"'), x="Year", y="Price", ax=ax[2])

fig, ax = plt.subplots(1, 3, figsize=(20,5))
fig.suptitle("Fiction Books Data Behaviour Along The time", fontsize=25)
sns.lineplot(data=books_ds.query('Genre == "Fiction"'), x="Year", y="Reviews", estimator="sum", ax=ax[0])
sns.lineplot(data=books_ds.query('Genre == "Fiction"'), x="Year", y="User Rating", ax=ax[1])
sns.lineplot(data=books_ds.query('Genre == "Fiction"'), x="Year", y="Price", ax=ax[2])

fig, ax = plt.subplots(1, 3, figsize=(20,5))
fig.suptitle("All Books Data Behaviour Along The time", fontsize=25)
sns.lineplot(data=books_ds, x="Year", y="Reviews", estimator="sum", ax=ax[0])
sns.lineplot(data=books_ds, x="Year", y="User Rating", ax=ax[1])
sns.lineplot(data=books_ds, x="Year", y="Price", ax=ax[2])

# 6.11 Checking Categorical Variables Bar and Pie Charts

In [None]:
#Plotting Categorical Variables

fig, ax = plt.subplots(1, 2, figsize=(15,5))
books_ds["Genre"].value_counts().plot.bar(color="purple", ax=ax[0])
books_ds["Genre"].value_counts().plot.pie(autopct='%1.1f%%', shadow=True,ax=ax[1])
fig.suptitle("Genre Frequency", fontsize=25)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 2, figsize=(15,5))
books_ds["Year"].value_counts().plot.bar(color="purple", ax=ax[0])
books_ds["Year"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,ax=ax[1])
fig.suptitle("Year Frequency", fontsize=25)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

# 6.12 Checking Numerical Variables Histogram, Boxplot and Violinplot

In [None]:
#Plotting Numerical Variables

fig, ax = plt.subplots(1, 3, figsize=(15,5))
fig.suptitle("User Rating Distribution", fontsize=25)
sns.histplot(books_ds["User Rating"], ax=ax[0])
sns.boxplot(books_ds["User Rating"], ax=ax[1])
sns.violinplot(books_ds["User Rating"], ax=ax[2])

fig, ax = plt.subplots(1, 3, figsize=(15,5))
fig.suptitle("Price Distribution", fontsize=25)
sns.histplot(books_ds["Price"], ax=ax[0])
sns.boxplot(books_ds["Price"], ax=ax[1])
sns.violinplot(books_ds["Price"], ax=ax[2])

fig, ax = plt.subplots(1, 3, figsize=(15,5))
fig.suptitle("User Rating to Price Ratio Distribution", fontsize=25)
sns.histplot(books_ds["User Rating to Price Ratio"], ax=ax[0])
sns.boxplot(books_ds["User Rating to Price Ratio"], ax=ax[1])
sns.violinplot(books_ds["User Rating to Price Ratio"], ax=ax[2])

In [None]:
#Alternatively using Profile Report to see variables statistics and correlations

# from pandas_profiling import ProfileReport
# profile = ProfileReport(books_ds, title="Top 50 Bestselling Books EDA")
# profile.to_file(output_file="Top 50 Bestselling Books EDA.html")

# 7. Correlations Analysis

In [None]:
#Plotting Bar Charts, also considering all numerical to categorical variables created at the step before

sns.set(font_scale=2)

fig, axarr = plt.subplots(1, 1, figsize=(30, 10))
sns.countplot(x="Price_Range", hue = "User Rating_Range", data = books_ds, hue_order = ["<2", "2 to 3", "3 to 4", "4 to 4.5", "4.5 to 4.75", "4.75 to 5"])

#Deleting original categorical columns

books_ds.drop(["Name", "Author", "Genre", "User Rating_Range", "Price_Range", "User Rating to Price Ratio_Range"], axis=1, inplace=True)

#Plotting a Heatmap

sns.set(font_scale=1)

fig, ax = plt.subplots(1, figsize=(10,10))
sns.heatmap(books_ds.corr(), annot=True, fmt=",.2f")
plt.title("Heatmap Correlation", fontsize=20)
plt.tick_params(labelsize=12)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

#Plotting a Pairplot

sns.pairplot(books_ds)

# 8. Conclusions

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

The dataset brings 550 books reviews from 2009 to 2019, with an equal number of books (50) per year.
56% of dataset are Non-Fiction and 44% are Fiction books.

Write Reviews:
Non-Fiction books Written Reviews have grown 440% from 2009 to 2019, starting in 78,682 and ending in 424,774.
Fiction books Written Reviews have grown 136% from 2009 to 2019, starting in 156,824 and ending in 370,143.


User Ratings:
Non-Fiction books User Ratings have grown 2% from 2009 to 2019, starting in 4.58 and ending in 4.69.
Fiction books User Ratings have grown 5% from 2009 to 2019, starting in 4.59 and ending in 4.82.

Prices:
Non-Fiction books Average Prices have decreased 31% from 2009 to 2019, starting in 15.23 and ending in 10.57.
Fiction books Average Prices have decreased 40% from 2009 to 2019, starting in 15.58 and ending in 9.35.

As a strategy to gain market share, Amazon could invest more in ads on High User Rating to Price Ratio Books,
meaning books very well rated by customer with relative low prices, meaning there´s a high and quick growth potential
in volumes for these books.

As a strategy to gain revenue, Amazon could invest more in promoting ads for Books with the highest prices,
as long they´re well rated by customers as well.

A machine learning model to suggest books to clients according to their behavior could also be implemented, but
it´s not the goal of this current project.