# EDA - Amazon Top 50 Bestselling Books 2009 - 2019

##### https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019

<br>

**This kernel features:**

**[I. Univariate Analysis](#I.-Univariate-Analysis)**

**[II. Bivariate Analysis](#II.-Bivariate-Analysis)**

**[III. Hypotheses and Conclusion](#III.-Hypotheses-and-Conclusion)**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from matplotlib import pyplot as plt

import pandas as pd
import numpy as np

In [None]:
dataset = pd.read_csv("../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv")
dataset.head()

In [None]:
dataset_authors = dataset["Author"]
dataset_users_ratings = dataset["User Rating"]
dataset_reviews = dataset["Reviews"]
dataset_prices = dataset["Price"]
dataset_years = dataset["Year"]
dataset_genres = dataset["Genre"]

In [None]:
dataset.shape

# I. Univariate Analysis

## 1. Users Ratings | Reviews

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.bar(dataset_users_ratings.value_counts().keys(), dataset_users_ratings.value_counts().values, width=0.05)
ax1.title.set_text("Users Ratings")
ax1.set_xlabel("ratings")
ax1.set_ylabel("nomber of books")

ax2.scatter(dataset_reviews.keys(), dataset_reviews.values, s=10)
ax2.title.set_text("Reviews")
ax2.set_xlabel("nomber of books")
ax2.set_ylabel("nomber of reviews")

fig.show()

In [None]:
print("Minimum rating: " + str(np.amin(dataset_users_ratings)))
print("Maximum rating: " + str(np.amax(dataset_users_ratings)))

In [None]:
print("Minimum review: " + str(np.amin(dataset_reviews)))
print("Maximum review: " + str(np.amax(dataset_reviews)))
print("Number of review's peak: " + str(dataset_reviews.loc[dataset_reviews > 40000].shape[0]))

### ☆ Graphical Analysis ☆

#### Users Ratings

- Minimum value is 3,30.
- Maximum value is 4,90.
- Most values are between 4,60 and 4,75.

#### Reviews

- Minimum value is 37.
- Maximum value is 87841.
- We can find 16 peaks of value higher than 40000.

## 2. Prices | Years

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.scatter(dataset_prices.keys(), dataset_prices.values, s=10)
ax1.title.set_text("Prices")
ax1.set_xlabel("nomber of books")
ax1.set_ylabel("prices")

ax2.pie(dataset_years.value_counts().keys(), labels=dataset_years.value_counts().values)
ax2.title.set_text("Years")

fig.show()

In [None]:
print("Minimum price: " + str(np.amin(dataset_prices)))
print("Maximum price: " + str(np.amax(dataset_prices)))
print("Number of price's peak: " + str(dataset_prices.loc[dataset_prices > 40].shape[0]))

### ☆ Graphical Analysis ☆

#### Prices

- Minimum value is 0.
- Maximum value is 105.
- We can see 17 peaks of value higher than 40.

#### Years

- 11 values between \[ 2009; 2019 \].
- There are exacty 50 rows in each year.

## 3. Authors | Genres

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.barh(dataset_authors.value_counts().head(10).keys(), dataset_authors.value_counts().head(10).values)
ax1.title.set_text("Authors")
ax1.set_xlabel("number of books")

ax2.bar(dataset_genres.value_counts().keys(), dataset_genres.value_counts().values)
ax2.title.set_text("Genres")
ax2.set_xlabel("genres")
ax2.set_ylabel("nomber of books")

fig.show()

### ☆ Graphical Analysis ☆

#### Authors

- We can find the 10 authors that appear the most.
- The first has 12 books and is called "Jeff Kinney".

#### Genres

- There are 2 values: "Non Fiction" and "Fiction".
- We can see that "Non Fiction" is in the majority.

# II. Bivariate Analysis

## 1. Authors ∈ Prices | Genres ∈ Prices

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

dataset_authors_prices = dataset_authors.value_counts().keys().to_numpy().reshape(248, 1)

def calculate_author_price(x):
    return np.array([x[0], dataset.loc[dataset['Author'] == x[0]]['Price'].mean()])

dataset_authors_prices = pd.DataFrame(np.apply_along_axis(calculate_author_price, arr=dataset_authors_prices, axis=1))
dataset_authors_prices[1] = dataset_authors_prices[1].astype(np.float)
dataset_authors_prices = dataset_authors_prices.sort_values(by=[1], ascending=False)

ax1.barh(dataset_authors_prices.head(10)[0], dataset_authors_prices.head(10)[1])
ax1.title.set_text("Authors ∈ Prices")
ax1.set_xlabel("prices")
ax1.set_ylabel("authors")

dataset_genre_prices = dataset_genres.value_counts().keys().to_numpy().reshape(2, 1)

def calculate_genre_price(x):
    return np.array([x[0], dataset.loc[dataset['Genre'] == x[0]]['Price'].mean()])

dataset_genre_prices = pd.DataFrame(np.apply_along_axis(calculate_genre_price, arr=dataset_genre_prices, axis=1))
dataset_genre_prices[1] = dataset_genre_prices[1].astype(np.float)
dataset_genre_prices = dataset_genre_prices.sort_values(by=[1], ascending=False)

ax2.bar(dataset_genre_prices[0], dataset_genre_prices[1])
ax2.title.set_text("Genres ∈ Prices")
ax2.set_xlabel("genres")
ax2.set_ylabel("prices")

fig.show()

In [None]:
dataset_authors_prices.head(1)

### ☆ Graphical Analysis ☆

#### Authors ∈ Prices

- We can see, the author with most expensive price for her books is "American Psychological Association".

#### Genres ∈ Prices

- We can see that "Non Fiction" is the most expensive.

## 2. Users Ratings ∈ Prices | Reviews ∈ Prices | Years ∈ Prices

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 4))

dataset_users_ratings_prices = dataset_users_ratings.value_counts().keys().to_numpy().reshape(14, 1)

def calculate_user_rating_price(x):
    return np.array([x[0], dataset.loc[dataset['User Rating'] == x[0]]['Price'].mean()])

dataset_users_ratings_prices = pd.DataFrame(np.apply_along_axis(calculate_user_rating_price, arr=dataset_users_ratings_prices, axis=1))
dataset_users_ratings_prices[1] = dataset_users_ratings_prices[1].astype(np.float)
dataset_users_ratings_prices = dataset_users_ratings_prices.sort_values(by=[1], ascending=False)

ax1.bar(dataset_users_ratings_prices[0], dataset_users_ratings_prices[1], width=0.05)
ax1.title.set_text("Users Ratings ∈ Prices")
ax1.set_xlabel("users ratings")
ax1.set_ylabel("prices")

dataset_reviews_prices = dataset_reviews.value_counts().keys().to_numpy().reshape(346, 1)

def calculate_reviews_price(x):
    return np.array([x[0], dataset.loc[dataset['Reviews'] == x[0]]['Price'].mean()])

dataset_reviews_prices = pd.DataFrame(np.apply_along_axis(calculate_reviews_price, arr=dataset_reviews_prices, axis=1))
dataset_reviews_prices[1] = dataset_reviews_prices[1].astype(np.float)
dataset_reviews_prices = dataset_reviews_prices.sort_values(by=[0], ascending=False)

ax2.scatter(dataset_reviews_prices[1], dataset_reviews_prices[0], s=10)
ax2.title.set_text("Reviews ∈ Prices")
ax2.set_xlabel("prices")
ax2.set_ylabel("reviews")

dataset_years_prices = dataset_years.value_counts().keys().to_numpy().reshape(11, 1)

def calculate_year_price(x):
    return np.array([x[0], dataset.loc[dataset['Year'] == x[0]]['Price'].mean()])

dataset_years_prices = pd.DataFrame(np.apply_along_axis(calculate_year_price, arr=dataset_years_prices, axis=1))
dataset_years_prices[1] = dataset_years_prices[1].astype(np.float)
dataset_years_prices = dataset_years_prices.sort_values(by=[0], ascending=False)

ax3.scatter(dataset_years_prices[0], dataset_years_prices[1], s=20)
ax3.title.set_text("Years ∈ Prices")
ax3.set_xlabel("years")
ax3.set_ylabel("prices")

fig.show()

In [None]:
dataset_users_ratings_prices.loc[dataset_users_ratings_prices[1] > 17.5]

In [None]:
dataset_reviews_prices.loc[dataset_reviews_prices[1] > 45]

### ☆ Graphical Analysis ☆

#### Users Ratings ∈ Prices
   
- 4.5 stars with \$20,97.
- 3.6 stars with \$19,00.
- 3.9 stars with \$17,67.

#### Reviews ∈ Prices 

- 13471 reviews cost \\$52,00.
- 8580 reviews cost \\$46,00.
- 6679 reviews cost \\$105,00.
        
#### Years ∈ Prices

- Book prices have decreased over time.

## 3. Users Ratings ∈ Reviews | Years ∈ Reviews

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

dataset_users_ratings_reviews = dataset_users_ratings.value_counts().keys().to_numpy().reshape(14, 1)

def calculate_user_rating_reviews(x):
    return np.array([x[0], dataset.loc[dataset['User Rating'] == x[0]]['Reviews'].mean()])

dataset_users_ratings_reviews = pd.DataFrame(np.apply_along_axis(calculate_user_rating_reviews, arr=dataset_users_ratings_reviews, axis=1))
dataset_users_ratings_reviews[1] = dataset_users_ratings_reviews[1].astype(np.float)
dataset_users_ratings_reviews = dataset_users_ratings_reviews.sort_values(by=[1], ascending=False)

ax1.bar(dataset_users_ratings_reviews[0], dataset_users_ratings_reviews[1], width=0.05)
ax1.title.set_text("Users Ratings ∈ Reviews")
ax1.set_xlabel("users ratings")
ax1.set_ylabel("reviews")

dataset_years_reviews = dataset_years.value_counts().keys().to_numpy().reshape(11, 1)

def calculate_years_reviews(x):
    return np.array([x[0], dataset.loc[dataset['Year'] == x[0]]['Reviews'].mean()])

dataset_years_reviews = pd.DataFrame(np.apply_along_axis(calculate_years_reviews, arr=dataset_years_reviews, axis=1))
dataset_years_reviews[1] = dataset_years_reviews[1].astype(np.float)
dataset_years_reviews = dataset_years_reviews.sort_values(by=[0], ascending=False)

ax2.scatter(dataset_years_reviews[0], dataset_years_reviews[1], s=20)
ax2.title.set_text("Years ∈ Reviews")
ax2.set_xlabel("years")
ax2.set_ylabel("reviews")

fig.show()

In [None]:
dataset_users_ratings_reviews.head(1)

### ☆ Graphical Analysis ☆

#### Users Ratings ∈ Reviews

- We can see a peak at 3.8 with 47265 reviews. 

#### Years ∈ Reviews

- There is a increase over years.

## III. Hypotheses and Conclusion

### 1. Hypotheses

- The years of publication on Amazon are between 2009 and 2019.

- There is an equal number of books each year.

- There can be several books by one author.

- The genres of the books are classified by "Fiction" and "Not fiction".

- Book ratings range from 3,3 to 3,9 and the number of ratings varies from 37 to 87841.

- Book prices range from \\$0,00 to \$105,00.

### 2. Conclusion
    
#### We can observe:

* As the years go by, the price of books decreases.
* But on the contrary, as the years go by, the number of reviews is increasing. 

#### So there are links between: 

* Years and Prices
* Reviews and Prices

### Otherwise, we cannot determine other links between the different variables with these analyses.

## Please upvote this kernel if you like it. ;)