# Amazon Top 50 Bestselling Books 2009 - 2019
Kaggle : https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019/metadata

1. Data cleansing
2. EDA

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv', engine='python')
print(df.shape)
df.head()

## 1. Data cleansing

### Cleansing "Author" column
- There exists duplicate author name (e.g., J.K. Rowling, J. K. Rowling)  

In [None]:
df["Author"] = df["Author"].str.replace(' ','')
df["Author"]

## Delete duplicated "Name"
- There exist duplication in "Name" column
- For example, "Wonder" exists during 5 years

In [None]:
df.loc[df["Name"] == "Wonder"]

In [None]:
# Delete duplication
df = df.drop_duplicates(subset=["Name"])
df.loc[df["Name"] == "Wonder"]

## 2. EDA

We will make a new feature "Weighted Rating".  
- It gives a weight to book's rate(R) when there exist enough reviews.  
- When there is not enough reviews, it gives a weight to average rate(C) rather than book's rate(R).


Weighted rating : 
$$
WeightedRating(WR) = \frac{v}{v+m}R + \frac{m}{v+m}C
$$
where   
- R = average rate for the book  
- C = the mean rate across the whole data  
- v = number of reviews for the book  
- m = minimum reviews required to be listed in the best seller list  



In [None]:
def get_weighted_rate(dataframe):
    R = np.array(dataframe["User Rating"])
    C = np.mean(R)
    v = np.array(dataframe["Reviews"])
    m = np.min(v)

    return (v*R)/(v+m) + (m*C)/(v+m)

To consider about genre, split data based on genre

In [None]:
fiction_books = df.loc[df["Genre"] == "Fiction"]
non_fiction_books = df.loc[df["Genre"] == "Non Fiction"]

In [None]:
df["Weighted Rating"] = get_weighted_rate(df)
fiction_books["Weighted Rating"] = get_weighted_rate(fiction_books)
non_fiction_books["Weighted Rating"] = get_weighted_rate(non_fiction_books)

### Distribution of genre visualization

In [None]:
plt.boxplot(
    x=[fiction_books["User Rating"], fiction_books["Weighted Rating"], non_fiction_books["User Rating"], non_fiction_books["Weighted Rating"]],
    labels=["Fiction", "(Weighted)Fiction", "Non fiction", "(Weighted)Non fiction"]
);

plt.title("User rate by genre")
plt.xlabel("Genre")
plt.ylabel("Rate")
plt.tight_layout()

Standard deviation of fiction's review is larger than non fiction's one

In [None]:
print(fiction_books["Reviews"].std())
print(non_fiction_books["Reviews"].std())

## Rate of genres by year visualization

In [None]:
fiction_rate_by_year = fiction_books.groupby("Year")["User Rating"].mean()
fiction_weighted_rate_by_year = fiction_books.groupby("Year")["Weighted Rating"].mean()

non_fiction_rate_by_year = non_fiction_books.groupby("Year")["User Rating"].mean()
non_fiction_weighted_rate_by_year = non_fiction_books.groupby("Year")["Weighted Rating"].mean()

df_rate_by_year = df.groupby("Year")["User Rating"].mean()
df_weighted_rate_by_year = df.groupby("Year")["Weighted Rating"].mean()

In [None]:
plt.figure(figsize=(10, 10))
plt.plot(fiction_rate_by_year, label="Fiction")
plt.plot(fiction_weighted_rate_by_year, label="Fiction(weight)")

plt.plot(non_fiction_rate_by_year, label="Non fiction")
plt.plot(non_fiction_weighted_rate_by_year, label="Non fiction(weight)")

plt.plot(df_rate_by_year, label="total")
plt.plot(df_weighted_rate_by_year, label="total(weight)")

plt.title("Average rate by year")
plt.xlabel("Year")
plt.ylabel("Averate rate")
plt.legend(loc="upper left")


# High rated book authors

- Only consider authors who have published more than 2 books

In [None]:
# Filter out authors who only published 1 book.
books_per_fiction_author = fiction_books.groupby("Author").count()
books_per_non_fiction_author = non_fiction_books.groupby("Author").count()

at_least_two_fiction = books_per_fiction_author.loc[books_per_fiction_author["Name"] > 1]
at_least_two_non_fiction = books_per_non_fiction_author.loc[books_per_non_fiction_author["Name"] > 1]

authors_at_least_two_fiction_books = fiction_books.loc[fiction_books["Author"].isin(at_least_two_fiction.index)]
authors_at_least_two_non_fiction_books = non_fiction_books.loc[non_fiction_books["Author"].isin(at_least_two_non_fiction.index)]

In [None]:
high_rate_fiction_author = authors_at_least_two_fiction_books.groupby("Author")[["User Rating", "Weighted Rating", "Reviews"]].mean()
high_rate_non_fiction_author = authors_at_least_two_non_fiction_books.groupby("Author")[["User Rating", "Weighted Rating", "Reviews"]].mean()

high_rate_fiction_author = high_rate_fiction_author.sort_values(by=["Weighted Rating"])
high_rate_non_fiction_author = high_rate_non_fiction_author.sort_values(by=["Weighted Rating"])

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(high_rate_fiction_author.index, high_rate_fiction_author["User Rating"], label="Rate")
plt.scatter(high_rate_fiction_author.index, high_rate_fiction_author["Weighted Rating"], label="Weighted rate")

plt.xticks(rotation=45);
plt.legend(loc="upper left")

In fiction books, there is tendency that high rated books have the lower number of reviews than others.  


In [None]:
high_rate_fiction_author.loc[high_rate_fiction_author.index.isin(["DavPilkey", "DanBrown"])]

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(high_rate_non_fiction_author.index, high_rate_non_fiction_author["User Rating"], label="Rate")
plt.scatter(high_rate_non_fiction_author.index, high_rate_non_fiction_author["Weighted Rating"], label="Weighted rate")

plt.xticks(rotation=45);
plt.legend(loc="upper left")

In [None]:
high_rate_non_fiction_author.loc[high_rate_non_fiction_author.index.isin(["TheCollegeBoard", "MarkR.Levin"])]