<h1 style="color:skyblue;font-size:40px;">About The Notebook</h1>

<p>This notebook is related to a EDA task on Amazon-Top-50-BestsSelling-Books Dataset.</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [None]:
books = pd.read_csv("/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv")
books.head()

In [None]:
books.shape

In [None]:
#Columns Information
books.info()

In [None]:
#Numerical Columns Descriptive Analysis
books.describe(exclude='O')

In [None]:
#Non-Numeric Columns Descriptive Analysis
books.describe(include='O')

In [None]:
#Percentage Of NaN In Each Column
books.isna().sum()/books.shape[0]*100

<h1 style="color:gray;font-size:40px;">Which Kind Of Books People Liked the Most?</h1>

In [None]:
books.Genre.value_counts().plot.pie(radius=2.5,autopct="%.2f",center=(10,10),textprops={'fontsize':'x-large'})
plt.ylabel(None)
plt.title("What Genre People Likes The Most?",fontdict={'fontsize':20,'color':'blue'},y=1.8)
plt.show()

<h1 style="color:skyblue;font-size:40px;">Do People's Choices Changed With Time?</h1>

In [None]:
fig,ax = plt.subplots(int(np.ceil(books.Year.value_counts().shape[0]/3)),3,sharex=False,sharey=False,figsize=(15,15))
year_list = books.Year.value_counts().sort_index(ascending=True).index
k=0
for i in range(int(np.ceil(books.Year.value_counts().shape[0]/3))):
    for j in range(3):
        if k <11:
            books[books.Year == year_list[k]].Genre.value_counts().plot.bar(color=['blue','orange'],ax=ax[i][j])
            ax[i][j].set_title(f"Year {year_list[k]}")
            k=k+1
        else:
            break
plt.tight_layout()
plt.show()

In [None]:
# We can also check the relation between year with Genre by performing chi-square test of indenpendence

'''This test function Returns
-------
chi2 : float
    The test statistic.
p : float
    The p-value of the test
dof : int
    Degrees of freedom
expected : ndarray, same shape as `observed`
    The expected frequencies, based on the marginal sums of the table.'''

#H0 : Year & Genre are independent.
#H1 : Year & Genre are dependent.

level_of_confidence = 0.95
chi_sq = stats.chi2_contingency(pd.crosstab(books.Year,books.Genre))
if chi_sq[0] < stats.chi2.ppf(level_of_confidence,chi_sq[2]):
    print("Year And Genre Are Independent")
else:
    print("Year And Genre Are Dependent")

<h1 style="color:orange;font-size:40px;">Books That Came In Top 50 Multiple Times</h1>

In [None]:
dup_index  = books.duplicated(['Name'],keep=False)[books.duplicated(['Name'],keep=False) == True].index
plt.figure(figsize=(25,8))
books.iloc[dup_index,]['Name'].value_counts().sort_values(ascending=False).plot.bar()
plt.xticks(rotation=90)
for i,j in enumerate(books.iloc[dup_index,]['Name'].value_counts().sort_values(ascending=False).values):
    plt.text(i-0.2,j+0.1,str(j))
plt.title("Books Who Got Best-Selling Title Multiple Times(Fiction And Non-Fiction)",fontdict={'color':'blue','fontsize':20})
plt.show()

<h1 style="color:skyblue;font-size:40px;">Author Who Tops The Chart In Fiction Category</h1>

In [None]:
dup_index  = books[books.Genre == 'Fiction'].duplicated(['Author'],keep=False)[books[books.Genre == 'Fiction'].duplicated(['Author'],keep=False) == True].index
plt.figure(figsize=(25,8))
books[books.Genre == 'Fiction']['Author'][dup_index].value_counts().sort_values(ascending=False).plot.bar()
plt.xticks(rotation=90)
for i,j in enumerate(books[books.Genre == 'Fiction']['Author'][dup_index].value_counts().sort_values(ascending=False).values):
    plt.text(i-0.2,j+0.1,str(j))
plt.title("Multiple Times Best Seller Author in Fiction Category",fontdict={'color':'blue','fontsize':20})
plt.show()

<h1 style="color:skyblue;font-size:40px;">Author Who Tops The Chart In Non-Fiction Category</h1>

In [None]:
dup_index  = books[books.Genre != 'Fiction'].duplicated(['Author'],keep=False)[books[books.Genre != 'Fiction'].duplicated(['Author'],keep=False) == True].index
plt.figure(figsize=(25,8))
books[books.Genre != 'Fiction']['Author'][dup_index].value_counts().sort_values(ascending=False).plot.bar()
plt.xticks(rotation=90)
for i,j in enumerate(books[books.Genre != 'Fiction']['Author'][dup_index].value_counts().sort_values(ascending=False).values):
    plt.text(i-0.2,j+0.1,str(j))
plt.title("Multiple Times Best Seller Author in Non-Fiction Category",fontdict={'color':'blue','fontsize':20})
plt.show()

# What Can't We Do With This Data?

<p style="color:gold;">We can't provide rank to best-seller books Because we don't have sales data avaliable for each book. The Amazon Best Seller rank is calculated based on the sales of a product and undergoes hourly updations. The rank assignment also considers the current sales of the product and the sales history too. A product with a rank of #1 means it has recently sold more than any other product in that category, on that store. It sells better than all the other products in that particular category,and therefore gets a higher ranking than the other products. </p>

<p style="color:black;font-size:20px;">The Amazon BSR calculation is not based on the product reviews or rating.</p>

**You can read more about Amazon BSR rules on this [link](https://www.sellerapp.com/amazon-best-seller-rank.html)**

# Any More That We Can Add?

**if you have any suggestion for this notebook then please tell me in the comment section. I will try to add all feasible solution in this notebook to make it more useful.**

**If you liked this notebook then please upvote it.**

<h1 style="color:green;font-size:100px;text-align:center;">Thank You</h1>