# Amazon Top 50 EDA

For a quick synopsis, check out the [Summary](#Summary) section which includes all the main insights.

This notebook is a brief EDA on the Top 50 Amazon best-selling books dataset (https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019).

This can be considered a work in progress, so any feedback or criticism is welcome; I'd like to improve for future notebooks!

<a id="Top"></a>

# Table of Contents

[Summary](#Summary)

[Dataset Overview](#Dataset_Overview)

[Title](#Title)

[Author](#Author)

[User Rating](#User_Rating)

[Reviews](#Reviews)

[Price](#Price)

[Year](#Year)

[Genre](#Genre)

<a id="Summary"></a>

# Summary

Check for any immediate correlations between variables

In [None]:
numerical = list(set(df.columns) -
                 set(['Name','Author']))
corr_matrix = df[numerical].corr()

sns.heatmap(corr_matrix);
plt.xticks(rotation=45);

In [None]:
plt.figure(figsize=(10,6))

ax = sns.countplot(data=df, x=df.Title.value_counts(), palette="Set2");

for p in ax.patches:
    ax.annotate(format(p.get_height(), 'd'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')

ax.set(xlabel="Appearances", ylabel="Count");
ax.set_title("Times a Title Shows up on Top 50 Charts (2009-2019)");
plt.show();

In [None]:
plt.figure(figsize=(10,6))

ax = sns.countplot(data=df_repeat, x="Top 50 Appearances", palette="Set2");

for p in ax.patches:
    ax.annotate('{:.1%}'.format(p.get_height()/len(df_repeat)), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')

ax.set(xlabel="Appearances", ylabel="Count");
ax.set_title("Repeat Top 50 Appearances");
plt.show();

In [None]:
plt.figure(figsize=(10,6))
bin_width=.05
ticks = list(df['User Rating'].unique());
ax = sns.histplot(df, x="User Rating", color='royalblue', kde=True, binwidth=bin_width);
ax.set(xlabel="User Rating", ylabel="Count");
ax.set_title("User Rating Distribution");
plt.xticks(ticks);
plt.show();

In [None]:
plt.figure(figsize=(10,6))
ax = sns.histplot(df, x="Reviews", color='royalblue', kde=True);
ax.set(xlabel="Reviews", ylabel="Count");
ax.set_title("Reviews Distribution");
plt.show();

In [None]:
plt.figure(figsize=(12,10));
ax = df_rev.plot(kind='scatter', x='Reviews', y='Title');
plt.title('Books with Most Reviews');
plt.tick_params(axis='y', which='major', labelsize=5);
ax.set_ylabel('');
plt.show();

Bivariate distribution for User Rating and Reviews

In [None]:
sns.jointplot('User Rating','Reviews', data=df_rev,
              kind="kde", fill=True, color='deepskyblue');

In [None]:
plt.figure(figsize=(12,10))
ax = df_price.plot(kind='scatter', x='Price', y='Title');
plt.title('Most Expensive Books');
plt.tick_params(axis='y', which='major', labelsize=7);
ax.set_ylabel('');
plt.show();

Bivariate distribution for User Rating and Price

In [None]:
sns.jointplot('User Rating','Price', data=df_price,
              kind="kde", fill=True, color='deepskyblue');

Authors that made repeat appearances per year

In [None]:
_, axes = plt.subplots(nrows=4, ncols=3, figsize=(20, 20))

dat_19.plot(kind='bar', x='Author', title="2019 Repeat Appearances", rot=0, ax=axes[0][0], color='royalblue');
dat_18.plot(kind='bar', x='Author', title="2018 Repeat Appearances", rot=0, ax=axes[0][1], color='royalblue');
dat_17.plot(kind='bar', x='Author', title="2017 Repeat Appearances", rot=0, ax=axes[0][2], color='royalblue');
dat_16.plot(kind='bar', x='Author', title="2016 Repeat Appearances", rot=0, ax=axes[1][0], color='royalblue');
dat_15.plot(kind='bar', x='Author', title="2015 Repeat Appearances", rot=90, ax=axes[1][1], color='royalblue');
dat_14.plot(kind='bar', x='Author', title="2014 Repeat Appearances", rot=0, ax=axes[1][2], color='royalblue');
dat_13.plot(kind='bar', x='Author', title="2013 Repeat Appearances", rot=0, ax=axes[2][0], color='royalblue');
dat_12.plot(kind='bar', x='Author', title="2012 Repeat Appearances", rot=0, ax=axes[2][1], color='royalblue');
dat_11.plot(kind='bar', x='Author', title="2011 Repeat Appearances", rot=0, ax=axes[2][2], color='royalblue');
dat_10.plot(kind='bar', x='Author', title="2010 Repeat Appearances", rot=0, ax=axes[3][0], color='royalblue');
dat_09.plot(kind='bar', x='Author', title="2009 Repeat Appearances", rot=90, ax=axes[3][1], color='royalblue');
axes[3,2].set_axis_off();
plt.tight_layout();

Genre distribution for Top 50

In [None]:
ax = sns.countplot(data=df, x='Genre', palette='Set2');
for p in ax.patches:
    ax.annotate('{:.1%}'.format(p.get_height()/len(df['Genre'])), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 5), 
                   textcoords = 'offset points')
ax.set(ylabel="Count");

Violin plots for User Rating and Price between Genres

In [None]:
_, axes = plt.subplots(1, 2, sharey=False, figsize=(10, 4))

sns.violinplot(x='Genre', y='User Rating', data=df, ax=axes[0], palette='Set2');
sns.violinplot(x='Genre', y='Price', data=df, ax=axes[1], palette='Set2');

<a id="Dataset_Overview"></a>

# Dataset Overview

[Back to top](#Top)

Because this dataset is a compilation over 11 years, some books may appear more than once. I'll make a frequency column ('Freq') to show a cumulative count for each title.

I also renamed the 'Name' column to 'Title' to better distinguish between book title and author name.

In [None]:
!python -m pip install seaborn==0.11.1

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'

In [None]:
df = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')

In [None]:
df.rename(columns={'Name': 'Title'}, inplace=True)
df['Freq']=df.groupby(by='Title')['Title'].transform('count')
df.head()

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
df.info()

<a id="Title"></a>

# Title

[Back to top](#Top)

None of the books are included in all of the Top 50 charts across the years.

The closest is the *APA Publication Manual*, which appears in 10 of the 11 years.

In [None]:
df.Title.value_counts()

In [None]:
plt.figure(figsize=(10,6))

ax = sns.countplot(data=df, x=df.Title.value_counts(), palette="Set2")

for p in ax.patches:
    ax.annotate(format(p.get_height(), 'd'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')

ax.set(xlabel="Appearances", ylabel="Count")
ax.set_title("Times a Title Shows up on Top 50 Charts (2009-2019)")
plt.show()

A vast majority of the books are unique to a single year's Top 50.

Let's see which titles made the top 3 appearance count:

In [None]:
print("Title | Appearances")
for title, count in df.Title.value_counts().items():
    if (count == 10) or (count == 9) or (count == 8):
        print(f"{title} | {count}")

In [None]:
df.loc[df['Title'] == 'Publication Manual of the American Psychological Association, 6th Edition']

It looks like the the *APA Publication Manual* made 10 total appearances over the years, making Top 50 every year except in 2019.

<a id="Author"></a>

# Author

[Back to top](#Top)

In [None]:
df.Author.value_counts()

Let's look at how many times each author made the Top 50:

In [None]:
plt.figure(figsize=(10,6))

ax = sns.countplot(data=df, x=df.Author.value_counts(), palette="Set2")

for p in ax.patches:
    ax.annotate(format(p.get_height(), 'd'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')

ax.set(xlabel="Appearances", ylabel="Number of Authors")
ax.set_title("Total Appearances in the Top 50 Charts")
plt.show()

Clearly, most of the authors appear only once in the top 50 charts.

### Which authors appeared a total of 6 or more times?

In [None]:
for x, y in df.Author.value_counts().items():
     if y>5: print(x, y)

Dr. Seuss appears a total of 9 times. Which books of his made the Top 50 over the years?

In [None]:
df.loc[df['Author'] == 'Dr. Seuss']

Dr. Seuss only had two books across the top 50 bestsellers,
but "Oh, the Places You'll Go!" reached top 50 a total of 8 times.

### Authors with Repeat Top 50 Appearances

Of the authors that made repeat appearances, how many total times did they appear?

In [None]:
df_repeat = pd.DataFrame(
            [x, y] for x, y in df.Author.value_counts().iteritems() if (y > 1)
)

In [None]:
df_repeat.columns = ['Author','Top 50 Appearances']
df_repeat

In [None]:
plt.figure(figsize=(10,6))

ax = sns.countplot(data=df_repeat, x="Top 50 Appearances", palette="Set2")

for p in ax.patches:
    ax.annotate('{:.1%}'.format(p.get_height()/len(df_repeat)), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')

ax.set(xlabel="Appearances", ylabel="Count")
ax.set_title("Repeat Top 50 Appearances")
plt.show()

Of those who made the Top 50 more than once, a majority (50.8%) only reached it twice.

<a id="User_Rating"></a>

# User Rating

[Back to top](#Top)

In [None]:
df['User Rating'].value_counts()

In [None]:
plt.figure(figsize=(10,6))
bin_width=.05
ticks = list(df['User Rating'].unique())
ax = sns.histplot(df, x="User Rating", color='royalblue', kde=True, binwidth=bin_width)
ax.set(xlabel="User Rating", ylabel="Count")
ax.set_title("User Rating Distribution")
plt.xticks(ticks)
plt.show()

Unsurprisingly, the User Rating distribution is skewed left, as most books in the top 50 have high ratings.

Let's look at which books received less than a 4.0-star rating:

In [None]:
df.loc[df['User Rating'] < 4.0].drop_duplicates('Title')

In [None]:
df.loc[df['User Rating'] == df['User Rating'].min()]

5 books received less than a 4-star rating, two of which made a repeat appearance in the Top 50.

J.K. Rowling's *The Casual Vacancy* received the lowest rating of all the books.

Interestingly, all these books are of the fiction genre.

<a id="Reviews"></a>

# Reviews

[Back to top](#Top)

In [None]:
df.Reviews.value_counts()

In [None]:
plt.figure(figsize=(10,6))
ax = sns.histplot(df, x="Reviews", color='royalblue', kde=True)
ax.set(xlabel="Reviews", ylabel="Count")
ax.set_title("Reviews Distribution")
plt.show()

Which titles got people talking the most?

In [None]:
df_rev = df.loc[df['Reviews'] > 2.5*df.Reviews.median()].drop_duplicates('Title')

In [None]:
plt.figure(figsize=(20,20))
ax = df_rev.plot(kind='scatter', x='Reviews', y='Title')
plt.title('Books with Most Reviews')
plt.tick_params(axis='y', which='major', labelsize=5)
ax.set_ylabel('')
plt.show()

In [None]:
sns.jointplot('User Rating','Reviews', data=df_rev,
              kind="kde", fill=True, color='deepskyblue');

In [None]:
df.loc[df['Reviews'] == df['Reviews'].max()]

*Where the Crawdads Sing*, by Delia Owens, received the most amount of reviews.

<a id="Price"></a>

# Price

[Back to top](#Top)

In [None]:
df.Price.value_counts(ascending=False)

Of all the titles that made it over the years, which was the most expensive?

In [None]:
df_price = df.loc[df['Price'] > 2.0*df.Price.median()].drop_duplicates('Title')

In [None]:
plt.figure(figsize=(12,10))
ax = df_price.plot(kind='scatter', x='Price', y='Title')
plt.title('Most Expensive Books')
plt.tick_params(axis='y', which='major', labelsize=7)
ax.set_ylabel('')
plt.show()

In [None]:
sns.jointplot('User Rating','Price', data=df_price,
              kind="kde", fill=True, color='deepskyblue');

In [None]:
df.loc[df['Price'] == df['Price'].max()]

The *Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5* is the most expensive book that made the Top 50.

<a id="Year"></a>

# Year

[Back to top](#Top)

In [None]:
df.Year.value_counts()

### Create a dataframe for each year

In [None]:
df_2019 = df.loc[df['Year'] == 2019]
df_2018 = df.loc[df['Year'] == 2018]
df_2017 = df.loc[df['Year'] == 2017]
df_2016 = df.loc[df['Year'] == 2016]
df_2015 = df.loc[df['Year'] == 2015]
df_2014 = df.loc[df['Year'] == 2014]
df_2013 = df.loc[df['Year'] == 2013]
df_2012 = df.loc[df['Year'] == 2012]
df_2011 = df.loc[df['Year'] == 2011]
df_2010 = df.loc[df['Year'] == 2010]
df_2009 = df.loc[df['Year'] == 2009]

### Look at the authors that made multiple appearances in a single year

In [None]:
cols = ['Author', 'Count']
dat_19 = pd.DataFrame(columns = cols)
for name, count in df_2019.Author.value_counts().items():
    if count > 1:
        dat_19 = dat_19.append({'Author': str(name), 'Count':count},ignore_index=True)

In [None]:
cols = ['Author', 'Count']
dat_18 = pd.DataFrame(columns = cols)
for name, count in df_2018.Author.value_counts().items():
    if count > 1:
        dat_18 = dat_18.append({'Author': str(name), 'Count':count},ignore_index=True)

In [None]:
cols = ['Author', 'Count']
dat_17 = pd.DataFrame(columns = cols)
for name, count in df_2017.Author.value_counts().items():
    if count > 1:
        dat_17 = dat_17.append({'Author': str(name), 'Count':count},ignore_index=True)

In [None]:
cols = ['Author', 'Count']
dat_16 = pd.DataFrame(columns = cols)
for name, count in df_2016.Author.value_counts().items():
    if count > 1:
        dat_16 = dat_16.append({'Author': str(name), 'Count':count},ignore_index=True)

In [None]:
cols = ['Author', 'Count']
dat_15 = pd.DataFrame(columns = cols)
for name, count in df_2015.Author.value_counts().items():
    if count > 1:
        dat_15 = dat_15.append({'Author': str(name), 'Count':count},ignore_index=True)

In [None]:
cols = ['Author', 'Count']
dat_14 = pd.DataFrame(columns = cols)
for name, count in df_2014.Author.value_counts().items():
    if count > 1:
        dat_14 = dat_14.append({'Author': str(name), 'Count':count},ignore_index=True)

In [None]:
cols = ['Author', 'Count']
dat_13 = pd.DataFrame(columns = cols)
for name, count in df_2013.Author.value_counts().items():
    if count > 1:
        dat_13 = dat_13.append({'Author': str(name), 'Count':count},ignore_index=True)

In [None]:
cols = ['Author', 'Count']
dat_12 = pd.DataFrame(columns = cols)
for name, count in df_2012.Author.value_counts().items():
    if count > 1:
        dat_12 = dat_12.append({'Author': str(name), 'Count':count},ignore_index=True)

In [None]:
cols = ['Author', 'Count']
dat_11 = pd.DataFrame(columns = cols)
for name, count in df_2011.Author.value_counts().items():
    if count > 1:
        dat_11 = dat_11.append({'Author': str(name), 'Count':count},ignore_index=True)

In [None]:
cols = ['Author', 'Count']
dat_10 = pd.DataFrame(columns = cols)
for name, count in df_2010.Author.value_counts().items():
    if count > 1:
        dat_10 = dat_10.append({'Author': str(name), 'Count':count},ignore_index=True)

In [None]:
cols = ['Author', 'Count']
dat_09 = pd.DataFrame(columns = cols)
for name, count in df_2009.Author.value_counts().items():
    if count > 1:
        dat_09 = dat_09.append({'Author': str(name), 'Count':count},ignore_index=True)

In [None]:
_, axes = plt.subplots(nrows=4, ncols=3, figsize=(20, 20))

dat_19.plot(kind='bar', x='Author', title="2019 Repeat Appearances", rot=0, ax=axes[0][0], color='royalblue');
dat_18.plot(kind='bar', x='Author', title="2018 Repeat Appearances", rot=0, ax=axes[0][1], color='royalblue');
dat_17.plot(kind='bar', x='Author', title="2017 Repeat Appearances", rot=0, ax=axes[0][2], color='royalblue');
dat_16.plot(kind='bar', x='Author', title="2016 Repeat Appearances", rot=0, ax=axes[1][0], color='royalblue');
dat_15.plot(kind='bar', x='Author', title="2015 Repeat Appearances", rot=90, ax=axes[1][1], color='royalblue');
dat_14.plot(kind='bar', x='Author', title="2014 Repeat Appearances", rot=0, ax=axes[1][2], color='royalblue');
dat_13.plot(kind='bar', x='Author', title="2013 Repeat Appearances", rot=0, ax=axes[2][0], color='royalblue');
dat_12.plot(kind='bar', x='Author', title="2012 Repeat Appearances", rot=0, ax=axes[2][1], color='royalblue');
dat_11.plot(kind='bar', x='Author', title="2011 Repeat Appearances", rot=0, ax=axes[2][2], color='royalblue');
dat_10.plot(kind='bar', x='Author', title="2010 Repeat Appearances", rot=0, ax=axes[3][0], color='royalblue');
dat_09.plot(kind='bar', x='Author', title="2009 Repeat Appearances", rot=90, ax=axes[3][1], color='royalblue');
axes[3,2].set_axis_off()

plt.tight_layout()

One or more authors made multiple Top 50 appearances in every year.

<a id="Genre"></a>

# Genre

[Back to top](#Top)

Which genre has been more popular in the Top 50 over the years?

In [None]:
df.Genre.value_counts()

In [None]:
ax = sns.countplot(data=df, x='Genre', palette='Set2');
for p in ax.patches:
    ax.annotate('{:.1%}'.format(p.get_height()/len(df['Genre'])), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 5), 
                   textcoords = 'offset points')
ax.set(ylabel="Count");

The non fiction genre accounts for 56.4% of the accumulated Top 50 charts.

In [None]:
_, axes = plt.subplots(1, 2, sharey=False, figsize=(10, 4))

sns.violinplot(x='Genre', y='User Rating', data=df, ax=axes[0], palette='Set2');
sns.violinplot(x='Genre', y='Price', data=df, ax=axes[1], palette='Set2');

In [None]:
sns.lmplot('Price', 'Reviews', data=df, hue='Genre', fit_reg=False, palette='Set2');

In general, books with the most reviews have lower prices, and are non fiction.

Thanks for reading! Again, feedback and criticism are appreciated. Feel free to leave a thought in the comments.

[Back to top](#Top)