# **Table of Contents**

* **Data loading and Data Preparation**
* **Exploratory analysis & visualization**
* **Asking and answering questions**
    1.     Top 10 bestselling books from 2009 to 2019 (fiction/non-fiction)
    1.     Top 10 bestselling authors from 2009 to 2019 (fiction/non-fiction)
    1.     Year-wise percentage category distribution of books
    1.     How many unique books and authors were included in bestsellers list from 2009 to 2019?
    1.     Most expensive book and most affordable book
    1.     Highest rated and lowest rated books
    1.     Does the Title length of the book matter to be a bestseller?
* **Summary and Conclusion**

In [None]:
#importing libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# Configuring styles
sns.set_style("whitegrid")
plt.rcParams['font.size'] = 14
plt.rcParams['figure.figsize'] = (9, 5)
plt.rcParams['figure.facecolor'] = '#E6E6E6'

## 1. Data Loading and Data Preparation

In [None]:
df = pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
df.head()

In [None]:
df.info()

* Dataset contains **550 rows and 7 columns**

* There are **no missing values** in the dataset

In [None]:
df.describe()

* We see **minimum price is 0** and **maximum price is 105**.

* Year ranges from **2009 to 2019**

* **Average rating is 4.6** for bestselling books from 2009 to 2019 

* **87841 is highest no of reviews** and **37 is lowest no of reviews** recieved for any book

## 2. Exploratory analysis & visualization

In [None]:
sns.set_palette("PRGn_r")
plt.title("Distribution of user ratings per book")
sns.histplot(df['User Rating'],edgecolor='black');

In [None]:
plt.title("CDF plot for user ratings")
sns.ecdfplot(df['User Rating'],linewidth=3);

* **70% user ratings were above 4.5**


In [None]:
sns.set_palette("PRGn")
plt.title("Distribution of no of reviews per book")
sns.histplot(df['Reviews'],edgecolor='black');

In [None]:
plt.xlabel("No of reviews")
plt.title("CDF plot for no of reviews")
sns.ecdfplot(df.Reviews,linewidth=3);

* **80% of the books had less than 20K reviews**

In [None]:
sns.set_palette("YlOrBr_r")
plt.title("Distribution of price per book")
sns.histplot(df['Price'],edgecolor='black');

In [None]:
plt.title("CDF plot for prices of books")
sns.ecdfplot(df.Price,linewidth=3);

* **85% of books were priced  less than 20($)**

In [None]:
sns.set_palette("PRGn_r")
plt.title("Year-wise user ratings of books")
sns.lineplot(y="User Rating", x="Year", data=df,linewidth=3);

* **User ratings for the bestselling books have slightly improved over a period of time**

In [None]:
custom_palette = ['crimson',"dodgerblue"]
sns.set_palette(custom_palette)
plt.title("Year-wise price trend of books")
sns.lineplot(y="Price", x="Year",hue="Genre", data=df,linewidth=3);

* **Non-fiction books were always costly than fiction books to purchase except in year 2009**

* **Overall price of the books is decreasing over time**

In [None]:
custom_palette = ['crimson',"dodgerblue"]
sns.set_palette(custom_palette)
plt.title("Year-wise no of reviews of books")
sns.lineplot(y="Reviews", x="Year", hue="Genre" ,data=df,linewidth=3);

* **Fiction readers often like to post reviews than the non fiction readers**

In [None]:
custom_palette = ['lightcoral',"skyblue"]
sns.set_palette(custom_palette)
genre = df.Genre.value_counts()
plt.title("Overall category of books")
plt.pie(genre, labels=genre.index, autopct='%0.2f%%', startangle=90);

* **On an average over 50 bestselling books, 28 were non-fiction and 22 were fiction**

## 3. Asking and answering questions

### 1. Top 10 bestselling books from 2009 to 2019 (fiction/non-fiction)

In [None]:
nonfict = df[df['Genre']=='Non Fiction']
fict = df[df['Genre']=='Fiction']
top10fict = fict['Name'].value_counts().head(10)
top10nonfict = nonfict['Name'].value_counts().head(10)

In [None]:
sns.barplot(x=top10fict,y=top10fict.index, palette = 'PuBu_r',edgecolor='grey');
plt.title("Top 10 bestselling books (Fiction)")
plt.xlabel('# bestseller');

* **"Oh, the Places You'll Go!" was the top selling fiction book with 8 apperences in the best-selling books from 2009 to 2019.**

In [None]:
sns.barplot(x=top10nonfict,y=top10nonfict.index, palette = 'OrRd_r',edgecolor='grey');
plt.title("Top 10 bestselling books (Non-fiction)")
plt.xlabel('# bestseller');

* **"Publication Manual of the American Psychological Association, 6th Edition" was the top selling non-fiction book 
with 10 apperences in the best-selling books from 2009 to 2019.**

### 2. Top 10 bestselling authors from 2009 to 2019 (fiction/non-fiction)

In [None]:
top10fict_auth = fict['Author'].value_counts().head(10)
top10nonfict_auth = nonfict['Author'].value_counts().head(10)

In [None]:
sns.barplot(x=top10fict_auth,y=top10fict_auth.index, palette = 'PuBu_r',edgecolor='grey');
plt.title("Top 10 bestselling Authors (Fiction)")
plt.xlabel('# bestseller');

* **"Jeff Kinney" was the top selling author under fiction category with 12 apperences in the best-selling books from 2009 to 2019.**

In [None]:
sns.barplot(x=top10nonfict_auth,y=top10nonfict_auth.index, palette = 'OrRd_r',edgecolor='grey');
plt.title("Top 10 bestselling Authors (Non-fiction)")
plt.xlabel('# bestseller');

* **"Gary Chapman" was the top selling author under non-fiction category with 11 apperences in the best-selling books from 2009 to 2019.**

### 3. Year-wise percentage category distribution of books

In [None]:
temp1 = df.groupby(['Year','Genre'])[['Genre']].count()
temp1.rename(columns={"Genre":'Count'},inplace=True)
temp1.reset_index(inplace= True)

In [None]:
custom_palette = ["dodgerblue",'crimson']
sns.set_palette(custom_palette)
plt.figure(figsize=(14,6))
plt.title("Year-wise category of books")
sns.barplot(x=temp1.Year, y=temp1.Count * 100 / 50, hue=temp1.Genre,edgecolor='black');
plt.ylabel("Percentage");

* **There was always a high proportion of non-fiction category books in top 50 best-selling books from year 2009 to 2019, except for the year 2014**

### 4. How many unique books and authors were included in bestsellers list from 2009 to 2019? 

In [None]:
uniquebook = df.Name.unique().shape[0]
uniqueauth = df.Author.unique().shape[0]
print(uniquebook,550-uniquebook)
print(uniqueauth)

* **Out of 550 bestselling books, 351 books were unique which also means 199 books were repeated**
* **248 unique authors were included in the list of bestselling authors from the year 2009 to 2019**

### 5. Most expensive book and most affordable book 

In [None]:
temp2 = df[df.Price == df.Price.max()]
exp=temp2.drop_duplicates(subset=['Name'], keep='first')
exp[['Name','Author','Price']]

In [None]:
exp.Name.tolist()

* **"Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5" was the most expensive book priced at 105($)**

In [None]:
temp3 = df[df.Price == df.Price.min()]
cheap = temp3.drop_duplicates(subset=['Name'], keep='first')
cheap[['Name','Author','Price']]

In [None]:
len(cheap.Name.tolist())

* **There were total 9 books priced at 0($)**

### 6. Highest rated and lowest rated books

In [None]:
temp4 = df[df['User Rating'] == df['User Rating'].max()]
highrated = temp4.drop_duplicates(subset=['Name'], keep='first')
highrated[['Name','Author','User Rating']]

In [None]:
len(highrated.Name.tolist())

* **There were total 28 books which had recieved the highest rating of 4.9**

In [None]:
temp5 = df[df['User Rating'] == df['User Rating'].min()]
lowrated = temp5.drop_duplicates(subset=['Name'], keep='first')
lowrated[['Name','Author','User Rating']]

* **"The Casual Vacancy" written by J.K.Rowling recieved the lowest rating of 3.3**

### 7. Does the Title length of the book matter to be a bestselling book?

In [None]:
name = df.Name.tolist()
name_len = []
for i in name:
    name_len.append(len(i))

In [None]:
sns.set_palette("PRGn")
sns.ecdfplot(data=name_len, linewidth=3);
plt.xlabel('Title length of books');
plt.title('CDF plot for title length of books');

**There is 40% chance that your book will be a bestseller if title length of your book is less than 30 characters.**

## 4. Summary and Conclusion

#### Here is a summary of all the inferences drawn from the analysis

* **Average rating for bestselling books from 2009 to 2019 is 4.6**

* **85% of the books were priced less than 20 dollars**

* **There is a slight improvement in user ratings of the bestselling books over a period of time.**

* **Non-fiction books were always costly than fiction books except in year 2009**

* **Fiction readers often like to post reviews than the non fiction readers**

* **On an average over 50 bestselling books, 28 were non-fiction and 22 were fiction**

* **"Oh, the Places You'll Go!" is the top selling fiction book with 8 apperences**

* **"Publication Manual of the American Psychological Association, 6th Edition" is the top selling non-fiction book with 10 apperences**

* **"Jeff Kinney" is the top selling author under fiction category with 12 apperences**

* **"Gary Chapman" is the top selling author under non-fiction category with 11 apperences**

* **Out of 550 bestselling books, 351 books were unique which also means 199 books were repeated**

* **248 unique authors were included in the list of bestselling authors from the year 2009 to 2019**

* **"Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5" was the most expensive book priced at 105($)**

* **There were total 9 books priced at 0($) i.e they were free of cost**

* **There were total 28 books which had recieved the highest rating of 4.9**

* **"The Casual Vacancy" written by J.K.Rowling recieved the lowest rating of 3.3**

* **There is 40% chance that your book will be a bestseller if title length of your book is less than 30 characters.**