**IMPORT LIBRARIES**

*Importing required libraries for the EDA*

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import plotly.express as px
import missingno as msno

import warnings
warnings.filterwarnings('ignore')



**LOADING DATASET**

*Loading the dataset into the dataframe*

In [None]:
#create bestseller dataframe
bestseller=pd.read_csv("../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv")

**DATA SPECIES EXPLORATION**

*Checking the types of data*

In [None]:
bestseller.info()

As per the above data, we can conclude that the data contains 1 float, 3 integers and one object value. All the colums indicate that they are non-null, meaning they have no missing value.

*Display 10 random samples from the dataframe*

In [None]:
bestseller.sample(10)

*Return shape of the dataframe*

In [None]:
bestseller.shape

*Show stats of the dataframe*

In [None]:
bestseller.describe()

*Find null/missing values*

In [None]:
bestseller.isnull().any() 

*Visualize missing values as a bar chart*

In [None]:
msno.bar(bestseller, figsize=(10,4), color = "purple",fontsize=10) 

As per the bar chart you are able to see how many missing values are there in each feature. In the case of our dataframe, there are no missing values.


*Return number of values in Genre column*

In [None]:
bestseller['Genre'].value_counts()

*Checking and dropping the duplicate rows*

Returning a boolean series to check for duplicates

In [None]:

# sorting by name 
bestseller.sort_values("Name", inplace = True) 
# making a bool series 
bestseller_bool = bestseller["Name"].duplicated() 
  
# displaying data 
bestseller.head() 
  
# display data 
bestseller[bestseller_bool]


As shown in the output image above, since the default parameter that was set was the 'Name', hence whenever the name is occured, the first book name that appears is considered Unique and rest are Duplicated. 

In the above case, you can see that the book Wonder by R. J. Palacio and You Are a Badass: How to Stop Doubting Your Gr... by Jen Sincero have several dupicate entries

Removing Duplicates

In [None]:
# making a bool series 
bestseller_bool = bestseller["Name"].duplicated(keep = False) 
  
# bool series 
bestseller_bool 
  
# passing NOT of bool series to see unique values only 
bestseller_books = bestseller[~bestseller_bool] 
  
# displaying data 
bestseller_books.info() 
bestseller_books 

As per the output image above, since the duplicated() method returns False for Duplicates, the NOT of the boolean series is taken to only see the unique values in the new bestseller books dataframe.

**DATA VISUALIZATION**

In [None]:
 bestseller_books.head()

**1. The best selling books of the decade.**

In [None]:
decades_best = bestseller_books[bestseller_books['User Rating']==bestseller_books['User Rating'].max()]
genre_review = decades_best.pivot_table(index=['Genre'], values=['Reviews'], aggfunc={"min","max","sum","count","mean"})
genre_review

* The output above show the list of books that recieve the highest user rating, which is 4.9

In [None]:
#Best Selling Books of the decade based on genre
plt.figure(figsize=(10,5))
plt.pie(bestseller_books['Genre'].value_counts().sort_values(),labels=bestseller_books['Genre'].value_counts().index,explode=[0.05,0],
        autopct='%1.2f%%',colors=['plum','salmon'])
plt.subplots_adjust(bottom=0, top=0.93, left=0.5, right=1)
plt.title("Best Selling Books of the Decade based on Genre",fontweight="bold", fontsize=12)
plt.show()

* About 55% of bestselling books in the decade are Fiction

**Box Plot**

* A box plot showing the reviews of the highest user rating 

In [None]:
sns.boxplot( decades_best['User Rating'], decades_best['Reviews']) 

**2. The percentage of fiction to non-fiction.**

* First grouping based on "User Rating". Within each User Rating we are grouping based on "Genre" 


In [None]:
genre_rating = bestseller_books.groupby(['User Rating', 'Genre']) 
  
# Print the first book in each group 
genre_rating.first() 

* Distribution of different genre categories based on the user rating, reviews, price and year

In [None]:
genre_grp = bestseller_books.groupby('Genre').sum().plot(kind='pie',title="Percentage of Fiction & Non Fiction Books",
                                        subplots=True, shadow = True,startangle=90,
figsize=(30,15), autopct='%1.1f%%')

* About 55% of bestselling books are Non Fiction based on the ratings, pricing and year
* About 63% of Fictions are thought to be bestselling books based on the readers reviews

* To get the total reviews per genre.

In [None]:
genre_review = bestseller_books.pivot_table(index=['Genre'], values=['Reviews'], aggfunc={"min","max","sum","count","mean"})
genre_review

* This allows you to sum the reviews (across all user ratings) per genre by using the aggfunc=’sum’ operation.

In [None]:
genre_review.plot.pie(subplots=True, figsize=(30, 20),autopct='%1.1f%%' );

* Distribution of Genre based on pricing

In [None]:
genre_price = bestseller_books.pivot_table(index=['Genre'], values=['Price'], aggfunc="sum")
genre_price

In [None]:
genre_price.plot.pie(subplots=True, figsize=(15, 10),autopct='%1.1f%%' );

* About 60% of bestselling books are Fiction, based on the book pricing over the years

**3. Trends over the years.**

**Violin Plot**

In [None]:
plt.figure(figsize=(15,16))

trend = sns.violinplot(data=bestseller_books,
                   x = 'Year',
                   y = 'Price', width=1.20, fliersize=30, bw=5)
trend.set_title("Book prices over the years")


* This violin plot shows the relationship of year to book pricing. The box plot elements show the median pricinf for book sold in 2015 was lower than for other years. The shape of the distribution; the extremely skinny on each end and wide in the middle, indicates that the price of books in year 2009 was higher

In [None]:
bestseller_books.head()

**Scatter Plot**

* Assigning columns to variables to ease using them when plotting


In [None]:
book = bestseller_books['Name']
author = bestseller_books['Author']
user_rating = bestseller_books['User Rating']
reviews = bestseller_books['Reviews']
price = bestseller_books['Price']
year = bestseller_books['Year']
genre = bestseller_books['Genre']

In [None]:
plt.figure(figsize = (10,5))
plt.title('Year ratings')

colours = np.arange(len(year))


plt.xlabel('Year')
plt.ylabel('User Ratings')

plt.scatter(year, user_rating, c = colours, cmap = 'Blues', marker = 'o', alpha = 0.75, edgecolor = 'k')

cbar = plt.colorbar()
cbar.set_label('Intensity')

plt.show()

* The scatter plots helps us understand how the User Ratings is changing based on the years.

**4. Top 10 Authors according to reviews**

* Find ranking of authors based on reviews

In [None]:
top_authors = bestseller_books.sort_values(by="Reviews", ascending=False).head(10) 
top_authors.pivot_table(index=['Author'], values=['Reviews'], aggfunc='sum') 

* Sorting reviews of top authors by descending order


In [None]:
top_authors.sort_values(by=['Reviews'], inplace=True, ascending=False)
top_authors

**Horizontal Bar**

In [None]:
top_authors.plot.barh(x='Author', y='Reviews', rot=0,color='lightcoral',
                     figsize=(15,10), # Figsize to make the plot larger
title=' Distribution of the Top 10 Authors based on Reviews', # Adding a title to the top
xlabel="Authors", # Adding a label on the x axis
ylabel="Reviews", # Adding a label on y axis
fontsize='large').invert_yaxis() #invert horizontal bar chart

plt.xticks(rotation = 45);


* Based on the above bar plot,these are the 10 authors with most bestselling books based on the reviews

In [None]:
top_authors.plot.barh(x='Name', y='User Rating', rot=0,color='teal',
                     figsize=(15,10), # Figsize to make the plot larger
title='Top 10 books based on user rating', # Adding a title to the top
xlabel="Name", # Adding a label on the x axis
ylabel="User Rating", # Adding a label on y axis
fontsize='large').invert_yaxis() #invert horizontal bar chart

* These are the top 10 list of books, based on the ratings by the users