## Amazon Top 50 Bestselling Books Analysis

Data analysis of top 50 bestselling book on Amazon has been performed. The dataset has been taken from [Kaggle](https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019).

### Importing the python libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Importing the dataset

In [None]:
df = pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')

In [None]:
df.head(15)

### Exploratory data analysis 

**Checking for null values**

In [None]:
df.isnull().sum()

There is no null value.

**Checking for number of unique books name**

In [None]:
df['Name'].nunique()

There are 351 books name.

In [None]:
df.info()

Since there are total 550 rows and 351 has unique book names, there must be duplicate entries of the books.

In [None]:
df.describe()

**Checking for duplicate books name**

In [None]:
df[df.duplicated('Name')]

There are 199 duplicate book names.

**Removing the rows with duplicate books name**

In [None]:
df.drop_duplicates(subset='Name',inplace=True)

In [None]:
df.head(20)

**Total number of fictions**

In [None]:
df[df['Genre'] == 'Fiction'].count()

There are 160 fictions.

**Total number of non-fictions**

In [None]:
df[df['Genre'] == 'Non Fiction'].count()

There are 191 non fictions.

**Visualizing the count of fiction and non-fiction books**

In [None]:
sns.countplot(data=df,x='Genre')

**Visualizing the distribution of fiction and non-fiction genre**

In [None]:
plt.pie(df['Genre'].value_counts(),labels=['Fiction','Non Fiction'],explode=(0,0.1),shadow=True,autopct='%1.1f%%');
plt.title('Distribution of Genre')
plt.legend()

**Top 5 books with highest rating**

In [None]:
df.nlargest(5,['User Rating'])

**5 books with lowest rating**

In [None]:
df.nsmallest(5,['User Rating'])

'Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5' by American Psychiatric Association is the most expensive book.

**Mean price of books by genre**

In [None]:
df.groupby('Genre').mean()['Price']

**Prices of books over the years**

In [None]:
sns.lineplot(data=df,x='Year',y='Price',color='red',marker='o')
plt.title('Prices of books over the years')

* The prices remained almost same during 2011-2013 and drastically decreased after 2013.
* There is a sharp rise in the book prices after 2015.
* The price of books was highest in 2016 and lowest around 2015. 

**Mean prices of books over the year**

In [None]:
p = df.groupby('Year').mean()['Price']

In [None]:
sns.lineplot(data=p,marker='o')
plt.figure(figsize=(10,5))

**Bestselling author**

In [None]:
df2 = df['Author'].value_counts().reset_index().head()
df2.columns = ['Author', 'No. of Bestsellers']
df2

Jeff Kinney is the bestselling author.

**Books per year**

In [None]:
df3 = df['Year'].value_counts().sort_values(ascending=False).head().reset_index()
df3.columns = ['Year', 'No. of Books']
df3

**Relationship between year and number of books**

In [None]:
sns.lineplot(x='Year',y='No. of Books',data=df3,linestyle=':')
plt.title('Year vs No. of Books')

**Paiplot on the dataset**

In [None]:
sns.pairplot(data=df)