- Recently I got a lot of feedback from my dear friends who just change or about the change their career towards to Data Analysis/ Data Science and Machine Learning areas about the lack of material between beginning the analysis journey and the advanced techniques.

- They are looking for detailed but at the same time beginner friendly, not so much complicated (with different regression, normalization techniques, etc.) explained Explanatory Data Analysis examples, which show them how to start and most importantly how to read the descriptive statistics and graphs.

- After getting these feedbacks, I have decided to make some kind of series of EDA’s from different datasets, without making so complicated for the people at their first steps of DS/ML journey.



**This notebook is part of the 9 Beginner Friendly EDAs. If these EDAs would be helpful to anyone, I would be more than happy.**



### **INTRO**

- In this study, we are going to make Exploratory Data Analysis (EDA) with the Amazon.com's bestseller books
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way.
- Study's dataset has 550 books along with their ratings, price, publication year, authors' name and genre.
- Data includes 2009-2019 best seller books.

- Let's import the required libraries

In [None]:
import pandas as pd
import numpy as np


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [None]:
df = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
df.head()

In [None]:
df.shape

- So we have 550 book and 7 features to work on

In [None]:
df.isnull().sum()

- We have a very clean dataset, which is very rare in the real world. 
- So enjoy working data without any missing value in it.

In [None]:
df.info()

- We have 4 numeric variable
- Also we 3 non-numeric variable.
- As a data type, everything seems quite OK.

In [None]:
df.describe()

Before going further, let's summarize what we have got from the dataset.

- Our dataset has 550 books from different authors and genres.

- Object data type variable (genre) can be grouped and see the differences among them.

- Reviews and price columns most probably have outliers. (Mean- Median difference, difference between 75% and maximum value, difference between %25 and minimum value)

-  Numerical variables deserves special attention for further analysis.


- Everything seems OK.  Let's move on to the next step: **analysis part**.

### Analysis Part

#### **Author**

In [None]:
df['Author'].nunique()

- 248 differnt authors are in the dataset.

#### **Genre**

In [None]:
df['Genre'].value_counts(normalize=True)

- I was expecting more genre, it is quite surprising.
- Anyway, Genre is still good to use to see differences between two category.

In [None]:
fig = px.histogram(df, x="Genre", title='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### **User Rating**

In [None]:
df['User Rating'].describe()

- Mean and median score are quite close to each other. (Median_ 4.7, Mean=4.618)
- Since median score is bigger than mean score we can expect outlier from the minimum side. 
- Most probably we will have left skewed distribution. 
- But still we can expect close to normal distribution of the variable.
- Let's see it

In [None]:
fig = px.histogram(df, x= 'User Rating', title='User Rating', marginal="box", hover_data = df[['Name','Author']])
fig.show()

- Yeap, as expected we have several outliers from the minimum side.
- Slightly left skewed distribution, but still close to the normal distribution
- Oh wait...
- Oh no!! J.K.Rowling's 'Casual Vacancy' got the lowest rating.

#### **Reviews**

In [None]:
df['Reviews'].describe()

- We have huge difference between mean and median values (mean = 11953, median=8580)
- It has highly skewed distribution with the outliers on the maximum side.
- We can expect highly right skewed distribution with possible outliers in the maximum side.
- Let' see it.

In [None]:
fig = px.histogram(df, x= 'Reviews', title='Reviews', marginal="box", hover_data = df[['Name','Author']])
fig.show()

- As expected, highly right skewed distribution with the outliers on the maximum side.

- By the way, I saw that Kristin Hannah's 'The Nightingale' is among the outliers with 49K plus reviews. It is a beautiful novel to read.

#### **Price**

In [None]:
df['Price'].describe()

- We can expect slightly rightly skewed distribution
- Still, distribution will be close to the normal distribution.
- We can expect outliers on the maximum side.
- Yeah, we have also free books. 
- Let me note them to check it after the analys and if any of them still free, I would be happy to have it.

In [None]:
fig = px.histogram(df, x= 'Price', title='Price', marginal="box", hover_data = df[['Name','Author']])
fig.show()

- As we expected, we have slightly right skewed distribution with outliers on the maximum side.
- Also we have 13 counts of books on the $0-1 range. 13 free books, sounds good to me.

- Befor moving on the details, let's see the correlation matrix for our dataset

In [None]:
df.drop('Year', axis=1).corr()

In [None]:
index_vals = df['Genre'].astype('category').cat.codes

fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='User Rating',
                                 values=df['User Rating']),
                            dict(label='Reviews',
                                 values=df['Reviews']),
                            dict(label='Price',
                                 values=df['Price'])],
                showupperhalf=False, 
                text=df['Name'],
                marker=dict(color=index_vals,
                            showscale=False, # colors encode categorical variables
                            line_color='white', line_width=0.5)
                ))


fig.update_layout(
    title='Books',
    width=1000,
    height=1000,
)

fig.show()

- There isn't any significant correlation to consider for further analysis.

- After getting overall picture about the data, we can go into more details.

In [None]:
genre_by_year = df.groupby('Year')['Genre'].value_counts().reset_index(level=0).rename(columns={'Genre': 'Genre count'}, index={'index': 'Genre'})
genre_by_year

#### **Movie Genre in Each Year**

In [None]:
fig = px.line(genre_by_year, x='Year', y='Genre count', color= genre_by_year.index, title='Movies By Genre in Each Year')
fig.show()

- Non fiction books  are sharply descreased on the 2014, and then sharply increased on 2015.
- Fiction books significantly increased on 2014 and then sharply decreased on 2015.
- Both fiction and non-fiction books have inconsistency on their counts by year.

#### **Price of Books in Each Year**

In [None]:
fig = px.scatter(df, x='Year', y='Price', title='Price of the Books in Each Year', hover_data = df[['Name','Author']])
fig.show()

- Prices of the books are quite on the same range by year with several outliers.
- We have 2013 and 2014 books at the price of $105 from American Psychiatric Association.

#### **Number of Reviews in Each Year**

In [None]:
fig = px.scatter(df, x='Year', y='Reviews', title='Number of Reviews in Each Year', color='Genre',hover_data = df[['Name','Author']])
fig.show()

- Quite same distribution by each year, especialy after 2010.
- Several outliers affect the distribution, as we have mentioned before.

#### **User Rating in Each Year**

In [None]:
fig = px.scatter(df, x='Year', y='User Rating', title='User Rating in Each Year',color='Genre', hover_data = df[['Name','Author']])
fig.show()

- User rating has almost same distribution on each year with a quite few outliers.

### **Top 20 Higly Rated Books**

In [None]:
top_20 = df.sort_values('User Rating', ascending=False)[:20]
top_20

In [None]:
fig = px.bar(top_20, x='Name', y= 'User Rating',  hover_data = top_20[['Year','Genre', 'Price']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Several books with their yearly editions got the most reviews from the readers.
- Only one book from non-fiction genre find place in the top 20 highly rated book list.
- Also maximum price for the books in the top 20 list is $10

#### **Lowest Rated 20 Books**

In [None]:
bottom_20 = df.sort_values('User Rating')[:20]
bottom_20

In [None]:
fig = px.bar(bottom_20, x='Name', y= 'User Rating',  hover_data = bottom_20[['Year','Genre', 'Price']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Lowest rating is 3.3.
- Only 1 book from non-fiction genre in the list.
- Maximum price of the book in the list is $20

#### **Top 20 Reviewed Books**

In [None]:
top_20_reviews = df.sort_values('Reviews', ascending=False)[:20]
top_20_reviews

In [None]:
fig = px.bar(top_20_reviews, x='Name', y= 'Reviews',  hover_data = top_20_reviews[['Year','Genre', 'Price']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Several books with their different editions got the most reviews.
- Only one book from non-fiction genre find place in the list.
- Maximum price of the book is one of the my favorite book, 'The Alchemist' by $35.

#### **Lowest Number Reviewed 20 Books**

In [None]:
bottom_20_reviews = df.sort_values('Reviews')[:20]
bottom_20_reviews

In [None]:
fig = px.bar(bottom_20_reviews, x='Name', y= 'Reviews',  hover_data = bottom_20_reviews[['Year','Genre', 'Price']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Zhi Gang Sha's two differnt books got the lowest reviews.
- Only 3 fiction books are in the lowest reviewed book list.
- Other books come from non-fiction genre.
- We can make an assumption about it. But still we need other data to support our assumptions.

- OK Let's see the top 25 Authors in our dataset

#### **Top 25 Authors** 

In [None]:
top_25_authors = df['Author'].value_counts()[:25]
top_25_authors

In [None]:
fig = px.bar(top_25_authors, x= top_25_authors.index, y=top_25_authors.values, title='Top 25 Authors',labels={'y':'Number of Books', 'index':'Author'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Let's see these top 25 authors, user ratings and reviews scores.

In [None]:
top_25_authors_ratings = df[df['Author'].isin(top_25_authors.index)][['Author','User Rating', 'Reviews']]
top_25_authors_ratings_grouped=top_25_authors_ratings.groupby('Author')[['User Rating','Reviews']].mean().sort_values('Reviews', ascending=False)

In [None]:
fig = px.bar(top_25_authors_ratings_grouped, x= top_25_authors_ratings_grouped.index, y='User Rating', title='Top 25 Authors with USer Rating')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- As expected, they have very high user rating (4.9 to 4)

In [None]:
top_25_authors_reviews_grouped=top_25_authors_ratings.groupby('Author')[['User Rating', 'Reviews']].mean().sort_values('Reviews', ascending=False)
fig = px.bar(top_25_authors_ratings_grouped, x= top_25_authors_ratings_grouped.index, y='Reviews', title='Top 25 Authors with Reviews')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

## This notebook is a part of the 9 Beginner Friendly EDAs
## If you like this one, you can also check out other notebooks in the Beginner Friendly EDAs series!
​
* [Data Analyst Jobs - EDA](https://www.kaggle.com/kaanboke/plotly-data-analyst-jobs)
* [Top Games on Google Play Store](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-games)
* [Hollywood Top Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-movies)
* [UDEMY Courses EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-udemy)
* [World Happiness Report - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-eda)
* [Countries Life Expectancy](https://www.kaggle.com/kaanboke/plotly-beginner-friendly)
* [Netflix Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-netflix)
* [London Bike Sharing - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-london-bike)

- Thanks for the dataset contibutor for this data. I really enjoyed working on it.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 