![](https://www.adazing.com/wp-content/uploads/2012/03/agold-book.png)

# <font color='mediumvioletred'>Motivation</font>
Hello Everyone! I am a beginner in the field of data science and ML. The purpose for creating this notebook is to improve my data visualization skills using python. Below you will see I have experimented with different chart types, asthetics etc. Please feel free to point out my mistakes and suggest topics I can read to improve. Thanks!

# <a id=toc><font color='teal'>Table of Contents</font></a>
1. [Insights](#insights)
2. [Essential Libraries and Custom Functions](#lib)
3. [Lets look at the data](#data)
4. [Add Additional Features](#add_feat)
5. [Exploratory Data Analysis](#eda)

# <a id=insights><font color='firebrick'>Insights</font></a>

[Back to index](#toc)

- <div class="alert alert-block alert-warning">Non-Fiction was a more popular category as compared to Fiction, every year from 2009 to 2019. Out of the 351 unique books, 54.4 percent were Non-Fiction and rest 45.6 percent were Fiction. The highest fraction (66 percent) of Non-Fiction books were sold in 2015 and lowest for fiction books. For Fiction books, highest fraction (48 percent) of books were sold in 2009, 2013 ad 2017, and lowest for Non-Fiction books.</div>
- <div class="alert alert-block alert-warning">Author `Jeff Kinney` is the top selling author with 12 apperences in the top selling books from 2009 to 2019. However, Author `EL James` has the highest number of reviews on his books. There are only 2 authors, DK and Scholastic, having books in both the genre category.
- <div class="alert alert-block alert-warning">The median and mean length of fiction books is less than non-fiction books.
- <div class="alert alert-block alert-warning">There are total 9 unique books with a price of zero dollar. It can be inferred that these books are either free or its an anomaly. Except for year 2009, the average price of Non-Fiction books is higher than Fiction books each year.
- <div class="alert alert-block alert-warning">None of the non-fiction books has a rating below 4.
- <div class="alert alert-block alert-warning">Except for years 2012 and 2013, the average user rating of fiction books is higher than Non-Fictions books each year. Also, the total reviews of Non-Fictions books are higher than Fiction books, except for year 2018 and 2019.

# <a id=lib><font color='steelblue'>Essential Libraries and custom function</font></a>
[Back to index](#toc)

<div class="alert alert-block alert-warning">
    The default library of seaborn on kaggle is 0.10.0, upgrading to the latest version 0.11.0
    </div>

In [None]:
pip install seaborn --upgrade

In [None]:
import pandas as pd # dataframe manipulation
import numpy as np # linear algebra

# data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print('Seaborn verion', sns.__version__)
sns.set_style('whitegrid')

# text data
import string
import re

# pie chart labels
def make_autopct(values):
    def my_autopct(pct):
        total = sum(values)
        val = int(round(pct*total/100.0))
        return '{p:.2f}%\n({v:d})'.format(p=pct,v=val)
    return my_autopct

# <font color='purple'>Lets look at the data...</font>
[Back to index](#toc)

In [None]:
df = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
df.head()

## Rename column

In [None]:
# rename column
df.rename(columns={"User Rating": "User_Rating"}, inplace=True)

## Dataset Shape

In [None]:
print(f"The dataset consists of {df.shape[0]} rows and {df.shape[1]} columns.")

## Missing Values

<div class="alert alert-block alert-warning">There are no missing values in any column.</div>

In [None]:
df.isnull().sum()

## Data Anomalies

1. <div class="alert alert-block alert-warning"><b>Mispelled Author Name:</b> Author `J.K. Rowling` name is mispelled in two books as there is an additional whitespace between the initials J and K.</div>

In [None]:
# incorrect spelling
df[df.Author == 'J. K. Rowling']

In [None]:
# correct spelling
df[df.Author == 'J.K. Rowling']

2. <div class="alert alert-block alert-warning"><b>Books with zero dollar price:</b> There are 9 unique books with a price of zero dollar.</div>

In [None]:
df[df['Price'] == 0]

## Fix Anomaly

In [None]:
df.loc[df.Author == 'J. K. Rowling', 'Author'] = 'J.K. Rowling'

# <a id=add_feat><font color='teal'>Add Additional Features</font></a>
[Back to index](#toc)

1. <div class="alert alert-block alert-warning"><b>Book name length:</b> It is the text length for a book name minus the spaces(whitespace) between words.</div>

In [None]:
# length of text in book name
df['name_len'] = df['Name'].apply(lambda x: len(x) - x.count(" ")) # subtract whitespaces

2. <div class="alert alert-block alert-warning"><b>Punctuations percentage:</b> It is the count of punctuations in book name.</div>

In [None]:
punctuations = string.punctuation
print('list of punctuations : ', punctuations)

In [None]:
# percentage of punctuations
def count_punc(text):
    """This function counts the number of punctuations in a text"""
    count = sum(1 for char in text if char in punctuations)
    return round(count/(len(text) - text.count(" "))*100, 3)

# apply function
df['punc%'] = df['Name'].apply(lambda x: count_punc(x))

# <a id=eda><font color='orangered'>Exploratory Data Analysis</font></a>
[Back to index](#toc)

## 1. Genre
- <div class="alert alert-block alert-warning">Genre is a nominal categorical variable with 2 classes - Fiction and Non-Fiction.
- <div class="alert alert-block alert-warning">Non-Fiction was a more popular category as compared to Fiction, every year from 2009 to 2019. Out of the 351 unique books, 54.4 percent were Non-Fiction and rest 45.6 percent were Fiction. 
- <div class="alert alert-block alert-warning">The highest fraction (66 percent) of Non-Fiction books were sold in 2015 and lowest for fiction books. For Fiction books, highest fraction (48 percent) of books were sold in 2009, 2013 ad 2017, and lowest for Non-Fiction books.   

In [None]:
# remove duplicate book names
no_dup = df.drop_duplicates('Name')
# data
g_count = no_dup['Genre'].value_counts()
genre_col = ['navy','crimson']
#plot
fig, ax = plt.subplots(figsize=(8, 8))
center_circle = plt.Circle((0, 0), 0.7, color='white')
plt.pie(x=g_count.values, labels=g_count.index, autopct=make_autopct(g_count.values), 
          startangle=90, textprops={'size': 15}, pctdistance=0.5, colors=genre_col)
ax.add_artist(center_circle)
fig.suptitle('Distribution of Genre for all unique books from 2009 to 2019', fontsize=20)
fig.show()

In [None]:
g_count = df['Genre'].value_counts() # includes duplicate books
fig, axes = plt.subplots(2, 6, figsize=(10,5))
axes = axes.ravel()
axes[0].pie(x=g_count.values, labels=None, autopct='%1.1f%%',
            startangle=90, textprops={'size': 11, 'color': 'white'},
            pctdistance=0.5, radius=1.3, colors=genre_col)
axes[0].set_title('2009 - 2019\n(Overall)', color='darkgreen', fontdict={'fontsize': 15})
for ax, year in zip(axes[1:], np.arange(2009, 2020)):
    counts = df[df['Year'] == year]['Genre'].value_counts()
    ax.pie(x=counts.values, labels=None, autopct='%1.1f%%', 
                  startangle=90, textprops={'size': 10,'color': 'white'}, 
                  pctdistance=0.5, colors=genre_col, radius=1.1)
    ax.set_title(year, color='darkred', fontdict={'fontsize': 15})
plt.suptitle('Distribution of Fiction and Non-Fiction books each year',fontsize=20)
fig.legend(g_count.index, loc='center right', fontsize=10)
plt.show()

## 2. Author

In [None]:
print(f"There are total {len(no_dup.Author.unique())} unique authors in list of top selling books from 2009 to 2019.")

### 2.1 Top 10 best selling authors for Non Fiction and Fiction Books
<div class="alert alert-block alert-warning">The best selling authors are selected based on their appearences in the top 50 selling books each year, from 2009 to 2019.

In [None]:
best_nf_authors = df.groupby(['Author', 'Genre']).agg({'Name': 'count'}).unstack()['Name', 'Non Fiction'].sort_values(ascending=False)[:11]
best_f_authors = df.groupby(['Author', 'Genre']).agg({'Name': 'count'}).unstack()['Name', 'Fiction'].sort_values(ascending=False)[:11]

with plt.style.context('Solarize_Light2'):
    fig, ax = plt.subplots(1, 2, figsize=(6, 7))
    ax[0].barh(y=best_nf_authors.index, width=best_nf_authors.values,
           color=genre_col[0])
    ax[0].invert_xaxis()
    ax[0].yaxis.tick_left()
    ax[0].set_xticks(np.arange(max(best_f_authors.values)+1))
    ax[0].set_yticklabels(best_nf_authors.index, fontweight='semibold')
    ax[0].set_xlabel('Number of appreances')
    ax[0].set_title('Non Fiction Authors')
    
    ax[1].barh(y=best_f_authors.index, width=best_f_authors.values,
           color=genre_col[1])
    ax[1].yaxis.tick_right()
    ax[1].set_xticks(np.arange(max(best_f_authors.values)+1))
    ax[1].set_yticklabels(best_f_authors.index, fontweight='semibold')
    ax[1].set_title('Fiction Authors')
    ax[1].set_xlabel('Number of appreances')
    
    fig.legend(['Non Fiction', 'Fiction'], fontsize=12)
    
plt.show()

### 2.2 Top 20 best selling authors
- <div class="alert alert-block alert-warning">The best selling authors are selected based on their appearences in the top 50 selling books each year. The number of apprearences includes duplicate book names. Their unique publications and total reviews are shown below.
- <div class="alert alert-block alert-warning">Author `Jeff Kinney` is the top selling author with 12 apperences in the top selling books from 2009 to 2019.

In [None]:
n_best = 20

top_authors = df.Author.value_counts().nlargest(n_best)
no_dup = df.drop_duplicates('Name') # removes all rows with duplicate book names

fig, ax = plt.subplots(1, 3, figsize=(10,10), sharey=True)

color = sns.color_palette("hls", n_best)

ax[0].hlines(y=top_authors.index , xmin=0, xmax=top_authors.values, color=color, linestyles='dashed')
ax[0].plot(top_authors.values, top_authors.index, 'go', markersize=9)
ax[0].set_xlabel('Number of appearences')
ax[0].set_xticks(np.arange(top_authors.values.max()+1))
ax[0].set_yticklabels(top_authors.index, fontweight='semibold')
ax[0].set_title('Appearences')

book_count = []
total_reviews = []
for name, col in zip(top_authors.index, color):
    book_count.append(len(no_dup[no_dup.Author == name]['Name']))
    total_reviews.append(no_dup[no_dup.Author == name]['Reviews'].sum()/1000)
ax[1].hlines(y=top_authors.index , xmin=0, xmax=book_count, color=color, linestyles='dashed')
ax[1].plot(book_count, top_authors.index, 'go', markersize=9)
ax[1].set_xlabel('Number of unique books')
ax[1].set_xticks(np.arange(max(book_count)+1))
ax[1].set_title('Unique books')

ax[2].barh(y=top_authors.index, width=total_reviews, color=color, edgecolor='black', height=0.7)
for name, val in zip(top_authors.index, total_reviews):
    ax[2].text(val+2, name, val)
ax[2].set_xlabel("Total Reviews (in 1000's)")
ax[2].set_title('Total reviews')
plt.show()

### 2.3 Top 20 Authors with  highest reviews
<div class="alert alert-block alert-warning">Author `EL James` has the highest number of reviews on his books.

In [None]:
n_best=20
top_reviews = no_dup.groupby('Author').agg({'Reviews': 'sum'})['Reviews'].nlargest(n_best)
with plt.style.context('bmh'):
    plt.figure(figsize=(10,12))
    plt.barh(y=top_reviews.index, width=top_reviews.values, height=0.7)
    for name, val in zip(top_reviews.index, top_reviews.values):
        plt.text(val/2, name, val, ha='center', color='white', fontsize=12)
    plt.yticks(fontweight='semibold')
    plt.xlabel('Total Reviews')
    plt.show()

### 2.4 Top 20 authors with long book names

In [None]:
n_best = 20

long_title = no_dup.groupby('Author').agg({'name_len': 'mean'})['name_len'].nlargest(n_best)

fig, ax = plt.subplots(1, 3, figsize=(10,10), sharey=True)

color = sns.color_palette("hls", n_best)

ax[0].barh(y=long_title.index, width=long_title.values, color=color, edgecolor='black')
for name, val in zip(long_title.index, long_title.values):
    ax[0].text(val+2, name, int(val))
ax[0].set_xlabel('Length of text in book name')
ax[0].set_title('Average text length')
ax[0].set_yticklabels(long_title.index, fontweight='semibold')

book_count = []
total_reviews = []
for name, col in zip(long_title.index, color):
    book_count.append(len(no_dup[no_dup.Author == name]['Name']))
    total_reviews.append(no_dup[no_dup.Author == name]['Reviews'].sum())
ax[1].hlines(y=top_authors.index , xmin=0, xmax=book_count, color=color, linestyles='dashed')
ax[1].plot(book_count, top_authors.index, 'go', markersize=9)
ax[1].set_xlabel('Number of unique books')
ax[1].set_xticks(np.arange(max(book_count)+1))
ax[1].set_title('Unique books')

ax[2].barh(y=long_title.index, width=total_reviews, color=color, edgecolor='black', height=0.7)
for name, val in zip(long_title.index, total_reviews):
    ax[2].text(val+1000, name, val)
ax[2].set_xlabel("Total Reviews (in 1000's)")
ax[2].set_title('Total reviews')
plt.show()

### 2.5 Top 20 authors using high percentage of punctutations in books names

In [None]:
n_best = 20

high_punc = no_dup.groupby('Author').agg({'punc%': 'mean'})['punc%'].nlargest(n_best)

fig, ax = plt.subplots(1, 3, figsize=(10,10), sharey=True)

color = sns.color_palette("hls", n_best)

ax[0].barh(y=high_punc.index, width=high_punc.values, color=color, edgecolor='black')
for name, val in zip(high_punc.index, high_punc.values):
    ax[0].text(val+1, name, round(val,2))
ax[0].set_xlabel('Puctutation percentage')
ax[0].set_title('Average punctutions percentage')
ax[0].set_yticklabels(high_punc.index, fontweight='semibold')

book_count = []
total_reviews = []
for name, col in zip(high_punc.index, color):
    book_count.append(len(no_dup[no_dup.Author == name]['Name']))
    total_reviews.append(no_dup[no_dup.Author == name]['Reviews'].sum())
ax[1].hlines(y=top_authors.index , xmin=0, xmax=book_count, color=color, linestyles='dashed')
ax[1].plot(book_count, top_authors.index, 'go', markersize=9)
ax[1].set_xlabel('Number of unique books')
ax[1].set_xticks(np.arange(max(book_count)+1))
ax[1].set_title('Unique books')

ax[2].barh(y=high_punc.index, width=total_reviews, color=color, edgecolor='black', height=0.7)
for name, val in zip(high_punc.index, total_reviews):
    ax[2].text(val+1000, name, val)
ax[2].set_xlabel("Total Reviews (in 1000's)")
ax[2].set_title('Total reviews')

plt.show()

### 2.6 Authors with books across both genres
<div class="alert alert-block alert-warning">There are only 2 authors, `DK` and `Scholastic`, having books in both the genre category

In [None]:
no_dup.groupby(['Author', 'Genre']).agg({'Name': 'count'}).unstack().dropna()

### 2.7 Top 20 authors with the most expensive book collection bundle

In [None]:
exp_collection = no_dup.groupby('Author').agg({'Price': 'sum'})['Price'].nlargest(20)

fig, ax = plt.subplots(1, 3, figsize=(10,10), sharey=True)

color = sns.color_palette("hls", n_best)

ax[0].barh(y=exp_collection.index, width=exp_collection.values, color=color, edgecolor='black')
for name, val in zip(exp_collection.index, exp_collection.values):
    ax[0].text(val+2, name, val)
ax[0].set_xlabel('Price')
ax[0].set_title('Total price of all books')
ax[0].set_yticklabels(exp_collection.index, fontweight='semibold')

book_count = []
total_reviews = []
for name, col in zip(exp_collection.index, color):
    book_count.append(len(no_dup[no_dup.Author == name]['Name']))
    total_reviews.append(no_dup[no_dup.Author == name]['Reviews'].sum()/1000)
ax[1].hlines(y=top_authors.index , xmin=0, xmax=book_count, color=color, linestyles='dashed')
ax[1].plot(book_count, top_authors.index, 'go', markersize=9)
ax[1].set_xlabel('Number of unique books')
ax[1].set_xticks(np.arange(max(book_count)+1))
ax[1].set_title('Unique books of respective authors')

ax[2].barh(y=exp_collection.index, width=total_reviews, color=color, edgecolor='black', height=0.7)
for name, val in zip(exp_collection.index, total_reviews):
    ax[2].text(val+7, name, val)
ax[2].set_xlabel("Total Reviews (in 1000's)")
ax[2].set_title('Total reviews of all unique books')

plt.show()


## 3. Length of book name

### 3.1 Smallest Book Name

In [None]:
df[df.name_len == min(df.name_len)]

### 3.2 Longest Book Name

In [None]:
df[df.name_len == max(df.name_len)]

### 3.3 Distribution for all 351 unique books
- <div class="alert alert-block alert-warning">Overall, the length is slightly left skewed.
- <div class="alert alert-block alert-warning">The median and mean length of fiction books is less than non-fiction books.
- <div class="alert alert-block alert-warning">The emperical cumulative frequency distribution plots indicates the differences in proportion of length of book name between fiction and non-fiction books.

In [None]:
no_dup = df.drop_duplicates('Name')

fig, ax = plt.subplots(3, 2, figsize=(12,15))
fig.subplots_adjust(wspace=0.4, hspace=0.4)

sns.histplot(data=no_dup, x='name_len', binwidth=9, kde=True, ax=ax[0,0])
ax[0,0].set_title('Histogram of length of book name', fontsize=15, color='darkred')
sns.histplot(data=no_dup, x='name_len', hue='Genre', binwidth=9, kde=True, ax=ax[0,1],palette=genre_col)
ax[0,1].set_title('Histogram of length of book name across genre', fontsize=15, color='darkred')

sns.boxplot(data=no_dup, x='name_len', ax=ax[1,0])
ax[1,0].set_title('Boxplot of length of book name', fontsize=15, color='darkred')
sns.boxplot(data=no_dup, x='name_len', y='Genre', ax=ax[1,1], palette=genre_col)
ax[1,1].set_title('Boxplot of length of book name across genre', fontsize=15, color='darkred')

sns.ecdfplot(data=no_dup, x='name_len', ax=ax[2,0])
ax[2,0].set_title('ECDF of length of book name', fontsize=15, color='darkred')
sns.ecdfplot(data=no_dup, x='name_len', hue='Genre', ax=ax[2,1], palette=genre_col)
ax[2,1].set_title('ECDF of length of book name across Genre', fontsize=15, color='darkred')

#fig.suptitle('Distribution of length of book name for all 351 unique books', fontsize=20)
fig.show()

### 3.3 Average length of book name across genre

In [None]:
no_dup.groupby('Genre').mean()['name_len']

## 4. Price
- <div class="alert alert-block alert-warning">Here, Price of a book is a continous quantitative variable.
- <div class="alert alert-block alert-warning">Minimum price of a book is zero dollar and max is 105 dollars. 
- <div class="alert alert-block alert-warning">There are total 9 unique books with a price of zero dollar. It can be inferred that these books are either free or its an anomaly.

### 4.1 Details of books with a price of zero dollar

In [None]:
# details of books with a price of zero dollar
no_dup[no_dup['Price'] == 0]

### 4.2 Distribution of Price for all 351 unique books
<div class="alert alert-block alert-warning">Price looks extremely skewed and the differences in the distribution and proportion, across Genre cannot be identified. 

In [None]:
no_dup = df.drop_duplicates('Name')

fig, ax = plt.subplots(3, 2, figsize=(12,15))
fig.subplots_adjust(wspace=0.4, hspace=0.4)

sns.histplot(data=no_dup, x='Price', kde=True, ax=ax[0,0])
ax[0,0].set_title('Histogram of price', fontsize=15, color='darkred')
sns.histplot(data=no_dup, x='Price', hue='Genre', kde=True, ax=ax[0,1], palette=genre_col)
ax[0,1].set_title('Histogram of price across genre', fontsize=15, color='darkred')

sns.boxplot(data=no_dup, x='Price', ax=ax[1,0])
ax[1,0].set_title('Boxplot of price', fontsize=15, color='darkred')
sns.boxplot(data=no_dup, x='Price', y='Genre', ax=ax[1,1], palette=genre_col)
ax[1,1].set_title('Boxplot of price across genre', fontsize=15, color='darkred')

sns.ecdfplot(data=no_dup, x='Price', ax=ax[2,0])
ax[2,0].set_title('ECDF of price', fontsize=15, color='darkred')
sns.ecdfplot(data=no_dup, x='Price', hue='Genre', ax=ax[2,1], palette=genre_col)
ax[2,1].set_title('ECDF of price across Genre', fontsize=15, color='darkred')

#fig.suptitle('Distribution of price for all 351 unique books', fontsize=20)
fig.show()

### 4.3 Log transformation of Price
<div class="alert alert-block alert-warning">Since, price of all unique books is extremely skewed, applying log transformtion made the distribution normal. Looks like there is a slight difference between the price across genre.

In [None]:
np.seterr(divide = 'ignore')
df['Price_log'] = np.where(df['Price']>0, np.log(df['Price']), 0)

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(12,4))
plt.subplots_adjust(wspace=0.4)
sns.kdeplot(data=df, x='Price_log', hue='Genre', ax=ax[0], palette=genre_col)
sns.boxplot(data=df, x='Price_log', y='Genre', ax=ax[1], palette=genre_col)
sns.ecdfplot(data=df, x='Price_log', hue='Genre', ax=ax[2], palette=genre_col)
plt.legend(loc='right')
plt.show()

### 4.4 Average Price of top selling 50 books across Genre
- <div class="alert alert-block alert-warning">Except for year 2009, the average price of top 50 Non-Fiction books is higher than Fiction books each year.

In [None]:
price_per_year = df.groupby(['Year', 'Genre']).agg({'Price': 'mean'}).unstack()

fig, ax = plt.subplots(figsize=(10,6))

width = 0.35
bar1 = ax.bar(price_per_year.index - width/2, price_per_year['Price', 'Fiction'],
        width, label='Fiction', color=genre_col[1])
"""for i, val in zip(price_per_year.index, price_per_year['Price', 'Fiction']):
    plt.text(i - width, val, round(val,2))"""
bar2 = ax.bar(price_per_year.index + width/2, price_per_year['Price', 'Non Fiction'],
        width, label='Non Fiction', color=genre_col[0])
"""for i, val in zip(price_per_year.index, price_per_year['Price', 'Non Fiction']):
    plt.text(i + width/2, val, round(val,2))"""
plt.xticks(price_per_year.index)

plt.suptitle('Average Price of Fiction and Non Fiction books each year', fontsize=20)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.legend(loc='best')

plt.show()

### 4.5 Max Price of Fiction and Non Fiction book each year

In [None]:
max_price_per_year = df.groupby(['Year', 'Genre']).agg({'Price': 'max'}).unstack()

plt.figure(figsize=(10,6))

plt.plot(max_price_per_year.index, max_price_per_year['Price', 'Fiction'], 'o', 
         markersize=5, color='black')
plt.plot(max_price_per_year.index, max_price_per_year['Price', 'Fiction'],
         label='Fiction', color=genre_col[1])
plt.plot(max_price_per_year.index, max_price_per_year['Price', 'Non Fiction'], 'o', 
         markersize=5, color='black')
plt.plot(max_price_per_year.index, max_price_per_year['Price', 'Non Fiction'],
         label='Non Fiction', color=genre_col[0])


plt.xticks(max_price_per_year.index)
plt.yticks(np.arange(0, 111, 10))
plt.suptitle('Max Price of Fiction and Non Fiction book each year from 2009 to 2019', fontsize=20)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.legend(loc='best')
plt.show()

## 5. User Ratings
- <div class="alert alert-block alert-warning">User Rating is a continous quantitative variable which ranges from 0 to 5.
- <div class="alert alert-block alert-warning">The minimum and the maximum value of user rating is 3.3 and 4.9
- <div class="alert alert-block alert-warning">The countplot indicates that most frequent occuring value of user rating is 4.8. None of the non-fiction book has a rating below 4.



### 5.1 Book with the lowest User Rating

In [None]:
df[df.User_Rating == min(df.User_Rating)]

### 5.2 Books with the best User Rating

In [None]:
print(f'There are total {len(no_dup[no_dup.User_Rating == max(no_dup.User_Rating)])} books which have recieved the best user rating of 4.9. Out of which ')
for k, v in dict(no_dup[no_dup.User_Rating == max(no_dup.User_Rating)]['Genre'].value_counts()).items():
    print(k, v)

#### 5.2.1 Authors count
- <div class="alert alert-block alert-warning">In the 28 best rating books, Author Dav Pilkey has 6 books, followed by J.K Rowling and Rush Kimbaugh with 4 and 2 books.

In [None]:
no_dup[no_dup.User_Rating == max(no_dup.User_Rating)]['Author'].value_counts()

### 5.3 Distrubution of User Ratings

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(12,10))
plt.subplots_adjust(wspace=0.3, hspace=0.3)

sns.color_palette("Set2")
sns.countplot(data=no_dup, x='User_Rating', ax=ax[0,0])
ax[0,0].set_title('Countplot of User Rating')
sns.countplot(data=no_dup, x='User_Rating', hue='Genre', ax=ax[0,1], palette=genre_col)
ax[0,1].set_title('Countplot of User Rating across Genre')


sns.boxplot(data=no_dup, x='User_Rating', ax=ax[1,0])
ax[1,0].set_title('Boxplot of User Rating')
sns.boxplot(data=no_dup, x='User_Rating', ax=ax[1,1], y='Genre', palette=genre_col)
ax[1,1].set_title('Cumulative frequency plot of User Rating')

plt.show()

### 5.4 Average User Rating of Fiction and Non-Fiction books each year
- <div class="alert alert-block alert-warning">Except for years 2012 and 2013, the average user rating of fiction books is higher than Non-Fictions books each year.

In [None]:
average_rating = df.groupby(['Year', 'Genre']).agg({'User_Rating': 'mean'}).unstack()

plt.figure(figsize=(10,6))

plt.plot(average_rating.index, average_rating.values, )
plt.plot(average_rating.index, average_rating.values, 'o', color='black')

plt.legend(('Fiction', 'Non-Fiction'))
plt.xticks(df.Year.unique())
plt.suptitle('Average User Rating of Fiction and Non-Fiction books each year from 2009 to 2019', fontsize=20)
plt.ylabel('User Rating', fontsize=12)
plt.xlabel('Year', fontsize=12)
plt.show()

## 6. Reviews
- <div class="alert alert-block alert-warning">It is a discrete quantitative variable.
- <div class="alert alert-block alert-warning">Minimum number of reviews for a book is  37 and maximum of 87841.
- <div class="alert alert-block alert-warning">80 percent of the books have less than equal to 20,000 reviews.

### 6.1 Book with minimum number of reviews

In [None]:
df[df.Reviews == min(df.Reviews)]

### 6.2 Book with maximum number of reviews

In [None]:
df[df.Reviews == max(df.Reviews)]

### 6.3 Distribution of Reviews

In [None]:
no_dup = df.drop_duplicates('Name')

fig, ax = plt.subplots(3, 2, figsize=(12,15))
fig.subplots_adjust(wspace=0.4, hspace=0.4)

sns.histplot(data=no_dup, x='Reviews', kde=True, ax=ax[0,0])
ax[0,0].set_title('Histogram of Reviews', fontsize=15, color='darkred')
sns.histplot(data=no_dup, x='Reviews', hue='Genre', kde=True, ax=ax[0,1], palette=genre_col)
ax[0,1].set_title('Histogram of Reviews across genre', fontsize=15, color='darkred')

sns.boxplot(data=no_dup, x='Reviews', ax=ax[1,0])
ax[1,0].set_title('Boxplot of Reviews', fontsize=15, color='darkred')
sns.boxplot(data=no_dup, x='Reviews', y='Genre', ax=ax[1,1], palette=genre_col)
ax[1,1].set_title('Boxplot of Reviews across genre', fontsize=15, color='darkred')

sns.ecdfplot(data=no_dup, x='Reviews', ax=ax[2,0])
ax[2,0].set_title('ECDF of Reviews', fontsize=15, color='darkred')
sns.ecdfplot(data=no_dup, x='Reviews', hue='Genre', ax=ax[2,1], palette=genre_col)
ax[2,1].set_title('ECDF of Reviews across Genre', fontsize=15, color='darkred')
ax[2,1].set_xticks(np.arange(0, 100000, 10000))

plt.suptitle('Distribution of Reviews for all 351 unique books', fontsize=15)
plt.show()

### 6.4 Log transformation of reviews

In [None]:
np.seterr(divide = 'ignore')
df['Reviews_log'] = np.where(df['Reviews']>0, np.log(df['Reviews']), 0)

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(12,4))
plt.subplots_adjust(wspace=0.4)
sns.kdeplot(data=df, x='Reviews_log', hue='Genre', ax=ax[0], palette=genre_col)
sns.boxplot(data=df, x='Reviews_log', y='Genre', ax=ax[1], palette=genre_col)
sns.ecdfplot(data=df, x='Reviews_log', hue='Genre', ax=ax[2], palette=genre_col)
plt.legend(loc='right')
plt.show()

### 6.5 Total number of Reviews of Fiction and Non-Fiction books each year
- <div class="alert alert-block alert-warning">Except for year 2018 and 2019, the total reviews of Non-Fictions books are higher than Fiction books.

In [None]:
total_reviews = df.groupby(['Year', 'Genre']).agg({'Reviews': 'sum'}).unstack()
category_names = df.Genre.unique()
results = dict(zip([str(year) for year in np.arange(2009, 2020)], total_reviews.values))

def survey(results, category_names):
    labels = list(results.keys())
    data = np.array(list(results.values()))
    data_cum = data.cumsum(axis=1)
    category_colors = plt.get_cmap('RdYlGn')(
        np.linspace(0.15, 0.85, data.shape[1]))


    fig, ax = plt.subplots(figsize=(10,10))
    ax.invert_yaxis()
    #ax.xaxis.set_visible(False)
    ax.set_xlim(0, np.sum(data, axis=1).max())

    for i, (colname, color) in enumerate(zip(category_names, category_colors)):
        widths = data[:, i]
        starts = data_cum[:, i] - widths
        ax.barh(labels, widths, left=starts, height=0.5,
                label=colname, color=color)
        xcenters = starts + widths / 2

        r, g, b, _ = color
        text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
        for y, (x, c) in enumerate(zip(xcenters, widths)):
            ax.text(x, y, str(int(c)), ha='center', va='center',
                    color=text_color, fontsize=12)
    ax.legend(ncol=len(category_names), bbox_to_anchor=(0, 1),
              loc='lower left', fontsize='small')

    return fig, ax


survey(results, category_names)
plt.show()