# This is my very first publication in almost 2 years. 

Any and all feedback on how to improve my analysis/code or anything else is welcome. This is a work in progress and I would add more analysis with time.

If you find this notebook helpful do leave a thumbs up!

Cheers!

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
data=pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')

In [None]:
data.head()

In [None]:
data.describe()


What are the questions we wanna find answers to? What are some questions that would be intersting to ask? What observations can we draw from this dataset? Let's brainstorm and list them here.

1. How does Genre affect Ratings?
2. How Does Price affect Ratings?
3. How does Year affect Ratings?
4. How does No.of Reviews (Reviews) affect Ratings?
5. Who are the best Authors?
6. Which Authors have the best Ratings?
7. Which Authors have the most Reviews?
8. Genre of Authors?
9. Price vs Ratings? If the price is high will there be less number of ratings?
10. Which Authors have the most number of books?
11. Average price per book for top Authors.

In [None]:
#any null values?
data.isnull().sum()

No null values. Thats cool.

In [None]:
#Plot to visualise number of Non Fiction vs Fiction books in our dataset
plt.figure(figsize = (8,6))
sns.countplot(x = 'Genre', data = data)
plt.legend()
plt.title('Plot to Check Number of Non Fiction vs Fiction Books')
plt.show()

We are having more number of Non Fiction books as compared to Fiction Books in our dataset.

In [None]:
data_gen = data.groupby('Genre')['User Rating'].agg(np.mean).plot(kind = 'bar', figsize = (8,6))
plt.title('Plot for Avg Rating of Fiction vs Non Fiction Books')
plt.show()

Fiction is rated just a tad bit higher.

Lets plot the top 10 authors that occur in out dataset

In [None]:
plt.figure(figsize = (18,8))
sns.countplot(x = 'Author',
             data = data,
             order = data['Author'].value_counts().iloc[: 10].index)
plt.xticks(rotation = 90)
plt.title('Top 10 Authors as per Occurance')
plt.show()

As I am a big big fan, I was really hoping to see Stephen King here. The Stand,The Shining are pure gold!
So just out of curiousity I am gonna analyse our beloved horror master, Mr. King here a bit.

In [None]:
#a dataset for Mr. King
king = data[data['Author'] == 'Stephen King']
#plots

fig,(ax1,ax2) = plt.subplots(2,1, figsize = (16,8))
ax1.bar(king['Name'], king['User Rating'])
ax1.set_title('Novel vs Rating')
ax2.bar(king['Name'], king['Reviews'])
ax2.set_title('Novel vs Reviews')
plt.tight_layout()
plt.show()

Only one thing worth noting here: **Doctor Sleep : A Novel** is the most reviewed amd highest rated. It is also the sequel the the Stephen King classis, The Shining. Maybe the people who read and loved the shining bought this the most and were heavy on putting out their reviews.
Based on the Ratings, that is around 4.7, they liked it.
Sadly, I have not read it till date :(

Top 10 writers in Fiction and Non Fiction?

In [None]:
df_fiction = data[data['Genre'] == 'Fiction']
df_non_fiction =  data[data['Genre'] == 'Non Fiction']

fig, ax = plt.subplots(1,2, figsize = (18,8))
sns.countplot(x = 'Author',
             data = df_fiction,
             order = df_fiction['Author'].value_counts().iloc[: 10].index,
             ax = ax[0])
ax[0].set_title('Top Authors (Fiction)')
ax[0].tick_params(labelrotation = 90)
sns.countplot(x = 'Author',
             data = df_non_fiction,
             order = df_non_fiction['Author'].value_counts().iloc[: 10].index,
             ax = ax[1])
ax[1].set_title('Top Authors (Non Fiction)')
ax[1].tick_params(labelrotation = 90)
plt.tight_layout()
plt.show()

Lets now see how Genre affects ratings.

In [None]:
df_genre_ratings = data.groupby('Genre')['User Rating'].agg(np.mean).to_frame(name = 'User Rating').reset_index()
plt.figure(figsize = (8,6))
sns.barplot(x = 'Genre', y = 'User Rating', data = df_genre_ratings)
plt.title ('Average User Rating vs Genre')
plt.show()

This is again a close call as Fiction is rated just a tad bit higher than the Non Fiction.
Assuming Fiction has more number of reviews then Non Fiction we can say that Fiction is rated by more number of people and so there is a higher average rating overall. 
Lets put this hypothesis to test.

In [None]:
df_genre_reviews = data.groupby('Genre')['Reviews'].agg(np.sum).to_frame(name = 'Number of Reviews').reset_index()
plt.figure(figsize = (8,6))
sns.barplot(x = 'Genre', y = 'Number of Reviews', data = df_genre_reviews)
plt.title('Genre vs Number of Reviews')
plt.show()

Whoa! Fiction does have a lot more reviews then Non Fiction books. 

Let's see which are the top 10 reviewed Authors and Books.

In [None]:
#top 10 reviewed Authors and Books
df_top_reviewed_authors = data.groupby('Author')['Reviews'].agg(np.sum).reset_index().sort_values(by = ['Reviews'], ascending = False)
df_top_reviewed_books = data.groupby('Name')['Reviews'].agg(np.sum).reset_index().sort_values(by = ['Reviews'],ascending = False)


In [None]:
#top Reviewed Authors
plt.figure (figsize = (18,8))
sns.barplot(x = 'Author', y = 'Reviews', data = df_top_reviewed_authors.iloc[: 10])
plt.title('Top Reviewed Authors')
plt.xticks(rotation = 90)
plt.show()
#top Reviewed Books
plt.figure (figsize = (18,8))
sns.barplot(x = 'Name', y = 'Reviews', data = df_top_reviewed_books.iloc[: 10])
plt.title('Top Reviewed Books')
plt.xticks(rotation = 90)
plt.show()

So we have Sussane Collins with almost 300000 reviews as our top reviewed writer and The Fault in Our Stars by John Green are our top reviewed book.

Let's check how ratings affect number of reviews.

In [None]:
plt.figure(figsize = (18,6))
sns.set_style('darkgrid')
sns.lineplot(x = 'User Rating', y = 'Reviews', hue = 'Genre', ci = None, data = data)
plt.title('Reviews vs User Ratings')
plt.show()

Reviews vs Year?

In [None]:
plt.figure(figsize = (18,6))
sns.set_style('darkgrid')
sns.lineplot(x = 'Year', y = 'Reviews', hue = 'Genre', ci = None, data = data)
plt.title('Reviews vs Year')
plt.show()

I beleive that it would also be interesting to check how Price varied with Year

In [None]:
plt.figure(figsize = (8,6))
sns.set_style('darkgrid')
sns.boxplot(x = 'Year', y = 'Price', hue = 'Genre', data = data)
plt.title('Price vs Year')
plt.show()