#### Hi, I'm Jason. This notebook is based on a very small dataset that contains the top 50 bestselling books on Amazon from 2009 to 2019. 
#### I've created visualizations to show the number top books by category (fiction vs. non-fiction), and I've also found the most popular book, and most popular author among the books. There's a histogram on the user ratings, and there's pie chart and bar chart on the genre (fiction vs non-fiction) breakdown.

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Read the data
books = pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')

In [None]:
# Take a peak at the data.
books.head()

In [None]:
# There's a column called "User Rating" that has a space in the name. 
# I removed the gap so that it's easier to call the column later on.
books = books.rename(columns={'User Rating': 'UserRating'})

In [None]:
# Check to see if there's any null values.
# There shouldn't be any, but I just want to double check. 
books.isnull().sum()

Let look at column by column. 

In [None]:
# The first column is the name of the book. 
# What is the most popular book among the bestsellers? I.e. which book is on the bestseller list multiple times? 
# Here's the list of top 10 most popular books. 
books.groupby(by = ['Name']).count()[['Year']].sort_values(by = 'Year', ascending = False).head(10)

In [None]:
# The second column is the author of the book.
# Who is the most popular author then? Whose book was in the bestseller list multiple times? 
books.groupby(by = ['Author']).count()[['Name']].sort_values(by = 'Name', ascending = False).head(10)

In [None]:
# To my surprise, books from the American Psychological Association are on the bestselling list 10 times! 
# That's an average of once a year! Looks like people who shop on Amazon are really into Psychological books. 
# I wonder what are those books? Is it one book that's on the list every year? Or multiple books? 
books[books.Author == 'American Psychological Association']

In [None]:
# It's just one book from the American Psychological Association that people keep buying every year. 

In [None]:
# Let's look at user rating next.
# What's the highest and lowest rating for these books? These are best selling books, rating should be fairly high I guess.
print('Highest rating is ' + str(max(books['UserRating'])))
print('Lowest rating is ' + str(min(books['UserRating'])))

In [None]:
# Highest rating is 4.9, not surprised at all. However, lowest rating is 3.3?! That's like a pssing grade. 
# I wonder which book is that? 
books[books.UserRating == 3.3]

In [None]:
# It's JK Rowling's The Casual Vacancy, rated by 9000+ people. 
# Given that it's rated by so many people, it should be a truthful rating. Why such a low rating book is on the bestsellers list then?
# Maybe people are buying it because it's written by JK Rowling, I guess? I can't tell with the limited data here.  

In [None]:
# Let's also take a look at how the rating is distributed.
plt.hist(books['UserRating'], bins = 20)

In [None]:
# Even though there are a few low rating books, but most of ratings are above 4.5. And the median of the rating is...
print('Median rating is ' + str(books['UserRating'].median()))

In [None]:
# Now let's look at the genre breakdown.
# The genre here has only 2 categories: fiction vs non-fiction.
books.groupby('Genre').size().plot(kind = 'pie')

In [None]:
# Overall, there are more non-fictions than fictions on the list.
# What about year over year breakdown? 

In [None]:
genreCount = books[['Year', 'Genre', 'Name']].groupby(by = ['Year', 'Genre']).count()

In [None]:
ax = genreCount.unstack('Genre').plot.bar(stacked = True, figsize = (10, 5))
ax.legend(['Fiction', 'Non Fiction'], loc = 'best')
ax.set_ylabel('Number of Books')

In [None]:
# There's only one year (2014) that there was more non-fictions than fictions, for the rest of the time, it's either an even split, or non-fiction wins.
# Looks like people who shop on Amazon still prefers non-fictions more than fictions. 