In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

In [2]:
df_books = pd.read_pickle('../data/processed/bestsellers_cleaned')
df_books.head()

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
0,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17350,8.0,2016,Non Fiction
1,11/22/63: A Novel,Stephen King,4.6,2052,22.0,2011,Fiction
2,12 Rules For Life: An Antidote To Chaos,Jordan B. Peterson,4.7,18979,15.0,2018,Non Fiction
3,1984 (Signet Classics),George Orwell,4.7,21424,6.0,2017,Fiction
4,"5,000 Awesome Facts (About Everything!) (Natio...",National Geographic Kids,4.8,7665,12.0,2019,Non Fiction


# EDA

Let's first look at how our key variables are distributed.

In [19]:
fig = px.histogram(df_books, x='User Rating', title='Histogram of User Rating')
fig.show()

In [21]:
fig = px.histogram(df_books, x='Price', title='Histogram of Price')
fig.show()

In [22]:
fig = px.histogram(df_books, x='Reviews', title='Histogram of Price')
fig.show()

In [32]:
genre_counts = df_books.groupby('Genre')['Genre'].count()
px.bar(genre_counts, labels={'index': 'Genre', 'value': 'Counts'})

The price and the number of reviews is skewed towards the lower end. 

We have 160 Fiction books and 188 Non-Fiction books in our dataset, for a relatively even split. 

Data is skewed towards the higher end for ratings, with the minimum value being 3.3 and 108 books having a rating of 4.8+ (they are best sellers after all!). To better interpret user popularity, we can re-code the rating column ourselves. 


In [6]:
rating_dict = {'Very High': [4.8, 4.9, 5.0], 
              'High': [4.5, 4.6, 4.7], 
              'Medium': [4.0, 4.1, 4.2, 4.3, 4.4],
              'Low': [3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9]}

category_order = ['Low', 'Medium', 'High', 'Very High']

def process_ratings(user_rating, rating_dict):
    for key, value in rating_dict.items():
        if user_rating in value:
                rating_category = key
    return rating_category

In [7]:
df_books['Rating Category'] = df_books['User Rating'].apply(lambda x: process_ratings(x, rating_dict))
df_books['Rating Category'] = pd.Categorical(df_books['Rating Category'])

In [8]:
rating_counts = df_books.groupby('Rating Category')['Rating Category'].count()

In [9]:
fig = px.bar(
    rating_counts,
    labels={'index': 'Rating Category', 'value': 'Count of Rating Category'},
    category_orders={'index': category_order}
)
fig.show()

Looking at how rating varies by various different factors. We can also overlay our rating categories.

In [14]:
fig = px.scatter(
    df_books,
    x='User Rating',
    y='Reviews',
    color='Rating Category',
    category_orders={'Rating Category': category_order},
    title='Number of Reviews vs User Rating'
)
fig.show()

In [15]:
fig = px.scatter(
    df_books, 
    x='User Rating', 
    y='Price', 
    color='Rating Category', 
    category_orders={'Rating Category': category_order},
    title='Price vs User Rating'
)
fig.show()

In [18]:
fig = px.scatter(
    df_books, 
    x='Price', 
    y='Reviews', 
    color='Rating Category', 
    category_orders={'Rating Category': category_order},
    title='Price vs Number of Reviews'
)
fig.show()

From the above visualisations, we can't identify any significant relationships between the price of a book and the number of reviews that is has. 
We also note that generally, the prices of these books low with an average of \\$13 and only a few books over \\$20. 

In [66]:
fig = px.scatter(
    df_books, 
    x='Genre', 
    y='User Rating',
    title='Fiction vs Non-Fiction by User Rating'
)
fig.show()

Interestingly, all non-fiction books have a user rating of 4 and above, whereas the fiction books have a bit more variance. 

## Authors

In [60]:
common_authors = df_books.groupby('Author')['Author'].count().sort_values().reset_index(name='Count')
common_authors = common_authors[common_authors['Count'] >= 3]
fig = px.bar(
    common_authors,
    x='Author',
    y='Count',
    title='Top Authors by Number of Books'
)
fig.show()

## 4.9 star books

We have 28 books in our dataset that have the highest rating observed of 4.9/5 

In [33]:
# Quick peek at the top rated books
df_top_books = df_books[df_books['User Rating'] == 4.9].sort_values('Author')
df_top_books

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre,Rating Category
219,Little Blue Truck,Alice Schertle,4.9,1884,0.0,2014,Fiction,Very High
41,"Brown Bear, Brown Bear, What Do You See?",Bill Martin Jr.,4.9,14344,5.0,2019,Fiction,Very High
174,Humans Of New York : Stories,Brandon Stanton,4.9,2812,17.0,2015,Non Fiction,Very High
431,The Magnolia Story,Chip Gaines,4.9,7861,5.0,2016,Non Fiction,Very High
84,Dog Man: Brawl Of The Wild: From The Creator O...,Dav Pilkey,4.9,7235,4.0,2019,Fiction,Very High
85,Dog Man: Fetch-22: From The Creator Of Captain...,Dav Pilkey,4.9,12619,8.0,2019,Fiction,Very High
86,Dog Man: For Whom The Ball Rolls: From The Cre...,Dav Pilkey,4.9,9089,8.0,2019,Fiction,Very High
87,Dog Man: Lord Of The Fleas: From The Creator O...,Dav Pilkey,4.9,5470,6.0,2018,Fiction,Very High
82,Dog Man: A Tale Of Two Kitties: From The Creat...,Dav Pilkey,4.9,4786,8.0,2017,Fiction,Very High
81,Dog Man And Cat Kid: From The Creator Of Capta...,Dav Pilkey,4.9,5062,6.0,2018,Fiction,Very High
