In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 1. Import Libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# 2. Read Data

In [None]:
df = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.isnull().sum()

Let's check to see if any of our numeric variables have any negative values. This will help us determine if there are any irregularities in the data that we should be concerned with. Now, there are a couple of ways we can find negatives. Let's start with the "slow and dumb" way. We'll loop through each column individually and sum the total negative values we find in each one.

In [None]:
sum(x < 0 for x in df['Price'].values.flatten())

In [None]:
sum(x < 0 for x in df['User Rating'].values.flatten())

In [None]:
sum(x < 0 for x in df['Reviews'].values.flatten())

Great! No negatives mean one less thing to worry about. But, notice how tedious it is to write out each loop and sum the values for each column. Let's build a function that does the exact same thing, but that we only need to write one time.

In [None]:
def count_negatives(data):
    neg_count = 0
    for n in data:
        if type(data) == 'int':
            if n < 0:
               neg_count += 1
    return neg_count

count_negatives(df)

We get the same correct result and if we want to check the dataset later on for whatever reason, we can just call this function again.

# Price

Now that we've made sure we loaded the data properly and that nothing seems to be wrong with it, we can begin our analysis. Let's first check if any of the variables are correlated.

In [None]:
sns.heatmap(
    df[['Price', 'Reviews', 'User Rating', 'Year']].corr(),
    annot = True,
    cmap = 'BuPu'
)

A low correlation tells us that the variables don't have strong relationships with each other.

Financial data is often heavily skewed with higher prices having more influence than most of the other data points. Because of this, it is often helpful to take the log of financial data so that it is easier to read to work with. To start, I will make a copy of our dataset.

In [None]:
data = df.copy()

Now, let's log transform our new dataset's price variable.

In [None]:
data['Price'] = data['Price'].map(lambda x: np.log(x) if x > 0 else 0)

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (12,6))

ax[0].hist(
    df['Price'],
    color = 'pink',
    bins = 25,
    edgecolor = 'black',
    label = "Skewness: %.2f"%df['Price'].skew()
)

ax[0].legend()
ax[0].set_title('Original Data', fontsize = 14, fontweight = 'bold')

ax[1].hist(
    data['Price'],
    color = 'royalblue',
    bins = 25,
    edgecolor = 'black',
    label = "Skewness: %.2f"%data['Price'].skew()
)

ax[1].legend()
ax[1].set_title('Log-transformed Data', fontsize = 14, fontweight = 'bold')
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

We can see from the above output that the original data has a long right tail, making it right-skewed. This skewness is evident by the legend, indicating that it is skewed by a factor of 3.69. This graph clearly follows a non-normal distribution. The graph on the right, however, shows a much more normal distribution as evidenced by its skewness being closer to 0 (in this case -0.48).

** As a general rule rule of thumb, it is often beneficial to transform financial data by taking the logarithm of it for machine learning techniques, such as regression, that list normality as one of its assumptions.

Okay, now let's take a look at the highest and lowest priced books.

In [None]:
lowest_prices = df.groupby('Name', as_index = False)['Price'].mean().sort_values('Price').head(10)
lowest_prices

This is interesting. To Kill a Mockingbird has a mean price of of 1.4. Let's take a closer look at this.

In [None]:
df[df['Name'] == 'To Kill a Mockingbird']

So, we can see from this that To Kill a Mockingbird appears on the bestseller list five times and, for whatever reason, it costs $7 instead of the usual free pricetag. This is interesting, let's see what other titles appear in the list more than once.

In [None]:
import collections

names = [item for item, count in collections.Counter(df['Name']).items() if count > 1]
names

This is a bit clunky and hard to read, let's make this easier on the eyes by turning it into a dataframe.

In [None]:
new_df = pd.DataFrame(names)
new_df = new_df.rename(columns = {0: 'name'})
new_df

The above output shows us that 96 of the titles in the original dataset appear more than once. To avoid any confusion moving forward let's use the median for a better approximation of the book's price.

In [None]:
lowest_prices = df.groupby('Name', as_index = False)['Price'].median().sort_values('Price').head(10)
lowest_prices

In [None]:
highest_prices = df.groupby('Name', as_index = False)['Price'].median().sort_values('Price', ascending = False).head(10)
highest_prices

In [None]:
# Add breaks in long title names.
x = ['Diagnostic and Statistical<br>Manual of Mental Disorders<br>,5th Edition:DSM-5', 
     'The Twilight Saga<br>Collection', 'Hamilton: The Revolution', 
     'The Book of Basketball:<br>The NBA According to<br>The Sports Guy', 
     'Harry Potter Paperback<br>Box Set (Books 1-7)', 
     'Publication Manual of the<br>American Psychological<br>Association, 6th Edition', 
     'Watchmen', 'The Official SAT<br>Study Guide', 'The Alchemist', 
     'The Official SAT<br>Study Guide, 2016 Edition<br>(Official Study Guide<br>for the New SAT)']

layout = go.Layout(
    title = "Highest Book Prices",
    plot_bgcolor = 'white', # Setting background color to white
    xaxis = dict(
        showgrid = False
    ),
    yaxis = dict(
        showgrid = False
    )
)

fig = go.Figure(layout = layout)

fig.add_trace(
    go.Bar(
    x = x,
    y = highest_prices['Price'],
    marker_color = 'royalblue',
    marker_line_color = 'black'
    )
)

# Add text above each bar.
fig.update_traces(
    text = highest_prices['Price'], 
    textposition = 'outside', 
    texttemplate = '%{y:$.2f}', 
    textfont = {'size': 10}
)

# Increase the height.
fig.update_layout(height = 620)

fig.update_xaxes(
    tickangle = 90,
    title_text = "Name of Book"
)
fig.update_yaxes(
    title_text = "Price"
)
fig.show()

The bar graph shows us that the books with the highest prices seem to be educational of some form, in the case of non-fiction, or collections of stories, in the case of fiction.

Now, let's look at genre's effect on book price. The above graph shows an almost even split with non-fiction books just notching out the majority of the top ten most expensive books. Is this a trend that remains consistent throughout the entire dataset? Is there a difference in the distributions of each genre?

In [None]:
fiction = []
non_fiction = []

for index, row in df.iterrows():
    if row['Genre'] == 'Fiction':
        fiction.append(row['Price'])
    else:
        non_fiction.append(row['Price'])

In [None]:
layout = go.Layout(
    title = "Distribution of Prices by Genre",
    plot_bgcolor = "white",
    xaxis = dict(
        title = "Price",
        showgrid = False
    ),
    yaxis = dict(
        title = 'Count',
        showgrid = False
    )
)

fig = go.Figure(layout = layout)

fig.add_trace(go.Histogram(x = fiction, name = 'Fiction', marker_color = 'salmon'))
fig.add_trace(go.Histogram(x = non_fiction, name = 'Non Fiction', marker_color = 'royalblue'))

fig.update_layout(barmode = 'stack')
fig.show()

In [None]:
layout = go.Layout(
    title = "Distribution of Prices by Genre",
    plot_bgcolor = "white",
    xaxis = dict(
        title = "Price",
        showgrid = False
    ),
    yaxis = dict(
        title = 'Count',
        showgrid = False
    )
)

fig = go.Figure(layout = layout)

fig.add_trace(
    go.Box(
        y = fiction, 
        name = 'Fiction', 
        boxpoints = 'suspectedoutliers',
        marker = dict(
            color = 'salmon',
            outliercolor = 'aquamarine'
#         ),
#         line = dict(
#             outliercolor = 'green',
#             outlierwidth = 2
        )
    )
)

fig.add_trace(
    go.Box(
        y = non_fiction,
        name = 'Non FIction',
        boxpoints = 'suspectedoutliers',
        marker = dict(
            color = 'royalblue',
            outliercolor = 'aquamarine'
        )
    )
)

The histograms and boxplots show that the two genres follow roughly similar distributions. The output also shows us that non-fiction books have higher average prices with more variance, as displayed by the outliers as evidenced by the boxplot.

# Genre

Let's go ahead and explore the genre variable some more. First, we'll make two new datasets with one representing just fiction books and the other representing non fiction books.

In [None]:
g1 = df[df['Genre'] == 'Fiction']
g2 = df[df['Genre'] == 'Non Fiction']

Let's take a look at the total number of bestselling books from each genre across all years reported.

In [None]:
col = 'Year'

fict = g1[col].value_counts().reset_index()
fict = fict.sort_values('index')
fict = fict.rename(columns = {'index': 'year', 'Year': 'count'})
fict

In [None]:
nonfict = g2[col].value_counts().reset_index()
nonfict = nonfict.rename(columns = {'index': 'year', 'Year': 'count'})
nonfict = nonfict.sort_values('year')
nonfict

In [None]:
layout = go.Layout(
    title = "Number of Bestsellers Each Year by Genre",
    plot_bgcolor = "white",
    xaxis = dict(
        title = "Year",
        showgrid = False
    ),
    yaxis = dict(
        title = 'Count',
        showgrid = False
    )
)

fig = go.Figure(layout = layout)
fig.add_trace(go.Scatter(x = fict['year'], y = fict['count'], name = 'Fiction', marker_color = 'salmon'))
fig.add_trace(go.Scatter(x = nonfict['year'], y = nonfict['count'], name = 'Non Fiction', marker_color = 'royalblue'))

The graph shows that in every year but 2014, more non fiction books were on the bestseller list compared with fiction books.

# User Rating

In [None]:
fig = px.histogram(df, x = 'User Rating', color = 'Genre', marginal = 'box', title = "Distribution of User Ratings")
fig.update_layout({'plot_bgcolor': 'white'})

The distributions for user ratings follow about the same pattern regardless of genre. Fiction ratings are higher by 0.1 points.

Let's perform a hypothesis test on the two distributions to see if there is any significant difference between the two average ratings. We could transform the data to satisfy assumptions of normality for t- or z-tests, however, I believe it would be simpler to use a non-parametric test. Specifically, I will be using the Mann Whitney U test on the distributions. The Mann-Whitney U Test is a non-parametric test that works well with non-normal distributions and tests against the null hypothesis (H0) that the two distributions are the same.

In [None]:
genre_rating = df.groupby('Genre')['User Rating'].mean().reset_index()
genre_rating = genre_rating.rename(columns = {'Genre': 'genre', 'User Rating': 'mean'})

In [None]:
df.groupby('Genre')['User Rating'].apply(np.std)

In [None]:
genre_rating['sdev'] = [0.264570, 0.189249]
genre_rating

In [None]:
print('##### Mann-Whitney U Test #####\n')
print('###############################\n')
print('H0: Distributions are equal\n')
print('H1: Distributions are not equal\n')

In [None]:
from scipy.stats import mannwhitneyu

stat, p_value = mannwhitneyu(g1['User Rating'], g2['User Rating'])
alpha = 0.05

if p_value < alpha:
    print('U Statistic: ', stat)
    print('P-value: ', p_value)
    print('Reject H0, Distributions are not equal.')
else:
    print('U Statistic: ', stat)
    print('P-value: ', p_value)
    print('Fail to Reject H0, Distributions are equal.')

Indeed, the distributions are not the same and it is safe to conclude that user ratings for fiction books are statistically higher than non-fiction books.

# Reviews

In [None]:
genre_reviews = df.groupby('Genre')['Reviews'].sum().reset_index()
genre_reviews

In [None]:
go.Figure(go.Bar(x = genre_reviews['Genre'], y = genre_reviews['Reviews']), layout = go.Layout(plot_bgcolor = 'white'))

In [None]:
fig = px.histogram(df, x = 'Reviews', color = 'Genre')
fig.update_layout({'plot_bgcolor': 'white'})

Let's take a quick look at what different data transformations would do to our reviews data.

In [None]:
log_r = df.copy()
sqrt_r = df.copy()
cube_r = df.copy()

log_r['Reviews'] = log_r['Reviews'].map(lambda x: np.log(x) if x > 0 else 0)
sqrt_r['Reviews'] = sqrt_r['Reviews'].map(lambda y: np.sqrt(y) if y > 0 else 0)
cube_r['Reviews'] = cube_r['Reviews'].map(lambda z: np.cbrt(z) if z > 0 else 0)

In [None]:
fig, ax = plt.subplots(2, 2, figsize = (12, 6))

ax[0,0].hist(
    df['Reviews'],
    color = 'pink',
    bins = 25,
    edgecolor = 'black',
    label = "Skewness: %.2f"%df['Reviews'].skew()
)

ax[0,0].legend()
ax[0,0].set_title('Original Data', fontsize = 14, fontweight = 'bold')

ax[0,1].hist(
    log_r['Reviews'],
    color = 'royalblue',
    bins = 25,
    edgecolor = 'black',
    label = "Skewness: %.2f"%log_r['Reviews'].skew()
)

ax[0,1].legend()
ax[0,1].set_title('Log Transformation ', fontsize = 14, fontweight = 'bold')

ax[1,0].hist(
    sqrt_r['Reviews'],
    color = 'darkcyan',
    bins = 25,
    edgecolor = 'black',
    label = "Skewness: %.2f"%sqrt_r['Reviews'].skew()
)

ax[1,0].legend()
ax[1,0].set_title(' Square Root Transformation', fontsize = 14, fontweight = 'bold')

ax[1,1].hist(
    cube_r['Reviews'],
    color = 'khaki',
    bins = 25,
    edgecolor = 'black',
    label = "Skewness: %.2f"%cube_r['Reviews'].skew()
)

ax[1,1].legend()
ax[1,1].set_title('Cube Root Transformation', fontsize = 14, fontweight = 'bold')
fig.tight_layout()

The above graphs show us that the cube root reduces the skewness of the data the most, resulting in the closest value to zero of all transformations tested, in this case 0.43. An interesting note: the square root distribtion is very similar to the distribution of the cube root.

As with user rating, let's run a Mann-Whitney U Test to test the difference of the distributions.

Remember,

* Null hypothesis: The Distributions are Equal
* Alternative hypothesis: The Distributions are not Equal

In [None]:
stat, p_value = mannwhitneyu(g1['Reviews'], g2['Reviews'])
print('U Statistic: ', stat)
print('P-value: ', p_value)

The results show us that the distributions of reviews between fiction and non-fiction books are indeed different. Based on this, we can reject the null hypothesis and conclude that fiction reviews are significantly higher than non-fiction.