## Introduction

This is my first EDA, and I hope to recieve constructive criticism.

## EDA 

Import the necessary libraries and load the files needed for the initial dataset exploration.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')

Get an idea of the size and shape of the dataset.

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.head()

The head function allows for a better view of what to expect on the organization of the entire dataset.

In [None]:
df.describe()

The describe function outputs the summary statistics of the appropriate data types as determined by the info function. Shown are the statistics for the float and integer data types associated with the dataset.

In [None]:
df[df.Author == 'J. K. Rowling']

In [None]:
df[df.Author == 'J.K. Rowling']

Above shows an error in the dataset where one of the author's name is not maintained throughout the entirety of its occurrences. The code below fixes that issue. Code Credit: Arush Chillar

In [None]:
df.loc[df.Author == 'J. K. Rowling', 'Author'] = 'J.K. Rowling'

## Data Visualization and Insights

The package I will use for the data visualization of this dataset is "plotnine" which is based of "ggplot2" graphics package from R. I chose this package as I am more familiar with R Tidyverse ggplot2 data visualization over other common python modules.

In [None]:
%matplotlib inline
import numpy as np
from plotnine import *

For this section, it is important to see exactly what can be compared. The easiest way to accomplish this is when categorical data is provided, "Fiction" and "Non Fiction" in this case.

## Price

In [None]:
(
    ggplot(df, aes(x='Price', color='Genre', fill='Genre'))
    + geom_density(alpha=0.1)
    + geom_rug()
)

Above is a visual representation of the distribution of prices amongst the different genres. As you can see, both tend to have prices in the lower quarter of the price range, and is evidence towards lower prices being comensurate to higher sales.

In [None]:
(
    ggplot(df, aes(x='Genre', y='Price', color='Genre'))
    + geom_boxplot()
    + labs(x='Genre', y='Price')
)

The figure above shows that price is somewhat held the same between the two genres. However, statistical tests would be desirable to compare the two.

## Price Statistical Tests

In [None]:
from scipy import stats

There are several things to consider when choosing which test to use in a comparison of means between these two groups. I want to utilize a t-test to carry out the mean comparison, however the data here does not meet the normality assumption. 

The first step to take is to log transform the data and see if that helps us meet the assumption of normality necessary to avoid Type 1 errors.

Before carrying out the transformation I will first make a separate dataset of the price data.

In [None]:
price_data = df[['Genre', 'Price']]
price_data.head()

### Log Transformation

In [None]:
price_data = price_data.iloc[:,1].transform(func = lambda x : np.log(x))

In [None]:
df['Price_Log'] = price_data

In [None]:
df.iloc[42]

Notice the result of the log transformation on this data. A fix must be introduced in order to avoid taking the log of the zero price value.

In [None]:
(
    ggplot(df, aes(x='Price_Log', color='Genre', fill='Genre'))
    + geom_density(alpha=0.1)
    + geom_rug()
)

Once the skewed data is log transformed it resembles what one would expect to see in a normally distributed dataset. Furthermore, this transformed data can be utilized to perform statistical tests which assume data normality.

## User Rating

In [None]:
(
    ggplot(df, aes(x='User Rating', color='Genre', fill='Genre'))
    + geom_density(alpha=0.1)
)

The distribution of User Rating is what one would expect from a list of "best sellers" as they are highly contained within the 4 to 5 star rating.