The main goal is to show various techniques for checking normal distribution:

1. Normal distribution parameters check
2. Hypothesis testing
3. Graphic representation of density functions.

Except for working with above methods, I will perform needed dataset modification in order to typos desposal and converting numerical variables into categorical.

## 1. Import required libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import scipy.stats

from sklearn.preprocessing import OrdinalEncoder

## 2. Upload the dataset

In [None]:
df = pd.read_csv('../input/goodreadsbooks/books.csv', error_bad_lines=False)

In [None]:
skipped_lines_percent = round(4/(df.shape[0]+4)*100,4)
skipped_lines_percent

Skipped lines account for about 0.0359 % of length of whole dataset, hence such skipping is acceptable.

## 3. Take a look at basic information about the dataset

In [None]:
df.head()

In [None]:
df.shape

The dataset consists of **11123 rows and 12 columns**. The **columns names** are below:

In [None]:
df.columns.to_list()

## 4. Dataframe modification 
### 4.1 Fix column names 

I will rename whitespace in "num_pages" column name.

In [None]:
df.columns = df.columns.str.replace(' ', '')
df.columns.to_list()

### 4.2 Remove needless columns

Some columns like:
* bookID
* isbn
* isbn13
* title
* authors

are not interesting from the viewpoint of distribution analysis, so they will be removed.

In [None]:
print(f'Number of columns before removing: {df.shape[1]}')
df = df.drop(['bookID', 'isbn', 'isbn13', 'title', 'authors'], axis=1)
print(f'Number of columns after removing: {df.shape[1]}')

In [None]:
df.columns.to_list()

### 4.3 Convert some categorical data into numerical

In [None]:
df.nunique()

Hence column "language_code" has only 27 unique values, it can be easy mapped to numerical values.

In [None]:
df.language_code.unique()

Language codes with prefix 'en-' like: en-US, en-CA, en-GB will be replaced by 'eng'.

In [None]:
df.language_code = df.language_code.replace(to_replace ='en-..', value = 'eng', regex = True)
np.sort(df.language_code.unique())

We can see that mentioned values have been replaced by 'eng'. Last thing is to convert this column to a column with numerical values.

In [None]:
before = df.language_code.unique()

enc = OrdinalEncoder()
df.language_code = enc.fit_transform(df.language_code.values.reshape(-1, 1)).astype(int)

In [None]:
pd.DataFrame(data={'before': before,
                   'after': df.language_code.unique()}).sort_values(by='before')

Map of "language code" values will enable the further distribution analysis.

### 4.4 Feature engineering

### 4.4.1 Publication date
"publication_date" may be valuable for distribution analysis, especially when years will be extracted.

In [None]:
df['year'] = df.publication_date.str.rsplit("/", n=3, expand=True)[2].astype(int)
# n=3 because value is splitted into 3 parts: day, month and year
# [2] because we are interested only in 'year'

df.head(2)

To avoid data leakage "publication_date" will be removed.

In [None]:
df = df.drop(['publication_date'], axis=1)
df.head(2)

## 5. Normal distribution analysis
I will perform the analysis going through 3 approaches:
* using basic stats to see normal distribution parameters.
* hypothesis testing of normal distribution
* graphic representation of density functions

Last glance at basic statistics to check if datapoints looks good.

In [None]:
df.describe()

### 5.1 Normal distribution parameters check

Parameters that are indicative for normal distribution are:
* mean 
* median 
* kurtosis
* skewness.

The mean and median [should have the same value](https://en.wikipedia.org/wiki/Normal_distribution), and kurtosis and skewness [be equal to 0](https://en.wikipedia.org/wiki/Normal_distribution).


In [None]:
df.agg(['mean', 'median', 'kurtosis', 'skew']).T

Mean and median have similar values for: 
* average_rating (left-skewed distribution)
* num_pages (right-skewed distribution)
* language_code (right-skewed distribution)
* year (with a skew that is the closest to 0, left-skewed distribution)

The "year" and "average_rating" are our front-runners in the race for normal distribution ;)

### 5.2 Hypothesis testing

In [None]:
results = []
p_value_list = []
alpha = 0.05

for i in df._get_numeric_data().columns:
    p_value = scipy.stats.normaltest(df[i])[1] # to get only p_value without a statistic
    p_value_list.append(p_value)
    if p_value < alpha:
        results.append('rejected')
    else:
        results.append('not rejected')
        
pd.DataFrame(data={'variable': df._get_numeric_data().columns,
                    'p_value': p_value_list,
                    'null hypothesis': results})

According to hypothesis testing, none of variables comes from a normal distribution. It's hard to find a feature that is the closest to be normal because all p value are 0.

### 5.3 Graphic representation of density functions

In [None]:
f, axes = plt.subplots(3,2, figsize=(15, 10))
sns.distplot(df.average_rating, color='skyblue', ax=axes[0, 0])
sns.distplot(df.num_pages, color='olive', ax=axes[0, 1])
sns.distplot(df.ratings_count, color='gold', ax=axes[1, 0])
sns.distplot(df.text_reviews_count, color='teal', ax=axes[1, 1])
sns.distplot(df.year, color='skyblue', ax=axes[2, 0])
sns.countplot(x = 'language_code', data = df, ax=axes[2,1])
plt.show()

As we can see, "average_rating" is the most normal variable.

## 6. Conlusion

The dataset consisted of 12 columns. Five of them (bookID, isbn, isbn13, title and authors) have been removed because checking them for statistics wasn't reasonable. 
The column "publication_date" has been replaced by column "year".  
Eventually, the dataset had 7 columns: categorical ("language_code" and "publisher") and numerical (the rest).

Use of 3 different methods showed different results. Despite the fact that the distribution of "average_rating" variable looks like a normal distribution (5.3), then none of the numerical variables hasn't a normal distribution (5.1 and 5.2). 