### Tests for Normality



Various forms of distribution
There are various kinds of probability distributions, and each distribution shows the probability of different outcomes for a random experiment.

A normal distribution is the most common and widely used distribution in statistics. It is also called a "bell curve" and "Gaussian curve" 

A normal distribution occurs commonly in nature.

#### What is normality?

Normality means that your data follows the normal distribution. Specifically, each value $y_i$ in $Y$ is a ‘realization’ of some normally distributed random variable $N(µ_i, σ_i)$.

A normal (or Gaussian) distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

$$
{\displaystyle f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}}
$$

$$\int_{-\infty}^{+\inf} \! p(x) \, \mathrm{d}x = 1$$

$$
N(\mu, \sigma)
$$

Generate and plot a standard norml distribution, i.e.,  $\mu = 0$ and $\sigma = 1$.

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import math
import matplotlib as mpl

mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.title('Normal Distribution')
plt.xlabel('x')
plt.ylabel('p(x)')
plt.show()

#### Assignment

- Generate a random normal distribution with $x_{min} = 0$, $x_{max}=16$, $\mu = 0$ and $\sigma = 1$
- Use matplotlib to plot your distribution.

In [0]:
import matplotlib.pyplot as plt
import scipy.stats
import numpy as np


x_min = 0.0
x_max = 16.0

mean = 8.0 
std = 2.0

???

Here's a more advanced example of plotting the same distribution.

In [0]:
import matplotlib.pyplot as plt
import scipy.stats
import numpy as np
import matplotlib as mpl

x_min = 0.0
x_max = 16.0

mean = 8.0 
std = 3.0

x = np.linspace(x_min, x_max, 100)

y = scipy.stats.norm.pdf(x,mean,std)

mpl.style.use('default')

plt.plot(x,y, color='black')

#----------------------------------------------------------------------------------------#
# fill area 1

pt1 = mean + std
plt.plot([pt1 ,pt1 ],[0.0,scipy.stats.norm.pdf(pt1 ,mean, std)], color='black')

pt2 = mean - std
plt.plot([pt2 ,pt2 ],[0.0,scipy.stats.norm.pdf(pt2 ,mean, std)], color='black')

ptx = np.linspace(pt1, pt2, 10)
pty = scipy.stats.norm.pdf(ptx,mean,std)

plt.fill_between(ptx, pty, color='#0b559f', alpha='1.0')

#----------------------------------------------------------------------------------------#
# fill area 2

pt1 = mean + std
plt.plot([pt1 ,pt1 ],[0.0,scipy.stats.norm.pdf(pt1 ,mean, std)], color='black')

pt2 = mean + 2.0 * std
plt.plot([pt2 ,pt2 ],[0.0,scipy.stats.norm.pdf(pt2 ,mean, std)], color='black')

ptx = np.linspace(pt1, pt2, 10)
pty = scipy.stats.norm.pdf(ptx,mean,std)

plt.fill_between(ptx, pty, color='#2b7bba', alpha='1.0')

#----------------------------------------------------------------------------------------#
# fill area 3

pt1 = mean - std
plt.plot([pt1 ,pt1 ],[0.0,scipy.stats.norm.pdf(pt1 ,mean, std)], color='black')

pt2 = mean - 2.0 * std
plt.plot([pt2 ,pt2 ],[0.0,scipy.stats.norm.pdf(pt2 ,mean, std)], color='black')

ptx = np.linspace(pt1, pt2, 10)
pty = scipy.stats.norm.pdf(ptx,mean,std)

plt.fill_between(ptx, pty, color='#2b7bba', alpha='1.0')

#----------------------------------------------------------------------------------------#
# fill area 4

pt1 = mean + 2.0 * std
plt.plot([pt1 ,pt1 ],[0.0,scipy.stats.norm.pdf(pt1 ,mean, std)], color='black')

pt2 = mean + 3.0 * std
plt.plot([pt2 ,pt2 ],[0.0,scipy.stats.norm.pdf(pt2 ,mean, std)], color='black')

ptx = np.linspace(pt1, pt2, 10)
pty = scipy.stats.norm.pdf(ptx,mean,std)

plt.fill_between(ptx, pty, color='#539ecd', alpha='1.0')

#----------------------------------------------------------------------------------------#
# fill area 5

pt1 = mean - 2.0 * std
plt.plot([pt1 ,pt1 ],[0.0,scipy.stats.norm.pdf(pt1 ,mean, std)], color='black')

pt2 = mean - 3.0 * std
plt.plot([pt2 ,pt2 ],[0.0,scipy.stats.norm.pdf(pt2 ,mean, std)], color='black')

ptx = np.linspace(pt1, pt2, 10)
pty = scipy.stats.norm.pdf(ptx,mean,std)

plt.fill_between(ptx, pty, color='#539ecd', alpha='1.0')

#----------------------------------------------------------------------------------------#
# fill area 6

pt1 = mean + 3.0 * std
plt.plot([pt1 ,pt1 ],[0.0,scipy.stats.norm.pdf(pt1 ,mean, std)], color='black')

pt2 = mean + 10.0 *std
plt.plot([pt2 ,pt2 ],[0.0,scipy.stats.norm.pdf(pt2 ,mean, std)], color='black')

ptx = np.linspace(pt1, pt2, 10)
pty = scipy.stats.norm.pdf(ptx,mean,std)

plt.fill_between(ptx, pty, color='#89bedc', alpha='1.0')

#----------------------------------------------------------------------------------------#
# fill area 7

pt1 = mean - 3.0 * std
plt.plot([pt1 ,pt1 ],[0.0,scipy.stats.norm.pdf(pt1 ,mean, std)], color='black')

pt2 = mean - 10.0 * std
plt.plot([pt2 ,pt2 ],[0.0,scipy.stats.norm.pdf(pt2 ,mean, std)], color='black')

ptx = np.linspace(pt1, pt2, 10)
pty = scipy.stats.norm.pdf(ptx,mean,std)

plt.fill_between(ptx, pty, color='#89bedc', alpha='1.0')

#----------------------------------------------------------------------------------------#

plt.grid()

plt.xlim(x_min,x_max)
plt.ylim(0,0.25)

plt.title('Normal Distribution',fontsize=10)

plt.xlabel('x')
plt.ylabel('p(x)')

# plt.savefig("normal_distribution_2.png")
plt.show()

### Normality Tests in Python

In general, if the data Is Gaussian (normal) we can use parametric statistical methods, i.e., a probability density function that takes parameters to generate a distribution. For example $N(\mu, \sigma)$ for a normal distribution. So having a normal distribution makes our lives easier. Unfortunately, many important real world distributions are not normal. A significant problem in many statistical analyses is the assumption of normality.
    
If the data is not normal, then we need to use some form of nonparametric statistical methods to learn a set of parameters to describe the distribution. This usually involves an iterative sampling procedure, where we update our parameter estimamtes for each sample.

In terms of modeling, basic methods like linear regression assume a normal distribution. If the distribution is not mormal we must use more advanced models such as topic modeling or neural networks.

In [0]:
# generate gaussian data
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# summarize
print('mean=%.3f stdv=%.3f' % (mean(data), std(data)))

### Visual Normality Checks

We can create plots of the data to check whether it is Gaussian.

These checks are qualitative, so less accurate than the statistical methods we will calculate in the next section. Nevertheless, they are fast and like the statistical tests, must still be interpreted before you can make a call about your data sample.

In this section, we will look at two common methods for visually inspecting a dataset to check if it was drawn from a Gaussian distribution.

#### Histogram Plot

A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram.

In the histogram, the data is divided into a pre-specified number of groups called bins. The data is then sorted into each bin and the count of the number of observations in each bin is retained.

The plot shows the bins across the x-axis maintaining their ordinal relationship, and the count in each bin on the y-axis.

A sample of data has a Gaussian distribution of the histogram plot, showing the familiar bell shape.

A histogram can be created using the hist() matplotlib function. By default, the number of bins is automatically estimated from the data sample.

A complete example demonstrating the histogram plot on the test problem is listed below.

In [0]:
# histogram plot
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# histogram plot
pyplot.hist(data)
pyplot.show()

Running the example creates a histogram plot showing the number of observations in each bin.

We can see a Gaussian-like shape to the data, that although is not strongly the familiar bell-shape, is a rough approximation.

#### Assignment

Does changing the parameters on the historgram improve the normality of the distribution?

What about increase the size of the distribution? I.e., rather than using $100$, try $1000$ or $10000$.

Your answer here:


#### Quantile-Quantile Plot

Another popular plot for checking the distribution of a data sample is the quantile-quantile plot, Q-Q plot, or QQ plot for short.

This plot generates its own sample of the idealized distribution that we are comparing with, in this case the Gaussian distribution. The idealized samples are divided into groups (e.g. 5), called quantiles. Each data point in the sample is paired with a similar member from the idealized distribution at the same cumulative distribution.

The resulting points are plotted as a scatter plot with the idealized value on the x-axis and the data sample on the y-axis.

A perfect match for the distribution will be shown by a line of dots on a 45-degree angle from the bottom left of the plot to the top right. Often a line is drawn on the plot to help make this expectation clear. Deviations by the dots from the line shows a deviation from the expected distribution.

We can develop a QQ plot in Python using the qqplot() statsmodels function. The function takes the data sample and by default assumes we are comparing it to a Gaussian distribution. We can draw the standardized line by setting the ‘line‘ argument to ‘s‘.

A complete example of plotting the test dataset as a QQ plot is provided below.

In [0]:
# QQ Plot
from numpy.random import seed
from numpy.random import randn
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# q-q plot
qqplot(data, line='s')
pyplot.show()

#### Assignment

Below is code to generate a bimodal distribution.

- Generate a QQ-plot with this distribution.
- Record your results below.

In [0]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from sklearn.neighbors import KernelDensity

# Plot a 1D density example
N = 100
np.random.seed(1)
X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)),
                    np.random.normal(5, 1, int(0.7 * N))))[:, np.newaxis]

X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]

true_dens = (0.3 * norm(0, 1).pdf(X_plot[:, 0])
             + 0.7 * norm(5, 1).pdf(X_plot[:, 0]))

fig, ax = plt.subplots()
ax.fill(X_plot[:, 0], true_dens, fc='black', alpha=0.2,
        label='input distribution')

for kernel in ['epanechnikov', 'tophat', 'gaussian']:
    kde = KernelDensity(kernel=kernel, bandwidth=0.5).fit(X)
    log_dens = kde.score_samples(X_plot)
    ax.plot(X_plot[:, 0], np.exp(log_dens), '-',
            label="kernel = '{0}'".format(kernel))

ax.text(6, 0.38, "N={0} points".format(N))

ax.legend(loc='upper left')
data_bi = X[:, 0], -0.005 - 0.01 * np.random.random(X.shape[0])
ax.plot(data_bi, '+k')

ax.set_xlim(-4, 9)
ax.set_ylim(-0.02, 0.4)
plt.show()

In [0]:
#Your code here:

??
??

Your answers here:

### Testing for Normality using Skewness and Kurtosis


Several statistical tests are available to test the degree to which your data deviates from normality, and if the deviation is statistically significant.
In this article, we’ll look at moment based measures, namely Skewness and Kurtosis, and the statistical tests of significance, namely Omnibus K² and Jarque — Bera, that are based on these measures.

<img src="446px-Negative_and_positive_skew_diagrams_(English).svg.png" width="400px"/>

Positive and negative skewness (CC BY-SA 3.0)


Skewness is defined as the third standardized central moment, of the random variable of the probability distribution.
The formula for skewness of the population for random variable $Y$ is shown below:

$$
\gamma_1 = E \left[ \left(\dfrac{Y - \mu}{\sigma} \right)^3 \right]
$$

Skewness of the normal distribution is zero.
While a symmetric distribution will have a zero skewness, a distribution having zero skewness is not necessarily symmetric.

Generate an almost random normal distribution.

In [0]:
from scipy.stats import skew 
import numpy as np  
import pylab as p  
  
x1 = np.linspace( -5, 5, 1000 ) 
y1 = 1./(np.sqrt(2.*np.pi)) * np.exp( -.5*(x1)**2  )  
p.plot(x1, y1) 
p.show();

Calculate skew:

In [0]:
print( '\nSkewness for data : ', skew(y1)) 

Generate a right skew distribution.

In [0]:
from scipy.stats import skewnorm
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
from scipy.stats import skew 

a = 4
mean, var, skew_, kurt = skewnorm.stats(a, moments='mvsk')

# Display the probability density function (pdf):

x = np.linspace(skewnorm.ppf(0.01, a),
                skewnorm.ppf(0.99, a), 100)
y2 = skewnorm.pdf(x, a)
ax.plot(x, y2,
       'r-', alpha=0.6, label='skewnorm pdf')
plt.show();

#### Assignment 

Calculate skewness for the right skew distribution above.

In [0]:
??

#### What is ‘Kurtosis’ and how to use it?

Kurtosis is a measure of how differently shaped are the tails of a distribution as compared to the tails of the normal distribution. While skewness focuses on the overall shape, Kurtosis focuses on the tail shape.
Kurtosis is defined as follows:
Kurtosis is the fourth standardized central moment, of the random variable of the probability distribution.
The formula for Kurtosis is as follows:

$$
K = E \left[ \left(\dfrac{Y - \mu}{\sigma} \right)^4 \right]
$$

<img src="800px-Standard_symmetric_pdfs.png" width="400px" />


image from Wikipedia Commons

https://commons.wikimedia.org/wiki/File:Standard_symmetric_pdfs.png

Normality tests based on Skewness and Kurtosis
While Skewness and Kurtosis quantify the amount of departure from normality, one would want to know if the departure is statistically significant. The following two tests let us do just that.


In [0]:
from scipy.stats import kurtosis 
import numpy as np  
import pylab as p  
  
x1 = np.linspace( -5, 5, 1000 ) 
y1 = 1./(np.sqrt(2.*np.pi)) * np.exp( -.5*(x1)**2  ) 
  
p.plot(x1, y1, '*') 
  
  
print( '\nKurtosis for normal distribution :', kurtosis(y1)) 
  
print( '\nKurtosis for normal distribution :',  
      kurtosis(y1, fisher = False)) 
  
print( '\nKurtosis for normal distribution :',  
      kurtosis(y1, fisher = True)) 

#### More advanced statistical normality tests

There are many statistical tests that we can use to quantify whether a sample of data looks as though it was drawn from a Gaussian distribution.

*Interpretation of a Test*

Before you can apply the statistical tests, you must know how to interpret the results.

Each test will return at least two things:

- `Statistic`: A quantity calculated by the test that can be interpreted in the context of the test via comparing it to critical values from the distribution of the test statistic.
- `p-value`: Used to interpret the test, in this case whether the sample was drawn from a Gaussian distribution.

The `statistic` can aid in the interpretation of the result, although it may require a deeper proficiency with statistics and a deeper knowledge of the specific statistical test. Instead, the `p-value` can be used to quickly and accurately interpret the statistic in practical applications.

The tests assume that that the sample was drawn from a Gaussian distribution. Technically this is called the null hypothesis, or H0. A threshold level is chosen called alpha, typically 5% (or 0.05), that is used to interpret the p-value.

In the SciPy implementation of these tests, you can interpret the `p-value` as follows.

- p <= alpha: reject H0, not normal.
- p > alpha: fail to reject H0, normal.

This means that, in general, we are seeking results with a larger `p-value` to confirm that our sample was likely drawn from a Gaussian distribution.

A result above $5\%$ does not mean that the null hypothesis is true. It means that it is very likely true given available evidence. The `p-value` is not the probability of the data fitting a Gaussian distribution; it can be thought of as a value that helps us interpret the statistical test.

#### Shapiro-Wilk Test

The Shapiro-Wilk test evaluates a data sample and quantifies how likely it is that the data was drawn from a Gaussian distribution, named for Samuel Shapiro and Martin Wilk.

In practice, the Shapiro-Wilk test is believed to be a reliable test of normality, although there is some suggestion that the test may be suitable for smaller samples of data, e.g. thousands of observations or fewer.

The `shapiro()` SciPy function will calculate the Shapiro-Wilk on a given dataset. The function returns both the W-statistic calculated by the test and the p-value.

The complete example of performing the Shapiro-Wilk test on the dataset is listed below.

In [0]:
# Shapiro-Wilk Test
from numpy.random import seed
from numpy.random import randn
from scipy.stats import shapiro
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# normality test
stat, p = shapiro(data)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')

The D’Agostino’s $K^2$ test calculates summary statistics from the data, namely kurtosis and skewness, to determine if the data distribution departs from the normal distribution, named for Ralph D’Agostino.

Skew is a quantification of how much a distribution is pushed left or right, a measure of asymmetry in the distribution.

Kurtosis quantifies how much of the distribution is in the tail. It is a simple and commonly used statistical test for normality.
The D’Agostino’s $K^2$ test is available via the normaltest() SciPy function and returns the test statistic and the `p-value`.

The complete example of the D’Agostino’s $K^2$ test on the dataset is listed below.

In [0]:
# D'Agostino and Pearson's Test
from numpy.random import seed
from numpy.random import randn
from scipy.stats import normaltest
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# normality test
stat, p = normaltest(data)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')

Running the example calculates the statistic and prints the statistic and p-value.

The p-value is interpreted against an alpha of 5% and finds that the test dataset does not significantly deviate from normal.

#### Assignment:

Evaluate the the Shapiro-Wilk test and the D’Agostino’s $K^2$ on the skewed distribution.

Please feel free to adjust the distribution, i.e., skew or kurtosis, to see more significant results.

In [0]:
# your code here

#### Anderson-Darling Test

The Anderson-Darling Test is a statistical test that can be used to evaluate whether a data sample comes from one of among many known data samples, named for Theodore Anderson and Donald Darling.

It can be used to check whether a data sample is normal. The test is a modified version of a more sophisticated nonparametric goodness-of-fit statistical test called the Kolmogorov-Smirnov test.

A feature of the Anderson-Darling test is that it returns a list of critical values rather than a single `p-value`. This can provide the basis for a more thorough interpretation of the result.

The `anderson()` SciPy function implements the Anderson-Darling test. It takes as parameters the data sample and the name of the distribution to test it against. By default, the test will check against the Gaussian distribution (dist=’norm’).

The complete example of calculating the Anderson-Darling test on the sample problem is listed below.

In [0]:
# Anderson-Darling Test
from numpy.random import seed
from numpy.random import randn
from scipy.stats import anderson
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# normality test
result = anderson(data)
print('Statistic: %.3f' % result.statistic)
p = 0
for i in range(len(result.critical_values)):
    sl, cv = result.significance_level[i], result.critical_values[i]
    if result.statistic < result.critical_values[i]:
        print('%.3f: %.3f, data looks normal (fail to reject H0)' % (sl, cv))
    else:
        print('%.3f: %.3f, data does not look normal (reject H0)' % (sl, cv))

Running the example calculates the statistic on the test data set and prints the critical values.

Critical values in a statistical test are a range of pre-defined significance boundaries at which the H0 can be failed to be rejected if the calculated statistic is less than the critical value. Rather than just a single p-value, the test returns a critical value for a range of different commonly used significance levels.

We can interpret the results by failing to reject the null hypothesis that the data is normal if the calculated test statistic is less than the critical value at a chosen significance level.

We can see that at each significance level, the test has found that the data follows a normal distribution

### What Test Should You Use?

We have covered a few normality tests, but this is not all of the tests that exist.

So which test do you use?

It's generally recommended to use some if not all of them on your data.

The question then becomes, how do you interpret the results? What if the tests disagree, which they often will?

There are a couple of ways to think about this question.

*1. Hard Fail*

Your data may not be normal for lots of different reasons. Each test looks at the question of whether a sample was drawn from a Gaussian distribution from a slightly different perspective.

A failure of one normality test means that your data is not normal. It's as simple as that.

You can either investigate why your data is not normal and perhaps use data preparation techniques to make the data more normal.

Or you can start looking into the use of nonparametric statistical methods instead of the parametric methods.

*2. Soft Fail*

If some of the methods suggest that the sample is Gaussian and some not, then perhaps take this as an indication that your data is `Gaussian-like`.

In many situations, you can treat your data as though it is Gaussian and proceed with your chosen parametric statistical methods.