# Statistical Tests - Unit 01: Overview, Shapiro and Chi-Squared

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

* **Statistical Tests Lesson is made of 3 units.**
* By the end of this lesson, you should be able to:
  * Understand and apply the concepts considered in a Statistical Test
  * Conduct and interpret statistical tests like Shapiro Wilk, Chi Squared, T test, Paired T Test, ANOVA, Mann Whitney, Wilcoxon and Kruskal Wallis test

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

  * Understand and apply the concepts considered in a Statistical Test
  * Conduct and interpret statistical tests using Shapiro Wilk and Chi Squared Test


---

* We will use Pandas and Pingouin (an open-source statistical package based mostly on Pandas and NumPy) libraries in this lesson.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
">
 **Why do we study Statistical Tests?**
  * Because we can determine the difference or similarity between groups. In addition, we can evaluate if a predictor variable has a statistical importance to a target variable.


---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Context for Learning

* We encourage you to:
  * Add **code cells and try out** other possibilities, ie.: play around with parameters values in a function/method, or consider additional function parameters etc.
  * Also, **add your own comments** to the cells. It can help you to consolidate the learning. 

* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed at Data Science for a particular function/method. 
  * However, you may seek additional in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For Pandas the link is [here](https://pandas.pydata.org/) and for Pingouin is [here](https://pingouin-stats.org/api.html)**

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import pingouin as pg
import scipy

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Statistical Tests Overview

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A statistical test has a mechanism to make a decision about a process. 
* **The idea is to see if there is enough evidence to accept or reject a hypothesis about the process.**


### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Hypothesis Testing

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Hypothesis testing is a way of forming opinions or conclusions from the data we collected.

* The data is used to choose between **two choices**, aka hypothesis or statements. In practical terms, the reasoning is done by comparing what we have observed to what we expected. 
* The available data will typically be a sample of the entire population.

  * There is a **Null Hypothesis (H0)**, which consists of a statement about the sample data used. Typically it says there is no difference between groups.
  * An **Alternative Hypothesis (H1)** is typically the research question and states that there is difference between groups.



### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Significance Level

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The Significance Level, or alpha, is the probability of rejecting the null hypothesis when it is true. 
* This means it is the percentage of risk we are fine to take while rejecting the null hypothesis.
* This is a percentage that can be set by the researcher, however it is frequently set at 5%, meaning there is a 5 in 100 chance of rejecting the null hypothesis when it is in fact true.
  * However, depending on the topic you are researching (typically, high stakes), you may be more conservative and select a lower alpha level. For example, if you are testing a new drug that will cure cancer, you want to be very sure about your conclusions



### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Test Statistic

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A Statistical test works by measuring a test statistic, which is a number that explains how different the relationship between your variables in your test is.
* The method to calculate a test statistic varies between tests, for example, the formula for a test with 2 samples is different from a test with 3 samples. The test statistic compares differences between the samples.

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> P-value

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The p-value is considered as a tool for deciding whether to reject the null hypothesis.

* In a simple definition, a p-value is the probability that the null hypothesis is true. The smaller p-value is, stronger evidence we have in favor of the alternative hypothesis. We will not focus on how it is calculated, like which statistics table are used, let's keep simple for the moment.

* Once you have a p-value and alpha (or Significance level), you are in a position to make a statistical conclusion and interpret a statistical test.
  * If p-value is lower than alpha, you have enough evidence to reject the null hypothesis
  * If the p-value is not lower than alpha, you do not have enough evidence to reject the null hypothesis

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Shapiro-Wilk

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  The Shapiro-Wilk tests if a given data is **normally distributed**
* The null hypothesis states that the population is normally distributed.  The alternative hypothesis states that the population is not normally distributed
* Thus, if the p-value is less than the chosen alpha level (typically set at 0.05), the null hypothesis is rejected, and there is evidence that the data tested is not normally distributed.


First, let's generate some data to illustrate the concepts over the lesson, using the libraries we learned so far

from scipy.stats import skewnorm
np.random.seed(seed=1)
size=200

X1 = np.random.normal(loc=40, scale=2, size=int(size/2) )
X2 = np.random.normal(loc=10, scale=4, size=int(size/2) ) 
bi_modal = np.concatenate([X1, X2])

X1 = np.random.normal(loc=40, scale=4, size=int(size/4) )
X2 = np.random.normal(loc=10, scale=4, size=int(size/4) ) 
X3 = np.random.normal(loc=0, scale=2, size=int(size/4) ) 
X4 = np.random.normal(loc=80, scale=2, size=int(size/4) ) 
multi_modal = np.concatenate([X1, X2, X3, X4])


df = pd.DataFrame(data={'Normal':np.random.normal(loc=0, scale=2, size=size),
                        "Positive Skewed": skewnorm.rvs(a=10, size=size),
                        "Negative Skewed": skewnorm.rvs(a=-10, size=size),
                        "Exponential":np.random.exponential(scale=20,size=size),
                        "Uniform":np.random.uniform(low=0.0, high=1.0, size=size),
                        "Bimodal":  bi_modal,
                        "Multimodal":  multi_modal,
                        "Poisson":np.random.poisson(lam=1.0, size=size),
                        "Discrete": np.random.choice([10,12,14,15,16,17,20],size=size),
                        }).round(3)

df.head(3)


Let's visualize the data distribution using boxplot and histogram for all variables
* We loop on each variable and create a figure with 2 plots, one boxplot and one histogram

for col in df.columns:
  fig, axes = plt.subplots(nrows=2 ,ncols=1 ,figsize=(5,5), gridspec_kw={"height_ratios": (.15, .85)})
  sns.boxplot(data=df, x=col, ax=axes[0])
  axes[0].set_xlabel(" ")
  sns.histplot(data=df, x=col, kde=True, ax=axes[1])
  fig.suptitle(f"{col} Distribution - Boxplot and Histogram")
  plt.show()
  print("\n\n")

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We can test if all numerical columns in a DataFrame are normally distributed with `pg.normality()`.The function documentation is [here](https://pingouin-stats.org/generated/pingouin.normality.html). The arguments we parse are: `data`, `alpha=0.05` for the significance level
* The output shows in the `index` each variable name, and at `normal` column the result if a given variable is normally distributed or not.

pg.normality(data=df, alpha=0.05)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note that in the previous example each column holds a distinct numerical distribution.
* However your data may be in a different arragenment. If your data is in a long format, has numerical and categorical variables, and you want to know if the numerical variables are normally distributed based on a given category, you can use the `dv` and `group` arguments


Consider the dataset below: It has records for 3 different species of penguins, collected from 3 islands in the Palmer Archipelago, Antarctica

df_pinguins = sns.load_dataset('penguins')
print(df_pinguins.shape)
df_pinguins.head(3)

You can check if `bill_length_mm` (numerical variable) is normally distributed across `species` (categorical variable)
* We add the `dv` (dependent variable) as `bill_length_mm` and `group` (grouping variable) as `species`
* We note that only `bill_length_mm` in `Gentoo` species is not normally distributed

pg.normality(data=df_pinguins, dv='bill_length_mm', group='species', alpha=0.05)

However, you will notice that `bill_length_mm` itself is not normally distributed

pg.normality(data=df_pinguins['bill_length_mm'], alpha=0.05)

You can plot a histogram for `bill_length_mm`, and `bill_length_mm` per `species` to make sense of the distribution plot/shape and the shapiro results
* bill_length_mm variable is not normally distributed
* when you analyze bill_length_mm per species, Gentoo's bill_length_mm is not normally distributed

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> **Note** The visuals may mislead you, what matters is the result from the statistical test

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,7))
sns.histplot(data=df_pinguins, x='bill_length_mm', kde=True, ax=axes[0])
sns.histplot(data=df_pinguins, x='bill_length_mm',hue='species' , kde=True, palette='Set2', ax=axes[1])
plt.show();
print("\n\n")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Chi-Squared Test (Goodness of Fit)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Chi-Squared Test measures if there is a significant difference between the expected frequencies and the observed frequencies in categorical variables


* Hypothesis
  * Null hypothesis – there is no difference in the frequency or the proportion of occurrences in each category
  * Alternate hypothesis - there is a difference in the frequency or proportion of occurrences in each category


Let's consider a builtin dataset from pingouin. It is a study on heart disease, where the target equals one, which indicates heart disease.

df = pg.read_dataset('chi2_independence')
print(df.shape)
df.head()

Let's check target (heart disease) distribution with `.value_counts()`

df['target'].value_counts()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's take target and fbs (that looks to define fasting blood sugar)
* We ask ourselves, is fbs a good predictor for the target (heart disease)? Is there any significant association between them?

Let's make a barplot to investigate `fbs` levels across different `target` levels
* That shows the distribution of people that have/dont have heart disease and have/dont have fbs
* It visually looks that the distribution of people with and without heart disease is similar to people with different fbs levels

sns.countplot(x='fbs',hue='target',data=df)
plt.show()

We use `pg.chi2_independence()` to conduct Chi Square Test. The documentation link is [here](https://pingouin-stats.org/generated/pingouin.chi2_independence.html#pingouin.chi2_independence). The arguments we use are:
* data, x and y as the variables for the chi squared test. y tends to be the target variable you are interested to analyze across a given feature (x)

expected, observed, stats = pg.chi2_independence(data=df, x='fbs', y='target')

The test summary (`stats`), has the result of the Pearson Chi-Square test


stats

We are interested on the `pval` from `pearson` test.
* We ``query`` from stats where `test == pearson` and grab `pval`

stats.query("test == 'pearson'")['pval']

We consider our significance level alpha = 0.05. 
* Since ``p-value`` (0.744428) is greater than alpha, we accept the null hypothesis.
* Therefore there was not a significant association between `fbs` and `target`. 
  * `fbs` is not indicating to be a good predictor for `target`

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Now let's take `target` and `sex`

* We ask ourselves, is `sex` a good predictor for `target` (heart disease)?
*  Is there any significant association between them?



Let's make a barplot to investigate `sex` levels across different `target` levels
* It visually looks that no heart disease (target = 0) proportion in one sex is different than the other.

sns.countplot(data=df, x='sex', hue='target')
plt.show()

We conduct the Chi-Squared Test, where now `x='sex'`

expected, observed, stats = pg.chi2_independence(data=df, x='sex', y='target')

And extract p-value using the same rationale from previous exercise

stats.query("test == 'pearson'")['pval']

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We consider our significance level alpha = 0.05.
* Since pvalue (0.000002) is smaller than alpha, we reject the null hypothesis.

* Therefore there was significant association between `sex` and `target`.
  *  `sex` is indicating to be a good predictor for `target` (heart disease)

---