# Statistical Tests - Unit 02: Parametric Statistical Tests

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

  * Conduct and interpret statistical tests using T test, Paired T test and ANOVA

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import pingouin as pg
import scipy

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Parametric and Nonparametric Statistical Tests
  


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> There are parametric and nonparamteric statistical tests.
* Depending on the normality of your data (meaning if it is a normal distribution or not), you may use a parametric test or a non parametric test

  * If it is normally distributed, we use parametric test.
  * If it is not, we use nonparametric test.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> **In this unit we will cover parametric tests**

----

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> T-test

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A t-test, or also known as Student's t-test is a parametric test (the mean is the parameter) and is able to compute and test difference between two sample means
* In other words, it tests if the difference in the means is 0.
* Both samples should be normally distributed 
* It is developed by William Gosset of Guinness's Brewery
* The samples should be **independent (or unpaired)**. 
  * For example, imagine we are evaluating the effect of a new drug treatment, and we enroll 200 people, then randomize half in the treatment group and half in the control group. In this case, we have two independent groups. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The null hypothesis states that there is no significant levels of difference between the samples. The alternative hypothesis states that there is significant levels of difference between the samples





Let's consider a DataFrame that has Col1 and Col2 columns made with NumPy function to create normal distributed data

np.random.seed(123)
size = 250
df = pd.DataFrame(data={'Col1': np.random.normal(loc=7, scale=1, size=size),
                        "Col2": np.random.normal(loc=8, scale=1.2, size=size)}
                  )
df.head()

We check normaliy using `pg.normality()`. They are normally distributed, so we can use T-test to compare both.

pg.normality(data=df, alpha=0.05)

Let's plot both variables in a histogram and box plot and ask ourselves: **are they similar or different**?

fig, axes = plt.subplots(nrows=1 ,ncols=2 ,figsize=(12,5))

sns.histplot(data=df, kde=True, ax=axes[0])
for col in df.columns: 
  axes[0].axvline(df[col].mean(), color='r', linestyle='dashed', linewidth=1)
sns.boxplot(data=df, ax=axes[1])

plt.show()
print("\n\n")

We conduct a T-test using `pg.ttest()`. The documentation is found [here](https://pingouin-stats.org/generated/pingouin.ttest.html#pingouin.ttest). We parse both numerical distributions in `x` and `y`
* We are intersted on p-value

pg.ttest(x=df['Col1'], y=df['Col2'])

At this moment, we are interested to check `p-val`, which is the p-value. We get that using `.loc[]`

pg.ttest(df['Col1'],df['Col2']).loc['T-test','p-val']

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We consider our significant level alpha = 0.05. 
* Since p-value (3.32e-20) is smaller than alpha, we reject the null hypothesis.

* Therefore there is enough statistical difference between X and Y levels. **Their levels are different!**
  * It isn't easy to make sense of the interpretation. The columns names are Col1 and Col2. But frame them as math exam scores from 2 distinct groups. The understanding, in this case, is that there is a difference between them, where the second has higher levels

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's consider another exercise and create a DataFrame that has Col3 and Col4 columns made with NumPy function to create normal distributed data

np.random.seed(3)
size = 250
df = pd.DataFrame(data={'Col3':  np.random.normal(loc=7, scale=1, size=size),
                        "Col4":np.random.normal(loc=7.2, scale=1, size=size)})
df.head(3)

We confirm normality with `pg.normality()`. They are normally distributed.

pg.normality(df, alpha=0.05)

Let's plot both variables in a histogram and box plot and ask ourselves: are they similarly distributed or different?

fig, axes = plt.subplots(nrows=1 ,ncols=2 ,figsize=(12,5))

sns.histplot(data=df, kde=True, ax=axes[0])
for col in df.columns: 
  axes[0].axvline(df[col].mean(), color='r', linestyle='dashed', linewidth=1)
sns.boxplot(data=df, ax=axes[1])

plt.show()
print("\n\n")

We conduct a T-test using `pg.ttest()`.

pg.ttest(df['Col3'],df['Col4'])

And may extract `p-val`, which is the p-value

pg.ttest(df['Col3'],df['Col4']).loc['T-test','p-val']

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We consider our significant level alpha = 0.05. 
* Since p-value (0.0949) is greater than alpha, we accept the null hypothesis.

* Therefore there is not enough statistical difference between X and Y levels. Their levels are the same!

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Paired Student’s t-test

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A Paired Student’s t-test is a parametric test (the mean is the parameter) and also tests for difference between two sample means. Both samples should be normally distributed

* The samples should be **dependent (or paired)**
  * It should be a sample of matched pairs. 
  * For example, **imagine the same group is tested twice**. Say you want to examine the difference between people's scores on a test before and after a training intervention 



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The null hypothesis states that there is no significant levels of difference between the samples. The alternative hypothesis states that there is significant levels of difference between the samples



Consider a dataset from pingouin datasets. **It shows scores 
for a given test over time in different groups**.




* We are querying the Group Meditation only for this exercise.
* It shows Scores, the month (Time) and person ID (subject)

df = (pg.read_dataset('mixed_anova')
    .query("Group == 'Meditation' and Time != 'January'")
    .drop(['Group'], axis=1)
    .reset_index(drop=True)
    )
print(df.shape)
df.head()

We will change `time` to an integer that represents the "month value" and assign to the `Month` column

df['Month'] = df['Time'].replace({"August":8, "June":6})
df.sort_values(by='Month', ascending=True, inplace=True)
df.head()

Let's check if the `Scores` are normally distributed across `Month` with `pg.normality()`
* We see that levels for both months - 6 (June) and 8 (Aug) - are normally distributed

pg.normality(data=df, dv='Scores', group='Month', alpha=0.05)

We use `pg.pairwise_ttests()` to conduct a Paired Student t-test. Find the documentation [here](https://pingouin-stats.org/generated/pingouin.pairwise_ttests.html). The arguments used are data, dv for dependent variable (scores), ``within`` is the name of column containing the within-subject factor (in this case month), ``subject`` as the subject identifier (like the person ID)

* We are interested to evaluate if the `Scores` levels are similar or different, considering the same group, across `Month`

pg.pairwise_ttests(data=df, dv='Scores', within='Month', subject='Subject')

We are interested in p-value: `p-unc`

pg.pairwise_ttests(data=df, dv='Scores', within='Month', subject='Subject', effsize='cohen').loc[0,'p-unc']

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We consider alpha = 0.05. 
* Since p-value (0.000143) is lower than alpha, we reject the null hypothesis.
* Therefore there is enough statistical difference between scores in June and August. Their levels are not the same!

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We use `pg.plot_paired()` to visualize this experiment. The function documentation is [here](https://pingouin-stats.org/generated/pingouin.plot_paired.html). The arguments are similar to the previous function (data, dv, within, subject), where dpi is the image quality, we set 150.
* It shows a boxplot indicating the distribution levels of Scores for Month 6 and 8. 
* You will notice red and green dots and lines "travelling" from one Month to another. Each dot, in this experiment is a person, that in Month 6 had a given score and in Month 8 had another score. 
* If the line is red, level decresead between months. If line is green, level increased. Have a look at the plot, and check visually if in general there are more greens or reds and if they are changing a lot or not. 
* The test assesses if the levels for the group as a whole, increased or not.

pg.plot_paired(data=df, dv='Scores', within='Month', subject='Subject', dpi=150)
plt.show()

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  However, in more realistic applications you may have more "Months" to analyze. 
* For example, the previous experiment may have been conducted over more months



 Consider the same dataset in the previous example, but now we will consider 3 months
* We will change time to an integer that represents the "month value" and assign to the `Month` column

df = (pg.read_dataset('mixed_anova')
    .query("Group == 'Meditation'")
    .drop(['Group'], axis=1)
    )

df['Month'] = df['Time'].replace({"January":1, "June":6, "August":8})
df.sort_values(by='Month', ascending=True, inplace=True)

df.head()

Let's check if the `Scores` are normally distributed across `Month` with `pg.normality()`
* the score in each month is normally distributed

pg.normality(data=df, dv='Scores', group='Month', alpha=0.05)

We use `pg.pairwise_ttests()` to conduct a pairwise Paired Student t-test. We are interested to evaluate if `Scores` levels are similar or different, considering the same group of people, across `Month`
* We will conduct 3 tests, each with a given pair of months. That is why it is called pair wise. We will be interested in column ``p-unc``
* In the end we will know if there are different levels of score, individually, from:
  * January to June,
  * June to August and 
  * January to August

pg.pairwise_ttests(data=df, dv='Scores', within='Month', subject='Subject', effsize='cohen')

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We consider alpha = 0.05. 

* From January (Month 1) to June (Month 6), p-value (0.160902) is greater than alpha, we accept the null hypothesis. Therefore there is **not** enough statistical difference between scores in **January and June.** Their levels are the same!

* From June (Month 6) to August (Month 8), p-value (0.00014) is lower than alpha, we reject the null hypothesis. Therefore there is enough statistical **difference between scores in June and August**. Their levels are not the same!

* From January (Month 1) to August (Month 8), p-value (0.052379) is a bit greater than alpha, we accept the null hypothesis. Therefore there is **not** enough statistical difference between scores in **January and August**. Their levels are the same!

We use `pg.plot_paired()` to visualize this experiment
* In the end, imagine the experiment was done using 3 different months. There was not enough statistical difference over time in the group when you compare Jan and Aug
* However, visually there was an apparent increase from Jan to Jun, but that was not significant enough.
* At the same time, if you compare the levels from Jun to Aug, there was a statistical significant decrease.

pg.plot_paired(data=df, dv='Scores', within='Month', subject='Subject', dpi=150)
plt.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Analysis of Variance (ANOVA)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A Analysis of Variance, or ANOVA test is a parametric that compares mean "variation" between 3 or more groups. The data should be normally distributed

Consider a dataset from pingouin datasets. It shows `Pain threshold` levels across different people's `Hair color` (Dark Brunette, Light Blond, Dark Blond, Light Brunette). The subject is the person ID

df = pg.read_dataset('anova')
print(df.shape)
df.head(3)

We can check `Pain threshold` normality across different `Hair color`. It is normally distributed

pg.normality(df, dv='Pain threshold',group='Hair color', alpha=0.05)

We combine a boxplot and swarm plot to visually check `Pain threshold` across different `Hair color`
* **Visually speaking**, we notice few datapoints. And it looks like to have a `Pain threshold` difference across different `Hair color`. However, it is **wise** to not conclude anything before conducting a statistical test.

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,5))
sns.boxplot(data=df,x="Hair color", y="Pain threshold", ax=axes[0])
sns.swarmplot(data=df,x="Hair color", y="Pain threshold", dodge=True, ax=axes[1])
plt.show()

We conduct a ANOVA test with `pg.anova()`. The function documentation is found [here](https://pingouin-stats.org/generated/pingouin.anova.html#pingouin.anova)

  pg.anova(data=df, dv='Pain threshold', between='Hair color', detailed=True)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are interested in p-unc, which is 0.004114	
* We consider our significant level alpha = 0.05. 
* Since p-value (0.004114) is lower than alpha, we reject the null hypothesis.

* Therefore there is enough statistical difference to conclude that Pain threshold levels are different between different hair color

---