# Statistical Tests - Unit 01: Overview, Shapiro and Chi-Squared

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

* **Statistical Tests Lesson is made of 3 units.**
* By the end of this lesson, you should be able to:
  * Understand and apply the concepts considered in a Statistical Test
  * Conduct and interpret statistical tests like Shapiro Wilk, Chi Squared, T test, Paired T Test, ANOVA, Mann Whitney, Wilcoxon and Kruskal Wallis test

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

  * Understand and apply the concepts considered in a Statistical Test
  * Conduct and interpret statistical tests using Shapiro Wilk and Chi Squared Test


---

* We will use Pandas and Pingouin (an open-source statistical package based mostly on Pandas and NumPy) libraries in this lesson.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
">
 **Why do we study Statistical Tests?**
  * Because we can determine the differences or similarities between groups, we can also evaluate if a predictor variable is statistically important to a target variable.


---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Context for Learning

* We encourage you to:
  * Add **code cells and try out** other possibilities, play around with parameter values in a function/method, or consider additional function parameters etc.
  * Also, **add your own comments** to the cells. It can help you to consolidate your learning. 

* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed in Data Science for a particular function/method. 
  * However, you may seek additional in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For Pandas, the link is [here](https://pandas.pydata.org/) and for Pingouin [here](https://pingouin-stats.org/api.html)**

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import pingouin as pg
import scipy

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Statistical Tests Overview

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A statistical test has a mechanism to make a decision about a process. 
* **The idea is to see if there is enough evidence to accept or reject a hypothesis about the process.**


### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Hypothesis Testing

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Hypothesis testing is a way of forming opinions or conclusions from the data we collect.

* The data is used to choose between **two choices**, aka hypothesis or statements. In practical terms, the reasoning is done by comparing what we have observed to what we expected. 
* The available data will typically be a sample of the entire population.

  * There is a **Null Hypothesis (H0)**, which consists of a statement about the sample data used. Typically it says there is no difference between groups.
  * An **Alternative Hypothesis (H1)** is typically the research question and states that there is a difference between groups.



### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Significance Level

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The Significance Level, or alpha, is the probability of rejecting the null hypothesis when it is true. 
* This means the percentage of risk we are okay to take while rejecting the null hypothesis.
* This is a percentage that the researcher can set; however, it is frequently set at 5%, meaning there is a 5 in 100 chance of rejecting the null hypothesis when it is, in fact, true.
  * However, depending on the topic you are researching (typically, high stakes), you may be more conservative and select a lower alpha level. For example, if you are testing a new drug that will cure cancer, you want to be very sure about your conclusions



### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Test Statistic

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A Statistical test works by measuring a test statistic, which is a number that explains how different the relationship between the variables in your test is.
* The method to calculate a test statistic varies between tests; for example, the formula for a test with two samples differs from a test with three samples. The test statistic compares differences between the samples.

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> P-value

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The p-value is considered a tool for deciding whether to reject the null hypothesis.

* In a simple definition, a p-value is a probability that the null hypothesis is true. The smaller p-value, the stronger the evidence we have in favour of the alternative hypothesis. We will not focus on how it is calculated, like which statistics tables are used; let's keep it simple for the moment.

* Once you have a p-value and alpha (or Significance level), you are in a position to make a statistical conclusion and interpret a statistical test.
  * If the p-value is lower than the alpha, you have enough evidence to reject the null hypothesis
  * If the p-value is not lower than alpha, you do not have enough evidence to reject the null hypothesis

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Shapiro-Wilk

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  The Shapiro-Wilk tests if a given data is **normally distributed**
* The null hypothesis states that the population is normally distributed.  The alternative hypothesis states that the population is not normally distributed
* Thus, if the p-value is less than the chosen alpha level (typically set at 0.05), the null hypothesis is rejected, and there is evidence that the data tested is not normally distributed.


First, let's generate some data to illustrate the concepts over the lesson, using the libraries we have learned so far

from scipy.stats import skewnorm
np.random.seed(seed=1)
size=200

X1 = np.random.normal(loc=40, scale=2, size=int(size/2) )
X2 = np.random.normal(loc=10, scale=4, size=int(size/2) ) 
bi_modal = np.concatenate([X1, X2])

X1 = np.random.normal(loc=40, scale=4, size=int(size/4) )
X2 = np.random.normal(loc=10, scale=4, size=int(size/4) ) 
X3 = np.random.normal(loc=0, scale=2, size=int(size/4) ) 
X4 = np.random.normal(loc=80, scale=2, size=int(size/4) ) 
multi_modal = np.concatenate([X1, X2, X3, X4])


df = pd.DataFrame(data={'Normal':np.random.normal(loc=0, scale=2, size=size),
                        "Positive Skewed": skewnorm.rvs(a=10, size=size),
                        "Negative Skewed": skewnorm.rvs(a=-10, size=size),
                        "Exponential":np.random.exponential(scale=20,size=size),
                        "Uniform":np.random.uniform(low=0.0, high=1.0, size=size),
                        "Bimodal":  bi_modal,
                        "Multimodal":  multi_modal,
                        "Poisson":np.random.poisson(lam=1.0, size=size),
                        "Discrete": np.random.choice([10,12,14,15,16,17,20],size=size),
                        }).round(3)

df.head(3)


Let's visualise the data distribution using a boxplot and histogram for all variables.
* We loop on each variable and create a figure with two plots, one boxplot and one histogram

for col in df.columns:
  fig, axes = plt.subplots(nrows=2 ,ncols=1 ,figsize=(7,7), gridspec_kw={"height_ratios": (.15, .85)})
  sns.boxplot(data=df, x=col, ax=axes[0])
  axes[0].set_xlabel(" ")
  sns.histplot(data=df, x=col, kde=True, ax=axes[1])
  fig.suptitle(f"{col} Distribution - Boxplot and Histogram")
  plt.show()
  print("\n\n")

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We can test if all numerical columns in a DataFrame are normally distributed with `pg.normality()`.The function documentation is [here](https://pingouin-stats.org/generated/pingouin.normality.html). The arguments we parse are: `data`, `alpha=0.05` for the significance level
* The output shows in the ``index`` each variable name and in the ``normal`` column whether a given variable is normally distributed or not.

pg.normality(data=df, alpha=0.05)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note that in the previous example, each column holds a distinct numerical distribution.
* However, your data may have a different arrangement.  If your data is in a long format, has numerical and categorical variables, and you want to know if the numerical variables are normally distributed based on a given category, you can use the `dv` and `group` arguments


Consider the dataset below: It has records for three different species of penguins collected from 3 islands in the Palmer Archipelago, Antarctica

df_pinguins = sns.load_dataset('penguins')
print(df_pinguins.shape)
df_pinguins.head(3)

You can check if `bill_length_mm` (numerical variable) is normally distributed across `species` (categorical variable)
* We add the `dv` (dependent variable) as `bill_length_mm` and `group` (grouping variable) as `species`
* We note that only `bill_length_mm` in `Gentoo` species is not normally distributed

pg.normality(data=df_pinguins, dv='bill_length_mm', group='species', alpha=0.05)

However, you will notice that `bill_length_mm` itself is not normally distributed

pg.normality(data=df_pinguins['bill_length_mm'], alpha=0.05)

You can plot a histogram for `bill_length_mm`, and `bill_length_mm` per `species` to make sense of the distribution plot/shape and the shapiro results
* bill_length_mm variable is not normally distributed
* when you analyze bill_length_mm per species, Gentoo's bill_length_mm is not normally distributed

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> **Note** The visuals may mislead you; what matters is the result of the statistical test

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,7))
sns.histplot(data=df_pinguins, x='bill_length_mm', kde=True, ax=axes[0])
sns.histplot(data=df_pinguins, x='bill_length_mm',hue='species' , kde=True, palette='Set2', ax=axes[1])
plt.show();
print("\n\n")

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Chi-Squared Test (Goodness of Fit)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Chi-Squared Test measures if there is a significant difference between the expected frequencies and the observed frequencies in categorical variables


* Hypothesis
  * Null hypothesis – there is no difference in the frequency or the proportion of occurrences in each category
  * Alternate hypothesis - there is a difference in the frequency or proportion of occurrences in each category


Let's consider a built-in dataset from pingouin. It is a study on heart disease, where the target equals one, which indicates heart disease.

df = pg.read_dataset('chi2_independence')
print(df.shape)
df.head()

Let's check target (heart disease) distribution with `.value_counts()`

df['target'].value_counts()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's take target and fbs (that looks to define fasting blood sugar)
* We ask ourselves, is fbs a good predictor for the target (heart disease)? Is there any significant association between them?

Let's make a barplot to investigate `fbs` levels across different `target` levels
* That shows the distribution of people that have/don't have heart disease and have/don't have fbs
* It visually looks that the distribution of people with and without heart disease is similar to people with different fbs levels

sns.countplot(x='fbs',hue='target',data=df)
plt.show()

We use `pg.chi2_independence()` to conduct Chi Square Test. The documentation link is [here](https://pingouin-stats.org/generated/pingouin.chi2_independence.html#pingouin.chi2_independence). The arguments we use are:
* data, x and y as the variables for the chi squared test. y tends to be the target variable you are interested in analysing across a given feature (x)

expected, observed, stats = pg.chi2_independence(data=df, x='fbs', y='target')

The test summary (`stats`), has the result of the Pearson Chi-Square test


stats

We are interested in the ``pval`` from the ``pearson`` test.
* We ``query`` from stats where `test == pearson` and grab `pval`

stats.query("test == 'pearson'")['pval']

We consider our significance level alpha = 0.05. 
* Since ``p-value`` (0.744428) is greater than the alpha, we accept the null hypothesis.
* Therefore there was not a significant association between `fbs` and `target`. 
  * `fbs` is not indicated to be a good predictor for `target`

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Now let's take `target` and `sex`

* We ask ourselves, is `sex` a good predictor for the `target` (heart disease)?
*  Is there any significant association between them?



Let's make a barplot to investigate `sex` levels across different `target` levels
* It visually looks that no heart disease (target = 0) proportion in one sex is different than the other.

sns.countplot(data=df, x='sex', hue='target')
plt.show()

We conduct the Chi-Squared Test, where now `x='sex'`

expected, observed, stats = pg.chi2_independence(data=df, x='sex', y='target')

And extract p-value using the same rationale from the previous exercise

stats.query("test == 'pearson'")['pval']

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We consider our significance level alpha = 0.05.
* Since pvalue (0.000002) is smaller than the alpha, we reject the null hypothesis.

* Therefore there was a significant association between `sex` and `target`.
  *  `sex` is indicated to be a good predictor for  the `target` (heart disease)

---

# Statistical Tests - Unit 02: Parametric Statistical Tests

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

  * Conduct and interpret statistical tests using T test, Paired T test and ANOVA

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import pingouin as pg
import scipy

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Parametric and Nonparametric Statistical Tests
  


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> There are parametric and nonparametric statistical tests.
* Depending on the normality of your data (meaning if it is a normal distribution or not), you may use a parametric test or a nonparametric test

  * If it is normally distributed, we use a parametric test.
  * If it is not, we use a nonparametric test.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> **In this unit, we will cover parametric tests**

----

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> T-test

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A t-test, also known as Student's t-test is a parametric test (the mean is the parameter) and can compute and test the difference between two sample means
* In other words, it tests if the difference in the means is 0
* Both samples should be normally distributed 
* It is developed by William Gosset of Guinness's Brewery
* The samples should be **independent (or unpaired)**. 
  * For example, imagine we are evaluating the effect of a new drug treatment, and we enrol 200 people, then randomise half in the treatment group and half in the control group. In this case, we have two independent groups


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  The null hypothesis states that there are no significant levels of difference between the samples. The alternative hypothesis states that there are significant levels of difference between the samples.





Let's consider a DataFrame that has Col1 and Col2 columns made with NumPy function to create normally distributed data

np.random.seed(123)
size = 250
df = pd.DataFrame(data={'Col1': np.random.normal(loc=7, scale=1, size=size),
                        "Col2": np.random.normal(loc=8, scale=1.2, size=size)}
                  )
df.head()

We check normality using `pg.normality()`, they are normally distributed, so we can use T-test to compare both.

pg.normality(data=df, alpha=0.05)

Let's plot both variables in a histogram and box plot and ask ourselves: **are they similar or different**?

fig, axes = plt.subplots(nrows=1 ,ncols=2 ,figsize=(12,5))

sns.histplot(data=df, kde=True, ax=axes[0])
for col in df.columns: 
  axes[0].axvline(df[col].mean(), color='r', linestyle='dashed', linewidth=1)
sns.boxplot(data=df, ax=axes[1])

plt.show()
print("\n\n")

We conduct a T-test using `pg.ttest()`. The documentation is found [here](https://pingouin-stats.org/generated/pingouin.ttest.html#pingouin.ttest). We parse both numerical distributions in `x` and `y`
* We are interested in the p-value

pg.ttest(x=df['Col1'], y=df['Col2'])

At this moment, we are interested in checking the `p-val`, which is the p-value. We get that using `.loc[]`

pg.ttest(df['Col1'],df['Col2']).loc['T-test','p-val']

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We consider our significant level alpha = 0.05. 
* Since p-value (3.32e-20) is smaller than the alpha, we reject the null hypothesis.

* Therefore there is enough statistical difference between X and Y levels. **Their levels are different!**
  * It isn't easy to make sense of the interpretation. The columns' names are Col1 and Col2. But frame them as math exam scores from 2 distinct groups. The understanding, in this case, is that there is a difference between them, where the second has higher levels

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's consider another exercise and create a DataFrame that has Col3 and Col4 columns made with NumPy function to create normally distributed data

np.random.seed(3)
size = 250
df = pd.DataFrame(data={'Col3':  np.random.normal(loc=7, scale=1, size=size),
                        "Col4":np.random.normal(loc=7.2, scale=1, size=size)})
df.head(3)

We confirm normality with `pg.normality()`. They are normally distributed.

pg.normality(df, alpha=0.05)

Let's plot both variables in a histogram and box plot and ask ourselves: are they similarly distributed or different?

fig, axes = plt.subplots(nrows=1 ,ncols=2 ,figsize=(12,5))

sns.histplot(data=df, kde=True, ax=axes[0])
for col in df.columns: 
  axes[0].axvline(df[col].mean(), color='r', linestyle='dashed', linewidth=1)
sns.boxplot(data=df, ax=axes[1])

plt.show()
print("\n\n")

We conduct a T-test using `pg.ttest()`.

pg.ttest(df['Col3'],df['Col4'])

And we extract the `p-val`, which is the p-value

pg.ttest(df['Col3'],df['Col4']).loc['T-test','p-val']

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We consider our significant level alpha = 0.05. 
* Since p-value (0.0949) is greater than the alpha, we accept the null hypothesis.

* Therefore there is not enough statistical difference between X and Y levels. Their levels are the same!

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Paired Student’s t-test

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  A Paired Student’s t-test is a parametric test (the mean is the parameter) and tests for the difference between two sample means. Both samples should be normally distributed.

* The samples should be **dependent (or paired)**
  * It should be a sample of matched pairs. 
  * For example, **imagine the same group is tested twice**. Say you want to examine the difference between people's scores on a test before and after a training intervention 



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  The null hypothesis states that there are no significant levels of difference between the samples. The alternative hypothesis states that there are significant levels of difference between the samples.



Consider a dataset from pingouin datasets. **It shows scores 
for a given test over time in different groups**.




* We are querying the Group Meditation only for this exercise.
* It shows Scores, the month (Time) and the person ID (subject)

df = (pg.read_dataset('mixed_anova')
    .query("Group == 'Meditation' and Time != 'January'")
    .drop(['Group'], axis=1)
    .reset_index(drop=True)
    )
print(df.shape)
df.head()

We will change `time` to an integer that represents the "month value" and assign it to the `Month` column

df['Month'] = df['Time'].replace({"August":8, "June":6})
df.sort_values(by='Month', ascending=True, inplace=True)
df.head()

Let's check if the `Scores` are normally distributed across `Month` with `pg.normality()`
* We see that levels for both months - 6 (June) and 8 (Aug) - are normally distributed

pg.normality(data=df, dv='Scores', group='Month', alpha=0.05)

We use `pg.pairwise_ttests()` to conduct a Paired Student t-test. Find the documentation [here](https://pingouin-stats.org/generated/pingouin.pairwise_ttests.html). The arguments used are: 
* ``data`` 
* ``dv`` for the dependent variable (scores)
* ``within`` is the name of the column containing the within-subject factor (in this case, month)
* ``subject`` as the subject identifier (like the person ID)

We are interested in evaluating if the `Scores` levels are similar or different, considering the same group across `Month`

pg.pairwise_ttests(data=df, dv='Scores', within='Month', subject='Subject')

We are interested in p-value: `p-unc`

pg.pairwise_ttests(data=df, dv='Scores', within='Month', subject='Subject', effsize='cohen').loc[0,'p-unc']

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We consider alpha = 0.05. 
* Since p-value (0.000143) is lower than the alpha, we reject the null hypothesis.
* Therefore there is enough statistical difference between scores in June and August. Their levels are not the same!

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We use `pg.plot_paired()` to visualise this experiment. The function documentation is [here](https://pingouin-stats.org/generated/pingouin.plot_paired.html). The arguments are similar to the previous function (data, dv, within, subject), where dpi is the image quality we set at 150.
* It shows a boxplot indicating the distribution levels of Scores for months 6 and 8. 
* You will notice red and green dots and lines "travelling" from one month to another. Each dot in this experiment is a person that, in Month 6, had a given score and, in Month 8, had another score.
* If the line is red, the level decreases between months. If the line is green, the level increases. Have a look at the plot, and check visually if, in general, there are more greens or reds and if they are changing a lot or not.
* The test assesses if the levels for the group as a whole increased or not.

pg.plot_paired(data=df, dv='Scores', within='Month', subject='Subject', dpi=150)
plt.show()

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  However, in more realistic applications you may have more "Months" to analyse. 
* For example, the previous experiment may have been conducted over more months



 Consider the same dataset in the previous example, but now we will consider three months
* We will change time to an integer that represents the "month value" and assign it to the `Month` column

df = (pg.read_dataset('mixed_anova')
    .query("Group == 'Meditation'")
    .drop(['Group'], axis=1)
    )

df['Month'] = df['Time'].replace({"January":1, "June":6, "August":8})
df.sort_values(by='Month', ascending=True, inplace=True)

df.head()

Let's check if the `Scores` are normally distributed across `Month` with `pg.normality()`
* The score in each month is normally distributed

pg.normality(data=df, dv='Scores', group='Month', alpha=0.05)

We use `pg.pairwise_ttests()` to conduct a pairwise Paired Student t-test. We are interested in evaluating if `Scores` levels are similar or different, considering the same group of people across `Month`
* We will conduct three tests, each with a given pair of months. That is why it is called pairwise. We will be interested in column ``p-unc``
* In the end we will know if there are different levels of scores, individually, from:
  * January to June,
  * June to August and 
  * January to August

pg.pairwise_ttests(data=df, dv='Scores', within='Month', subject='Subject', effsize='cohen')

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We consider alpha = 0.05. 

* From January (Month 1) to June (Month 6), p-value (0.160902) is greater than the alpha; we accept the null hypothesis. Therefore there is **not** enough statistical difference between scores in **January and June.** Their levels are the same!

* From June (Month 6) to August (Month 8), p-value (0.00014) is lower than the alpha; we reject the null hypothesis. Therefore there is enough statistical **difference between scores in June and August**. Their levels are not the same!

* From January (Month 1) to August (Month 8), p-value (0.052379) is a bit greater than the alpha; we accept the null hypothesis. Therefore there is **not** enough statistical difference between scores in **January and August**. Their levels are the same!

We use `pg.plot_paired()` to visualise this experiment
* In the end, imagine the experiment was done using three different months. There was not enough statistical difference over time in the group when you compare Jan and Aug
* However, visually there was an apparent increase from Jan to Jun, but that was not significant enough.
* At the same time, if you compare the levels from Jun to Aug, there was a statistically significant decrease.

pg.plot_paired(data=df, dv='Scores', within='Month', subject='Subject', dpi=150)
plt.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Analysis of Variance (ANOVA)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> An Analysis of Variance or ANOVA test is parametric that compares mean "variation" between three or more groups. The data should be normally distributed

Consider a dataset from pingouin datasets. It shows `Pain threshold` levels across different people, `Hair color` (Dark Brunette, Light Blond, Dark Blond, Light Brunette). The subject is the person's ID

df = pg.read_dataset('anova')
print(df.shape)
df.head(3)

We can check `Pain threshold` normality across different `Hair color`. It is normally distributed

pg.normality(df, dv='Pain threshold',group='Hair color', alpha=0.05)

We combine a boxplot and swarm plot to visually check `Pain threshold` across different `Hair color`
* **Visually speaking**, we notice few data points. And it looks like to have a `Pain threshold` difference across different `Hair color`. However, it is **wise** not to conclude anything before conducting a statistical test.

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,5))
sns.boxplot(data=df,x="Hair color", y="Pain threshold", ax=axes[0])
sns.swarmplot(data=df,x="Hair color", y="Pain threshold", dodge=True, ax=axes[1])
plt.show()

We conduct an ANOVA test with `pg.anova()`. The function documentation is found [here](https://pingouin-stats.org/generated/pingouin.anova.html#pingouin.anova)

  pg.anova(data=df, dv='Pain threshold', between='Hair color', detailed=True)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are interested in p-unc, which is 0.004114	
* We consider our significant level alpha = 0.05. 
* Since p-value (0.004114) is lower than the alpha, we reject the null hypothesis.

* Therefore there is enough statistical difference to conclude that Pain threshold levels are different between different hair colour

---



# Statistical Tests - Unit 03: Nonparametric Statistical Tests

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

  * Conduct and interpret statistical tests using Mann Whitney, Wilcoxon and Kruskal Wallis test

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import pingouin as pg
import scipy

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Statistical Tests - Unit 03: Nonparametric Statistical Tests

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> There are parametric and nonparametric statistical tests.
* Depending on the normality of your data (meaning if it is a normal distribution or not), you may use a parametric test or a nonparametric test

  * If it is normally distributed, we use a parametric test.
  * If it is not, we use a nonparametric test.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> **In this unit, we will cover nonparametric tests**

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Mann-Whitney U Test

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">   A Mann-Whitney U Test is a nonparametric test used to determine if there are differences between two groups where at least one group is not normally distributed.
* The samples should be independent (or unpaired).
* A nonparametric version of the independent t-test we saw in the previous unit.

Let's consider a DataFrame that has Col1 and Col2 columns made with NumPy function to create not normal distributed data

np.random.seed(1)
df = pd.DataFrame(data={'Col1':np.random.uniform(low=0, high=1, size=500),
                        "Col2":np.random.uniform(low=0.1, high=1, size=500)})
df.head()

We check for normality. Both are not normally distributed

pg.normality(data=df, alpha=0.05)

We plot both columns in a histogram and boxplot to better understand the levels

fig, axes = plt.subplots(nrows=1 ,ncols=2 ,figsize=(12,5))
sns.histplot(data=df, kde=True, ax=axes[0])
sns.boxplot(data=df, ax=axes[1])
plt.show()
print("\n\n")

We use `pg.mwu()` to conduct a  Mann-Whitney U Test. The documentation link is [here](https://pingouin-stats.org/generated/pingouin.mwu.html). The arguments we use are x and y, where we parse the numerical distribution

pg.mwu(x=df['Col1'], y=df['Col2'])


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are interested in p-val, which is 0.0784
* We consider our significant level alpha = 0.05. 
* Since p-value (0.0784) is higher than the alpha, we accept the null hypothesis.

* Therefore there is not enough statistical difference to conclude the levels are different.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Wilcoxon Test

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  A Wilcoxon Test is a non-parametric test used when you'd like to use the paired t–test. At least one of the samples should not be **normally distributed**.
* The samples should be **dependent (or paired)**
  * It should be a sample of matched pairs. 
  * For example, **imagine the same group is tested twice**. Say you want to examine the difference between people's scores on a test before and after a training intervention 



Let's consider a DataFrame that has Col3 and Col4 columns made with a python list

df = pd.DataFrame(data={'Col3':[18.3, 13.3, 16.5, 12.6, 9.5, 13.6, 8.1, 8.9, 10, 8.3, 7.9, 8.1, 13.4],
                        "Col4":[12.7, 11.1, 15.3, 12.7, 10.5, 15.6, 11.2, 14.2, 16.3, 15.5, 19.9, 20.4, 36.8]
                        })
df.head()

Let's check for normality. One is not normally distributed; we can use the Wilcoxon test

pg.normality(data=df, alpha=0.05)

We plot both columns in a histogram and boxplot to better understand the levels

fig, axes = plt.subplots(nrows=1 ,ncols=2 ,figsize=(12,5))
sns.histplot(data=df, kde=True, ax=axes[0])
sns.boxplot(data=df, ax=axes[1])
plt.show()
print("\n\n")

We use `pg.wilcoxon()` to conduct a Wilcoxon Test. The documentation is [here](https://pingouin-stats.org/generated/pingouin.wilcoxon.html). The arguments we use are x and y, as the numerical data we want to compare.

pg.wilcoxon(x=df['Col3'], y=df['Col4'])


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are interested in the p-val, which is 0.0397
* We consider our significant level alpha = 0.05. 
* Since p-value (0.0397) is lower than the alpha, we reject the null hypothesis.

* Therefore there is enough statistical difference to conclude the levels are different.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Kruskal-Wallis

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A Kruskal-Wallis test is a nonparametric test used to determine if there are differences between three or more groups, considered when at least one of the distributions is not normally distributed
* A nonparametric alternative to one-way ANOVA



We use a pingouin dataset. We will be interested in Metric and Performance variables to demonstrate the concept in this exercise

df= pg.read_dataset("rm_anova2").filter(['Metric',	'Performance'])
df.head()

We want to know the metric distribution levels, so we use `.value_counts()`. There are three levels (action, product and client)

df['Metric'].value_counts()

We check for normality. One is not; so we can use Kruskal Wallis

pg.normality(data=df, dv='Performance', group='Metric', alpha=0.05)

We combine a boxplot and swarm plot to visually check `Performance` across different `Metric`
* **Visually**, we notice few data points. And it looks to have a `Performance` difference across different `Metric`. However, it is **wise** not to conclude anything before conducting a statistical test.

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,5))
sns.boxplot(data=df, x="Metric", y="Performance", ax=axes[0])
sns.swarmplot(data=df, x="Metric", y="Performance", dodge=True, ax=axes[1])
plt.show()

We use `pg.kruskal()` to conduct a Kruskal Wallis test. The documentation is [here](https://pingouin-stats.org/generated/pingouin.kruskal.html#pingouin.kruskal). The arguments are data, ``dv`` as the dependent variable and ``between`` as the variable, which we will use to analyse the levels in between.




pg.kruskal(data=df, dv='Performance', between='Metric')

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are interested in p-unc, which is 0.00012
* We consider our significant level alpha = 0.05. 
* Since p-value (0.00012) is lower than the alpha, we reject the null hypothesis.
a
* Therefore there is enough statistical difference to conclude the levels are different.

---