<a href="https://colab.research.google.com/github/ram-anand/ram-anand.github.io/blob/main/List_of_Parametric_and_Non_Parametric_Hypothesis_Tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to Hypothesis Tests

Gathering and analyzing samples instead of the entire population is often the most practical and cost effective way to make inferences about an entire population. From this sample data, we may be able get an estimation the actual population parameter with some degree of error. Or we can say the actual population parameter may lies within the confidence interval below and above the estimate. 

A statistical hypothesis is an assumption about a population parameter and hypothesis test is a formal statistical test we use to reject or fail to reject a statistical hypothesis.

To test whether a statistical hypothesis about a population parameter is true, we obtain a random sample from the population and perform a hypothesis test on the sample data.

<figure>
<img src="https://www.dxbydt.com/wp-content/uploads/2015/09/inferential-statistics-sample-population.png" width="350"/>
<img src="https://www.netquest.com/hs-fs/hubfs/sampling_image.jpg?width=804&name=sampling_image.jpg" width="250"/>
<img src="https://www.six-sigma-material.com/images/xPopSamples.GIF.pagespeed.ic.v4-6UWWh-1.webp" width="300"/>
</figure>

There are two types of statistical hypotheses:
- The null hypothesis: denoted as H0, assumes that the sample data occurs purely from chance. This hypothesis states that there is no effect of experiment on the variable of interest.
- The alternative hypothesis: denoted as H1 or Ha, assumes the sample data is influenced by some non-random cause. This hypothesis states that there is an effect of experiment and there will be significant difference between the control group and treatment (experimental group).

A statistical hypothesis can be one-tailed or two-tailed.
- A one-tailed hypothesis tests if a statistic is less than or greater than some value.
- A two-tailed hypothesis is used to test if the statistic is equal to or not equal to some value in its statement. The equal statement is always part of the null hypothesis.

<img src="https://qphs.fs.quoracdn.net/main-qimg-8d3108b7ec80883512b6a6f925ddf7ad" width="800"/>

**Dependent or Paired sample:**
If the values in one sample affect the values in the other sample, then the samples are dependent.

**Independent or unpaired sample:**
If the values in one sample reveal no information about those of the other sample, then the samples are independent.



### Confidence Interval

A confidence interval (CI) is how much uncertainty there is with any particular statistic. Confidence intervals are often used with a margin of error. 

>Confidence Interval, $CI = \text{point estimate} \pm \text{error margin}$\
$\text{error margin} = \text{critical value}*\text{standard error}$\
e.g. $CI = (\mu-z_{\alpha/2}*s.e., \mu+z_{\alpha/2}*s.e.)$

>Point estimate: The point estimate of your confidence interval will be whatever statistical estimate you are making (e.g. population mean, the difference between population means, proportions, variation among groups).

- The confidence interval (CI) is a range of values that’s likely to include a population value with a certain degree of confidence (confidence level). It is a range of values we are fairly sure our true value lies in.

- A confidence interval is the mean of your estimate plus and minus the variation in that estimate. This is the range of values you expect your estimate to fall between if you redo your test, within a certain level of confidence.

#### Confidence Level

Confidence, in statistics, is another way to describe probability.
The confidence level is the percentage of times you expect to reproduce an estimate between the upper and lower bounds (interval) of the confidence interval, and is set by the alpha value called significance level.

> Confidence level, $CL = 1 − \alpha$

For a significance level of 0.05, or confidence level of 95% says that 95% of experiments like we just did will include the true mean, but 5% won't.

So there is a 1-in-20 chance (5%) that our Confidence Interval does NOT include the true mean.

For example, if you construct a confidence interval with a 95% confidence level, you are confident that 95 out of 100 times the estimate will fall between the upper and lower values specified by the confidence interval.

#### Error Margin vs Sample size

Error margin is the range above or below the mean estimate 

<figure>
<img src="https://sites.google.com/site/hellobenchen/_/rsrc/1570154043247/home/wiki/math/marginoferr/margin_of_error.png" width="450"/>
<img src="http://www.geoib.com/uploads/7/6/3/9/7639044/4776976.jpg" width="450"/>
<center><figcaption> Source: (1) hellobenchen (2) geoib.com</figcaption></center>
</figure>


>$\text{error margin} = \text{critical value}*s.e.$,\
$s.e.=s.d./\sqrt{n}$

**sample size**, $n = (\text{critical value})^2*(\frac{\text{standard deviation}}{\text{error margin}})^2$

<figure>
<img src="https://statsandr.com/blog/chi-square-test-of-independence-by-hand_files/Screenshot%202020-01-28%20at%2000.56.28.png"/>
<center><figcaption> Source: statsandr.com</figcaption></center>
</figure>

To find the critical value, follow these steps.
- Compute alpha: $\alpha$ = 1 - (confidence level / 100)
- Find the critical probability, $p* = 1 - \alpha/2$.
- Find the critical value (point on x-axis from pdf) having a cumulative probability equal to the critical probability $(p*)$







### Types of Decision Errors

There are two types of decision errors that one can make when doing a hypothesis test:

<figure><center>
<img src="https://i1.wp.com/statisticsbyjim.com/wp-content/uploads/2018/07/TypesErrorHypothesisTests.png?resize=600%2C400" width="400"/>
<img src="https://www.researchgate.net/profile/Abdulkerim-Gok/publication/316927316/figure/fig3/AS:667699772391428@1536203439714/Left-Definitions-of-terminologies-in-a-statistical-test-Right-An-illustration-of-power.ppm" width="450"/><figcaption>source: (1) statisticsbyjim.com (2) Picture by Abdulkerim Gok</
figcaption></center>
</figure>

||Do Not Reject H0|Reject H0|
|---|---|---|
|Reality|---|---|
|H0 is True|Correct Decision|Type I Error ($\alpha$)|
|H0 is False|Type II Error ($\beta$)|Correct Decision|


- Type I error: You reject the null hypothesis when it is actually true. The probability of committing a Type I error is equal to the significance level, often called alpha ($\alpha$)  and commonly set at 0.05.

- Type II error: You fail to reject the null hypothesis when it is actually false. The probability of committing a Type II error is called the Beta, denoted as ($\beta$). 

- Statistical significance: 
  - It is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test. 
  - Significance is usually denoted by a p-value, or probability value. Statistical significance is arbitrary – it depends on the threshold (significance level), or alpha value, chosen by the researcher. 
  - The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis. 
  - When the p-value falls below the chosen alpha value, then we say the result of the test is statistically significant.




  







### Statistical Power (sensitivity)

Power of the test **$(1-\beta)$** is the probability of a test to correctly reject null hypothesis H0. Power is the probability of avoiding a Type II error. The higher the statistical power of a test, the lower the risk of making a Type II error.


A true effect is a real, non-zero relationship between variables in a population. An effect is usually indicated by a real difference between groups or a correlation between variables.

Statistical power, or sensitivity, is the likelihood/chance of a significance test detecting an effect when there actually is one. High power in a study indicates a large chance of a test detecting a true effect. 
Low power means that your test only has a small chance of detecting a true effect or that the results are likely to be distorted by random and systematic error.

Power is usually set at 80% or higher. This means that if there are true effects to be found in 100 different studies with 80% power, only 80 out of 100 statistical tests will actually detect them. 

If you don’t ensure sufficient power, your study may not be able to detect a true effect at all. This means that resources like time and money are wasted, and it may even be unethical to collect data from participants (especially in clinical trials).

On the flip side, too much power means your tests are highly sensitive to true effects, including very small ones. This may lead to finding statistically significant results with very little usefulness in the real world.

To balance these pros and cons of low versus high statistical power, you should use a power analysis to set an appropriate level.

What is a power analysis?
A power analysis is a calculation that helps you determine a minimum sample size for your study.

A power analysis is made up of four main components. If you know or have estimates for any three of these, you can calculate the fourth component.

 A power analysis can be used to determine the necessary sample size for a study.

### Factors influencing power

>1. **Sample size**
- It is the minimum number of observations needed to observe an effect of a certain size with a given power level.
- Sample size is positively related to power. A small sample (less than 30 units) may only have low power while a large sample has high power.
Increasing sample size improves power. But there is a point at which increasing your sample size may not yield high enough benefits.
- Nonparametric tests can be subject to low power mainly due to small sample size. Therefore, it is important to consider the possibility of a Type II error when a nonparametric test fails to reject H0. There may be a true effect or difference, yet the nonparametric test is underpowered to detect it.
- Your research design is also related to power and sample size:
  - In a within-subjects design, each participant is tested in all treatments of a study, so individual differences will not unevenly affect the outcomes of different treatments.
  - In a between-subjects design, each participant only takes part in a single treatment, so with different participants in each treatment, there is a chance that individual differences can affect the results.
  A within-subjects design is more powerful, so fewer participants are needed. More participants are needed in a between-subjects design to establish relationships between variables.

>2. **Significance level (alpha)**
- It is the maximum risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
- Significance level is positively correlated with power, increasing the significance level (e.g., from 5% to 10%) increases power. When you decrease the significance level, your significance test becomes more conservative and less sensitive to detecting true effects.
Increase the significance level. While this makes a test more sensitive to detecting true effects, it also increases the risk of making a Type I error.

>3. **Expected effect size**
- It is a standardized way of expressing the magnitude of the expected result of your study, usually based on similar studies or a pilot study.
- Positviely correlated with effect size, To increase the expected effect in an experiment, you could manipulate your independent variable more widely (e.g., spending 1 hour instead of 10 minutes in nature) to increase the effect on the dependent variable (stress level). This may not always be possible because there are limits to how much the outcomes in an experiment may vary.

>4. **Use a one-tailed test**
- Instead of a two-tailed test, go for one tailed test. When using a t test or z tests, a one-tailed test has higher power. 
- However, a one-tailed test should only be used when there’s a strong reason to expect an effect in a specific direction (e.g., one mean score will be higher than the other), because it won’t be able to detect an effect in the other direction. In contrast, a two-tailed test is able to detect an effect in either direction.

Other factors affecting Power


5. **Variability**
- The variability of the population characteristics affects the power of your test. High population variance reduces power. 
- In other words, using a population that takes on a large range of values for a variable will lower the sensitivity of your test, while using a population where the variable is relatively narrowly distributed will heighten the sensitivity of the test.
- Using a fairly specific population with defined demographic characteristics can lower the spread of the variable of interest and improve power.


6. **Measurement error**
- The higher the measurement error in a study, the lower the statistical power of a test. Measurement error can be random or systematic. 
- Reduce measurement error can improve power. Increasing the precision and accuracy of your measurement devices and procedures reduces variability, improving reliability and power.  Using multiple measures or methods, known as triangulation, can also help reduce systematic bias.



### Effect size

It is the quantitative measure of magnitude of experiment effect, it evaluating the strength of a statistical claim. 
The larger is effect size, the stronger is the relationship.
Examples of effect sizes include 

- **Absoulte or unstandardized measure**: 
How large is the difference between the groups (the mean difference), 
These statistics describe the size of the effect, but remain in the original units of the variables gives better intrepretability.
e.g., the difference between group means = $\mu_{1}-\mu_{2}$

- **Standardized measure**:  
Effect size is the difference in means between the two groups divided by the standard deviation of the control group. Here, the scores are standardized using standard deviation to remove the units of the variables in the effect.
e.g., the standardized difference between group means = $\frac{\mu_{1}-\mu_{2}}{\sigma}$

While statistical significance shows that an effect exists in a study, practical significance shows that the effect is large enough to be meaningful in the real world. Statistical significance is denoted by p-values, whereas practical significance is represented by effect sizes.

Statistical significance alone can be misleading because it’s influenced by the sample size. Increasing the sample size always makes it more likely to find a statistically significant effect, no matter how small the effect truly is in the real world.

In contrast, effect sizes are independent of the sample size. Only the data is used to calculate effect sizes.

Standardized effect sizes help you evaluate how big or small an effect is when the units of measurement aren’t intuitive or can help you compare results across studies. It can be used in sample size calculations.

Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses.


|Test|Effect size|Statistic for Effect size|Notes|Small|Medium|Large|
|---|---|---|---|---|---|---|
|One sample z-test|Cohen's D|$\frac{\mu_{1}-\mu_{0}}{\sigma}$|$\mu_{1}$ = population mean, <br>$\mu_{0}$ =  hypothesized population mean,<br>$\sigma$ = population sd|0.2|0.5|0.8|
|Two sample z-test|Cohen's D|$\frac{\mu_{1}-\mu_{2}}{\sigma}$|$\mu_{1},\mu_{2}$ = population mean, <br>$\sigma$ = control group sd||||
|One proportion z-test|Cohen's D|$\frac{p_{1}-p_{0}}{\sqrt{p_{0}(1-p_{0})}}$|$p_{1}$ = sample proportion, <br>$p_{0}$ = hypothesized population proportion||||
|Two proportion z-test|Cohen's D|$\frac{p_{1}-p_{2}}{\sqrt{p(1-p)}}$|$p_{1}, p_{2}$ = sample proportions, <br>$p = \frac{p_{1}+p_{2}}{2}$ = mean proportion||||
|One sample t-test|Cohen's D|$\frac{\bar{x}-\mu_{0}}{s}$|$\bar{x}$ = sample mean,<br>$\mu_{0}$ = hypothesized population mean,<br> $s = \frac{\sum(x-\bar{x})^2}{n-1}$ = sample sd|||
|Paired sample t-test|Cohen's D|$\frac{\bar{x}-\bar{y}}{s_{d}} = \frac{\bar{d}}{s_{d}}$|$\bar{d}$ = mean of difference,<br>$s_{d} = \frac{\sum(d_{i}-\bar{d})^2}{n-1}$ = sd of difference|||
|Paired sample t-test|Cohen's D|$\frac{t_{d}}{\sqrt{n}}$|$t_{d} = \frac{\bar{d}}{s_{d}/\sqrt{n}}$ = t-statistic of difference,<br>$s_{d}$ = sd of difference|||
|Student t-test|Cohen's D|$\frac{\bar{x}_{1}-\bar{x}_{2}}{s_{pooled}} = t(\frac{1}{n_{1}}+\frac{1}{n_{2}})$|$\bar{x}_{1}, \bar{x}_{2}$ = sample means, <br> t = student t-statistic,<br>$s_{pooled}$ = pooled sample variance||||
|Student t-test|r squared or eta squared|$\frac{t^2}{t^2+df_{t}}$|t = student t-statistic,<br> degree of freedom, $df=n_{1}+n_{2}-2$||||
|Correlation in two variables|r squared|$\frac{\sum(x-\bar{x})(y-\bar{y})}{\sum(x-\bar{x})^2}$|Measures degree of linear relationship <br>between two quantitative variables|0.2|0.5|0.8|
|Non parametric test|r squared or eta squared|$\frac{z^2}{n}$|$r^2=\eta^2$,<br>z = standardized value of test <br>statistic, n = number of observations||||
|Chi square goodness of fit|Cohen's W|$\sqrt\frac{\chi^2}{N}$|||||
|Chi square independance test|Phi|$\sqrt\frac{\chi^2}{N}$|N=RC=total observations|0.1|0.3|0.5|
|Chi square independance test|Cramer's V|$\sqrt\frac{\chi^2}{N*df}$|$df=(R-1)(C-1)$|||||
|Linear Regression|R squared|$\frac{SSR}{SST}$|Proportion of variance in one <br>variable explained by the other|0.02|0.13|0.26|
|Linear Regression|F squared|$\frac{MSR}{MSE}$||0.02|0.15|0.35|
|ANOVA|(Partial) Eta Squared, $\eta^2$|$(\frac{k-1}{N-k})*F$|$F=\frac{MSB}{MSE}$|0.01|0.06|0.14|
|ANOVA|(Partial) Eta Squared, $\eta^2$|$\frac{SSB}{SSB+SSE}$||0.01|0.06|0.14|
|ANOVA|Omega Squared, $\omega^2$|$\frac{SSB-df_{SSB}*MSE}{SST+MSE}$||0.01|0.06|0.14|
|ANOVA|Epsilon Squared, $\epsilon^2$|$\frac{SSB-df_{SSB}*MSE}{SST+MSE}$||||
|ANOVA|Cohen's F|$\sqrt{\frac{R^2}{1-R^2}} = \sqrt{\frac{\eta^2}{1-\eta^2}} = \sqrt{\frac{\omega^2}{1-\omega^2}}$||0.10|0.25|0.40|
|Between Groups|Odds ratio (OR) |$\frac{odds_{group-1}}{odds_{group-2}}$||1.5|2|3|
|Between groups|Relative risk or risk ratio (RR)|$\frac{CR%-TR%}{CR%}$|CR% = control group risk %<br>TR% = treatment group risk %|2|3|4|
|Mann Whitney U test |Rank biserial correlation, r|$(1-\frac{2U}{n_{1}+n_{2}})$|U = Mann whitney test statistic<br>$n_{1}, n_{2}$ =sample sizes|0.2|0.5|0.8|
|Friedman test|Kendall's W|$\frac{\chi^2}{n(k-1)}$|$\chi^2$ = Friedman test statistic <br>n = sample size,<br> k = number of groups|0.1|0.3|0.5|
|Kruskal-Wallis H-test|Eta Squared, $\eta^2$|$\frac{H-k+1}{n-k}$|H = Kruskal Wallis test statistic,<br>k = number of groups,<br>n = total observations|0.1|0.3|0.5|


### Sample size

> sample size, $n_{i} = \frac{z_{1-\alpha/2}+z_{1-\beta}}{ES}$
where ES= effect size

As sample size increases, the power of your test also increases. Therfore,larger sample means that you have collected more information, which makes it easier to correctly reject the null hypothesis when you should.

To ensure that your sample size is big enough, you will need to conduct a power analysis calculation. 

For any power calculation, you will need to know:

- What type of test you plan to use (e.g., independent t-test, paired t-test, ANOVA, regression, etc. See Step 6 if you are not familiar with these tests.),
- The alpha value or significance level you are using (usually 0.01 or 0.05. See the next section of this page for more information.),
- The expected effect size (See the last section of this page for more information.),
- The sample size you are planning to use
When these values are entered, a power value between 0 and 1 will be generated. If the power is less than 0.8, you will need to increase your sample size.

### Steps in Hypothesis Tests
A hypothesis test consists of five steps:

1. State the hypotheses. 

State the null and alternative hypotheses. These two hypotheses need to be mutually exclusive, so if one is true then the other must be false.

2. Determine a significance level to use for the hypothesis.

Decide on a significance level. Common choices are .01, .05, and .1. 

3. Find the test statistic.

Find the test statistic and the corresponding p-value. Often we are analyzing a population mean or proportion and the general formula to find the test statistic is: (sample statistic – population parameter) / (standard deviation of statistic)

4. Reject or fail to reject the null hypothesis.

Using the test statistic or the p-value, determine if you can reject or fail to reject the null hypothesis based on the significance level/cutoff level provided by the business.

The p-value tells us the strength of evidence in support of a null hypothesis. If the p-value is less than the significance level, we reject the null hypothesis. Otherwise we fail to reject the null hypothesis.

5. Interpret the results. 

Interpret the results of the hypothesis test in the context of the question being asked. 

### Choosing a statistical test

There are broadely two types of statistical test:

1.  Parametric test : 

- Parametric tests are more robust, and in general has higher power thus requires less data to make a stronger conclusion than nonparametric
tests.
  - the data need to be continuous
  - the data need to be normally distributed (data points must follow a bellshaped curve).
  - the data also need to have equal variance
- Parametric tests involve estimation of the key parameters of the population/distribution (e.g., the mean or difference in means) from the sample data. 

2.  Non-parametric test : 
- If the data do not meet the criteria for a parametric test (normally distributed, equal variance, and continuous), it must be analyzed with a nonparametric test.
- In nonparametric tests, the hypotheses are not about population parameters (e.g., $\mu=50$ or $\mu_{1}=\mu_{2}$). Instead, the null hypothesis is more general.
- The cost of fewer assumptions is that nonparametric tests are generally less powerful than their parametric counterparts (i.e., when the alternative is true, they may be less likely to reject H0).

For example, when comparing two independent groups in terms of a continuous outcome, the null hypothesis in a parametric test is H0: μ1 =μ2. In a nonparametric test the null hypothesis is that the two populations are equal, often this is interpreted as the two populations are equal in terms of their central tendency.

<figure><center>
<img src="https://dzchilds.github.io/stats-for-bio/images/stats_key.svg"/><figcaption>source: dzchilds.github.io</figcaption>
<center></figure>


## Parametric Tests

Assumptions: 
- Continuous: The data follow a continuous or ordinal scale.
- Normality: Data have a normal distribution (or at least symmetric/bell shaped)
- Homogeneity of variances: Data from multiple groups have the same variance
- Independence: Data are independent (randomly selected)

<img src="https://miro.medium.com/max/1050/1*XLhQMDdKW-3sd8a8HSFvgA.png" width="450"/>

When to use z-test or t-test:
1. Population standard deviation is known or Sample size > 30  
- One sample z-test

2. Population standard deviation is unknown and sample size < 30
- One sample t-test
- Two sample t-test
 - Unpaired t-test (independent samples)
    - Equal population variance : Student t-test
    - UnEqual population variance : Welch t-test
 - Paired t-test (dependent samples)

Confidence Interval for test statistic:
>$\text{Point Estimate} \pm \text{(error margin)}$ where \
$\text{error margin} = \text{(critical value)} * \text{(standard error)}$










### Table of test statistic 

|Type of test|When to use|Sample size|Test statistic|Standard error|Degree of freedom|Confidence Interval|Notes or Comments|
|---|---|---|---|---|---|---|---|
|One sample z-test|Tests if average of a single sample is equal to <br/> target hypothesized mean|n|$z = \frac{x-\mu_{0}}{s.e.}$|$\sigma/\sqrt{n}$|-|$\mu_{0} \pm z*s.e.$|hypothesized population mean $\mu_{0}$,<br/> Degree of freedom not defined|
|One proportion z-test|Tests if proportion of a single sample is equal to <br> target hypothesized proportion|n|$z = \frac{(p-p_{0})}{s.e.}$|$\sqrt{\frac{np_{0}(1-p_{0})}{n}}$|-|$\mu \pm z*s.e.$|min( $np_{0}, np_{0}(1-p_{1}))>5$, <br> sample proportion = $p$, <br> hypothesized mean = $np_{0}$, <br>hypothesized variance = $np_{0}(1-p_{0})$|
|Two proportion z-test|Tests if difference between two proportion is equal to <br> target hypothesized proprtion|n|$z = \frac{(p_{1}-p_{2})}{s.e.}$|$\sqrt{p_{pooled}(1-p_{pooled})(\frac{1}{n_{1}} + \frac{1}{n_{2}})}$|-|$(p_{1}-p_{2}) \pm z*s.e$|min( $np_{1}, np_{1}(1-p_{1}), np_{2}, np_{2}(1-p_{2}))>5$, <br> sample proportions = $p_{1}, p_{2}$, <br> $p_{pooled} = \frac{p_{1}n_{1}+p_{2}n_{2}}{n_{1}+n_{2}}$|
|One sample t-test|Tests if average of a single sample is equal to <br/> target hypothesized mean|n|$t = \frac{\bar{x}-\mu}{s.e.}$|$s/\sqrt{n}$|$n-1$|$\mu \pm t*s.e.$|sample mean = $\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$, <br/> sample variance = $s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^2$ |
|Paired t-test|Tests if average of the differences between paired or<br/>  dependent samples is equal to zero (no difference)|$n_{1},n_{2}$|$t = \frac{d-\bar{d}}{s.e.}$|$s_{d}/\sqrt{n}$|$n-1$|$\bar{d} \pm t*s.e.$|difference values, $d_{i}=x_{i}-y_{i}, \bar{d} = \sum_{i=1}^{n}d_{i}$, <br/> $s_{d}=\frac{1}{n-1}\sum_{i=1}^{n}(d_{i}-\bar{d})^2$|
|Student's t-test <br/> (equal population variance)|Tests if difference between the average of <br/> two independent samples is equal to target value|$n_{1},n_{2}$|$t = \frac{(\bar{x}_{1}-\bar{x}_{2})-(\mu_{1}-\mu_{2})}{s.e.}$|$\sqrt{s_{pooled}^2 * (\frac{1}{n_{1}}+\frac{1}{n_{2}})}$|$(n_{1}+n_{2}-2)$|$(\bar{x}_{1}-\bar{x}_{2}) \pm t*s.e.$|pooled variance = $s_{pooled}^2 = \frac{(n_{1}-1)s_{1}^2+(n_{2}-1)s_{2}^2}{(n_{1}+n_{2}-2)}$, <br> approx equal population variance, $\mu_{1} = \mu_{2}$|
|Welch t-test <br/> (unequal population variance)|Tests if difference between the average of <br/> two independent samples is equal to target value|$n_{1},n_{2}$|$t = \frac{(\bar{x}_{1}-\bar{x}_{2})-(\mu_{1}-\mu_{2})}{s.e.}$|$\sqrt{(\frac{s_{1}^2}{n_{1}}+\frac{s_{2}^2}{n_{2}})}$|$\frac{(\frac{s_{1}^2}{n_{1}}+\frac{s_{2}^2}{n_{2}})}{\frac{(s_{1}^2/n_{1})^2}{n_{1}-1}+\frac{(s_{2}^2/n_{2})^2}{n_{2}-1}}$|$(\bar{x}_{1}-\bar{x}_{2}) \pm t*s.e.$|unequal population variance, $\mu_{1} \neq \mu_{2}$|




### One sample z-test
> Use to compare true mean of a population to a target value or a reference. 
1. Determine whether the population mean differs from the hypothesized mean that you specify.
2. Calculate a range of values that is likely to include the population mean.
- Two tailed z-test \
H0: The population mean (μ) equals the hypothesized mean (µ0) \
H1: μ ≠ µ0	The population mean (μ) differs from the hypothesized mean (µ0) \
- One tailed z-test \
H0: The population mean (μ) equals the hypothesized mean (µ0) \
H1: μ > µ0	The population mean (μ) is greater than the hypothesized mean (µ0).

> Input
1. sample size=$n$, 
2. hypothesized population mean under H0=$\mu_{0}$, 
3. hypothesized population mean under H1=$\mu_{1}$, 
4. sample mean = $\bar{x}$, 
5. sample variance = $s^2$ 
6. population variance = $\sigma^2$ 
7. For given significane level $\alpha$, the critical value for two tailed = $z_{\alpha/2}$

As per central limit theorem, 
 - the sample mean has a normal distribution, $\bar{x} \sim N(\mu, \sigma/\sqrt{n})$ 
 - or z score follows standard normal distribution,  $z \sim N(0, 1)$ where $z = \frac{x-mean}{s.e.} = \frac{x-\mu_{0}}{\sigma/\sqrt{n}}$

- test statistic, $z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \sim N(0, 1)$ i.e. z-score follows standard normal distribution $N(0, 1)$

- standard error, $ s.e. = \sigma/\sqrt{n}$

- confidence interval = $\mu_{0} \pm z_{\alpha/2}*s.e.$

- effect size, $ES = \frac{|\mu_{1}-\mu_{0}|}{\sigma}$

- Error, $e = z_{\alpha/2}*s.e.$

- Sample size, $n = (\frac{z_{1-\alpha/2} + z_{1-\beta}}{ES})^2 = z_{\alpha/2}^2*(\frac{\sigma }{e})^2$

- cohen's $d = \frac{\bar{x}-\mu}{\sigma}$

- p-value for $\bar{x}$ at given significance level probability(SL), $p(\bar{x}<x_{SL}|N(x;\mu,\sigma^2) = p(z<z_{SL}|N(z;0,1) = \frac{1}{2\pi}\int_{-\infty}^{z_{SL}}e^{-\frac{z^2}{2}}dz$, where $z=\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}$

Decision rule:
>-  Reject H0 if ($z \lt z_{\alpha/2}$ or $z \gt z_{\alpha/2}$) or (p-value < SL)
-  Do not reject H0 if ($z_{\alpha/2} \le z \le z_{\alpha/2}$) or (p-value >= SL)

<figure>
<img src="https://saylordotorg.github.io/text_introductory-statistics/section_12/ecf5f771ca148089665859c88d8679df.jpg" width="500"/>
<figcaption>Source: saylordotorg.github.io</figcaption>
</figure>

Effect Size
>|Cohen's d|Interpretation|
|---|---|
|0 - 0.2|Little or no effect|
|0.2 - 0.5|Small effect size|
|0.5 - 0.8|Medium effect size|
|0.8 or more|Large effect size|



Effect sizes have several advantages over p-values:
> 1. An effect size helps us get a better idea of how large the difference is between two groups or how strong the association is between two groups. A p-value can only tell us whether or not there is some significant difference or some significant association.
2. Unlike p-values, effect sizes can be used to quantitatively compare the results of different studies done in different settings. For this reason, effect sizes are often used in meta-analyses.
3. P-values can be affected by large sample sizes. The larger the sample size, the greater the statistical power of a hypothesis test, which enables it to detect even small effects. This can lead to low p-values, despite small effect sizes that may have no practical significance.


### One proportion z-test (Normal approximation method) - Dichotomous Outcome

Use to estimate a binomial population proportion and to compare the proportion to a target value or a reference value
- Determine whether the population proportion differs from the hypothesized proportion that you specify.
- Calculate a range of values that is likely to include the population proportion.

The sample mean of binomial distribution (distribution of sample proportions) approximation to normal distribution (CLT) is sufficiently accurate if 

Sample size constraint:
> min( $np_{0}, np_{0}(1-p_{0}))>5$ 

where $p_{0}$ denotes the population proportion under H0, $p_{1}$ is the proportion under H1 and n is its related sample size.

Hypotheses:
- H0: $p = p_{0}$, sample proportion is same as hypothesized population proportion
- H1: $p \neq p_{0}$, sample proportion is significantly different from hypothesized population proportion for two tailed test

z score, $z = \frac{(p-p_{0})}{s.e.}$ follows normal distribution when 

- For given significane level $\alpha$, the critical value for two tailed = $z_{\alpha/2}$


where

- hypothesized mean = $np_{0}$, 

- hypothesized variance = $np_{0}(1-p_{0})$,

- standard error, $s.e. = \sqrt{\frac{np_{0}(1-p_{0})}{n}}$

- effect size, $ES = \frac{p_{1}-p_{0}}{\sqrt{p_{0}(1-p_{0})}}$

- confidence interval, $p_{0} \pm z_{\alpha/2}*s.e.$

- error, $e = z_{\alpha/2}*s.e.$

- sample size, $n = (\frac{z_{1-\alpha/2}+z_{1-\beta}}{ES})^2 = z_{\alpha/2}^2*(\frac{np_{0}(1-p_{0})}{e})^2$

Decision rule:
>-  Reject H0 if ($z \lt z_{\alpha/2}$ or $z \gt z_{\alpha/2}$) or (p-value < SL)
-  Do not reject H0 if ($z_{\alpha/2} \le z \le z_{\alpha/2}$) or (p-value >= SL)

<figure>
<img src="https://saylordotorg.github.io/text_introductory-statistics/section_12/01fe19537789cf83979f79f172b522c5.jpg" width="500"/>
<figcaption>Source: saylordotorg.github.io</figcaption>
</figure>

### Two proportion z-test (Independent proportions) - Dichotomous Outcome

A two proportion z-test is used to test for a difference between two population proportions.

- H0: $p_{1} = p_{2}$, the two population proportions are equal
- H0: $p_{1} \neq p_{2}$, the two population proportions are not equal (Two tailed) or first proportion is smaller than the second(One tailed), $p_{1} < p_{2}$

For the risk difference, 
> H0: $p_{1} - p_{2} = 0$, versus H1: $p_{1} - p_{2} \neq 0$, which are, by definition, equal to H0: $RD = 0$ versus H1: $RD \neq 0$.

For Risk ratio
> If an investigator wants to focus on the risk ratio, the equivalent hypotheses are H0: $RR = 1$ versus H1: $RR \neq 1$.

for Odds ratio
> If the investigator wants to focus on the odds ratio, the equivalent hypotheses are H0: $OR = 1$ versus H1: $OR \neq 1$.

Sample size constraint:
> min($np_{1}, np_{1}(1-p_{1}), np_{2}, np_{2}(1-p_{2}))>5$ 

If $p_{1}$ and $p_{2}$ are the sample proportions, $n_{1}$ and $n_{2}$ are the sample sizes, and where p is the total pooled proportion calculated as:

z score, $z = \frac{(p_{1}-p_{2})}{s.e.}$ follows normal distribution when 

where

- standard error, $s.e. = \sqrt{p_{pooled}(1-p_{pooled})(\frac{1}{n_{1}} + \frac{1}{n_{2}})}$, where $p_{pooled} = \frac{p_{1}n_{1}+p_{2}n_{2}}{n_{1}+n_{2}}$

- For given significane level $\alpha$, the critical value for two tailed = $z_{\alpha/2}$

- Confidence interval: $(p_{1}-p_{2}) \pm z_{\alpha/2}*s.e.$

- error, $e = z_{\alpha/2}*s.e.$

- sample size, $n_{1} = n_{2} = ( p_{1}(1-p_{1})+p_{2}(1-p_{2}) )*(\frac{z_{\alpha/2}}{e})^2$

Decision rule:
>-  Reject H0 if ($z \lt z_{\alpha/2}$ or $z \gt z_{\alpha/2}$) or (p-value < SL)
-  Do not reject H0 if ($z_{\alpha/2} \le z \le z_{\alpha/2}$) or (p-value >= SL)

<figure>
<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/03/Screenshot-from-2020-03-03-18-01-53.png" width="500"/>
<figcaption>Source: analyticsvidhya.com</figcaption>
</figure>

### One sample t-test

Tests whether the mean of a single population is equal to a target value. 

$t = \frac{\text{Difference between groups (means)}}{\text{Normal variability within group (or standard error SE)}} $

1. Determine whether the population mean($\mu$) differs from the hypothesized mean ($\mu_{0}$). 
  - Two tailed: H0: $\mu = \mu_{0}$, H1: $\mu \neq \mu_{0}$
  - One tailed: H0: $\mu = \mu_{0}$, H1: $\mu < \mu_{0}$ or $\mu > \mu_{0}$

2. Calculate a range of values that is likely to include the population mean.

> Input
1. sample size=$n$, 
2. hypothesized population mean under H0 =$\mu_{0}$, 
3. sample mean = $\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$, 
4. sample variance = $s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^2$ 

- test statistic, $t = \frac{\bar{x}-\mu_{0}}{s.e.} \sim t_{n-1}$

- standard error, $ s.e. = s/\sqrt{n}$

- degree of freedom = $(n-1)$

- confidence interval = $\mu \pm t_{\alpha/2}*s.e.$


the test statistic, t follows a t-distribution of (n-1) degree of freedom.

For given significance level $\alpha$, critical value for two tailed = $t_{\alpha/2}$


Decision rule:
-  Reject H0 if ($t \lt -t_{\alpha/2}$ or $t \gt t_{\alpha/2}$) or (p-value < SL)
-  Do not reject H0 if ($t_{\alpha/2} \le t \le t_{\alpha/2}$) or (p-value >= SL)

<figure>
<img src="https://saylordotorg.github.io/text_introductory-statistics/section_12/ecf5f771ca148089665859c88d8679df.jpg" width="450"/>
<img src="https://saylordotorg.github.io/text_introductory-statistics/section_12/37a4f201cad923d15a8bfab828bd7640.jpg" width="450"/>
<figcaption>Source: saylordotorg.github.io</figcaption>
</figure>

- For given significane level $\alpha$, the critical value for two tailed = $z_{\alpha/2}$

> Assumptions
- The variable under study should be either an interval or ratio variable.
- The observations in the sample should be independent.
- The variable under study should be approximately normally distributed. You can check this assumption by creating a histogram and visually checking if the distribution has roughly a “bell shape.”
- The variable under study should have no outliers. You can check this assumption by creating a boxplot and visually checking for outliers.

### Two sample t-test (Paired or Unpaired)

Tests whether the difference between the true means of two independent populations is equal to a target value

Determine whether the population means of two independent groups differ.
Calculate a range of values that is likely to include the difference between the population means

1. Paired or Matched
2. Unpaired or Independent
 - Equal population variance : Student t-test
 - UnEqual population variance : Welch t-test

Hypothesis:

  - Two tailed: H0: $\mu_{1} = \mu_{2}$, H1: $\mu \neq \mu_{0}$
  - One tailed: H0: $\mu_{1} = \mu_{2}$, H1: $\mu < \mu_{0}$ or $\mu_{1} > \mu_{2}$

> Input
1. sample size of first sample = $n_{1}$, 
2. sample size of second sample = $n_{2}$, 
3. sample mean of first sample = $\mu_{1}$, 
4. sample mean of second sample = $\mu_{2}$, 

<img src="https://www.statology.org/wp-content/uploads/2020/04/confIntmeans1-1024x341.png"/>


#### Paired Sample t-test
Tests whether the mean of the differences between dependent or paired observations is equal to a target value

> Assumptions
- Paired, Dependent or Matched samples: Each observation in one sample corresponds to a specific observation in the other sample. The two comparison groups are said to be dependent, and the data can arise from a single sample of participants where each participant is measured twice (possibly before and after an intervention) or from two samples that are matched on specific characteristics (e.g., siblings). 
- Equal sample size: Both sample should be of same size. 
- Normality: The data should be approximately normally distributed.

If we have paired data (samples), both samples must be of the same size say $n$. Let x be the first sample with sample mean $\bar{x}$ and  population mean $\mu_{1}$ and y be the second sample with sample mean $\bar{y}$  and population mean $\mu_{2}$. Then x1 is paired with y1, x2 is paired with y2 etc. so for i=1,2,…,n, every $x_{i}$ is paired with $y_{i}$. 

Difference scores between the pairs, $d_{i} = x_{i}-y_{i}$ for each observation pairs$(x_{i}, y_{i})$,

Hypotheses:
- Null hypothesis(H0): No difference in the population means i.e. $\mu_{d} = 0$
- Alternate hypothesis(H0): Significant difference in the population means i.e. $\mu_{d} \neq 0$


- test statistic,  $t = \frac{\bar{d}}{s_{d}/\sqrt{n}}$

- mean of paired differences, $\bar{d} = \frac{1}{n}\sum_{1}^{n}(d_{i})$

- variance of paired differences, $s_{d}^2 = \frac{1}{n-1}\sum_{1}^{n}(d_{i}-\bar{d})^2$

- confidence interval = $\bar{d} \pm t_{\alpha/2}*s.e.$


It can understood as doing one sample t-test of differenced values.

NOTE:
For this test to be valid the differences only need to be approximately normally distributed.
Therefore, it would not be advisable to use a paired t-test where there were any extreme
outliers.


#### Student's t-test (Independent samples and Equal Population Variance)

this test assumes that both groups of data are sampled from populations that follow a normal distribution and that both populations have the same variance. Independent samples may occur, for instance, when the subjects in condition A are different from the subjects in condition B.

> Assumptions
- Independent or unpaired samples: The observations in one sample should be independent of the observations in the other sample. 
- Normality: The data should be approximately normally distributed.
- Equal variance: The two samples should have approximately the same variance.

test statistic, $ t =\frac{(\bar{x}_{1}-\bar{x}_{2})-(\mu_{1}-\mu_{2})}{s.e.}$ follows t-distribution with df degree of freedom, where 

- standard error, $s.e. = \sqrt{s_{pooled}^2 * (\frac{1}{n_{1}}+\frac{1}{n_{2}})}$

- Pooled sample variance, $s_{pooled}^2 = \frac{(n_{1}-1)s_{1}^2+(n_{2}-1)s_{2}^2}{(n_{1}+n_{2}-2)}$ \

- degree of freedom, $df = (n_{1}+n_{2}-2)$

- confidence interval = $(\bar{x}_{1} - \bar{x}_{2}) \pm t_{\alpha/2}*s.e.$








#### Welch’s t-test (Independent samples, Unequal Population Variance) 

this test assumes that both groups of data are sampled from populations that follow a normal distribution, but it does not assume that those two populations have the same variance.

> Assumptions
- Independent or unpaired samples: The observations in one sample should be independent of the observations in the other sample.
- Normality: The data should be approximately normally distributed.
- Unequal variance: The two samples should have unequal variance.

test statistic,  $t = \frac{(\bar{x}_{1}-\bar{x}_{2})-(\mu_{1}-\mu_{2})}{s.e.}$ ,where
- standard error, $s.e. = \sqrt{(\frac{s_{1}^2}{n_{1}}+\frac{s_{2}^2}{n_{2}})}$
- degree of freedom, $df = \frac{(\frac{s_{1}^2}{n_{1}}+\frac{s_{2}^2}{n_{2}})}{\frac{(s_{1}^2/n_{1})^2}{n_{1}-1}+\frac{(s_{2}^2/n_{2})^2}{n_{2}-1}}$

The confidence interval for the difference in means $\mu_{1}-\mu_{2}$ is given by
- confidence interval = $(\bar{x}_{1}-\bar{x}_{2}) \pm t_{\alpha/2}*s.e.$

- error units, $e = t_{\alpha/2}*s.e.$

- sample size, $n_{1} = n_{2} = 2*(\frac{z_{\alpha/2}\sigma}{e})^2$




## Non-parametric Tests

- One sample
 - Dichotomous data:
   - Binomial test
 - Categorical data:
   - Chi-square goodness-of-fit test
 - Quantitative data:
   - Sign test for 1 median
   - Kolmogorov-Smirnov test
   - Shapiro-Wilk test
- Two samples
 - Both dichotomous data:
   - McNemar test
 - Both nominal data:
   - Chi-square independence test
   - Fisher’s Exact Test 
 - Both ordinal data:
   - Wilcoxon signed-ranks test
   - Sign test for 2 related medians
   - Mann-Whitney test (mean ranks)
   - Median test for 2+ independent medians

- Three or more samples
 - All dichotomous data:
   - Cochran Q test 
 - All nominal data:
   - Chi-square independence test
 - All ordinal data:
   - Friedman test
   - Kruskal-Wallis test (mean ranks)
   - Median test for 2+ independent medians

### Assigning Ranking

- Ordinal ranking ("1234" ranking): assigns ascending rank to all observations in the ordered list
- Fractional ranking ("1 2.5 2.5 4" ranking): assign rank to equal observations as mean of their ordinal  e.g rank for Rank(7) = (5+6)/2
- Standard competition ranking ("1224" ranking): first observations are ordinally ranked, then assigns lowest ordinal rank to equal observations
- Standard competition ranking ("1334" ranking): first observations are ordinally ranked, then assigns highest ordinal rank to equal observations, skips (n-1) ranks for n equal observations
- Dense ranking ("1223" ranking): assigns lowest ordinal rank to equal observations and continued ranking without skipping any rank number

||Data|0|2|3|5|7|7|9|10|
|---|---|---|---|---|---|---|---|---|---|
|Ordinal ranking|Rank|1|2|3|4|5|6|7|8|
|Fractional ranking|Rank|1|2|3|4.5|4.5|6|7|8|
|Standard ranking|Rank|1|2|3|4|5|5|7|8|
|Dense ranking|Rank|1|2|3|4|5|5|6|7|

### Mann-Whitney U test or Wilcoxon rank-sum test

A Mann-Whitney U test (sometimes called the Wilcoxon rank-sum test) is used to compare the differences between two independent samples
- to test whether two samples are likely to derive from the same population (i.e., that the two populations have the same shape)
- when the sample distributions are not normally distributed and 
- the sample sizes are small (n<30) 

It is considered to be the nonparametric equivalent to the two-sample independent t-test.

Assumptions:
- Ordinal or Continuous: The variable you’re analyzing is ordinal or continuous. Examples of ordinal variables include Likert items (e.g., a 5-point scale from “strongly disagree” to “strongly agree”). Examples of continuous variables include height (measured in inches), weight (measured in pounds), or exam scores (measured from 0 to 100).
- Independence: All of the observations from both groups are independent of each other.
Shape: The shapes of the distributions for the two groups are roughly the same.

<figure>
<img src="https://www.statstest.com/wp-content/uploads/2020/02/mann-whitney-u-test.png" width="450"/>
<figcaption>Source: statstest.com</figcaption>
</figure>

Hypotheses:
- H0: The two populations are equal
- H1: The two populations are not equal

If there are two samples (groups) of ordinal data

n1 = the sample sizes for sample 1 
n2 = the sample sizes for sample 2

Observations from two groups are ranked together using Fractional Ranking approach and then test statistic is calculated as below.

R1 = Sum of the ranks in group 1 and 
R2 = Sum of the ranks in group 2 

- $U_{1} = n_{1}*n_{2}  +  n_{1}*(n_{1}+1)/2 - R_{1}$
- $U_{2} = n_{1}*n_{2}  +  n_{2}*(n_{2}+1)/2 - R_{2}$

Mann Whitney U statistic, $U = min(U_{1}, U_{2})$ 

Always, $(U_{1} + U_{2}) = n_{1}*n_{2}$

To determine the appropriate critical value we need sample sizes (n1, n2) and two-sided level of significance (e.g. 0.05).

Decision rule: 
>- Reject H0 if U <= Critical value or p-value <= Significance level
- Do not reject H0 if U > Critical value or p-value <= Significance level

Note:
>For any Mann-Whitney U test, 
- the theoretical range of U is from 0 (complete separation between groups, H0 most likely false and H1 most likely true) to n1*n2 (H1 most likely true , little evidence in support of H1).
- Smaller values of U support the research hypothesis (i.e., we reject H0 if U is small). On the other hand large values support Null hypothesis. 

Example:

|Ranking|1|2|3|4.5|4.5|6|7.5|7.5|9|10|Rank sum|
|---|---|---|---|---|---|---|---|---|---|---|---|
|Placebo|1|2|3|4|||6||||18|
|New Drug|||||4|5||6|7|12|37|

- $R_{1}$ = 16
- $R_{2}$ = 24
- $U_{1}$ = 5*5 + 5*6/2 - 18 = 22
- $U_{2}$ = 5*5+5*6/2 - 37 = 3 
- $U = min(U_{1}, U_{2})$ = 3

Critical value for (n1=n2=5 and SL=0.05) = 2
since U = 3 > critical value, hence Null hypothese that the two groups come from same population can not be rejected.






### Wilcoxon Signed Rank Test

The Wilcoxon Signed Rank Test is the non-parametric version of the paired t-test to compare outcomes between two matched or paired groups. It is used to test whether or not there is a significant difference between two population means. It is used when the distribution of the differences between the pairs is severely non-normally distributed.

Check for normality:
The easiest way to determine if the differences are non-normally distributed is to create a histogram of the differences and see if they follow a somewhat normal, “bell-shaped” distribution.

Keep in mind that the paired t-test is fairly robust to departures from normality, so the deviation from a normal distribution needs to be pretty severe to justify the use of the Wilcoxon Signed Rank test.
<figure>
<img src="https://www.statstest.com/wp-content/uploads/2020/11/Wilcoxon-Signed-Rank-Test.jpg" width="450"/>
<figcaption>Source: statstest.com</figcaption>
</figure>

Steps to perform test:
- Find the difference and absolute difference for each pair.
- Order the pairs by the absolute differences and assign a rank (signed fractional rank) from the smallest to largest absolute differences. Ignore pairs that have an absolute difference of zero and assign mean ranks when there are ties.
- Find the sum of the positive ranks and the negative ranks.

>|Groups|1|2|3|4|5|6|7|8|
|---|---|---|---|---|---|---|---|---|
|Before treatment|85|60|70|75|95|80|85|80|
|After treatment|70|70|65|80|75|70|75|85|
|Difference|15|-10|5|-5|20|10|10|-5|
|Ordinal Rank|7|4|1|2|8|5|6|3|
|Fractional Rank|7|5|2|2|8|5|5|2|
|Signed Fractional Rank|+7|-5|+2|-2|+8|+5|+5|-2|

number of groups  = n = 8
> Note: sum of the ranks (ignoring the signs) will always equal n(n+1)/2.

Hypotheses:
- Null hypothesis (H0): The median difference is zero 
- Alternate hypothesis (H1): The median difference is positive

The test statistic for the Wilcoxon Signed Rank Test is W, defined as the smaller of W+ (sum of absolute values the positive ranks) and W- (sum of absolute values the negative ranks). 

> $W = min(W_{+}, W_{-})$

We find the one-sided critical value for given sample size (n) and given evel of significance ($\alpha$) from the table of Critical Values of W.

Decision rule:
>- Reject H0 if $W<=W_{\alpha}$ or p-value<= SL
- Do not reject H0 if $W>W_{\alpha}$ or p-value> SL

Inference:
- If the null hypothesis is true, we expect to see similar numbers of lower and higher ranks that are both positive and negative (i.e., W+ and W- would be similar). 
- If the alternate hypothesis is true we expect to see more higher and positive ranks (W+ much larger than W-).

### Sign Test

The Sign Test is the simplest nonparametric test for matched or paired data. 
- The test usage the signs of the difference scores between two matched (paired) groups to test whether both groups are related or not.
- It do not account for the magnitude of those differences.  
- It is used when the distribution of the differences between the pairs is severely non-normally distributed.

Hypotheses:
- Null hypothesis (H0): The median difference is zero 
- Alternate hypothesis (H1): The median difference is positive

Steps to perform test:
- Find the difference for each pair.
 - If there is just one difference score of zero, some investigators drop that observation and reduce the sample size by 1 (i.e., the sample size for the binomial distribution would be n-1).
 - If there is an even number of zeros, we randomly assign them positive or negative signs.
- If there is an odd number of zeros (>=3), we randomly drop one and reduce the sample size by 1, and then randomly assign the remaining observations positive or negative signs. 
- Count the number of positive and negative differences

>|Groups|1|2|3|4|5|6|7|8|
|---|---|---|---|---|---|---|---|---|
|Before treatment|85|60|70|75|95|80|85|80|
|After treatment|70|70|65|80|75|70|75|85|
|Difference|15|-10|5|-5|20|10|10|-5|
|Sign|+|-|+|-|+|+|+|-|

n = total number of signs (all positive and negative)

The test statistic for the Sign Test is the number of positive signs or number of negative signs, whichever is smaller.

>Test statistic, $S = min (Count_{+} , Count_{-})$

By using the binomial distribution formula, we can compute the probability of observing this distribution of positive and neagtive combination: 

> P-value calculation
- For one tailed test, p-value is the sum of probabilities of getting at most S number of signs out of n \
p-value(One tailed) = $P(x \le S) = \sum_{x=1}^{S}\frac{n!}{(n-x)!} p^x {1-p)}^{(n-x)}$
- For one tailed test, p-value(One tailed) = $2*P(x \le S)$ = 2*p-value(One tailed)

For one tailed (side) significance level of 0.05 ~ $P(x \le k)$, we should have at most k positive or negative sign (critical value = k) out of the n.

Decision rule:
>- Reject H0: if the smaller of the number of positive or negative signs is less than or equal to that critical value (S<=k), then we reject H0 in favor of H1 
- Do not reject H0: if the smaller of the number of positive or negative signs is greater than the critical value(S>k), then we do not reject H0. 



### Chi-Square goodness of fit test

A Chi-Square goodness of fit test is used to determine whether or not a categorical variable follows a hypothesized distribution.

- Used when we have only one independent variable

Hypotheses:
- H0: (null hypothesis) A variable follows a hypothesized distribution.
- H1: (alternative hypothesis) A variable does not follow a hypothesized distribution.

The test statistic follows chi-square distribution of n-1 degrees of freedom (where n is the number of categories).

|Observed frequency|$o_{1}$|$o_{2}$|$o_{3}$|-|$o_{n}$|
|---|---|---|---|---|---|
|Expected frequency|$e_{1}$|$e_{2}$|$e_{3}$|-|$e_{n}$

Chi squared statistic:
- chi-square statistic, $\chi^2 = \sum \frac{(O-E)^2}{E} = \sum_{i=1}^{n}\frac{(o_{i}-e_{i})^2}{e_{i}} \sim \chi^2_{df}$ where:
>$\chi^2 = \sum \frac{(O-E)^2}{E} \sim \chi^2_{n-1}$
where:
- O = observed frequency
- E = expected frequency
- n =  number of categories
- (n-1) = degree of freedom of chi-square distribution
- $\alpha$ = significance level (SL) for right hand tailed test
- right tailed critical value for SL = $\chi_{\alpha}$

Again, with χ2 tests there are no upper, lower or two-tailed tests. If the null hypothesis is true, the observed and expected frequencies will be close in value and the $\chi^2$ statistic will be close to zero. If the null hypothesis is false, then the $\chi^2$ statistic will be large. The rejection region for the $\chi^2$ test of independence is always in the upper (right-hand) tail of the distribution. 

<figure>
<img src="https://saylordotorg.github.io/text_introductory-statistics/section_15/3406a41dcf8b2ad498d3271e90a762c1.jpg" width="500"/>
<figcaption>Source: saylordotorg.github.io</figcaption>
</figure>

For given significance level $\alpha$, right tailed critical value = $\chi^2_{\alpha}$
Decision Rule:
>-  Reject H0 if (statistic >= right tailed critical value) or (p value <= SL)
-  Do not reject H0 if (statistic < right tailed critical value) or (p value > SL)

Example:

|Observed frequency|4|5|7|3|8|5|6|Statitic|
|---|---|---|---|---|---|---|---|---|
|Expected frequency|4|5|6|2|7|8|8||
|$(O-E)^2$|0|0|1|1|1|9|4||
|$(O-E)^2/E$|0|0|1/6|1/2|1/7|9/8|1/2|2.43|

Right tailed critical value for chi square distribution with 7-1=6 degree of freedom $\chi^2(6) = 12.59$ 

Since this statistic, 2.43 < 12.59 is smaller than critical value, null hypotheseis that both groups are independent can not be rejected.


### Chi-Square Test of Independence test - Two or More Independent categorical Samples

A Chi-Square Test of Independence is used to determine whether or not there is a significant association between two or more categorical (discrete) variables.

- Used when we have two or more independent variables
- we compare the numbers of observations in each
category of each variable to the numbers we would
expect if the variables were independent of each
other

In the table below, the grouping variable is shown in the rows of the table; r denotes the number of independent groups. 

The outcome variable is shown in the columns of the table; c denotes the number of response options in the outcome variable. 

Each combination of a row (group) and column (response) is called a cell of the table. The table has r*c cells and is sometimes called an r x c ("r by c") table.

- Variable 1 levels: col-1, col-2, col-3, col-4
- Variable 2 levels: row-1, row-2, row-3, row-4 

**Contingency table with observed frequencies**
>|Index|col-1|col-2|col-3|col-4|col-total|
|---|---|---|---|---|---|
|row-1|$o_{11}$|$o_{12}$|$o_{13}$|$o_{14}$|$RT_{1}$|
|row-2|$o_{21}$|$o_{22}$|$o_{23}$|$o_{24}$|$RT_{2}$|
|row-3|$o_{31}$|$o_{32}$|$o_{33}$|$o_{34}$|$RT_{3}$|
|row-4|$o_{41}$|$o_{42}$|$o_{43}$|$o_{44}$|$RT_{4}$|
|row-5|$o_{51}$|$o_{52}$|$o_{53}$|$o_{54}$|$RT_{5}$|
|row-total|$CT_{1}$|$CT_{2}$|$CT_{3}$|$CT_{4}$|$GT$|

**Contingency table with expected frequencies**
>|Index|col-1|col-2|col-3|col-4|
|---|---|---|---|---|
|row-1|$e_{11}$|$e_{12}$|$e_{13}$|$e_{14}$|
|row-2|$e_{21}$|$e_{22}$|$e_{23}$|$e_{24}$|
|row-3|$e_{31}$|$e_{32}$|$e_{33}$|$e_{34}$|
|row-4|$e_{41}$|$e_{42}$|$e_{43}$|$e_{44}$|
|row-5|$e_{51}$|$e_{52}$|$e_{53}$|$e_{54}$|

Hypotheses:
- Null hypothesis (H0): The two factors (categorical variables) are independent
- Alternative hypothesis (H1): The two factors ( categorical variables) are not independent

If the variable 1 and variable 2 are assumed independent, the probablity of a cell is equal to product of the row (response) probability and the column (group) probability.

> Two events, A and B, are independent, if P(A and B) = P(A) P(B). Therefore, \
P($cell_{ij}) = P(row_{i})*P(col_{j}) = \frac{RT_{i}}{GT}*\frac{CT_{j}}{GT}$ 

> Expected frequency of $cell_{ij}$, 
$E = P(cell_{ij}) * GT = \frac{RT_{i} * CT_{j}}{GT}$

Or simply, Expected Cell Frequency, $E = \frac{\text{Row Total} * \text{Column Total}}{\text{Grand Total}}$

- number of rows = r
- number of columns = c
- $RT_{i}$ = row total for i-th row (marginal row frequency)
- $CT_{j}$ = column total for j-th column (marginal column frequency)
- $GT$ =  Grand total of all observations (total sample size)
- chi-square statistic, $\chi^2 = \sum \frac{(O-E)^2}{E} = \sum_{i=1}^{r}\sum_{j=1}^{c}\frac{(o_{ij}-e_{ij})^2}{e_{ij}} \sim \chi^2_{df}$ where:
- O = observed frequency
- E = expected frequency
- n =  number of categories
- chi-square degree of freedom, $df = (r-1)*(c-1)$
- $\alpha$ = significance level (SL) for right hand tailed test
- right tailed critical value for SL = $\chi_{\alpha}$

Decision Rule:
>-  Reject H0 if (statistic >= right tailed critical value) or (p value <= SL)
-  Do not reject H0 if (statistic < right tailed critical value) or (p value > SL ) 

<figure>
<img src="https://saylordotorg.github.io/text_introductory-statistics/section_15/3406a41dcf8b2ad498d3271e90a762c1.jpg" width="500"/>
<figcaption>Source: saylordotorg.github.io</figcaption>
</figure>


If the p-value that corresponds to the test statistic is less than chosen significance level then we can reject the null hypothesis, this shows there is some association between the the two variables.

### Fisher’s Exact Test

Fisher’s Exact Test is used to determine whether or not there is a significant association between two categorical variables. 

- Chi square test of independance is not accurate when we have a small number of observations 
 - expected frequency of less than 5 in more than 20% of cells) 
 - or generally if one or more of the cell counts in a 2×2 table is less than 5), we can substitute Fischer’s exact in a 2 x 2
design

Hypotheses:
- Null hypothesis (H0): The two variables are independent.
- Alternative hypothesis (H1): The two variables are not independent.
Suppose we have the following 2×2 table:

||Group 1|Group 2|Row Total|
|---|---|---|---|
|Category|1|a|b|a+b
|Category|2|c|d|c+d
|Column Total|a+c|b+d|a+b+c+d = n|


The one-tailed p value for Fisher’s Exact Test is calculated as:

$p = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{(a!b!c!d!n!)}$

This produces the same p value as the CDF of the hypergeometric distribution with the following parameters:

- population size = n
- population "successes" = a+b
- sample size = a + c
- sample "successes" = a

The two-tailed p value for Fisher’s Exact Test is less straightforward to calculate and can’t be found by simply multiplying the one-tailed p value by two.

### Chi square test - Effect size

Phi (φ)
How to Calculate 
Phi is calculated as φ = √(X2 / n)

where:

X2 is the Chi-Square test statistic

n = total number of observations

When to Use
It’s appropriate to calculate φ only when you’re working with a 2 x 2 contingency table (i.e. a table with exactly two rows and two columns).

How to Interpret
A value of φ  = 0.1 is considered to be a small effect, 0.3 a medium effect, and 0.5 a large effect.

Cramer’s V (V)
How to Calculate 
Cramer’s V is calculated as V = √(X2 / n*df)

where:

X2 is the Chi-Square test statistic

n = total number of observations

df = (#rows-1) * (#columns-1)

When to Use
It’s appropriate to calculate V when you’re working with any table larger than a 2 x 2 contingency table.

How to Interpret
The following table shows how to interpret V based on the degrees of freedom:

Degrees of freedom	Small	Medium	Large
1	0.10	0.30	0.50
2	0.07	0.21	0.35
3	0.06	0.17	0.29
4	0.05	0.15	0.25
5	0.04	0.13	0.22
Odds Ratio (OR)
How to Calculate 
Given the following 2 x2 table:

Effect Size	# Successes	# Failures
Treatment Group	A	B
Control Group	C	D
The odds ratio would be calculated as:

Odds ratio = (AD) / (BC)

When to Use
It’s appropriate to calculate the odds ratio only when you’re working with a 2 x 2 contingency table. Typically the odds ratio is calculated when you’re interested in studying the odds of success in a treatment group relative to the odds of success in a control group.

How to Interpret
There is no specific value at which we deem an odds ratio be a small, medium, or large effect, but the  further away the odds ratio is from 1, the higher the likelihood that the treatment has an actual effect.

It’s best to use domain specific expertise to determine if a given odds ratio should be considered small, medium, or large.



### Friedman Test

The Friedman Test is a non-parametric alternative to the Repeated Measures ANOVA. It is used to determine whether or not there is a statistically significant difference between the means of three or more groups in which the same subjects show up in each group.

When to Use the Friedman Test
The Friedman Test is commonly used in two situations:

1. Measuring the mean scores of subjects during three or more time points.

For example, you might want to measure the resting heart rate of subjects one month before they start a training program, one month after starting the program, and two months after using the program. You can perform the Friedman Test to see if there is a significant difference in the mean resting heart rate of patients across these three time points.

2. Measuring the mean scores of subjects under three different conditions.

For example, you might have subjects watch three different movies and rate each one based on how much they enjoyed it. Since each subject shows up in each sample, you can perform a Friedman Test to see if there is a significant difference in the mean rating of the three movies.

Hypotheses:

The null hypothesis (H0): µ1 = µ2 = µ3 (the mean reaction times across the populations are all equal)

The alternative hypothesis: (Ha): at least one population mean is different from the rest



### Kruskal-Wallis test

A Kruskal-Wallis test is used to determine whether or not there is a statistically significant difference between the medians of three or more independent groups. This test is the nonparametric equivalent of the one-way ANOVA and is typically used when the normality assumption is violated.   

- The Kruskal-Wallis test does not assume normality in the data 
- It is much less sensitive to outliers than the one-way ANOVA.

Kruskal-Wallis Test Assumptions
>1. Ordinal or Continuous Response Variable – the response variable should be an ordinal or continuous variable. An example of an ordinal variable is a survey response question measured on a Likert Scale (e.g. a 5-point scale from “strongly disagree” to “strongly agree”) and an example of a continuous variable is weight (e.g. measured in pounds).
2. Independence – the observations in each group need to be independent of each other. Usually a randomized design will take care of this.
3. Distributions have similar shapes – the distributions in each group need to have a similar shape.

Hypotheses:
- The null hypothesis (H0): The k population medians are not all equal
- The alternative hypothesis: (Ha): At least one of the median is different from the others.

If there are k number of groups to be compated, 
>- observations from all groups consider together 
- are ranked together using Fractional Ranking approach 

Note: the sum of the ranks will always equal n(n+1)/2

- $n_{j}$ = the sample size for j-th sample
- $R_{j}$ = sum of the ranks in j-th group
- Highest Rank, $N = \sum_{j=1}^{k} n_{j}$ = total number of observations of all groups

test statistic, $H = \frac{12}{N(N+1)}\sum_{j=1}{k}\frac{R_{j}^2}{n_{j}}-3(N+1)$

Look into Kruskal Wallis Test Critical Values table for the appropriate critical value corresponding to the sample sizes (n1, n2, n3,.., nk) and given level of significance
> If there are 3 or more comparison groups and 5 or more observations in each of the comparison groups, the test statistic H approximates a chi-square distribution with df=k-1. The critical value can also be looked up in the Critical Values of for the chi-square table

Decision Rule:
-  Reject H0 if (H >= critical value) or p-value <= SL, there is a difference in the medians of groups
- Do no reject H0 if (H < critical value) or p-value > SL, there is no difference in the medians of groups




### Dunn’s Test (Post Hoc test)

A Kruskal-Wallis test is used to determine whether or not there is a statistically significant difference between the medians of three or more independent groups. It is considered to be the non-parametric equivalent of the One-Way ANOVA.

If the results of a Kruskal-Wallis test are statistically significant, then it’s appropriate to conduct Dunn’s Test to determine exactly which groups are different.

Dunn’s Test performs pairwise comparisons between each independent group and tells you which groups are statistically significantly different at some level of α.

For example, suppose a researcher wants to know whether three different drugs have different effects on back pain. He recruits 30 subjects for the study and randomly assigns them to use Drug A, Drug B, or Drug C for one month and then measures their back pain at the end of the month.

The researcher can perform a Kruskal-Wallis test to determine if the median back pain is equal among the three drugs. If the p-value of the Kruskal-Wallis test is below a certain threshold, it can be said that the three drugs produce different effects. 

Following this, the researcher could then perform Dunn’s Test to determine which drugs produce statistically significant effects.

Dunn’s Test: The Formula
You will likely never have to perform Dunn’s Test by hand since it can be performed using statistical software (like R, Python, Stata, SPSS, etc.) but the formula to calculate the z-test statistic for the difference between two groups is:

zi = yi / σi

where i is one of the 1 to m comparisons, yi =WA – WB (where WA is the average of the sum of the ranks for the ith group) and σi is calculated as:

σi  =  √((N(N+1)/12) – (ΣT3s – Ts/(12(N-1)) / ((1/nA)+(1/nB))

where N is the total number of observations across all groups, r is the number of tied ranks, and Ts is the number of observations tied at the sth specific tied value.

How to Control the Family-wise Error Rate
Whenever we make multiple comparisons at once, it’s important that we control the family-wise error rate. One way to do so is to adjust the p-values that results from the multiple comparisons.

There are several ways to adjust the p-values, but the two most common adjustment methods are:

1. The Bonferroni Adjustment

Adjusted p-value = p*m

where:

p: The original p-value
m: The total number of comparisons being made
2. The Sidak Adjustment

Adjusted p-value = 1 – (1-p)m

where:

p: The original p-value
m: The total number of comparisons being made
By using one of these p-value adjustments, we can dramatically reduce the probability of committing a type I error among the set of multiple comparisons.

### Dunnett’s test (Post Hoc test)

An ANOVA (Analysis of Variance) is used to determine whether or not there is a statistically significant difference between the means of three or more independent groups. 

If the p-value from the ANOVA is less than some chosen significance level, we can reject the null hypothesis and conclude that we have sufficient evidence to say that at least one of the means of the groups is different from the others.

However, this doesn’t tell us which groups are different from each other. It simply tells us that not all of the group means are equal. In order to find out exactly which groups are different from each other, we must conduct a post-hoc test.

If one of the groups in the study is considered the control group, then we should use Dunnett’s test as the post-hoc test following the ANOVA.

Dunnett’s Test: Definition
We can use the following two steps to perform Dunnett’s test:

Step 1: Find Dunnett’s critical value.

First, we must find Dunnett’s critical value. This is calculated as:

Dunnett’s Critical value: td√2MSw/n

where:

td: The value found in Dunnett’s Table for a given alpha level, number of groups, and group sample sizes.
MSw: The Mean Squares of the “Within Group” in the ANOVA output table
n: The size of the group samples
Step 2: Compare the differences in group means to Dunnett’s critical value.

Next, we calculate the absolute difference between the mean of each group with the mean of the control group. If the difference exceeds Dunnett’s critical value, then that difference is said to be statistically significant.

The following example shows how to perform Dunnett’s test in practice.

Dunnett’s Test: Example
Suppose a teacher wants to know whether or not two new studying techniques have the potential to increase exam scores for her students. To test this, she randomly splits her class of 30 students into the following three groups:

Control Group: 10 students
New Study technique 1: 10 students
New Study Technique 2: 10 students
After one week of using their assigned study technique, each student takes the same exam. The results are as follows:

Mean exam score of control group: 81.6
Mean exam score of new study technique 1 group: 85.8
Mean exam score of new study technique 2 group: 87.7
Mean Squares of the “Within Group” in the ANOVA output table: 23.3
Using this information, we can perform Dunnett’s test to determine if either of the two new study techniques produce significantly different mean exam scores compared to the control group.

Step 1: Find Dunnett’s critical value.

Using α = .05, group sample size n = 10 and total groups = 3, Dunnett’s table tells us to use a value of 2.57 in the critical value calculation.

Example of using Dunnett's table for multiple comparisons

Next, we can plug this number into the formula to find Dunnett’s Critical value:

Dunnett’s Critical value: td√2MSw/n  =  2.57√2(23.3)/10  =  5.548

Step 2: Compare the differences in group means to Dunnett’s critical value.

The absolute difference between the means of each study technique and the control group are as follows:

Abs. diff between new technique 1 and control: |85.8 – 81.6| = 4.2
Abs. diff between new technique 2 and control: |87.7 – 81.6| = 6.1
Only the absolute difference between technique 2 and the control group is greater than Dunnett’s critical value of 5.548.

Thus, we can say that the new studying technique #2 produces significantly different exam scores compared to the control group, but the new studying technique #1 does not.

## References:

- https://www.statstutor.ac.uk/resources/uploaded/paired-t-test.pdf
- https://www.statstutor.ac.uk/resources/uploaded/unpaired-t-test.pdf
- https://www.ncss.com/software/ncss/nonparametric-analysis-in-ncss/
- https://www.ncss.com/software/ncss/comparing-means-in-ncss/
- https://www.cse.iitk.ac.in/users/nsrivast/HCC/lec07-09.pdf
- https://tjmurphy.github.io/jabstb/nonparametrics.html
- https://www.statstutor.ac.uk/resources/uploaded/tutorsquickguidetostatistics.pdf
- https://saylordotorg.github.io/text_introductory-statistics/index.html
- https://www.spss-tutorials.com/effect-size/
- http://tss.awf.poznan.pl/files/3_Trends_Vol21_2014__no1_20.pdf
- https://cran.r-project.org/web/packages/statsExpressions/vignettes/stats_details.html
- https://www.scribbr.com/statistics/statistical-power/
- https://www.statstest.com/mann-whitney-u-test/
- https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Power/BS704_Power_print.html
- https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/bs704_nonparametric_print.html


In [None]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt

def t_dist_known_sd(group_size):
    t_values_array = []
    sd = 2
    for i in range(50000):
        X1 = stats.norm.rvs(loc=5,
                            scale=sd,
                            size=group_size,
                            random_state=(i + 1))
        t_stat = (np.mean(X1) - 5) / (sd / np.sqrt(len(X1)))
        t_values_array.append(t_stat)
    t_values_array = np.array(t_values_array)
    return t_values_array

def t_dist_unknown_sd(group_size):
    t_values_array = []
    for i in range(50000):
        X1 = stats.norm.rvs(loc=5,
                            scale=2,
                            size=group_size,
                            random_state=(i + 1))
        t_stat = (np.mean(X1) - 5) / (np.std(X1, ddof=1) / np.sqrt(len(X1)))
        t_values_array.append(t_stat)
    t_values_array = np.array(t_values_array)
    return t_values_array

In [None]:
# Distributions when the standard deviation of the population is known for different sample size.
t_5 = t_dist_known_sd(5)
t_10 = t_dist_known_sd(10)
t_30 = t_dist_known_sd(30)
t_100 = t_dist_known_sd(100)
sns.distplot(x=t_5, color='black', hist=False, label='Sample size: 5')
sns.distplot(x=t_10, color='darkblue', hist=False, label='Sample size: 10')
sns.distplot(x=t_30, color='green', hist=False, label='Sample size: 30')
sns.distplot(x=t_100, color='orange', hist=False, label='Sample size: 100')
plt.xlim(-5, 5)
plt.legend()

In [None]:
# Distributions when the standard deviation of the population is unknown for different sample size
t_5 = t_dist_unknown_sd(5)
t_10 = t_dist_unknown_sd(10)
t_30 = t_dist_unknown_sd(30)
t_100 = t_dist_unknown_sd(100)
sns.distplot(x=t_5, color='black', hist=False, label='Sample size: 5')
sns.distplot(x=t_10, color='darkblue', hist=False, label='Sample size: 10')
sns.distplot(x=t_30, color='green', hist=False, label='Sample size: 30')
sns.distplot(x=t_100, color='orange', hist=False, label='Sample size: 100')
plt.xlim(-5, 5)
plt.legend()