In [5]:
import numpy as np

# Paired sample t-test

##  Theory

- $X_i$ and $Y_i$ are paired, and correlated
    - $Cov(X_i, Y_i) = \sigma_{XY}$
    
    
- Independece across pair: 
    - $E(X_i - Y_i) = E(\bar X - \bar Y) = \mu_X - \mu_Y$
    - $Var(\bar X - \bar Y) = \frac{1}{N}[\sigma^2_X+\sigma^2_Y - 2 \rho\sigma_X \sigma_Y]$, where $\rho$ is correlation coefficient between $X$ and $Y$.


- Normal assumption:
    - $D = X-Y \sim N(.,.)$
    - $\bar D \sim N(.,.)$


- Under big sample size:
    - $D = X-Y$ doens'y have to be normal
    - $\bar D \xrightarrow{N\to\infty} N(.,.)$


- Test Statistic:
    - $$t = \frac{\bar D - \mu_D}{s_{\bar D}}$$
    - $df=n-1$

**Comparison with independent sample t-test**
- When $\sigma_X = \sigma_Y$:
    - *Independent*: $Var(\bar X - \bar Y) = 2\sigma^2/N$
    - *Paired*:  $Var(\bar X - \bar Y) = 2\sigma^2(1-\rho)/N$

## Numerical Example

In [8]:
D = [2,4,10,12,16,15,4,27,9,-1,15]
N = len(D)

In [13]:
D_bar = np.mean(D)
D_bar

10.272727272727273

In [11]:
s_D = np.std(D, ddof=1)
s_D

7.976100664998018

In [12]:
s_D_bar = s_D / np.sqrt(N)
s_D_bar

2.404884835991147

In [15]:
T = D_bar / s_D_bar
T

4.271608818429545

In [17]:
from scipy.stats import t
p = 1 - (t.cdf(T, df = N-1)-0.5) * 2
p

0.001632849921999746

**Conclusion:** $p$ < 0.05, there is significance difference

# ANOVA test

## Theory

**Assumption**:
$$Y_{ij} = \mu + \alpha_i + e_{ij}, i^{th}\ treatment,\  j^{th} observation$$

- $e_{ij} \sim N(0, \sigma^2)$,  iid
- Constraint: $\sum \alpha_i=0$
- $H_0: \alpha_i=0$ for each $i$


- Break the errors:
    - $SS_W = \sum_i\sum_j(Y_{ij}-\bar Y_{i.)^2}$
    - $SS_B = J \sum_i(\bar Y_{..}-\bar Y_{i.)^2}$


- Under normal distribution:
    $$SS_W/\sigma^2 \sim \chi^2[I(J-1)]$$


- Under $H_0$:
    $$SS_B/\sigma^2 \sim \chi^2(I-1)$$


- When two $\chi$ samples are **independent**
$$\frac{\chi^2_a/a}{\chi^2_b/b} \sim F(a,b)$$

    
- Under normal distribution and $H_0$
$$F = \frac{SS_B/(I-1)}{SS_W/[I(J-1)]} \sim F[I-1, I(J-1)]$$
$$$$
     - Under $H_0$, $E(numerator) = E(denominator) = \sigma^2$, so F should be close to 1
     - Under $H_1$, when some $\alpha_i>0$, $E(numerator) >\sigma^2$, so F should be larger than 1 


<img src="https://ecstep.com/wp-content/uploads/2017/12/F-distribution-2.png" width="400">

**Violation of assumptions**:
- Independence: should not be violated
- Normality: still valid if non-normal and large sample
- Non-constant variance: still valid with equal sample size across groups

## Multiple Comparison

**Bonferroni**
- Instead of $t_{df}(\alpha)$, use $t_{df}(\frac{\alpha}{M})$
- $M$ is number of comparisons

## Two-factor ANOVA

**Assumnption**
$$Y_{ijk} = \mu + \alpha_i + \beta_j + \delta_{ij} + e_{ijk}$$
**Error break**
$$SS=SS_A+SS_B+SS_{AB}+SS_E$$
**Four $\chi$ Distributions**
$$SS_A/\sigma^2 \sim \chi^2(I-1)$$
$$SS_B/\sigma^2 \sim \chi^2(J-1)$$
$$SS_{AB}/\sigma^2 \sim \chi^2[(I-1)(J-1)]$$
**Three F-statistics**
    $$F=\frac{MS_?}{MS_E} = \frac{{SS}_?/{df}_?}{SS_E/[IJ(K-1)]} \sim F[df_?, IJ(K-1)]$$

# Experiment Design


## Examples of confounding
- Effect of ***Gender*** in College Admission confounded by ***Major***: women apply for hard majors
- Effect of ***Coffee Drinking*** on coronary diseases confounded by ****Smoking**: coffee drinkers smoke more


## Randomized Block Design
- With a randomized block design, the experimenter divides subjects into subgroups called **blocks***, such that the variability within blocks is less than the variability between blocks. Then, subjects within each block are randomly assigned to treatment conditions. 
- Compared to a completely randomized design, this design reduces variability within treatment conditions and potential confounding, producing a better estimate of treatment effects.


- ***Example***: Paired-Sample t-test, where ***person*** is the block
- ***Example***: Fertilizer agricultural experiment, where ***field*** is the block

The table below shows a randomized block design for a hypothetical medical experiment.

|Gender	||Treatment|
|::|::|::|
||Placebo	|Vaccine|
|Male	|250	|250|
|Female	|250	|250|
Subjects are assigned to blocks, based on gender. Then, within each block, subjects are randomly assigned to treatments (either a placebo or a cold vaccine). For this design, 250 men get the placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women get the vaccine.

It is known that men and women are physiologically different and react differently to medication. This design ensures that each treatment condition has an equal proportion of men and women. As a result, differences between treatment conditions cannot be attributed to gender. This randomized block design removes gender as a potential source of variability and as a potential confounding variable.

# Categorical analysis