In [2]:
import pandas as pd
import numpy as np
import scipy.stats as st

## Topic 2: Hypothesis Test
Is there strong evidence that the population mean $\mu$ is differenet from some value that is of interest to us.
Is it different from some hypothesized value?

The null hypothesis is $H_0: \mu = \mu_0$  
possible alternatives hypothesis: 
- $H_a: \mu < \mu_0$ (one-sided/tailed test)
- $H_a: \mu > \mu_0$ (one-sided/tailed)
- $H_a: \mu \neq \mu_0$ (two-sided/tailed)


There are two scenarios:
- $\sigma$ is known, Z-test
- $\sigma$ is unknown, T-test

#### 1. Z-test
$$
Z \sim X(\mu, \sigma^2)
$$

suppose we have a simple random sample of n observations from a normally distributed population where $\sigma$ is known. The normality assumption, that we are sampling from a normally distributed populaiton is very important when the sample size is small. But as the sample size goes bigger and bigger, it is less and less important due to the centrel limit theorem.

To test $H_0: \mu = \mu_0$, we use the test statistic(Z statistic):
\begin{equation}
Z = \frac{\bar{X} - \mu_0}{\sigma_\bar{X}} = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}\\
where:\ \sigma_\bar{X} = \frac{\sigma}{\sqrt{n}} 
\end{equation}
Note:$\sigma$ is population mean  
Because of the centrel limit theorem: $$ \bar{X} \sim (\mu, \frac{\sigma^2}{\sqrt{n}}) $$ 
$$\frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} \sim N(0, 1) $$

#### 2. Rejection Region Approach
1. choose a value for $\alpha$, the significance level of the test.  
$\alpha$ is the probability of rejecting the null hypothesis if it is true
2. Find the appropriate rejectoin region.
3. Reject the null hypothesis if the test statistic falls in the rejection region.

#### 3. P-value  
The p-value measure how strength of the evidence against the null hypothesis.

__The definition__:  
the p-value is the probability of getting the ovserved value of the test statistic, or a vlaue with even greater evidence against null hypothesis, if the null hypothesis is true.  

- The smaller the p-value, the greater the evidence against the null hypothesis.  
- If we have a given significance level $\alpha$, then: reject $H_0$ if p-value <= $\alpha$.  
(if p-value <= $\alpha$, the evidence against $H_0$ is significant at the $\alpha$ level of significance)



#### 4. Type I errors, Type II errors, Power of the Test.

- A Type I error is rejecting $H_0$ when, in reality, it is true.  
P(Tytpe I error|$H_0$ is true) = $\alpha$  
- A Type II error is failing to reject $H_0$ when, in reality, it is false.  
P(Tytpe II error|$H_0$ is false) = $\beta$  
$\beta$ depends on a number of factors, including the choice of $\alpha$, the sample size, and the true value of the parameter  
- Power is the probability of rejecting the null hypothesis, given it is false  
Power = 1 - P(Type II error) = 1 - $\beta$

|   ||               __Underlying reality__(unknown)|
|----|------------|----------------|----------------|
|    |            |$H_0$ is false  |$H_0$ is true   |
|__conclusion from test__(known)|reject $H_0$|correct decision|Type I error|
|    |Accept $H_0$|Type II error   |correct decision|

if we choose a very small $\alpha$, we will be making it very hard to reject the $H_0$. Then we will have a small chance to make a Type I error, but we have a very high chance to make a Type II error. The Power of the test is high.

If we choose a larger value of $\alpha$, it will be vice versa. The Power of the test is low.

||||Relationship between parameters|
|-|-|-|-|-|-|
|$\alpha$ up|P(reject) up|P(Type I) up|P(Type II) down|$\beta$ down|Power up|
|$\alpha$ down|P(reject) down|P(Type I) down|P(Type II) up|$\beta$ up|Power down|

#### 5. Confident Interval

Z-teste: we would reject $H_0$ at $\alpha$ = 0.05, if
$$
 \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} <= -1.96\ or\ \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} >= 1.96
$$
isolating $\mu_0$, we would reject $H_0$:
$$
\mu_0 >= \bar{X} + 1.96*\frac{\sigma}{\sqrt{n}}\ or\ \mu_0 <= \bar{X} - 1.96*\frac{\sigma}{\sqrt{n}}
$$

- The upper bound of the 95% confidence interval is $\bar{X} + 1.96*\frac{\sigma}{\sqrt{n}}$  
- The lower bound of the 95% confidence interval is $\bar{X} - 1.96*\frac{\sigma}{\sqrt{n}}$

#### 6. Distribution of P-value

$H_0$: $\mu$ = 0  
$H_a$: $\mu$ > 0  
where population_sigma = 1.

I am testing 3 scenarios.
- The population mean is 0, which means $H_0$ is ture
- The population mean is 1, which means $H_0$ is false
- The populaiton mean is 2, which means $H_0$ is false  

I picked up 1000 samples from each scenario with sample size 50. After calculating the P-value of each sample, we can histgram the distribution of P-value for each scenario.

In [7]:
sample_size = 50
num_sample = 10000

for i in range(0, num_sample):
    # population_1
    mu1, sigma1 = 0, 1
    sample1 = np.random.normal(mu1, sigma1, sample_size)
    
    

    # population_2
    mu2, sigma2 = 1, 1
    sample1 = np.random.normal(mu2, sigma2, sample_size)

    # population_3
    mu3, sigma3 = 2, 1
    sample1 = np.random.normal(mu3, sigma3, sample_size)

In [6]:
sample1 = np.random.normal(mu1, sigma1, sample_size1)

array([-0.44128516, -0.60979832, -0.34385109,  1.30108423, -0.71732452,
       -1.91477831, -0.21408047,  1.27536415,  1.06191106, -1.24144945,
       -0.76279453, -1.0733659 ,  0.40119336,  0.53604004,  1.1890227 ,
       -1.91696102, -1.64305129, -0.4724886 , -0.98144517,  0.75864234])

#### 5. T-test

To test $H_0: \mu = \mu_0$, we use the test statistic(T statistic):
\begin{equation}
T = \frac{\bar{X} - \mu_0}{\sigma_\bar{X}} = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}\\
where:\ \sigma_\bar{X} = \frac{\sigma}{\sqrt{n}} 
\end{equation}
Note:$\sigma$ is population mean  
Because of the centrel limit theorem: $$ \bar{X} \sim (\mu, \frac{\sigma^2}{\sqrt{n}}) $$ 
$$\frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} \sim N(0, 1) $$

But in most case, population $\sigma$ is known, then the stardard deviation of sampling distribution of sample mean can't be calculated by $\frac{\sigma}{\sqrt{n}}$.  
We can only use the __Standard Error__, which is the estimated standard deviation of the sampling distribution of the sample mean, $\frac{s}{\sqrt{n}}$
Then the test statistic  will have a t distribution with __n-1__ degress of freedom  
$$\frac{\bar{X} - \mu_0}{s/\sqrt{n}} \sim T(n - 1) $$
T distribution is very similar to standard normal distribution except lower peak and heavier tails. As the degree of freedom increase, the T distribution tends toward the standard normal distribution, because the standard error estimates the standard deviation better and better.