<a href="https://colab.research.google.com/github/wcj365/python-stats-dataviz/blob/master/14%20-%20Hypothesis%20Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 14 - Hypothesis Testing

**Null Hypothesis:** Status Quo

**Alternative Hypothesis:** Challenging the status quo

### One-Sample t-Test
A one-sample t-test checks whether a sample mean differs from the population mean.

### Two-Sample t-Test
A two-sample t-test investigates whether the means of two independent data samples differ from one another. In a two-sample test, the null hypothesis is that the means of both groups are the same.

In [0]:
import math
import scipy.stats as stats
import numpy as np

In [9]:
# Generate three random samples from normal distribution.
# rvs stands for "random variable sample"
# The inputs are (population mean, population standard deviation, sample size)

sample_A = stats.norm.rvs(10000, 1000, 50)
sample_B = stats.norm.rvs(1000, 100, 50)
sample_C = stats.norm.rvs(1000, 100, 50)

print("Sample A:\n\n", sample_A, end="\n\n")
print("Sample B: =\n\n", sample_B, end="\n\n")
print("Sample C: =\n\n", sample_C)

Sample A:

 [11160.80984163  9897.15605455  9917.85807547  7207.45721356
  9044.63900747  9708.01468765  9814.66261502 11212.4509415
  9781.42686519 11230.99268175  9683.88885264 11407.7875564
  9611.69452933  9572.65961423 10942.26233215 12019.97828666
 10560.91872098  8792.77569039  9467.34957794 10635.89734731
 10401.78021359 11378.53257134 11077.45005989  9446.35531629
 11885.4180422  11199.01227669  9827.50478455  9006.84111528
  9475.23153781  9863.81254383  9833.30848997 10470.77754384
  9569.99617606  9423.63220264  9045.16925518  9606.65397477
 10293.65403821 11276.30980596 10526.23972395 10773.0072756
 10971.59482502 11198.34729122  9262.69635057 11886.74650527
  8779.90343026  8703.77020232 10279.87608205  9742.93844366
 10748.94363695 10633.7913478 ]

Sample B =

 [ 986.33899723  919.60244233 1200.47068717  889.0902896  1087.63889928
  822.22279957  952.68589368 1010.73056979 1142.9187592  1096.48661537
 1271.97007861  902.84377869 1166.08454648  843.29867575  890.32356969


In [6]:
print("Sample mean of sample A =", round(sample_A.mean(),0))
print("Sample mean of sample B =", round(sample_B.mean(),0))
print("Sample mean of sample C =", round(sample_C.mean(),0))

Sample mean of sample A = 10185.0
Sample mean of sample B = 1006.0
Sample mean of sample C = 1012.0


### One Sample t-Test

H0: the mean of population that sample A drawn from is 0

Ha: the mean of population that sample B drawn from is not 0


In [10]:
stats.ttest_1samp(sample_A, 0)

Ttest_1sampResult(statistic=73.61093736981869, pvalue=7.768664898302252e-52)

p-value is the probability of having the sameple under the hull hypothesis.
If the population mean is 0 (null hypothesis), then the chance for the sample to have a mean of 10185 is very very slim - almost impossible.

H0: the mean of population that sample A sampled from is 9500

Ha: the mean of population that sample A sampled from is not 9500

In [11]:
stats.ttest_1samp(sample_A, 9500)

Ttest_1sampResult(statistic=4.81701903598267, pvalue=1.4446299243790565e-05)

p-value is the probability of having the sameple under the hull hypothesis. If the population mean is 9500 (null hypothesis), then the chance for the sample to have a mean of 10185 is very slim.

H0: the mean of population that sample A sampled from is 9950

Ha: the mean of population that sample A sampled from is not 9950

In [16]:
stats.ttest_1samp(sample_A, 9950)

Ttest_1sampResult(statistic=1.558359746485175, pvalue=0.12558402033014057)

p-value is the probability of having the sameple under the hull hypothesis. If the population mean is 9950 (null hypothesis), then the chance for the sample to have a mean of 10185 is about 12.6%, not slim. So we are unable to reject the null hypothesis. So we would conclude that given the evidence (sample data) the population mean is likely to be 9995 (we are still not 100% certain).

### Two Sample Test - sample A vs sample B

H0: The populations that sample A and B were sampled from have the same mean

Ha: The populations that sample A and B were sampled from have different means

In [17]:
stats.ttest_ind(sample_A,sample_B)

Ttest_indResult(statistic=65.9752975108562, pvalue=5.091613568499395e-83)

With very small tiny p-value, we reject the null hypothesis and accept the alternative hypothesis.

### Two Sample Test - sample B vs C

H0: The populations that sample B and C were sampled from have the same mean

Ha: The populations that sample B and C were sampled from have different means

In [19]:
stats.ttest_ind(sample_B,sample_C)

Ttest_indResult(statistic=0.15836381504601296, pvalue=0.8744959998730909)

With very large p-value, we are unable to reject the null hypothesis. So we accept the null hypothesis. We have enough evidence to believe sample B and C were drawn from populations with the same mean.

### Two Sample Test - sample D vs E 

Let's make the population means different (1000 vs 1005)

H0: The populations that sample D and E were sampled from have the same mean
Ha: The populations that sample D and E were sampled from have different means

In [26]:
sample_D = stats.norm.rvs(1000, 100, 50)
sample_E = stats.norm.rvs(1005, 100, 50)

stats.ttest_ind(sample_D, sample_E)

Ttest_indResult(statistic=0.3957814119673394, pvalue=0.6931266064232611)

Still relatively large p-value, so unable to reject the null hypothesis. 
Even though the population means are different but the difference is not statistically significant.


### Two Sample Test - sample D vs E 

Let's make the population means somewhat more different (1000 vs 1010)

H0: The populations that sample D and E were sampled from have the same mean
Ha: The populations that sample D and E were sampled from have different means

In [27]:
sample_D = stats.norm.rvs(1000, 100, 50)
sample_E = stats.norm.rvs(1015, 100, 50)

stats.ttest_ind(sample_D, sample_E)

Ttest_indResult(statistic=-0.9016450519626695, pvalue=0.3694560773602692)

The p-value is smaller but still greater than 0.05 (the standard used in social science research). We will not reject the null hypothesis. We conclude the populations are not significantly different.

### Two Sample Test - sample D vs E 

Let's make the population means even more different (1000 vs 1030)

H0: The populations that sample D and E were sampled from have the same mean
Ha: The populations that sample D and E were sampled from have different means

In [30]:
sample_D = stats.norm.rvs(1000, 100, 50)
sample_E = stats.norm.rvs(1030, 100, 50)

stats.ttest_ind(sample_D, sample_E)

Ttest_indResult(statistic=-2.0660955236732907, pvalue=0.04145771923675248)

Now, we have a p-value that is less than standard 0.05.
We can reject the null hypothesis and state that the means of the two populations are not the same.

### The End