<a href="https://colab.research.google.com/github/wcj365/python-stats-dataviz/blob/master/14%20-%20Hypothesis%20Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 14 - Hypothesis Testing

**Null Hypothesis:** Status Quo

**Alternative Hypothesis:** Challenging the status quo

### One-Sample t-Test
A one-sample t-test checks whether a sample mean differs from the population mean.

### Two-Sample t-Test
A two-sample t-test investigates whether the means of two independent data samples differ from one another. In a two-sample test, the null hypothesis is that the means of both groups are the same.

In [0]:
import math
import scipy.stats as stats
import numpy as np

In [4]:
# Generate three random samples from normal distribution.
# rvs stands for "random variable sample"
# The inputs are (population mean, population standard deviation, sample size)

x1 = stats.norm.rvs(10000, 1000, 50)
x2 = stats.norm.rvs(1000, 100, 50)
x3 = stats.norm.rvs(1000, 100, 50)
print(x1)
print(x2)
print(x3)

[ 8892.5419476   8880.60349125  9933.10573274 10242.89280698
  9574.59963647 11015.68662426  8950.26910536  9413.52037614
 12541.30257325 10661.58207338  9553.46211005 11284.92265966
 11743.0319755   9419.50158648 10030.15547029  9818.04859664
 11592.99178839  9455.80743369 10179.71242169 10218.50850933
 10286.88527807  9332.46414959  9706.62376894  9931.82821025
  9647.10685708  9309.39086588 11833.39667801  9263.58485309
  9963.95743231  8313.95468574 10648.71401652 11657.39889729
  8812.86561169 11034.13958046 10819.9416706   9690.65376058
 11226.39721014  9701.93912815 10218.03763115 10274.46650508
  9675.27639018 10903.44502777  8885.91754944  8630.67176298
 11045.12386603  9646.8375689  11191.09897377  9307.25396189
  9430.14182421 11157.77127195]
[1141.07716085  944.9305697  1144.55513145  919.33297258 1039.44303219
 1051.3659284   931.36924514 1228.12682316 1089.99564413 1063.90645926
 1064.07019839  970.53764936 1042.32028801  896.72407958 1114.91436234
 1037.68463018 1057.980

In [5]:
print(round(x1.mean(),2))
print(round(x2.mean(),2))
print(round(x3.mean(),2))

10098.99
1005.4
975.79


### One Sample t-Test

H0: the mean of population that x1 sampled from is 0

Ha: the mean of population that x1 sampled from is not 0


In [8]:
stats.ttest_1samp(x1, 0)

Ttest_1sampResult(statistic=74.76855251618234, pvalue=3.64053014577373e-52)

p-value is the probability of having the sameple under the hull hypothesis.
If the population mean is 0 (null hypothesis), then the chance for the sample to have a mean of 10098.99 is very very slim - almost impossible.

H0: the mean of population that x1 sampled from is 9500

Ha: the mean of population that x1 sampled from is not 9500

In [10]:
stats.ttest_1samp(x1, 9500)

Ttest_1sampResult(statistic=4.434667244386324, pvalue=5.2066525356193545e-05)

p-value is the probability of having the sameple under the hull hypothesis. If the population mean is 9500 (null hypothesis), then the chance for the sample to have a mean of 10098.99 is very slim.

### Two Sample Test - x1 vs x2

H0: The populations that X1 and X2 were sampled from have the same mean
Ha: The populations that X1 and X2 were sampled from have different means

In [0]:
stats.ttest_ind(x1,x2)

Ttest_indResult(statistic=75.2982688917182, pvalue=1.5464486135779753e-88)

Very small tiny p-value, so reject the null hypothesis and accept the alternative hypothesis.

### Two Sample Test - x2 vs x3

H0: The populations that X2 and X3 were sampled from have the same mean
Ha: The populations that X2 and X3 were sampled from have different means

In [0]:
stats.ttest_ind(x2, x3)

Ttest_indResult(statistic=0.44192131789576433, pvalue=0.6595198044117564)

Very large p-value, so unable to reject the null hypothesis.

### Two Sample Test - x4 vs x5 

Let's make the population means different (1000 vs 1005)

H0: The populations that X4 and X5 were sampled from have the same mean
Ha: The populations that X4 and X5 were sampled from have different means

In [0]:
x4 = stats.norm.rvs(1000, 100, 50)
x5 = stats.norm.rvs(1005, 100, 50)
stats.ttest_ind(x4,x5)

Ttest_indResult(statistic=-0.6165788219522748, pvalue=0.5389422757221363)

Still relatively large p-value, so unable to reject the null hypothesis. 
Even though the population means are different but the difference is not statistically significant.

### Two Sample Test - x4 vs x5 

Let's make the population means more different (1000 vs 1025)

H0: The populations that X4 and X5 were sampled from have the same mean
Ha: The populations that X4 and X5 were sampled from have different means

In [0]:
x4 = stats.norm.rvs(1000, 100, 50)
x5 = stats.norm.rvs(1020, 100, 50)
stats.ttest_ind(x4,x5)

Ttest_indResult(statistic=-0.49293490714470384, pvalue=0.6231612843320816)

The p-value is smaller but still greater than 0.05 that standard used in research. We will not reject the null hypothesis. We conclude the populations are not significantly different.

### Two Sample Test - x4 vs x5 

Let's make the population means larger different (1000 vs 1050) - 5% difference

H0: The populations that X4 and X5 were sampled from have the same mean
Ha: The populations that X4 and X5 were sampled from have different means

In [0]:
x4 = stats.norm.rvs(1000, 100, 50)
x5 = stats.norm.rvs(1050, 100, 50)
stats.ttest_ind(x4,x5)

Ttest_indResult(statistic=-2.8765854978573033, pvalue=0.004933252916142617)

Now, we have a p-value that is less than standard 0.05.
We can reject the null hypothesis and state that the means of the two populations are the same.

### The End