<a href="https://colab.research.google.com/github/wcj365/python-stats-dataviz/blob/master/14%20-%20Hypothesis%20Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 14 - Hypothesis Testing

**Null Hypothesis:** Status Quo

**Alternative Hypothesis:** Challenging the status quo

### One-Sample t-Test
A one-sample t-test checks whether a sample mean differs from the population mean. 

### Two-Sample t-Test
A two-sample t-test investigates whether the means of two independent data samples differ from one another. In a two-sample test, the null hypothesis is that the means of both groups are the same.

In [1]:
import math
import scipy.stats as stats
import numpy as np

In [2]:
# Generate three random samples from normal distribution.
# rvs stands for "random variable sample"
# The inputs are (population mean, population standard deviation, sample size)

sample_A = stats.norm.rvs(10000, 1000, 50)
sample_B = stats.norm.rvs(1000, 100, 50)
sample_C = stats.norm.rvs(1000, 100, 50)

print("Sample A:\n\n", sample_A, end="\n\n")
print("Sample B: =\n\n", sample_B, end="\n\n")
print("Sample C: =\n\n", sample_C)

Sample A:

 [ 8498.08012025 10865.2023432  11288.4812106  11295.73510285
 10930.91979229 11062.77119077  9526.94624105 10067.69657688
  9354.17479121  9511.20962643 10120.48501979  9314.31140253
 10500.38757665 10500.84207738  9234.92870109  9359.03746334
  9792.40520679 10103.47704609  9737.60734016 10772.31038748
 11493.62539981 11059.63859453  8620.54679657 10164.4629446
 10584.7972384   9584.57192169  8551.95710353  9947.78158494
 10174.08701436  9322.30657864  9332.41890757 10089.6015265
 10117.61442891 11589.47556556  8544.16966468  9779.18905504
  8917.20189614 11028.68763535 10770.51289744 11395.61127346
  9120.7704156   8694.83283846  9029.9979643   9810.68294015
  9628.47261687 10869.8796043  10248.2344638  10137.7969628
  9103.49603165 10068.50454317]

Sample B: =

 [ 909.63194976 1009.95471143 1196.61131103  907.12501033  980.89191733
  886.06644797 1017.09359913  960.61435419 1068.06107247 1034.80220537
 1118.60314489  951.91907356  989.67968101 1026.33448616  946.55770104

In [3]:
print("Sample mean of sample A =", round(sample_A.mean(),0))
print("Sample mean of sample B =", round(sample_B.mean(),0))
print("Sample mean of sample C =", round(sample_C.mean(),0))

Sample mean of sample A = 9992.0
Sample mean of sample B = 966.0
Sample mean of sample C = 988.0


### One Sample t-Test

H0: the mean of the population that sample A drawn from is 0

Ha: the mean of the population that sample A drawn from is not 0


In [4]:
stats.ttest_1samp(sample_A, 0)

Ttest_1sampResult(statistic=83.03368600581942, pvalue=2.223787847497298e-54)

p-value is the probability of having the sameple under the hull hypothesis.
If the population mean is 0 (null hypothesis), then the chance for the sample to have a mean of 10185 is very very slim - almost impossible.

H0: the mean of population that sample A sampled from is 9500

Ha: the mean of population that sample A sampled from is not 9500

In [5]:
stats.ttest_1samp(sample_A, 9500)

Ttest_1sampResult(statistic=4.091362201187447, pvalue=0.0001594389369122153)

p-value is the probability of having the sameple under the hull hypothesis. If the population mean is 9500 (null hypothesis), then the chance for the sample to have a mean of 10185 is very slim.

H0: the mean of population that sample A sampled from is 9950

Ha: the mean of population that sample A sampled from is not 9950

In [6]:
stats.ttest_1samp(sample_A, 9950)

Ttest_1sampResult(statistic=0.3519889683364591, pvalue=0.7263558704995761)

p-value is the probability of having the sameple under the hull hypothesis. If the population mean is 9950 (null hypothesis), then the chance for the sample to have a mean of 10185 is about 12.6%, not slim. So we are unable to reject the null hypothesis. So we would conclude that given the evidence (sample data) the population mean is likely to be 9950 (we are still not 100% certain).

H0: the mean of population that sample A sampled from is 9900

Ha: the mean of population that sample A sampled from is not 9900

In [7]:
stats.ttest_1samp(sample_A, 9900)

Ttest_1sampResult(statistic=0.7674748830976801, pvalue=0.4464817867958044)

### Two Sample Test - sample A vs sample B

H0: The populations that sample A and B were sampled from have the same mean

Ha: The populations that sample A and B were sampled from have different means

In [8]:
stats.ttest_ind(sample_A,sample_B)

Ttest_indResult(statistic=74.51335237754252, pvalue=4.243682582586811e-88)

With very small tiny p-value, we reject the null hypothesis and accept the alternative hypothesis.

### Two Sample Test - sample B vs C

H0: The populations that sample B and C were sampled from have the same mean

Ha: The populations that sample B and C were sampled from have different means

In [9]:
stats.ttest_ind(sample_B,sample_C)

Ttest_indResult(statistic=-1.0919749985163272, pvalue=0.27752136174262915)

With very large p-value, we are unable to reject the null hypothesis. So we accept the null hypothesis. We have enough evidence to believe sample B and C were drawn from populations with the same mean.

### Two Sample Test - sample D vs E 

Let's make the population means different (1000 vs 1005)

H0: The populations that sample D and E were sampled from have the same mean
Ha: The populations that sample D and E were sampled from have different means

In [10]:
sample_D = stats.norm.rvs(1000, 100, 50)
sample_E = stats.norm.rvs(1005, 100, 50)

stats.ttest_ind(sample_D, sample_E)

Ttest_indResult(statistic=-0.5612732616510778, pvalue=0.5758924845791201)

Still relatively large p-value, so unable to reject the null hypothesis. 
Even though the population means are different but the difference is not statistically significant.


### Two Sample Test - sample D vs E 

Let's make the population means somewhat more different (1000 vs 1010)

H0: The populations that sample D and E were sampled from have the same mean
Ha: The populations that sample D and E were sampled from have different means

In [11]:
sample_D = stats.norm.rvs(1000, 100, 50)
sample_E = stats.norm.rvs(1015, 100, 50)

stats.ttest_ind(sample_D, sample_E)

Ttest_indResult(statistic=-0.23023251455235985, pvalue=0.818391006052364)

The p-value is smaller but still greater than 0.05 (the standard used in social science research). We will not reject the null hypothesis. We conclude the populations are not significantly different.

### Two Sample Test - sample D vs E 

Let's make the population means even more different (1000 vs 1030)

H0: The populations that sample D and E were sampled from have the same mean
Ha: The populations that sample D and E were sampled from have different means

In [12]:
sample_D = stats.norm.rvs(1000, 100, 50)
sample_E = stats.norm.rvs(1030, 100, 50)

stats.ttest_ind(sample_D, sample_E)

Ttest_indResult(statistic=-1.4255466585372618, pvalue=0.15717665187640334)

Now, we have a p-value that is less than standard 0.05.
We can reject the null hypothesis and state that the means of the two populations are not the same.

### The End