# Hypothesis Testing (One Sample)

In [57]:
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import ttest_1samp

### One Sided Hypothesis Tests

#### Example: Pharmaceutical Company

A pharmaceutical company is trying out a medication for lowering blood sugar and managing diabetes. It is known that any level of Hemoglobin A1c below 5.7% is considered normal. The drug company has treated 100 study volunteers with this medication and would like to prove that after treatment their mean A1c is below 5.7%.

In [44]:
pop_mean = 5.7
sample_mean = 5.1
sample_std = 1.6
n = 100
statistic = (sample_mean - pop_mean)/(sample_std/np.sqrt(n))
pval = stats.t.sf(np.abs(statistic), n-1) # t is t-distribution
print(statistic)
print(pval)


# probability to observe as extreme as this by pure chance given that our H0 is true

-3.750000000000003
0.0001489332089038242


In [45]:
# Confidence Interval
# The population mean lies between the confidence interval


stats.t.interval(0.95, df=n-1, loc=sample_mean, scale=(sample_std/np.sqrt(n)))

(4.78252528775861, 5.417474712241389)

#### Example: Municipal Children's Home

Boys of a certain age are known to have a mean weight of μ = 85 pounds. A complaint is made that the boys living in a municipal children's home are underfed and thus underweight (one-sided test!!). As one bit of evidence, n = 25 boys(of the same age) are weighed and found to have a mean weight of 80.94 pounds. It is known that the population standard deviation σ is 11.6 pounds (the unrealistic part of this example!).  
Based on the available data, what should be concluded concerning the complaint?

In [46]:
# your code here
pop_mean = 80.94
sample_mean = 85
sample_std = 11.6
n = 25

statistic = (sample_mean - pop_mean)/(sample_std/np.sqrt(n))
pval = stats.t.sf(np.abs(statistic), n-1)
print(statistic)
print(pval)

1.750000000000001
0.046447544473094286


In [47]:
# Confidence Interval
stats.t.interval(0.95, df=n-1, loc=sample_mean, scale=(sample_std/np.sqrt(n)))

(80.21175533702299, 89.78824466297701)

### Two-sided Hypothesis Tests

#### Example: Honolulu Heart Study

It is assumed that the mean systolic blood pressure is μ = 120 mm Hg. In the Honolulu Heart Study, a sample of n = 100 people had an average systolic blood pressure of 130.1 mm Hg with a standard deviation of 21.21 mm Hg. Is the group significantly different (with respect to systolic blood pressure!) from the regular population?

In [48]:
pop_mean = 120
sample_mean = 130.1
sample_std = 21.21
n = 100
statistic = (sample_mean - pop_mean)/(sample_std/np.sqrt(n))
pval = stats.t.sf(np.abs(statistic), n-1)*2 # for two-sided: *2 !!
print(statistic)
print(pval)

4.761904761904759
6.562701817208617e-06


In [49]:
# Confidence Interval
stats.t.interval(0.95, df=n-1, loc=sample_mean, scale=(sample_std/np.sqrt(n)))

(125.89147584585008, 134.30852415414992)

## Using data arrays

#### Generating 1000 draws from a standard normal random variable

In [50]:
X = stats.norm(0, 1).rvs(size = 100)
print(X)

[-0.0758169   0.07699717  0.12301751  0.54543395 -1.16604264 -0.65512718
  0.89482722  0.59495611 -1.12732664  0.57850042 -2.297386    1.06786607
  1.15718644 -0.23189231  1.97199524  0.74520968 -1.06286837 -0.00670112
  0.14112174 -1.2858427  -0.40926177 -1.19964425  2.3169073  -1.19655971
 -0.5557465   1.17632065  0.62816248  0.25679331  1.88066522 -1.51435854
  0.12843899  0.23245213  0.41584989 -0.75309708 -0.42287099  0.3035092
  0.52835068  1.31204127  0.79095117 -1.92930371  1.38843501  0.49322562
 -1.91669852  0.17831097  0.52846128  2.77149932 -0.32381244  0.23785869
  0.06667989  0.13966841  1.09589284 -0.13329511  0.83489612  0.25197856
 -0.50031012  0.1391542   0.10282238 -0.40276017  1.18232596  0.39688147
  1.33797897 -0.77300062  0.69409556 -1.5750415   0.80489598  1.05957535
  0.39657565  1.25864629  0.53412303 -0.74147238 -1.22586525  0.69825177
  0.20349556  0.86092359  0.48900902  0.97990568  1.56790482 -0.32058991
 -1.08803462 -0.14222332 -0.90596648  0.71956841  0.

#### Test if the sample average of X is equal to 0

In [52]:
# the zero mean for my null hypothesis
stats.ttest_1samp(X, 0)
print(stats.ttest_1samp(X,5))

Ttest_1sampResult(statistic=-48.99952874601842, pvalue=3.1472524231622514e-71)


#### Using actual data

In [53]:
data = pd.read_csv('Fitbit2.csv') 
data.head()

Unnamed: 0,Date,Calorie burned,Steps,Distance,Floors,Minutes Sedentary,Minutes Lightly Active,Minutes Fairly Active,Minutes Very Active,Activity Calories,...,Distance_miles,Days,Days_encoded,Work_or_Weekend,Hours Sleep,Sleep efficiency,Yesterday_sleep,Yesterday_sleep_efficiency,Months,Months_encoded
0,2015-05-08,1934,905,0.65,0,1.355,46,0,0,1680,...,0.403891,Friday,4.0,1,6.4,92.086331,0.0,0.0,May,5
1,2015-05-09,3631,18925,14.11,4,611.0,316,61,60,2248,...,8.767545,Saturday,5.0,0,7.566667,92.464358,6.4,92.086331,May,5
2,2015-05-10,3204,14228,10.57,1,602.0,226,14,77,1719,...,6.567891,Sunday,6.0,0,6.45,88.761468,7.566667,92.464358,May,5
3,2015-05-11,2673,6756,5.02,8,749.0,190,23,4,9620,...,3.119282,Monday,0.0,1,5.183333,88.857143,6.45,88.761468,May,5
4,2015-05-12,2495,502,3.73,1,876.0,171,0,0,7360,...,2.317714,Tuesday,1.0,1,6.783333,82.892057,5.183333,88.857143,May,5


In [54]:
data.describe()

Unnamed: 0,Calorie burned,Steps,Distance,Floors,Minutes Sedentary,Minutes Lightly Active,Minutes Fairly Active,Minutes Very Active,Activity Calories,MinutesOfSleep,...,NumberOfAwakings,LengthOfRestInMinutes,Distance_miles,Days_encoded,Work_or_Weekend,Hours Sleep,Sleep efficiency,Yesterday_sleep,Yesterday_sleep_efficiency,Months_encoded
count,367.0,367.0,367.0,367.0,367.0,367.0,367.0,367.0,367.0,367.0,...,367.0,367.0,367.0,367.0,367.0,367.0,367.0,367.0,367.0,367.0
mean,2741.501362,10121.588556,8.549128,11.724796,563.934482,236.405995,26.163488,35.722071,2044.147139,290.479564,...,16.196185,321.343324,5.31218,3.010899,0.713896,4.841326,76.362799,4.818529,76.119842,6.501362
std,916.307036,5594.836225,3.409881,10.33737,294.793145,86.531376,20.319456,31.006682,2041.267168,154.752328,...,10.757622,170.786726,2.118801,1.998604,0.452555,2.579205,32.973194,2.58493,33.206279,3.459267
min,179.0,0.0,0.0,0.0,1.002,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,2698.0,6730.5,6.155,5.0,520.0,179.0,8.0,10.5,1218.5,224.0,...,7.0,248.0,3.824539,1.0,0.0,3.733333,86.238532,3.725,86.233673,3.5
50%,2974.0,10413.0,8.29,11.0,663.0,226.0,24.0,29.0,1553.0,337.0,...,16.0,370.0,5.151166,3.0,1.0,5.616667,89.433962,5.6,89.433962,7.0
75%,3233.0,13916.5,10.56,16.0,756.5,290.0,41.5,54.0,1927.5,400.5,...,24.0,440.5,6.561678,5.0,1.0,6.675,92.438419,6.65,92.438419,9.5
max,4351.0,26444.0,20.45,101.0,998.0,472.0,101.0,153.0,9830.0,553.0,...,45.0,607.0,12.707037,6.0,1.0,9.216667,100.0,9.216667,100.0,12.0


In [55]:
stats.ttest_1samp(data['Distance'], 8.5)

Ttest_1sampResult(statistic=0.2760091723270195, pvalue=0.7826967998541668)

In [56]:
stats.ttest_1samp(data['Distance'], 20)

Ttest_1sampResult(statistic=-64.33279347915331, pvalue=1.3621863860696342e-201)