In [46]:
from __future__ import print_function
__author__ = "Sung Hoon Yang, CUSP NYU 2018"
import numpy as np
import pandas as pd
import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 88}

matplotlib.rc('font', **font)
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
%matplotlib inline
np.random.seed(999)

"""
Set up PUIDATA env var
"""
import os
os.environ["PUIDATA"] = "%s/fall18/PUI/PUIDATA"% os.getenv("HOME")

## Assignment 3: Finish z-test lab and turn it in as a notebook .

I am looking for here is: seeing a good Null/alternative hypothesis statement and treatment, with a clear Null and Alternative spelled out AND written out as a formula, and a good interpretation of the Z value you obtain in terms of ability or inability to reject the Null Hypothesis. 
Here is the forumla

<img src="http://bit.ly/2N3HGT6" align="center" border="0" alt="Z = \frac{\mu_{pop} - \mu_{sample}}{\sigma / \sqrt{N}}" width="154" height="44"/>

This is also in the slides attached (in a more readable format).

The chapter of _Statistics In a Nutshell_ that covers these topics is called Inferential statistics. 
It is chapter 3 in the hard copies of the book in the CUSP library, 
but it was moved to chapter 7 in the online book version which is in the link. Same content more or less.


### GRADING: 

Your notebook must display
- the complete formulation of the hypothesis (Null and Alternative) to be tested in words and formula
- the download of the data (which is in https://github.com/fedhere/PUI2018_fb55/blob/master/Lab4_fb55/times.txt, but you must get the raw data!)
- the calculation of the z statistics (with the given formula and the data processed from the data file)
- the comparison of the statistis with the significance threshold and the conclusions about the Null Hypothesis



In [64]:
!curl https://github.com/fedhere/PUI2018_fb55/blob/master/Lab4_fb55/times.txt | sed 's/<\/*[^>]*>//g' > /nfshome/shy256/fall18/PUI/PUIDATA/times.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 61828    0 61828    0     0   247k      0 --:--:-- --:--:-- --:--:--  248k


In [65]:
values = []
with open(os.path.join(os.getenv('PUIDATA'), 'times.txt'), 'r') as f:
    for l in f.readlines():
        val = re.findall("[0-9]{2}\.[0-9]{8}", l)
        try:
            val_f = float(val[0])
            values.append(val_f)
        except:
            pass
ar = np.asarray(values)
ar.mean(), ar.std(), len(ar)

(34.268918561807233, 6.8815039221860799, 83)

In [66]:
__POPULATION_MEAN__ = 36
__POPULATION_STD__ = 6 

#### Hypothesis Testing
I design my Hypothesis Testing as such:
$$
H_0: \mu_{pop} = 36
$$
$$
H_1: \mu_{pop} \neq 36
$$
Since our alternative hypothesis captures inequality, the test will be two-tailed. I will perform the testing with 
$$
\alpha \in \{0.1, 0.05, 0.01\}
$$
    

In [37]:
from math import sqrt
_ex, _sd, N = 34.466161688299998, 7.1015040681937762, len(values)
df = pd.DataFrame(values, columns=['val'])
df['z-stat'] = (df['val'] - _ex) / (_sd / sqrt(N))

In [38]:
df.loc[:'val', 'z-stat'].head(7)

0   -4.004676
1   -2.316108
2   -5.966427
3   -4.298238
4    6.399192
5    0.501186
6    7.612080
Name: z-stat, dtype: float64

In [18]:
# Confidence Intervals
p_vals = {
    .1: 1.645,
    .05: 1.96,
    .01: 2.33
}
print("z stat for 36: %.2f" % ((36 - _ex) / (_sd / sqrt(N))))
for a, p in p_vals.items():
    print("alpha: %.2f\t| z-score: %.2f\t" % (a, p))
    print("This is %d%% Confidence Interval (C.I.)" % (100 * a))
    print("C.I: (%.2f, %.2f)" % (-1.0 * p * _sd / sqrt(N) + _ex, 1.0 * p * _sd / sqrt(N) + _ex))


z stat for 36: 2.16
alpha: 0.10	| z-score: 1.65	
This is 10% Confidence Interval (C.I.)
C.I: (33.30, 35.63)
alpha: 0.05	| z-score: 1.96	
This is 5% Confidence Interval (C.I.)
C.I: (33.07, 35.86)
alpha: 0.01	| z-score: 2.33	
This is 1% Confidence Interval (C.I.)
C.I: (32.81, 36.12)


#### According to Confidence Intervals
There is 
* 10 % chance that population mean is outside (33.30, 35.63). This does not include 36. Z-stat for 36, 2.16 is also greater than 1.65. So the test result is statistically significant and we reject Null Hypothesis.
* 5 % chance that population mean is outside (33.07, 35.86). This does not include 36. Z-stat for 36, 2.16 is also greater than 1.96. So the test result is statistically significant and we reject Null Hypothesis.
* 1 % chance that population mean is outside (32.81, 36.12). This DOES include 36. Z-stat for 36, 2.16 is also smaller than 2.33. So the test result is statistically insignificant and we DO NOT reject Null Hypothesis.



In [None]:
## End of Notebook