# Hypothesis Testing

In [1]:
import numpy as np
import scipy.stats as stats

# Payphone Fills

Past data have shown that if the payphones at the airport are emptied every 14 days, the coin collectors will be 70% full on the average. The phone company tries to schedule the collection visits at 70% full because money is lost if the phones get full and unusable, but visiting the phones too frequently is also an expense. The company keeps data on fill amounts during collection, in case they need to increase or decrease collection frequency. During the last visit, suppose that 5 phones were 50%, 40%, 70%, 75%, and 45% full, respectively. Do you think the frequency of visits needs to be changed, or is this just chance variation? (95% confidence is fine.)
Hint: Note that an action will be made (increase or decrease visits) if the average fill shifts away from the mean (70%)  in either direction. State your hypotheses based on this fact.

Hint 2: You’ll need a p-value calculator to find the p-value: https://www.graphpad.com/quickcalcs/pvalue1.cfm 


Thoughts:

Sample **standard deviation** (or just sddev) is essentially how off from the mean each observation is on average. It is a measurement of a distributions spread. If the std dev is low then the distribution is tight around the mean reflecting more certainty. On the other hand, if it is a large number it reflects the opposite, that the sampled data is less certain (i.e. nore variance). Std dev is scaled to the metric in question, meaning that if were looking at salary information then the scale will be in the tens or hundreds of thousands, while if age is the metric then the scale is much less. That is said to illustrate the knowledge needed to use this metric when undertsanding data. 

It is useful to remove the scale of std dev by converting it to a statistic. With large values on n (I think ~25) a **Z-score** can be used. Z-score essentially converts std_dev to a dimensionless value--how many std devs is each data point from the mean--and can allow for estimating a normal distribution. This will change the distribution to be centered on 0 but remain with the same shape. The Z-score is related to the Z-statistic which can illustrate how far from the known mean the new smaplled data is. The Z-statistic then corresponds directly to p-value depending on the type of test used-one or two tailed. 

In the examples below we use the **t-statistic** which, like the Z-statistic estiamtes a normal. distribution just with less data. With less data the t-distribution is wider than the Z-distribution since there is less certainty. 

**P-value** is a meausre of the probabulity that the change seen in the sample data is due to random chance and does not represent the true popolation. THis is useful when measureing change to understand if the differences we are seeing in our new sample is real and worth taking action from. We want to know. for sure that the changes we are making based on these new observations are in fact going to lead us to our desired destination and prevent us from chasing our each time a new set of information comes our way.  

In [2]:
def sample_std_dev(array):
    sample_mu = array.mean()
    n = len(array)
    return np.sqrt(np.sum((array - sample_mu) ** 2 )/ (n-1))

def t_statistic(samples_array, mu):
    sample_mu = samples_array.mean()
    s = sample_std_dev(samples_array)
    n = len(samples_array)
    return (sample_mu - mu) / (s / np.sqrt(n))

def p_value_from_t_statistic(t_stat, dof, tails=1):
    if tails == 1:
        if t_stat > 0:
            return 1 - stats.t.cdf(t_stat, dof)
        else:
            return stats.t.cdf(t_stat, dof)
    else: 
        if t_stat > 0:
            return (1 - stats.t.cdf(t_stat, dof)) * 2
        else:
            return stats.t.cdf(t_stat, dof) * 2

Given fill data:

In [3]:
fills = np.array([50, 40, 70, 75, 45])

We need to compute t as shown below:

$$ t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}} $$

In [4]:
x_bar = fills.mean()
x_bar

56.0

\\(\mu\\) is our average fill that we are testing our hypothesis against. In this case meaning that with all of our historical data the average fullness of each payphone is 70% after 14 days. 

In [6]:
mu = 70 # Null Hypothesis H0

Sample standard deviation needs degress of freedom \\(n-1\\)

In [7]:
# With formula
s = sample_std_dev(fills)
print("{:.4f}".format(s))

15.5724


In [8]:
# With numpy
s = fills.std(ddof=1)
print("{:.4f}".format(s))

15.5724


n is our number of samples

In [9]:
# With len function
n = len(fills)

In [12]:
# With numpy shape attribute
fills.shape

(5,)

In [13]:
n = fills.shape[0]
print("n: {}".format(n))

n: 5


In [14]:
t_stat = t_statistic(fills, mu)
print("t_satatistic: {:.4f}".format(t_stat))

t_satatistic: -2.0103


Two tailed signifcance value.

In [18]:
alpha = (1.0 - 0.95) / 2
print('alpha: {:.3f}'.format(alpha))

alpha: 0.025


Since, we are investigating whether we should increase OR decrease the inspection rate we need to test in whether a change is real either direction. This is known as a two-tailed test. To account for both directions we need to split the 5% chance of the data different not sue to random chance to both sides of the distribution. So with a required 95% confidence the calculated p-value must be < 0.025 for us to accept the change. 

Accepting the change means to reject the Null Hypothesis that the data does not provide evidence for us to change from the status quo. Any evidence that would get us to take action and make a change must be definative--to whatever confidence level we choose. In  other words, it needs to be hard to prove somthing wrong. 

In [15]:
# p-value from scipy stats or can find form link above
p = p_value_from_t_statistic(t_stat, n-1, 2)
print('p-value: {:.4f}'.format(p))

p-value: 0.1148


In [16]:
# p-value form scipy stats
t_stat, p_val = stats.ttest_1samp(a=fills, popmean=mu)
print("t-statistic: {:.4f}".format(t_stat))
print("p-value: {:.4f}".format(p_val))

t-statistic: -2.0103
p-value: 0.1148


Can we reject \\(H_0\\)?

In [19]:
p <= alpha

False

We cannot reject the null hypothesis because our data is more likey to occur given the null hypothesis is true than our significance level of 0.025.  In other words, our data cannot prove, with significance, that the average payphone fill after 14 days is not 70%.

## Trains

People in DC constantly complain that the metro consistently runs an average of 10 minutes late. You actually think it’s less than this, so you gather data for ten different trains at a specific location in DC. The following is your data in minutes of lateness: [4, 12, 6, 2, 1, 6, 7, 3, 16, 0]. Based on your data, are the people in DC correct?

In [96]:
h0 = 10
# h1 -> < 10 therefore left-tailed (one-tailed) test

# One-tailed significance value
C = 0.95
alpha = 1 - C

train_samples = np.array([4, 12, 6, 2, 1, 6, 7, 3, 16, 0])
n = len(train_samples)
sample_mean = train_samples.mean()
s = sample_std_dev(train_samples)

In [97]:
t_stat = t_statistic(train_samples, h0)
print("t_satatistic: {:.4f}".format(t_stat))

t_satatistic: -2.7129


In [98]:
p = p_value_from_t_statistic(t_stat, n-1, tails=2)
print('p-value: {:.4f}'.format(p))

p-value: 0.0239


In [99]:
# p-value form scipy stats
t_stat, p_val = stats.ttest_1samp(a=train_samples, popmean=h0)
print("t-statistic: {:.4f}".format(t_stat))
print("p-value: {:.4f}".format(p_val))

t-statistic: -2.7129
p-value: 0.0239


In [100]:
p <= alpha

True

We can reject the null hypothesis because our p-value is less that out significance level (alpha).  In other words, there is less than a 5% chance that we would observe the recorded data if trains we 10 minutes late on average.