# $$\textit{testing}$$
$$\text{Schwartz}$$

_Significance testing_ is largely the product of _Karl Pearson_ (1857--1936), _William Sealy Gosset_ (1876--1937), and _Ronald Fisher_ (1890--1962), although evidence of its use dates back to _Pierre-Simon Laplace_ (1749--1827) in the 1770's. _Pearson_ created the notion of a _p-value_ and (Pearson's) _chi-squared test_ and founded the world's first statistics department at University College London in 1911. _Gosset_ developed and penned the _t-distribution_ and _t-test_ under the 
pseudonym _Student_ due to the objections of his employer -- the original Guinness Brewery in Dublin, Ireland -- regarding publication of internal practices.  And _Fisher_ created _Analysis of Variance_ and popularized the use of a _null hypothesis_ for the so-called _significance test_. In addition to being regarded as the father of modern statistical science and experimental design, _Fisher_ also made significant contributions to agricultural biology and genetics.  Indeed,  Richard Dawkins named him "the greatest biologist since Darwin". 

_Hypothesis testing_ was developed by _Jerzy Neyman_ (1894 -- 1981) and _Egon Pearson_ (1895--1980, son of Karl Pearson). Building on these ideas, Neyman later introduced _Confidence Intervals_ into the statistics landscape.   At the time of the publication of their work on hypothesis testing in 1933, Neyman and Pearson (along with Fisher) were faculty members at the University College London in the department of statistics (founded by the older Pearson).   While Fisher as a result of his agricultural background emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions, Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. 

Initially a _Bayesian_, Fisher sought to provide a more "objective" approach to inference. The significance testing he developed did not use the notion of an alternative hypothesis -- only a null hypothesis -- and hence did not involve the notion of _Type II error_. Fisher's interpretation of p-values was informal: p-values were only meant to provide guidance for potential future experiments.  Neyman and Pearson on the other hand formalized hypothesis testing with _Type I/II errors_ and developed a procedure to choose between competing hypotheses. They considered their formulation to be an improved and more objective generalization of significance testing as it provided a decision making tool to determine researcher behavior without requiring any inductive inference on the part of the researcher.  


<table align="center">
<tr>
<td>
<img src="stuff/Karl_Pearson.jpg" width="180px" align="left">
</td>
<td>
<img src="stuff/William_Sealy_Gosset.jpg" width="188px" align="left">
</td>
<td>
<img src="stuff/ronald-fisher-5.jpg" width="199px" align="left">
</td>
<td>
<img src="stuff/neyman.jpg" width="162px" align="left">
</td>
<td>
<img src="stuff/pearson.jpg" width="201px" align="left">
</td>
</tr>
<tr>
<td>
K. Pearson
</td>
<td>
W. Gosset
</td>
<td>
R. Fisher
</td>
<td>
J. Neyman
</td>
<td>
E. Pearson
</td>
</tr>
</table>





Fisher and Neyman/Pearson clashed bitterly, and often.  As they all shared the same building at the University College London they had ample opportunity to cross paths (and swords -- although only Fisher was ever knighted -- and not until many years later -- and Neyman was, after all, Polish, not English).  They disagreed about the proper role of models in statistical inference. Fisher thought the Neyman/Pearson approach was not applicable to scientific research because (1) initial assumptions about the null hypothesis are often discovered to be questionable as unexpected sources of error appear over the course of the experiment and (2) rigid reject/accept decisions based on models formulated before data is collected are incompatible with the real-world scenario faced by scientists and attempts to apply such formulations to scientific research would lead to mass confusion (as it has). 

In 1938 Neyman left University College London and moved to the University of California, Berkeley. This put much of the planetary diameter between both his partnership with Pearson and his dispute with Fisher. A further respite in the debate was provided by World War II.  Nonetheless, the disagreement between Fisher and Neyman only terminated (unresolved after 27 years) with Fisher's death in 1962.  Neyman wrote a well-regarded eulogy of Fisher upon his death.  And some of Neyman's later publications reported p-values and significance levels.

Afterword:

In an apparent effort to provide a "non-controversial" theory (as well as likely from confusion and misunderstanding of the topic, _per se_) the modern version of hypothesis testing used today is an inconsistent hybrid of the "Fisher versus Neyman/Pearson" formulations developed in the early 20th century.  Rather than comparing two directly competing realistic hypotheses, one of the hypotheses is made to be a "no effect null hypothesis" so (despite great conceptual differences and caveats) p-values can be interpreted from both the Fisher and the Neyman/Pearson perspectives.  Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the hypothesis testing used today has more similarities with Fisher's method than theirs.

# Objectives

- Hypothesis Testing
    - Null Hypothesis
    - Test Statistic
    - Significance Level/Rejection Region
        - One/Two-Tailed
    - P-Values (never get this wrong...)
    - Relationship to Confidence Intervals
- Multiple Testing Adjustment
    - Bonferroni
    - False Discovery Rate (FDR)
- Alternative Hypothesis
    - Power
- Type I/II Errors





- Common Tests (and when to apply them)
    - Z/T-Test
    - Pearson's Chi-Squared Test
    - Fisher's Exact Test
    - Other common Non-Parametric Tests 
    - Kolmogorov-Smirnov (K-S) Test
    - F-Test
        

# Hypothesis Testing Workflow

0. Specify the _Population(s)_
1. Specify a _Null Hypothesis_ about _Population Parameters_ that we are interested evaluating
2. Identify a _Test Statistic_ that is _informative about the Null Hypotheis_ and whose _distribution is known when the null hypothesis is true_ 
3. Specify a _Rejection Region_ (i.e., _Significance Level_) for the "Distribution of the Test Statistic under the Null Hypothesis" which includes hypothetical values of the _Test Statistic_ for which we would _Reject the Null Hypothesis_
4. Take a (hopefully _Representative_) _Sample_ from the _Population_ on which to base your inference and calculate the _Test Statistic_
5. Determine if the _Test Statistic_ lies in the _Rejection region, and
    
     1. if it does, _Reject_ the _Null Hypothesis_
     2. if it does not, _Fail to Reject_ the _Null Hypothesis_



*Note: The _Null Hypothesis_ $H_0$ is often simply a placeholder or straw man which we fully expect to reject.  Confidence intervals give us a much better tool for getting a feel for actually making *inferences* about what a population parameter might be.

# Examples

- Suppose we'd like to know _Spurs_ fans and _Rockets_ fans are represented in the same proportion here in Austin, TX.
- Suppose we'd like to independently verify that dogs are indeed heavier than cats.
- [A/B Testing] Suppose we'd like to assess which of two different versions of a website are more effective for business outcomes.

# A/B Testing Hints

# $$X_i = Bernoulli(p_X)$$ 

# $$Var[X_i] = p_X(1-p_X)$$

# $$X_i, i=1,\cdots,n_X, \;\; Y_i, i=1,\cdots,n_Y$$

# $$H_0: p_X = p_Y = p$$

# $$\bar X_i - \bar Y_i \sim ?$$

# Significance Level $\alpha$


\begin{align*}
\Large \text{Pr}(\text{rejecting $H_0$} | H_0 \text{ is true}) = \Large \text{Pr}(\text{Type I error})
\end{align*}

# P-Value $p$


\begin{align*}
\Large \text{Pr}(\text{Observing a test statistic as or more extreme than yours} | H_0\text{ is true})
\end{align*}

<table align="center">
<tr>
<td>
<img src="stuff/tests_1.png" width="1000px" align="center">
</td>

<td>
<img src="stuff/tests_2.png" width="1000px" align="center">
</td>
</tr>
</table>


P-Value blunders for which _I'll never forgive you_ (and which will _haunt_ you for the _rest of your natural and unnatural life_)

<br>
X: A p-value _is not_ the probability $H_0$ is False

$\checkmark$: $H_0$ is True, or it is not -- there is no ''sometimes/probability''

<br>
X: A p-value _is not_ the probability of incorrectly rejecting $H_0$

$\checkmark$: Significance level $\alpha$ is the probability of wrongly rejecting $H_0$

<br>
X: A p-value _is not anything else except_ 

\begin{align*}
\Large \text{Pr}(\text{Observing a test statistic as or more extreme than yours} | H_0\text{ is true})
\end{align*}

$\checkmark$: A p-value is, _at all times, ever only and EXACTLY ONLY_ 

\begin{align*}
\Large \text{Pr}(\text{Observing a test statistic as or more extreme than yours} | H_0\text{ is true})
\end{align*}

# Confidence Intervals (via the pivot) and Hypothesis Testing

- A $100(1-\alpha)\%$ confidence interval _does not contain_ $\mu_0$ $\iff$ A two-sided test _rejects_ $H_0$ at the $\alpha$-significance level

- A $100(1-\alpha)\%$ confidence interval _contains_ $\mu_0$ $\iff$ A two-sided test _fails to reject_ $H_0$ at the $\alpha$-significance level

<table align="center">
<tr>
<td>
<img src="stuff/conf1.png" width="500px" align="center">
</td>
</tr>
</table>


If $H_0$ is true, then 
\begin{align*}
\Pr\left( -t_{n-1}^{\alpha/2} < \frac{\bar x - \mu_0} {\hat \sigma /\sqrt{n}} < t_{n-1}^{\alpha/2}\right) &=  \Pr\left( -\bar x -t_{n-1}^{\alpha/2} \frac{\hat \sigma}{\sqrt{n}} < - \mu_0 <  -\bar x  +t_{n-1}^{\alpha/2} \frac{\hat \sigma}{\sqrt{n}} \right)\\
&=  \Pr\left( \bar x +t_{n-1}^{\alpha/2} \frac{\hat \sigma}{\sqrt{n}} > \mu_0 >  \bar x  - t_{n-1}^{\alpha/2} \frac{\hat \sigma}{\sqrt{n}} \right)\\
&= \Pr\left( \bar x - t_{n-1}^{\alpha/2} \frac{\hat \sigma}{\sqrt{n}} < \mu_0 <  \bar x  + t_{n-1}^{\alpha/2} \frac{\hat \sigma}{\sqrt{n}} \right)
\end{align*}

"captures" $\mu_0$ in $100(1-\alpha)\%$ of hypothetically repeated experiments 




# Proof

If $H_0$ is true, then 
$\alpha = \text{Pr}_{\bar x}\left(\left\lvert  \frac{\bar x - \mu_0} {\hat \sigma /\sqrt{n}}\right\rvert > Z_{\alpha/2} \right)$
and the observed p-value under $H_0$ is 
$p = \text{Pr}_Z\left( Z > \left\lvert  \frac{\bar x - \mu_0} {\hat \sigma /\sqrt{n}}\right\rvert \right)$

And if $p < \alpha$ then

$\quad\;\;\; \left\lvert \frac{\bar x - \mu_0} {\hat \sigma /\sqrt{n}}\right\rvert > Z_{\alpha/2} $

$\Longrightarrow \mu_0 < \bar x - Z_{\alpha/2} \frac{\hat \sigma}{\sqrt{n}} \text{ or } 
\mu_0 > \bar x + Z_{\alpha/2} \frac{\hat \sigma}{\sqrt{n}}$


$\Longrightarrow \mu_0 \not \in \left(\bar x - Z_{\alpha/2} \frac{\hat \sigma}{\sqrt{n}}, 
\bar x + Z_{\alpha/2} \frac{\hat \sigma}{\sqrt{n}}\right)$

<br> 
So the $100(1-\alpha)\%$ confidence interval _does not_ contain $\mu_0$


# Multiple Testing

- Each time we do a hypothesis test \textcolor{gray}{[what?]}
- There's a chance we are wrong about our decision
- If $H_0$ is true, an $\alpha$ chance of being wrong
- So if we do $N$ tests, and $H_0$ is true for all of them
- _we still expect to wrongly reject $H_0$ about $\alpha \times N$ times!_
- Testing at $\alpha' = \alpha/N$ gives an $\alpha$ chance all tests are right
- This is called _Bonferroni correction_ 
- and it guarantees a $\alpha$ _familly-wise error rate_ 



- Bonferroni correction is really quite stringent... 


- An alternative is the _False Discovery Rate (FDR)_ $q$
- which for a set of tests (e.g., tests significant at the $\alpha$-level)
- is the proportion $q$ of the tests called incorrectly (i.e., the "FDR")


# Multiple Testing in A/B Testing Contexts
<table align="center">
<tr>
<td>
<img src="stuff/figure_1.png" width="1000px" align="center">
</td>
</tr>
</table>

# Alternative Hypotheses $H_A$

Unlike the _Null Hypothesis_ which provides a formal straw man, the _Alternative Hypothesis_ $H_A$ is actually our "Best Guess" at what we think might actually be true. The _Alternative Hypothesis_ $H_A$ is used to assess the _Power_ of an experiment to _Reject_ $H_0$: 

the specification of $H_A$ allows us to perform probabilistic cacluations under the distribution of the test statistic as we think it truly is (as opposed to how we think it is not, i.e., $H_0$).

# Test Power $\beta$

\begin{align*}
\Large \text{Pr}(\text{Fail to rejecting $H_0$} | H_A \text{ is true}) = \text{Pr}(\text{Type II error})
\end{align*}

<table align="center">
<tr>
<td>
<img src="stuff/tests_3.png" width="1000px" align="center">
</td>

<td>
<img src="stuff/tests_4.png" width="1000px" align="center">
</td>
</tr>
</table>

Note that p-values are not tightly related to relative likelihoods
- Compared to all possible _Alternative Hypotheses_, p-values $\sim .05$ are very strong evidence for $H_0$
- If some extreme $H_A$ is true then you'd never see anything like $.05$, but if $H_0$ is true then you would...

A quick wiki read of Karl Pearson or Sir Ronald Fisher shows the staggering and breathtaking extent of their contributions to the field of statistics. Even Neyman, bitter rival of Fisher, who tends to receive the short end of the stick in historical restrospectives and is generally dismissed as the the foil to Fisher's genius has been an absolute pillar of modern statistical thought.  

So who then is this upstart pretender William Sealy Gosset -- a mere beer maker -- who could dare to cross paths with these giants and be considered one of the major individuals responsible for hypthosis testing?  

<table align="center">
<tr>
<td>
<img src="stuff/William_Sealy_Gosset.jpg" width="188px" align="left">
</td>
<td>
<img src="stuff/ronald-fisher-5.jpg" width="199px" align="left">
</td>
<td>
<img src="stuff/neyman.jpg" width="162px" align="left">
</td>
<td>
<img src="stuff/pearson.jpg" width="201px" align="left">
</td>
<td>
<img src="stuff/Karl_Pearson.jpg" width="180px" align="left">
</td>
</tr>
</table>

Gosset was an Oxford man -- a trained Chemist -- and he ended up working at Guinness because of Claude Guinness's policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness's industrial processes. Never formally trained in Statistics, Gosset was self-taught but steadily incorporated and endeared himself with the statisticians working at University College London. Fisher himself even lauded Gosset as "one of the most original minds in contemporary science" and a contemporary of Gosset and Fisher even went so far as to say "I think he [Gosset] was really the big influence in statistics... he asked the questions and Pearson and Fisher put them into statistical language, and then Neyman came to work with the mathematics. But I think
most of it came from Gosset."

http://statistics.berkeley.edu/sites/default/files/tech-reports/541.pdf

http://faculty.fiu.edu/~blissl/GuinessGossetFisher.pdf

$\text{Let $X_i \overset{i.i.d.}{\sim}f(\theta)$, for $i = 1, \cdots n$,
and suppose we are interested in testing:}$

$ \left\{ \begin{array}{l}
H_0: \text{E}[X_i] = \mu_0\\
H_a: \text{E}[X_i] \not = \mu_0
\end{array} \right. $


$\text{If $X_i$ is distributed normally or the CLT applies$^*$ than if $H_0$ is true}$

$$\bar X - \mu_0 \overset{\tiny approx}{\sim} N\left(0, 
\frac{\text{Var}[X]}{n}\right)$$

$\text{However, if we }\textit{do not know } \text{Var}[X] \text{ we would need to estimate it,}$

$\text{and can do so in an unbiased manner with}$

$$ s^2 = \frac{\sum (X_i - \bar X)^2}{n-1} $$

$\text{Unfortunately, if the CLT has not yet ``kicked in'' }$ 
$\text{then }\textit{only if } f(\theta)\sim N(\mu,\sigma^2)
\text{ do we have that}$

$$\frac{\bar X - \mu_0}{\sqrt{s^2/n}} \sim t_{n-1}$$


$\text{But as $n \rightarrow \infty$, }$

$$
t_{n}
\longrightarrow
N\left(0, 1\right)
$$

$\text{and the CLT applies anyway at that point (so the use case for the t-test is}$ $\textit{extremely} limited)$

One of Gosset’s analyses focused on malt extract, which was measured in “degrees saccharine” per barrel of 168 lbs. malt. At the time, an extract in the neighborhood of 133° gave the targeted level of alcohol content for Guinness’s beer. A higher extract affected the life of the beer, and also the alcohol content which in turn affected the excise tax paid on alcoholic beverages. 

In Gosset’s view, +/-0.5° was a difference or error in malt extract level which Guinness and its customers could swallow, and he determined that "in order to get the accuracy we require we must take the mean of at least four determinations.”

http://fmmh.ycdsb.ca/teachers/fmmh_mcmanaman/pages/tok_ziliak1.pdf

In [None]:
import numpy as np
from scipy import stats

H0_mu0 = 133
degrees_saccharine = np.array([133.72, 137.02, 140.88, 135.45])
n = len(degrees_saccharine)

xbar = np.mean(degrees_saccharine)
s2 = np.var(degrees_saccharine, ddof = 1)
df = n - 1

print 2 * (1 - stats.t.cdf((xbar - H0_mu0)/np.sqrt(s2/n), df = df))

stats.ttest_1samp(degrees_saccharine, H0_mu0)


$\text{If we had $paired$ samples}$ 

$$X_i \overset{i.i.d.}{\sim}f(\theta_X) \text{ and } Y_i \overset{i.i.d.}{\sim}f(\theta_Y)$$ 

$\text{such that $X_i$ and $Y_i$ shared some dependencey for $i = 1, \cdots n$, and were interested in testing}$

$ \left\{ \begin{array}{l}
H_0: \text{E}[X_i] = \text{E}[Y_i] \\
H_a: \text{E}[X_i] \not = \text{E}[Y_i]
\end{array} \right. $

$\text{we could let $Z_i = X_i - Y_i$ so that}$

$$\text{E}[Z_i] = 0 \text{ and } \text{Var}[Z_i] = \text{Var}[X_i] + \text{Var}[Y_i] - 2 \text{Cov}[X_i,Y_i]$$

$\text{Then, if $H_0$ is true}$

$$\text{E}[Z_i] = 0$$

$\text{And if we don't know $\text{Var}[Z_i]$, we could estimate it in the usual way with}$

$$ s^2_Z = \frac{\sum (Z_i - \bar Z)^2}{n-1} $$

$\text{Then, if $n$ is sufficiently large to invoke the CLT approximation we have that}$

$$\bar Z \overset{\tiny approx}{\sim} N\left(0, 
\frac{s_Z^2}{n}\right)$$

$\text{Or, if $Z_i \sim N(\text{E}[Z_i],\text{Var}[Z_i])$}$ 

$$\frac{\bar Z - 0}{\sqrt{s_Z^2/n}} \sim t_{n-1}$$

$\text{Note that this is the same as the previous single sample case}$

* Student's actual publication 
http://seismo.berkeley.edu/~kirchner/eps_120/Odds_n_ends/Students_original_paper.pdf

<table align="left">
<tr><td>
<img src="stuff/t2.png" width="900px" align="left">
</tr></td>
</table>

In [None]:
soft_corn_yield = np.array([7.85, 8.89, 14.81, 13.55, 7.48, 15.39])
hard_corn_yield = np.array([7.27, 8.32, 13.81, 13.36, 7.97, 13.13])
diffs = soft_corn_yield - hard_corn_yield

H0_mu0 = 0 # soft_corn_yield - hard_corn_yield >= 0
n = len(diffs)

xbar = np.mean(diffs)
s2b = np.var(diffs.tolist() + [.685], ddof = 0)
print "Gosset's variance estimator SD_G =", np.sqrt(s2b), "was BIASED!"
s2w = np.var(diffs, ddof = 0)
print "Gosset's variance estimate SD_W =", np.sqrt(s2w), "was WRONG!"
s2 = np.var(diffs, ddof = 1)
print "We use the unbiased (and correctly calculated) estimator, SD_U =", np.sqrt(s2), "\n"
df = n - 1

In [None]:
# V0 -- fully manual 
print 1 - stats.t.cdf((xbar - H0_mu0)/np.sqrt(s2/n), df = df) # one-sided test

# V1 -- manual differences then single sample test
one_sided_diffs_test = list(stats.ttest_1samp(diffs, H0_mu0))
one_sided_diffs_test[1] = one_sided_diffs_test[1]/2
print one_sided_diffs_test

# V2 -- automated two related samples test
paired_sample_test = list(stats.ttest_rel(soft_corn_yield, hard_corn_yield, H0_mu0))
paired_sample_test[1] = paired_sample_test[1]/2
print paired_sample_test

$\text{What if we again had $paired$ samples}$ 

$$X_i \overset{i.i.d.}{\sim}f(\theta_X) \text{ and } Y_j \overset{i.i.d.}{\sim}f(\theta_Y)$$ 

$\text{such that $X_i$ and $Y_i$ shared some dependencey for $i = 1, \cdots n$, and were again interested in testing}$

$ \left\{ \begin{array}{l}
H_0: \text{E}[X_i] = \text{E}[Y_j] \\
H_a: \text{E}[X_i] \not = \text{E}[Y_j]
\end{array} \right. $

$\text{but we } \textit{did not } \text{take the difference $Z_i = X_i - Y_i$}$

$\text{and instead directly estimated the means and difference therein? That is, we examined}$

$$\bar X - \bar Y$$

$\text{Depending on $n$ and/or the distributions of $X_i$ and $Y_i$ ($how?$),}$
$\text{and after estimating $s_X^2$ and $s_Y^2$ in the usual way, we have either}$

$$ \bar X \overset{\tiny approx}{\sim} N\left(\text{E}[X], \frac{s_X^2}{n}\right) \text{ or }
\frac{\bar X-\text{E}[X]}{s_X^2/n} \sim t_{n-1} 
\overset{n\rightarrow\infty}{\longrightarrow}
N\left(0, 1\right)  
$$

$\text{and the analogous situation holds for $\bar Y$ as well. Subsequently, we have}$


$$
\begin{eqnarray*}
\text{Var}[\bar X - \bar Y] &=& \text{Var}[\bar X] + \text{Var}[\bar Y] - 2\text{Cov}[\bar X, \bar Y]\\
&\overset{?}{=}& \text{Var}[\bar X] + \text{Var}[\bar Y] \\
&=& \frac{\text{Var}[X]}{n} + \frac{\text{Var}[Y]}{n} \\
&=& \frac{\text{Var}[X] + \text{Var}[Y]}{n} \\
&\overset{?}{=}& \frac{\text{Var}[X] + \text{Var}[Y] −2Cov[X,Y]}{n} \\
&=& \frac{\text{Var}[Z]}{n} \\
\end{eqnarray*}
$$

$\text{So if there's _positive (negative)_ dependency between $X_i$ and $Y_i$ estimating variance as }$

$$\text{Var}[\bar X - \bar Y] = \text{Var}[\bar X] + \text{Var}[\bar Y]$$

$\text{is wrong and _less efficient (overly optimistic)_; on the other hand, testing arbitrarily paired independent variables}$ 

$$\frac{\text{Var}[Z]}{n} = \frac{\text{Var}[X] + \text{Var}[Y] −2Cov[X,Y]}{n}$$

$\text{could produce chance covariation in $X_i$ and $Y_i$ which spuriously effected variance estimates}$

### If data is paired, paired testing is more powerful and efficient in terms of significance

In [None]:
paired_sample_test = list(stats.ttest_rel(soft_corn_yield, hard_corn_yield, H0_mu0))
paired_sample_test[1] = paired_sample_test[1]/2
print "Correctly treated as dependent samples:", paired_sample_test

two_independent_samples_test = list(stats.ttest_ind(soft_corn_yield, hard_corn_yield, H0_mu0))
two_independent_samples_test[1] = two_independent_samples_test[1]/2
print "Incorrectly treated as independent samples:", two_independent_samples_test

* Guinness really does taste better in Ireland

https://www.theguardian.com/science/blog/2011/may/27/barack-obama-guinness-taste-ireland
http://blog.minitab.com/blog/michelle-paret/guinness-t-tests-and-proving-a-pint-really-does-taste-better-in-ireland

$\begin{array}{c|ccc}
       & n & \bar X & s_x \\\hline
Ireland &42 & 74 & 7.4 \\
Elsewhere & 61 & 57 & 7.1 \\
\end{array}$


In [None]:
xbar1 = 74.
s21 = 7.4**2
n1 = 42

xbar2 = 57.
s22 = 7.1**2
n2 = 61

# Prove (xbar1-xbar2)/np.sqrt(s21/n1 + s22/n2) ~ N(0,1)...
1 - stats.norm.cdf((xbar1-xbar2)/np.sqrt(s21/n1 + s22/n2))

$\text{Making this concrete, suppose that we have two independent samples}$ 

$$X_i \overset{i.i.d.}{\sim}Bern(p_X) \text{ and } Y_j \overset{i.i.d.}{\sim}Bern(p_Y)$$ 

$\text{for $i = 1, \cdots n$ and $j = 1, \cdots m$ and we are again interested in testing}$

$ \left\{ \begin{array}{l}
H_0: \text{E}[X_i] = \text{E}[Y_j] \\
H_a: \text{E}[X_i] \not = \text{E}[Y_j]
\end{array} \right. $

$\text{We will test this with $\bar X - \bar Y$ which we notate as
 $\hat p_X - \hat p_Y$}$

$\text{Suppose $H_0$ is true and $\text{E}[X_i] = \text{E}[Y_j] = p$}$
$\text{Then, under $H_0$ we have } \textit{common variances}$

$$\text{Var}[X_i] = \text{Var}[Y_i] = p(1-p)$$

$\text{And we estimate $p$ with $\hat p = \frac{\sum X_i + \sum Y_i}{n+m}$ and $\text{Var}\left[\hat p_X - \hat p_Y\right] = \frac{\hat p(1- \hat p)}{n} + \frac{\hat p(1- \hat p)}{m}$}$

$\text{For large $n$ and $m$ then, under $H_0$ the CLT and normality of added normal variables imply}$ 

$$ \hat p_X - \hat p_Y \overset{\tiny approx}{\sim} N\left(0, \frac{\hat p(1- \hat p)}{n} + \frac{\hat p(1- \hat p)}{m}\right)$$ 

In [None]:
# We did this in the morning, plus you'll be doing a bit of this type of stuff in your sprint so I'll leave it for you

$\text{Consider now the case of the two independent normally distributed samples}$

$$X_i \overset{i.i.d.}{\sim}N(\mu_X,\sigma^2) \text{ and } Y_j \overset{i.i.d.}{\sim}N(\mu_Y,\sigma^2) $$ 

$\text{ for } i = 1, \cdots n \text{ and } j = 1, \cdots m$

$\text{where we explicitly consider the case of a shared } \textit{common variance } \sigma^2$ 

$\text{and where we are again interested in}$

$ \left\{ \begin{array}{l}
H_0: \text{E}[X_i] = \text{E}[Y_j] \\
H_a: \text{E}[X_i] \not = \text{E}[Y_j]
\end{array} \right. $

$\text{If $H_0$ is true we can estimate the shared variance in an unbiased manner as}$


$$ s^2_{XY} = \frac{\sum (X_i - \bar X)^2 + \sum (Y_i - \bar Y)^2}{n + m - 2} $$

$\text{with the associated degrees of freedom $n + m - 2$ where}$


$$\text{Var}[\bar X - \bar Y] = \frac{s^2_{XY}}{n} + \frac{s^2_{XY}}{m}$$

$\text{and}$

$$\frac{\bar X - \bar Y}{\sqrt{\frac{s^2_{XY}}{n} + \frac{s^2_{XY}}{m}}} \sim t_{n+m-2}$$



$\textit{When the variances are not assumed to be equal, and CLT normality has not yet kicked in,}$
$\textit{the degrees of freedom for the t-distirubiton is given by the Welch–Satterthwaite equation}$ 

In 1897, Thomas B. Case -- Guinness’s first scientific brewer -- and a cooperating scientist named Briant were examining the amount of soft and hard resins found in 50 gram samples of several seasons of American and Kent hops. For one lot of Kent hops Case had examined 11 samples and Briant had examined 14 and they observed soft resin averages of 4.05 grams and 4.2 grams, respectively. They observed even greater differences in their two samples of “American, 1895” at 0.35 grams and .5 grams for soft resins and hard resins, respectively.  Case wrote that “We could not... support the conclusion that there are no differences between pockets of the same lot” but he had no basis for evaluating whether observed differences represented random error from the samples or actual differences in the population.  In 1899 Case hired Gosset, who in 1908 arrived at his small sample inference theory.  

http://fmmh.ycdsb.ca/teachers/fmmh_mcmanaman/pages/tok_ziliak1.pdf

In [None]:
soft_resin_Case = np.array([3.33,3.39,3.85,4.16,3.59,4.25,4.94,4.06,4.06,4.55,4.37])
soft_resin_Briant = np.array([4.11,5.48,2.94,1.52,3.87,5.64,4.85,4.68,5.11,4.13,5.44,2.34,3.50,5.19])

print stats.ttest_ind(soft_resin_Case, soft_resin_Briant, equal_var = True), "\n(with common variances)"

In [None]:
print stats.ttest_ind(soft_resin_Case, soft_resin_Briant, equal_var = False), "\n(with unique variances)"

In [None]:
print "Observed variance for soft_resin_Case was", np.var(soft_resin_Case, ddof = 1)
print "Observed variance for soft_resin_Briant was", np.var(soft_resin_Briant, ddof = 1)

print "\nObserved standard deviation for soft_resin_Case was", 
print np.var(soft_resin_Case, ddof = 1)**.5
print "Observed standard deviation for soft_resin_Briant was", 
print np.var(soft_resin_Briant, ddof = 1)**.5

In [None]:
stats.ttest_rel(soft_resin_Case, soft_resin_Briant)

$\text{If $n$ is small enough that the CLT does not yet provide a good approximation, and}$

$$X_i \sim f(\theta) \not = N\left(\mu,\sigma^2\right)$$

$\text{for $i = 1, \cdots n,$ then according to scikit-learn you need to get more data.}$

<table align="center">
<tr><td>
<img src="stuff/ml_map.png" width="750px" align="center">
</tr></td>
</table>


$\text{However, one thing you } \textit{could } \text{do is test the median, i.e.,}$

$ \left\{ \begin{array}{l}
H_0: \text{Median}(X_i) = m \\
H_a: \text{Median}(X_i) \not = m
\end{array} \right. $

$\text{since if $H_0$ holds}$

$$\sum 1_{[X_i>m]} \sim Binom(0.5, n)$$

$\text{This approach belongs to a class of tests known as } \textit{nonparametric tests}$ 

$\textit{Nonparametric tests } \text{do not rely on any distributional assumptions about the data}$

$- \; \text{They therefore are much more generally applicable than their parametric counterparts}$

$- \; \text{And they are } \textit{still } \text{often nearly as powerful as their parametric counterparts}$

$- \; \text{One then wonders why these more general purpose tools are not more heavily emphesized... }$

In [None]:
H0_mu0 = 133

print "Single Sample degrees saccharine test p-values"
print "Parametric: ", stats.ttest_1samp(degrees_saccharine, H0_mu0)
print "Nonparametric: ", 2*stats.binom.pmf(np.sum((degrees_saccharine-H0_mu0)>0),4,0.5)
# Show that this is the correct p-value based on the binomial idea in the last block 

$\text{Some very useful nonparametric tests are }$

$\quad - \; \textbf{The Wilcoxon signed-rank test} \text{, which addresses paired samples like the paired t-test}$

$\quad - \; \textbf{Mann–Whitney U} \text{, which addresses independent samples, like the two-sample t-test}$

$\quad - \; \textbf{Fisher's exact test} \text{, which addresses independence in (often, $2\times2$) contingency tables} (we will return to this soon)$

$\quad - \; \textbf{Kolmogorov-Smirnov test} \text{, which addresses distributional assumptions (we will return to this later)}$

$\quad - \; \text{And many many more... sign test, median test, Wilcoxon rank sum test, etc....} $




In [None]:
diffs = soft_corn_yield - hard_corn_yield
H0_mu0 = 0

print "Paired Soft/Hard corn yield test p-values"
print "Parametric: ", stats.ttest_rel(soft_corn_yield, hard_corn_yield, H0_mu0)
print "Nonparametric: ", stats.wilcoxon(soft_corn_yield, hard_corn_yield, correction = True) 

print "\nCase/Briant Independent soft resin test p-values"
print "Parametric: ", stats.ttest_ind(soft_resin_Case, soft_resin_Briant, equal_var = False)
print "Nonparametric: ", stats.mannwhitneyu(soft_resin_Case, soft_resin_Briant,alternative = 'two-sided') 

Dr. Muriel Bristol, an acquaintance of Fisher was a British algologist who took tea with milk and (as was the usual practice of high class British aristocrats) prefered that the milk be poured into the cup before the tea. One day Fisher was having afternoon tea with Muriel and William Roach (who would later marry Muriel) but Muriel politely declined tea because (on this particular occasion) the milk had not been poured first. To this Fisher responded "Nonsense. Surely it makes no difference," but Muriel insisted that it did, to which William proposed, "Let's test her!"  On their next afternoon tea engagement, Fisher brought back a newly invented test for the task at hand, a nonparametric test now known as "Fisher's Exact Test".  

Together with William's help, Fisher tested tested Muriel using a sample of eight cups of tea. In a random order, William prepared four cups of tea with the milk poured first, and four cups of tea with the milk poured second -- all of which Muriel was able to correctly identify the correct milk/tea order for. 

$\begin{array}{c|cc}
&\text{William poured milk first} & \text{William poured milk second} \\\hline
\text{Muriel said, "Milk poured first"} & 4 & 0\\
\text{Muriel said, "Milk poured second"} & 0 & 4 \\
\end{array}$

In [None]:
 stats.fisher_exact([[4,0],[0,4]], alternative = 'greater')

$\text{As with Fisher's Exact test, interest in contingency tables often lies in the independence of the marginal variables }$

$\text{That is, for discrete variables }
X_i \overset{i.i.d.}{\sim}f(\theta_X) \text{ and } Y_j \overset{i.i.d.}{\sim}f(\theta_Y) 
\text{ for } i = 1, \cdots n \text{ and } j = 1, \cdots m$

$\text{we are often interested in }$

$ \left\{ \begin{array}{l}
H_0: \text{$X_i$ and $Y_i$ are }independent \\
H_a: \text{$X_i$ and $Y_i$ are }dependent
\end{array} \right. $



$\text{A } \textit{chi-squared } \text{test based on a large sample distributional approximation can be used to evaluate $H_0$ }$

$\text{The approximation is (conservatively) recommended for use } \textit{only when }$

$\quad - \; \text{The number of counts in each cell is $\geq 5$, or}$

$\quad - \; \text{The number of counts in each cell is $\geq 10$ if the degrees of freedom are 1}$


$\text{The chi-squared test statistic itself is}$

$$
\begin{eqnarray*}
x^2 &=& \sum_{r=1}^R \sum_{c=1}^C \frac{(O_{rc} - E_{rc})^2}{E_{rc}} \text{ where }\\
E_{rc} &=& \frac{\left(\sum_i O_{ic}\right)\left(\sum_j O_{rj}\right)}{\sum O_{ij}} \text{ and} \\\\
x^2 &\overset{\tiny approx}{\sim}& \chi^2_{df} \;\; where \; df = (R-1)(C-1)
\end{eqnarray*}
$$

## Does grunting help McEnroe serve more aces?

* McEnroe lost only one Collegiate match, which occured at Trinity University in San Antonio, TX


$\begin{array}{c|ccc}
&\text{McEnroe serves an ace} & \text{McEnroe faults} & \text{McEnroe's serve is retruned} \\\hline
\text{McEnroe grunts on serve} & 61 & 32 & 144\\
\text{McEnroe silent on serve} & 35 & 8 & 53 \\
\end{array}$

In [None]:
chi2, p, ddof, expected = stats.chi2_contingency([[61,32,144],[35,8,53]])

print "Observed:\n", [[61,32,144],[35,8,53]],"\n"
print "Expected (if independent):\n", expected,"\n"

msg = "Test Statistic: {}\np-value: {}\nDegrees of Freedom: {}"
print msg.format( chi2, p, ddof ) 

$\text{A } \textit{chi-squared test } \text{can also be used to examine distributional assumptions }in \; toto$

$\text{For any binned random variable $X_i$ for $i = 1, \cdots, n$ we can test}$

$ \left\{ \begin{array}{l}
H_0: X_i \sim f(\theta) \\
H_a: H_0 \text{ False}
\end{array} \right. $

$\text{where $\Pr(X_i = k) = \theta_k$ }$

$$
\begin{eqnarray*}
x^2 &=& \sum_{k=1}^K \frac{(O_{k} - E_{k})^2}{E_{k}} \text{ where }\\
O_{k} &=& \sum^n_{i=1} 1_{[X_i=k]}  \\
E_{k} &=& n \cdot \theta \text{ and} \\
x^2 &\overset{approx.}{\sim}& \chi^2_{df} \;\; where \; df = n - K
\end{eqnarray*}
$$


In [None]:
# No example here... this is type of test is a total relic...

$\text{A } \textit{chi-squared test } \text{can also test variance in normally distributed populations }$

$$X_i \overset{i.i.d.}{\sim}N(\mu_X,\sigma_X^2), \; i = 1, \cdots n$$

$\text{using}$

$ \left\{ \begin{array}{l}
H_0: \sigma_X^2 = \sigma_0^2 \\
H_a: \sigma_X^2 \not = \sigma_0^2
\end{array} \right. $

$\text{since, if $H_0$ is True, then}$
$$ \frac{\sum (X_i - \bar X)^2}{\sigma_0^2} \sim \chi_{n-1}^2 $$

In [None]:
n = len(soft_resin_Case)
print "Estimated variance " + str(np.var(soft_resin_Case, ddof = 0))
print "p-value of H_0: variance=1 test " + str(2*stats.chi2.cdf((np.var(soft_resin_Case, ddof = 0)*n)/1, df = n - 1)) + "\n"

n = len(soft_resin_Briant)
print "Estimated variance " + str(np.var(soft_resin_Briant, ddof = 0))
print "p-value of H_0: variance=1 test " + str(2*(1-stats.chi2.cdf((np.var(soft_resin_Briant, ddof = 0)*n)/1, df = n - 1))) 

$\text{Further, the variances of two samples can be compared using an } \textit{F-test:}$

$\text{If $\chi^2_1$ and $\chi^2_2$ are chi-squared random variables with $v$ and $w$ degrees of freedom}$

$\text{then $F = \frac{\chi^2_1}{v}\div\frac{\chi^2_2}{w}$ is distributed as an } \textit{F distribution } \text{with degrees of freedom $v$ and $w$}$ 

$\text{Therefore, for $Y_j \overset{i.i.d.}{\sim} N(\mu_Y,\sigma_Y^2), \; j = 1, \cdots m$, we can test}$

$ \left\{ \begin{array}{l}
H_0: \sigma_X^2 = \sigma_Y^2 \\
H_a: \sigma_X^2 \not = \sigma_Y^2
\end{array} \right. $

$\text{using}$

$$ \frac{\sum (X_i - \bar X)^2\big/(n-1)}{\sum (Y_i - \bar Y)^2\big/(m-1)} \sim F_{n-1,m-1}$$


In [None]:
1 - stats.f.cdf(np.var(soft_resin_Briant, ddof = 1)/np.var(soft_resin_Case, ddof = 1),
               len(soft_resin_Briant)-1, len(soft_resin_Case)-1)

$\text{The F-test is most often encountered in model selection contexts, and in particular in the } \textit{multiple regression } \text{context}$

$\text{Here, it is used to test if the $k$ features (covariates) } \textbf{$X$}_i \text{ have any explanatory power with respect to outcome $Y_i$}$

$\text{Under the usual multiple regression assumptions, we can evaluate}$

$ \left\{ \begin{array}{l}
H_0: E[Y_i] = \beta_0 \\
H_a: E[Y_i] \not = \beta_0
\end{array} \right. $

$\text{using}$

$$ \frac{\sum (Y_i - \beta_0)^2\big/(n - 1 -1)}{\sum (Y_i - \textbf{$X$}_i\textbf{$\beta$})^2\big/(n-k-1)} 
=
\frac{RSS_{2}\big/(n - 2)}{RSS_{k + 1}\big/(n - k - 1)} 
\sim F_{n-2,n-k-1} 
$$

$\text{ or, using the preferred method }$

$$\frac{\left(\sum (X_i - \beta_0)^2 - \sum (Y_i - \textbf{$X$}_i\textbf{$\beta$})^2\right)\big/(k - 1)}{\sum (Y_i - \textbf{$X$}_i\textbf{$\beta$})^2\big/(n-k-1)} 
= 
\frac{\left(RSS_{2}- RSS_{k + 1}\right)\big/(k - 1)}{RSS_{k + 1}\big/(n - k - 1)} 
\sim F_{k-1,n-k-1}
$$


In [None]:
# We will note this again in another lecture

$\text{An interesting result we will revisit with } \textit{logistic regression }\text{is that } \textit{deviance } \text{for a $true$ $model$ $M$ with $k$ parameters}$

$$D_M \overset{\tiny approx}{\sim} \chi^2_{n-k}$$

$\text{where for true model $M$ and saturated model $Y$}$

$$D_M = -2\left(\log f\left(Y|\hat \theta^M\right) - \log f\left(Y|\hat \theta^Y\right)\right)$$

$\text{Hence for Model $R$ with $m-k$ parameters nested within Model $F$ with $m$ parameters we can evaluate}$

$ \left\{ \begin{array}{l}
H_0: \text{Models $R$ fits the data fairly well}  \\
H_a: H_0 \text{ is False}
\end{array} \right. $

$\text{using}$

$$D_R - D_F \overset{\tiny approx}{\sim} \chi^2_{k}$$

$\text{since the sum (i.e., difference) of two $\chi^2$ random variables remains $\chi^2$ and if model $R$ does not fit the data $D_R$ will be large}$

$\text{Becaues there is no residual sums of squares ($RSS$) in GLM contexts and only the deviance approximation is available,}$ 
$\text{in GLM contexts (such as logistic regression) this result plays the same role as the $F$ test in multiple linear regression}$


In [None]:
# We will note this again in another lecture

$\text{The } \textit{Kolmogorov-Smirnov (K-S) test } \text{provides another test of distributional assumptions}$

$\text{Interestingly, while the test tests distributional (often, parametric) assumptions,
the test itself is } \textit{nonparametric}$

$\text{The K-S test can examine distributional assumptions of a single sample}$

$ \left\{ \begin{array}{l}
H_0: X_i \sim f(\theta) \\
H_a: H_0 \text{ False}
\end{array} \right. $

$\text{of or compare the distributions of two samples} $ 

$ \left\{ \begin{array}{l}
H_0: f_X(\theta_X) = f_Y(\theta_Y) \\
H_a: H_0 \text{ False}
\end{array} \right. $

$\text{Like all other tests, the K-S test has a test statistic (traditionally noted as $D_n$), and }$
$\text{is based on a null distribution (called the $Kolmogorov \; distribution$ for the K-S test)}$




<table align="center">
<tr>
<td><img src="stuff/ks1.png" width="300px" align="center"></td>
<td><img src="stuff/ks2.png" width="300px" align="center"></td>
</tr>
</table>

In [None]:
from scipy import stats

# print stats.ks_2samp(soft_resin_Case, soft_resin_Briant)

print stats.kstest(stats.t.rvs(df=30,size=1000),'norm')

# Parametric Versus Nonparametric

#### We might characterize an analysis framework as parametric if

### Results are buttressed or bolstered by modeling assumptions 
* E.g., leveraging the structure of normality via a t-test increases power but comes at a cost of loss of robustness compared to nonparametric tests free of distributional assumptions

### Predicted values are based on "parameters'' 
* E.g., the $\beta$ coefficients in linear regression

### Parameter estimation  determines the specific instance of a model within a "model class'' defined by those parameters
* E.g., the CLT is based on a normal distribution which is determined by estimating $\mu$ and $\sigma^2$

###  The complexity of the model does grows as data size $n$ grows
* E.g., trees grow in complexity as data becomes richer while a normal distribution is defined by $\mu$ and $\sigma^2$ regardless of $n$
