In [1]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

from toolz import memoize

ModuleNotFoundError: No module named 'toolz'

# Hypothesis Testing and Confidence Interval Review

## Distributions
The distribution of a `statistic` (e.g. distribution of sample `mean`, distribution of **difference in `means`**) and the distribution of a `population / sample` are important to distinguish.

For example, the CLT (Central Limit Theorem) tells us that the distribution of the sample `sum` / `mean` / `proportion` (like a confidence interval) will have a standard deviation of:

$$ \frac{\sigma_p}{\sqrt{n}}$$

not the distribution of a **large sample** ($\approx \sigma_p$) or the population ($\sigma_p$)

#### Example
Let's say we generate a confidence interval of `[0.4, 0.6]` for the true proportion of supporters of a particular policy.

In this case, it's clear that the confidence interval doesn't mean the members of the population give a fractional amount of support. Either we support or not.

As sample size increases, this interval becomes smaller. This makes sense because a sample that approaches the size of the population (i.e. the population itself) will guess the parameter exactly. 

## Hypothesis Testing v. Confidence Intervals
1. Hypothesis testing answers `yes` or `no` questions such as:
    * Is the proportion of red balloons 50%
    * Is the proportion of red balloons greater / less than 50%
    
    
2. Confidence intervals give us an interval for the true value of our parameter, which allows us to answer the same questions and more

## P-values and Confidence Levels

**Hypothesis Testing:** we designate a cutoff (5%) for the proportion of samples on one **tail** we consider **extreme**.

**Confidence Intervals:** we design an interval with a confidence (95%) and consider samples that fall outside the interval (in the 5%) to be **extreme**

It is possible that our sample was from the 5% of **extreme** samples (false positive)
* This is why we **fail to reject**, not **accept**.

<img src = 'distribution.jpg' width = 400/>

## Testing

### Step 1: Null and alternative hypotheses
1. Null hypothesis indicates that the model holds (i.e. 50% of the balloons are red) and that any deviation from that model is due to random chance (not statistically significant).
2. Alternative indicates what our belief is. For example:
    * The proportion of red balloons is not 50%
    * The proportion of red balloons is more/less than 50%

### Step 2: Define test statistic
* Any number can be computed from the sample
* Choice depends on type of distribution (numerical or categorical) and our belief (distance or direction)

#### Categorical
1. One sample: TVD between model and sample
2. Two samples: TVD between samples

#### Numerical
1. One sample: `mean`, `proportion`, `sum`, etc.
2. Two samples: difference of `means`


### Step 3: Distribution of / Interval for Statistic

#### Hypothesis Testing
Sample from the **model**. For example:
* `['A' 'B']` if we think 50% of the population is `A`

#### Confidence Intervals
1. Bootstrap the original sample (i.e. `['A' 'A' 'A' 'A']), or
2. Normal approximation by CLT ( $\mu$ - 2 $\frac{\sigma}{\sqrt{n}}$, $\mu$ + 2 $\frac{\sigma}{\sqrt{n}}$)

### Step 4: Conclusion

#### Hypothesis Testing
Calculate `p-value` (the prop. of statistics in our **null distribution** that are as or more extreme than our observed) and reject / fail to reject based on the `p-value` cutoff.

#### Confidence Interval
Reject / fail to reject based on whether the value we wanted is in the interval

## Special Case: Comparing Samples
* When we have a sample with elements in one of 2 classes, we might want to see if **class causes some difference**
    * Categorical Distribution: Permutation Testing
    * Numerical Distribution:
        * Bootstrapped Confidence Interval for Difference, or
        * Permutation Testing 

For numerical distribution, we usually use Bootstrapped Confidence Interval for Difference since it allows us to make stronger statement and we can use it **as long as we meet the conditions for the bootstrap**.

This can support **causation**, but if our original sample is not random, then our results can only show causation **within our sample**

# Problems

A senator of Oski University wants to know roughly how many people support her platforms. She believes that around 50% of students support her platforms; however, she doesn't know  how to best confirm her hypothesis. Therefore, she consults her 2 friends who are well versed in the field of data science to help her out.

To test this hypothesis, her 2 friends take a random sample of 500 students in the university and ask each student whether they support the senator's platform. However, once they attain this sample, the 2 friends don't agree on the best approach to tackle this problem.

1. The first believes that he should run a hypothesis test
2. The second believes that he should use confidence intervals. 

They decided to try both approaches, starting with the hypothesis test, to see which one is better.

### Hypotheses

#### Null
50% of students in Oski University support her platform. Any observed deviation from this proportion is due to random chance.

#### Alternative
The true proportion of students who support her platform is not 50%.

### How can we use both hypothesis testing and confidence intervals?

#### Hypothesis Testing:
We define our model such that each person is a supporter with a 50% chance (e.g. `[0 1]`) and compare it against the proportion of supporters in a large sample

#### Confidence Intervals:
We bootstrap our sample, build confidence interval, and see if 50% is within the interval.

### Which Method is Better? and Why?
Answer: **Confidence Intervals**

Once we have a confidence interval, we will always have an interval for the true proportion of supporters, so we can easily answer future `yes/no` question about what the true proportion of supporters is.

For example, with a confidence interval, we can easily check whether the true proportion of supporters is 60% just by checking if 60% is in the interval.

For hypothesis testing, we would need to generate a distribution for each question since **our distribution is built under the null** (new question -> new null -> new distribution)

## Problem 2

2 students of Oski University, each pursuing different disciplines, argue about whether there is a difference in the IQ levels of the students within each of their fields of interest. To test this, they collect a sample of 1000 students of each discipline and record the IQ level of each student. They then decide to use this data to run a statistical test. Answer the following preliminary questions that the students asked themselves as they constructed and ran their test.

Assume the sample is **large enough** and **randomly** selected.

### Hypotheses

#### Null
Regardless of field interest, there is no difference in IQ level of each student. Any observed difference is due to random chance.

#### Alternative
Choice of fields make a difference on the IQ level of an individual.

### What type of test should we run and why?

#### Ans: Bootstrapped Confidence Interval for Difference in Means

We are dealing with the difference between 2 numerical distributions, which means that we can use `difference of means` of the 2 distributions. Since our sample is large enough and **according to CLT, the difference of sample means is roughly normal**, we can use bootstrapping instead of a permutation test.

Why is bootstrapping "better" than a permutation test?

**Ans**: Same reason why confidence intervals are "better" than hypothesis tests.



## Can the results of this test show causality amongst the entire population or just the sample?

#### Answer: The Entire Population

Since our original sample was **large and random**, the evidence of causation **can be attributed to the entire population**, not just the sample itself.

In real life, this is rarely possible because generating a **truly random** sample is unlikely.