# Hypothesis Testing

## Scenarios

- Chemistry - do inputs from two different barley fields produce different
yields?
- Astrophysics - do star systems with near-orbiting gas giants have hotter
stars?
- Economics - demography, surveys, etc.
- Medicine - BMI vs. Hypertension, etc.
- Business - which ad is more effective given engagement?

![img1](./img/img1.png)

![img2](./img/img2.png)

### Null Hypothesis / Alternative Hypothesis Structure

![img3](./img/img3.png)                  

### The Null Hypothesis

![gmonk](https://vignette.wikia.nocookie.net/villains/images/2/2f/Ogmork.jpg/revision/latest?cb=20120217040244) There is NOTHING, **no** difference.
![the nothing](https://vignette.wikia.nocookie.net/theneverendingstory/images/a/a0/Bern-foster-the-neverending-story-clouds.jpg/revision/latest?cb=20160608083230)

### The Alternative hypothesis

![difference](./img/giphy.gif)

### Error

- TYPE I: False positive rate (incorrectly reject)
- TYPE II: False negative rate (incorrectly fail to reject)

### Choosing the right error rate

- Alpha, α
- Sigma, σ
- Depends on field of study, 0.2 ≤ α ≤ 0.00001

### T-test

Why use it?
- Sometimes the population standard deviation is irrelevant, and sometimes it’s
unknown. (we’ll get to the different types of t-test later)
- Sometimes a sample is too small to be confident that it’s an accurate representation of reality

### T vs Z (again)

A t-test is like a modified z-test:
- Penalize for small sample size - “degrees of freedom”
- Use sample std. dev. s to estimate population σ

![img5](./img/img5.png)

### T and Z in detail
![img4](./img/img4.png)

### T-value table

![img6](./img/img6.png)

### P-Values
![picjellybeans](https://imgs.xkcd.com/comics/significant.png)

### Language of Hypothesis Testing

If p < α : we *reject* the null hypothesis<br>
If p > α : we *fail to reject* the null hypothesis


Language is **important**

### What if the experiment fails?

- Don’t throw out failed experiments
- This methodology, with this data, does not produce significant results
 - More data
 - More time
 - More details

### T-test success recipe

Regardless of the type of t-test you are performing, there are 5 main steps to executing them:

- Set up null and alternative hypotheses

- Choose a significance level

- Calculate the test statistic

- Determine the critical or p-value (find the rejection region)

- Compare t-value with critical t-value to accept or reject the Null hypothesis.

# Question 1
Is this any different from population?
- Population mean = 85
- Sample = [90,100,110]

#### Using `scipi`

In [None]:
from scipy.stats import ttest_1samp
data = [90,100,110]
ttest_1samp(data,85)

#### Manual implementation

In [None]:
from statistics import stdev

data = [90,100,110]
mu = 85
n = len(data)
s = stdev(data)
df = n-1

t = (100-85)/(s/(n**.5))

In [None]:
print(t)
print(df)

# Question 2

I'm buying jeans from store A and store B.  I know nothing about their inventory other than prices. Should I go just one store for a less expensive pair of jeans?
I'm pretty apprehensive about this big decision so alpha = 0.10

Try this both manually and with scipy

- [20,30,30,50,75,25,30,30,40,80]
- [60,30,70,90,60,40,70,40]

In [None]:
store1 = [20,30,30,50,75,25,30,30,40,80]
store2 = [60,30,70,90,60,40,70,40]

from scipy.stats import ttest_ind


In [None]:
ttest_ind(store1, store2, equal_var = False)

In [None]:
>>> from scipy import stats
>>> np.random.seed(12345678)
Test with sample with identical means:

>>>
>>> rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
>>> rvs2 = stats.norm.rvs(loc=5,scale=10,size=500)
>>> stats.ttest_ind(rvs1,rvs2)
(0.26833823296239279, 0.78849443369564776)
>>> stats.ttest_ind(rvs1,rvs2, equal_var = False)
(0.26833823296239279, 0.78849452749500748)

In [None]:
print(t)
print(df)

In [None]:
from scipy.stats import ttest_ind

# Question 3
Given the same data 1, how many more samples would you need to achieve p = 0.01, assuming sample mean and sample std. dev. do not change.

In [None]:
data = [90,100,110]
mu = 85
n = len(data)
s = stdev(data)
df = n-1

t = (100-85)/(s/(n**.5))

In [None]:
print(t)

In [None]:
for n in range(3,10):
    df = n-1
    t = (100-85)/(s/(n**.5))
    print (df,t)

You'd need 5 degrees of freedom, n=6.  That's 3 more samples.