In [1]:
import pandas as pd
import numpy as np
import math
from collections import defaultdict
import ts_code.nsfg as nsfg
import ts_code.thinkstats2 as thinkstats2
import ts_code.thinkplot as thinkplot
import ts_code.first as first
import matplotlib.pyplot as plt
%matplotlib inline

## Chapter 9 - Hypothesis testing

The fundamental question we want to address is whether the effects we see in a sample are likely to appear in the larger population.

There are several ways we could formulate this question, including Fisher null hypothesis testing, Neyman-Pearson decision theory, and Bayesian inference.

In this chapter we walk through a subset of all three called **classical hypothesis testing.**

Here's how we answer the question above:

+ The first step is to quantify the size of the apparent effect by choosing a **test statistic**. 

+ The second step is to define a **null hypothesis**, which is a model of the system based on the assumption that the apparent effect is not real.

+ The third step is to compute a **p-value**, which is the probability of seeing the apparent effect if the null hypothesis is true.

+ The last step is to interpret the result. If the p-value is low, the effect is said to be **statistically significant**, which means that it is unlikely to have occurred by chance. In that case we infer that the effect is more likely to appear in the larger population.

One of the most common effects to test is a difference in means between two groups. 

In the NSFG data, we saw that the mean pregnancy length for first babies is slightly longer, and the mean birth weight is slightly smaller.


For these examples, the null hypothesis is that the distributions for the two
groups are the same. One way to model the null hypothesis is by **permutation**; that is, we can take values for first babies and others and shuffle them,
treating the two groups as one big group.

So, we take the two groups, combine them, shuffle, and then partition them back out into two groups and check the difference in mean at the end of each trial.

Choosing the best test statistic depends on what question you are trying to
address.

If we had some reason to think that first babies are likely to be late, then we would not take the absolute value of the difference; instead we would use a onesided difference.

This kind of test is called **one-sided** because it only counts one side of the distribution of differences.

The previous test, using both sides, is **two-sided**.

In general the p-value for a one-sided test is about half the p-value for a two-sided test, depending on the shape of the distribution.

These same methods can be used to test correlations, we can use Pearson's correlation or Spearman's as the test statistic. If we have reason to expect positive correlation, we could use a one-sided test.

We can also test proportions by comparing observed values vs expected values. 

If you have a die that comes up 3, 19 times in 60 trials and 4 only 5 times, you might think it crooked. To test this out test statistic is the sum of the absolute difference between these observed values and our expected value of every face coming up 10 times.

Next, we run a model and see how often a deviation this big or bigger comes up, this gives us a p-value to use to decided if this is a statistically signifigant deviation.

More often, instead of using the total deviation as a test statistic for testing proportion, we will use the **chi-squared** statistic;

$$\chi^{2} = \sum_{i} \frac{(O_{i}-E_{i})^{2}}{E_{i}}$$

where $O_{i}$ is the observed frequencies and $E_{i}$ are the expected frequencies.

Squaring the deviations (rather than taking absolute values) gives more weight to large deviations. Dividing through by expected standardizes the deviations.

It is important to note that the p-value depends on the choice of test statistic and the model of the null hypothesis, and sometimes these choices determine whether an effect is statistically significant or not.

If the effect is actually due to change, what is the probability that we will wrongly consider it significant? This is the **false positive rate**.

If the effect if real, what is the chance that the hypothesis test will fail? This is the **false negative rate**.

The false positive rate is relatively easy to compute: if the threshold is 5%,
the false positive rate is 5%.

If there is no effect, we can simulate the null hypothesis and the distribution of test statistics from the null hypothesis. The p-value is the probability that a random value exceeds this test statistic, which is 1-CDF(t). If the p-value is less than 5%, then CDF(t) is greater than 95%, and this happens 5% of the itme.

The false negative rate depends on the *effect size* and normally we don't know that. So, we can compute a rate conditioned on a hypothetical effect size.

We assume the observed difference is accurate, then use the observed sample as a model of the population, running hypothesis tests with simulated data.

We would sample with replacement from out two groups, run the hypothesis test, check the result and count the number of false negatives, that is outcomes with a p_value above our threshold(normally 5%).

This will give us the percent of time we expect an experiment with this sample size to yield a negative test. Generally this is presented as 1-negative rate, which is the percent we expect a postive test.

This "correct positive rate" is called the **power** of the test or "sensitivity". It reflects the ability of the test to detect an effect of a given size.

As a rule of thumb, a power of 80% is considered acceptable, anything below this is "underpowered".

In general a negative hypothesis test does not imply that there is no difference
between the groups; instead it suggests that if there is a difference, it is too
small to detect with this sample size.

The testing prodcedure outlined above is problematic for a few reasons. We performed multiple test, running one hypothesis test has a false positive rate of %5, running 20 means you're almost guranteed to have one.

The same datset was also used for exploration and testing. If you explore a large dataset, find a surprising effect, and then test whether it is significant, you have a good chance of generating a false positive.

To compensate for multiple tests, you can adjust the p-value threshold using the Holm-Donferroni method;

+ Order the p-values of the null hypothesis from lowest to highest, for a given significance level $\alpha$, let $k$ by the minimal index such that; $P_{k}>\frac{\alpha}{m + 1 - k}$.

+ Reject the null hypothesis $H_{1}\dots H_{k-1}$ and do not reject $H_{k}\dots H_{m}$. If $k=1$ do not reject any nulls, if no $k$ exists, reject all.

Alternatively you can address both problems by partitioning the data, using one set for exploration and the other for testing.

It is also common to address these problems implicitly by replicating published
results. Typically the first paper to report a new result is considered exploratory. Subsequent papers that replicate the result with new data are considered confirmatory.

### Exercises

**Exercise 9.1** As sample size increases, the power of a hypothesis test increases, which means it is more likely to be positive if the effect is real. Conversely, as sample size decreases, the test is less likely to be positive even if the effect is real.

To investigate this behavior, run the tests in this chapter with different subsets of the NSFG data. You can use thinkstats2.SampleRows to select a random subset of the rows in a DataFrame.

What happens to the p-values of these tests as sample size decreases? 
>p-values grow as the sample size decreases.

In [2]:
import ts_code.hypothesis as hyp

ModuleNotFoundError: No module named 'nsfg'

In [None]:
live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])

In [None]:
sample_sizes = range(150, 4400, 500)
prglngth_ps=[]
agewgth_ps= []
weightdiff_ps = []
chi_ps = []

for n in sample_sizes:
    #sample data
    live_sample = thinkstats2.SampleRows(live, n)
    firsts_sample = thinkstats2.SampleRows(firsts, n)
    others_sample = thinkstats2.SampleRows(others, n)
    
    #difference in pregnancy lenghts
    data = firsts_sample.prglngth.values, others_sample.prglngth.values
    ht = hyp.DiffMeansPermute(data)
    prglngth_pvalue = ht.PValue(1000)
    
    #mean weight difference, first vs others
    data = (firsts_sample.totalwgt_lb.dropna().values,
            others_sample.totalwgt_lb.dropna().values)
    ht = hyp.DiffMeansPermute(data)
    weightdiff_pvalue = ht.PValue(1000)
    
    #correlation between mom age and birth weight
    data = live_sample.agepreg.values, live_sample.totalwgt_lb.values
    ht = hyp.CorrelationPermute(data)
    age_bwght_pvalue = ht.PValue(1000)
    
    #pregnancy lenghts(chi squared)
    data = firsts_sample.prglngth.values, others_sample.prglngth.values
    ht = hyp.PregLengthTest(data)
    chi_pvalue = ht.PValue(1000)
    
    prglngth_ps.append(prglngth_pvalue)
    agewgth_ps.append(age_bwght_pvalue)
    weightdiff_ps.append(weightdiff_pvalue)
    chi_ps.append(chi_pvalue)

In [None]:
plt.plot(sample_sizes, prglngth_ps, label = 'Pregnancy Length' )
plt.plot(sample_sizes, agewgth_ps, label = 'Age/Weight' )
plt.plot(sample_sizes, weightdiff_ps, label = 'Weight Mean Difference' )
plt.plot(sample_sizes, chi_ps, label = 'Pregnancy Length(Chi)' )
plt.ylabel('p-value')
plt.xlabel('sample size')
plt.legend();

**Exercise 9.2** In Section 9.3, we simulated the null hypothesis by permutation; that is, we treated the observed values as if they represented the entire population, and randomly assigned the members of the population to the
two groups.

An alternative is to use the sample to estimate the distribution for the population, then draw a random sample from that distribution. This process is called resampling. There are several ways to implement resampling, but one of the simplest is to draw a sample with replacement from the observed values, as in Section 9.10.

Write a class named *DiffMeansResample* that inherits from *DiffMeansPermute* and overrides RunModel to implement resampling, rather than permutation.

Use this model to test the differences in pregnancy length and birth weight.
How much does the model affect the results?

>In this situation it has a very small affect, though after running the test a few times, the result were identical more than once.

In [None]:
class DiffMeansPermute(thinkstats2.HypothesisTest):
    """Tests a difference in means by permutation."""

    def TestStatistic(self, data):
        """Computes the test statistic.

        data: data in whatever form is relevant        
        """
        group1, group2 = data
        test_stat = abs(group1.mean() - group2.mean())
        return test_stat

    def MakeModel(self):
        """Build a model of the null hypothesis.
        """
        group1, group2 = self.data
        self.n, self.m = len(group1), len(group2)
        self.pool = np.hstack((group1, group2))

    def RunModel(self):
        """Run the model of the null hypothesis.

        returns: simulated data
        """
        np.random.shuffle(self.pool)
        data = self.pool[:self.n], self.pool[self.n:]
        return data


In [None]:
class DiffMeansResample(DiffMeansPermute):
    
    def RunModel(self):
        sample = thinkstats2.Resample(self.pool)
        data = sample[:self.n], sample[self.n:]
        return data

In [None]:
#difference in pregnancy lenghts
data = firsts.prglngth.values, others.prglngth.values
ht = hyp.DiffMeansPermute(data)
prglngth_perm = ht.PValue(1000)
    
#mean weight difference, first vs others
data = (firsts.totalwgt_lb.dropna().values,
            others.totalwgt_lb.dropna().values)
ht = hyp.DiffMeansPermute(data)
weightdiff_perm = ht.PValue(1000)
    
#difference in pregnancy lenghts(ReSample)
data = firsts.prglngth.values, others.prglngth.values
ht = DiffMeansResample(data)
prglngth_reshuf= ht.PValue(1000)
    
#mean weight difference, first vs others(Resample)
data = (firsts.totalwgt_lb.dropna().values,
            others.totalwgt_lb.dropna().values)
ht = DiffMeansResample(data)
weightdiff_reshuf = ht.PValue(1000)

print('Permuation: (%.05f, %.05f), Resample:(%.05f, %.05f)' % 
          (prglngth_perm, weightdiff_perm, prglngth_reshuf, weightdiff_reshuf))

### Glossary

**hypothesis testing:** The process of determining whether an apparent effect is statistically significant.

**test statistic:** A statistic used to quantify an effect size.

**null hypothesis:** A model of a system based on the assumption that an apparent effect is due to chance.

**p-value:** The probability that an effect could occur by chance.

**statistically significant:** An effect is statistically significant if it is unlikely to occur by chance.

**permutation test:** A way to compute p-values by generating permutations of an observed dataset.

**resampling test:** A way to compute p-values by generating samples, with replacement, from an observed dataset.

**two-sided test:** A test that asks, "What is the chance of an effect as big as the observed effect, positive or negative?"

**one-sided test:** A test that asks, "What is the chance of an effect as big as the observed effect, and with the same sign?"

**chi-squared test:** A test that uses the chi-squared statistic as the test statistic.

**false positive:** The conclusion that an effect is real when it is not.

**false negative:** The conclusion that an effect is due to chance when it is not.

**power:** The probability of a positive test if the null hypothesis is false.