## File : Exercise 9-1_Edris_Safari.ipynb
## Name:Edris Safari
## Date:1/20/2019
## Course: DSC530 - Data Exploration and Analysis
## Desc: Week7 exercise 9-1 assignment

As sample size increases, the power of a hypothesis test increases, which means it is more likely to be positive if the effect is real. Conversely, as sample size decreases, the test is less likely to be positive even if the effect is real.
To investigate this behavior, run the tests in this chapter with different subsets of the NSFG data. You can use thinkstats2.SampleRows to select a random subset of the rows in a DataFrame. What happens to the p-values of these tests as the sample size decreases? What is the smallest sample size that yields a positive test?

In [3]:
import first
import hypothesis
import scatter
import thinkstats2

import numpy as np

# class DiffMeansResample inherits from DiffMeansPermute and override its RunModel method
class DiffMeansResample(hypothesis.DiffMeansPermute):
    """Tests a difference in means using resampling."""
    
    def RunModel(self):
        """Run the model of the null hypothesis.

        returns: simulated data
        """
        group1 = np.random.choice(self.pool, self.n, replace=True)
        group2 = np.random.choice(self.pool, self.m, replace=True)
        return group1, group2
  

def RunResampleTest(firsts, others):
     """Tests differences in means by resampling.

    firsts: DataFrame
    others: DataFrame
    """
    
    data = firsts.prglngth.values, others.prglngth.values
    ht = DiffMeansResample(data)
    # P-value is calculated by HypothesisTest by running the test(executing RunTests 10000 times)
    p_value = ht.PValue(iters=10000)
    print('\nmeans permute preglength')
    print('p-value =', p_value)
    print('actual =', ht.actual)
    print('ts max =', ht.MaxTestStat())

    data = (firsts.totalwgt_lb.dropna().values,
            others.totalwgt_lb.dropna().values)
    ht = hypothesis.DiffMeansPermute(data)
    p_value = ht.PValue(iters=10000)
    print('\nmeans permute birthweight')
    print('p-value =', p_value)
    print('actual =', ht.actual)
    print('ts max =', ht.MaxTestStat())


def RunTests(live, iters=1000):
    """Runs the tests from Chapter 9 with a subset of the data.
    live: DataFrame
    iters: how many iterations to run
    """
    n = len(live)
    firsts = live[live.birthord == 1]
    others = live[live.birthord != 1]

    # Four tests are conducted and recorded.
    # Test1-Pregnancy length
    # compare pregnancy lengths
    data = firsts.prglngth.values, others.prglngth.values
    ht = hypothesis.DiffMeansPermute(data)
    p1 = ht.PValue(iters=iters)

    # Test2-Birth weights
    data = (firsts.totalwgt_lb.dropna().values,
            others.totalwgt_lb.dropna().values)
    ht = hypothesis.DiffMeansPermute(data)
    p2 = ht.PValue(iters=iters)

    # test correlation
    # Use CorrelationPermute to test correlation between age and weight
    live2 = live.dropna(subset=['agepreg', 'totalwgt_lb'])
    data = live2.agepreg.values, live2.totalwgt_lb.values
    ht = hypothesis.CorrelationPermute(data)
    p3 = ht.PValue(iters=iters)

    # compare pregnancy lengths (chi-squared)
    data = firsts.prglngth.values, others.prglngth.values
    ht = hypothesis.PregLengthTest(data)
    p4 = ht.PValue(iters=iters)

    print('%d\t%0.2f\t%0.2f\t%0.2f\t%0.2f' % (n, p1, p2, p3, p4))


In [6]:
# Initialize random generators
thinkstats2.RandomSeed(18)
# Create dataframes
live, firsts, others = first.MakeFrames()

# Run model on firsts and others dataset
RunResampleTest(firsts, others)

n = len(live)
print('n\ttest1\ttest2\ttest3\ttest4')
# run the test 10 times
for _ in range(10):
    # select a random subset of the rows
    sample = thinkstats2.SampleRows(live, n)
    RunTests(sample)
    n //= 2


means permute preglength
p-value = 0.1674
actual = 0.07803726677754952
ts max = 0.2267524361042348

means permute birthweight
p-value = 0.0
actual = 0.12476118453549034
ts max = 0.11224350119686566
n	test1	test2	test3	test4
9148	0.16	0.00	0.00	0.00
4574	0.03	0.02	0.00	0.00
2287	0.04	0.07	0.00	0.00
1143	0.70	0.04	0.80	0.07
571	0.53	0.00	0.00	0.35
285	0.96	0.84	0.35	0.53
142	0.87	0.49	0.20	0.06


The p-values decrease as sample size increase with the smallest sample size that yields a positive p-value is 4574 for test1 and test2.