Exercise from Think Stats, 2nd Edition (thinkstats2.com)<br>
Allen Downey

In [1]:
%matplotlib inline

Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool. 

In [24]:
#set up the necessary imports
import nsfg
import chap01soln
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.formula.api as smf

df = nsfg.ReadFemPreg()
resp = chap01soln.ReadFemResp()

In [76]:
#let's just focus on the respondent data with live births (since we'll assume the best)
#also assume within a month and change before an "average" length pregnancy of 40 weeks is when the pool occurs
#thus we know at last some length of the pregnancy. We also know the age of our pregnant friend.

live = df[df.outcome == 1]
live = live[live.prglngth > 30]

resp.index = resp.caseid
join = live.join(resp, on='caseid', rsuffix='_r')

In [83]:
#we want to test on a variety of factors to determine what relationships might exist
test_table = []
for name in join.columns:
    try:
        if join[name].var() < 1e-7: #make sure that there is a little but of variance
            continue
            
        formula = 'prglngth ~ agecon + ' + str(name) #predicting prglngth as a function of age, other parameter
        model = smf.ols(formula, data=join)
        if model.nobs < len(join)/2:
            continue
        results = model.fit()
    except (ValueError, TypeError):
        continue
    test_table.append((results.rsquared, name))

In [84]:
test_table.sort(reverse=True) #sort by highest value
for mse, name in test_table[:30]:
    print(name, mse) #print the top ten

(u'prglngth', 1.0)
(u'wksgest', 0.80647560229554416)
(u'agepreg', 0.77722780709056005)
('totalwgt_lb', 0.12589504948051811)
(u'birthwgt_lb', 0.12104803951300569)
(u'lbw1', 0.10399010733882053)
(u'mosgest', 0.095653948252776844)
(u'prglngth_i', 0.022702898461402943)
(u'canhaver', 0.0060528525930908517)
(u'datcon01_i', 0.006021881275759422)
(u'con1mar1_i', 0.0057470796692380421)
(u'nbrnaliv', 0.0046402822824620493)
(u'mar1con1_i', 0.0032946548980058443)
(u'anynurse', 0.0032494403967081587)
(u'bfeedwks', 0.0027926146173661293)
(u'agebaby1', 0.0027862747948664834)
(u'rmarout11_i', 0.0023613798588607571)
(u'marout11_i', 0.0023613798588607571)
(u'marcon11_i', 0.0023613798588607571)
(u'paydu', 0.0023452835507742353)
(u'pregend1', 0.0022912806602890523)
(u'datend02_i', 0.0021324893643125398)
(u'datcon02_i', 0.0021324893643125398)
(u'cmlastlb_r', 0.0021148239919197565)
(u'cmlastlb', 0.0021148239919197565)
(u'agecon02_i', 0.0021129060195237415)
(u'ageprg02_i', 0.0020748312204004193)
(u'fmarcon5_

Based on this information, I believe that the more useful parameters (which would be either known or guessable by the folks participating in the pool) would/may be: canhavr (perhaps if she is very close within people in the office), con1mar1 (the months between conception and marriage), mar1con1 (months between marriage and conception), agebaby1 (age at first birth, assuming more than one child), nbrnaliv (multiple births, if known), and paydu (current living quarters). I will say that none of these, without further investigation, would make a compelling case on their own, but together their might be something that can be predicted better than random guessing.

## Clarifying Questions

Use this space to ask questions regarding the content covered in the reading. These questions should be restricted to helping you better understand the material. For questions that push beyond what is in the reading, use the next answer field. If you don't have a fully formed question, but are generally having a difficult time with a topic, you can indicate that here as well.

I just don't know if my intuition was really built well during this chapter for models like this, and wonder if we could walk through formula invocation could be reviewed in class further.

## Enrichment Questions

Use this space to ask any questions that go beyond (but are related to) the material presented in this reading. Perhaps there is a particular topic you'd like to see covered in more depth. Perhaps you'd like to know how to use a library in a way that wasn't show in the reading. One way to think about this is what additional topics would you want covered in the next class (or addressed in a followup e-mail to the class). I'm a little fuzzy on what stuff will likely go here, so we'll see how things evolve.

## Additional Resources / Explorations

If you found any useful resources, or tried some useful exercises that you'd like to report please do so here. Let us know what you did, what you learned, and how others can replicate it.

I found [this](http://www.cdc.gov/nchs/data/nsfg/NSFG_2006-2010_UG_App1a_FemRespFileIndex.pdf) a useful resource in addition to the codebook.