Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool.

In [4]:
%matplotlib inline

import numpy as np
import pandas as pd

import thinkstats2
import thinkplot
import first
live, firsts, others = first.MakeFrames()
live = live[live.prglngth>30]

In [2]:
import statsmodels.formula.api as smf
model = smf.ols('prglngth ~ birthord==1 + race==2 + nbrnaliv>1', data=live)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,prglngth,R-squared:,0.011
Model:,OLS,Adj. R-squared:,0.011
Method:,Least Squares,F-statistic:,34.28
Date:,"Wed, 12 May 2021",Prob (F-statistic):,5.090000000000001e-22
Time:,16:27:50,Log-Likelihood:,-18247.0
No. Observations:,8884,AIC:,36500.0
Df Residuals:,8880,BIC:,36530.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,38.7617,0.039,1006.410,0.000,38.686,38.837
birthord == 1[T.True],0.1015,0.040,2.528,0.011,0.023,0.180
race == 2[T.True],0.1390,0.042,3.311,0.001,0.057,0.221
nbrnaliv > 1[T.True],-1.4944,0.164,-9.086,0.000,-1.817,-1.172

0,1,2,3
Omnibus:,1587.47,Durbin-Watson:,1.619
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6160.751
Skew:,-0.852,Prob(JB):,0.0
Kurtosis:,6.707,Cond. No.,10.9


The Trivers-Willard hypothesis suggests that for many mammals the sex ratio depends on “maternal condition”; that is, factors like the mother’s age, size, health, and social status. See https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis

Some studies have shown this effect among humans, but results are mixed. In this chapter we tested some variables related to these factors, but didn’t find any with a statistically significant effect on sex ratio.

As an exercise, use a data mining approach to test the other variables in the pregnancy and respondent files. Can you find any factors with a substantial effect?

In [6]:

# Solution

import nsfg

live = live[live.prglngth>30]
resp = nsfg.ReadFemResp()
resp.index = resp.caseid
join = live.join(resp, on='caseid', rsuffix='_r')

def GoMining(df):
    """Searches for variables that predict birth weight.

    df: DataFrame of pregnancy records

    returns: list of (rsquared, variable name) pairs
    """
    df['boy'] = (df.babysex==1).astype(int)
    variables = []
    for name in df.columns:
        try:
            if df[name].var() < 1e-7:
                continue

            formula='boy ~ agepreg + ' + name
            model = smf.logit(formula, data=df)
            nobs = len(model.endog)
            if nobs < len(df)/2:
                continue

            results = model.fit()
        except:
            continue

        variables.append((results.prsquared, name))

    return variables

variables = GoMining(join)

Optimization terminated successfully.
         Current function value: 0.692991
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692961
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692849
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692996
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692903
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692724
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.693014
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692992
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693010
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692985
  

Optimization terminated successfully.
         Current function value: 0.692984
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692952
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693008
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692893
         Iterations 5
         Current function value: 0.692776
         Iterations: 35




Optimization terminated successfully.
         Current function value: 0.692638
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692838
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.692971
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692985
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692971
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693003
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692998
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692973
  

Optimization terminated successfully.
         Current function value: 0.693052
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693078
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693078
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692964
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692801
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693074
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692959
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692995
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693004
  

Optimization terminated successfully.
         Current function value: 0.692721
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693054
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692742
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693012
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693007
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692958
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692975
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692848
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692942
  

Optimization terminated successfully.
         Current function value: 0.692985
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693004
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692985
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692932
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692619
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692779
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692886
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692739
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692662
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692621
  

Optimization terminated successfully.
         Current function value: 0.692985
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692952
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692946
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692995
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692974
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692992
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693012
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692995
  

Optimization terminated successfully.
         Current function value: 0.693014
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692955
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692967
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693009
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692859
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693010
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692925
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692969
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692983
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692998
  

Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692973
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692930
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692929
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692933
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693045
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693078
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692739
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693012
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692997
  



Optimization terminated successfully.
         Current function value: 0.692926
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692726
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692774
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692999
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692861
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692705
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692723
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692803
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692956
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692786
  



Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693010
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692658
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692789
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693008
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692855
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692855
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692749
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692862
  



Optimization terminated successfully.
         Current function value: 0.692975
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692961
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693005
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692995
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693001
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693014
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692957
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692983
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692258
  



Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692853
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692971
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692639
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692917
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692760
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692832
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693028
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693014
  

Optimization terminated successfully.
         Current function value: 0.692862
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692905
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692962
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692977
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693010
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693000
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692998
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693012
         Iterations 3
         Current function value: 0.692939
         Iterations: 35




Optimization terminated successfully.
         Current function value: 0.693003
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692831
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692999
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693003
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692997
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692795
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692693
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692457
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692815
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693002
  

In [9]:
# Solution

#Here are the 30 variables that yield the highest pseudo-R^2 values.

import re

def ReadVariables():
    """Reads Stata dictionary files for NSFG data.

    returns: DataFrame that maps variables names to descriptions
    """
    vars1 = thinkstats2.ReadStataDct('2002FemPreg.dct').variables
    vars2 = thinkstats2.ReadStataDct('2002FemResp.dct').variables

    all_vars = vars1.append(vars2)
    all_vars.index = all_vars.name
    return all_vars

def MiningReport(variables, n=30):
    """Prints variables with the highest R^2.

    t: list of (R^2, variable name) pairs
    n: number of pairs to print
    """
    all_vars = ReadVariables()

    variables.sort(reverse=True)
    for r2, name in variables[:n]:
        key = re.sub('_r$', '', name)
        try:
            desc = all_vars.loc[key].desc
            if isinstance(desc, pd.Series):
                desc = desc[0]
            print(name, r2, desc)
        except (KeyError, IndexError):
            print(name, r2)

MiningReport(variables)

totalwgt_lb 0.009696855253233383
birthwgt_lb 0.009274460080281988 BD-3 BIRTHWEIGHT IN POUNDS - 1ST BABY FROM THIS PREGNANCY
constat3 0.0010985419170438382 3RD PRIORITY CODE FOR CURRENT CONTRACEPTIVE STATUS
lbw1 0.0010519527860076705 LOW BIRTHWEIGHT - BABY 1
nplaced 0.001010368752280555 # OF R'S BIO CHILDREN SHE PLACED FOR ADOPTION (BASED ON BPA)
fmarout5 0.0009096579032891183 FORMAL MARITAL STATUS AT PREGNANCY OUTCOME
rmarout6 0.000818252143711895 INFORMAL MARITAL STATUS AT PREGNANCY OUTCOME - 6 CATEGORIES
infever 0.0008115919859909004 EVER USED INFERTILITY SERVICES OF ANY KIND
frsteatd 0.0007675331422082321 AGE (IN MOS) WHEN 1ST SUPPLEMENTED - 1ST FROM THIS PREG
splstwk1 0.0007334122339932581 IF-1 H/P DOING WHAT LAST WEEK (EMPLOYMENT STATUS) 1ST MENTION
pmarpreg 0.0007245809157658822 WHETHER PREGNANCY ENDED BEFORE R'S 1ST MARRIAGE (PREMARITALLY)
usefstp 0.0007122387685902787 EF-3 USE METHOD AT FIRST SEX WITH 1ST PARTNER IN PAST 12 MONTHS?
outcom02 0.0007015744602576479 OUTCOME OF PREG

In [10]:
# Solution

# Eliminating variables that are not known during pregnancy and 
# others that are fishy for various reasons, here's the best model I could find:

formula='boy ~ agepreg + fmarout5==5 + infever==1'
model = smf.logit(formula, data=join)
results = model.fit()
results.summary() 

Optimization terminated successfully.
         Current function value: 0.691874
         Iterations 4


0,1,2,3
Dep. Variable:,boy,No. Observations:,8884.0
Model:,Logit,Df Residuals:,8880.0
Method:,MLE,Df Model:,3.0
Date:,"Wed, 12 May 2021",Pseudo R-squ.:,0.001653
Time:,16:35:04,Log-Likelihood:,-6146.6
converged:,True,LL-Null:,-6156.8
Covariance Type:,nonrobust,LLR p-value:,0.0001432

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.1805,0.118,-1.534,0.125,-0.411,0.050
fmarout5 == 5[T.True],0.1582,0.049,3.217,0.001,0.062,0.255
infever == 1[T.True],0.2194,0.065,3.374,0.001,0.092,0.347
agepreg,0.0050,0.004,1.172,0.241,-0.003,0.013


If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called `mnlogit`. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called `rmarital`.

Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000. What is the probability that she is married, cohabitating, etc?

In [12]:
# Solution

# Here's the best model I could find.
join.numbabes.replace([97], np.nan, inplace=True)
join['age2'] = join.age_r**2

formula='rmarital ~ age_r + age2 + C(race) + totincr + educat'
model = smf.mnlogit(formula, data=join)
results = model.fit()
results.summary() 

Optimization terminated successfully.
         Current function value: 1.084053
         Iterations 8


0,1,2,3
Dep. Variable:,rmarital,No. Observations:,8884.0
Model:,MNLogit,Df Residuals:,8849.0
Method:,MLE,Df Model:,30.0
Date:,"Wed, 12 May 2021",Pseudo R-squ.:,0.1682
Time:,16:36:19,Log-Likelihood:,-9630.7
converged:,True,LL-Null:,-11579.0
Covariance Type:,nonrobust,LLR p-value:,0.0

rmarital=2,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,9.0156,0.805,11.199,0.000,7.438,10.593
C(race)[T.2],-0.9237,0.089,-10.418,0.000,-1.097,-0.750
C(race)[T.3],-0.6179,0.136,-4.536,0.000,-0.885,-0.351
age_r,-0.3635,0.051,-7.150,0.000,-0.463,-0.264
age2,0.0048,0.001,6.103,0.000,0.003,0.006
totincr,-0.1310,0.012,-11.337,0.000,-0.154,-0.108
educat,-0.1953,0.019,-10.424,0.000,-0.232,-0.159
rmarital=3,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.9570,3.020,0.979,0.328,-2.963,8.877
C(race)[T.2],-0.4411,0.237,-1.863,0.062,-0.905,0.023


In [13]:
# Solution

# This person has a 75% chance of being currently married, 
# a 13% chance of being "not married but living with opposite 
# sex partner", etc.

columns = ['age_r', 'age2', 'race', 'totincr', 'educat']
new = pd.DataFrame([[25, 25**2, 2, 11, 12]], columns=columns)
results.predict(new)

Unnamed: 0,0,1,2,3,4,5
0,0.750028,0.126397,0.001564,0.033403,0.021485,0.067122
