http://thinkstats2.com
Copyright 2016 Allen B. Downey

In [1]:
import pandas as pd

import numpy as np

pd.set_option('display.max_columns', 300)
pd.set_option('precision', 2)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams
rcParams['figure.figsize'] = 10,8
import seaborn as sb
sb.set_style('white') 

In [2]:
from __future__ import print_function, division
import nsfg

__Anecdotal evidence__

Based on personal and unpublished evidence. It usually fails due to
1. Small number of observations: We should compare a large number of observations to be sure that a diference exists.
- Selection bias: Process of selecting the data itself might bias the results 
- Confirmation bias: People who believe in the claim might be more likely to contribute examples that confirm the claim
- Inaccuracy: Often personal, so likely to be misremembered, misrepresented or repeated inaccurately

## Statistical approach

1. Data collection
- Descriptive statistics
- Exploratory data analysis
- Estimation
- Hypothesis testing

__Data__
National Survey of Family Growth (NSFG) cross-sectional data

NSFG data is in a gzip-compressed data file, with file format in Stata dictionary file. `nsfg.py` module contains classes and functions to handle this. 

__Columns__
- `caseid` is the integer ID of the respondent.
- `prglngth` is the integer duration of the pregnancy in weeks.
- `outcome` is an integer code for the outcome, 1 indicates a live birth
- `pregordr` is a pregnancy serial number; a respondent's first pregnancy is 1,  second pregnancy is 2, and so on
- `birthord` is a serial number for live births; first child is 1, blank for outcomes other than live birth
- `birthwgt_lb` and `birthwgt_oz` contain the pounds and ounces parts of the birth weight of the baby.
- `agepreg` is the mother's age at the end of the pregnancy.
- `finalwgt` is the statistical weight indicating the number of people in the U.S. population this respondent represents.

In [3]:
preg = nsfg.ReadFemPreg()
preg.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,cmotpreg,prgoutcome,cmprgend,flgdkmo1,cmprgbeg,ageatend,hpageend,gestasun_m,gestasun_w,wksgest,mosgest,dk1gest,dk2gest,dk3gest,bpa_bdscheck1,bpa_bdscheck2,bpa_bdscheck3,babysex,birthwgt_lb,birthwgt_oz,lobthwgt,babysex2,birthwgt_lb2,birthwgt_oz2,lobthwgt2,babysex3,birthwgt_lb3,birthwgt_oz3,lobthwgt3,cmbabdob,kidage,hpagelb,birthplc,paybirth1,paybirth2,paybirth3,knewpreg,trimestr,ltrimest,priorsmk,postsmks,npostsmk,getprena,bgnprena,pnctrim,lpnctri,workpreg,workborn,didwork,matweeks,weeksdk,matleave,matchfound,livehere,alivenow,cmkidied,cmkidlft,lastage,wherenow,legagree,parenend,anynurse,fedsolid,frsteatd_n,frsteatd_p,frsteatd,quitnurs,ageqtnur_n,ageqtnur_p,ageqtnur,matchfound2,livehere2,alivenow2,cmkidied2,cmkidlft2,lastage2,wherenow2,legagree2,parenend2,anynurse2,fedsolid2,frsteatd_n2,frsteatd_p2,frsteatd2,quitnurs2,ageqtnur_n2,ageqtnur_p2,ageqtnur2,matchfound3,livehere3,alivenow3,cmkidied3,cmkidlft3,lastage3,wherenow3,legagree3,parenend3,anynurse3,fedsolid3,frsteatd_n3,frsteatd_p3,frsteatd3,quitnurs3,ageqtnur_n3,ageqtnur_p3,ageqtnur3,cmlastlb,cmfstprg,cmlstprg,cmintstr,cmintfin,cmintstrop,cmintfinop,cmintstrcr,cmintfincr,evuseint,stopduse,whystopd,whatmeth01,whatmeth02,whatmeth03,whatmeth04,resnouse,wantbold,probbabe,cnfrmno,wantbld2,timingok,toosoon_n,toosoon_p,wthpart1,wthpart2,feelinpg,hpwnold,timokhp,cohpbeg,cohpend,tellfath,whentell,tryscale,wantscal,whyprg1,whyprg2,whynouse1,whynouse2,whynouse3,anyusint,prglngth,outcome,birthord,datend,agepreg,datecon,agecon,fmarout5,pmarpreg,rmarout6,fmarcon5,learnprg,pncarewk,paydeliv,lbw1,bfeedwks,maternlv,oldwantr,oldwantp,wantresp,wantpart,cmbirth,ager,agescrn,fmarital,rmarital,educat,hieduc,race,hispanic,hisprace,rcurpreg,pregnum,parity,insuranc,pubassis,poverty,laborfor,religion,metro,brnout,yrstrus,prglngth_i,outcome_i,birthord_i,datend_i,agepreg_i,datecon_i,agecon_i,fmarout5_i,pmarpreg_i,rmarout6_i,fmarcon5_i,learnprg_i,pncarewk_i,paydeliv_i,lbw1_i,bfeedwks_i,maternlv_i,oldwantr_i,oldwantp_i,wantresp_i,wantpart_i,ager_i,fmarital_i,rmarital_i,educat_i,hieduc_i,race_i,hispanic_i,hisprace_i,rcurpreg_i,pregnum_i,parity_i,insuranc_i,pubassis_i,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
0,1,1,,,,,6.0,,1.0,,,1.0,1093.0,,1084.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,1.0,8.0,13.0,,,,,,,,,,1093.0,138.0,37.0,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1166.0,1093.0,1166.0,920.0,1093.0,,,,,1.0,1.0,1.0,,,,,,,,,,3.0,,,1.0,,,1,2.0,,,1.0,1.0,,,,,,,,5,39,1,1.0,1093.0,33.16,1084,3241,1.0,2.0,1.0,1,,,,2.0,995.0,,1,2,1,2,695,44,44,1,1,16,12,2,2,2,2,2,2,2,2,469,3,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3410.39,3869.35,6448.27,2,9,,8.81
1,1,2,,,,,6.0,,1.0,,,1.0,1166.0,,1157.0,,,9.0,0.0,39.0,9.0,,,,0.0,,,2.0,7.0,14.0,,,,,,,,,,1166.0,65.0,42.0,1.0,1.0,2.0,,2.0,,,0.0,5.0,,1.0,4.0,,,5.0,,,,,,5.0,1.0,,,,,,,,1.0,,4.0,1.0,4.0,,20.0,1.0,20.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1166.0,1093.0,1166.0,1093.0,1166.0,1166.0,1231.0,,,1.0,1.0,1.0,,,,,,,,,,3.0,,,1.0,,,1,4.0,,,1.0,1.0,,,,,,,,5,39,1,2.0,1166.0,39.25,1157,3850,1.0,2.0,1.0,1,2.0,4.0,3.0,2.0,87.0,0.0,1,4,1,4,695,44,44,1,1,16,12,2,2,2,2,2,2,2,2,469,3,2,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3410.39,3869.35,6448.27,2,9,,7.88
2,2,1,,,,,5.0,,3.0,5.0,,1.0,1156.0,,1147.0,,,0.0,39.0,39.0,9.0,,,,0.0,,,1.0,9.0,2.0,,2.0,2.0,0.0,,1.0,1.0,4.0,,1156.0,75.0,24.0,,,,,,,,,,,,,,,,,,,,,5.0,1.0,,,,,,,,5.0,,,,,,,,,5.0,5.0,5.0,1156.0,,0.0,,,,,,,,,,,,,5.0,5.0,5.0,1156.0,,0.0,,,,,,,,,,,,,1204.0,1156.0,1204.0,1153.0,1156.0,,,,,5.0,,,,,,,5.0,5.0,,,,,,,,4.0,,5,,5.0,5.0,1.0,1.0,,,,,,,,5,39,1,1.0,1156.0,14.33,1147,1358,5.0,1.0,6.0,5,,,,2.0,995.0,,5,5,5,5,984,20,20,5,6,11,7,1,2,3,2,3,5,3,2,100,2,3,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7226.3,8567.55,12999.54,2,12,,9.12
3,2,2,,,,,6.0,,1.0,,,1.0,1198.0,,1189.0,,,0.0,39.0,39.0,9.0,,,,0.0,,,2.0,7.0,0.0,,,,,,,,,,1198.0,33.0,25.0,1.0,3.0,,,3.0,,,0.0,5.0,,1.0,4.0,,,5.0,,,,,,5.0,5.0,1.0,,1205.0,7.0,2.0,,1.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1204.0,1156.0,1204.0,1156.0,1198.0,,,,,,,,4.0,,,,,5.0,,,,,,,,4.0,3.0,1,1.0,5.0,5.0,1.0,1.0,2.0,3.0,2.0,,,,,1,39,1,2.0,1198.0,17.83,1189,1708,5.0,1.0,6.0,5,3.0,4.0,4.0,2.0,995.0,0.0,5,3,5,3,984,20,20,5,6,11,7,1,2,3,2,3,5,3,2,100,2,3,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7226.3,8567.55,12999.54,2,12,,7.0
4,2,3,,,,,6.0,,1.0,,,1.0,1204.0,,1195.0,,,0.0,39.0,39.0,9.0,,,,0.0,,,2.0,6.0,3.0,,,,,,,,,,1204.0,27.0,25.0,1.0,3.0,,,2.0,,,0.0,5.0,,1.0,4.0,,,1.0,5.0,2.0,,,,5.0,5.0,1.0,,1221.0,17.0,2.0,,1.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1204.0,1156.0,1204.0,1198.0,1204.0,1204.0,1231.0,,,,,,4.0,,,,,5.0,,,,,,,,4.0,5.0,5,,1.0,1.0,,,4.0,4.0,2.0,,,,,1,39,1,3.0,1204.0,18.33,1195,1758,5.0,1.0,6.0,5,2.0,4.0,4.0,2.0,995.0,3.0,5,5,5,5,984,20,20,5,6,11,7,1,2,3,2,3,5,3,2,100,2,3,1,5,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7226.3,8567.55,12999.54,2,12,,6.19


Print the column names.

In [4]:
preg.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

Make a dictionary that maps from each respondent's `caseid` to a list of indices into the pregnancy `DataFrame`.  Use it to select the pregnancy outcomes for a single respondent.

In [5]:
caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values

array([4, 4, 4, 4, 4, 4, 1], dtype=int64)

## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611933)

In [7]:
preg['birthord'].value_counts()

1.00     4413
2.00     2874
3.00     1234
4.00      421
5.00      126
6.00       50
7.00       20
8.00        7
9.00        2
10.00       1
Name: birthord, dtype: int64

Count the number of nans.

In [8]:
preg.birthord.isnull().sum()

4445