Case study: Do first babies tend to arrive late?

**Anecdotal evidence** usually fails, because:

* Small # of observations: If pregnancy length is longer for 1st babies, the difference is probably small compared to natural variation. In that case, we might have to compare a large number of pregnancies to be sure that a difference exists.
* Selection bias: People who join a discussion of this question might be interested *because* their 1st babies were late. In that case the process of selecting data would bias the results.
* Confirmation bias: People who believe the claim might be more likely to contribute examples that confirm it. People who doubt the claim are more likely to cite counterexamples.
* Inaccuracy: Anecdotes are often personal stories, and often misremembered, misrepresented, repeated inaccuately, etc.

To address limitations of anecdotes, use tools of statistics such as

* Data collection: here = data from a large national survey designed explicitly w/ goal of generating statistically valid inferences about the U.S. population.
* Descriptive stats: that summarize data concisely + evaluate different ways to visualize data.
* EDA: look for patterns, differences, + other features that address questions we're interested it + check for inconsistencies + identify limitations.
* Estimation: use data from a sample (statistic) to estimate characteristics of the general population (parameter)
* Hypothesis testing: Where we see apparent effects, like a difference between 2 groups, evaluate whether the effect might have happened by chance.

Since '73, CDC has conducted National Survey of Family Growth (NSFG) to gather info on family life, marriage + divorce, pregnancy, infertility, use of contraception, and men’s + women’s health to plan health services and health education programs, + to do statistical studies of families, fertility, and health.” 

To use this data effectively, we have to understand the design of the study:
* **cross-sectional** = captures a *snapshot* of a group at a point in time
    * common alternative = **longitudinal** = observes a group repeatedly over a period of time
* conducted 7 times + each deployment = a **cycle** 
    * we have cycle 6: Jan 2002 - Mar 2003
* Target population to draw conclusions about = people in US aged 15-44
    * impossible to survey entire US population, so take a **sample** of **respondents**
* cross-sectional studies are meant to be **representative** of entire population (entire target pop. has equal chance of being sampled/selected)
    * hard to achieve in practice
* NSFG is *NOT* representative --> is deliberately **oversampled** = 3 groups, Hispanics, African-Americans, + teenagers were recruited at higher rates than their *actual* representation in the US pop to make sure # of respondents in each group was large enough to draw valid inferences
    * oversampling drawback = not as easy to draw conclusions about general pop. from our statistics

### Variables used in this book:

* **caseid** = integer ID of the respondent.
* **prglngth** = integer duration of the pregnancy in weeks.
* **outcome** = integer code for the outcome of the pregnancy where 1 = a live birth.
* **pregordr** = a pregnancy serial # (code for a respondent’s 1st pregnancy = 1, for the 2nd pregnancy = 2, and so on.
* **birthord** = a serial # for live births; the code for a respondent’s 1st child = 1, and so on. 
    * *For outcomes other than live birth, this field is blank.*
* **birthwgt_lb** and **birthwgt_oz** contain pounds + ounces parts of the birth weight of the baby.
* **agepreg** = mother’s age at the end of the pregnancy.
* **finalwgt** = the *statistical* weight associated with the respondent = a floating-point value that indicates the # of people in the U.S. population this respondent represents

In [25]:
import sys
import numpy as np
import nsfg
import thinkstats2 # Python module that contains many classes + functions in this book 
# including functions that read the Stata dictionary and the NSFG data file. 

from collections import defaultdict

## Examples from Chapter 1

Read NSFG data into a Pandas DataFrame with **ReadFemPreg()**, which uses **ReadStataDct()** to take the name of a dictionary file + returns a **dct** = a FixedWidthVariables object that contains the info from the dictionary file. 

A **dct** object provides **ReadFixedWidth()**, which reads the data file + results in a DataFrame

Our dictionary file = **2002FemPreg.dct**, a **Stata** (a statistical software system) dictionary file

A “dictionary” = a list of variable names, types, + indices that ID where in each line to find each variable.

Exaple lines:

* infile dictionary {
    *  _column(1)  str12  caseid    %12s  "RESPONDENT ID NUMBER"
    *  _column(13) byte   pregordr   %2f  "PREGNANCY ORDER (NUMBER)"
* }

This dictionary describes 2 variables: **caseid**, a 12-character string that represents respondent ID + **pregorder**, a 1-byte integer that indicates which pregnancy this record describes for this respondent.

In [26]:
preg = nsfg.ReadFemPreg() 
preg.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
0,1,1,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,8.8125
1,1,2,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,7.875
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,9.125
3,2,2,,,,,6.0,,1.0,,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,7.0
4,2,3,,,,,6.0,,1.0,,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,6.1875


Check the dimensions of the df, as well as the column names

In [27]:
preg.shape

(13593, 244)

In [28]:
# Print column names.
preg.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

Select a single column name via its index

In [29]:
# get 2nd col
preg.columns[1]

'pregordr'

In [30]:
# get a column via its name and check its data type
pregordr = preg['pregordr']
type(pregordr)

pandas.core.series.Series

Its a **Series** - like a list wbut w/ additionat features (such as indexes)

Print this series column.

In [31]:
pregordr

0        1
1        2
2        1
3        2
4        3
5        1
6        2
7        3
8        1
9        2
10       1
11       1
12       2
13       3
14       1
15       2
16       3
17       1
18       2
19       1
20       2
21       1
22       2
23       1
24       2
25       3
26       1
27       1
28       2
29       3
        ..
13563    2
13564    3
13565    1
13566    1
13567    1
13568    2
13569    1
13570    2
13571    3
13572    4
13573    1
13574    2
13575    1
13576    1
13577    2
13578    1
13579    2
13580    1
13581    2
13582    3
13583    1
13584    2
13585    1
13586    2
13587    3
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

In [32]:
# Select a single element (the 1st one) from a column via its index
# result is an int64 object
pregordr[0]

1

Select a **slice** from a column.

In [33]:
# result of a Slice = another series but w/ original indices
pregordr[2:5]

2    1
3    2
4    3
Name: pregordr, dtype: int64

In [34]:
# Select a column using dot notation
# ****only works if col name is a valid Python identifier = must begin with a letter, can’t contain spaces, etc.
pregordr = preg.pregordr
pregordr

0        1
1        2
2        1
3        2
4        3
5        1
6        2
7        3
8        1
9        2
10       1
11       1
12       2
13       3
14       1
15       2
16       3
17       1
18       2
19       1
20       2
21       1
22       2
23       1
24       2
25       3
26       1
27       1
28       2
29       3
        ..
13563    2
13564    3
13565    1
13566    1
13567    1
13568    2
13569    1
13570    2
13571    3
13572    4
13573    1
13574    2
13575    1
13576    1
13577    2
13578    1
13579    2
13580    1
13581    2
13582    3
13583    1
13584    2
13585    1
13586    2
13587    3
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

## 1.5 Variables


In [35]:
# Count # of times each variable value occurs w/ value_counts()
# sort them by index #
preg.outcome.value_counts().sort_index()

1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

In [36]:
# another column
preg.sest.value_counts().sort_index()

1     141
2     449
3      47
4      70
5      91
6      80
7     114
8      55
9     193
10    236
11    221
12    114
13     47
14    210
15    186
16    131
17    265
18     84
19     77
20     91
21     96
22    195
23    101
24     77
25    236
26     50
27    115
28    344
29     55
30    216
     ... 
55    152
56    239
57    318
58    157
59    126
60     52
61    187
62    212
63    208
64     57
65    470
66     42
67    101
68     37
69    311
70     95
71    155
72    132
73    103
74    147
75    194
76     64
77    181
78    648
79     65
80     67
81    178
82    174
83     57
84    262
Name: sest, Length: 84, dtype: int64

In [37]:
# another column
preg.birthwgt_lb.value_counts().sort_index()

0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

Many of the variables are **recodes** = are not part of the raw data collected by the survey + are calculated *using* the raw data.

* **prglngth** for live births =  raw variable **wksgest (weeks of gestation)** if available 
* Otherwise it is estimated using **mosgest** (months of gestation) * 4.33 (average # of weeks in a month).

Recodes are often based on logic that checks consistency + accuracy of the data.

Good idea to use recodes when available, unless there is a compelling reason to process raw data yourself.

## 1.6 Cleaning + Transformations

Must often have to check for errors in data, deal w/ special values, convert data into different formats, + perform calculations in data cleaning.

nsfg.py includes **CleanFemPreg()**, a function to clean the variables we will use

In [38]:
preg.agepreg.head()

0    33.16
1    39.25
2    14.33
3    17.83
4    18.33
Name: agepreg, dtype: float64

In [39]:
def CleanFemPreg(df):
    df.agepreg /= 100.0

    na_vals = [97, 98, 99]
    df.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
    df.birthwgt_oz.replace(na_vals, np.nan, inplace=True)

    df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.0 
    
    df.loc[df.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan # removes 51 lb baby error

**agepreg** = mother’s age at end of pregnancy encoded as an integer number of *centiyears*, So the 1st line divides each element of agepreg by 100, yielding a floating-point year value.

**birthwgt_lb** and **birthwgt_oz** contain weight of a baby, in pounds + ounces, for pregnancies that end in live birth. 
In addition it uses several special codes:

* 97 NOT ASCERTAINED
* 98 REFUSED  
* 99 DON'T KNOW

"Special values" encoded as #'s = dangerous b/c if not handled properly, they can generate bogus results, like a 99-pound baby. 
The **replace method** replaces these values with **np.nan** = special floating-point value that represents “not a number.” 
**inplace flag** = modify the *existing* Series rather than create a new one.

As part of the IEEE floating-point standard, all mathematical operations return nan if either argument is nan, so computations w/ nan tend to do the right thing + most pandas functions handle nan appropriately. 

Then we create a new column **totalwgt_lb** = combines pounds + ounces into a single quantity, in pounds.

***NOTE*** When you add a new column to a DataFrame, must use *dictionary* syntax + *not* dot notation ( df.totalwgt_lb = df.birthwgt_lb + df.birthwgt_oz / 16.0 )

The version w/ dot notation *adds* an attribute to the DataFrame object that is *NOT* treated as a new column

Then we replace values with birthweights > 20 lbs to NaN values w/ **loc** = provides several ways to select rows + columns from a DataFrame. 
   * In this example, 1st expression = row indexer; 2nd expression = the column

## 1.7 Validation

Want to validate data from other sources before working w/ it + want to understand it as well so as to not mis-interpret data.

1 way to validate = summary statistics

NSFG codebook includes tables that summarize each variable. Here is the table for **outcome** (encodes outcome of each pregnancy):

* value   label                Total
*    1      LIVE BIRTH              9148
*    2      INDUCED ABORTION        1862
*    3      STILLBIRTH               120
*    4      MISCARRIAGE             1921
*    5      ECTOPIC PREGNANCY        190
*    6      CURRENT PREGNANCY        352

Now check

In [40]:
preg.outcome.value_counts()

1    9148
4    1921
2    1862
6     352
5     190
3     120
Name: outcome, dtype: int64

Similarly, here is the published table for birthwgt_lb

* value label                  Total
* . INAPPLICABLE            4449
* 0-5 UNDER 6 POUNDS          1125
* 6 6 POUNDS                2223
* 7 7 POUNDS                3049
* 8 8 POUNDS                1889
* 9-95 9 POUNDS OR MORE         799

In [41]:
preg.birthwgt_lb.value_counts().sort_index()

0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

Make a dictionary that maps from each respondent's `caseid` to a list of indices into the pregnancy `DataFrame`.  Use it to select the pregnancy outcomes for a single respondent.

In [42]:
caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values

array([4, 4, 4, 4, 4, 4, 1], dtype=int64)

## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611933)

In [43]:
# Solution goes here

We can also use `isnull` to count the number of nans.

In [44]:
preg.birthord.isnull().sum()

4445

Select the `prglngth` column, print the value counts, and compare to results published in the [codebook](http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611931)

In [45]:
# Solution goes here

To compute the mean of a column, you can invoke the `mean` method on a Series.  For example, here is the mean birthweight in pounds:

In [46]:
preg.totalwgt_lb.mean()

7.265628457623368

Create a new column named <tt>totalwgt_kg</tt> that contains birth weight in kilograms.  Compute its mean.  Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

In [47]:
# Solution goes here

`nsfg.py` also provides `ReadFemResp`, which reads the female respondents file and returns a `DataFrame`:

In [48]:
resp = nsfg.ReadFemResp()

`DataFrame` provides a method `head` that displays the first five rows:

In [49]:
resp.head()

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667
1,5012,1,5,1,5,5.0,42,42,718,42,...,0,2335.279149,2846.79949,4744.19135,2,18,1233,1221,16:30:59,64.294
2,11586,1,5,1,5,5.0,43,43,708,43,...,0,2335.279149,2846.79949,4744.19135,2,18,1234,1222,18:19:09,75.149167
3,6794,5,5,4,1,5.0,15,15,1042,15,...,0,3783.152221,5071.464231,5923.977368,2,18,1234,1222,15:54:43,28.642833
4,616,1,5,4,1,5.0,20,20,991,20,...,0,5341.329968,6437.335772,7229.128072,2,18,1233,1221,14:19:44,69.502667


Select the `age_r` column from `resp` and print the value counts.  How old are the youngest and oldest respondents?

In [50]:
# Solution goes here

We can use the `caseid` to match up rows from `resp` and `preg`.  For example, we can select the row from `resp` for `caseid` 2298 like this:

In [51]:
resp[resp.caseid==2298]

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667


And we can get the corresponding rows from `preg` like this:

In [52]:
preg[preg.caseid==2298]

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
2610,2298,1,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875
2611,2298,2,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,5.5
2612,2298,3,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,4.1875
2613,2298,4,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875


How old is the respondent with `caseid` 1?

In [53]:
# Solution goes here

What are the pregnancy lengths for the respondent with `caseid` 2298?

In [54]:
# Solution goes here

What was the birthweight of the first baby born to the respondent with `caseid` 5012?

In [55]:
# Solution goes here