## 01. Whether First Babies Tend to Be Born Late

In [30]:
import numpy as np
import pandas as pd

In [8]:
from IPython.display import HTML

In [39]:
preg = pd.read_csv('chap01/nsfg_pregnancy_data.csv')
preg.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,poverty_i,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw
0,1,1,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
1,1,2,,,,,6.0,,1.0,,...,0,0,0,0,3410.389399,3869.349602,6448.271112,2,9,1231
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
3,2,2,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231
4,2,3,,,,,6.0,,1.0,,...,0,0,0,0,7226.30174,8567.54911,12999.542264,2,12,1231


In [4]:
preg.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'poverty_i', 'laborfor_i', 'religion_i', 'metro_i', 'basewgt',
       'adj_mod_basewgt', 'finalwgt', 'secu_p', 'sest', 'cmintvw'],
      dtype='object', length=243)

In [5]:
preg['pregordr'].head()

0    1
1    2
2    1
3    2
4    3
Name: pregordr, dtype: int64

```pregordr``` is a pregnancy serial number: the code for a respondent's first pregnancy is 1, for the second pregnancy is 2, and so on.

### Validation
If you invest time to validate the data, you can save time later and avoid errors.

One way to validate data is to __compute basic statistics and compare them with published results__.

Here is the table for ```outcome```, which encodes the outcome of each pregnancy.

In [9]:
def show_table(d):
    df = pd.DataFrame(d)
    return HTML(df.to_html(index=False))

In [10]:
d = {
    "Value": [1, 2, 3, 4, 5, 6, "Total"],
    "Label": [
        "LIVE BIRTH",
        "INDUCED ABORTION",
        "STILLBIRTH",
        "MISCARRIAGE",
        "ECTOPIC PREGNANCY",
        "CURRENT PREGNANCY",
        "",
    ],
    "TOTAL": [9148, 1862, 120, 1921, 190, 352, 13593],
}
show_table(d)

Value,Label,TOTAL
1,LIVE BIRTH,9148
2,INDUCED ABORTION,1862
3,STILLBIRTH,120
4,MISCARRIAGE,1921
5,ECTOPIC PREGNANCY,190
6,CURRENT PREGNANCY,352
Total,,13593


The "Total" column indicates the number of pregnancies with each outcome. Let's check these totals:

In [12]:
preg["outcome"].unique()

array([1, 2, 4, 5, 3, 6])

In [11]:
preg["outcome"].value_counts().sort_index()

outcome
1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: count, dtype: int64

We can confirm that the values in ```outcome``` are correct. Similarly, here is the published table for ```birthwgt_lb```.

In [24]:
counts = preg['birthwgt_lb'].value_counts().sort_index()

In [25]:
counts.loc[0.0:5.0].sum()

np.int64(1125)

In [26]:
d = {
    "Value": [".", "0-5", "6", "7", "8", "9-95", "97", "98", "99", "Total"],
    "Label": [
        "inapplicable",
        "UNDER 6 POUNDS",
        "6 POUNDS",
        "7 POUNDS",
        "8 POUNDS",
        "9 POUNDS OR MORE",
        "Not ascertained",
        "REFUSED",
        "DON'T KNOW",
        "",
    ],
    "TOTAL": [4449, 1125, 2223, 3049, 1889, 799, 1, 1, 57, 13593],
}
show_table(d)

Value,Label,TOTAL
.,inapplicable,4449
0-5,UNDER 6 POUNDS,1125
6,6 POUNDS,2223
7,7 POUNDS,3049
8,8 POUNDS,1889
9-95,9 POUNDS OR MORE,799
97,Not ascertained,1
98,REFUSED,1
99,DON'T KNOW,57
Total,,13593


In [28]:
counts = preg['birthwgt_lb'].value_counts(dropna=False).sort_index()
# value_counts(dropna=False) does not ignore values that are "NA" or "Not applicable"
counts

birthwgt_lb
0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
51.0       1
97.0       1
98.0       1
99.0      57
NaN     4449
Name: count, dtype: int64

The values 97, 98, and 99 represent cases where the birth weight is unknown. A simple option is to replace these values with ```NaN```. At the same time, we will also replace a value that is clearly wrong, 51 pounds.

In [31]:
preg["birthwgt_lb"] = preg["birthwgt_lb"].replace([51, 97, 98, 99], np.nan)

In [33]:
preg["birthwgt_lb"].value_counts(dropna=False).sort_index()

birthwgt_lb
0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
NaN     4509
Name: count, dtype: int64

### Transformation
For example,  ```agepreg``` contains the mother's age at the end of the pregnancy. We can use ```mean``` method to compute its average.

In [40]:
preg['agepreg'].mean()  # an integer number of centiyears (hundredths of a year)

np.float64(2468.8151197039497)

To convert it to years, we can divide through by 100.

In [41]:
preg['agepreg'] /= 100
preg['agepreg'].mean()

np.float64(24.6881511970395)

As another example, ```birthwgt_lb``` and ```birthwgt_oz``` contain birth weights with the pounds and ounces in separate columns. It will be more convenient to combine them into a single column that contains weights in pounds and fractions of a pound.

First, we'll clean ```birthwgt_oz``` as we did with ```birthwgt_lb```.

In [42]:
preg['birthwgt_oz'].value_counts(dropna=False).sort_index()

birthwgt_oz
0.0     1037
1.0      408
2.0      603
3.0      533
4.0      525
5.0      535
6.0      709
7.0      501
8.0      756
9.0      505
10.0     475
11.0     557
12.0     555
13.0     487
14.0     475
15.0     378
97.0       1
98.0       1
99.0      46
NaN     4506
Name: count, dtype: int64

In [43]:
preg["birthwgt_oz"] = preg["birthwgt_oz"].replace([97, 98, 99], np.nan)

Now we can use the cleaned values to create a new column that combines pounds and ounces into a single quantity.

In [44]:
preg['totalwgt_lb'] = preg['birthwgt_lb'] + preg['birthwgt_oz'] / 16.0
preg['totalwgt_lb'].mean()

np.float64(7.270508352693882)

### Summary Statistics
A **statistic** is a number derived from a dataset, usually intended to quantify some aspect of the data. Examples included counts, mean, variance, and standard deviation.

Variance is a statistic that quantifies the spread of a set of values. It is the mean of the squared deviations, which are the distances of each point from the mean.

In [46]:
weights = preg['totalwgt_lb']
mean = weights.mean()

squared_deviations = (weights - mean) ** 2
squared_deviations

0        2.377738
1        0.365410
2        3.439139
3        0.073175
4        1.172907
           ...   
13588    1.172907
13589         NaN
13590         NaN
13591    0.052666
13592    0.052666
Name: totalwgt_lb, Length: 13593, dtype: float64

In [48]:
# Compute the mean of the squared deviations:
var = squared_deviations.mean()
var

np.float64(2.1980768905984482)

In [49]:
weights.var()

np.float64(2.1983200945031394)

Variance is useful in some computations, but not a good way to describe a dataset. A better option is the **standard deviation**, which is the square root of variance.

In [50]:
weights.std(ddof=0)

np.float64(1.4825912756381807)

### Interpretation
As an example, let's select the rows in the pregnancy file with ```caseid``` 10229.

In [53]:
subset = preg.query("caseid == 10229")
subset.shape

(7, 244)

In [54]:
subset['outcome'].values

array([4, 4, 4, 4, 4, 4, 1])