## Exploratory data analysis

### Approach

Let's start with a demographical question: do first babies tend to arrive late?

This simple question is not easy to answer from anecdotal evidence, and it is a good problem to start building our modeling toolkit with Python.

This toolkit includes techniques in data collection, descriptive statistics, exploratory data analysis, estimation, and hypothesis testing.

### Data source

We will use the National Survey of Family Growth (NSFG), which is stored in a `.csv` file stored in this repository. In order to use it, we will use a library called `pandas`.

`pandas` is one of the most useful Python packages for data analysis and modeling. Let's use it to read our input data:

In [1]:
import pandas as pd
df = pd.read_csv('.lesson/assets/FemPreg.csv')

In [2]:
df

Unnamed: 0,row_number,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
0,0,1,1,,,,,6.0,,1.0,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,8.8125
1,1,1,2,,,,,6.0,,1.0,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,7.8750
2,2,2,1,,,,,5.0,,3.0,...,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,9.1250
3,3,2,2,,,,,6.0,,1.0,...,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,7.0000
4,4,2,3,,,,,6.0,,1.0,...,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,6.1875
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13588,13588,12571,1,,,,,6.0,,1.0,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,6.1875
13589,13589,12571,2,,,,,3.0,,,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,
13590,13590,12571,3,,,,,3.0,,,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,
13591,13591,12571,4,,,,,6.0,,1.0,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,7.5000


This dataset has been prepared to be easy to read, with columns that have (relatively) descriptive names:

In [3]:
df.columns

Index(['row_number', 'caseid', 'pregordr', 'howpreg_n', 'howpreg_p',
       'moscurrp', 'nowprgdk', 'pregend1', 'pregend2', 'nbrnaliv',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=245)

Each column in the dataframe can be extracted and treated as a pandas' `Series`, which can be thought of as an improved list. Let's examine the "pregnancy serial number".

In [4]:
pregord = df['pregordr']
type(pregord)
pregord

0        1
1        2
2        1
3        2
4        3
        ..
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

We can see the components of a Series: the indices, the elements, the variable name, the length, and the data type.

### Variables

The most important variables in the dataset are the following:

- caseid is the integer ID of the respondent

- prglngth is the integer duration of the pregnancy in weeks.

- outcome is an integer code for the outcome of the pregnancy. The code 1 indicates a live birth.

- pregordr is a pregnancy serial number; for example, the code for a respondent’s first pregnancy is 1, for the second pregnancy is 2, and so on.

- birthord is a serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live birth, this field is blank.

- birthwgt_lb and birthwgt_oz contain the pounds and ounces parts of the birth weight of the baby.

- agepreg is the mother’s age at the end of the pregnancy.

- finalwgt is the statistical weight associated with the respondent. It is a floating-point value that indicates the number of people in the U.S. population this respondent represents.

Some of the variables are `recodes`, which means that they are calculated from the `raw data`. For instance you could do:

In [5]:
df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.0

if it was not already calculated.

Note that you may use the dot notation to access attributes but not to add new columns to a dataframe (if you try, you will be adding an attribute that will not be treated as a column, and this will generate confusion).

### Validation

One way to validate data is to compute basic statistics. In our case, we may want to look into the encoding of each pregnancy and count the number of times each value appears by using the `value_counts()` command.

In [6]:
df.outcome.value_counts().sort_index()

1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

The encoding for this variable is as follows:

| value | label |
|----|----|
| 1 | LIVE BIRTH |
| 2 | INDUCED ABORTION |
| 3 | STILLBIRTH |
| 4 | MISCARRIAGE |
| 5 | ECTOPIC PREGNANCY |
| 6 | CURRENT PREGNANCY |

We can do something similar with the weigth of the newborn babies:

In [7]:
df.birthwgt_lb.value_counts().sort_index()

0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

Imagine that we are told that the baby weighing 15 pounds is actually a typo, and that the weight of that baby is actually unknown. Pandas provides an easy way to deal with the situation:

In [8]:
import numpy as np
df.loc[df.birthwgt_lb == 15.0, 'birthwgt_lb'] = np.nan
df.birthwgt_lb.value_counts().sort_index()

0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
Name: birthwgt_lb, dtype: int64

The `.loc` attribute provides several ways to select rows and columns from a dataframe. The first expresion in brackets is the row indexer, whereas the second selects the column.

### Interpretation

Let's transform our data so that we see how many pregnancies each particular respodent had in a more explicit way.

We will use `defaultdict`, which is a `container` from the package `collections`. A container is just a constructor of a data structure, in this case a dictionary. `defaultdict` has the advantage over `dict` that it allows to create items that you try to access if they do not exist. You may think of it as a tweaked version of the regular `dict`.

In [12]:
from pprint import pprint
import collections
d1 = collections.defaultdict(list)
for index, caseid in df.caseid.items():
    d1[caseid].append(index)

There is a particular case (i.e. a particular pregnant woman) worth our attention: `10229`. Let's see in which rows (i.e. in which pregnancy records) she appears and what was the outcome:

In [13]:
d1[10229]
df.outcome[d1[10229]]

11093    4
11094    4
11095    4
11096    4
11097    4
11098    4
11099    1
Name: outcome, dtype: int64

If we go back to our encoding table we will see that this is a remarkable case in which the woman had six consecutive miscarriages and, finally, a live birth.