In [1]:
import pandas as pd
import ts_code.nsfg as nsfg

## Chapter 01 - Exploratory data analysis

The goal is to use data combined with practical methods to answer questions and guide decisions under uncertainty.

Sometimes to answer a question we are provided with **ancedotal evidence.** In a casual conversation this is fine, but it usually fails as persuasive or reliable data for one(or many) of the reasons below;

+ **Small number of observations:** If pregnancy length is longer for first babies, the difference is probably small compared to natural variation. In that case, we might have to compare a large number of pregnancies to be sure that a difference exists.

+ **Selection bias:** People who join a discussion of this question might be interested because their first babies were late. In that case the process of selecting data would bias the results.

+ **Confrmation bias:** People who believe the claim might be more likely to contribute examples that confrm it. People who doubt the claim are more likely to cite counterexamples.

+ **Inaccuracy:** Anecdotes are often personal stories, and often misremembered, misrepresented, repeated inaccurately, etc.

To address these limitations, let's use the tools of statistics;

+ **Data collection:** We will use data from a large national survey that was designed explicitly with the goal of generating statistically valid inferences about the U.S. population.

+ **Descriptive statistics:** We will generate statistics that summarize the data concisely, and evaluate different ways to visualize data.

+ **Exploratory data analysis:** We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations.

+ **Estimation:** We will use data from a sample to estimate characteristics of the general population.

+ **Hypothesis testing:** Where we see apparent effects, like a difference between two groups, we will evaluate whether the effect might have happened by chance.

In this book we will use the National Survey of Family Growth (NSFG) conducted by the CDC.

The NSFG is a **cross-sectional study**, which means that it captures a snapshot of a group at a point in time. 

The *most common alternative* is a **longitudinal study**, which observes a group repeatedly over a period of time.

Each deployment of the NSFG is called **cycle**, we will be using data from *Cycle 6*, which was conducted from January 2002 to March 2003. (*I might see if I can use updated data*)

The goal of the survey is to draw conclusions about a **population**; the
target population of the NSFG is people in the United States aged 15-44.

Ideally data would be collegect from every member of the population, but that's seldom possible. Instead data is collected from a subset of the population, this is a **sample**. The participants are called **respondents**.

Cross-sectional studies are meant to be **representative**, which
means that every member of the target population has an equal chance of
participating.

The NSFG is not representative; instead it is deliberately **oversampled**. The
designers of the study recruited three groups|Hispanics, African-Americans
and teenagers|at rates higher than their representation in the U.S. popula-
tion, in order to make sure that the number of respondents in each of these
groups is large enough to draw valid statistical inferences.

When dealing with real world data, you need often need to;
+ check for errors
+ deal with special values
+ covert data into different formats
+ perform calculations

These operations are called **data cleaning**.

In the nsfg module, there is a cleaning function called *CleanFemPreg*. The code from the book is below, outlines with comments on what is happening.

In [2]:
def CleanFemPreg(df):
    df.agepreg /= 100.0 #agepreg has motheres age in centiyears
    #dividing by 100.0 gives a floating point age in years
    
    #birthwgt_lb and birthwgt_oz contain the weight of the baby
    #it includes the following special codes;
    #97 NOT ASCERTAINED
    #98 REFUSED
    #99 DONT KNOW
    #we dont want specical codes as numbers, it can lead to use thinking
    #there's a 99 pound baby
    na_vals = [97, 98, 99] #here they are
    #below we look for the value in np_vals
    #replace it with numpy's not a number value
    #we do it in place, so the dataframe is updated
    df.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
    df.birthwgt_oz.replace(na_vals, np.nan, inplace=True)
    
    #finally we add a column that is the baby's weight in pounts(float)
    df[totalwgt_lb] = df.birthwgt_lb + df.birthwgt_oz / 16.0

In [3]:
preg2002 = nsfg.ReadFemPreg()

In [4]:
preg2002.shape

(13593, 244)

Errors can be introduced when data is exported from one environment and imported into another. When getting familar with a new dataset, one of the first steps is to **validate** the data.

One way to validate data is to compute basic statistics and compare them
with published results. the NSFG codebook includes tables
that summarize each variable. Here is the table for outcome, which encodes
the outcome of each pregnancy:

value label Total

1 LIVE BIRTH 9148

2 INDUCED ABORTION 1862

3 STILLBIRTH 120

4 MISCARRIAGE 1921

5 ECTOPIC PREGNANCY 190

6 CURRENT PREGNANCY 352

In [5]:
preg2002['outcome'].value_counts(sort=False)

1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

Let's check the values against the published table for the birth weight in pounds;

value label Total

. INAPPLICABLE 4449

0-5 UNDER 6 POUNDS 1125

6 6 POUNDS 2223

7 7 POUNDS 3049

8 8 POUNDS 1889

9-95 9 POUNDS OR MORE 799

In [6]:
preg2002.birthwgt_lb.value_counts(sort=False).sort_index() 

0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

Let's look at the outcomes for a few respondents, each is assigned their own caseid, but the data is not organized so that caseids with be clumped together, so we need to map out caseid to the index.

In [7]:
from collections import defaultdict
def MakePregMap(df):
    """Takes in the preg dataframe
    returns a dictionary with caseid values as keys
    and their index as values in a list"""
    d = defaultdict(list)
    for index, caseid in df.caseid.iteritems():
        d[caseid].append(index)
    return d

In [8]:
preg_map = MakePregMap(preg2002)

In [9]:
#let's look at the indicies for caseid=10229
preg_map[10229] #-> indicies in our df with information about this women
#indicates she was pregnant 7 times

[11093, 11094, 11095, 11096, 11097, 11098, 11099]

In [10]:
#let's get this information from the dataframe
caseid = 10229
indices = preg_map[caseid] #value we pulled above
#let's look only at the outcome columns
preg2002['outcome'][indices].values 
#values gives us the value of the cell without the index as an np.array

array([4, 4, 4, 4, 4, 4, 1])

The outcome code 1 indicates a live birth and 4 indicates a miscarriage. Statistically this in not unusual, but if we think of the context behind the numbers, we can see a story. This women was pregnant six times, each ending in a miscarriage before her seventh and most recent pregnency ended with a live birth.

Each record in this dataset represents a person, who gave honest answers to difficult personal questions

We can use this data to answer statistical questions about family life, reproduction, and health. At the same time, we have an obligation to consider the people represented by the data, and to afford them respect and gratitude.

### Exercises

**Exercise 1.1** - *See chap01ex.ipynb*.

**Exercise 1.2** In the repository you downloaded, you should find a file named
chap01ex.py; using this file as a starting place, write a function that reads
the respondent file, 2002FemResp.dat.gz.

The variable pregnum is a recode that indicates how many times each respondent has been pregnant. Print the value counts for this variable and compare them to the published results in the NSFG codebook.

You can also cross-validate the respondent and pregnancy files by comparing pregnum for each respondent with the number of records in the pregnancy file.

You can use nsfg.MakePregMap to make a dictionary that maps from each caseid to a list of indices into the pregnancy DataFrame.

In [11]:
###code added to chap01ex below;
def read_FemResp(dct_file = 'ts_code/2002FemResp.dct',
              dat_file = 'ts_code/2002FemResp.dat.gz'):
    dct = thinkstats2.ReadStataDct(dct_file)
    df = dct.ReadFixedWidth(dat_file, compression='gzip')
    return df

In [12]:
import ts_code.chap01ex as chap01

In [13]:
df = chap01.read_FemResp() 

In [14]:
df.head()

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667
1,5012,1,5,1,5,5.0,42,42,718,42,...,0,2335.279149,2846.79949,4744.19135,2,18,1233,1221,16:30:59,64.294
2,11586,1,5,1,5,5.0,43,43,708,43,...,0,2335.279149,2846.79949,4744.19135,2,18,1234,1222,18:19:09,75.149167
3,6794,5,5,4,1,5.0,15,15,1042,15,...,0,3783.152221,5071.464231,5923.977368,2,18,1234,1222,15:54:43,28.642833
4,616,1,5,4,1,5.0,20,20,991,20,...,0,5341.329968,6437.335772,7229.128072,2,18,1233,1221,14:19:44,69.502667


Values from the codebook;

|Value | Label | Total
| :---:| :-------:|:------:|
|0| NONE| 2610
|1| 1 PREGNANCY| 1267
|2| 2 PREGNANCIES| 1432
|3| 3 PREGNANCIES| 1110
|4 |4 PREGNANCIES| 611
|5| 5 PREGNANCIES| 305
|6| 6 PREGNANCIES| 150
|7-95| 7 OR MORE PREGNANCIES| 158
|-|**Total**| **7643**

In [15]:
df.pregnum.value_counts().sort_index()

0     2610
1     1267
2     1432
3     1110
4      611
5      305
6      150
7       80
8       40
9       21
10       9
11       3
12       2
14       2
19       1
Name: pregnum, dtype: int64

In [16]:
df.pregnum.value_counts().sum()

7643

In [17]:
for caseid, rows in preg_map.items():
    if len(rows) != df[df.caseid == caseid].pregnum.values[0]: print('Oh no!')

No printout above, so they all seem to match!

**Exercise 1.3** The best way to learn about statistics is to work on a project
you are interested in. Is there a question like, "Do first babies arrive late,"
that you want to investigate?

I have a few, but one of the first ones I started working on is "Does a Warm December Mean Less Snow?", so far for the Boston metro area it looks like a resounding no.

### Glossary

**anecdotal evidence:** Evidence, often personal, that is collected casually rather than by a well-designed study.

**population:** A group we are interested in studying. "Population" often refers to a group of people, but the term is used for other subjects, too.

**cross-sectional study:** A study that collects data about a population at a particular point in time.

**cycle:** In a repeated cross-sectional study, each repetition of the study is called a cycle.

**longitudinal study:** A study that follows a population over time, collecting data from the same group repeatedly.

**record:** In a dataset, a collection of information about a single person or other subject.

**respondent:** A person who responds to a survey.

**sample:** The subset of a population used to collect data.

**representative:** A sample is representative if every member of the population has the same chance of being in the sample.

**oversampling:** The technique of increasing the representation of a sub-population in order to avoid errors due to small sample sizes.

**raw data:** Values collected and recorded with little or no checking, calculation or interpretation.

**recode:** A value that is generated by calculation and other logic applied to raw data.

**data cleaning:** Processes that include validating data, identifying errors, translating between data types and representations, etc.