# Introduction

This notebook is for:

Chapter 1: Statistical Thinking 
Chapter 2: Descriptive Statistics

Tools of statistics, which are:

1: Data collection
2: Descriptive Statistics
3: Exploratory Data Analysis
4: Hypothesis Testing
5: Estimation

In [5]:
from __future__ import division, print_function
import numpy as np
import pandas as pd
%matplotlib inline

# Chapter 1 - Statistical Thinking

## Data Collection

Data is from National Survey of Family Growth (NSFG) initiative of the US Centers for Disease Control (CDC). NSFG is a cross-sectional study, not a longitudinal one_



In [None]:
f_resp = open("../data/2002FemResp.dat", 'rb')
nbr_resp = 0
for l_resp in f_resp:
    nbr_resp += 1
f_resp.close()
f_preg = open("../data/2002FemPreg.dat", 'rb')
nbr_preg = 0
for l_preg in f_preg:
    nbr_preg += 1
f_preg.close()

print("# of respondents: %d" % (nbr_resp))
print("# of pregnancies: %d" % (nbr_preg))

In [None]:
pregnancies = pd.read_fwf("../data/2002FemPreg.dat", 
                         names=["caseid", "nbrnaliv", "babysex", "birthwgt_lb",
                               "birthwgt_oz", "prglength", "outcome", "birthord",
                               "agepreg", "finalwgt"],
                         colspecs=[(0, 12), (21, 22), (55, 56), (57, 58), (57, 59),
                                (274, 276), (276, 277), (278, 279), (283, 285), (422, 439)])
pregnancies.head()

More Insights

In [None]:
live_births = pregnancies[pregnancies["outcome"] == 1]
print("# of live births: %d" % len(live_births))

Shows that about 9,000 of the 13,500 pregnancies here result in live births.

In [None]:
first_babies = live_births[live_births["birthord"] == 1]
other_babies = live_births[live_births["birthord"] != 1]
print("# first babies: %d" % len(first_babies))
print("# other babies: %d" % len(other_babies))

of these the data is fairly evenly split between first baby and other babies.

In [None]:
avg_prglen_first_baby = first_babies["prglength"].mean()
avg_prglen_other_baby = other_babies["prglength"].mean()
print("Average pregnancy length for first baby: %.3f weeks" % (avg_prglen_first_baby))
print("Average pregnancy length for other baby: %.3f weeks" % (avg_prglen_other_baby))
print("Difference: %.3f weeks" % (avg_prglen_first_baby - avg_prglen_other_baby))

Pregnancy lengths for first babies seem to be only slightly higher (0.078 weeks) than for other babies. Such a difference is called an __apparent effect__, ie, there might be something going on, but we are not sure.

# Chapter 2 - Descriptive Statistics

### Summary statistics and Distributions

Summary statistics - mean, variance, median, etc. 

Distributions are usually represented as histograms (raw frequencies binned into equally spaced blocks).
Normalized histogram is called Probability Mass Function (PMF).
Distributions describes how often each value appears
### Distributions - Histogram

In [None]:
prglen_first_babies = np.array(first_babies["prglength"])
prglen_other_babies = np.array(other_babies["prglength"])

# setting up range of histogram and number of bins
first_baby_min_prglen = np.min(prglen_first_babies)
first_baby_max_prglen = np.max(prglen_first_babies)
other_baby_min_prglen = np.min(prglen_other_babies)
other_baby_max_prglen = np.max(prglen_other_babies)
print("first baby preg length min: %d, max: %d" % 
      (first_baby_min_prglen, first_baby_max_prglen))
print("other baby preg length min: %d, max: %d" % 
      (other_baby_min_prglen, other_baby_max_prglen))

bin_lb = min([first_baby_min_prglen, other_baby_min_prglen])
bin_ub = max([first_baby_max_prglen, other_baby_max_prglen])
nbr_bins = bin_ub - bin_lb
bin_range = (bin_lb, bin_ub)
print("range:", bin_range, "#-bins:", nbr_bins)

# building the histograms
first_baby_fdist = np.histogram(np.array(prglen_first_babies), bins=nbr_bins, range=bin_range)
other_baby_fdist = np.histogram(np.array(prglen_other_babies), bins=nbr_bins, range=bin_range)

In [None]:
def mode(fdist):
    mode_idx = np.argmax(fdist[0])
    return fdist[1][mode_idx]

def all_modes(fdist):
    mode_idxs = np.argsort(fdist[0])[::-1]
    vf_pairs = []
    for i in range(mode_idxs.shape[0]):
        vf_pairs.append((int(fdist[1][mode_idxs[i]]), fdist[0][mode_idxs[i]]))
    return vf_pairs
    
print("First baby arrival top week (mode): %d" % (mode(first_baby_fdist)))
print("Other baby arrival top week (mode): %d" % (mode(other_baby_fdist)))

print("First baby top 5 frequent weeks:", all_modes(first_baby_fdist)[0:5])
print("Other baby top 5 frequent weeks:", all_modes(other_baby_fdist)[0:5])