## DSC 530 Data Exploration and Analysis 
* 3.2 Exercise: Preparing for Exploratory

# Chapter 1

Examples and Exercises from Think Stats, 2nd Edition

http://thinkstats2.com

Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT


In [3]:
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py")

In [4]:
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py")

download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct")
download(
    "https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz"
)

## Exercise 1-1

*Not required to show per colab example*

## Exercise 1-2

In [6]:
#The code below imports and sets the data frame to a variable
import nsfg
preg = nsfg.ReadFemPreg()

In [7]:
# he code below imports and sets a second data frame to a variable
resp = nsfg.ReadFemResp()

In [9]:
#The below code prints the value counts of the variable pregnum
resp.pregnum.value_counts().sort_index()

0     2610
1     1267
2     1432
3     1110
4      611
5      305
6      150
7       80
8       40
9       21
10       9
11       3
12       2
14       2
19       1
Name: pregnum, dtype: int64

In [25]:
#The code below takes the caseid related to the individual in the resp data frame
resp[resp.caseid == 2298]

Unnamed: 0,caseid,rscrinf,rdormres,rostscrn,rscreenhisp,rscreenrace,age_a,age_r,cmbirth,agescrn,...,pubassis_i,basewgt,adj_mod_basewgt,finalwgt,secu_r,sest,cmintvw,cmlstyr,screentime,intvlngth
0,2298,1,5,5,1,5.0,27,27,902,27,...,0,3247.916977,5123.759559,5556.717241,2,18,1234,1222,18:26:36,110.492667


In [23]:
#The code below takes the caseid related to the individual in the preg data frame
preg[preg.caseid == 2298]

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
2610,2298,1,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875
2611,2298,2,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,5.5
2612,2298,3,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,4.1875
2613,2298,4,,,,,6.0,,1.0,,...,0,0,0,3247.916977,5123.759559,5556.717241,2,18,,6.875


In [15]:
#The code below takes the caseid related to the individual and compares the pregnum variable across data frames
resp.query('caseid == 2298')['pregnum']

0    4
Name: pregnum, dtype: int64

# Chapter 2

In [18]:
import numpy as np

## Exercise 2-1

Based on the results in this chapter, suppose you were asked to summarize what you learned about whether first babies arrive late.

Which summary statistics would you use if you wanted to get a story on the evening news?
Which ones would you use if you wanted to reassure an anxious patient?
Finally, imagine that you are Cecil Adams, author of The Straight Dope (http://straightdope.com), and your job is to answer the question, “Do first babies arrive late?”

Write a paragraph that uses the results in this chapter to answer the question clearly, precisely, and honestly.

*Based on the results of this chapter it appears evident that first babies neither come early or late. The mean pregnancy length of the first baby is 38.6 weeks while the mean pregnancy length of all other babies is 38.5. The summary statistic that might make a story in the news could be the variance of pregnancy lengths, seven weeks at face value seems as if it could be a big swing in either direction. To reassure and an anxious patient, the standard devation would be a much better summary statistic to showcase. A standard devation of 2.7 weeks could easy worries of being either early or late.* 

*Ultimately, to answer the question “Do first babies arrive late?”, based on the results it appears that they do not arrive late. The mean length of pregnancy for first borns is 38.6 weeks while all others is 38.52 weeks. This leaves a difference of 0.08 weeks between first borns and all others. This is a negligible difference.*

## Exercise 2-4

Using the variable *totalwgt_lb*, investigate whether first babies are lighter or heavier than other. Compute Cohen's *d* to quantify the  difference between the two groups. How does it comapre to the difference in preganancy length?

In [17]:
def CohenEffectSize(group1, group2):
    """Computes Cohen's effect size for two groups.

    group1: Series or DataFrame
    group2: Series or DataFrame

    returns: float if the arguments are Series;
             Series if the arguments are DataFrames
    """
    diff = group1.mean() - group2.mean()

    var1 = group1.var()
    var2 = group2.var()
    n1, n2 = len(group1), len(group2)

    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    d = diff / np.sqrt(pooled_var)
    return d

*The formula above was provided by the author to calculate Cohen Effect Size*

In [19]:
# Ensure that the preg and live dataframe are defined
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]

# Subset the dataframe of the two types, first babies or non-first babies
firsts = live[live.birthord == 1]
others = live[live.birthord != 1]

In [20]:
# This code outputs the mean for first borns baby weights and all other baby weights
firsts.totalwgt_lb.mean(), others.totalwgt_lb.mean()

(7.201094430437772, 7.325855614973262)

In [21]:
# The solution provided by the author to check your codes
CohenEffectSize(firsts.totalwgt_lb, others.totalwgt_lb)

-0.088672927072602

In [22]:
# The code provided by the author to show pregnancy length with his CohenD function
CohenEffectSize(firsts.prglngth, others.prglngth)

0.028879044654449883

How does the Cohen Effect Size compare between totalwgt_lbs and prglngth? 

The Cohen Effect Size value for the total weight of first borns and others is a negative value. This indicates that the mean for all others total weights was higher than the mean for first born weights. First borns appear to be lighter than others but by a very small amount, the difference in means being 0.13lbs. 

The Cohen Effect size value is a positive value for first born pregnancly lengths and all other pregnancy lengths. This indicates that the mean pregnancy length for first borns is greater than the mean pregnancy length for all others. 