### Load Relevant Birth Data from Raw CDC files
&copy; 2018-2022 Karl W. Schulz<br>
University of Texas<br>

This notebook parses raw CDC (denominator) text files year by year and imports a subset of available variables into pandas dataframe. The resulting dataframe is saved to pickle files for use with subsequent analysis utilities (files are stored in a `data/` subdirectory). Births to non U.S. residents are **not** included.  Variables extracted from the raw CDC files are controlled via a companion runtime config file defined below.

CDC files are availalbe for download at: https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Births.  Note that this analysis uses the **period-linked** files from the CDC.
***
### Runtime Input Controls
To use this utility you will first need to have copies of the public CDC data downloaded locally for each year you want to parse (note that the CDC provides *zip* files which need to be decompressed after download).  The parsing utility assumes that the filenames remain unchanged and are organized in subdirectories indicating each calendar year.  Once downloaded, update the `dataDir` option in the companion runtime configFile (config.period) to define the  top-level file path location of the local files.  Finally, update the *years* variable below to define a list of years to parse. The companion *config.period* input file is configured to accomodate years ranging from 2005-2017.

Note: on a typical system, it can take upwards of 10 minutes to load and parse data for a given year. 

In [1]:
# input file describing variable locations in CDC files
configFile='config.period'        
# desired analysis years to load
years = ['2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017']

In [2]:
%reload_ext autoreload
%autoreload 2
import utils
import cdc
import argparse
import timeit
import time
import pandas as pandas
import pathlib

### Loop over years and load data

In [3]:
for year in years:
    # Setup runtime arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--configFile",type=str)
    parser.add_argument("--year",type=str)
    parser.add_argument("--loglevel",type=str)
    args = parser.parse_args(['--configFile',configFile,'--year',year])
    utils.initLogger(args)
    config = utils.runtimeConfig(args)
    data = cdc.cdc(args,config)
    
    t0 = time.time()
    datum = data.loadDenominatorFiles(None,config)
    t1 = time.time()
    print("\nTotal time to parse CDC file   = %.3f (secs)" % (t1-t0))
    
    # Save pandas dataframe to pickle file
    pathlib.Path('data').mkdir(parents=True,exist_ok=True)
    datum.to_pickle('data/parsed' + year + '.pickle')
    print("Total time to save pickle file = %.3f (secs)" % (time.time()-t1))


--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2005
   --> years = ['2005']
   --> numRecordsPerYear = -1 

   --> 2005/numeratorFile = ./rawCDCdata/period//2005/VS05LINK.USNUMPUB
   --> 2005/denomFile     = ./rawCDCdata/period//2005/VS05LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2005 =>  138 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2005 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2005 =>   89 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2005 =>  423 (len =  1) (flag index = -999)

variable       rdmeth_rec (unknown): index for year=2005 =>  401 (len =  1) (flag index =  679)
variable         me_trial (unknown): index for year=2005 =>  394 (len =  1) (flag index =  621)
variable          rf_diab (unknown): index

HBox(children=(IntProgress(value=0, max=3606918210), HTML(value='')))



Total number of raw births read           =  4,145,883
--> # of U.S. births (50 states)          =  4,138,573

Total time to parse CDC file   = 110.784 (secs)
Total time to save pickle file = 13.663 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2006
   --> years = ['2006']
   --> numRecordsPerYear = -1 

   --> 2006/numeratorFile = ./rawCDCdata/period//2006/VS06LINK.USNUMPUB
   --> 2006/denomFile     = ./rawCDCdata/period//2006/VS06LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2006 =>  138 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2006 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2006 =>   89 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2006 =>  423 (len =  1) (flag index = -999)

variable       rdmeth_rec 

HBox(children=(IntProgress(value=0, max=3717739680), HTML(value='')))



Total number of raw births read           =  4,273,264
--> # of U.S. births (50 states)          =  4,265,593

Total time to parse CDC file   = 152.727 (secs)
Total time to save pickle file = 13.108 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2007
   --> years = ['2007']
   --> numRecordsPerYear = -1 

   --> 2007/numeratorFile = ./rawCDCdata/period//2007/VS07LINK.USNUMPUB
   --> 2007/denomFile     = ./rawCDCdata/period//2007/VS07LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2007 =>  138 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2007 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2007 =>   89 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2007 =>  423 (len =  1) (flag index = -999)

variable       rdmeth_rec 

HBox(children=(IntProgress(value=0, max=3761886960), HTML(value='')))



Total number of raw births read           =  4,324,008
--> # of U.S. births (50 states)          =  4,316,233

Total time to parse CDC file   = 174.651 (secs)
Total time to save pickle file = 13.969 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2008
   --> years = ['2008']
   --> numRecordsPerYear = -1 

   --> 2008/numeratorFile = ./rawCDCdata/period//2008/VS08LINK.USNUMPUB
   --> 2008/denomFile     = ./rawCDCdata/period//2008/VS08LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2008 =>  138 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2008 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2008 =>   89 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2008 =>  423 (len =  1) (flag index = -999)

variable       rdmeth_rec 

HBox(children=(IntProgress(value=0, max=3702013560), HTML(value='')))



Total number of raw births read           =  4,255,188
--> # of U.S. births (50 states)          =  4,247,726

Total time to parse CDC file   = 191.968 (secs)
Total time to save pickle file = 13.674 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2009
   --> years = ['2009']
   --> numRecordsPerYear = -1 

   --> 2009/numeratorFile = ./rawCDCdata/period//2009/VS09LINK.USNUMPUB
   --> 2009/denomFile     = ./rawCDCdata/period//2009/VS09LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2009 =>  138 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2009 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2009 =>   89 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2009 =>  423 (len =  1) (flag index = -999)

variable       rdmeth_rec 

HBox(children=(IntProgress(value=0, max=3599917320), HTML(value='')))



Total number of raw births read           =  4,137,836
--> # of U.S. births (50 states)          =  4,130,665

Total time to parse CDC file   = 184.212 (secs)
Total time to save pickle file = 13.315 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2010
   --> years = ['2010']
   --> numRecordsPerYear = -1 

   --> 2010/numeratorFile = ./rawCDCdata/period//2010/VS10LINK.USNUMPUB
   --> 2010/denomFile     = ./rawCDCdata/period//2010/VS10LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2010 =>  138 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2010 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2010 =>   89 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2010 =>  423 (len =  1) (flag index = -999)

variable       rdmeth_rec 

HBox(children=(IntProgress(value=0, max=3486181350), HTML(value='')))



Total number of raw births read           =  4,007,105
--> # of U.S. births (50 states)          =  3,999,386

Total time to parse CDC file   = 196.586 (secs)
Total time to save pickle file = 11.886 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2011
   --> years = ['2011']
   --> numRecordsPerYear = -1 

   --> 2011/numeratorFile = ./rawCDCdata/period//2011/VS11LINK.USNUMPUB
   --> 2011/denomFile     = ./rawCDCdata/period//2011/VS11LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2011 =>  138 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2011 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2011 =>   89 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2011 =>  423 (len =  1) (flag index = -999)

variable       rdmeth_rec 

HBox(children=(IntProgress(value=0, max=3446261400), HTML(value='')))



Total number of raw births read           =  3,961,220
--> # of U.S. births (50 states)          =  3,953,590

Total time to parse CDC file   = 274.030 (secs)
Total time to save pickle file = 12.505 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2012
   --> years = ['2012']
   --> numRecordsPerYear = -1 

   --> 2012/numeratorFile = ./rawCDCdata/period//2012/VS12LINK.USNUMPUB
   --> 2012/denomFile     = ./rawCDCdata/period//2012/VS12LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2012 =>  138 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2012 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2012 =>   89 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2012 =>  423 (len =  1) (flag index = -999)

variable       rdmeth_rec 

HBox(children=(IntProgress(value=0, max=3445892520), HTML(value='')))



Total number of raw births read           =  3,960,796
--> # of U.S. births (50 states)          =  3,952,841

Total time to parse CDC file   = 277.145 (secs)
Total time to save pickle file = 11.605 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2013
   --> years = ['2013']
   --> numRecordsPerYear = -1 

   --> 2013/numeratorFile = ./rawCDCdata/period//2013/VS13LINK.USNUMPUB
   --> 2013/denomFile     = ./rawCDCdata/period//2013/VS13LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2013 =>  138 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2013 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2013 =>   89 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2013 =>  423 (len =  1) (flag index = -999)

variable       rdmeth_rec 

HBox(children=(IntProgress(value=0, max=3428464680), HTML(value='')))



Total number of raw births read           =  3,940,764
--> # of U.S. births (50 states)          =  3,932,181

Total time to parse CDC file   = 281.142 (secs)
Total time to save pickle file = 12.609 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2014
   --> years = ['2014']
   --> numRecordsPerYear = -1 

   --> 2014/numeratorFile = ./rawCDCdata/period//2014/VS14LINK.USNUMPUB
   --> 2014/denomFile     = ./rawCDCdata/period//2014/VS14LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2014 =>  104 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2014 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2014 =>   75 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2014 =>  454 (len =  1) (flag index = -999)

variable       rdmeth_rec 

HBox(children=(IntProgress(value=0, max=5389539900), HTML(value='')))



Total number of raw births read           =  3,998,175
--> # of U.S. births (50 states)          =  3,988,076

Total time to parse CDC file   = 358.346 (secs)
Total time to save pickle file = 12.489 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2015
   --> years = ['2015']
   --> numRecordsPerYear = -1 

   --> 2015/numeratorFile = ./rawCDCdata/period//2015/VS15LINK.USNUMPUB.modified
   --> 2015/denomFile     = ./rawCDCdata/period//2015/VS15LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2015 =>  104 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2015 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2015 =>   75 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2015 =>  454 (len =  1) (flag index = -999)

variable       rd

HBox(children=(IntProgress(value=0, max=5376812084), HTML(value='')))



Total number of raw births read           =  3,988,733
--> # of U.S. births (50 states)          =  3,978,497

Total time to parse CDC file   = 333.978 (secs)
Total time to save pickle file = 12.585 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2016
   --> years = ['2016']
   --> numRecordsPerYear = -1 

   --> 2016/numeratorFile = ./rawCDCdata/period//2016/VS16LINK.USNUMPUB
   --> 2016/denomFile     = ./rawCDCdata/period//2016/VS16LINK.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2016 =>  104 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2016 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2016 =>   75 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2016 =>  454 (len =  1) (flag index = -999)

variable       rdmeth_rec 

HBox(children=(IntProgress(value=0, max=5332838976), HTML(value='')))



Total number of raw births read           =  3,956,112
--> # of U.S. births (50 states)          =  3,945,875

Total time to parse CDC file   = 328.976 (secs)
Total time to save pickle file = 12.273 (secs)

--
Parsing runtime options...
Using config from file = config.period
Overriding analysis year using command-line option: 2017
   --> years = ['2017']
   --> numRecordsPerYear = -1 

   --> 2017/numeratorFile = ./rawCDCdata/period//2017/VS17LINK.Public.USNUMPUB
   --> 2017/denomFile     = ./rawCDCdata/period//2017/VS17LINK.Public.USDENPUB

--
Initializing raw reads from desired CDC files...

variable         restatus (    int): index for year=2017 =>  104 (len =  1) (flag index = -999)
variable         revision (unknown): index for year=2017 =>    7 (len =  1) (flag index = -999)
variable            mager (    int): index for year=2017 =>   75 (len =  2) (flag index = -999)
variable          dplural (    int): index for year=2017 =>  454 (len =  1) (flag index = -999)

variable    

HBox(children=(IntProgress(value=0, max=5209724788), HTML(value='')))



Total number of raw births read           =  3,864,781
--> # of U.S. births (50 states)          =  3,855,500

Total time to parse CDC file   = 311.850 (secs)
Total time to save pickle file = 11.250 (secs)
