**Michael Vizelman - Data Science - Summer 2019**

### Final Project - Classification Models and Ensemble Model Using Millennials Data - Data Extraction

## Introduction

This Jupyter notebook contains the steps taken to extract a subset of data from the Current Population Survey [2019 Annual Social and Economic (ASEC) Supplement](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.html) which will be used to predict whether or not a millennial will receive financial assistance from the government. 

The original data sets are comprised of 799 variables in the personal data file (file name: pppub19.csv), and 135 variables in the household data file (file name: hhpub19.csv).  We will be selecting a subset of these variables in each data file that we believe could prove relevant for our population of interest (millennials), and our goal, and joining them.  

## Data Dictionary

Below is a data dictionary which includes details of the variables we chose to extract from each of the original data files, including the new (more meaningful) name we chose for each variable. The United States Census Bureau has a [full data dictionary](https://www2.census.gov/programs-surveys/cps/datasets/2019/march/06_ASEC_2019-Data_Dictionary_Full.pdf) which includes details of all the variables for which data was collected in the 2019 Annual Social and Economic Supplement (ASEC) survey.  

<ins>Personal data file variables:</ins> 

|**Original Variable Name**|**Description**|**New Variable Name**|**Values**|**Type**|
|----------------------|---------------|------------|----------|--------|
|PH_SEQ|Household seq number|person_household_id|00001:99999|categorical|
|A_AGE|Age|age|00-79 = 0-79 years of age <br /> 80 = 80-84 years of age <br /> 85 = 85+ years of age|continuous|
|A_SEX|Sex|gender|1 = Male <br/> 2 = Female|categorical|
|PRDTRACE|Race|race|01 = White only <br/> 02 = Black only <br/> 03 = American Indian, Alaskan Native only (AI) <br/> 04 = Asian only <br/> 05 = Hawaiian/Pacific Islander only (HP) <br/> 06 = White-Black <br/> 07 = White-AI <br/> 08 = White-Asian <br/> 09 = White-HP <br/> 10 = Black-AI <br/> 11 = Black-Asian <br/> 12 = Black-HP <br/> 13 = AI-Asian <br/> 14 = AI-HP <br/> 15 = Asian-HP <br/> 16 = White-Black-AI <br/> 17 = White-Black-Asian <br/> 18 = White-Black-HP <br/> 19 = White-AI-Asian <br/> 20 = White-AI-HP <br/> 21 = White-Asian-HP <br/> 22 = Black-AI-Asian <br/> 23 = White-Black-AI-Asian <br/> 24 = White-AI-Asian-HP <br/> 25 = Other 3 race comb. <br/> 26 = Other 4 or 5 race comb.|categorical|
|PEHSPNON|Are you Spanish, Hispanic, or Latino?|span_hisp_latin|1 = Yes <br/> 2 = No|categorical|
|A_MARITL|Marital status|marital_status| 1 = Married - civilian spouse present <br/> 2 = Married - AF spouse present <br/> 3 = Married - spouse absent (exc.separated) <br/> 4 = Widowed <br/> 5 = Divorced <br/> 6 = Separated <br/> 7 = Never married|categorical|
|A_HGA|Educational attainment|education_level|0 = Children <br /> 31 = Less than 1st grade <br /> 32 = 1st,2nd,3rd,or 4th grade <br /> 33 = 5th or 6th grade <br /> 34 = 7th and 8th grade <br /> 35 = 9th grade <br /> 36 = 10th grade <br /> 37 = 11th grade <br /> 38 = 12th grade no diploma <br /> 39 = High school graduate <br /> 40 = Some college but no degree <br /> 41 = Associate degree in college - occupation/vocation program <br /> 42 = Associate degree in college - academic program <br /> 43 = Bachelor's degree <br />  44 = Master's degree <br /> 45 = Professional school degree <br /> 46 = Doctorate degree (for example: PHD,EDD)| categorical|
|PECERT1|Do you have a currently active professional certification or a state or industry license?|active_certification_license| -1 = Not in universe <br/> 1 = Yes <br/> 2 = No|categorical <br /> Universe: PRPERTYP=02=Adult civilian household member|
|P_STAT|Status of person identifier|civilian_or_army|1 = Civilian 15+ <br/> 2 = Armed Forces <br/> 3 = Children 0 - 14|categorical|
|PRDISFLG|Does this person have any disability conditions? (blind, deaf, other physical or mental)|disability|-1 = NIU <br/> 1 = Yes <br/> 2 = No|categorical <br/> PRPERTYP=02=Adult civilian household member|
|HEA|Health status|health_status|1= Excellent<br />2= Very good<br />3= Good<br />4= Fair<br />5= Poor|categorical|
|NOW_COV|Currently covered by health insurance coverage|insurance_coverage_flag|1= Yes<br />2= No|categorical|
|PRCITSHP|Citizenship group|citizenship|1 = Native, born in US <br/> 2 = Native, born in PR or US outlying area <br/> 3 = Native, born abroad of US parent(s) <br/> 4 = Foreign born, US cit by naturalization <br/> 5 = Foreign born, not a US citizen|categorical|
|PENATVTY|In what country were you born?|birth_country|057-555 range, 057=United States, see full detail in Appendix H in the [Technical Documentation](https://www2.census.gov/programs-surveys/cps/techdocs/cpsmar19.pdf) of ASEC 2019 (page 365)|categorical|
|PEINUSYR|When did you come to the U.S. to stay (immigrated)?|immigration_period|00 = Not an immigrant <br/> 01 = Before 1950 <br/> 02 = 1950-1959 <br/> 03 = 1960-1964 <br/> 04 = 1965-1969 <br/> 05 = 1970-1974 <br/> 06 = 1975-1979 <br/> 07 = 1980-1981 <br/> 08 = 1982-1983 <br/> 09 = 1984-1985 <br/> 10 = 1986-1987 <br/> 11 = 1988-1989 <br/> 12 = 1990-1991 <br/> 13 = 1992-1993 <br/> 14 = 1994-1995 <br/> 15 = 1996-1997 <br/> 16 = 1998-1999 <br/> 17 = 2000-2001 <br/> 18 = 2002-2003 <br/> 19 = 2004-2005 <br/> 20 = 2006-2007 <br/> 21 = 2008-2009 <br/> 22 = 2010-2011 <br/> 23 = 2012-2013 <br/> 24 = 2014-2015 <br/> 25 = 2016-2019|categorical|
|PEMLR|Major labor force recode|employed_or_not|0 = Not in universe<br/>1 = Employed - at work<br/>2 = Employed - absent<br/>3 = Unemployed - on layoff<br/>4 = Unemployed - looking<br/>5 = Not in labor force - retired<br/>6 = Not in labor force - disabled<br/>7 = Not in labor force - other|categorical|
|A_CLSWKR|Class of worker|worker_class|0 = Not in universe or children and Armed Forces<br/>1 = Private<br/>2 = Federal government<br/>3 = State government<br/>4 = Local government<br/>5 = Self-employed-incorporated<br/>6 = Self-employed-not incorporated<br/>7 = Without pay<br/>8 = Never worked|categorical<br/>Universe: PEMLR (employed_or_not) =1-3 ,or PEMLR=4-7 and person worked in the last 12 months|
|A_MJIND|Major industry code|industry|0 = Not in universe, or children<br/>1 = Agriculture, forestry,fishing, and hunting<br/>2 = Mining<br/>3 = Construction<br/>4 = Manufacturing<br/>5 = Wholesale and retail trade<br/>6 = Transportation and utilities<br/>7 = Information<br/>8 = Financial activities<br/>9 = Professional and business services<br/>10 = Educational and health services<br/>11 = Leisure and hospitality<br/>12 = Other services<br/>13 = Public administration<br/>14 = Armed Forces|categorical <br/> Universe: A_CLSWKR (worker_class) = 1-7|
|PTOTVAL|Total persons income|total_income|0 = none<br />negative amt = income (loss) <br />positive amt = income<br/> -99999:99999999 range|continuous <br /> Universe: All Persons aged 15+|
|TAX_INC|Taxable income amount|taxable_income|0 = none; dollar amount|continuous|
|PERLIS|Poverty level of person|poverty_category|1 = below poverty level<br />2 = 100 - 124 percent of the poverty level<br />3 = 125 - 149 percent of the poverty level<br />4 = 150 and above the poverty level|categorical|
|CHELSEW_YN|Does this person have a child living outside the household?|child_outside_household|0= Not in universe<br />1= Yes<br />2= No|categorical <br /> Universe: All Persons aged 15+|
|CHSP_YN|Is this person required to pay child support?|child_support_flag|0= Not in universe<br/> 1= Yes<br/> 2=No|categorical <br/> Universe: CHELSEW_YN (child_outside_household) = 1|
|CHSP_VAL|Annual amount of child support paid|annual_child_support|0 = Not in universe<br /> 1:99999 = amount paid in child support|continuous <br />Universe: CHSP_YN = 1|
|HHDREL|Detailed household summary|household_status|<ins>In household:</ins> <br/> 1 = Householder <br/> 2 = Spouse of householder <br/> <ins>Child of householder:</ins> <br/> 3 = Under 18 years, single (never married) <br/> 4 = Under 18 years, ever married <br/> 5 = 18 years and over <br/> <ins>Other household members:</ins> <br/> 6 = Other relative of householder <br/> 7 = Nonrelative of householder <br/> 8 = Secondary individual|categorical|
|FIN_YN|Received financial assistance?|financial_assistance_flag|0 = not in universe <br />1 = yes <br />2 = no|categorical <br />Universe: All Persons aged 15+|

<ins>Household data file variables:</ins> 
                    
|**Original Variable Name**|**Description**|**New Variable Name**|**Values**|**Type**|
|----------------------|---------------|------------|----------|--------|
|H_SEQ|Household sequence number|household_id|00001:99999|categorical|
|GTMETSTA|Metropolitan status|metropolitan_status|1 = Metropolitan<br/> 2 = Non-metropolitan<br/> 3 = Not identified|categorical|
|GESTFIPS|State code|state|01-56 State code|categorical|
|GEREG|Region|region|1 = Northeast<br/> 2 = Midwest<br/> 3 = South<br/> 4 = West|categorical|
|HTOTVAL|Total household income|household_income|negative amt = income (loss) <br />positive amt = income<br/> -99999:99999999 range|continuous <br /> Universe: All Persons aged 15+|

## Data Extraction

Since the original files are too large to host them on GitHub, in order to reproduce the following steps of extracting a subset of the data, the original files need to be downloaded from the web and loaded into a python environment from a local machine.   
The files are available in .csv format from the Current Population Survey [2019 Annual Social and Economic (ASEC) Supplement](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.html).

We are taking the following steps for the extraction:
- upload both of the files
- select the variables of interest
- rename the variables 
- join the two sets of variables on the household_id
- subset the population of interest:
    - millennials (born 1981-1996)
    - that are living without non-family members in the household 
    - that are civilians (not in the armed forces)
- export the subset in .csv file format

In [1]:
#first import the necessary libraries and modules
import pandas as pd
import numpy as np

In [2]:
#next read in the personal data file - pppub19.csv
#the file is too large to host it on GitHub, 
#and must be uploaded from a local environment 
file_location = "/Users/michael/Desktop/pppub19.csv"
person = pd.read_csv(file_location)

#display all columns
pd.set_option('display.max_columns', None) 

#print shape
print("df shape is: ", person.shape)
#check results
person.head()

df shape is:  (180101, 799)


Unnamed: 0,PERIDNUM,GESTCEN,PH_SEQ,GTCBSA,P_SEQ,A_LINENO,PF_SEQ,PHF_SEQ,OED_TYP1,OED_TYP2,OED_TYP3,PERRP,PXRRP,PXMARITL,PXRACE1,PEHSPNON,PXHSPNON,PEAFEVER,PXAFEVER,PEAFWHN1,PXAFWHN1,PEAFWHN2,PEAFWHN3,PEAFWHN4,PXSPOUSE,PENATVTY,PXNATVTY,PEMNTVTY,PXMNTVTY,PEFNTVTY,PXFNTVTY,PEINUSYR,PXINUSYR,PRDASIAN,PRDTHSP,PRDTRACE,PRPERTYP,PRCITFLG,PRCITSHP,PECOHAB,PXCOHAB,PEABSRSN,PEHRUSLT,PEMLR,PRDISC,PRPTREA,PRUNTYPE,PRWKSTAT,PEIO1COW,PRCOW1,PRERELG,PRWERNAL,PRHERNAL,PRNLFSCH,PEDISEAR,PEDISEYE,PEDISREM,PEDISPHY,PEDISDRS,PEDISOUT,PXDISEAR,PXDISEYE,PXDISREM,PXDISPHY,PXDISDRS,PXDISOUT,PRDISFLG,PECERT1,PECERT2,PECERT3,PXCERT1,PXCERT2,PXCERT3,PEPAR1,PEPAR2,PEPAR1TYP,PEPAR2TYP,PXPAR1,PXPAR2,PXPAR1TYP,PXPAR2TYP,A_AGE,A_SEX,A_DTIND,A_HRSPAY,A_PAYABS,A_WANTJB,A_HRLYWK,A_ENRLW,A_EXPLF,A_MJIND,A_UNMEM,A_MJOCC,A_DTOCC,A_UNCOV,A_HSCOL,A_FTPT,A_WKSCH,A_FNLWGT,A_ERNLWT,A_FAMREL,A_FAMNUM,AXAGE,AXSEX,AXLFSR,AXHRS,AXWHYABS,AXPAYABS,AXCLSWKR,AXNLFLJ,AXUSLHRS,AXUNMEM,AXUNCOV,AXENRLW,AXHSCOL,AXFTPT,AXHGA,AXHRLYWK,A_USLHRS,A_FAMTYP,A_GRSWK,A_WKSLK,A_SPOUSE,A_MARITL,A_HGA,A_HRS1,P_STAT,A_USLFT,A_CIVLF,A_FTLF,A_UNTYPE,A_CLSWKR,A_EXPRRP,A_WKSTAT,A_LFSR,A_WHYABS,A_PFREL,A_WHENLJ,A_NLFLJ,MARSUPWT,ACTC_CRD,AGE1,AGI,ANN_VAL,ANN_YN,CAID,CAP_VAL,CAP_YN,CHAMPVA,CHCARE_YN,CHELSEW_YN,CHSP_VAL,CHSP_YN,CLWK,COV,COV_CYR,COV_MULT_CYR,CSP_VAL,CSP_YN,CTC_CRD,DBTN_VAL,DEPDIR,DEPGRP,DEPMIL,DEPMRK,DEPMRKS,DEPMRKUN,DEPNONM,DEPPRIV,DIR,DIRFTYP,DIRFTYP2,DIRLIN1,DIROUT,DIS_CS,DIS_HP,DIS_SC1,DIS_SC2,DIS_VAL1,DIS_VAL2,DIS_YN,DIV_VAL,DIV_YN,DSAB_VAL,DST_SC1,DST_SC2,DST_SC1_YNG,DST_SC2_YNG,DST_VAL1,DST_VAL2,DST_VAL1_YNG,DST_VAL2_YNG,DST_YN,DST_YN_YNG,EARNER,ED_VAL,ED_YN,EIT_CRED,ERN_OTR,ERN_SRCE,ERN_VAL,ERN_YN,FAMREL,FEDTAX_AC,FEDTAX_BC,FED_RET,FICA,FIN_VAL,FIN_YN,FRMOTR,FRM_VAL,FRSE_VAL,FRSE_YN,GRP,GRPFTYP,GRPFTYP2,GRPLIN1,GRPOUT,HEA,HHDFMX,HHDREL,HIPAID,HRCHECK,HRSWK,IHSFLG,INDUSTRY,INT_VAL,INT_YN,I_ANNVAL,I_ANNYN,I_CAID,I_CAPVAL,I_CAPYN,I_CHAMPVA,I_CHCAREYN,I_CHELSEWYN,I_CHSPVAL,I_CHSPYN,I_CSPVAL,I_CSPYN,I_DEPDIR,I_DEPGRP,I_DEPMIL,I_DEPMRK,I_DEPMRKS,I_DEPMRKUN,I_DEPNONM,I_DEPPRIV,I_DIR,I_DIROUT,I_DISCS,I_DISHP,I_DISSC1,I_DISSC2,I_DISYN,I_DIVVAL,I_DIVYN,I_DSTSC,I_DSTSCCOMP,I_DSTVAL1COMP,I_DSTVAL2COMP,I_DSTYNCOMP,I_EDTYP,I_EDYN,I_ERNSRC,I_ERNVAL,I_ERNYN,I_FINVAL,I_FINYN,I_FRMVAL,I_FRMYN,I_GRP,I_GRPOUT,I_HEA,I_HIPAID,I_HRCHK,I_HRSWK,I_IHSFLG,I_INDUS,I_INTVAL,I_INTYN,I_LJCW,I_LKSTR,I_LKWEEK,I_LOSEWK,I_MCAID,I_MCARE,I_MCPREM,I_MIG1,I_MIG2,I_MIG3,I_MIL,I_MILOUT,I_MOOP,I_MOOP2,I_MRK,I_MRKOUT,I_MRKS,I_MRKSOUT,I_MRKUN,I_MRKUNOUT,I_NOEMP,I_NONM,I_NONMOUT,I_NOW_CAID,I_NOW_CHAMPVA,I_NOW_DEPDIR,I_NOW_DEPGRP,I_NOW_DEPMIL,I_NOW_DEPMRK,I_NOW_DEPMRKS,I_NOW_DEPMRKUN,I_NOW_DEPNONM,I_NOW_DEPPRIV,I_NOW_DIR,I_NOW_DIROUT,I_NOW_GRP,I_NOW_GRPOUT,I_NOW_HIPAID,I_NOW_IHSFLG,I_NOW_MCAID,I_NOW_MCARE,I_NOW_MIL,I_NOW_MILOUT,I_NOW_MRK,I_NOW_MRKOUT,I_NOW_MRKS,I_NOW_MRKSOUT,I_NOW_MRKUN,I_NOW_MRKUNOUT,I_NOW_NONM,I_NOW_NONMOUT,I_NOW_OTHMT,I_NOW_OUTDIR,I_NOW_OUTGRP,I_NOW_OUTMIL,I_NOW_OUTMRK,I_NOW_OUTMRKS,I_NOW_OUTMRKUN,I_NOW_OUTNONM,I_NOW_OUTPRIV,I_NOW_OWNDIR,I_NOW_OWNGRP,I_NOW_OWNMIL,I_NOW_OWNMRK,I_NOW_OWNMRKS,I_NOW_OWNMRKUN,I_NOW_OWNNONM,I_NOW_OWNPRIV,I_NOW_PCHIP,I_NOW_PRIV,I_NOW_PUB,I_NOW_VACARE,I_NWLKWK,I_NWLOOK,I_NXTRES,I_OCCUP,I_OEDVAL,I_OIVAL,I_OTHMT,I_OUTDIR,I_OUTGRP,I_OUTMIL,I_OUTMRK,I_OUTMRKS,I_OUTMRKUN,I_OUTNONM,I_OUTPRIV,I_OWNDIR,I_OWNGRP,I_OWNMIL,I_OWNMRK,I_OWNMRKS,I_OWNMRKUN,I_OWNNONM,I_OWNPRIV,I_PAWMO,I_PAWTYP,I_PAWVAL,I_PAWYN,I_PCHIP,I_PECOULD,I_PENINC,I_PENPLA,I_PENSC1,I_PENSC2,I_PENVAL1,I_PENVAL2,I_PENYN,I_PEOFFER,I_PEWNELIG1,I_PEWNELIG2,I_PEWNELIG3,I_PEWNELIG4,I_PEWNELIG5,I_PEWNELIG6,I_PEWNTAKE1,I_PEWNTAKE2,I_PEWNTAKE3,I_PEWNTAKE4,I_PEWNTAKE5,I_PEWNTAKE6,I_PEWNTAKE7,I_PEWNTAKE8,I_PHIPVAL,I_PHIPVAL2,I_PHMEMP,I_PMEDVAL,I_POTCVAL,I_PRIV,I_PTRSN,I_PTWKS,I_PTYN,I_PUB,I_PYRSN,I_RETCBVAL,I_RETCBYN,I_RINTSC,I_RINTVAL1,I_RINTVAL2,I_RINTYN,I_RNTVAL,I_RNTYN,I_RSNNOT,I_SEVAL,I_SEYN,I_SSIVAL,I_SSIYN,I_SSVAL,I_SSYN,I_SURSC1,I_SURSC2,I_SURYN,I_UCVAL,I_UCYN,I_VACARE,I_VETQVA,I_VETTYP,I_VETVAL,I_VETYN,I_WCTYP,I_WCVAL,I_WCYN,I_WKCHK,I_WKSWK,I_WORKYN,I_WSVAL,I_WSYN,I_WTEMP,LJCW,LKNONE,LKSTRCH,LKWEEKS,LOSEWKS,MARG_TAX,MCAID,MCAID_CYR,MCARE,MIGSAME,MIG_DIV,MIG_MTR1,MIG_MTR3,MIG_MTR4,MIG_REG,MIG_ST,MIL,MILFTYP,MILFTYP2,MILLIN1,MILOUT,MOOP,MOOP2,MRK,MRKFTYP,MRKFTYP2,MRKLIN1,MRKOUT,MRKS,MRKSFTYP,MRKSFTYP2,MRKSLIN1,MRKSOUT,MRKUN,MRKUNFTYP,MRKUNFTYP2,MRKUNLIN1,MRKUNOUT,NOCOV_CYR,NOEMP,NONM,NONMFTYP,NONMFTYP2,NONMLIN1,NONMOUT,NOW_CAID,NOW_CHAMPVA,NOW_COV,NOW_DEPDIR,NOW_DEPGRP,NOW_DEPMIL,NOW_DEPMRK,NOW_DEPMRKS,NOW_DEPMRKUN,NOW_DEPNONM,NOW_DEPPRIV,NOW_DIR,NOW_DIRFTYP,NOW_DIRFTYP2,NOW_DIRLIN,NOW_DIROUT,NOW_GRP,NOW_GRPFTYP,NOW_GRPFTYP2,NOW_GRPLIN,NOW_GRPOUT,NOW_HIPAID,NOW_IHSFLG,NOW_MCAID,NOW_MCARE,NOW_MIL,NOW_MILFTYP,NOW_MILFTYP2,NOW_MILLIN,NOW_MILOUT,NOW_MRK,NOW_MRKFTYP,NOW_MRKFTYP2,NOW_MRKLIN,NOW_MRKOUT,NOW_MRKS,NOW_MRKSFTYP,NOW_MRKSFTYP2,NOW_MRKSLIN,NOW_MRKSOUT,NOW_MRKUN,NOW_MRKUNFTYP,NOW_MRKUNFTYP2,NOW_MRKUNLIN,NOW_MRKUNOUT,NOW_NONM,NOW_NONMFTYP,NOW_NONMFTYP2,NOW_NONMLIN,NOW_NONMOUT,NOW_OTHMT,NOW_OUTDIR,NOW_OUTGRP,NOW_OUTMIL,NOW_OUTMRK,NOW_OUTMRKS,NOW_OUTMRKUN,NOW_OUTNONM,NOW_OUTPRIV,NOW_OWNDIR,NOW_OWNGRP,NOW_OWNMIL,NOW_OWNMRK,NOW_OWNMRKS,NOW_OWNMRKUN,NOW_OWNNONM,NOW_OWNPRIV,NOW_PCHIP,NOW_PRIV,NOW_PUB,NOW_VACARE,NWLKWK,NWLOOK,NXTRES,OCCUP,OI_OFF,OI_VAL,OI_YN,OTHMT,OUTDIR,OUTGRP,OUTMIL,OUTMRK,OUTMRKS,OUTMRKUN,OUTNONM,OUTPRIV,OWNDIR,OWNGRP,OWNMIL,OWNMRK,OWNMRKS,OWNMRKUN,OWNNONM,OWNPRIV,PARENT,PAW_MON,PAW_TYP,PAW_VAL,PAW_YN,PCHIP,PCHIP_SP2,PEARNVAL,PECOULD,PEMCPREM,PENINCL,PENPLAN,PEN_SC1,PEN_SC2,PEN_VAL1,PEN_VAL2,PEN_YN,PEOFFER,PERLIS,PEWNELIG1,PEWNELIG2,PEWNELIG3,PEWNELIG4,PEWNELIG5,PEWNELIG6,PEWNTAKE1,PEWNTAKE2,PEWNTAKE3,PEWNTAKE4,PEWNTAKE5,PEWNTAKE6,PEWNTAKE7,PEWNTAKE8,PHIP_VAL,PHIP_VAL2,PHMEMPRS,PMED_VAL,PNSN_VAL,POCCU2,POTC_VAL,POTHVAL,POV_UNIV,PPPOS,PRECORD,PRIV,PRIV_CYR,PRSWKXPNS,PTOTVAL,PTOT_R,PTRSN,PTWEEKS,PTYN,PUB,PUB_CYR,PYRSN,RESNSS1,RESNSS2,RESNSSA,RESNSSI1,RESNSSI2,RESNSSIA,RETCB_VAL,RETCB_YN,RINT_SC1,RINT_SC2,RINT_VAL1,RINT_VAL2,RINT_YN,RNT_VAL,RNT_YN,RSNNOTW,SEMP_VAL,SEMP_YN,SEOTR,SE_VAL,SPM_ACTC,SPM_CAPHOUSESUB,SPM_CHILDCAREXPNS,SPM_CHILDSUPPD,SPM_EITC,SPM_ENGVAL,SPM_EQUIVSCALE,SPM_FAMTYPE,SPM_FEDTAX,SPM_FEDTAXBC,SPM_FICA,SPM_GEOADJ,SPM_HAGE,SPM_HEAD,SPM_HHISP,SPM_HMARITALSTATUS,SPM_HRACE,SPM_ID,SPM_MEDXPNS,SPM_NUMADULTS,SPM_NUMKIDS,SPM_NUMPER,SPM_POOR,SPM_POVTHRESHOLD,SPM_RESOURCES,SPM_SCHLUNCH,SPM_SNAPSUB,SPM_STTAX,SPM_TENMORTSTATUS,SPM_TOTVAL,SPM_WCOHABIT,SPM_WEIGHT,SPM_WFOSTER22,SPM_WICVAL,SPM_WKXPNS,SPM_WNEWHEAD,SPM_WNEWPARENT,SPM_WUI_LT15,SRVS_VAL,SSI_VAL,SSI_YN,SS_VAL,SS_YN,STATETAX_A,STATETAX_B,STRKUC,SUBUC,SUR_SC1,SUR_SC2,SUR_VAL1,SUR_VAL2,SUR_YN,TAX_INC,TRDINT_VAL,UC_VAL,UC_YN,VACARE,VET_QVA,VET_TYP1,VET_TYP2,VET_TYP3,VET_TYP4,VET_TYP5,VET_VAL,VET_YN,WAGEOTR,WC_TYPE,WC_VAL,WC_YN,WECLW,WEIND,WELKNW,WEMIND,WEMOCG,WEUEMP,WEWKRS,WEXP,WICYN,WICYNA,WKCHECK,WKSWORK,WORKYN,WRK_CK,WSAL_VAL,WSAL_YN,WS_VAL,WTEMP,FL_665,SPM_CAPWKCCXPNS,TPEN_VAL1,TPEN_VAL2,TANN_VAL,TDST_VAL1,TDST_VAL2,TDST_VAL1_YNG,TDST_VAL2_YNG,TFIN_VAL,TOI_VAL,TTRDINT_VAL,TRINT_VAL1,TRINT_VAL2,TRNT_VAL,TCAP_VAL,TDIV_VAL,TCSP_VAL,TED_VAL,TCHSP_VAL,TPHIP_VAL,TPHIP_VAL2,TPMED_VAL,TPOTC_VAL,TPEMCPREM,TCERNVAL,TCWSVAL,TCSEVAL,TCFFMVAL,TSURVAL1,TSURVAL2,TDISVAL1,TDISVAL2,TAX_ID,PEIOIND,PEIOOCC,A_WERNTF,A_HERNTF,I_DISVL1,I_DISVL2,I_SURVL1,I_SURVL2,MIG_CBST,MIG_DSCP,DEP_STAT,FILEDATE,FILESTAT,MMYY
0,0100069124539430801101,11,4,0,1,1,1,1,0,0,0,41,0,0,0,2,0,2,0,-1,1,-1,-1,-1,1,57,0,57,0,57,0,0,0,-1,0,1,2,0,1,-1,1,0,30,1,0,20,0,7,4,4,0,0,0,2,2,2,2,2,2,2,0,0,0,0,0,0,2,2,-1,-1,0,0,0,-1,-1,-1,-1,1,1,1,1,21,1,45,-1,0,0,0,2,1,11,0,3,13,0,0,0,1,235592,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30,2,0,0,0,7,37,30,1,2,1,0,0,1,2,4,1,0,0,0,0,203167,0,4,18000,0,2,2,0,0,2,0,1,0,2,1,2,1,1,0,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,2,2,0,0,0,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,1,0,2,0,2,1,18000,1,10,600,600,0,1377,0,2,0,0,0,2,2,0,0,0,0,3,49,1,0,1,30,2,8660,0,2,0,0,0,0,0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,0,0,0,-1,0,0,0,-1,0,-1,0,-1,0,0,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,0,-1,0,-1,-1,0,0,0,0,-1,0,-1,0,-1,0,-1,0,-1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,-1,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,10,2,1,2,1,1,1,1,1,1,0,2,0,0,0,0,2050,2050,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,3,6,2,0,0,0,0,2,2,2,0,0,0,0,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,2,2,2,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,2,2,0,0,0,4050,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,18000,0,0,0,2,0,0,0,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2000,0,33,50,0,1,41,3,2,1,1929,18000,8,4,30,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,0,2,0,0,0,0,0,0,0,900,0.4635,5,600,600,1377,0.8997,21,1,0,7,1,4001,2050,1,0,1,0,10080,12961,0,0,-17,2,18000,0,203167,0,0,1929,1,0,0,0,0,2,0,2,-17,107,2,2,0,0,0,0,2,6000,0,0,2,2,0,0,0,0,0,0,0,2,0,0,0,2,5,18,7,11,13,8,2,7,0,0,3,52,1,1,18000,1,0,0,1,1929,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,401,8660,4050,0,0,0,0,0,0,0,0,0,110419,5,32019
1,3999403901610020901101,11,6,0,1,1,1,1,0,0,0,41,0,0,0,2,0,2,0,-1,1,-1,-1,-1,1,57,0,57,42,57,12,0,0,-1,0,1,2,0,1,-1,1,0,-1,5,0,0,0,1,0,0,0,0,0,0,1,2,1,1,1,1,0,0,0,0,0,0,1,2,-1,-1,0,0,0,-1,-1,-1,-1,1,1,1,1,85,2,0,-1,0,2,0,0,0,0,0,0,0,0,0,0,0,122080,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,-1,2,0,0,0,4,39,0,1,0,0,0,0,0,2,1,7,0,0,0,0,123204,0,17,18000,0,2,2,0,0,2,0,2,0,0,5,1,3,1,0,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,2,1,0,0,0,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,0,0,0,2,10,440,440,0,0,0,2,0,0,0,2,2,0,0,0,0,3,49,1,0,0,0,2,0,0,2,0,4,0,0,0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,0,-1,0,0,0,0,4,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,-1,0,0,0,0,0,11,0,0,0,0,0,0,2,0,0,0,0,-1,1,1,0,-1,0,-1,0,-1,0,0,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,0,-1,0,-1,-1,0,0,0,0,-1,0,-1,0,-1,0,-1,0,-1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,4,0,-1,0,0,1,1,4,0,4,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,10,15,10,0,0,4,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,2,1,1,1,1,1,1,1,1,0,2,0,0,0,0,6440,6440,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,1,0,2,0,0,0,0,2,2,1,0,0,0,0,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,2,2,1,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,1,2,0,2,0,0,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,0,1608,0,0,5,0,18000,0,1,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1440,1440,0,1000,18000,53,4000,21780,1,41,3,2,1,0,21780,9,0,0,0,1,3,0,1,0,4,0,0,0,0,0,0,0,0,0,2,0,2,2,0,2,0,0,0,0,0,0,0,0,0.4635,5,440,440,0,0.8796,85,1,0,4,1,6001,8048,1,0,1,0,11483,13292,0,0,0,3,21780,0,123204,0,0,0,1,0,0,0,0,2,3780,1,0,0,2,2,0,0,0,0,2,4400,0,0,2,2,0,0,0,0,0,0,0,2,0,0,0,2,9,23,1,15,24,9,5,13,0,0,0,0,2,2,0,2,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,601,0,-1,0,0,0,0,0,0,0,0,0,110419,5,32019
2,9119340093206090901101,11,7,0,1,1,1,1,0,0,0,41,0,0,0,2,42,2,0,-1,1,-1,-1,-1,1,57,0,57,0,57,0,0,0,-1,0,1,2,0,1,-1,1,0,44,1,0,0,0,2,4,4,0,0,0,0,2,2,2,2,2,2,0,0,0,0,0,0,2,2,-1,-1,0,0,0,-1,-1,-1,-1,1,1,1,1,61,2,45,-1,0,0,0,0,1,11,0,3,13,0,0,0,1,117089,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,44,2,0,0,0,7,39,44,1,0,1,1,0,1,2,2,1,0,0,0,0,120917,0,13,12000,0,2,2,0,0,2,0,2,0,0,1,2,1,1,0,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,2,2,0,0,0,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,1,0,2,250,2,1,12000,1,10,-250,0,0,918,0,2,0,0,0,2,2,0,0,0,0,3,49,1,0,2,44,2,8660,0,2,0,0,0,0,0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,0,0,0,-1,0,0,0,-1,0,-1,0,-1,0,0,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,0,-1,0,-1,-1,0,0,0,0,-1,0,-1,0,-1,0,-1,0,-1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,-1,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,1,2,1,1,1,1,1,1,0,2,0,0,0,0,500,500,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,3,1,2,0,0,0,0,2,2,2,0,0,0,0,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,2,2,2,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,2,2,0,0,0,4020,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,12000,0,0,0,2,0,0,0,0,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,400,0,32,100,0,1,41,3,2,1,1929,12000,5,0,0,2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,0,2,0,0,0,1317,0,0,250,0,0.4635,5,-250,0,918,0.8796,61,1,0,7,1,7001,500,1,0,1,1,11483,10232,0,0,-12,3,12000,0,120917,0,0,1929,1,0,0,0,0,2,0,2,-12,0,2,2,0,0,0,0,2,0,0,0,2,2,0,0,0,0,0,0,0,2,0,0,0,2,5,18,7,11,13,8,1,1,0,0,3,52,1,1,12000,1,0,0,1,1929,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,701,8660,4020,0,0,0,0,0,0,0,0,0,110419,5,32019
3,1410320300969990901101,11,8,0,1,1,1,1,0,0,0,40,0,0,0,2,0,2,0,-1,1,-1,-1,-1,1,57,0,301,0,57,0,0,0,-1,0,1,2,0,1,-1,1,0,-1,6,0,0,0,1,0,0,0,0,0,0,2,2,2,1,1,1,0,0,0,0,0,0,1,1,2,-1,0,0,0,-1,-1,-1,-1,1,1,1,1,73,2,0,-1,0,2,0,0,0,0,0,0,0,0,0,0,0,113253,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,1,0,0,0,5,39,0,1,0,0,0,0,0,1,1,7,0,5,0,0,114623,0,16,0,0,2,1,0,0,2,0,2,0,0,5,1,3,3,0,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,2,1,0,0,0,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,0,0,0,2,1,0,0,0,0,0,2,0,0,0,2,2,0,0,0,0,5,1,1,0,0,0,2,0,0,2,0,0,2,0,0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,-1,0,0,0,0,0,0,0,0,0,0,2,0,2,0,0,0,0,-1,1,1,0,-1,0,-1,0,-1,0,0,-1,2,0,-1,-1,-1,-1,-1,-1,-1,-1,0,-1,0,-1,-1,0,2,0,0,-1,0,-1,0,-1,0,-1,0,-1,2,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,0,0,0,0,0,0,0,0,0,2,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,2,-1,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,1,1,1,1,1,1,1,0,2,0,0,0,0,1200,1200,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,1,0,2,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,2,1,1,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,1,2,0,2,0,0,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0,0,0,0,0,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1000,0,53,200,10727,1,41,3,2,1,0,10727,5,0,0,0,1,3,0,2,0,0,1,0,0,0,0,0,0,0,0,2,0,2,1,0,2,0,0,0,234,0,0,250,0,0.6535,3,-250,0,918,0.8796,73,1,0,5,1,8001,1600,2,0,2,0,16190,20084,0,1308,-12,3,22727,0,114623,0,0,1929,1,0,0,0,7848,1,2879,1,0,0,2,2,0,0,0,0,2,0,0,0,2,2,0,0,0,0,0,0,0,2,0,0,0,2,9,23,1,15,24,9,5,13,0,0,0,0,2,2,0,2,0,2,1,1929,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,801,0,-1,0,0,0,0,0,0,0,0,0,110419,6,32019
4,1410320300969990901102,11,8,0,2,2,1,1,0,0,0,48,0,0,0,2,0,2,0,-1,1,-1,-1,-1,1,57,0,57,0,57,0,0,0,-1,0,1,2,0,1,-1,1,0,20,1,0,22,0,7,4,4,0,0,0,2,2,2,2,2,2,2,0,0,0,0,0,0,2,2,-1,-1,0,0,0,1,-1,1,-1,0,1,0,1,37,1,43,-1,0,0,0,2,1,10,0,3,15,0,0,0,1,189450,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,1,0,0,0,7,39,20,1,2,1,0,0,1,5,4,1,0,3,0,0,148079,0,8,12000,0,2,2,0,0,2,0,2,0,0,1,2,1,1,0,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,2,2,0,0,0,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,1,0,2,250,2,1,12000,1,5,-250,0,0,918,0,2,0,0,0,2,2,0,0,0,0,3,9,5,0,1,20,2,8370,0,2,0,0,0,0,0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,0,0,0,-1,1,1,0,-1,0,-1,0,-1,0,0,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,0,-1,0,-1,-1,0,0,0,0,-1,0,-1,0,-1,0,-1,0,-1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,-1,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,1,2,1,1,1,1,1,1,0,2,0,0,0,0,400,400,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,3,1,2,0,0,0,0,2,2,2,0,0,0,0,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,2,2,2,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,2,2,0,0,0,4610,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,12000,0,0,0,2,0,0,0,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,200,0,37,200,0,1,42,3,2,1,1929,12000,5,2,52,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,0,2,0,0,0,234,0,0,250,0,0.6535,3,-250,0,918,0.8796,73,0,0,5,1,8001,1600,2,0,2,0,16190,20084,0,1308,-12,3,22727,0,114623,0,0,1929,0,0,0,0,0,2,0,2,-12,0,2,2,0,0,0,0,2,0,0,0,2,2,0,0,0,0,0,0,0,2,0,0,0,2,5,16,7,10,15,8,2,7,0,0,3,52,1,1,12000,1,0,0,1,1929,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,802,8370,4610,0,0,0,0,0,0,0,0,0,110419,5,32019


In [3]:
#create a subset with variable 'PH_SEQ' needed to join the data with household data
#and inital choise of variables  
person_variables = person[['PH_SEQ','A_AGE', 'A_SEX', 'PRDTRACE', 'PEHSPNON', 'A_MARITL', 
                           'A_HGA', 'PECERT1', 'P_STAT', 'PRDISFLG', 'HEA', 'NOW_COV',
                           'PRCITSHP', 'PENATVTY', 'PEINUSYR','PEMLR', 'A_CLSWKR',
                           'A_MJIND', 'PTOTVAL', 'TAX_INC', 'PERLIS', 'CHELSEW_YN',
                           'CHSP_YN', 'CHSP_VAL', 'HHDREL', 'FIN_YN']]

#rename the columns with meaningful names
person_new = person_variables.rename(columns = {'PH_SEQ'    :'person_household_id',
                                                'A_AGE'     :'age',
                                                'A_SEX'     :'gender',
                                                'PRDTRACE'  :'race',
                                                'PEHSPNON'  :'span_hisp_latin',
                                                'A_MARITL'  :'marital_status',
                                                'A_HGA'     :'education_level',
                                                'PECERT1'   :'active_certification_license',      
                                                'P_STAT'    :'civilian_or_army',
                                                'PRDISFLG'  :'disability',
                                                'HEA'       :'health_status',
                                                'NOW_COV'   :'insurance_coverage_flag',
                                                'PRCITSHP'  :'citizenship',
                                                'PENATVTY'  :'birth_country',
                                                'PEINUSYR'  :'immigration_period',
                                                'PEMLR'     :'employed_or_not',
                                                'A_CLSWKR'  :'worker_class',
                                                'A_MJIND'   :'industry',
                                                'PTOTVAL'   :'total_income',
                                                'TAX_INC'   :'taxable_income',
                                                'PERLIS'    :'poverty_category',
                                                'CHELSEW_YN':'child_outside_household',
                                                'CHSP_YN'   :'child_support_flag',
                                                'CHSP_VAL'  :'annual_child_support',
                                                'HHDREL'    :'household_status',
                                                'FIN_YN'    :'financial_assistance_flag'})

#print shape
print("df shape is: ", person_new.shape)
#check results
person_new.head()

df shape is:  (180101, 26)


Unnamed: 0,person_household_id,age,gender,race,span_hisp_latin,marital_status,education_level,active_certification_license,civilian_or_army,disability,health_status,insurance_coverage_flag,citizenship,birth_country,immigration_period,employed_or_not,worker_class,industry,total_income,taxable_income,poverty_category,child_outside_household,child_support_flag,annual_child_support,household_status,financial_assistance_flag
0,4,21,1,1,2,7,37,2,1,2,3,2,1,57,0,1,1,11,18000,6000,3,1,2,0,1,2
1,6,85,2,1,2,4,39,2,1,1,3,1,1,57,0,5,0,0,21780,4400,4,2,0,0,1,2
2,7,61,2,1,2,7,39,2,1,2,3,2,1,57,0,1,1,11,12000,0,1,2,0,0,1,2
3,8,73,2,1,2,5,39,1,1,1,5,1,1,57,0,6,0,0,10727,0,3,2,0,0,1,2
4,8,37,1,1,2,7,39,2,1,2,3,2,1,57,0,1,1,10,12000,0,3,2,0,0,5,2


In [4]:
#next read in the household data file from a local environment 
file_location2 = "/Users/michael/Desktop/hhpub19.csv"
household = pd.read_csv(file_location2)

#print shape
print("df shape is: ", household.shape)
#check results
household.head()

df shape is:  (94633, 135)


Unnamed: 0,H_IDNUM,GEREG,GESTFIPS,GEDIV,HEFAMINC,H_MONTH,H_YEAR,H_TENURE,H_HHNUM,H_LIVQRT,H_RESPNM,H_TELHHD,H_TELAVL,H_TELINT,H1TENURE,H1LIVQRT,H1TELHHD,H1TELAVL,H1TELINT,H_NUMPER,H_HHTYPE,H_TYPEBC,H_MIS,GESTCEN,HANNVAL,HANN_YN,HCHCARE_VAL,HCHCARE_YN,HCOV,HCSPVAL,HCSP_YN,HDISVAL,HDIS_YN,HDIVVAL,HDIV_YN,HDSTVAL,HDST_YN,HEARNVAL,HEDVAL,HED_YN,HENGAST,HENGVAL,HFDVAL,HFINVAL,HFIN_YN,HFLUNCH,HFLUNNO,HFOODMO,HFOODNO,HFOODSP,HFRVAL,HH5TO18,HHINC,HHOTLUN,HHOTNO,HHSTATUS,HH_HI_UNIV,HINC_FR,HINC_SE,HINC_UC,HINC_WC,HINC_WS,HINTVAL,HINT_YN,HLORENT,HMCAID,HNUMFAM,HOIVAL,HOI_YN,HOTHVAL,HPAWVAL,HPAW_YN,HPCTCUT,HPENVAL,HPEN_YN,HPRES_MORT,HPRIV,HPROP_VAL,HPUB,HPUBLIC,HRECORD,HRNTVAL,HRNT_YN,HRNUMWIC,HRWICYN,HSEVAL,HSSIVAL,HSSI_YN,HSSVAL,HSS_YN,HSUP_WGT,HSURVAL,HSUR_YN,HTOP5PCT,HTOTVAL,HUCVAL,HUNDER15,HUNDER18,HUNITS,HVETVAL,HVET_YN,HWCVAL,HWSVAL,H_SEQ,I_CHCAREVAL,I_HENGAS,I_HENGVA,I_HFDVAL,I_HFLUNC,I_HFLUNN,I_HFOODM,I_HFOODN,I_HFOODS,I_HHOTLU,I_HHOTNO,I_HLOREN,I_HPUBLI,I_HUNITS,I_PROPVAL,NOW_HCOV,NOW_HMCAID,NOW_HPRIV,NOW_HPUB,HRHTYPE,THCHCARE_VAL,THPROP_VAL,GTCBSA,GTCO,GTCBSAST,GTCBSASZ,GTCSA,GTMETSTA,GTINDVPC,FILEDATE,MMYY
0,2031046575209908011,1,23,1,-1,3,2019,0,1,1,0,0,0,0,0,0,0,0,0,0,3,1,7,11,0,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,2,0,110419,32019
1,11000302225345308011,1,23,1,-1,3,2019,0,1,1,0,0,0,0,0,0,0,0,0,0,3,1,6,11,0,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,2,0,110419,32019
2,2031054530232108011,1,23,1,-1,3,2019,0,1,1,0,0,0,0,0,0,0,0,0,0,3,1,6,11,0,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,2,0,110419,32019
3,1000691245394308011,1,23,1,6,3,2019,1,1,5,1,1,0,1,0,0,0,0,0,1,1,0,5,11,0,2,-1,0,3,0,2,0,2,0,2,0,2,18000,0,2,1,900,0,0,2,0,0,0,0,2,0,0,8,0,0,2,1,2,2,2,2,1,0,2,0,3,1,0,2,0,0,2,3,0,2,2,3,8000,3,0,1,0,2,0,0,0,0,2,0,2,203167,0,2,2,18000,0,0,0,1,0,2,0,18000,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,3,6,0,0,0,0,3,0,0,2,0,110419,32019
4,14320203005595208011,1,23,1,-1,3,2019,0,1,1,0,0,0,1,0,0,0,0,0,0,3,1,8,11,0,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,2,0,110419,32019


In [5]:
#create subset with 'H_SEQ' needed to join the data with personal data
#and inital choise of variables 
household_variables = household[['H_SEQ','GTMETSTA','GESTFIPS','GEREG','HTOTVAL']]

#rename the columns with meaningful names
household_new = household_variables.rename(columns = {'H_SEQ'  :'household_id',
                                                     'GTMETSTA':'metropolitan_status',
                                                     'GESTFIPS':'state',
                                                     'GEREG'   :'region',
                                                     'HTOTVAL' :'household_income'
                                                    })
#print shape
print("df shape is: ", household_new.shape)
#check results
household_new.head()

df shape is:  (94633, 5)


Unnamed: 0,household_id,metropolitan_status,state,region,household_income
0,1,2,23,1,0
1,2,2,23,1,0
2,3,2,23,1,0
3,4,2,23,1,18000
4,5,2,23,1,0


In [6]:
#join the data sets using houshold_id
data = pd.merge(person_new,household_new,left_on='person_household_id',right_on='household_id',how='left')

#drop target variable and add it back as the last column
data.drop(columns='financial_assistance_flag',inplace=True)
data['financial_assistance_flag'] = person_new.financial_assistance_flag

#drop id columns
data.drop(columns=['person_household_id','household_id'],inplace=True)

#print shape
print("df shape is: ", data.shape)
#check results
data.head()

df shape is:  (180101, 29)


Unnamed: 0,age,gender,race,span_hisp_latin,marital_status,education_level,active_certification_license,civilian_or_army,disability,health_status,insurance_coverage_flag,citizenship,birth_country,immigration_period,employed_or_not,worker_class,industry,total_income,taxable_income,poverty_category,child_outside_household,child_support_flag,annual_child_support,household_status,metropolitan_status,state,region,household_income,financial_assistance_flag
0,21,1,1,2,7,37,2,1,2,3,2,1,57,0,1,1,11,18000,6000,3,1,2,0,1,2,23,1,18000,2
1,85,2,1,2,4,39,2,1,1,3,1,1,57,0,5,0,0,21780,4400,4,2,0,0,1,2,23,1,21780,2
2,61,2,1,2,7,39,2,1,2,3,2,1,57,0,1,1,11,12000,0,1,2,0,0,1,2,23,1,12000,2
3,73,2,1,2,5,39,1,1,1,5,1,1,57,0,6,0,0,10727,0,3,2,0,0,1,2,23,1,22727,2
4,37,1,1,2,7,39,2,1,2,3,2,1,57,0,1,1,10,12000,0,3,2,0,0,5,2,23,1,22727,2


In [7]:
#extract only observations of millennials, born from 1981 to 1996 
#born in 1981 means 38 in 2019 (the data is from 2019)
#born in 1996 means 23 in 2019 (the data is from 2019)
data_sub = data[(data.age>=23) & (data.age<=38)]

#extract only observations of millennials living without non-family members
#since we will be looking at household income 
#we cannot consider the income of a non-family memebr (i.e. roommate)
#class 7 = Nonrelative of householder
#class 8 = Secondary individual
data_sub = data_sub[(data_sub.household_status!=7) & (data_sub.household_status!=8)]

#extract only Civilians, since we know from the data dictionary 
#that for several variables there is no data for Armed Forces (they are not in universe)
#varibale civilian_or_army = 1 = adult civilian
data_sub = data_sub[data_sub.civilian_or_army==1]

#drop civilian_or_army columns, since now we have only civilians
data_sub.drop(columns=['civilian_or_army'],inplace=True)

#print shape
print("df shape is: ", data_sub.shape)
#check results
data_sub.head()

df shape is:  (32883, 28)


Unnamed: 0,age,gender,race,span_hisp_latin,marital_status,education_level,active_certification_license,disability,health_status,insurance_coverage_flag,citizenship,birth_country,immigration_period,employed_or_not,worker_class,industry,total_income,taxable_income,poverty_category,child_outside_household,child_support_flag,annual_child_support,household_status,metropolitan_status,state,region,household_income,financial_assistance_flag
4,37,1,1,2,7,39,2,2,3,2,1,57,0,1,1,10,12000,0,3,2,0,0,5,2,23,1,22727,2
19,33,1,6,2,1,39,2,2,3,2,1,57,0,1,6,3,50000,0,2,2,0,0,2,2,23,1,50200,2
34,33,1,1,2,1,40,2,2,2,1,1,57,0,1,1,5,24003,4,1,2,0,0,1,2,23,1,24004,2
35,29,2,1,2,1,40,2,2,3,1,1,57,0,7,0,0,1,0,1,2,0,0,2,2,23,1,24004,2
43,34,2,1,2,1,41,1,2,2,1,1,57,0,1,1,10,43560,71900,4,2,0,0,1,2,23,1,95900,2


In [8]:
#now we can extract the subset of the data we created in .csv format
#this will allow us to upload it to GitHub and the rest of our analysis could be easily reproducible
#running this cell will extract a file into the work folder of this Jupyter Notebook,
#meaning whoever runs this Jupyter notebook will have the subset file at in the folder he saved this notebook
data_sub.to_csv('final_project_data.csv',index=False,sep=',')