# 20151230-predict-household-income-from-census

Related post:  
https://stharrold.github.io/20151230-predict-household-income-from-census.html

Purpose: Predict total annual household income.

## Initialization

### Imports

In [1]:
cd ~

/home/samuel_harrold


In [30]:
# Import standard packages.
import collections
import json
import os
import pdb # TEST: Comment out pdb after testing.
import sys
# Import installed packages.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Import local packages.
sys.path.insert(
    0,
    os.path.join(os.path.curdir, 'stharrold.github.io/content/static/dsdemos'))
# TEST: Comment out autoreload after testing.
%reload_ext autoreload
%autoreload 2
import dsdemos as dsd
# IPython magic.
%matplotlib inline

## Globals

In [3]:
path_static = os.path.join(os.path.expanduser(r'~'), r'stharrold.github.io/content/static')
basename = r'20151230-predict-household-income-from-census'
path_disk = os.path.abspath(r'/mnt/disk-20151227t211000z/')
path_acs = os.path.join(path_disk, r'www2-census-gov/programs-surveys/acs/')
path_csv = os.path.join(path_acs, r'data/pums/2013/5-Year/ss13hdc.csv') # 'hdc' = 'housing DC'
path_ddict = os.path.join(path_acs, r'tech_docs/pums/data_dict/PUMS_Data_Dictionary_2009-2013.txt')

## Extract-transform-load

**TODO:**
* Just use pandas. Acknowledge dask.

In [4]:
%%time
with open(path_csv) as fobj:
    nlines = sum(1 for _ in fobj)
print("{path}:".format(path=path_csv))
print("size (MB) = {size:.1f}".format(size=os.path.getsize(path_csv)/1e6))
print("num lines = {nlines}".format(nlines=nlines))
df = pd.read_csv(path_csv)
print("df RAM usage (MB) = {mem:.1f}".format(mem=df.memory_usage().sum()/1e6))

/mnt/disk-20151227t211000z/www2-census-gov/programs-surveys/acs/data/pums/2013/5-Year/ss13hdc.csv:
size (MB) = 13.5
num lines = 17501
df RAM usage (MB) = 28.7
CPU times: user 444 ms, sys: 20 ms, total: 464 ms
Wall time: 466 ms


In [5]:
percentiles = [0.1587, 0.5000, 0.8413] # +1 std. dev., mean/median, -1 std. dev. for normal dist.
df.describe(percentiles=percentiles, include='all')

Unnamed: 0,insp,RT,SERIALNO,DIVISION,PUMA00,PUMA10,REGION,ST,ADJHSG,ADJINC,...,WGTP71,WGTP72,WGTP73,WGTP74,WGTP75,WGTP76,WGTP77,WGTP78,WGTP79,WGTP80
count,6561.0,17500,17500.0,17500.0,17500.0,17500.0,17500.0,17500.0,17500.0,17500.0,...,17500.0,17500.0,17500.0,17500.0,17500.0,17500.0,17500.0,17500.0,17500.0,17500.0
unique,,1,,,,,,,,,...,,,,,,,,,,
top,,H,,,,,,,,,...,,,,,,,,,,
freq,,17500,,,,,,,,,...,,,,,,,,,,
mean,999.282731,,2011068000000.0,5.0,56.427371,37.764171,3.0,11.0,1039364.231657,1048478.770229,...,17.050857,17.043486,17.05,17.049029,17.048,17.051543,17.0532,17.047029,17.046971,17.051486
std,1085.174484,,1401911000.0,0.0,55.291036,55.358495,0.0,0.0,31877.254257,29598.26989,...,17.593886,17.740566,17.534604,17.555515,17.558942,17.574232,17.623017,17.802284,17.267472,17.710924
min,0.0,,2009000000000.0,5.0,-9.0,-9.0,3.0,11.0,1000000.0,1007549.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15.9%,200.0,,2009001000000.0,5.0,-9.0,-9.0,3.0,11.0,1000000.0,1007549.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
50%,790.0,,2011001000000.0,5.0,101.0,-9.0,3.0,11.0,1035725.0,1054614.0,...,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0
84.1%,1500.0,,2013000000000.0,5.0,104.0,104.0,3.0,11.0,1086032.0,1085467.0,...,30.0,30.0,30.0,31.0,30.0,30.0,30.0,30.0,31.0,30.0


In [6]:
ddict = dsd.census.parse_pumsdatadict(path=path_ddict)

In [7]:
ddict.keys()

odict_keys(['title', 'date', 'record_types', 'notes'])

In [8]:
ddict['title']

'2009-2013 ACS PUMS DATA DICTIONARY'

In [9]:
ddict['date']

'August 7, 2015'

In [10]:
ddict['record_types'].keys()

odict_keys(['HOUSING RECORD', 'PERSON RECORD'])

In [11]:
ddict['notes']

['*  In cases where the SOC occupation code ends in X(s) or Y(s), two or more SOC',
 'occupation codes were aggregated to correspond to a specific Census occupation',
 'code. In these cases, the Census occupation description is used for the SOC',
 'occupation title."',
 '** These codes are pseudo codes developed by the Census Bureau and are not',
 '   official or equivalent NAICS or SOC codes.',
 'Legend to Identify NAICS Equivalents',
 '     M = Multiple NAICS codes',
 '     P = Part of a NAICS code - NAICS code split between two or more Census',
 '         codes',
 '     S = Not specified Industry in NAICS sector - Specific to Census codes',
 '         only',
 '     Z = Exception to NAICS code - Part of NAICS industry but has a unique',
 '         Census code',
 'Occupation codes in OCCP10, OCCP12, SOCP10, and SOCP12 are based on the Standard',
 'Occupational Classification 2010. Occupation codes in OCCP02 and SOCP00 are based',
 'on Standard Occupational Classification 2000.  For mo

In [13]:
ddict['record_types']['HOUSING RECORD'].keys()

odict_keys(['RT', 'SERIALNO', 'DIVISION', 'PUMA00', 'PUMA10', 'REGION', 'ST', 'ADJHSG', 'ADJINC', 'WGTP', 'NP', 'TYPE', 'ACR', 'AGS', 'BATH', 'BDSP', 'BLD', 'BUS', 'CONP', 'ELEP', 'FS', 'FULP', 'GASP', 'HFL', 'INSP', 'MHP', 'MRGI', 'MRGP', 'MRGT', 'MRGX', 'REFR', 'RMSP', 'RNTM', 'RNTP', 'RWAT', 'RWATPR', 'SINK', 'SMP', 'STOV', 'TEL', 'TEN', 'TOIL', 'VACS', 'VALP', 'VEH', 'WATP', 'YBL', 'FES', 'FINCP', 'FPARC', 'GRNTP', 'GRPIP', 'HHL', 'HHT', 'HINCP', 'HUGCL', 'HUPAC', 'HUPAOC', 'HUPARC', 'KIT', 'LNGI', 'MULTG', 'MV', 'NOC', 'NPF', 'NPP', 'NR', 'NRC', 'OCPIP', 'PARTNER', 'PLM', 'PSF', 'R18', 'R60', 'R65', 'RESMODE', 'SMOCP', 'SMX', 'SRNT', 'SVAL', 'TAXP', 'WIF', 'WKEXREL', 'WORKSTAT', 'FACRP', 'FAGSP', 'FBATHP', 'FBDSP', 'FBLDP', 'FBUSP', 'FCONP', 'FELEP', 'FFSP', 'FFULP', 'FGASP', 'FHFLP', 'FINSP', 'FKITP', 'FMHP', 'FMRGIP', 'FMRGP', 'FMRGTP', 'FMRGXP', 'FMVP', 'FPLMP', 'FREFRP', 'FRMSP', 'FRNTMP', 'FRNTP', 'FRWATP', 'FRWATPRP', 'FSINKP', 'FSMP', 'FSMXHP', 'FSMXSP', 'FSTOVP', 'FTAXP', 

In [None]:
features = {
    ''
    }

In [12]:
features = collections.OrderedDict()
for col in df.columns.values:
    features[col] = collections.OrderedDict()
    features[col]['include']

array(['insp', 'RT', 'SERIALNO', 'DIVISION', 'PUMA00', 'PUMA10', 'REGION',
       'ST', 'ADJHSG', 'ADJINC', 'WGTP', 'NP', 'TYPE', 'ACR', 'AGS',
       'BATH', 'BDSP', 'BLD', 'BUS', 'CONP', 'ELEP', 'FS', 'FULP', 'GASP',
       'HFL', 'MHP', 'MRGI', 'MRGP', 'MRGT', 'MRGX', 'REFR', 'RMSP',
       'RNTM', 'RNTP', 'RWAT', 'RWATPR', 'SINK', 'SMP', 'STOV', 'TEL',
       'TEN', 'TOIL', 'VACS', 'VALP', 'VEH', 'WATP', 'YBL', 'FES', 'FINCP',
       'FPARC', 'GRNTP', 'GRPIP', 'HHL', 'HHT', 'HINCP', 'HUGCL', 'HUPAC',
       'HUPAOC', 'HUPARC', 'KIT', 'LNGI', 'MULTG', 'MV', 'NOC', 'NPF',
       'NPP', 'NR', 'NRC', 'OCPIP', 'PARTNER', 'PLM', 'PSF', 'R18', 'R60',
       'R65', 'RESMODE', 'SMOCP', 'SMX', 'SRNT', 'SVAL', 'TAXP', 'WIF',
       'WKEXREL', 'WORKSTAT', 'FACRP', 'FAGSP', 'FBATHP', 'FBDSP', 'FBLDP',
       'FBUSP', 'FCONP', 'FELEP', 'FFSP', 'FFULP', 'FGASP', 'FHFLP',
       'FINSP', 'FKITP', 'FMHP', 'FMRGIP', 'FMRGP', 'FMRGTP', 'FMRGXP',
       'FMVP', 'FPLMP', 'FREFRP', 'FRMSP', 'FRNTMP', 'F

In [182]:
for key in ddict['record_types']['HOUSING RECORD']:
    print("'{key}': __clude. {desc}. TODO.  ".format(
        key=key, desc=ddict['record_types']['HOUSING RECORD'][key]['description']))
    if key == 'FACRP':
        break

'RT': __clude. Record Type. TODO.  
'SERIALNO': __clude. Housing unit/GQ person serial number. TODO.  
'DIVISION': __clude. Division code. TODO.  
'PUMA00': __clude. Public use microdata area code (PUMA) based on Census 2000 definition for data collected prior to 2012. Use in combination with PUMA10.. TODO.  
'PUMA10': __clude. Public use microdata area code (PUMA) based on 2010 Census definition for data collected in 2012 or later. Use in combination with PUMA00.. TODO.  
'REGION': __clude. Region code. TODO.  
'ST': __clude. State Code. TODO.  
'ADJHSG': __clude. Adjustment factor for housing dollar amounts (6 implied decimal places). TODO.  
'ADJINC': __clude. Adjustment factor for income and earnings dollar amounts (6 implied decimal places). TODO.  
'WGTP': __clude. Housing Weight. TODO.  
'NP': __clude. Number of person records following this housing record. TODO.  
'TYPE': __clude. Type of unit. TODO.  
'ACR': __clude. Lot size. TODO.  
'AGS': __clude. Sales of Agriculture Produ

In [181]:
ddict['record_types']['HOUSING RECORD']['RT']

OrderedDict([('length', '5'),
             ('description', 'Housing Weight replicate 2'),
             ('var_codes',
              OrderedDict([('-9999..09999',
                            'Integer weight of housing unit')]))])

## Select features

**Notes:**
* Example consumer databases: http://www.consumerreports.org/cro/money/consumer-protection/big-brother-is-watching/overview/index.htm?rurl=http%3A%2F%2Fwww.consumerreports.org%2Fcro%2Fmoney%2Fconsumer-protection%2Fbig-brother-is-watching%2Foverview%2Findex.htm
* Random forests are scale invariant, so they can accommodate non-linear transformation.
* Cast all values to floats so that compatable with most algorithms and can use </> logic. Otherwise less informationally dense and may require deeper tree structure to find features.
* To "map to float" ('b' is N/A, mapped to 0; 1 is Yes; 2 is No; other values are special):  
```python
test = pd.DataFrame(data=[['b', 1.0], ['1', 1.0], ['2', 1.0], ['3', 1.1], ['4', 1.1]], columns=['COL', 'ADJ'])
tfmask = test['COL'].isin(['b'])
test.loc[tfmask, 'COL'] = 0.0
test['COL'] = test['COL'].astype(float)
print(test.dtypes)
test
```
* To "adjust for inflation":  
```python
tfmask = test['COL'] >= 3.0
test.loc[tfmask, 'COL'] *= test.loc[tfmask, 'ADJ']
test
```
* TODO: Remove vacant units ('NP') from data frame.
* TODO: Filter categorical variables from metadata (those without '..').

'insp': Exclude. Quality control inspection metadata.  
'RT': Exclude. All same.  
'SERIALNO': Index. All unique.  
'DIVISION', 'REGION': Exclude. Use more precise location.  
'PUMA00', 'PUMA10', 'ST': Include. TODO: Combine and lookup lat-lon coordinates from census.gov.  
'ADJHSG', 'ADJINC': Include. TODO: Need to adjust for inflation or has already happened? https://www.census.gov/library/publications/2009/acs/pums.html App 5.  
'WGTP': Include. Use housing weights. TODO: Work through user verification file.  
'NP': Include. Map number of people in house to float.  
'TYPE': Include. Map type of housing unit to median income if correlated.  
'ACR': Include. Map lot size (acres) to float.  
'AGS': Include. Map agricultural product sales to float.  
'BATH': Include. Map has bathtub/shower to float.  
'BDSP': Include. Map number of bedrooms to float.  
'BLD': Include. Map units in structure to median income if correlated.  
'BUS': Include. Map has business/medical office on property to float.  
'CONP': Include. Map condo fee to float.  
'ELEP': Include. Map electricity to float. Adjust \$2+ with 'ADJHSG'.  
'FS': Include. Map food stamps to float.  
'FULP': Include. Map fuel cost to float. Adjust \$3+ with 'ADJHSG'.  
'GASP': Include. Map gas cost to float. Adjust \$4+ with 'ADJHSG'.  
'HFL': Include. Map house heating fuel to median income if correlated.  
'MHP': Include. Map mobile home costs to float. Adjust all with 'ADJHSG'.  
'MRGI': Include. Map first mortgage payment includes insurance to median income if correlated.  
'MRGP': Include. Map first mortgage payment to float.  
'MRGT': Include. Map first mortgage payment includes real estate taxes to median income if correlated.  
'MRGX': Include. Map first mortgage status to median income if correlated.  
'REFR': Exclude. Has refridgerator. Use 'KIT'.  
'RMSP': Include. Map number of rooms to float.  
'RNTM': Include. Map meals included in rent to float.  
'RNTP': Include. Map monthly rent to float. Adjust all with 'ADJHSG'.  
'RWAT': Include. Map has hot/cold running water to float.  
'RWATPR': Include. Map has running water to float. Join into 'RWAT'.  
'SINK': Include. Map has sink to float.  
'SMP': Include. Map second+ mortgage payments to float. Adjust all with 'ADJHSG'.  
'STOV': Include. Has stove/range. Use 'KIT'.  
'TEL': Include. Map has telephone to float.  
'TEN': Include. Map tenure to median income if correlated.  
'TOIL': Include. Map has toilet to float.  
'VACS': Include. Map vacancy status to median income if correlated.  
'VALP': Include. Map property value to float.  
'VEH': Include. Map number of vehicles to float.  
'WATP': Include. Map water cost to float. Adjust \$3+ with 'ADJHSG'.  
'YBL': Include. Map when structure built to float.  
'FES': Include. Map family type to float.  
'FINCP': Include. Map family income to float. Adjust all with 'ADJINC'.  
'FPARC': Exclude. Family presence and age children. Use 'HUPAC'.  
'GRNTP': Include. Map gross rent to float. Adjust all with 'ADJHSG'.  
'GRPIP': Exclude. Gross rent percent income to float. We want to predict income.  
'HHL': Include. Map household language to float.  
'HHT': Include. Map household/family type to float.  
'HINCP': Include. Map household income to float. Adjust all with 'ADJINC'.  
'HUGCL': Include. Map household with grandparent and grandchildren to float.  
'HUPAC': Include. Map household presence and age of children to float.  
'HUPAOC': Exclude. Household presence and age of own children. Use 'HUPAC'.  
'HUPARC': Exclude. Household presence and age of related children. Use 'HUPAC'.  
'KIT': Include. Map has complete kitchen to float.  
'LNGI': Include. Map limited english to float.  
'MULTG': Exclude. Multigenerational household.
'MV': Include. Map when moved into house to float.  
'NOC': Include. Number of own children in household. Map to float.
'NPF': Include. Number of persons in family. Map to float.  
'NPP': Exclude. Grandparent headed household.  
'NR': Exclude. Presence of nonrelative in household.  
'NRC': Include. Number of related children in household. Map to float.  
'OCPIP': Exclude. Selected monthly owner costs as percentage of household income.  
'PARTNER': Include. Unmarried partner household.  
'PLM': Include. Complete plumbing facilities. Map to float. Compare to above.  
'PSF': Exclude. Presence of subfamilies in household.  
'R18': Include. Presence of persons under 18 years in household. Map to float.  
'R60': Exclude. Presence of persons 60+ years in household.  
'R65': Include. Presence of persons 60+ years in household. Use 'R65'.  
'RESMODE': Exclude. Response mode to ACS.  
'SMOCP': Include. Selected monthly owner costs. Map to float.  
'SMX': Include. Second+ mortgage status. Map to float.  
'SRNT': Exclude. Specified rent unit.  
'SVAL': Exclude. Specified value owner unit.  
'TAXP': Include. Property taxes. Map to float.  
'WIF': Include. Number of workers in family. Map to float.  
'WKEXREL': Include. Work experience of householder and spouse. Map to float.  
'WORKSTAT': Include. Work status of householder or spouse in family households. Map to float. Reduce detail and combine with another column.  
'FACRP': Exclude. Lot size allocation flag.  
...
'FYBLP': Exclude. When structure first built allocation flag.  
'WGTP1': Exclude. Housing Weight replicate 1.
...
'WGTP80': Exclude. Housing Weight replicate 1.

## Export ipynb to html

In [158]:
!date --rfc-3339='seconds'

2016-01-01 05:40:17+00:00


In [None]:
path_ipynb = os.path.join(path_static, basename, basename+'.ipynb')
for template in ['basic', 'full']:
    path_html = os.path.splitext(path_ipynb)[0]+'-'+template+'.html'
    cmd = ['jupyter', 'nbconvert', '--to', 'html', '--template', template, path_ipynb, '--output', path_html]
    print(' '.join(cmd))
    subprocess.run(args=cmd, check=True)