# How to Prepare data for Modeling
The purpose of this notebook is to illustrate how to use `ACSModelPrep`, a module that integrates American Community Survey (ACS) data with user input data in preparation for modeling or predicting race & ethnicity.

**Note**: If GEOIDs are determined external to ZRP please remember to include the full GEOID. A block group GEOID should contain 12 characters following this format [state][county][census tract][block group]
- Example of full block group **340230003002**
    - state fips: 34
    - county fips: 023
    - census tract code: 000300
    - block group code: 2
- Corresponding full census tract is **34023000300**
    

In [18]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import warnings

In [3]:
warnings.filterwarnings(action='ignore')
home = expanduser('~')

src_path = '{}/zrp'.format(home)
sys.path.append(src_path)

In [5]:
from zrp.prepare.prepare import ZRP_Prepare, ACSModelPrep
from zrp.prepare.utils import load_file

## Load sample data for prediction
Load processed list of New Jersey Mayors downloaded from https://www.nj.gov/dca/home/2022mayors.csv 

In [6]:
nj_mayors = load_file("../2022-nj-mayors-sample.csv")
nj_mayors.shape

(462, 9)

In [7]:
nj_mayors

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY
0,Gabe,,Plumer,782,Frenchtown Road,Milford,NJ,08848,2
1,Ari,,Bernstein,500,West Crescent Avenue,Allendale,NJ,07401,4
2,David,J.,Mclaughlin,125,Corlies Avenue,Allenhurst,NJ,07711-1049,5
3,Thomas,C.,Fritts,8,North Main Street,Allentown,NJ,08501-1607,6
4,P.,,McCkelvey,49,South Greenwich Street,Alloway,NJ,08001-0425,7
...,...,...,...,...,...,...,...,...,...
457,William,,Degroff,3943,Route,Chatsworth,NJ,08019,558
458,Joseph,,Chukwueke,200,Cooper Avenue,Woodlynne,NJ,08107-2108,559
459,Paul,,Sarlo,85,Humboldt Street,Wood-Ridge,NJ,07075-2344,560
460,Craig,,Frederick,120,Village Green Drive,Woolwich Township,NJ,08085-3180,562


#### Modeling Data Prep  
To integrate ACS data we will use `ACSModelPrep` 

Input data into the `ACSModelPrep` is tabluar data with at minimum the following columns: first name, middle name, last name, block group, census tract, zip code, and zest key. The `ZEST_KEY` should be specified to establish correspondence between inputs and outputs; it's effectively used as an index for the data table. Geocoding does not require first name, middle name, and last name but it is best practice to include these columns if the intention is to return race & ethnicity proxies.

`ACSModelPrep` requires at least one Census GEOID block group, census tract, or zip code. In this example `zip_code` serves as the GEOID.


In [13]:
%%time
acs = ACSModelPrep()
acs.fit()

CPU times: user 13 µs, sys: 23 µs, total: 36 µs
Wall time: 44.3 µs


When called, the `.transform()` function's processing steps can include processing input data and integrating ACS data. 
- No data is out because `save_table` is set to False. If True then the data will be saved to a file 


In [19]:
zrp_output = acs.transform(nj_mayors, save_table=False)

Generating Geo IDs
   ...loading ACS lookup tables
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete


### Inspect the output


In [20]:
zrp_output.head()

Unnamed: 0,B01003_001,B02001_001,B02001_002,B02001_003,B02001_004,B02001_005,B02001_006,B02001_007,B02001_008,B02001_009,...,acs_source,city,first_name,house_number,index,last_name,middle_name,state,street_address,zip_code
2,8463,8463,8378,38,0,32,0,0,15,0,...,ZIP,Milford,Gabe,782,0.0,Plumer,,NJ,Frenchtown Road,8848
219,8463,8463,8378,38,0,32,0,0,15,0,...,ZIP,Milford,Daniel,61,174.0,Bush,,NJ,Church Road,8848
4,6765,6765,5787,46,0,744,0,0,188,0,...,ZIP,Allendale,Ari,500,1.0,Bernstein,,NJ,West Crescent Avenue,7401
8,28986,28986,24562,1950,48,710,0,594,1122,112,...,ZIP,Alpha,Craig,1001,5.0,Dunwell,S.,NJ,East Boulevard,8865
199,28986,28986,24562,1950,48,710,0,594,1122,112,...,ZIP,Phillipsburg,Brian,3003,156.0,Tipton,,NJ,Belvidere Road,8865
