# How to GeoCode
The purpose of this notebook is to illustrate how to use `ZGeo`, a module that converts address to Census Geographic Identifiers (GEOIDs).

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import warnings

In [5]:
warnings.filterwarnings(action='ignore')
home = expanduser('~')

src_path = '{}/zrp'.format(home)
sys.path.append(src_path)

In [6]:
from zrp.prepare.prepare import ZRP_Prepare, ZGeo
from zrp.prepare.utils import load_file

## Load sample data for prediction
Load processed list of New Jersey Mayors downloaded from https://www.nj.gov/dca/home/2022mayors.csv 

In [7]:
nj_mayors = load_file("../2022-nj-mayors-sample.csv")
nj_mayors.shape

(462, 9)

In [8]:
nj_mayors

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY
0,Gabe,,Plumer,782,Frenchtown Road,Milford,NJ,08848,2
1,Ari,,Bernstein,500,West Crescent Avenue,Allendale,NJ,07401,4
2,David,J.,Mclaughlin,125,Corlies Avenue,Allenhurst,NJ,07711-1049,5
3,Thomas,C.,Fritts,8,North Main Street,Allentown,NJ,08501-1607,6
4,P.,,McCkelvey,49,South Greenwich Street,Alloway,NJ,08001-0425,7
...,...,...,...,...,...,...,...,...,...
457,William,,Degroff,3943,Route,Chatsworth,NJ,08019,558
458,Joseph,,Chukwueke,200,Cooper Avenue,Woodlynne,NJ,08107-2108,559
459,Paul,,Sarlo,85,Humboldt Street,Wood-Ridge,NJ,07075-2344,560
460,Craig,,Frederick,120,Village Green Drive,Woolwich Township,NJ,08085-3180,562


#### Geocode  
To map addresses to GEOIDs we will use `ZGeo` 

Input data into the prediction/modeling pipeline is tabluar data with the following columns: first name, middle name, last name, house number, street address (street name), city, state, zip code, and zest key. The `ZEST_KEY` should be specified to establish correspondence between inputs and outputs; it's effectively used as an index for the data table. Geocoding does not require first name, middle name, and last name but it is best practice to include these columns if the intention is to return race & ethnicity proxies.

`ZGeo` is used to map addresses to block group, census tract, and other Census geographic identifiers. When called, the `.transform()` function's processing steps can include processing input data and geocoding the data. 

In [10]:
%%time
geocode = ZGeo()
geocode.fit()

Notes about the `.transform()` parameters:
- `geo = 34` indicates we want to geocode addresses from NJ
    - The state FIPs for NJ is 34. The user is required to input the numeric state FIPs if using ZGeo alone. Please refer to the inverse_state_mapping.json for assitance with mapping state abbreviation to state FIPs.
- The output data may be larger in size than the input data since `replicate` is set to True.
- No data is out because `save_table` is set to False. If True then the geocoded data will be saved to a file by state fips


In [10]:
zrp_output = geocode.transform(nj_mayors, geo='34', processed=False, replicate=True, save_table=False)

  0%|          | 0/462 [00:00<?, ?it/s][Parallel(n_jobs=49)]: Using backend ThreadingBackend with 49 concurrent workers.
[Parallel(n_jobs=49)]: Done 102 tasks      | elapsed:    0.0s
100%|██████████| 462/462 [00:00<00:00, 13118.29it/s]


   Data is loaded
   [Start] Processing geo data
      ...formatting

['NJ']

['NJ']
/home/kam/zrp/zrp/prepare/../data/processed
      ...address cleaning
      ...replicating address
         ...Base
         ...Map street suffixes...



[Parallel(n_jobs=49)]: Done 352 tasks      | elapsed:    0.0s
[Parallel(n_jobs=49)]: Done 462 out of 462 | elapsed:    0.0s finished


         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=900)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=900)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
   [Completed] Mapping geo data
CPU times: user 5.48 s, sys: 1.22 s, total: 6.7 s
Wall time: 5.12 s


### Inspect the output


In [12]:
zrp_output.head()

Unnamed: 0_level_0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,BLKGRPCE,BLKGRPCE10,...,ZCTA5CE,ZCTA5CE10,ZEST_FULLNAME,ZEST_KEY_COL,ZEST_STATE,ZEST_ZIP,GEOID_ZIP,GEOID_CT,GEOID_BG,GEOID
ZEST_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56,MATTHEW,C,MOENCH,100,COMMONS WAY,BRIDGEWATER,NJ,8807,2,2,...,8807,8807,COMMONS WAY,56,,8807,8807,34035050703,340350507032,
202,LOUIS,,MANZO,114,BRIDGETON PIKE,MULLICA HILL,NJ,8062,2,2,...,8062,8062,BRIDGETON PIKE,202,,8062,8062,34015502002,340155020022,
204,JOHN,,DELORENZO,218,BOULEVARD,HASBROUCK HEIGHTS,NJ,7604,3,6,...,7604,7604,BOULEVARD,204,,7604,7604,34003025200,340030252003,
224,PAUL,J,RITTER,590,SHILOH PIKE,BRIDGETON,NJ,8302,3,3,...,8302,8302,SHILOH PIKE,224,,8302,8302,34011010600,340110106003,
248,THOMAS,,BARBERA,135,BROADWAY,LAUREL SPRINGS,NJ,8021,2,2,...,8021,8021,BROADWAY,248,,8021,8021,34007607900,340076079002,
