# ZRP Zip Code Only
The purpose of this notebook is to illustrate how to use the `zrp` package to generate race/ethnicity proxies that are based on zip code only. 


In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
import pandas as pd
import sys
import os
import re
import warnings
from zrp.prepare.utils import load_file, make_directory
from os.path import join, expanduser, dirname

warnings.filterwarnings(action='ignore')
home = expanduser('~')

In [3]:
import pkg_resources
print("ZRP version:", pkg_resources.get_distribution('zrp').version)

ZRP version: 0.2.2


## Load sample data for prediction
Load processed list of New Jersey Mayors downloaded from https://www.nj.gov/dca/home/2022mayors.csv 

In [4]:
nj_mayors = load_file("../2022-nj-mayors-sample.csv")
nj_mayors.shape

(462, 9)

In [5]:
nj_mayors

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY
0,Gabe,,Plumer,782,Frenchtown Road,Milford,NJ,08848,2
1,Ari,,Bernstein,500,West Crescent Avenue,Allendale,NJ,07401,4
2,David,J.,Mclaughlin,125,Corlies Avenue,Allenhurst,NJ,07711-1049,5
3,Thomas,C.,Fritts,8,North Main Street,Allentown,NJ,08501-1607,6
4,P.,,McCkelvey,49,South Greenwich Street,Alloway,NJ,08001-0425,7
...,...,...,...,...,...,...,...,...,...
457,William,,Degroff,3943,Route,Chatsworth,NJ,08019,558
458,Joseph,,Chukwueke,200,Cooper Avenue,Woodlynne,NJ,08107-2108,559
459,Paul,,Sarlo,85,Humboldt Street,Wood-Ridge,NJ,07075-2344,560
460,Craig,,Frederick,120,Village Green Drive,Woolwich Township,NJ,08085-3180,562


#### Create output folder


In [6]:
make_directory()

Directory already exists


## Process Input Data
This step processes and standarizes the input data

In [7]:
from zrp.prepare.preprocessing import ProcessGeo

In [8]:
pg = ProcessGeo()
pg.fit(nj_mayors)
processed = pg.transform(nj_mayors, processed=False, replicate=False)

  0%|          | 0/462 [00:00<?, ?it/s]

   [Start] Validating input geo data
   [Completed] Validating input geo data
   [Start] Processing geo data
      ...formatting
      ...address cleaning


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 15 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 170 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 420 tasks      | elapsed:    0.0s
100%|██████████| 462/462 [00:00<00:00, 15413.86it/s]

      ...formatting
   [Completed] Processing geo data



[Parallel(n_jobs=-1)]: Done 462 out of 462 | elapsed:    0.0s finished


In [9]:
processed.head()

Unnamed: 0_level_0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY_COL
ZEST_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2,GABE,,PLUMER,782,FRENCHTOWN ROAD,MILFORD,NJ,8848,2
4,ARI,,BERNSTEIN,500,WEST CRESCENT AVENUE,ALLENDALE,NJ,7401,4
5,DAVID,J,MCLAUGHLIN,125,CORLIES AVENUE,ALLENHURST,NJ,7711,5
6,THOMAS,C,FRITTS,8,NORTH MAIN STREET,ALLENTOWN,NJ,8501,6
7,P,,MCCKELVEY,49,SOUTH GREENWICH STREET,ALLOWAY,NJ,8001,7


#### Modeling Data Prep  
To integrate ACS data we will use `ACSModelPrep` 

Input data into the `ACSModelPrep` is expected to be tabluar data with the following columns: first name, middle name, last name, block group, census tract, zip code, and zest key. The `ZEST_KEY` should be specified to establish correspondence between inputs and outputs; it's effectively used as an index for the data table. Since we only want to use zip codes, we will have to bend the rules a bit with additional processing.

`ACSModelPrep` requires at least one Census GEOID block group, census tract, or zip code. In this example `zip_code` serves as the GEOID.


In [10]:
from zrp.prepare.prepare import ACSModelPrep

In [11]:
%%time
acs = ACSModelPrep()
acs.fit()

CPU times: user 16 µs, sys: 19 µs, total: 35 µs
Wall time: 42.4 µs


When called, the `.transform()` function's processing steps can include processing input data and integrating ACS data. 
- No data is out because `save_table` is set to False. If True then the data will be saved to a file 


In [12]:
zrp_output = acs.transform(processed, save_table=False)

Generating Geo IDs
   ...loading ACS lookup tables
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete


Working around defaults add placeholders for missing geographies

In [13]:
zrp_output[['GEOID_BG', 'GEOID_CT']] = None

### Invoke the ZRP_Predict on the sample data
To generate predictions, you can:
- Provide the path to the preferred pipeline directory in the `__init__`. 
    - Here we provide the path from the installed zrp version
    - If using git version of ZRP pipe path is:<br>
        `curpath = os.getcwd()`<br>
        `pipe_path = join(curpath, "../../zrp/modeling/models")`

In [14]:
from zrp.modeling.predict import ZRP_Predict
pipe_path = os.path.join(home, ".conda/envs/zrp_0.2.2/lib/python3.7/site-packages/zrp/modeling/models")

In [15]:
z_predict = ZRP_Predict(file_path="", pipe_path=pipe_path)
z_predict.fit(zrp_output)
predict_out = z_predict.transform(zrp_output)

   [Start] Validating pipeline input data
     Number of observations: 916
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 15 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1051.73it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


In [16]:
predict_out.sort_values('source_bisg').head()

Unnamed: 0_level_0,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_zip_code,source_bisg
ZEST_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
354,0.255177,0.003159,0.021822,0.02782,0.692021,WHITE,1.0,0.0
446,0.029528,0.000717,0.003945,0.010995,0.954815,WHITE,1.0,0.0
445,0.015095,0.004106,0.009082,0.013552,0.958166,WHITE,1.0,0.0
444,0.000835,0.000667,0.038553,0.002209,0.957736,WHITE,1.0,0.0
443,0.002764,0.023738,0.315428,0.059286,0.598785,WHITE,1.0,0.0


Note: Source columns denote which method was used to generate proxies. When "source_bisg" is missing, neither ZRP nor BISG were able to generate race/ethnicity predictions 