# How to Build a ZRP Model Using Your Own Data 

The purpose of this notebook is to illustrate how to use `ZRP_Build`, a class that generates a new, custom ZRP model trained off of user input data. You must supply standard ZRP requirements including name and address, in addition to race to build the custom model-pipeline. The pipeline, model, and supporting data are saved automatically to "./artifacts/experiments/{zrp_model_name}/" in the support files path defined.

This notebook is not intended to display ZRP performance. The dataset used is incredibly small for the purpose of displaying quickly how the model can be trained. To view a notebook displaying ZRP performance, see https://github.com/zestai/zrp/blob/main/examples/ZRP-Tutorial.ipynb.

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import shutil
import warnings

In [3]:
warnings.filterwarnings(action='ignore')
home = expanduser('~')

src_path = os.getcwd()
root = os.path.join(src_path, "../..")
sys.path.append(src_path)

In [4]:
from zrp import ZRP
from zrp.modeling import ZRP_Build, ZRP_Predict
from zrp.prepare.utils import load_file

In [5]:
## Cleanup the directory structure 
if os.path.exists('artifacts/experiments/test_block_group_only'):
    shutil.rmtree('artifacts/experiments/test_block_group_only')

# Load sample data for training

In [6]:
input_sample = load_file(root + "/tests/data/sm_7.csv")
input_sample['ZEST_KEY'] = input_sample.index.astype(str)  # must specify key to establish correspondence between inputs and outputs

input_sample.shape

(50, 10)

In [7]:
input_sample

Unnamed: 0,ZEST_KEY,first_name,last_name,middle_name,house_number,street_address,city,state,zip_code,race
0,0,DAVID,KRUEGER,JOSEPH,362,11TH ST,KEY COLONY BCH,FL,33051,WHITE
1,1,JOSEPH,SCHWARTZ,,8477,MAN O WAR RD,PALM BEACH GARDENS,FL,33418,WHITE
2,2,MICHAEL,SCHMITT,J,4104,FENROSE CIR,MELBOURNE,FL,32940,WHITE
3,3,BENJAMIN,MUELLER,M,3636,BRIGHTWOOD LN,PACE,FL,32571,WHITE
4,4,DANIEL,WEISS,JOSEPH,255,WAKISSA CV,DESTIN,FL,32541,WHITE
5,5,MARY,NOVAK,A,3177,WALLACE LAKE RD,PACE,FL,32571,WHITE
6,6,EMMA,KLEIN,LYNN,630,S SAPODILLA AVE,WEST PALM BEACH,FL,33401,WHITE
7,7,PATRICIA,MARTIN,ANN,362,LAZY LN,LAKE PLACID,FL,33852,WHITE
8,8,SUSAN,ANDERSON,MARIE,3775,GROVE VIEW LN,PORT ORANGE,FL,32129,WHITE
9,9,RICHARD,MILLER,MICHAEL,1298,W GLADIOLA DR,AVON PARK,FL,33825,WHITE


Note the input sample columns. As detailed in ZRP Build docstrings, the following columns are needed: first_name, middle_name, last_name, house_number, street_address, city, state, zip_code, race.

# Invoke ZRP Build on the sample data

ZRP Build provides functionality for you to specify where to put artifacts folder & its files (pipeline, model(s), and supporting data), generated during intermediate steps. This is the parameter, 'file_path' If this is not specified, the artifacts folder is dumped in the same folder where the function is called from. 

The ZRP consists of a waterfall of 3 models: block group, census tract, and zip code. For a refresher on this architecture, please see https://github.com/zestai/zrp/blob/main/model_report.rst#prediction-process. When building your own ZRP model using ZRP Build, these three models are trained. To reflect the structure of the ZRP module, ZRP Build places the 3 generated models and associated files in distinct folders named for the geo level. These three folders are stored in the following directory: '[file_path]/artifacts/experiments/[zrp_model_name/]', where 'file_path' is the user defined or default path to the 'artifacts' parent directory. 

'zrp_model_name' is another relevant parameter into ZRP Build that specifies the name of this new model you are building. Ultimately, uniquely defining 'file_path' and 'zrp_model_name' will avoid overwriting previously built models in subsequent runs of ZRP Build.

First, define custom xgboost modeling parameters, including those related to early stopping.  For the early stopping evaluation metric the class population-weighted average mean error can be used or the auc.

In [8]:
xgb_params = {'gamma': 5,
              'learning_rate': 0.01,
              'max_depth': 3,
              'min_child_weight': 500,
              'n_estimators': 1000,
              'subsample': 0.20,
              'eval_metric':'merror',
              'early_stopping_rounds':15,
              'verbose':True}

In addition to the default usage construction arguments, here a validation holdout of 20% of the training data is specified to be used for early stopping.  The source is also limited to only the block group.

In [9]:
%%time
zest_race_predictor = ZRP_Build(valid_size=0.2,
                                sources='block_group',
                                xgb_params=xgb_params,
                                zrp_model_name='test_block_group_only') 
zest_race_predictor.fit()
output = zest_race_predictor.transform(input_sample)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/50 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
100%|██████████| 50/50 [00:00<00:00, 1504.02it/s]

####################################
Processing rows: 0:25000
####################################
Data is loaded
   [Start] Validating input data
     Number of observations: 50
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['FL']
   ... on state: FL

   Data is loaded
   [Start] Processing geo data
      ...address cleaning



[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    0.0s finished


      ...replicating address
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=50)
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=65)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table


100%|██████████| 1/1 [00:07<00:00,  7.30s/it]

      ...mapping
   [Completed] Validating input geo data
Directory already exists
...Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 50
     Is key unique: True






   [Completed] Validating ACS input data

   ...loading ACS lookup tables
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

chunk_size = 0.20283199999999998Mb


ZIP    50
NaN    50
BG     46
CT     46
Name: acs_source, dtype: int64

BUILDING block_group MODEL.

Dropping [] features
    ...Len features to keep list:  98
    ...Data shape pre feature drop:  (54, 190)
    ...Data shape post feature drop:  (54, 97)
Directory already exists
...Output saved
...Output saved
...Output saved
...Output saved
...Output saved
...Output saved
Post-sampling shape:  (50, 10)


Unique train labels:  ['HISPANIC', 'BLACK', 'WHITE', 'AIAN', 'AAPI']
Categories (5, object): ['HISPANIC', 'BLACK', 'WHITE', 'AIAN', 'AAPI']
Unique test labels:  ['HISPANIC', 'AAPI', 'BLACK', 'AIAN']
Categories (4, object): ['HISPANIC', 'AAPI', 'BLACK', 'AIAN']

---
Saving raw data
...Output saved
...Output saved

---
Building pipeline

---
Fitting pipeline
[Pipeline] ..... (step 1 of 8) Processing Drop Features, total=   0.0s
[Pipeline] .. (step 2 of 8) Processing Compound Name FE, total=   0.0s


  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 897.56it/s]

[Pipeline] ............ (step 3 of 8) Processing App FE, total=   0.2s
[Pipeline] ............ (step 4 of 8) Processing ACS FE, total=   0.0s
[Pipeline] .. (step 5 of 8) Processing Name Aggregation, total=   0.1s
[Pipeline] . (step 6 of 8) Processing Drop Features (2), total=   0.0s
[Pipeline] ............ (step 7 of 8) Processing Impute, total=   0.0s



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 917.39it/s]

[Pipeline]  (step 8 of 8) Processing Correlated Feature Selection, total=   0.1s
Directory already exists

---
Transforming FE data



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1548.86it/s]


---
Saving FE data
...Output saved
Directory already exists

---
Transforming FE data

---
Saving FE data
...Output saved

---
building zrp_model

 training data shape:31,97

---
fitting zrp_model... n_class=5
[0]	train-merror:0.08703	val-merror:0.82328
Multiple eval metrics have been passed: 'val-merror' will be used for early stopping.

Will train until val-merror hasn't improved in 15 rounds.



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


[1]	train-merror:0.07998	val-merror:0.82328
[2]	train-merror:0.06445	val-merror:0.82328
[3]	train-merror:0.06445	val-merror:0.82328
[4]	train-merror:0.06445	val-merror:0.82328
[5]	train-merror:0.06445	val-merror:0.82328
[6]	train-merror:0.04046	val-merror:0.82328
[7]	train-merror:0.03199	val-merror:0.82328
[8]	train-merror:0.03199	val-merror:0.82328
[9]	train-merror:0.02352	val-merror:0.82328
[10]	train-merror:0.01505	val-merror:0.82328
[11]	train-merror:0.01505	val-merror:0.82328
[12]	train-merror:0.01505	val-merror:0.82328
[13]	train-merror:0.01505	val-merror:0.82328
[14]	train-merror:0.00658	val-merror:0.82328
[15]	train-merror:0.00658	val-merror:0.82328
Stopping. Best iteration:
[0]	train-merror:0.08703	val-merror:0.82328


---
finished fitting zrp_model....0.148
Directory already exists

---
finished saving zrp_model
Completed building block_group model.

##############################
Custom ZRP model build complete.
CPU times: user 30.4 s, sys: 2.93 s, total: 33.3 s
Wall time: 2

# Proxy Using Newly Built ZRP Model
The ZRP model typically relies on three sub-models: a block group model, a census tract model, and azipP code model. Each sub-model is trained separately: one for Census block groups, one for Census tracts, and another for ZIP codes.
The inputs to the ZRP model include a name and an address. The address is used to look up attributes of the corresponding region. The lookup process follows these steps:
1. Retrieve attributes for the Census block group.
2. If the block group lookup fails, retrieve attributes for the Census tract.
3. If the Census tract lookup fails, retrieve attributes forzipe ZIP 

   code.
Attributes from the American Community Survey (ACS) associated with the retrieved geographic area are then appended to the first, middle, and las data names. The resulting vector of predictors is used as input to the corresponding model (e.g., block group, traczipor ZIP code-based m

odel).
Earlier in this notebook, we trained a block group model. Below, we load the test data from that trained model. We will use this test data to generate proxies from the block group model. Note that the data is loaded from the saved experiment file path specified in the previous cell. Make sure to specify the path to where the model is outputted using the parameter 'pipe_path'. This includes appending 'artifacts/experiments/[zrp_model_name]' to your specified 'file_path'. In this case, we did not specify a 'file_path', so the default is the path where this code is being run from.om.

In [10]:
test_sample = load_file("artifacts/experiments/test_block_group_only/block_group/X_test.feather")
test_sample['ZEST_KEY'] = test_sample.index.astype(str)  # must specify key to establish correspondence between inputs and outputs

test_sample.shape

(11, 98)

Since only a block group model was trained in this notebook and no other models exist for census tracts or zip codes, we will need to use `ZRP_Predict_BlockGroup` to generate proxies using the block group model. This function is available in "/zrp/modeling/predict.py".

If you have a census tract model only, use` ZRP_Predict_CensusTrac`t. If you have a ZIP code model only, use` ZRP_Predict_ZipCod`e. If all three models have been trained, you can generate proxies for the train and test datasets using` ZRP_Predic`t.

In [11]:
from zrp.modeling import ZRP_Predict_BlockGroup

In [12]:
test_sample.head()

Unnamed: 0,index,ZEST_KEY,B08301_018,B19001_015,middle_name,B08301_010,B25004_008,B08301_011,B19001_002,B25075_018,...,B99021_002,B25075_013,C16001_036,B19001_001,B25075_020,C16001_015,B19001_009,B25075_024,B19001_014,C16001_006
0,7,0,0.0,67.0,M,66.0,0.0,39.0,0.0,0.0,...,0.0,0.0,,665.0,174.0,,0.0,0.0,115.0,
1,6,1,14.0,35.0,A,10.0,0.0,10.0,14.0,172.0,...,0.0,0.0,,720.0,19.0,,69.0,0.0,61.0,
2,2,2,0.0,112.0,E,0.0,0.0,0.0,27.0,121.0,...,0.0,0.0,,1129.0,109.0,,73.0,0.0,212.0,
3,25,3,0.0,166.0,T,0.0,24.0,0.0,46.0,67.0,...,53.0,45.0,,1201.0,168.0,,60.0,18.0,150.0,
4,23,4,9.0,42.0,L,0.0,69.0,0.0,26.0,34.0,...,0.0,0.0,,490.0,113.0,,26.0,8.0,71.0,


Use `ZRP_Predict_BlockGroup` to generate proxies on the new data sample.

In [13]:
%%time
zest_race_predictor = ZRP_Predict_BlockGroup(
                          pipe_path='artifacts/experiments/test_block_group_only',)
zest_race_predictor.fit()
zrp_output = zest_race_predictor.transform(test_sample.drop(['index'], axis=1).set_index('ZEST_KEY'))

  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1591.16it/s]

CPU times: user 290 ms, sys: 0 ns, total: 290 ms
Wall time: 294 ms



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


*Note*: The following output data frame shouldn't be evaluated for performance/accuracy. As stated above, this notebook trains on an insignificant amount of training data for the purpose of demonstrating quickly how to use ZRP Build. A larger dataset is necessary to build a model with strong performance.

In [14]:
zrp_output

Unnamed: 0_level_0,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_zrp_block_group
ZEST_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.096119,0.033168,0.03585,0.049208,0.785655,WHITE,1
1,0.309642,0.061927,0.107573,0.302884,0.217973,AAPI,1
10,0.135656,0.08889,0.414266,0.253064,0.108125,BLACK,1
2,0.198663,0.088306,0.207507,0.375011,0.130513,HISPANIC,1
3,0.044373,0.0438,0.582892,0.056639,0.272296,BLACK,1
4,0.194429,0.061057,0.079061,0.092391,0.573062,WHITE,1
5,0.093315,0.046406,0.41634,0.388372,0.055567,BLACK,1
6,0.056665,0.025234,0.495402,0.381087,0.041612,BLACK,1
7,0.138487,0.047759,0.236715,0.511249,0.06579,HISPANIC,1
8,0.037782,0.008889,0.021123,0.015957,0.91625,WHITE,1


Additionally, if you are starting with base fields like name and address, it is recommended to have all three ZRP models trained. Use `ZRP` to prepare the data  i.e. geocoding the address, attaching the ACS attributes, creating the pre-pipeline data, and generating the ZRP proxies.