# How to Build a ZRP Model Using Your Own Data 

The purpose of this notebook is to illustrate how to use ZRP_Build, a class that generates a new, custom ZRP model trained off of user input data. You must supply standard ZRP requirements including name and address, in addition to race to build the custom model-pipeline. The pipeline, model, and supporting data are saved automatically to "./artifacts/experiments/{zrp_model_name}/" in the support files path defined.

This notebook is not intended to display ZRP performance. The dataset used is incredibly small for the purpose of displaying quickly how the model can be trained. To view a notebook displaying ZRP performance, see https://github.com/zestai/zrp/blob/main/examples/ZRP-Tutorial.ipynb.

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import shutil
import warnings

In [3]:
warnings.filterwarnings(action='ignore')
home = expanduser('~')

src_path = os.getcwd()
root = os.path.join(src_path, "../..")
sys.path.append(src_path)

In [4]:
from zrp import ZRP
from zrp.modeling import ZRP_Build
from zrp.prepare.utils import load_file

In [5]:
## Cleanup the directory structure 
if os.path.exists('artifacts'):
    shutil.rmtree('artifacts')

# Load sample data for training

In [6]:
input_sample = load_file(root + "/tests/data/sm_1.csv")
input_sample['ZEST_KEY'] = input_sample.index.astype(str)  #must specify key to establish correspondence between inputs and outputs

input_sample.shape

(6, 14)

Note the input sample columns. As detailed in ZRP Build docstrings, the following columns are needed: first_name, middle_name, last_name, house_number, street_address, city, state, zip_code, race.

# Invoke ZRP Build on the sample data

ZRP Build provides functionality for you to specify where to put artifacts folder & its files (pipeline, model(s), and supporting data), generated during intermediate steps. This is the parameter, 'file_path' If this is not specified, the artifacts folder is dumped in the same folder where the function is called from. 

The ZRP consists of a waterfall of 3 models: block group, census tract, and zip code. For a refresher on this architecture, please see https://github.com/zestai/zrp/blob/main/model_report.rst#prediction-process. When building your own ZRP model using ZRP Build, these three models are trained. To reflect the structure of the ZRP module, ZRP Build places the 3 generated models and associated files in distinct folders named for the geo level. These three folders are stored in the following directory: '[file_path]/artifacts/experiments/[zrp_model_name/]', where 'file_path' is the user defined or default path to the 'artifacts' parent directory. 

'zrp_model_name' is another relevant parameter into ZRP Build that specifies the name of this new model you are building. Ultimately, uniquely defining 'file_path' and 'zrp_model_name' will avoid overwriting previously built models in subsequent runs of ZRP Build.

First, define custom xgboost modeling parameters, including those related to early stopping.  For the early stopping evaluation metric the class population-weighted average mean error can be used or the auc.

In [16]:
xgb_params = {'gamma': 5,
              'learning_rate': 0.01,
              'max_depth': 3,
              'min_child_weight': 500,
              'n_estimators': 1000,
              'subsample': 0.20,
              'eval_metric':'merror',
              'early_stopping_rounds':15,
              'verbose':True}

In addition to the default usage construction arguments, here a validation holdout of 20% of the training data is specified to be used for early stopping.  The source is also limited to only the block group.

In [17]:
%%time
zest_race_predictor = ZRP_Build(valid_size=0.2,
                                sources='block_group',
                                xgb_params=xgb_params,
                                zrp_model_name='test_model_small_data') 
zest_race_predictor.fit()
output = zest_race_predictor.transform(input_sample)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/6 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
100%|██████████| 6/6 [00:00<00:00, 9731.56it/s]

####################################
Processing rows: 0:25000
####################################
Data is loaded
   [Start] Validating input data
     Number of observations: 6
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning
      ...replicating address
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=6)
         ...Base
         ...Map street suffixes...



[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    0.0s finished


         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=7)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table


100%|██████████| 1/1 [00:02<00:00,  2.36s/it]

      ...mapping
   [Completed] Validating input geo data
Directory already exists
...Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 6
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables





   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

chunk_size = 0.023122999999999998Mb
BUILDING block_group MODEL.

Directory already exists
Dropping [] features
    ...Len features to keep list:  98
    ...Data shape pre feature drop:  (16, 192)
    ...Data shape post feature drop:  (16, 97)
Directory already exists
...Output saved
...Output saved
...Output saved
...Output saved
...Output saved
...Output saved
Post-sampling shape:  (6, 14)


Unique train labels:  ['BLACK', 'AIAN', 'WHITE']
Categories (3, object): ['BLACK', 'AIAN', 'WHITE']
Unique test labels:  ['AIAN', 'WHITE']
Categories (2, object): ['AIAN', 'WHITE']

---
Saving raw data
...Output saved
...Output saved

---
Building pipeline

---
Fitting pipeline
[Pipeline] ..... (step 1 of 8) Processing Drop Features, total=   0.0s
Pass through
[Pipeline] .. (step 2 of 8) Processing Compound Name FE, tota

  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1778.00it/s]

[Pipeline] ............ (step 3 of 8) Processing App FE, total=   0.1s
[Pipeline] ............ (step 4 of 8) Processing ACS FE, total=   0.0s
[Pipeline] .. (step 5 of 8) Processing Name Aggregation, total=   0.1s
[Pipeline] . (step 6 of 8) Processing Drop Features (2), total=   0.0s
[Pipeline] ............ (step 7 of 8) Processing Impute, total=   0.0s



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 2029.17it/s]

[Pipeline]  (step 8 of 8) Processing Correlated Feature Selection, total=   0.1s
Directory already exists

---
Transforming FE data
Pass through

---
Saving FE data



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 2084.64it/s]

...Output saved
Directory already exists

---
Transforming FE data
Pass through

---
Saving FE data
...Output saved

---
building zrp_model

 training data shape:4,118

---
fitting zrp_model... n_class=3
[0]	train-merror:0.16035	val-merror:0.00000
Multiple eval metrics have been passed: 'val-merror' will be used for early stopping.

Will train until val-merror hasn't improved in 15 rounds.
[1]	train-merror:0.16035	val-merror:0.00000
[2]	train-merror:0.16035	val-merror:0.00000
[3]	train-merror:0.16035	val-merror:0.00000
[4]	train-merror:0.16035	val-merror:0.00000
[5]	train-merror:0.16035	val-merror:0.00000
[6]	train-merror:0.16035	val-merror:0.00000
[7]	train-merror:0.16035	val-merror:0.00000
[8]	train-merror:0.16035	val-merror:0.00000
[9]	train-merror:0.16035	val-merror:0.00000
[10]	train-merror:0.16035	val-merror:0.00000
[11]	train-merror:0.16035	val-merror:0.00000
[12]	train-merror:0.16035	val-merror:0.00000
[13]	train-merror:0.16035	val-merror:0.00000



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


[14]	train-merror:0.16035	val-merror:0.00000
[15]	train-merror:0.16035	val-merror:0.00000
Stopping. Best iteration:
[0]	train-merror:0.16035	val-merror:0.00000


---
finished fitting zrp_model....0.041
Directory already exists

---
finished saving zrp_model
Completed building block_group model.

##############################
Custom ZRP model build complete.
CPU times: user 20.2 s, sys: 1.96 s, total: 22.2 s
Wall time: 20.6 s


# Proxy using newly build ZRP model
Make sure to specify the path to where the model is outputted (the parameter, 'pipe_path'). This includes appending 'artifacts/'experiments'/[zrp_model_name]' to your specified 'file_path'. In this case, we did not specify a 'file_path', thus the default is the path where this code is being run from.

In [18]:
test_sample = load_file(root + "/tests/data/sm_3.csv")
test_sample['ZEST_KEY'] = test_sample.index.astype(str)  #must specify key to establish correspondence between inputs and outputs

test_sample.shape

(5, 14)

In [19]:
%%time
zest_race_predictor = ZRP(file_path = 'artifacts/experiments/test_model_small_data',
                          pipe_path='artifacts/experiments/test_model_small_data',
                         runname='custom_outputs')
zest_race_predictor.fit()
zrp_output = zest_race_predictor.transform(test_sample)

Directory already exists


Exception: New value of 'runname' parameter needs to be specified or the following files need to be moved or deleted: ['artifacts/experiments/test_model_small_data/artifacts/Zest_Geocoded_custom_outputs__2019__45_1.parquet']

In [20]:
%%time
zest_race_predictor = ZRP(file_path = 'artifacts/experiments/test_model_small_data', pipe_path='artifacts/experiments/test_model_small_data')
zest_race_predictor.fit()
zrp_output = zest_race_predictor.transform(test_sample)

Directory already exists


Exception: New value of 'runname' parameter needs to be specified or the following files need to be moved or deleted: ['artifacts/experiments/test_model_small_data/artifacts/Zest_Geocoded__2019__45_1.parquet']

*Note*: The following output data frame shouldn't be evaluated for performance/accuracy. As stated above, this notebook trains on an insignificant amount of training data for the purpose of demonstrating quickly how to use ZRP Build. A larger dataset is necessary to build a model with strong performance.