# How to Build a ZRP Model Using Your Own Data 

The purpose of this notebook is to illustrate how to use ZRP_Build, a class that generates a new, custom ZRP model trained off of user input data. You must supply standard ZRP requirements including name and address, in addition to race to build the custom model-pipeline. The pipeline, model, and supporting data are saved automatically to "./artifacts/experiments/{zrp_model_name}/" in the support files path defined.

This notebook is not intended to display ZRP performance. The dataset used is incredibly small for the purpose of displaying quickly how the model can be trained. To view a notebook displaying ZRP performance, see https://github.com/zestai/zrp/blob/main/examples/ZRP-Tutorial.ipynb.

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import shutil
import warnings

In [3]:
warnings.filterwarnings(action='ignore')
home = expanduser('~')

src_path = os.getcwd()
root = os.path.join(src_path, "../..")
sys.path.append(src_path)

In [4]:
from zrp import ZRP
from zrp.modeling import ZRP_Build
from zrp.prepare.utils import load_file

In [5]:
## Cleanup the directory structure 
if os.path.exists('artifacts/test_model_small_data'):
    shutil.rmtree('artifacts/test_model_small_data')

# Load sample data for training

In [6]:
input_sample = load_file(root + "/tests/data/sm_1.csv")
input_sample['ZEST_KEY'] = input_sample.index.astype(str)  #must specify key to establish correspondence between inputs and outputs

input_sample.shape

(6, 14)

Note the input sample columns. As detailed in ZRP Build docstrings, the following columns are needed: first_name, middle_name, last_name, house_number, street_address, city, state, zip_code, race.

# Invoke ZRP Build on the sample data

ZRP Build provides functionality for you to specify where to put artifacts folder & its files (pipeline, model(s), and supporting data), generated during intermediate steps. This is the parameter, 'file_path' If this is not specified, the artifacts folder is dumped in the same folder where the function is called from. 

The ZRP consists of a waterfall of 3 models: block group, census tract, and zip code. For a refresher on this architecture, please see https://github.com/zestai/zrp/blob/main/model_report.rst#prediction-process. When building your own ZRP model using ZRP Build, these three models are trained. To reflect the structure of the ZRP module, ZRP Build places the 3 generated models and associated files in distinct folders named for the geo level. These three folders are stored in the following directory: '[file_path]/artifacts/experiments/[zrp_model_name/]', where 'file_path' is the user defined or default path to the 'artifacts' parent directory. 

'zrp_model_name' is another relevant parameter into ZRP Build that specifies the name of this new model you are building. Ultimately, uniquely defining 'file_path' and 'zrp_model_name' will avoid overwriting previously built models in subsequent runs of ZRP Build.

In [7]:
%%time
zest_race_predictor = ZRP_Build(zrp_model_name='test_model_small_data') 
zest_race_predictor.fit()
output = zest_race_predictor.transform(input_sample)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/6 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 6/6 [00:00<00:00, 9443.09it/s]

####################################
Processing rows: 0:25000
####################################
Data is loaded
   [Start] Validating input data
     Number of observations: 6
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning



[Parallel(n_jobs=-1)]: Done   4 out of   6 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    0.0s finished


      ...replicating address
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=6)
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=7)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table


100%|██████████| 1/1 [00:02<00:00,  2.76s/it]

      ...mapping
   [Completed] Validating input geo data
Directory already exists
...Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 6
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables





   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

chunk_size = 0.023122999999999998Mb


ZIP    6
NaN    6
BG     4
CT     4
Name: acs_source, dtype: int64

BUILDING block_group MODEL.

Directory already exists
Dropping ['census_tract', 'zip_code'] features
    ...Len features to keep list:  98
    ...Data shape pre feature drop:  (8, 192)
    ...Data shape post feature drop:  (8, 97)
Directory already exists
...Output saved
...Output saved
...Output saved
...Output saved
Post-sampling shape:  (6, 14)


Unique train labels:  ['BLACK', 'AIAN', 'WHITE', 'AAPI']
Categories (4, object): ['BLACK', 'AIAN', 'WHITE', 'AAPI']
Unique test labels:  ['AAPI', 'WHITE']
Categories (2, object): ['AAPI', 'WHITE']

---
Saving raw data
...Output saved
...Output saved

---
Building pipeline

---
Fitting pipeline
[Pipeline] ..... (step 1 of 8) Processing Drop Features, total=   0.0s
Pass through
[Pipeline] .. (step 2 of 8) Processing Compound Name FE, total=   0.0s


  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1449.81it/s]

[Pipeline] ............ (step 3 of 8) Processing App FE, total=   0.1s
[Pipeline] ............ (step 4 of 8) Processing ACS FE, total=   0.0s
[Pipeline] .. (step 5 of 8) Processing Name Aggregation, total=   0.1s
[Pipeline] . (step 6 of 8) Processing Drop Features (2), total=   0.0s
[Pipeline] ............ (step 7 of 8) Processing Impute, total=   0.0s



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1794.74it/s]

[Pipeline]  (step 8 of 8) Processing Correlated Feature Selection, total=   0.0s
Directory already exists

---
Transforming FE data
Pass through

---
Saving FE data
...Output saved

---
building zrp_model

 training data shape:5,57

---
fitting zrp_model... n_class=4
[0]	train-merror:0.31727	train-WeightedAUC:-0.50000
Multiple eval metrics have been passed: 'train-WeightedAUC' will be used for early stopping.

Will train until train-WeightedAUC hasn't improved in 2000 rounds.
[1]	train-merror:0.31727	train-WeightedAUC:-0.50000



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


[2]	train-merror:0.31727	train-WeightedAUC:-0.50000
[3]	train-merror:0.31727	train-WeightedAUC:-0.50000
[4]	train-merror:0.31727	train-WeightedAUC:-0.50000
[5]	train-merror:0.31727	train-WeightedAUC:-0.50000
[6]	train-merror:0.31727	train-WeightedAUC:-0.50000
[7]	train-merror:0.31727	train-WeightedAUC:-0.50000
[8]	train-merror:0.31727	train-WeightedAUC:-0.50000
[9]	train-merror:0.31727	train-WeightedAUC:-0.50000
[10]	train-merror:0.31727	train-WeightedAUC:-0.50000
[11]	train-merror:0.31727	train-WeightedAUC:-0.50000
[12]	train-merror:0.31727	train-WeightedAUC:-0.50000
[13]	train-merror:0.31727	train-WeightedAUC:-0.50000
[14]	train-merror:0.31727	train-WeightedAUC:-0.50000
[15]	train-merror:0.31727	train-WeightedAUC:-0.50000
[16]	train-merror:0.31727	train-WeightedAUC:-0.50000
[17]	train-merror:0.31727	train-WeightedAUC:-0.50000
[18]	train-merror:0.31727	train-WeightedAUC:-0.50000
[19]	train-merror:0.31727	train-WeightedAUC:-0.50000
[20]	train-merror:0.31727	train-WeightedAUC:-0.50000
[

  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1677.72it/s]

[Pipeline] ............ (step 4 of 8) Processing ACS FE, total=   0.1s
[Pipeline] .. (step 5 of 8) Processing Name Aggregation, total=   0.1s
[Pipeline] . (step 6 of 8) Processing Drop Features (2), total=   0.0s
[Pipeline] ............ (step 7 of 8) Processing Impute, total=   0.0s



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1693.30it/s]

[Pipeline]  (step 8 of 8) Processing Correlated Feature Selection, total=   0.1s
Directory already exists

---
Transforming FE data
Pass through

---
Saving FE data
...Output saved

---
building zrp_model

 training data shape:5,83

---
fitting zrp_model... n_class=4
[0]	train-merror:0.31727	train-WeightedAUC:-0.50000



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Multiple eval metrics have been passed: 'train-WeightedAUC' will be used for early stopping.

Will train until train-WeightedAUC hasn't improved in 2000 rounds.
[1]	train-merror:0.31727	train-WeightedAUC:-0.50000
[2]	train-merror:0.31727	train-WeightedAUC:-0.50000
[3]	train-merror:0.31727	train-WeightedAUC:-0.50000
[4]	train-merror:0.31727	train-WeightedAUC:-0.50000
[5]	train-merror:0.31727	train-WeightedAUC:-0.50000
[6]	train-merror:0.31727	train-WeightedAUC:-0.50000
[7]	train-merror:0.31727	train-WeightedAUC:-0.50000
[8]	train-merror:0.31727	train-WeightedAUC:-0.50000
[9]	train-merror:0.31727	train-WeightedAUC:-0.50000
[10]	train-merror:0.31727	train-WeightedAUC:-0.50000
[11]	train-merror:0.31727	train-WeightedAUC:-0.50000
[12]	train-merror:0.31727	train-WeightedAUC:-0.50000
[13]	train-merror:0.31727	train-WeightedAUC:-0.50000
[14]	train-merror:0.31727	train-WeightedAUC:-0.50000
[15]	train-merror:0.31727	train-WeightedAUC:-0.50000
[16]	train-merror:0.31727	train-WeightedAUC:-0.50000


  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1283.84it/s]

[Pipeline] ............ (step 4 of 8) Processing ACS FE, total=   0.1s
[Pipeline] .. (step 5 of 8) Processing Name Aggregation, total=   0.1s
[Pipeline] . (step 6 of 8) Processing Drop Features (2), total=   0.0s
[Pipeline] ............ (step 7 of 8) Processing Impute, total=   0.0s



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


[Pipeline]  (step 8 of 8) Processing Correlated Feature Selection, total=   0.1s
Directory already exists

---
Transforming FE data
Pass through


  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1330.26it/s]


---
Saving FE data
...Output saved

---
building zrp_model

 training data shape:4,56

---
fitting zrp_model... n_class=2
[0]	train-merror:0.15203	train-WeightedAUC:-0.83333
Multiple eval metrics have been passed: 'train-WeightedAUC' will be used for early stopping.

Will train until train-WeightedAUC hasn't improved in 2000 rounds.



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


[1]	train-merror:0.15203	train-WeightedAUC:-1.00000
[2]	train-merror:0.15203	train-WeightedAUC:-1.00000
[3]	train-merror:0.15203	train-WeightedAUC:-1.00000
[4]	train-merror:0.15203	train-WeightedAUC:-1.00000
[5]	train-merror:0.15203	train-WeightedAUC:-1.00000
[6]	train-merror:0.15203	train-WeightedAUC:-1.00000
[7]	train-merror:0.15203	train-WeightedAUC:-1.00000
[8]	train-merror:0.15203	train-WeightedAUC:-1.00000
[9]	train-merror:0.15203	train-WeightedAUC:-1.00000
[10]	train-merror:0.15203	train-WeightedAUC:-1.00000
[11]	train-merror:0.15203	train-WeightedAUC:-1.00000
[12]	train-merror:0.15203	train-WeightedAUC:-1.00000
[13]	train-merror:0.15203	train-WeightedAUC:-1.00000
[14]	train-merror:0.15203	train-WeightedAUC:-1.00000
[15]	train-merror:0.15203	train-WeightedAUC:-1.00000
[16]	train-merror:0.15203	train-WeightedAUC:-1.00000
[17]	train-merror:0.15203	train-WeightedAUC:-1.00000
[18]	train-merror:0.15203	train-WeightedAUC:-1.00000
[19]	train-merror:0.15203	train-WeightedAUC:-1.00000
[2

# Proxy using newly build ZRP model
Make sure to specify the path to where the model is outputted (the parameter, 'pipe_path'). This includes appending 'artifacts/'experiments'/[zrp_model_name]' to your specified 'file_path'. In this case, we did not specify a 'file_path', thus the default is the path where this code is being run from.

In [8]:
test_sample = load_file(root + "/tests/data/sm_3.csv")
test_sample['ZEST_KEY'] = test_sample.index.astype(str)  #must specify key to establish correspondence between inputs and outputs

test_sample.shape

(5, 14)

In [12]:
%%time
zest_race_predictor = ZRP(file_path = 'artifacts/experiments/test_model_small_data+0',
                          pipe_path='artifacts/experiments/test_model_small_data',
                         runname='custom_outputs')
zest_race_predictor.fit()
zrp_output = zest_race_predictor.transform(test_sample)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/5 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 5/5 [00:00<00:00, 4855.64it/s]

####################################
Processing rows: 0:25000
####################################
Data is loaded
   [Start] Validating input data
     Number of observations: 5
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning
      ...replicating address
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=6)
         ...Base
         ...Map street suffixes...



[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished


         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=6)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table


100%|██████████| 1/1 [00:02<00:00,  2.59s/it]

      ...mapping
   [Completed] Validating input geo data
Directory already exists
...Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 5
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables





   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

   [Start] Validating pipeline input data
     Number of observations: 16
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1590.56it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1672.37it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


   ...Proxies generated
...Output saved
...Output saved
CPU times: user 22.8 s, sys: 2.39 s, total: 25.2 s
Wall time: 22.3 s


In [13]:
%%time
zest_race_predictor = ZRP(file_path = 'artifacts/experiments/test_model_small_data_1', pipe_path='artifacts/experiments/test_model_small_data')
zest_race_predictor.fit()
zrp_output = zest_race_predictor.transform(test_sample)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/5 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 5/5 [00:00<00:00, 9562.94it/s]

####################################
Processing rows: 0:25000
####################################
Data is loaded
   [Start] Validating input data
     Number of observations: 5
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning
      ...replicating address
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=6)
         ...Base
         ...Map street suffixes...



[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished


         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=6)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table


100%|██████████| 1/1 [00:02<00:00,  2.56s/it]

      ...mapping
   [Completed] Validating input geo data
Directory already exists
...Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 5
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables





   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

   [Start] Validating pipeline input data
     Number of observations: 16
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1120.87it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1258.42it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


   ...Proxies generated
...Output saved
...Output saved
CPU times: user 21.6 s, sys: 2.33 s, total: 24 s
Wall time: 22 s


*Note*: The following output data frame is not evaluated for performance/accuracy. As stated above, this notebook trains on an insignificant amount of training data for the purpose of demonstrating quickly how to use ZRP Build. A larger dataset is necessary to build a model with strong performance.

In [14]:
zrp_output

Unnamed: 0,ZEST_KEY,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,original_race,...,original_sex,sex,age,AAPI,AIAN,BLACK,WHITE,race_proxy,source_zrp_block_group,source_zrp_zip_code
0,0,sam,T,j0nes,1 2 1,GRAYS MARKET RD,EARLY BRANCH,SC,29916,WHITE,...,FEMALE,FEMALE,19,0.0,0.0,0.165161,0.834839,WHITE,0.0,1.0
1,1,henry,,ngo,632,SANDBAR PT,CLOVER,SC,29710,ASIAN,...,MALE,MALE,32,0.0,0.0,0.092399,0.907601,WHITE,0.0,1.0
2,2,GABBY,L,BRIDGES,605,KERSHAW ST,CHERAW,SC,29520,WHITE,...,FEMALE,FEMALE,50,0.056507,0.015964,0.244797,0.682731,WHITE,1.0,0.0
3,3,JAMES,M,HORN,3401,DUNCAN ST,COLUMBIA,SC,29205,WHITE,...,MALE,MALE,26,0.056507,0.015964,0.244797,0.682731,WHITE,1.0,0.0
4,4,LONDON,Z,ABARA,26,PECAN CIR,YORK,SC,29745,BLACK/AFRICAN,...,FEMALE,FEMALE,22,0.056507,0.015964,0.244797,0.682731,WHITE,1.0,0.0
