# How to Build a ZRP Model Using Your Own Data 

The purpose of this notebook is to illustrate how to use ZRP_Build, a class that generates a new, custom ZRP model trained off of user input data. You must supply standard ZRP requirements including name and address, in addition to race to build the custom model-pipeline. The pipeline, model, and supporting data are saved automatically to "./artifacts/experiments/{zrp_model_name}/" in the support files path defined.

This notebook is not intended to display ZRP performance. The dataset used is incredibly small for the purpose of displaying quickly how the model can be trained. To view a notebook displaying ZRP performance, see https://github.com/zestai/zrp/blob/main/examples/ZRP-Tutorial.ipynb.

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import warnings

In [3]:
warnings.filterwarnings(action='ignore')
home = expanduser('~')

src_path = os.getcwd()
root = os.path.join(src_path, "../..")
sys.path.append(src_path)

In [4]:
from zrp import ZRP
from zrp.modeling import ZRP_Build
from zrp.prepare.utils import load_file

# Load sample data for training

In [5]:
input_sample = load_file(root + "/tests/data/sm_1.csv")
input_sample['ZEST_KEY'] = input_sample.index.astype(str)  #must specify key to establish correspondence between inputs and outputs

input_sample.shape

(5, 14)

In [6]:
input_sample

Unnamed: 0,ZEST_KEY,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,original_race,race,original_sex,sex,age
0,0,SAMANTHA,S,FERGUSON,121,GRAYS MARKET RD,EARLY BRANCH,SC,29916,WHITE,WHITE,FEMALE,FEMALE,29
1,1,HUOI,,A,632,SANDBAR PT,CLOVER,SC,29710,ASIAN,AAPI,MALE,MALE,38
2,2,GAIL,A,ABRUNZO,605,KERSHAW ST,CHERAW,SC,29520,WHITE,WHITE,FEMALE,FEMALE,66
3,3,JOHN,M,AHERN,3401,DUNCAN ST,COLUMBIA,SC,29205,WHITE,WHITE,MALE,MALE,20
4,4,PARIS,K,ASANI,26,PECAN CIR,YORK,SC,29745,BLACK/AFRICAN,BLACK,FEMALE,FEMALE,18


Note the input sample columns. As detailed in ZRP Build docstrings, the following columns are needed: first_name, middle_name, last_name, house_number, street_address, city, state, zip_code, race.

# Invoke ZRP Build on the sample data

ZRP Build provides functionality for you to specify where to put artifacts folder & its files (pipeline, model(s), and supporting data), generated during intermediate steps. This is the parameter, 'file_path' If this is not specified, the artifacts folder is dumped in the same folder where the function is called from. 

The ZRP consists of a waterfall of 3 models: block group, census tract, and zip code. For a refresher on this architecture, please see https://github.com/zestai/zrp/blob/main/model_report.rst#prediction-process. When building your own ZRP model using ZRP Build, these three models are trained. To reflect the structure of the ZRP module, ZRP Build places the 3 generated models and associated files in distinct folders named for the geo level. These three folders are stored in the following directory: '[file_path]/artifacts/experiments/[zrp_model_name/]', where 'file_path' is the user defined or default path to the 'artifacts' parent directory. 

'zrp_model_name' is another relevant parameter into ZRP Build that specifies the name of this new model you are building. Ultimately, uniquely defining 'file_path' and 'zrp_model_name' will avoid overwriting previously built models in subsequent runs of ZRP Build.

In [6]:
zest_race_predictor = ZRP_Build(zrp_model_name='test_model_small_data') 
zest_race_predictor.fit()
output = zest_race_predictor.transform(input_sample)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/5 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 5/5 [00:00<00:00, 819.23it/s]

Data is loaded
   [Start] Validating input data
     Number of observations: 5
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning
      ...replicating address
         ...Base
         ...Map street suffixes...



[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished


         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=5)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=5)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists
Output saved
   [Completed] Mapping geo data


100%|██████████| 1/1 [00:03<00:00,  3.71s/it]



[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 5
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

BUILDING block_group MODEL.

Directory already exists
Dropping ['census_tract', 'zip_code'] features
Len features to keep list:  98
Data shape pre feature drop:  (12, 932)
Index(['B03001_001', 'B03001_002', 'B03001_003', 'B03001_004', 'B03001_005',
       'B03001_006', 'B03001_007', 'B03001_008', 'B03001_009', 'B03001_010',
       ...
       'age', 'city', 'house_number', 'original_race', 'original_sex', 'sex',
       'state', 'street_address', 'zest_in_state_fips', 'zip_code'],
      dtype='object', length=835)
Data shape post feature drop:  (12, 97)
Directory already exists
Output sa

  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1526.87it/s]

[Pipeline] ............ (step 3 of 8) Processing App FE, total=   0.1s
[Pipeline] ............ (step 4 of 8) Processing ACS FE, total=   0.1s
[Pipeline] .. (step 5 of 8) Processing Name Aggregation, total=   0.1s
[Pipeline] . (step 6 of 8) Processing Drop Features (2), total=   0.0s
[Pipeline] ............ (step 7 of 8) Processing Impute, total=   0.0s
[Pipeline]  (step 8 of 8) Processing Correlated Feature Selection, total=   0.0s
Directory already exists

---
Transforming FE data
Pass through



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1736.77it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished



---
Saving FE data
Output saved
X_train_fe:  <bound method NDFrame.head of           B08301_010  B02001_006  B02001_004  B25004_006  B25075_008_pct  \
ZEST_KEY                                                                   
2                0.0         0.0         0.0         0.0             0.0   
3                0.0         0.0         0.0         0.0             0.0   
4                0.0         0.0         0.0         0.0             0.0   

          B25075_006_pct  B25075_003_pct  B25075_009_pct  B25075_004_pct  \
ZEST_KEY                                                                   
2                    0.0             0.0             0.0             0.0   
3                    0.0             0.0             0.0             0.0   
4                    0.0             0.0             0.0             0.0   

          B25075_007_pct  ...  B08301_013_pct  B08301_012_pct  B08301_011_pct  \
ZEST_KEY                  ...                                                   


  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1410.80it/s]

Output saved

---
Building pipeline

---
Fitting pipeline
[Pipeline] ..... (step 1 of 8) Processing Drop Features, total=   0.0s
Pass through
[Pipeline] .. (step 2 of 8) Processing Compound Name FE, total=   0.0s
[Pipeline] ............ (step 3 of 8) Processing App FE, total=   0.1s
[Pipeline] ............ (step 4 of 8) Processing ACS FE, total=   0.1s



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


[Pipeline] .. (step 5 of 8) Processing Name Aggregation, total=   0.1s
[Pipeline] . (step 6 of 8) Processing Drop Features (2), total=   0.0s
[Pipeline] ............ (step 7 of 8) Processing Impute, total=   0.0s
[Pipeline]  (step 8 of 8) Processing Correlated Feature Selection, total=   0.1s
Directory already exists

---
Transforming FE data
Pass through


  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1010.19it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished



---
Saving FE data
Output saved
X_train_fe:  <bound method NDFrame.head of           B03001_001  B03001_008  B23020_001  B03001_012  B06009_025  \
ZEST_KEY                                                               
2             3743.0         0.0        38.8         0.0       305.0   
3             3743.0         0.0        38.8         0.0       305.0   
4             3743.0         0.0        38.8         0.0       305.0   

          B19001I_001  B04006_049  B03001_016  B10051B_002  B19001B_002  ...  \
ZEST_KEY                                                                 ...   
2                87.0       406.0        62.0         26.0         11.0  ...   
3                87.0       406.0        62.0         26.0         11.0  ...   
4                87.0       406.0        62.0         26.0         11.0  ...   

          B08301_013_pct  B08301_012_pct  B08301_011_pct  B08301_016_pct  \
ZEST_KEY                                                                   
2         

  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 838.69it/s]

[Pipeline] ............ (step 4 of 8) Processing ACS FE, total=   0.1s
[Pipeline] .. (step 5 of 8) Processing Name Aggregation, total=   0.1s
[Pipeline] . (step 6 of 8) Processing Drop Features (2), total=   0.0s
[Pipeline] ............ (step 7 of 8) Processing Impute, total=   0.0s



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 927.53it/s]

[Pipeline]  (step 8 of 8) Processing Correlated Feature Selection, total=   0.0s
Directory already exists

---
Transforming FE data
Pass through



[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished



---
Saving FE data
Output saved
X_train_fe:  <bound method NDFrame.head of           B19001D_014  B10051I_007  B06009_020  B02001_006  B02001_004  \
ZEST_KEY                                                                 
0                 0.0         35.5         0.0         0.0        63.5   
1                 0.0         71.0         0.0         0.0       254.0   
2                 0.0         35.5         0.0         0.0         0.0   
3                 0.0          0.0         0.0         0.0         0.0   
4                 0.0         35.5         0.0         0.0         0.0   

          B19001D_015  unallocated_race  B25075_016_pct  B25075_004_pct  \
ZEST_KEY                                                                  
0                 0.0          0.996364        0.016013        0.001105   
1                 0.0          0.991068        0.019620        0.000760   
2                 0.0          1.000000        0.022260        0.000000   
3                 0.0         

# Proxy using newly build ZRP model
Make sure to specify the path to where the model is outputted (the parameter, 'pipe_path'). This includes appending 'artifacts/'experiments'/[zrp_model_name]' to your specified 'file_path'. In this case, we did not specify a 'file_path', thus the default is the path where this code is being run from.

In [7]:
test_sample = load_file(root + "/tests/data/sm_3.csv")
test_sample['ZEST_KEY'] = test_sample.index.astype(str)  #must specify key to establish correspondence between inputs and outputs

test_sample.shape

(5, 14)

In [8]:
%%time
zest_race_predictor = ZRP(pipe_path='artifacts/experiments/test_model_small_data')
zest_race_predictor.fit()
zrp_output = zest_race_predictor.transform(test_sample)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/5 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 5/5 [00:00<00:00, 1918.36it/s]

Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 5
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning
      ...replicating address
         ...Base
         ...Map street suffixes...



[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished


         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=5)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=5)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists


100%|██████████| 1/1 [01:49<00:00, 109.88s/it]

Output saved
   [Completed] Mapping geo data


100%|██████████| 1/1 [01:50<00:00, 110.38s/it]



[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 5
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

   [Start] Validating pipeline input data
     Number of observations: 16
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1102.60it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1349.08it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved
Output saved
CPU times: user 34.4 s, sys: 7.63 s, total: 42 s
Wall time: 28min 43s


*Note*: The following output data frame shouldn't be evaluated for performance/accuracy. As stated above, this notebook trains on an insignificant amount of training data for the purpose of demonstrating quickly how to use ZRP Build. A larger dataset is necessary to build a model with strong performance.