# ZRP Example Usage
The purpose of this notebook is to illustrate how to use ZRP, the main class of the zrp package that processes user input data &  returns race/ethnicity predictions

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser
import pandas as pd
import sys
import os
import re
import warnings

## Set source code path here

In [3]:
warnings.filterwarnings(action='once')
home = expanduser('~')

src_path = os.getcwd()
sys.path.append(src_path)

In [4]:
from zrp import ZRP
from zrp.prepare.utils import load_file, load_json

## Load sample data for prediction
Load list of New Jersey Mayors downloaded from https://www.nj.gov/dca/home/2022mayors.csv 

In [5]:
nj_mayors = load_file(src_path + "/2022-nj-mayors.csv")
nj_mayors.shape

(565, 18)

In [6]:
nj_mayors

Unnamed: 0,MUNI CODE,MUNI NAME,COUNTY,ADDRESS 1,ADDRESS 2,CITY,STATE,ZIP,PHONE,FAX,MAYOR NAME,TERM START,TERM END,FORM,TERM LEGNTH,EMAIL,SOCIAL MEDIA HANDLE,Municipal Contact List
0,1330,Aberdeen Township,Monmouth,One Aberdeen Square,,Aberdeen,NJ,07747-2300,(732) 583-4200,,Fred Tagliarini,,12/31/2025,COUNCIL-MANAGER,4,fred.tagliarini@aberdeennj.org,,
1,0101,Absecon City,Atlantic,Absecon Municipal Complex,500 Mill Road,Absecon,NJ,08201,(609) 641-0663,(609) 645-5098,Kimberly Horton,,12/31/2024,MAYOR-COUNCIL,3,khorton@abseconnj.org,,
2,1001,Alexandria Township,Hunterdon,782 Frenchtown Road,,Milford,NJ,08848,(908) 996-7071,,Gabe Plumer,,12/31/2022,TOWNSHIP,3,clerk@alexandrianj.gov,,
3,2101,Allamuchy Township,Warren,Post Office Box A,,Allamuchy,NJ,07820,(908) 852-5132,,Rosemary Tuohy,,12/31/2024,FAULKNER ACT,3,mayor@allamuchynj.org,,
4,0201,Allendale Borough,Bergen,500 West Crescent Avenue,,Allendale,NJ,07401,(201) 818-4400,,Ari Bernstein,,12/31/2022,,,aribernstein@allendalenj.gov,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
560,0269,Wood-Ridge Borough,Bergen,85 Humboldt Street,,Wood-Ridge,NJ,07075-2344,(201) 939-0202,,Paul A Sarlo,,12/31/2023,,,psarlo@njwoodridge.org,,
561,1715,Woodstown Borough,Salem,Post Office Box 286,,Woodstown,NJ,08098,(856) 769-2200,,Donald Dietrich,,12/31/2023,,,Don.dietrich@comcast.net,,
562,0824,Woolwich Township,Gloucester,120 Village Green Drive,,Woolwich Township,NJ,08085-3180,(856) 467-2666,,Craig Frederick,,12/31/2024,,,cfrederick@woolwichtwp.org,,
563,0340,Wrightstown Borough,Burlington,21 Saylors Pond Road,,Wrightstown,NJ,08562,(609) 723-4450,(609) 723-7137,Donald Cottrell,,12/31/2022,,,mayor@wrightstownborough.com,,


### Wrangle NJ mayor data for predictions
Prepare the NJ mayor data.  This parsing of the NJ mayors file will leave some NA's, but it is sufficient for demonstration purposes


In [7]:
zrp_sample = pd.DataFrame(columns=['first_name', 'middle_name', 'last_name', 'house_number', 'street_address', 'city', 'state', 'zip_code'])

Prepare Names

In [8]:
split_mayor_names = nj_mayors['MAYOR NAME'].str.split(' ')
zrp_sample['first_name'] = split_mayor_names.str[0]
zrp_sample['last_name'] = split_mayor_names.str[-1]

City, State, Zip

In [9]:
zrp_sample['city'] = nj_mayors['CITY']
zrp_sample['state'] = nj_mayors['STATE']
zrp_sample['zip_code'] = nj_mayors['ZIP']

Address

In [10]:
zrp_sample['house_number'] = nj_mayors['ADDRESS 1'].str.extract('([0-9]+)')
zrp_sample['street_address'] = nj_mayors['ADDRESS 1'].str.extract('.*[0-9]+([^0-9]+)')


In [11]:
zrp_sample['ZEST_KEY'] = zrp_sample.index.astype(str)  #must specify key to establish correspondence between inputs and outputs
zrp_sample

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY
0,Fred,,Tagliarini,,,Aberdeen,NJ,07747-2300,0
1,Kimberly,,Horton,,,Absecon,NJ,08201,1
2,Gabe,,Plumer,782,Frenchtown Road,Milford,NJ,08848,2
3,Rosemary,,Tuohy,,,Allamuchy,NJ,07820,3
4,Ari,,Bernstein,500,West Crescent Avenue,Allendale,NJ,07401,4
...,...,...,...,...,...,...,...,...,...
560,Paul,,Sarlo,85,Humboldt Street,Wood-Ridge,NJ,07075-2344,560
561,Donald,,Dietrich,286,,Woodstown,NJ,08098,561
562,Craig,,Frederick,120,Village Green Drive,Woolwich Township,NJ,08085-3180,562
563,Donald,,Cottrell,21,Saylors Pond Road,Wrightstown,NJ,08562,563


### Invoke the Zest Race Predictor on the sample data

To run with custom names provide a mapping of custom column names to the expected column names, for example:

`        ZRP(**{'first_name':'example_first_name',
               'middle_name':'example_middle_name',
               'last_name':'example_last_name',
               'house_number':'example_house_number',
               'street_address':'example_street_address',
               'zip_code':'example_zip_code',
               'state':'example_state',
               'census_tract':'example_census_tract',
               'block_group':'example_block_group',
        })`
        
All of the above dictionary keys are recommended to provide. If Census tract or Census block group are unavailable, `ZRP()` will geocode the input data using Census shapefile data. If house number also is not available `ZRP()` will use zip/postal codes and underlying data to return proxies. While all other columns are required if columns like middle name (or even first or last name) are highly or fully missing `ZRP()` will still be able to generate proxies. To accommodate more fair audit workflows we have enabled generating ZRP (name + geo), BISG (name + geo), ZRP name-only, and ZRP geo-only proxies.

Initialize, fit & transform `ZRP()`

In [14]:
%%time
zest_race_predictor = ZRP()
zest_race_predictor.fit()
zrp_output = zest_race_predictor.transform(zrp_sample)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/565 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.1s
100%|██████████| 565/565 [00:00<00:00, 7439.98it/s]

Directory already exists
####################################
Processing rows: 0:25000
####################################
Data is loaded
   [Start] Validating input data
     Number of observations: 565
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['NJ']
   ... on state: NJ

   Data is loaded
   [Start] Processing geo data
      ...address cleaning



[Parallel(n_jobs=-1)]: Done 565 out of 565 | elapsed:    0.1s finished


      ...replicating address
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=565)
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=1003)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping


100%|██████████| 1/1 [00:04<00:00,  4.08s/it]

   [Completed] Validating input geo data
Directory already exists
...Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 565
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables





   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

   [Start] Validating pipeline input data
     Number of observations: 1743
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 601.51it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 765.66it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 656.49it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 435.00it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


   ...Proxies generated
...Output saved
...Output saved
CPU times: user 50.2 s, sys: 5.11 s, total: 55.3 s
Wall time: 41 s


### Inspect the output and join

In [15]:
zrp_output

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_zrp_block_group,source_zrp_census_tract,source_zrp_zip_code,source_bisg,source_zrp_name_only
0,Fred,,Tagliarini,,,Aberdeen,NJ,07747-2300,0,0.000445,0.000366,0.001628,0.000926,0.996634,WHITE,0.0,0.0,1.0,0.0,0.0
1,Kimberly,,Horton,,,Absecon,NJ,08201,1,0.020371,0.022385,0.363591,0.042607,0.551045,WHITE,0.0,0.0,1.0,0.0,0.0
2,Gabe,,Plumer,782,Frenchtown Road,Milford,NJ,08848,2,0.000117,0.000096,0.031971,0.002676,0.965141,WHITE,1.0,0.0,0.0,0.0,0.0
3,Rosemary,,Tuohy,,,Allamuchy,NJ,07820,3,0.000583,0.000500,0.001106,0.002331,0.995479,WHITE,0.0,0.0,1.0,0.0,0.0
4,Ari,,Bernstein,500,West Crescent Avenue,Allendale,NJ,07401,4,0.016766,0.008396,0.002317,0.049893,0.922627,WHITE,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
560,Paul,,Sarlo,85,Humboldt Street,Wood-Ridge,NJ,07075-2344,560,0.000533,0.000352,0.000895,0.000873,0.997347,WHITE,0.0,0.0,1.0,0.0,0.0
561,Donald,,Dietrich,286,,Woodstown,NJ,08098,561,0.009935,0.000523,0.007084,0.008390,0.974068,WHITE,0.0,0.0,1.0,0.0,0.0
562,Craig,,Frederick,120,Village Green Drive,Woolwich Township,NJ,08085-3180,562,0.012849,0.016040,0.291692,0.019360,0.660060,WHITE,1.0,0.0,0.0,0.0,0.0
563,Donald,,Cottrell,21,Saylors Pond Road,Wrightstown,NJ,08562,563,0.011897,0.027253,0.118932,0.023193,0.818726,WHITE,1.0,0.0,0.0,0.0,0.0


### Check the most likely Hispanic 

In [16]:
zrp_output.nlargest(10, "HISPANIC")

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_zrp_block_group,source_zrp_census_tract,source_zrp_zip_code,source_bisg,source_zrp_name_only
377,Hector,,Lora,330.0,Passaic Street,Passaic,NJ,07055-5815,377,0.000128,0.00023,0.001379,0.97948,0.018783,HISPANIC,1.0,0.0,0.0,0.0,0.0
286,Marcial,,Mojena,249.0,,Columbus,NJ,08022,286,0.00102,0.000309,0.001687,0.965955,0.031029,HISPANIC,0.0,0.0,1.0,0.0,0.0
536,Gabriel,,Rodriguez,428.0,th Street,West New York,NJ,07093-2222,536,0.001721,0.005429,0.002682,0.958142,0.032027,HISPANIC,0.0,0.0,1.0,0.0,0.0
388,Helmin,,Caba,,,Perth Amboy,NJ,08861,388,0.021563,0.000426,0.007714,0.954092,0.016205,HISPANIC,0.0,0.0,1.0,0.0,0.0
236,Alberto,,Santos,402.0,Kearny Avenue,Kearny,NJ,07032,236,0.019993,0.000571,0.003235,0.951424,0.024777,HISPANIC,1.0,0.0,0.0,0.0,0.0
543,Ray,,Arroyo,101.0,Washington Avenue,Westwood,NJ,07675,543,0.010628,0.004962,0.00505,0.916381,0.062979,HISPANIC,0.0,0.0,1.0,0.0,0.0
499,Manuel,,Figueiredo,,,Union,NJ,07083-3597,499,0.000594,0.000619,0.001788,0.916256,0.080742,HISPANIC,0.0,0.0,1.0,0.0,0.0
378,Andre,,Sayegh,125.0,st Floor,Paterson,NJ,07505-1414,378,0.074799,0.000378,0.005142,0.889286,0.030395,HISPANIC,0.0,0.0,1.0,0.0,0.0
418,Ramopn,,Hache,131.0,North Maple Avenue,Ridgewood,NJ,07450-3236,418,0.001488,0.000555,0.001574,0.843779,0.152605,HISPANIC,0.0,0.0,1.0,0.0,0.0
398,Peter,,Cantu,641.0,Plainsboro Road,Plainsboro,NJ,08536,398,0.047685,0.007551,0.00451,0.83916,0.101094,HISPANIC,1.0,0.0,0.0,0.0,0.0


### Check the most likely Black

In [17]:
zrp_output.nlargest(10, "BLACK")

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_zrp_block_group,source_zrp_census_tract,source_zrp_zip_code,source_bisg,source_zrp_name_only
343,Ras,,Baraka,920.0,Broad Street,Newark,NJ,07102,343,0.000497,7e-05,0.978793,0.002025,0.018616,BLACK,1.0,0.0,0.0,0.0,0.0
215,Dahlia,,Vertreese,,,Hillside,NJ,07205,215,0.000565,0.000883,0.968902,0.010385,0.019265,BLACK,0.0,0.0,1.0,0.0,0.0
549,Tiffani,,Worthy,1.0,Salem Road,Willingboro,NJ,08046,549,0.00084,0.000422,0.963874,0.008897,0.025967,BLACK,1.0,0.0,0.0,0.0,0.0
229,Anthony,,Vauss,,,Irvington,NJ,07111-2412,229,0.00328,0.001508,0.962892,0.019847,0.012472,BLACK,0.0,0.0,1.0,0.0,0.0
370,Dwayne,,Warren,29.0,North Day Street,Orange,NJ,07050,370,0.003101,0.009734,0.955713,0.006223,0.025229,BLACK,1.0,0.0,0.0,0.0,0.0
397,Adrian,,Mapp,515.0,Watchung Avenue,Plainfield,NJ,07060-1720,397,0.004099,0.008725,0.944972,0.020197,0.022007,BLACK,1.0,0.0,0.0,0.0,0.0
258,Derek,,Armstead,301.0,North Wood Avenue,Linden,NJ,07036-4296,258,0.019287,0.008417,0.923846,0.005005,0.043444,BLACK,1.0,0.0,0.0,0.0,0.0
78,Jamila,,Odom-Bremmer,201.0,Grant Avenue,Chesilhurst,NJ,08089,78,0.013521,0.016294,0.919407,0.018952,0.031827,BLACK,1.0,0.0,0.0,0.0,0.0
250,Mary,,Wardlow,4.0,East Douglas Avenue,Lawnside,NJ,08045-1597,250,0.000533,0.000994,0.90578,0.023291,0.069402,BLACK,0.0,0.0,1.0,0.0,0.0
493,Maurice,,Hill,33.0,Washington Street,Toms River,NJ,08754,493,0.003581,0.015931,0.903502,0.016041,0.060945,BLACK,0.0,0.0,0.0,0.0,1.0


BISG proxies are saved by default when `ZRP` is ran. Below we import the BISG proxies in.

In [18]:
bisg_output = pd.read_feather("artifacts/r022_bisg_proxy_output.feather")

FileNotFoundError: [Errno 2] No such file or directory: 'artifacts/r022_bisg_proxy_output.feather'

In [None]:
bisg_output.head()

In [None]:
bisg_output

How many proxies does BISG return?

In [None]:
f"Out of {bisg_output.shape[0]} records only {bisg_output[bisg_output.race_proxy.notna()].shape[0]} proxies are returned"  

How many proxies does ZRP return?

In [None]:
f"Out of {zrp_output.shape[0]} records {zrp_output[zrp_output.race_proxy.notna()].shape[0]} proxies are returned"  
