# How to Generate Race & Ethnicity Predictions using ZRP
The purpose of this notebook is to illustrate how to use ZRP_Predict, a module that generates race & ethnicity predictions

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import warnings

In [3]:
warnings.filterwarnings(action='ignore')
home = expanduser('~')

src_path = '{}/zrp'.format(home)
sys.path.append(src_path)

In [4]:
from zrp.modeling.predict import ZRP_Predict
from zrp.prepare.prepare import ZRP_Prepare
from zrp.prepare.utils import load_file

  return f(*args, **kwds)


## Load sample data for prediction
Load processed list of New Jersey Mayors downloaded from https://www.nj.gov/dca/home/2022mayors.csv 

In [5]:
nj_mayors = load_file("../2022-nj-mayors-sample.csv")
nj_mayors.shape

(462, 9)

In [6]:
nj_mayors

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY
0,Gabe,,Plumer,782,Frenchtown Road,Milford,NJ,08848,2
1,Ari,,Bernstein,500,West Crescent Avenue,Allendale,NJ,07401,4
2,David,J.,Mclaughlin,125,Corlies Avenue,Allenhurst,NJ,07711-1049,5
3,Thomas,C.,Fritts,8,North Main Street,Allentown,NJ,08501-1607,6
4,P.,,McCkelvey,49,South Greenwich Street,Alloway,NJ,08001-0425,7
...,...,...,...,...,...,...,...,...,...
457,William,,Degroff,3943,Route,Chatsworth,NJ,08019,558
458,Joseph,,Chukwueke,200,Cooper Avenue,Woodlynne,NJ,08107-2108,559
459,Paul,,Sarlo,85,Humboldt Street,Wood-Ridge,NJ,07075-2344,560
460,Craig,,Frederick,120,Village Green Drive,Woolwich Township,NJ,08085-3180,562


#### ZRP Prepare  
Predictions can only be generated from prepared data that is processed, Census GEOIDs (ie census tract), and has American Community Survey mapped to each unique record. To prepare the data we will use ZRP_Prepare

In [7]:
%%time
zest_race_predictor = ZRP_Prepare()
zest_race_predictor.fit(nj_mayors)
prepared = zest_race_predictor.transform(nj_mayors)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/462 [00:00<?, ?it/s][A[Parallel(n_jobs=49)]: Using backend ThreadingBackend with 49 concurrent workers.
[Parallel(n_jobs=49)]: Done 102 tasks      | elapsed:    0.0s
100%|██████████| 462/462 [00:00<00:00, 13880.96it/s]

Data is loaded
   [Start] Validating input data
     Number of observations: 462
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['NJ']
   ... on state: NJ

   Data is loaded
   [Start] Processing geo data
      ...address cleaning



[Parallel(n_jobs=49)]: Done 352 tasks      | elapsed:    0.0s
[Parallel(n_jobs=49)]: Done 462 out of 462 | elapsed:    0.0s finished


      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=900)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=900)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping


100%|██████████| 1/1 [00:05<00:00,  5.38s/it]

   [Completed] Validating input geo data
Directory already exists
Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 462
     Is key unique: True






   [Completed] Validating ACS input data

   ...loading ACS lookup tables
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

CPU times: user 35.4 s, sys: 12 s, total: 47.5 s
Wall time: 28 s


### Invoke the ZRP_Predict on the sample data
To generate predictions, you can:
1. Provide the path to the preferred pipeline directory in the `__init__`. 
    - Here we provide the default path

In [8]:
curpath = os.getcwd()
pipe_path = join(curpath, "../../zrp/modeling/models")

In [9]:
zrp_predict = ZRP_Predict(pipe_path)

To transform the data/generate predictions, provide the prepared data from ZRP_Prepare to the transform

In [10]:
zrp_predict.fit(prepared)
zrp_output = zrp_predict.transform(prepared)

   [Start] Validating pipeline input data
     Number of observations: 1427
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1035.63it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 993.91it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 866.23it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


### Inspect the output
- Preview the data
- View what artifacts are saved

In [11]:
zrp_output.sort_values("source_block_group", ascending=False)

Unnamed: 0_level_0,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_block_group,source_census_tract,source_zip_code,source_bisg
ZEST_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
559,0.004409,0.001062,0.915445,0.052111,0.026974,BLACK,1.0,0.0,0.0,0.0
419,0.000306,0.000229,0.000952,0.001160,0.997353,WHITE,1.0,0.0,0.0,0.0
421,0.060321,0.011445,0.030831,0.036595,0.860807,WHITE,1.0,0.0,0.0,0.0
422,0.076223,0.000318,0.001026,0.019656,0.902777,WHITE,1.0,0.0,0.0,0.0
423,0.028568,0.012289,0.654524,0.023888,0.280731,BLACK,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
277,0.054039,0.010743,0.018480,0.030987,0.885752,WHITE,0.0,0.0,1.0,0.0
279,0.000806,0.001535,0.088218,0.026714,0.882727,WHITE,0.0,0.0,1.0,0.0
281,0.012340,0.004350,0.005831,0.029696,0.947782,WHITE,0.0,0.0,1.0,0.0
282,0.015886,0.005495,0.004929,0.022326,0.951364,WHITE,0.0,0.0,1.0,0.0


### Check Coverage
A quick glance at the ZRP output we can see a low missing rate. `ZRP_Predict` uses a waterfall method that predicts by using block group, census_tract, then zip_code. 

In [12]:
zrp_output.filter(regex='[A-Z]|race').isna().mean()

AAPI          0.017316
AIAN          0.017316
BLACK         0.017316
HISPANIC      0.017316
WHITE         0.017316
race_proxy    0.017316
dtype: float64

Checking the distribution of predicted race & ethnicity 

In [13]:
zrp_output.race_proxy.value_counts(normalize=True, dropna=False)

WHITE       0.876623
BLACK       0.049784
HISPANIC    0.034632
AAPI        0.021645
NaN         0.017316
Name: race_proxy, dtype: float64

In [14]:
zrp_output.shape

(462, 10)

Please refer to the source columns to determine which geographic identifier or method was used to generate the proxy 

`ZRP_Predict` generates multiple artifacts that are automatically saved:
- Dataframe with proxies
    - `proxy_output.feather`
- Validation dictionary for input data
    - `input_predict_validator.json`
