# How to Generate Race & Ethnicity Predictions using ZRP
The purpose of this notebook is to illustrate how to use ZRP_Predict, a module that generates race & ethnicity predictions

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import warnings

In [3]:
warnings.filterwarnings(action='ignore')
home = expanduser('~')

src_path = '{}/zrp'.format(home)
sys.path.append(src_path)

In [4]:
from zrp.modeling.predict import ZRP_Predict
from zrp.prepare.prepare import ZRP_Prepare
from zrp.prepare.utils import load_file

## Load sample data for prediction
Load processed list of New Jersey Mayors downloaded from https://www.nj.gov/dca/home/2022mayors.csv 

In [5]:
nj_mayors = load_file("../2022-nj-mayors-sample.csv")
nj_mayors.shape

(462, 9)

In [6]:
nj_mayors

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY
0,Gabe,,Plumer,782,Frenchtown Road,Milford,NJ,08848,2
1,Ari,,Bernstein,500,West Crescent Avenue,Allendale,NJ,07401,4
2,David,J.,Mclaughlin,125,Corlies Avenue,Allenhurst,NJ,07711-1049,5
3,Thomas,C.,Fritts,8,North Main Street,Allentown,NJ,08501-1607,6
4,P.,,McCkelvey,49,South Greenwich Street,Alloway,NJ,08001-0425,7
...,...,...,...,...,...,...,...,...,...
457,William,,Degroff,3943,Route,Chatsworth,NJ,08019,558
458,Joseph,,Chukwueke,200,Cooper Avenue,Woodlynne,NJ,08107-2108,559
459,Paul,,Sarlo,85,Humboldt Street,Wood-Ridge,NJ,07075-2344,560
460,Craig,,Frederick,120,Village Green Drive,Woolwich Township,NJ,08085-3180,562


#### ZRP Prepare  
Predictions can only be generated from prepared data that is processed, Census GEOIDs (ie census tract), and has American Community Survey mapped to each unique record. To prepare the data we will use ZRP_Prepare

In [7]:
%%time
zest_race_predictor = ZRP_Prepare()
zest_race_predictor.fit(nj_mayors)
prepared = zest_race_predictor.transform(nj_mayors)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/462 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 15 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 170 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 420 tasks      | elapsed:    0.0s
100%|██████████| 462/462 [00:00<00:00, 15458.00it/s]

Data is loaded
   [Start] Validating input data
     Number of observations: 462
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['NJ']
   ... on state: NJ

   Data is loaded
   [Start] Processing geo data
      ...address cleaning



[Parallel(n_jobs=-1)]: Done 462 out of 462 | elapsed:    0.0s finished


      ...replicating address
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=462)
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=900)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping


100%|██████████| 1/1 [00:05<00:00,  5.06s/it]

   [Completed] Validating input geo data
Directory already exists
...Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 462
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables





   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

CPU times: user 28.4 s, sys: 2.33 s, total: 30.7 s
Wall time: 30.7 s


### Invoke the ZRP_Predict on the sample data
To generate predictions, provide the path to the preferred pipeline directory in the `__init__`. Here we provide the default local path from a git clone.

In [8]:
curpath = os.getcwd()
pipe_path = join(curpath, "../../zrp/modeling/models")

Initialize & fit `ZRP_Predict`

In [9]:
zrp_predict = ZRP_Predict(pipe_path)
zrp_predict.fit(prepared)

   [Start] Validating pipeline input data
     Number of observations: 1543
     Is key unique: False
   [Completed] Validating pipeline input data



<zrp.modeling.predict.ZRP_Predict at 0x7f3543f80690>

To transform the data/generate predictions, provide the prepared data from ZRP_Prepare to the transform

In [10]:
zrp_output = zrp_predict.transform(prepared)

   [Start] Validating pipeline input data
     Number of observations: 1543
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 15 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1226.05it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 15 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1158.01it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 15 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1455.85it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 15 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1102.89it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


   ...Proxies generated
Directory already exists
...Output saved


#### Troubleshooting
If you run into a <font color='red'>FileNotFoundError</font>, it is likely that an incorrect `pipe_path` was provided.

The pipelines are downloaded upon installation when `python -m zrp download` is ran in the terminal. Please refer to the ReadMe if you missed this step. Otherwise, the current notebook is configured to run when the pipeline is downloaded to your local zrp folder. To check if the data is downloaded to your local git cloned zrp folder run the following command.

In [11]:
!find {pipe_path}  -print | grep -i 'pipe.pkl'

/home/kam/zrp/examples/modeling/../../zrp/modeling/models/block_group/pipe.pkl
/home/kam/zrp/examples/modeling/../../zrp/modeling/models/census_tract/pipe.pkl
/home/kam/zrp/examples/modeling/../../zrp/modeling/models/zip_code/pipe.pkl


The expected output will be similar to the following 

If`!find {pipe_path}  -print | grep -i 'pipe.pkl'` returns nothing, then update your `pipe_path` to point to your virtual (conda) environment. An example is included below, where the virtal environment is named "zrp_v0.3.3":
- `pipe_path = os.path.join(home, ".conda/envs/zrp_v0.3.3/lib/python3.7/site-packages/zrp/modeling/models")`


### Inspect the output
- Preview the data
- View what artifacts are saved

In [12]:
zrp_output

Unnamed: 0_level_0,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_zrp_block_group,source_zrp_census_tract,source_zrp_zip_code,source_bisg,source_zrp_name_only
ZEST_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
104,0.001839,0.000730,0.008079,0.044049,0.945302,WHITE,0.0,0.0,0.0,0.0,1.0
117,0.039753,0.035560,0.423672,0.025209,0.475807,WHITE,0.0,0.0,0.0,0.0,1.0
191,0.032259,0.023393,0.201179,0.059578,0.683592,WHITE,0.0,0.0,0.0,0.0,1.0
29,0.001591,0.043471,0.383610,0.044442,0.526887,WHITE,0.0,0.0,0.0,0.0,1.0
326,0.064744,0.001305,0.019531,0.017721,0.896698,WHITE,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
87,0.000357,0.000264,0.000991,0.001061,0.997328,WHITE,1.0,0.0,0.0,0.0,0.0
88,0.029783,0.000419,0.002139,0.032234,0.935425,WHITE,1.0,0.0,0.0,0.0,0.0
92,0.000108,0.000093,0.001272,0.003412,0.995114,WHITE,1.0,0.0,0.0,0.0,0.0
93,0.038076,0.002857,0.031983,0.087664,0.839420,WHITE,1.0,0.0,0.0,0.0,0.0


### Check Coverage
A quick glance at the ZRP output we can see a low missing rate. `ZRP_Predict` uses a waterfall method that predicts by using block group, census_tract, then zip_code. 

In [13]:
zrp_output.filter(regex='[A-Z]|race').isna().mean()

AAPI          0.0
AIAN          0.0
BLACK         0.0
HISPANIC      0.0
WHITE         0.0
race_proxy    0.0
dtype: float64

Checking the distribution of predicted race & ethnicity 

In [14]:
zrp_output.race_proxy.value_counts(normalize=True, dropna=False)

WHITE       0.896104
BLACK       0.045455
HISPANIC    0.036797
AAPI        0.021645
Name: race_proxy, dtype: float64

In [15]:
zrp_output.shape

(462, 11)

Please refer to the source columns to determine which geographic identifier or method was used to generate the proxy 

`ZRP_Predict` automatically saves the proxies to `proxy_output.feather`