# How to Generate Race & Ethnicity Predictions using ZRP
## Geographic Identifier Level
The purpose of this notebook is to illustrate how to use ZRP submodules to generates race & ethnicity predictions. It is recommended to use `ZRP_Predict` to generate predictions, for the best coverage. There are three geographic identifier (GEOID) specific predict methods.  
- `ZRP_Predict_ZipCode()`
- `ZRP_Predict_CensusTract()`
- `ZRP_Predict_BlockGroup()`

In this example we will illustrate how to generate predictions using `ZRP_Predict_ZipCode`

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import warnings

In [3]:
warnings.filterwarnings(action='once')
home = expanduser('~')

src_path = '{}/zrp'.format(home)
sys.path.append(src_path)

In [4]:
from zrp.modeling.predict import ZRP_Predict_ZipCode
from zrp.prepare.prepare import ZRP_Prepare
from zrp.prepare.utils import load_file, load_json

  version = LooseVersion(pd.__version__)
  return f(*args, **kwds)


## Load sample data for prediction
Load processed list of New Jersey Mayors downloaded from https://www.nj.gov/dca/home/2022mayors.csv 

In [5]:
nj_mayors = load_file("../2022-nj-mayors-sample.csv")
nj_mayors.shape

(462, 9)

In [6]:
nj_mayors

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY
0,Gabe,,Plumer,782,Frenchtown Road,Milford,NJ,08848,2
1,Ari,,Bernstein,500,West Crescent Avenue,Allendale,NJ,07401,4
2,David,J.,Mclaughlin,125,Corlies Avenue,Allenhurst,NJ,07711-1049,5
3,Thomas,C.,Fritts,8,North Main Street,Allentown,NJ,08501-1607,6
4,P.,,McCkelvey,49,South Greenwich Street,Alloway,NJ,08001-0425,7
...,...,...,...,...,...,...,...,...,...
457,William,,Degroff,3943,Route,Chatsworth,NJ,08019,558
458,Joseph,,Chukwueke,200,Cooper Avenue,Woodlynne,NJ,08107-2108,559
459,Paul,,Sarlo,85,Humboldt Street,Wood-Ridge,NJ,07075-2344,560
460,Craig,,Frederick,120,Village Green Drive,Woolwich Township,NJ,08085-3180,562


#### ZRP Prepare  
Predictions can only be generated from prepared data that is processed, Census GEOIDs (ie census tract), and has American Community Survey mapped to each unique record. To prepare the data we will use ZRP_Prepare

In [7]:
%%time
zest_race_predictor = ZRP_Prepare()
zest_race_predictor.fit(nj_mayors)
prepared = zest_race_predictor.transform(nj_mayors)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/462 [00:00<?, ?it/s][A[Parallel(n_jobs=49)]: Using backend ThreadingBackend with 49 concurrent workers.
[Parallel(n_jobs=49)]: Done 102 tasks      | elapsed:    0.0s
100%|██████████| 462/462 [00:00<00:00, 13510.30it/s]

Data is loaded
Directory already exists
   [Start] Validating input data
     Number of observations: 462
     Is key unique: True
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['NJ']
   ... on state: NJ

   Data is loaded
   [Start] Processing geo data
/home/kam/zrp/zrp/prepare/../data/processed
      ...address cleaning



[Parallel(n_jobs=49)]: Done 352 tasks      | elapsed:    0.0s
[Parallel(n_jobs=49)]: Done 462 out of 462 | elapsed:    0.0s finished


      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=900)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=900)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  _pyarrow_version_ge_015 = LooseVersion(pyarrow.__version__) >= LooseVersion("0.15")


      ...merge user input & lookup table
      ...mapping


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [00:04<00:00,  4.84s/it]

Directory already exists
Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 462
     Is key unique: True



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)



   [Completed] Validating ACS input data

   ...loading ACS lookup tables
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

CPU times: user 34.4 s, sys: 9.55 s, total: 43.9 s
Wall time: 21.6 s


### Invoke the ZRP_Predict on the sample data
To generate predictions, you can:
1. Provide the path to the preferred pipeline directory in the `__init__`. 
    - Here we provide the default path

In [8]:
curpath = os.getcwd()
pipe_path = join(curpath, "../../zrp/modeling/models")

In [9]:
zrp_predict =  ZRP_Predict_ZipCode(pipe_path)

#### Load Feature List
Features are subset by GEOID
- zp: zip code
- bg: block group
- ct: census tract

In [10]:
feature_list = load_json(os.path.join(curpath, "../../zrp/modeling/feature_list_zp.json"))

We only need zip code related data so lets filter the prepared data

In [11]:
zip_only = prepared[prepared.acs_source=='GEOID_ZIP'].filter(feature_list)

To transform the data/generate predictions, provide the prepared data from ZRP_Prepare to the transform filtered to the `feature_list` columns

In [12]:
zrp_predict.fit()
zrp_output = zrp_predict.transform(prepared.filter(zip_only))
zrp_output = zrp_output[~zrp_output.index.duplicated(keep='first')]

  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 15 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1120.87it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


### Inspect the output
- Preview the data
    - only one source column is expected in the output, since these predictions are GEOID specific
- View what artifacts are saved

In [13]:
zrp_output

Unnamed: 0_level_0,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_zip_code
ZEST_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
10,0.010662,0.014569,0.016970,0.013396,0.944403,WHITE,1
10,0.008043,0.011817,0.014175,0.014500,0.951466,WHITE,1
10,0.014464,0.028271,0.162401,0.023886,0.770978,WHITE,1
100,0.000979,0.008053,0.015015,0.012642,0.963312,WHITE,1
100,0.001589,0.017367,0.036589,0.017921,0.926534,WHITE,1
...,...,...,...,...,...,...,...
96,0.000938,0.000564,0.038707,0.002508,0.957284,WHITE,1
96,0.001729,0.001461,0.250965,0.003827,0.742019,WHITE,1
96,0.000861,0.000753,0.060394,0.002053,0.935938,WHITE,1
97,0.009594,0.000934,0.192721,0.012938,0.783813,WHITE,1


### Check Coverage
A quick glance at the `ZRP_Predict_ZipCode`  output we can see all records with a proper zip code that maps to American Community Survey data have a proxy

In [14]:
zrp_output.filter(regex='[A-Z]|race').isna().mean()

AAPI          0.0
AIAN          0.0
BLACK         0.0
HISPANIC      0.0
WHITE         0.0
race_proxy    0.0
dtype: float64

Checking the distribution of predicted race & ethnicity 

In [15]:
zrp_output.race_proxy.value_counts(normalize=True, dropna=False)

WHITE       0.876605
BLACK       0.054922
HISPANIC    0.038516
AAPI        0.029957
Name: race_proxy, dtype: float64

In [23]:
zrp_output.shape

(462, 7)

Please refer to the source columns to determine which geographic identifier or method was used to generate the proxy 

`ZRP_Predict` generates multiple artifacts that are automatically saved:
- Dataframe with proxies
    - `proxy_output.feather`
- Validation dictionary for input data
    - `input_predict_validator.json`


In [16]:
!ls artifacts

input_acs_validator.json      proxy_output.feather
input_predict_validator.json  Zest_Geocoded_test_2019__34.parquet
input_validator.json
