# How to Generate Race & Ethnicity Predictions using ZRP
The purpose of this notebook is to illustrate how to use ZRP_Predict, a class that generates race & ethnicity predictions

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser
import pandas as pd
import sys
import os
import re
import warnings

In [3]:
warnings.filterwarnings(action='once')
home = expanduser('~')

In [4]:
src_path = '{}/zrp'.format(home)
sys.path.append(src_path)

In [5]:
test_ids = ['GA_10961114',  'GA_07588296', 'GA_11951308', 'GA_03567641',  
            'GA_11493478', 'GA_08063136', 'GA_02144077', 'GA_06757359', 
            'GA_10561962', 'GA_07690722',   'GA_11003386'
           ]

In [7]:
from zrp.modeling.predict import ZRP_Predict
from zrp.prepare.prepare import *
from zrp.prepare.utils import *

## load data
simulating user input data

In [8]:
support_files_path = "/d/shared/zrp/shared_data"
key ='ZEST_KEY'

In [9]:
df = load_file("/d/shared/zrp/shared_data/processed/data/state_level/voters/base_ga_2022q1.parquet")
df.shape

  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  _pyarrow_version_ge_015 = LooseVersion(pyarrow.__version__) >= LooseVersion("0.15")


(7517881, 14)

### sample data
sample for test case

In [10]:
samp = df.copy()
samp = samp[samp[key].isin(test_ids)]
samp.shape

(11, 14)

## ZRP Prepare  
prepare the data

In [11]:
%%time
zest_race_predictor = ZRP_Prepare()
zest_race_predictor.fit(samp)
output = zest_race_predictor.transform(samp)

Data is loaded
Data is loaded
   Formatting P1
   Formatting P2
reduce whitespace

[Start] Preparing geo data
  The following states are included in the data: ['GA']
   ... on state: GA

   Data is loaded
   [Start] Processing geo data
/d/shared/zrp/shared_data


  0%|          | 0/11 [00:00<?, ?it/s][Parallel(n_jobs=49)]: Using backend ThreadingBackend with 49 concurrent workers.
100%|██████████| 11/11 [00:00<00:00, 1439.45it/s]

      ...address cleaning



[Parallel(n_jobs=49)]: Done  11 out of  11 | elapsed:    0.2s finished


      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

     Address dataframe expansion is complete! (n=17)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping
Output saved
   [Completed] Mapping geo data


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


[Completed] Preparing geo data

[Start] Preparing ACS data
User input data is loaded
   ...loading ACS lookup tables
   ... combining ACS & user input data
ZEST_KEY
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data
CPU times: user 44.2 s, sys: 11.2 s, total: 55.4 s
Wall time: 59.2 s


## ZRP_Predict
To generate predictions, assuming the data was prepared by ZRP_Prepare:
1. Provide the path to the preferred pipeline directory in the `__init__`.
2. No parameters are required for the `fit`.
3. To transform the data,generate predictions, provide the prepared data from ZRP_Prepare. 

By default proxy probabilities are returned.

In [13]:
pipe_path = "/d/shared/zrp/model_artifacts/experiment/exp_011"

In [14]:
%%time
zrp_predict = ZRP_Predict(pipe_path=pipe_path)
zrp_predict.fit()
out_predictions = zrp_predict.transform(output)

Handle Compounds (in transform): (11, 92)
Handle Compounds (in transform reset): (11, 93)
Handle Compounds (end transform): (12, 93)
App FE (in transform) (12, 94)


  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=90)]: Using backend ThreadingBackend with 90 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 115.28it/s]

App FE (in transform post data_fe 1) (12, 94)
App FE (in transform post data_fe 2) (12, 105)
App FE (end transform) (12, 105)
Custom Ratios (in transform) (12, 105)
Custom Ratios (end transform) (12, 126)
Name Aggregation (in transform) (12, 126)



[Parallel(n_jobs=90)]: Done   1 out of   1 | elapsed:    1.2s finished


(11, 15)
(11, 15)
Empty DataFrame
Columns: [HISPANIC_last_name, BLACK_middle_name, AAPI_middle_name, WHITE_last_name, BLACK_last_name, AAPI_last_name, AIAN_first_name, WHITE_middle_name, AAPI_first_name, HISPANIC_first_name, AIAN_middle_name, HISPANIC_middle_name, WHITE_first_name, AIAN_last_name, BLACK_first_name]
Index: []

(12, 126)

(11, 110)

(11, 15)



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved
CPU times: user 7.99 s, sys: 533 ms, total: 8.53 s
Wall time: 13.6 s


In [16]:
out_predictions.reset_index(drop=True)

Unnamed: 0,AAPI,AIAN,BLACK,HISPANIC,WHITE,source_block_group
0,0.003693,0.012285,0.68988,0.010139,0.284002,1
1,0.003317,0.008846,0.08652,0.006012,0.895306,1
2,0.015528,0.001163,0.001545,0.025489,0.956274,1
3,0.13645,0.03533,0.537617,0.071682,0.218921,1
4,0.009961,0.000892,0.948494,0.010603,0.03005,1
5,0.00314,6.6e-05,0.000432,0.994575,0.001788,1
6,0.003461,0.002152,0.013817,0.007392,0.973178,1
7,0.013305,0.010755,0.013567,0.024665,0.937708,1
8,0.013309,0.001035,0.003674,0.024191,0.95779,1
9,0.016148,0.000652,0.125726,0.012625,0.844849,1
