# ZRP Example Usage
The purpose of this notebook is to illustrate how to use ZRP, the main class of the zrp package that processes user input data &  returns race/ethnicity predictions

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser
import pandas as pd
import sys
import os
import re
import warnings
from time import time

## Set source code path here

In [3]:
warnings.filterwarnings(action='once')
home = expanduser('~')

src_path = os.getcwd()
sys.path.append(src_path)

In [4]:
from zrp import ZRP
from zrp.prepare.utils import load_file

## Load sample data for prediction
Load list of New Jersey Mayors downloaded from https://www.nj.gov/dca/home/2022mayors.csv 

In [5]:
nj_mayors = pd.read_parquet('/d/shared/zrp/shared_data/processed/data/state_level/voters/processed_sc_2022q1.parquet')
nj_mayors.shape

(2087206, 13)

In [7]:
nj_mayors

Unnamed: 0_level_0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,original_race,race,original_sex,sex,age
ZEST_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
SC_471127307,SAMANTHA,S,FERGUSON,121,GRAYS MARKET RD,EARLY BRANCH,SC,29916,WHITE,WHITE,FEMALE,FEMALE,29
SC_471495249,HUOI,,A,632,SANDBAR PT,CLOVER,SC,29710,ASIAN,AAPI,MALE,MALE,38
SC_471076575,GAIL,A,ABRUNZO,605,KERSHAW ST,CHERAW,SC,29520,WHITE,WHITE,FEMALE,FEMALE,66
SC_406714510,JOHN,M,AHERN,3401,DUNCAN ST,COLUMBIA,SC,29205,WHITE,WHITE,MALE,MALE,20
SC_471105450,PARIS,K,ASANI,26,PECAN CIR,YORK,SC,29745,BLACK/AFRICAN,BLACK,FEMALE,FEMALE,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...
SC_471459264,DANIELLE,R,MCGHEE,521,CARROL DR,SUMTER,SC,29150,BLACK/AFRICAN,BLACK,FEMALE,FEMALE,32
SC_104846164,PETER,B,MCGHEE,2112,PROMENADE CT,MT PLEASANT,SC,29466,WHITE,WHITE,MALE,MALE,44
SC_470837709,ZANE,A,MCGHEE,1524,NORTHLAND DR,CAYCE,SC,29033,WHITE,WHITE,MALE,MALE,22
SC_024152349,DAVID,W,MCGHEE,522,RAILROAD AVE,NORTH AUGUSTA,SC,29841,BLACK/AFRICAN,BLACK,MALE,MALE,47


### Wrangle NJ mayor data for predictions
Prepare the NJ mayor data.  This parsing of the NJ mayors file will leave some NA's, but it is sufficient for demonstration purposes


In [6]:
zrp_sample = pd.DataFrame(columns=['first_name', 'middle_name', 'last_name', 'house_number', 'street_address', 'city', 'state', 'zip_code'])

Prepare Names

In [7]:
# split_mayor_names = nj_mayors['MAYOR NAME'].str.split(' ')
zrp_sample['first_name'] = nj_mayors['first_name']
zrp_sample['last_name'] = nj_mayors['last_name']

City, State, Zip

In [8]:
zrp_sample['city'] = nj_mayors['city']
zrp_sample['state'] = nj_mayors['state']
zrp_sample['zip_code'] = nj_mayors['zip_code']

Address

In [9]:
zrp_sample['house_number'] = nj_mayors['house_number']
zrp_sample['street_address'] = nj_mayors['street_address']


In [10]:
zrp_sample['ZEST_KEY'] = zrp_sample.index.astype(str)  #must specify key to establish correspondence between inputs and outputs
# zrp_sample

In [11]:
zrp_sample.reset_index(drop=True, inplace=True)

zrp_sample.head()

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY
0,SAMANTHA,,FERGUSON,121,GRAYS MARKET RD,EARLY BRANCH,SC,29916,SC_471127307
1,HUOI,,A,632,SANDBAR PT,CLOVER,SC,29710,SC_471495249
2,GAIL,,ABRUNZO,605,KERSHAW ST,CHERAW,SC,29520,SC_471076575
3,JOHN,,AHERN,3401,DUNCAN ST,COLUMBIA,SC,29205,SC_406714510
4,PARIS,,ASANI,26,PECAN CIR,YORK,SC,29745,SC_471105450


### Invoke the Zest Race Predictor on the sample data

In [12]:
%load_ext line_profiler
%load_ext memory_profiler

In [None]:
%%time
start_time = time()
zest_race_predictor = ZRP()
zest_race_predictor.fit()
zrp_output = zest_race_predictor.transform(zrp_sample)

Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 2087206
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data



  0%|          | 0/1 [00:00<?, ?it/s]

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data



  0%|          | 0/2087206 [00:00<?, ?it/s][A

      ...address cleaning


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.1s

  0%|          | 136/2087206 [00:00<25:37, 1357.41it/s][A[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.2s

  0%|          | 1440/2087206 [00:00<18:43, 1855.88it/s][A[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.3s

  0%|          | 2408/2087206 [00:00<14:11, 2449.39it/s][A[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.3s

  0%|          | 3104/2087206 [00:00<11:25, 3039.74it/s][A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    0.4s

  0%|          | 3800/2087206 [00:00<09:29, 3657.20it/s][A[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:    0.5s

  0%|          | 4500/2087206 [00:00<08:08, 4266.81it/s]

      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=2525332)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=2525332)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
            ZEST_FULLNAME FROMHN  TOHN ZEST_ZIP STATEFP COUNTYFP LFROMADD  \
0               BELGER RD   1298  1100    29836      SC      005     1298   
1       REVOLUTIONARY TRL   4801  4999    29836      SC      005     4801   
3            HICKORY ST S    501   599    29810      SC      005      501   
4                 MILL ST   1142  1060    29810      SC      005     1142   
5  PATTERSON TRAILER PARK   S601  S699    29810      SC      005     S601   

  LTOADD RFROMADD RTOADD  ... TRACTCE10 BLKGRPCE10 ZCTA5CE10 PUMACE10 

In [1]:
2

2

In [51]:
from zrp.validate import ValidateGeocoded
from zrp.zrp import ZRP_Prepare, ZRP_Predict

In [52]:
zest_race_predictor = ZRP()
zest_race_predictor.fit()

<zrp.zrp.ZRP at 0x7f97960e09d0>

In [31]:
%%time
%mprun -f zest_race_predictor.transform zest_race_predictor.transform(zrp_sample.head(100000))

  ipython_version = LooseVersion(IPython.__version__)
  other = LooseVersion(other)


Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 100000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace


  0%|          | 0/1 [00:00<?, ?it/s]


[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data



  0%|          | 0/100000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s

  0%|          | 304/100000 [00:00<00:32, 3024.08it/s][A[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.1s


      ...address cleaning



  1%|          | 737/100000 [00:00<00:29, 3324.89it/s][A[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.2s

  1%|          | 1124/100000 [00:00<00:28, 3468.13it/s][A[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.3s

  1%|▏         | 1488/100000 [00:00<00:28, 3510.75it/s][A[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.4s

  2%|▏         | 1840/100000 [00:00<00:28, 3495.98it/s][A
  2%|▏         | 2188/100000 [00:00<00:28, 3489.01it/s][A[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.6s

  3%|▎         | 2505/100000 [00:00<00:29, 3285.27it/s][A
  3%|▎         | 2814/100000 [00:00<00:32, 2952.85it/s][A
  3%|▎         | 3102/100000 [00:00<00:35, 2746.21it/s][A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    1.0s

  3%|▎         | 3374/100000 [00:01<00:37, 2582.54it/s][A
  4%|▎         | 3633/100000 [00:01<00:38, 2509.36it/s][A
  4%|▍         | 3885/100000 [00:01<00:39, 2436.18it/s][A[Parallel(n_jobs=-1)]: Done 4042 tasks

         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=121062)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=121062)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data


  if distutils.version.LooseVersion(version) < minimum_version:


      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [17:20<00:00, 1040.55s/it]

Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data





     Number of observations: 100000
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

   [Start] Validating pipeline input data
     Number of observations: 367922
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/9 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 9/9 [00:00<00:00, 112.45it/s]
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.1s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 577.33it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/2 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 2/2 [00:00<00:00, 1084.50it/s]
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved


Filename: /home/ika/zrp/zrp/zrp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    84   2225.9 MiB   2225.9 MiB           1       def transform(self, input_data):
    85                                                 """
    86                                                 Processes input data and generates ZRP predictions. Generates BISG predictions additionally if specified.
    87                                         
    88                                                 Parameters
    89                                                 -----------
    90                                                 input_data: pd.Dataframe
    91                                                     Dataframe to be transformed
    92                                                 """
    93                                                 # Load Data
    94   2225.9 MiB      0.0 MiB           1           try:
    95   2225.9 MiB      0.0 MiB           1               data 


CPU times: user 22min 4s, sys: 42.4 s, total: 22min 47s
Wall time: 20min 24s


In [33]:
%mprun -f ZRP_Prepare.transform zest_race_predictor.transform(zrp_sample.head(100000))

  ipython_version = LooseVersion(IPython.__version__)


Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 100000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data



  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/100000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.1s

  0%|          | 364/100000 [00:00<00:27, 3617.91it/s][A

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning


[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.1s

  1%|          | 764/100000 [00:00<00:26, 3723.89it/s][A[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.2s

  1%|          | 1232/100000 [00:00<00:24, 3960.24it/s][A[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.3s

  2%|▏         | 1509/100000 [00:00<00:28, 3436.25it/s][A
  2%|▏         | 1782/100000 [00:00<00:31, 3132.58it/s][A[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.5s

  2%|▏         | 2152/100000 [00:00<00:29, 3272.18it/s][A[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.7s

  3%|▎         | 2516/100000 [00:00<00:28, 3369.47it/s][A
  3%|▎         | 2940/100000 [00:00<00:27, 3584.78it/s][A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    0.9s

  3%|▎         | 3424/100000 [00:00<00:24, 3874.24it/s][A
  4%|▍         | 3811/100000 [00:01<00:25, 3807.89it/s][A[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:    1.1s

  4%|▍         | 4268/100000 [

      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=121062)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=121062)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [17:32<00:00, 1052.14s/it]

Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data





     Number of observations: 100000
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

   [Start] Validating pipeline input data
     Number of observations: 367922
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/9 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 9/9 [00:00<00:00, 82.55it/s]
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.1s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 652.00it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/2 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 2/2 [00:00<00:00, 1656.85it/s]
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved



Filename: /home/ika/zrp/zrp/prepare/prepare.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    48   2424.9 MiB   2424.9 MiB           1       def transform(self, input_data):
    49                                                 """
    50                                                 Parameters
    51                                                 ----------
    52                                                 input_data: pd.Dataframe
    53                                                     Dataframe to be transformed
    54                                                 """  
    55   2424.9 MiB      0.0 MiB           1           curpath = dirname(__file__)
    56                                                 
    57                                                 # Load Data
    58   2424.9 MiB      0.0 MiB           1           try:
    59   2424.9 MiB      0.0 MiB           1               data = input_data.copy()
    60   2424.9 MiB      0.0 MiB     

In [37]:
from zrp.prepare.acs_mapper import ACSModelPrep
from zrp.prepare.geo_geocoder import ZGeo, ProcessGeo
from zrp.modeling.predict import ZRP_Predict_BlockGroup, ZRP_Predict_ZipCode, validate_case

In [53]:
%mprun -f ACSModelPrep.acs_combine zest_race_predictor.transform(zrp_sample.head(1000))

  ipython_version = LooseVersion(IPython.__version__)
  other = LooseVersion(other)


Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 1000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1


  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/1000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.1s

 58%|█████▊    | 576/1000 [00:00<00:00, 5699.24it/s][A

   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning


[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.1s
100%|██████████| 1000/1000 [00:00<00:00, 5035.49it/s]
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:    0.2s finished


      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=1205)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=1205)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data


  if distutils.version.LooseVersion(version) < minimum_version:


      ...merge user input & lookup table
      ...mapping


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [00:14<00:00, 14.07s/it]

   [Completed] Validating input geo data
Directory already exists
Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 1000
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
mbggk
(839, 926)
Index(['first_name', 'middle_name', 'last_name', 'house_number',
       'street_address', 'city', 'state', 'zip_code', 'BLKGRPCE', 'BLKGRPCE10',
       ...
       'B99021_002', 'B99021_003', 'B99162_001', 'B99162_002', 'B99162_003',
       'B99162_004', 'B99162_005', 'B99162_006', 'B99162_007', 'acs_source'],
      dtype='object', length=926)
first_name        object
middle_name       object
last_name         object
house_number      object
street_address    object
                   ...  
B99162_004        object
B99162_005        object
B99162_006        object
B99162_007        object
acs_source        object
Length: 926, dtype: object
             first_name middle_name last_name house_number street_address  \
ZEST_KEY                                                                    
SC_471076575       GAIL        None   ABRUNZO         

  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 690.19it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 678.91it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 674.43it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Filename: /home/ika/zrp/zrp/prepare/acs_mapper.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    47   5967.3 MiB   5967.3 MiB           1       def acs_combine(self, data, acs_bg, acs_ct, acs_zip):
    48                                                 """
    49                                                 Combines ACS data with processed user input data.
    50                                                 Generating optional features for modeling.
    51                                                 
    52                                                 Parameters
    53                                                 ----------
    54                                                 data: str
    55                                                     Processed user input data, expected to include names & GEOID        
    56                                                 acs_bg: str
    57                                                     ACS block gro

In [76]:
2

2

In [43]:
%lprun -f ZGeo.transform zest_race_predictor.transform(zrp_sample.head(500000))

Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 500000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data


  0%|          | 0/1 [00:00<?, ?it/s]


  The following states are included in the data: ['SC']
   ... on state: SC






   Data is loaded
   [Start] Processing geo data
      ...address cleaning


  0%|          | 0/500000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.

  0%|          | 1/500000 [00:00<23:19:35,  5.95it/s][A[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.1s

  0%|          | 800/500000 [00:00<16:18:27,  8.50it/s][A[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.2s

  0%|          | 1584/500000 [00:00<11:24:10, 12.14it/s][A[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.2s

  0%|          | 2264/500000 [00:00<7:58:37, 17.33it/s] [A[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.3s

  1%|          | 3020/500000 [00:00<5:34:51, 24.74it/s][A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    0.4s

  1%|          | 3696/500000 [00:00<3:54:27, 35.28it/s][A[Parallel(n_jobs=-1)

      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=605270)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=605270)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [30:44<00:00, 1844.73s/it]

Output saved
   [Completed] Mapping geo data






[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 500000
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

   [Start] Validating pipeline input data
     Number of observations: 1843097
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/43 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 43/43 [00:00<00:00, 90.78it/s] 
[Parallel(n_jobs=-1)]: Done  43 out of  43 | elapsed:    0.6s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 903.17it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/8 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 8/8 [00:00<00:00, 113.91it/s]
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    0.1s finished


Directory already exists


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved


Timer unit: 1e-06 s

Total time: 1844.18 s
File: /home/ika/zrp/zrp/prepare/geo_geocoder.py
Function: transform at line 183

Line #      Hits         Time  Per Hit   % Time  Line Contents
   183                                               def transform(self, input_data, geo, processed, replicate, save_table=True):
   184                                                   """
   185                                                   Returns a DataFrame of geocoded addresses.
   186                                           
   187                                                   :param input_data: A pd.DataFrame.
   188                                                   :param geo: A String
   189                                                   :param processed: A boolean.
   190                                                   :param replicate: A boolean.
   191                                                   :param save_table: A boolean. Tables are saved if True. Default is True.


In [236]:
from zrp.prepare.geo_geocoder import replicate_address_2

In [237]:
%lprun -f replicate_address_2 zest_race_predictor.transform(zrp_sample.head(100000))

Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 100000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace


  0%|          | 0/1 [00:00<?, ?it/s]


[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data



  0%|          | 0/100000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.1s

  1%|          | 608/100000 [00:00<00:16, 6073.98it/s][A[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.1s


      ...address cleaning


[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.2s

  1%|▏         | 1408/100000 [00:00<00:15, 6530.75it/s][A[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.3s

  2%|▏         | 2136/100000 [00:00<00:14, 6736.60it/s][A[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.4s

  3%|▎         | 2644/100000 [00:00<00:15, 6122.36it/s][A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    0.5s

  3%|▎         | 3224/100000 [00:00<00:16, 6021.58it/s][A
  4%|▍         | 3844/100000 [00:00<00:15, 6063.52it/s][A[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:    0.6s

  4%|▍         | 4456/100000 [00:00<00:15, 6070.28it/s][A[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed:    0.8s

  5%|▌         | 5072/100000 [00:00<00:15, 6092.61it/s][A
  6%|▌         | 5700/100000 [00:00<00:15, 6139.23it/s][A[Parallel(n_jobs=-1)]: Done 6042 tasks      | elapsed:    1.0s

  6%|▋         | 6316/100000 [00:01<00:15, 6136.97it/s][A
  7%|▋         | 6915/10000

      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=121062)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=121062)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [06:13<00:00, 373.45s/it]

Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data





     Number of observations: 100000
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

   [Start] Validating pipeline input data
     Number of observations: 367922
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/9 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 9/9 [00:00<00:00, 112.35it/s]
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.1s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 876.74it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/2 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 2/2 [00:00<00:00, 1949.93it/s]
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved


Timer unit: 1e-06 s

Total time: 105.83 s
File: /home/ika/zrp/zrp/prepare/preprocessing.py
Function: replicate_address_2 at line 208

Line #      Hits         Time  Per Hit   % Time  Line Contents
   208                                           def replicate_address_2(data, street_address, street_suffix_mapping, unit_mapping):
   209                                               """
   210                                               Replicate street addresses 
   211                                               
   212                                               Parameters
   213                                               ----------
   214                                               data: pd.DataFrame
   215                                                   DataFrame to make changes to 
   216                                               street_address: str
   217                                                  Name of street address column 
   218                         

In [232]:
%lprun -f ACSModelPrep.acs_combine zest_race_predictor.transform(zrp_sample.head(100000))

Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 100000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace


  0%|          | 0/1 [00:00<?, ?it/s]


[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning



  0%|          | 0/100000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.1s

  1%|          | 808/100000 [00:00<00:12, 8070.35it/s][A[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.2s

  2%|▏         | 1568/100000 [00:00<00:12, 7911.85it/s][A[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.2s

  2%|▏         | 2312/100000 [00:00<00:12, 7755.58it/s][A[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.3s

  3%|▎         | 3040/100000 [00:00<00:12, 7599.14it/s][A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    0.4s

  4%|▍         | 3792/100000 [00:00<00:12, 7566.88it/s][A[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:    0.6s

  5%|▍         

      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=121062)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=121062)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [06:15<00:00, 375.08s/it]

Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data





     Number of observations: 100000
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

   [Start] Validating pipeline input data
     Number of observations: 367922
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/9 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 9/9 [00:00<00:00, 131.77it/s]
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.1s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1010.68it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/2 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 2/2 [00:00<00:00, 2060.07it/s]
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved


Timer unit: 1e-06 s

Total time: 16.9006 s
File: /home/ika/zrp/zrp/prepare/acs_mapper.py
Function: acs_combine at line 47

Line #      Hits         Time  Per Hit   % Time  Line Contents
    47                                               def acs_combine(self, data, acs_bg, acs_ct, acs_zip):
    48                                                   """
    49                                                   Combines ACS data with processed user input data.
    50                                                   Generating optional features for modeling.
    51                                                   
    52                                                   Parameters
    53                                                   ----------
    54                                                   data: str
    55                                                       Processed user input data, expected to include names & GEOID        
    56                                          

In [210]:
%lprun -f ZGeo.get_reduced zest_race_predictor.transform(zrp_sample.head(100000))

Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 100000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace


  0%|          | 0/1 [00:00<?, ?it/s]


[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning



  0%|          | 0/100000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.1s

  1%|          | 824/100000 [00:00<00:12, 8222.03it/s][A[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.2s

  2%|▏         | 1700/100000 [00:00<00:11, 8373.09it/s][A[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.2s

  2%|▏         | 2304/100000 [00:00<00:13, 7489.70it/s][A[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.3s

  3%|▎         | 2917/100000 [00:00<00:13, 7022.38it/s][A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    0.4s

  4%|▎         | 3528/100000 [00:00<00:14, 6709.91it/s][A[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:    0.6s

  4%|▍         

      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=121062)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=121062)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [05:55<00:00, 355.42s/it]

Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data





     Number of observations: 100000
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

   [Start] Validating pipeline input data
     Number of observations: 367922
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/9 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 9/9 [00:00<00:00, 108.11it/s]
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.1s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 794.07it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/2 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 2/2 [00:00<00:00, 1783.29it/s]
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved


Timer unit: 1e-06 s

Total time: 211.008 s
File: /home/ika/zrp/zrp/prepare/geo_geocoder.py
Function: get_reduced at line 110

Line #      Hits         Time  Per Hit   % Time  Line Contents
   110                                               def get_reduced(self, tmp_data):
   111         1          4.0      4.0      0.0          keep_cols = ['ZEST_KEY', 'first_name', 'middle_name', 'last_name',
   112         1          2.0      2.0      0.0                       'house_number', 'street_address', 'city', 'state', 'zip_code',
   113         1          1.0      1.0      0.0                       'BLKGRPCE', 'BLKGRPCE10', 'COUNTYFP', 'COUNTYFP10', 'FROMHN', 'TOHN',
   114         1          2.0      2.0      0.0                       'LFROMADD', 'LTOADD', 'PUMACE', 'PUMACE10', 'RFROMADD', 'RTOADD', 'SIDE',
   115         1          2.0      2.0      0.0                       'STATEFP', 'STATEFP10', 'TBLKGPCE', 'TRACTCE', 'TRACTCE10', 'TTRACTCE',
   116         1          1.0      1.0    

In [47]:
from zrp.prepare.utils import load_file

In [48]:
%lprun -f load_file zest_race_predictor.transform(zrp_sample.head(50))

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/50 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 50/50 [00:00<00:00, 4806.45it/s]

Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 50
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning



[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    0.0s finished


      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=55)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=55)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


      ...merge user input & lookup table
      ...mapping


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [00:03<00:00,  3.78s/it]

   [Completed] Validating input geo data
Directory already exists
Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 50
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

   [Start] Validating pipeline input data
     Number of observations: 174
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 599.53it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 806.91it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Timer unit: 1e-06 s

Total time: 20.1638 s
File: /home/ika/zrp/zrp/prepare/utils.py
Function: load_file at line 96

Line #      Hits         Time  Per Hit   % Time  Line Contents
    96                                           def load_file(file_path):    
    97                                               """
    98                                               Load files. Compatible with csv, text, feather, xlsx, and parquet
    99                                               
   100                                               Parameters
   101                                               ----------
   102                                               file_path: str
   103                                                   File path of file to load
   104                                               """
   105                                           
   106         4          5.0      1.2      0.0      na_values = ["None",
   107         4          1.0      0.2      0.0     

In [30]:
# %%time
times = {}
for n_obs in [5, 10, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000]:
    start_time = time()
    zest_race_predictor = ZRP()
    zest_race_predictor.fit()
    zrp_output = zest_race_predictor.transform(zrp_sample.head(n_obs))
    print(f'N_obs: {n_obs}, time: {round(time() - start_time, 2)} seconds')
    times[n_obs] = round(time() - start_time, 2)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/5 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 5/5 [00:00<00:00, 1586.83it/s]

Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 5
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning
      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=5)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=5)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data



[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [00:02<00:00,  2.70s/it]

      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists
Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 5
     Is key unique: True

   [Completed] Validating ACS input data

['/home/ika/zrp/zrp/prepare/../data/processed/acs/2019/5yr/processed_Zest_ACS_Lookup_20195yr_blockgroup_short.parquet']
   ...loading ACS lookup tables



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


GEOID         object
GEO_NAME      object
EXT_GEOID     object
B01003_001    uint16
B02001_001    uint16
               ...  
B99021_003     int32
B99162_003     int16
B99162_004     int16
B99162_005     int16
B99162_007     int16
Length: 91, dtype: object
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

predict start
   [Start] Validating pipeline input data
     Number of observations: 16
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1030.29it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1041.29it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/10 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 10/10 [00:00<00:00, 2378.26it/s]
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.0s finished


Output saved
N_obs: 5, time: 11.69 seconds
Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 10
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning
      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=10)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=10)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [00:02<00:00,  2.77s/it]

      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists
Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 10
     Is key unique: True

   [Completed] Validating ACS input data

['/home/ika/zrp/zrp/prepare/../data/processed/acs/2019/5yr/processed_Zest_ACS_Lookup_20195yr_blockgroup_short.parquet']
   ...loading ACS lookup tables



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


GEOID         object
GEO_NAME      object
EXT_GEOID     object
B01003_001    uint16
B02001_001    uint16
               ...  
B99021_003     int32
B99162_003     int16
B99162_004     int16
B99162_005     int16
B99162_007     int16
Length: 91, dtype: object
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

predict start
   [Start] Validating pipeline input data
     Number of observations: 30
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 994.62it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1154.18it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/100 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
100%|██████████| 100/100 [00:00<00:00, 6174.72it/s]
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.0s finished


Output saved
N_obs: 10, time: 11.51 seconds
Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 100
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning
      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=117)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=117)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [00:02<00:00,  2.80s/it]

   [Completed] Validating input geo data
Directory already exists
Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 100
     Is key unique: True

   [Completed] Validating ACS input data

['/home/ika/zrp/zrp/prepare/../data/processed/acs/2019/5yr/processed_Zest_ACS_Lookup_20195yr_blockgroup_short.parquet']
   ...loading ACS lookup tables



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


GEOID         object
GEO_NAME      object
EXT_GEOID     object
B01003_001    uint16
B02001_001    uint16
               ...  
B99021_003     int32
B99162_003     int16
B99162_004     int16
B99162_005     int16
B99162_007     int16
Length: 91, dtype: object
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

predict start
   [Start] Validating pipeline input data
     Number of observations: 362
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1066.17it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1324.80it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/500 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.0s
100%|██████████| 500/500 [00:00<00:00, 10458.57it/s]
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:    0.0s finished


Output saved
N_obs: 100, time: 12.76 seconds
Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 500
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning
      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=609)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=609)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [00:03<00:00,  3.49s/it]

   [Completed] Validating input geo data
Directory already exists
Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 500
     Is key unique: True

   [Completed] Validating ACS input data

['/home/ika/zrp/zrp/prepare/../data/processed/acs/2019/5yr/processed_Zest_ACS_Lookup_20195yr_blockgroup_short.parquet']
   ...loading ACS lookup tables



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


GEOID         object
GEO_NAME      object
EXT_GEOID     object
B01003_001    uint16
B02001_001    uint16
               ...  
B99021_003     int32
B99162_003     int16
B99162_004     int16
B99162_005     int16
B99162_007     int16
Length: 91, dtype: object
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

predict start
   [Start] Validating pipeline input data
     Number of observations: 1821
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1869.95it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1032.32it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1091.13it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/1000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.1s
100%|██████████| 1000/1000 [00:00<00:00, 12109.04it/s]
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:    0.1s finished


Output saved
N_obs: 500, time: 18.1 seconds
Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 1000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning
      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=1205)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=1205)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [00:04<00:00,  4.38s/it]

   [Completed] Validating input geo data
Directory already exists
Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 1000
     Is key unique: True

   [Completed] Validating ACS input data

['/home/ika/zrp/zrp/prepare/../data/processed/acs/2019/5yr/processed_Zest_ACS_Lookup_20195yr_blockgroup_short.parquet']
   ...loading ACS lookup tables



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


GEOID         object
GEO_NAME      object
EXT_GEOID     object
B01003_001    uint16
B02001_001    uint16
               ...  
B99021_003     int32
B99162_003     int16
B99162_004     int16
B99162_005     int16
B99162_007     int16
Length: 91, dtype: object
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

predict start
   [Start] Validating pipeline input data
     Number of observations: 3672
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1270.23it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1166.70it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1135.44it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved
N_obs: 1000, time: 19.73 seconds
Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 5000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1


  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/5000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.1s

 26%|██▌       | 1284/5000 [00:00<00:00, 12829.64it/s][A

   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning


[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.2s

 39%|███▉      | 1960/5000 [00:00<00:00, 10094.29it/s][A[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.3s

 52%|█████▏    | 2620/5000 [00:00<00:00, 8701.69it/s] [A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    0.4s

 66%|██████▌   | 3312/5000 [00:00<00:00, 8071.96it/s][A
 80%|███████▉  | 3984/5000 [00:00<00:00, 7610.64it/s][A[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:    0.5s

 93%|█████████▎| 4664/5000 [00:00<00:00, 7340.68it/s][A[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed:    0.7s
100%|██████████| 5000/5000 [00:00<00:00, 7655.82it/s]
[Parallel(n_jobs=-1)]: Done 5000 out of 5000 | elapsed:    0.7s finished


      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=6016)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=6016)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [00:12<00:00, 12.26s/it]

   [Completed] Validating input geo data
Directory already exists
Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 5000
     Is key unique: True

   [Completed] Validating ACS input data

['/home/ika/zrp/zrp/prepare/../data/processed/acs/2019/5yr/processed_Zest_ACS_Lookup_20195yr_blockgroup_short.parquet']
   ...loading ACS lookup tables



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


GEOID         object
GEO_NAME      object
EXT_GEOID     object
B01003_001    uint16
B02001_001    uint16
               ...  
B99021_003     int32
B99162_003     int16
B99162_004     int16
B99162_005     int16
B99162_007     int16
Length: 91, dtype: object
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

predict start
   [Start] Validating pipeline input data
     Number of observations: 18396
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1227.48it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1209.43it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1083.80it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved
N_obs: 5000, time: 27.65 seconds
Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 10000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1


  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/10000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.1s

 13%|█▎        | 1304/10000 [00:00<00:00, 13033.46it/s][A

   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning


[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.2s

 25%|██▌       | 2508/10000 [00:00<00:00, 12698.34it/s][A
 32%|███▏      | 3152/10000 [00:00<00:00, 9827.84it/s] [A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    0.3s

 38%|███▊      | 3816/10000 [00:00<00:00, 8589.58it/s][A[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:    0.4s

 45%|████▍     | 4496/10000 [00:00<00:00, 7957.48it/s][A[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed:    0.6s

 52%|█████▏    | 5180/10000 [00:00<00:00, 7577.39it/s][A
 59%|█████▊    | 5872/10000 [00:00<00:00, 7355.14it/s][A[Parallel(n_jobs=-1)]: Done 6042 tasks      | elapsed:    0.7s

 66%|██████▌   | 6552/10000 [00:00<00:00, 7173.68it/s][A[Parallel(n_jobs=-1)]: Done 7192 tasks      | elapsed:    0.9s

 72%|███████▏  | 7236/10000 [00:00<00:00, 7050.68it/s][A
 79%|███████▉  | 7920/10000 [00:01<00:00, 6978.34it/s][A[Parallel(n_jobs=-1)]: Done 8442 ta

      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=12032)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=12032)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [00:21<00:00, 21.36s/it]

   [Completed] Validating input geo data
Directory already exists
Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 10000
     Is key unique: True

   [Completed] Validating ACS input data

['/home/ika/zrp/zrp/prepare/../data/processed/acs/2019/5yr/processed_Zest_ACS_Lookup_20195yr_blockgroup_short.parquet']
   ...loading ACS lookup tables



  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


GEOID         object
GEO_NAME      object
EXT_GEOID     object
B01003_001    uint16
B02001_001    uint16
               ...  
B99021_003     int32
B99162_003     int16
B99162_004     int16
B99162_005     int16
B99162_007     int16
Length: 91, dtype: object
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

predict start
   [Start] Validating pipeline input data
     Number of observations: 36863
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1207.69it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 964.87it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1175.86it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved
N_obs: 10000, time: 39.49 seconds
Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 50000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace


  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/50000 [00:00<?, ?it/s][A


[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.1s

  3%|▎         | 1256/50000 [00:00<00:03, 12533.86it/s][A[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.2s

  5%|▌         | 2660/50000 [00:00<00:03, 12936.73it/s][A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    0.3s

  7%|▋         | 3352/50000 [00:00<00:04, 10243.82it/s][A
  8%|▊         | 4048/50000 [00:00<00:05, 8960.31it/s] [A[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:    0.4s

  9%|▉         | 4736/50000 [00:00<00:05, 8173.87it/s][A[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed:    0.5s

      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=60443)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=60443)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [01:36<00:00, 96.51s/it]

Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 50000
     Is key unique: True






   [Completed] Validating ACS input data

['/home/ika/zrp/zrp/prepare/../data/processed/acs/2019/5yr/processed_Zest_ACS_Lookup_20195yr_blockgroup_short.parquet']
   ...loading ACS lookup tables


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


GEOID         object
GEO_NAME      object
EXT_GEOID     object
B01003_001    uint16
B02001_001    uint16
               ...  
B99021_003     int32
B99162_003     int16
B99162_004     int16
B99162_005     int16
B99162_007     int16
Length: 91, dtype: object
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

predict start
   [Start] Validating pipeline input data
     Number of observations: 184001
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/5 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 5/5 [00:00<00:00, 199.89it/s]
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.1s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1045.44it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 843.25it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1101.16it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved
N_obs: 50000, time: 138.72 seconds
Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 100000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace


  0%|          | 0/1 [00:00<?, ?it/s]


[Start] Preparing geo data

  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning



  0%|          | 0/100000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.1s

  1%|▏         | 1256/100000 [00:00<00:07, 12512.28it/s][A[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.2s

  3%|▎         | 2684/100000 [00:00<00:07, 12978.09it/s][A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    0.3s

  4%|▎         | 3568/100000 [00:00<00:08, 11362.08it/s][A[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:    0.4s

  4%|▍         | 4315/100000 [00:00<00:10, 9532.97it/s] [A[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed:    0.5s

  5%|▌

      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=121062)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=121062)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [03:10<00:00, 190.11s/it]

Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data





     Number of observations: 100000
     Is key unique: True

   [Completed] Validating ACS input data

['/home/ika/zrp/zrp/prepare/../data/processed/acs/2019/5yr/processed_Zest_ACS_Lookup_20195yr_blockgroup_short.parquet']
   ...loading ACS lookup tables


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


GEOID         object
GEO_NAME      object
EXT_GEOID     object
B01003_001    uint16
B02001_001    uint16
               ...  
B99021_003     int32
B99162_003     int16
B99162_004     int16
B99162_005     int16
B99162_007     int16
Length: 91, dtype: object
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

predict start
   [Start] Validating pipeline input data
     Number of observations: 367922
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/9 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 9/9 [00:00<00:00, 122.38it/s]
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.1s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1088.30it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/2 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 2/2 [00:00<00:00, 2151.48it/s]
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    0.0s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1123.27it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists
Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved
N_obs: 100000, time: 258.68 seconds
Directory already exists
Data is loaded
   [Start] Validating input data
     Number of observations: 500000
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data


  0%|          | 0/1 [00:00<?, ?it/s]


  The following states are included in the data: ['SC']
   ... on state: SC

   Data is loaded
   [Start] Processing geo data
      ...address cleaning



  0%|          | 0/500000 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:    0.1s

  0%|          | 1256/500000 [00:00<00:39, 12544.97it/s][A[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:    0.2s

  1%|          | 2688/500000 [00:00<00:38, 13018.69it/s][A[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:    0.3s

  1%|          | 4128/500000 [00:00<00:37, 13400.80it/s][A[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed:    0.4s

  1%|          | 5044/500000 [00:00<00:47, 10447.70it/s][A
  1%| 

      ...replicating address
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=605270)
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=605270)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping
   [Completed] Validating input geo data
Directory already exists


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)
100%|██████████| 1/1 [16:02<00:00, 962.22s/it]

Output saved
   [Completed] Mapping geo data






[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 500000
     Is key unique: True

   [Completed] Validating ACS input data

['/home/ika/zrp/zrp/prepare/../data/processed/acs/2019/5yr/processed_Zest_ACS_Lookup_20195yr_blockgroup_short.parquet']
   ...loading ACS lookup tables


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


GEOID         object
GEO_NAME      object
EXT_GEOID     object
B01003_001    uint16
B02001_001    uint16
               ...  
B99021_003     int32
B99162_003     int16
B99162_004     int16
B99162_005     int16
B99162_007     int16
Length: 91, dtype: object
   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

predict start
   [Start] Validating pipeline input data
     Number of observations: 1843097
     Is key unique: False
   [Completed] Validating pipeline input data



  0%|          | 0/43 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 43/43 [00:00<00:00, 88.03it/s] 
[Parallel(n_jobs=-1)]: Done  43 out of  43 | elapsed:    0.5s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 967.99it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished
  0%|          | 0/8 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 8/8 [00:00<00:00, 116.34it/s]
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    0.1s finished
  0%|          | 0/1 [00:00<?, ?it/s][Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
100%|██████████| 1/1 [00:00<00:00, 1172.90it/s]
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished


Directory already exists


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved


  if distutils.version.LooseVersion(version) < minimum_version:
  other = LooseVersion(other)


Output saved
N_obs: 500000, time: 1258.46 seconds


In [33]:
times

{5: 11.69,
 10: 11.51,
 100: 12.76,
 500: 18.1,
 1000: 19.73,
 5000: 27.65,
 10000: 39.49,
 50000: 138.72,
 100000: 258.68,
 500000: 1258.46}

In [1]:
2

2

### Inspect the output and join

In [16]:
zrp_output

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_block_group,source_zip_code
0,SAMANTHA,,FERGUSON,121,GRAYS MARKET RD,EARLY BRANCH,SC,29916,SC_471127307,0.017308,0.020048,0.154378,0.049648,0.758618,WHITE,0.0,1.0
1,HUOI,,A,632,SANDBAR PT,CLOVER,SC,29710,SC_471495249,0.930123,0.001428,0.006116,0.030818,0.031515,AAPI,0.0,1.0
2,GAIL,,ABRUNZO,605,KERSHAW ST,CHERAW,SC,29520,SC_471076575,0.002481,0.000866,0.019643,0.011117,0.965894,WHITE,1.0,0.0
3,JOHN,,AHERN,3401,DUNCAN ST,COLUMBIA,SC,29205,SC_406714510,0.024794,0.000483,0.008057,0.017206,0.94946,WHITE,1.0,0.0
4,PARIS,,ASANI,26,PECAN CIR,YORK,SC,29745,SC_471105450,0.001656,0.000606,0.916169,0.021014,0.060555,BLACK,1.0,0.0


### Check the most likely Hispanic 

In [14]:
zrp_output.nlargest(10, "HISPANIC")

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_block_group,source_census_tract,source_zip_code,source_bisg
377,Hector,,Lora,330.0,Passaic Street,Passaic,NJ,07055-5815,377,0.000113,0.000222,0.001535,0.984599,0.01353,HISPANIC,1.0,0.0,0.0,0.0
286,Marcial,,Mojena,249.0,,Columbus,NJ,08022,286,0.00102,0.000309,0.001682,0.965945,0.031043,HISPANIC,0.0,0.0,1.0,0.0
536,Gabriel,,Rodriguez,428.0,th Street,West New York,NJ,07093-2222,536,0.001721,0.005429,0.002667,0.958142,0.032041,HISPANIC,0.0,0.0,1.0,0.0
388,Helmin,,Caba,,,Perth Amboy,NJ,08861,388,0.021552,0.000426,0.00847,0.953645,0.015907,HISPANIC,0.0,0.0,1.0,0.0
236,Alberto,,Santos,402.0,Kearny Avenue,Kearny,NJ,07032,236,0.019993,0.000571,0.003235,0.951424,0.024777,HISPANIC,1.0,0.0,0.0,0.0
543,Ray,,Arroyo,101.0,Washington Avenue,Westwood,NJ,07675,543,0.004734,0.006675,0.004117,0.940691,0.043783,HISPANIC,1.0,0.0,0.0,0.0
499,Manuel,,Figueiredo,,,Union,NJ,07083-3597,499,0.000594,0.000619,0.001788,0.916256,0.080742,HISPANIC,0.0,0.0,1.0,0.0
378,Andre,,Sayegh,125.0,st Floor,Paterson,NJ,07505-1414,378,0.075077,0.000379,0.005269,0.889816,0.029458,HISPANIC,0.0,0.0,1.0,0.0
418,Ramopn,,Hache,131.0,North Maple Avenue,Ridgewood,NJ,07450-3236,418,0.001468,0.000549,0.001547,0.84619,0.150247,HISPANIC,0.0,0.0,1.0,0.0
556,Carlos,,Rendo,188.0,Pascack Road,Woodcliff Lake,NJ,07677-7921,556,0.000315,0.000395,0.0013,0.837821,0.160169,HISPANIC,0.0,0.0,1.0,0.0


### Check the most likely Black

In [15]:
zrp_output.nlargest(10, "BLACK")

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_block_group,source_census_tract,source_zip_code,source_bisg
215,Dahlia,,Vertreese,,,Hillside,NJ,07205,215,0.000495,0.000783,0.973164,0.009093,0.016466,BLACK,0.0,0.0,1.0,0.0
229,Anthony,,Vauss,,,Irvington,NJ,07111-2412,229,0.002764,0.001271,0.970029,0.016721,0.009215,BLACK,0.0,0.0,1.0,0.0
549,Tiffani,,Worthy,1.0,Salem Road,Willingboro,NJ,08046,549,0.00084,0.000422,0.963874,0.008897,0.025967,BLACK,1.0,0.0,0.0,0.0
343,Ras,,Baraka,920.0,Broad Street,Newark,NJ,07102,343,0.000688,8.9e-05,0.961769,0.003171,0.034282,BLACK,1.0,0.0,0.0,0.0
370,Dwayne,,Warren,29.0,North Day Street,Orange,NJ,07050,370,0.002516,0.01233,0.96043,0.005826,0.018898,BLACK,0.0,0.0,1.0,0.0
397,Adrian,,Mapp,515.0,Watchung Avenue,Plainfield,NJ,07060-1720,397,0.004099,0.008725,0.944972,0.020197,0.022007,BLACK,1.0,0.0,0.0,0.0
258,Derek,,Armstead,301.0,North Wood Avenue,Linden,NJ,07036-4296,258,0.01619,0.009768,0.941253,0.002946,0.029844,BLACK,0.0,0.0,1.0,0.0
250,Mary,,Wardlow,4.0,East Douglas Avenue,Lawnside,NJ,08045-1597,250,0.000452,0.000842,0.925571,0.019787,0.053347,BLACK,0.0,0.0,1.0,0.0
78,Jamila,,Odom-Bremmer,201.0,Grant Avenue,Chesilhurst,NJ,08089,78,0.013521,0.016294,0.919407,0.018952,0.031827,BLACK,1.0,0.0,0.0,0.0
433,Donald,,Shaw,210.0,Chestnut Street,Roselle,NJ,07203-1218,433,0.004979,0.011493,0.849289,0.016776,0.117463,BLACK,1.0,0.0,0.0,0.0


BISG proxies are saved by default when `ZRP` is ran. Below we import the BISG proxies in.

In [16]:
bisg_output = pd.read_feather("artifacts/bisg_proxy_output.feather")

In [17]:
bisg_output.head()

Unnamed: 0,ZEST_KEY,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_bisg
0,56,0.038829,0.000429,0.000807,0.006858,0.937739,WHITE,1
1,202,0.010862,0.000296,0.001673,0.239643,0.742021,WHITE,1
2,204,0.00905,0.000132,0.00033,0.028292,0.956646,WHITE,1
3,224,0.001166,0.014795,0.085358,0.042606,0.836503,WHITE,1
4,248,0.007348,0.000625,0.029071,0.082111,0.861174,WHITE,1


How many proxies does BISG return?

In [18]:
f"Out of {bisg_output.shape[0]} records only {bisg_output[bisg_output.race_proxy.notna()].shape[0]} proxies are returned"  

'Out of 565 records only 438 proxies are returned'

How many proxies does ZRP return?

In [19]:
f"Out of {zrp_output.shape[0]} records {zrp_output[zrp_output.race_proxy.notna()].shape[0]} proxies are returned"  


'Out of 565 records 551 proxies are returned'