# How to Prepare data for ZRP Predictions
The purpose of this notebook is to illustrate how to use `ZRP_Prepare`, a module that prepares user input data for generating predictions, models, & analysis. 

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import warnings

In [3]:
warnings.filterwarnings(action='ignore')
home = expanduser('~')

src_path = '{}/zrp'.format(home)
sys.path.append(src_path)

In [4]:
from zrp.prepare.prepare import ZRP_Prepare
from zrp.prepare.utils import load_file

## Load sample data for prediction
Load processed list of New Jersey Mayors downloaded from https://www.nj.gov/dca/home/2022mayors.csv 

In [5]:
nj_mayors = load_file("../2022-nj-mayors-sample.csv")
nj_mayors.shape

(462, 9)

In [6]:
nj_mayors

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY
0,Gabe,,Plumer,782,Frenchtown Road,Milford,NJ,08848,2
1,Ari,,Bernstein,500,West Crescent Avenue,Allendale,NJ,07401,4
2,David,J.,Mclaughlin,125,Corlies Avenue,Allenhurst,NJ,07711-1049,5
3,Thomas,C.,Fritts,8,North Main Street,Allentown,NJ,08501-1607,6
4,P.,,McCkelvey,49,South Greenwich Street,Alloway,NJ,08001-0425,7
...,...,...,...,...,...,...,...,...,...
457,William,,Degroff,3943,Route,Chatsworth,NJ,08019,558
458,Joseph,,Chukwueke,200,Cooper Avenue,Woodlynne,NJ,08107-2108,559
459,Paul,,Sarlo,85,Humboldt Street,Wood-Ridge,NJ,07075-2344,560
460,Craig,,Frederick,120,Village Green Drive,Woolwich Township,NJ,08085-3180,562


#### ZRP Prepare  
To prepare the data we will use `ZRP_Prepare` 

Input data into the prediction/modeling pipeline is tabluar data with the following columns: first name, middle name, last name, house number, street address (street name), city, state, zip code, and zest key. The `ZEST_KEY` must be specified to establish correspondence between inputs and outputs; it's effectively used as an index for the data table.

`ZRP_Prepare` is used to process this input data into the set of requisite feature vectors necessary for prediction. When called, the `.transform()` function's processing steps include geocoding the data (converting addresses to block groups or census tracts), and matching the geocoded data on American Community Survey data lookup tables. This ultimately links input data to additional  demographic data based on individuals' geography. In the end, the input data is bolstered with additional features, which are used for predictions with enhanced feature fidelity. 

In [7]:
%%time
prepare = ZRP_Prepare()
prepare.fit(nj_mayors)
zrp_output = prepare.transform(nj_mayors)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/462 [00:00<?, ?it/s][A[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 15 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 170 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 420 tasks      | elapsed:    0.0s
100%|██████████| 462/462 [00:00<00:00, 15073.22it/s]

Data is loaded
   [Start] Validating input data
     Number of observations: 462
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace

[Start] Preparing geo data

  The following states are included in the data: ['NJ']
   ... on state: NJ

   Data is loaded
   [Start] Processing geo data
      ...address cleaning



[Parallel(n_jobs=-1)]: Done 462 out of 462 | elapsed:    0.0s finished


      ...replicating address
         ...Base
         ...Number processing...
         House number dataframe expansion is complete! (n=462)
         ...Base
         ...Map street suffixes...
         ...Mapped & split by street suffixes...
         ...Number processing...

         Address dataframe expansion is complete! (n=900)
      ...formatting
   [Completed] Processing geo data
   [Start] Mapping geo data
      ...merge user input & lookup table
      ...mapping


100%|██████████| 1/1 [00:04<00:00,  4.95s/it]

   [Completed] Validating input geo data
Directory already exists
...Output saved
   [Completed] Mapping geo data

[Completed] Preparing geo data

[Start] Preparing ACS data
   [Start] Validating ACS input data
     Number of observations: 462
     Is key unique: True

   [Completed] Validating ACS input data

   ...loading ACS lookup tables





   ... combining ACS & user input data
 ...Copy dataframes
 ...Block group
 ...Census tract
 ...Zip code
 ...No match
 ...Merge
 ...Merging complete
[Complete] Preparing ACS data

CPU times: user 25.8 s, sys: 2.35 s, total: 28.2 s
Wall time: 28.2 s


### Inspect the output
- Preview the data
- View what artifacts are saved

In [8]:
zrp_output.shape

(1543, 198)

In [9]:
zrp_output.head()

Unnamed: 0_level_0,B01003_001,B02001_001,B02001_002,B02001_003,B02001_004,B02001_005,B02001_006,B02001_007,B02001_008,B02001_009,...,house_number_LEFT,house_number_RIGHT,house_numer_numeric,last_name,middle_name,small,state,street_address,zest_in_state_fips,zip_code
ZEST_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10,589,589,534,8,0,8,0,0,39,0,...,,137,,MORGAN,M,,NJ,MAIN STREET,34,7821
100,1266,1266,999,233,0,0,0,0,34,0,...,,770,,TEMPLETON,L,,NJ,COOPERTOWN ROAD,34,8075
106,1722,1722,1447,44,0,108,0,50,73,0,...,,1011,,MEDANY,,,NJ,COOPER STREET,34,8096
107,1071,1071,755,55,0,107,0,137,17,0,...,,37,,BLACKMAN,,,NJ,NORTH SUSSEX STREET,34,7801
108,667,667,578,4,67,3,0,0,15,0,...,,288,,CAMPBELL,,,NJ,MAIN STREET,34,8345


`ZRP_Prepare` generates multiple artifacts that are automatically saved:
- Dataframe with address to GEOID mappings
    - `Zest_Geocoded_test_{year}__{state_fips}.parquet`
- Validation dictionary for input data
    - `input_validator.json`
- Validation dictionary for geographic data
    - `input_geo_validator.json`
- Validation dictionary for American Community Survey data
    - `input_acs_validator.json`
