# How to Perform Basic Processing 
The purpose of this notebook is to illustrate how to use `ProcessStrings`, a module that processes user input data

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import warnings

In [7]:
warnings.filterwarnings(action='ignore')
home = expanduser('~')

src_path = '{}/zrp'.format(home)
sys.path.append(src_path)

In [8]:
from zrp.prepare.prepare import ProcessStrings
from zrp.prepare.utils import load_file

## Load sample data for prediction
Load processed list of New Jersey Mayors downloaded from https://www.nj.gov/dca/home/2022mayors.csv 

In [9]:
nj_mayors = load_file("../2022-nj-mayors-sample.csv")
nj_mayors.shape

(462, 9)

In [10]:
nj_mayors

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY
0,Gabe,,Plumer,782,Frenchtown Road,Milford,NJ,08848,2
1,Ari,,Bernstein,500,West Crescent Avenue,Allendale,NJ,07401,4
2,David,J.,Mclaughlin,125,Corlies Avenue,Allenhurst,NJ,07711-1049,5
3,Thomas,C.,Fritts,8,North Main Street,Allentown,NJ,08501-1607,6
4,P.,,McCkelvey,49,South Greenwich Street,Alloway,NJ,08001-0425,7
...,...,...,...,...,...,...,...,...,...
457,William,,Degroff,3943,Route,Chatsworth,NJ,08019,558
458,Joseph,,Chukwueke,200,Cooper Avenue,Woodlynne,NJ,08107-2108,559
459,Paul,,Sarlo,85,Humboldt Street,Wood-Ridge,NJ,07075-2344,560
460,Craig,,Frederick,120,Village Green Drive,Woolwich Township,NJ,08085-3180,562


#### ZRP Preprocessing  
To quickly process the data we will use `ProcessStrings`. There are other preprocessing classes that should be used on specific data like `ProcessGeo` on data we intend to geocode, `ProcessACS` on ACS data, and `ProcessGLookup` for geographic lookup tables. Implementation is similar for each processing class

Input data into the prediction/modeling pipeline is tabluar data with the following columns: first name, middle name, last name, house number, street address (street name), city, state, zip code, and zest key. The `ZEST_KEY` must be specified to establish correspondence between inputs and outputs; it's effectively used as an index for the data table.


In [11]:
%%time
preprocess = ProcessStrings()
preprocess.fit(nj_mayors)
zrp_output = preprocess.transform(nj_mayors)

Directory already exists
   [Start] Validating input data
     Number of observations: 462
     Is key unique: True
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace
CPU times: user 84.4 ms, sys: 905 µs, total: 85.3 ms
Wall time: 85.4 ms


### Inspect the output
- Preview the data

In [12]:
zrp_output.shape

(462, 8)

In [13]:
zrp_output.head()

Unnamed: 0_level_0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code
ZEST_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2,GABE,,PLUMER,782,FRENCHTOWN ROAD,MILFORD,NJ,8848
4,ARI,,BERNSTEIN,500,WEST CRESCENT AVENUE,ALLENDALE,NJ,7401
5,DAVID,J,MCLAUGHLIN,125,CORLIES AVENUE,ALLENHURST,NJ,77111049
6,THOMAS,C,FRITTS,8,NORTH MAIN STREET,ALLENTOWN,NJ,85011607
7,P,,MCCKELVEY,49,SOUTH GREENWICH STREET,ALLOWAY,NJ,80010425
