# How to Generate BISG Proxies
The purpose of this notebook is to illustrate how to use `BISGWrapper`, a wrapper class for BISG, to generate race & ethnicity predictions

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi=False

In [2]:
from os.path import join, expanduser, dirname
import pandas as pd
import sys
import os
import re
import warnings

In [3]:
warnings.filterwarnings(action='ignore')
home = expanduser('~')

src_path = '{}/zrp'.format(home)
sys.path.append(src_path)

In [12]:
from zrp.modeling.predict import BISGWrapper
from zrp.prepare import ProcessStrings
from zrp.prepare.utils import *

## Load sample data for prediction
Load processed list of New Jersey Mayors downloaded from https://www.nj.gov/dca/home/2022mayors.csv 

In [5]:
nj_mayors = load_file("../2022-nj-mayors-sample.csv")
nj_mayors.shape

(462, 9)

In [6]:
nj_mayors

Unnamed: 0,first_name,middle_name,last_name,house_number,street_address,city,state,zip_code,ZEST_KEY
0,Gabe,,Plumer,782,Frenchtown Road,Milford,NJ,08848,2
1,Ari,,Bernstein,500,West Crescent Avenue,Allendale,NJ,07401,4
2,David,J.,Mclaughlin,125,Corlies Avenue,Allenhurst,NJ,07711-1049,5
3,Thomas,C.,Fritts,8,North Main Street,Allentown,NJ,08501-1607,6
4,P.,,McCkelvey,49,South Greenwich Street,Alloway,NJ,08001-0425,7
...,...,...,...,...,...,...,...,...,...
457,William,,Degroff,3943,Route,Chatsworth,NJ,08019,558
458,Joseph,,Chukwueke,200,Cooper Avenue,Woodlynne,NJ,08107-2108,559
459,Paul,,Sarlo,85,Humboldt Street,Wood-Ridge,NJ,07075-2344,560
460,Craig,,Frederick,120,Village Green Drive,Woolwich Township,NJ,08085-3180,562


#### Data Processing (optional)  
When generating BISG proxies, we use data output from ZRP_Prepare. Since it is not a requirement for the BISGWrapper we will use our most basic processing module to clean up the data in preparation for BISG predictions. 

In [7]:
%%time
process = ProcessStrings()
process.fit(nj_mayors)
prepared = process.transform(nj_mayors)

   [Start] Validating input data
     Number of observations: 462
     Is key unique: True
Directory already exists
   [Completed] Validating input data

   Formatting P1
   Formatting P2
   reduce whitespace
CPU times: user 76.8 ms, sys: 4.91 ms, total: 81.7 ms
Wall time: 82.2 ms


*The warning message above can be ignored since middle name is commonly missing. 

### Invoke the BISGWrapper on the sample data
To generate BISG proxies, you are required to have `last_name` and `zip_code` in your dataframe otherwise. 

In [8]:
bisg = BISGWrapper()
bisg.fit(prepared)
bisg_output = bisg.transform(prepared)

### Inspect the output


In [9]:
bisg_output

Unnamed: 0_level_0,AAPI,AIAN,BLACK,HISPANIC,WHITE,race_proxy,source_bisg
ZEST_KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,0.000478,0.000463,0.006154,0.001932,0.990258,WHITE,1
4,0.011623,0.000000,0.000197,0.004692,0.977975,WHITE,1
5,,,,,,,1
6,,,,,,,1
7,,,,,,,1
...,...,...,...,...,...,...,...
558,0.000386,0.000000,0.003039,0.004521,0.989895,WHITE,1
559,,,,,,,1
560,,,,,,,1
562,,,,,,,1


### Check Coverage
A quick glance at the BISG output we can see high missing proxy rates. BISG has some limitations that contribute to missing values:
- last names are required to be in the Census Surname List (ref: [https://www.census.gov/topics/population/genealogy/data/2010_surnames.html](https://www.census.gov/topics/population/genealogy/data/2010_surnames.html))
    - est. 162K out of 6.3MM last names from the 2010 Census are on the list
- zip codes being BISG invalid 

In [10]:
bisg_output.isna().mean()

AAPI           0.58658
AIAN           0.58658
BLACK          0.58658
HISPANIC       0.58658
WHITE          0.58658
race_proxy     0.58658
source_bisg    0.00000
dtype: float64

Checking the distribution of predicted race & ethnicity 

In [11]:
bisg_output.race_proxy.value_counts(normalize=True, dropna=False)

NaN         0.586580
WHITE       0.376623
BLACK       0.017316
HISPANIC    0.012987
AAPI        0.006494
Name: race_proxy, dtype: float64