# North Carolina Voter Registration Analysis

### Requirements
- Download North Carolina voter registration database available here: https://www.ncsbe.gov/results-data/voter-registration-data
- Using the BISG implementation available here
https://surgeo.readthedocs.io/en/dev/
- and the “weighted estimator” as described in this paper
https://arxiv.org/pdf/1811.11154
Task

Write code (in python preferably) to approximate the racial composition of each political party (DEM, REP, LIB, IND) using the weighted estimator and the BISG implementation as your proxy predictor. Do this for a county of your choosing. Also chose some appropriate visualization to show the error of your estimates and the true race proportions

### Some things to keep in mind
- You will need to do a little bit of data processing of the North Carolina voter registration dataset. Make sure that the code you write to do this is well-documented and easy to follow
- I would recommend wrapping the BISG library in a custom class since we will be implementing many other methods for prediction by proxy. Try writing a “ProxyPredictor” interface that contains an “inference” method
- Your subclass’s implementation of the “inference” method should take as input a pandas data frame, and should output a pandas data frame with race predictions
Note: this method will not be complicated for this example, and should just interface the functionality of Surgeo (the BISG library) with the codebase that you are developing

### Download Dataset:


In [59]:
import os

datapath = "data/ncvoter92.txt"
if not os.path.exists("data"):
    os.makedirs("data")
if not os.path.isfile(datapath):
  !wget -O data.zip "https://s3.amazonaws.com/dl.ncsbe.gov/data/ncvoter92.zip"
  !unzip data.zip -d data

zsh:1: command not found: wget
unzip:  cannot find or open data.zip, data.zip.zip or data.zip.ZIP.


## Convert into DataFrame

In [170]:
import pandas as pd
voter_data = pd.read_csv("../"+datapath, sep='\t', encoding="latin1")
print(list(voter_data.columns))
print(voter_data["last_name"].isna().sum()) # TODO: should I remove na last names?
print(voter_data["zip_code"].isna().sum()) # TODO: how does surgeo deal with na zips? should I remove them?
voter_data.head()

['county_id', 'county_desc', 'voter_reg_num', 'ncid', 'last_name', 'first_name', 'middle_name', 'name_suffix_lbl', 'status_cd', 'voter_status_desc', 'reason_cd', 'voter_status_reason_desc', 'res_street_address', 'res_city_desc', 'state_cd', 'zip_code', 'mail_addr1', 'mail_addr2', 'mail_addr3', 'mail_addr4', 'mail_city', 'mail_state', 'mail_zipcode', 'full_phone_number', 'confidential_ind', 'registr_dt', 'race_code', 'ethnic_code', 'party_cd', 'gender_code', 'birth_year', 'age_at_year_end', 'birth_state', 'drivers_lic', 'precinct_abbrv', 'precinct_desc', 'municipality_abbrv', 'municipality_desc', 'ward_abbrv', 'ward_desc', 'cong_dist_abbrv', 'super_court_abbrv', 'judic_dist_abbrv', 'nc_senate_abbrv', 'nc_house_abbrv', 'county_commiss_abbrv', 'county_commiss_desc', 'township_abbrv', 'township_desc', 'school_dist_abbrv', 'school_dist_desc', 'fire_dist_abbrv', 'fire_dist_desc', 'water_dist_abbrv', 'water_dist_desc', 'sewer_dist_abbrv', 'sewer_dist_desc', 'sanit_dist_abbrv', 'sanit_dist_des

Unnamed: 0,county_id,county_desc,voter_reg_num,ncid,last_name,first_name,middle_name,name_suffix_lbl,status_cd,voter_status_desc,...,sanit_dist_abbrv,sanit_dist_desc,rescue_dist_abbrv,rescue_dist_desc,munic_dist_abbrv,munic_dist_desc,dist_1_abbrv,dist_1_desc,vtd_abbrv,vtd_desc
0,92,WAKE,100228366,EH906352,A,GIM,,,A,ACTIVE,...,,,,,UNC,UNINCORPORATED,10.0,PROSECUTORIAL DISTRICT 10,17-01,17-01
1,92,WAKE,100790131,EH1299704,A,HMIT,,,A,ACTIVE,...,,,,,RAL,RALEIGH,10.0,PROSECUTORIAL DISTRICT 10,17-01,17-01
2,92,WAKE,100688481,EH1232725,A,MAIH,,,A,ACTIVE,...,,,,,RAL,RALEIGH,10.0,PROSECUTORIAL DISTRICT 10,17-01,17-01
3,92,WAKE,100548507,EH1133682,A,MON,,,A,ACTIVE,...,,,,,RAL,RALEIGH,10.0,PROSECUTORIAL DISTRICT 10,17-01,17-01
4,92,WAKE,100302999,EH962486,A,RUP,,,A,ACTIVE,...,,,,,RAL,RALEIGH,10.0,PROSECUTORIAL DISTRICT 10,17-09,17-09


## Pre Processing:
- Mapping NC races to the BISG races
- Converting zip codes to strings
- removing rows

In [171]:
import numpy as np

voter_parties = sorted(voter_data["party_cd"].unique())
print(voter_parties)
print(sorted(voter_data["ethnic_code"].unique()))
print(sorted(voter_data["race_code"].unique()))


# Remove last name NAs




# Making the "race_code" and "ehnic_code" match with the surgeo's BISG race 
    # hispanic ethnicity takes priority over race
race_map = {"A": "multiple",
             "B": "black",
             "P": "api",
             "I": "native",
             "M": "multiple",
             "O": "multiple",
             "U": "multiple",
             "W": "white"}

voter_data["bisg_race"] = voter_data["race_code"].map(race_map)
voter_data.loc[voter_data["ethnic_code"] == "HL", "bisg_race"] = "hispanic"


# Converting zip codes to strings (NAs converted to 0)
voter_data["zip_code"] = voter_data["zip_code"].fillna(0)
voter_data["zip_code"] = voter_data["zip_code"].astype(int).astype(str)


# voter_data["last_name"] = voter_data["last_name"].fillna("")

['CST', 'DEM', 'GRE', 'JFA', 'LIB', 'NLB', 'REP', 'UNA', 'WTP']
['HL', 'NL', 'UN']
['A', 'B', 'I', 'M', 'O', 'P', 'U', 'W']


In [125]:
# import sys, os
# sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
# from predictors.BISGPredictor_class import BISGPredictor

# bisg = BISGPredictor()

# sg_results = bisg.inference(voter_data)


## Getting Surgeo BISG race predictions

In [None]:
import surgeo
bisg_race = ["white", "black", "api", "native", "multiple", "hispanic"]

fsg = surgeo.BIFSGModel()
sg = surgeo.SurgeoModel()
f = surgeo.FirstNameModel()
g = surgeo.GeocodeModel()
s = surgeo.SurnameModel()

first_names = voter_data["first_name"]
surnames = voter_data["last_name"]
zctas = voter_data["zip_code"]

sg_results = sg.get_probabilities(surnames, zctas)
sg_results.sort_values(by = ["name", "zcta5"])

Unnamed: 0,zcta5,name,white,black,api,native,multiple,hispanic
627842,00000,,,,,,,
627847,00000,,,,,,,
627854,00000,,,,,,,
610348,27513,,,,,,,
627846,27518,,,,,,,
...,...,...,...,...,...,...,...,...
960161,27608,ZYWICKI,0.990696,0.001092,0.000485,0.0,0.004992,0.002735
960162,27608,ZYWICKI,0.990696,0.001092,0.000485,0.0,0.004992,0.002735
960163,27616,ZYWICKI,0.908961,0.023250,0.008171,0.0,0.024368,0.035250
960165,27526,ZYWIOLEK,,,,,,


In [191]:
# voter_data.sort_values(by = ["last_name", "zip_code"])[["zip_code", "last_name"]]
pd.merge(voter_data, sg_results[bisg_race], left_index = True, right_index=True)[["last_name", "zip_code", "party_cd", "bisg_race", "white", "black", "api", "native", "multiple", "hispanic"]]

Unnamed: 0,last_name,zip_code,party_cd,bisg_race,white,black,api,native,multiple,hispanic
0,A,27604,UNA,multiple,,,,,,
1,A,27604,UNA,multiple,,,,,,
2,A,27604,REP,multiple,,,,,,
3,A,27604,DEM,multiple,,,,,,
4,A,27610,REP,multiple,,,,,,
...,...,...,...,...,...,...,...,...,...,...
960162,ZYWICKI,27608,DEM,white,0.990696,0.001092,0.000485,0.0,0.004992,0.002735
960163,ZYWICKI,27616,UNA,white,0.908961,0.023250,0.008171,0.0,0.024368,0.035250
960164,ZYWICKI,27519,UNA,white,0.956283,0.002195,0.021044,0.0,0.014954,0.005524
960165,ZYWIOLEK,27526,REP,white,,,,,,


## Compute weighted estimator

In [152]:
race_party_dist = {}

for race in bisg_race:
    race_dist = np.zeros(6)
    count = len(voter_data[voter_data["bisg_race"] == race])


37
331439
543
61
6573
3225
204136
414050
103
