<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#NC-Data-Prototyping" data-toc-modified-id="NC-Data-Prototyping-1">NC Data Prototyping</a></span><ul class="toc-item"><li><span><a href="#Loading-the-data" data-toc-modified-id="Loading-the-data-1.1">Loading the data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Voter" data-toc-modified-id="Voter-1.1.0.1">Voter</a></span></li><li><span><a href="#Voter-history" data-toc-modified-id="Voter-history-1.1.0.2">Voter history</a></span></li></ul></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-1.1.1">Modeling</a></span></li></ul></li></ul></li></ul></div>

# NC Data Prototyping

## Loading the data

In [186]:
# imports
import numpy as np
import pandas as pd

There are two data files: 
 - voter (info about the voter)
 - voterhistory (whether or not they voted)

#### Voter

The data is a 3.2GB text file that is tsv. It takes forever to load the full thing, so I'm only loading the first 100,000 rows for now. Note - I also shuffled the order of the rows to make sure I'm not missing systematic errors that are only happening in one place.   

In [187]:
voter = pd.read_csv('ncvoter.txt', sep='\t', encoding = "ISO-8859-1", nrows=100000).sample(frac=1)
voter.head(3)

Unnamed: 0,county_id,county_desc,voter_reg_num,status_cd,voter_status_desc,reason_cd,voter_status_reason_desc,absent_ind,name_prefx_cd,last_name,...,munic_dist_desc,dist_1_abbrv,dist_1_desc,dist_2_abbrv,dist_2_desc,confidential_ind,birth_year,ncid,vtd_abbrv,vtd_desc
46219,1,ALAMANCE,9150809,A,ACTIVE,AV,VERIFIED,,,HUGHBANKS,...,GRAHAM,17.0,17TH PROSECUTORIAL,,,N,1997,AA185225,06E,06E
52243,1,ALAMANCE,9164402,A,ACTIVE,AV,VERIFIED,,,KENNEDY,...,BURLINGTON,17.0,17TH PROSECUTORIAL,,,N,1997,AA193478,126,126
93040,1,ALAMANCE,2249900,A,ACTIVE,AV,VERIFIED,,,STRICKLAND,...,BURLINGTON,17.0,17TH PROSECUTORIAL,,,N,1962,AA16281,125,125


Let's try and figure out which of the 71 columns we actually need lol  

#### Voter history

In [188]:
vhis_unfiltered = pd.read_csv('ncvhis.txt', sep='\t', encoding = "ISO-8859-1", nrows=200000).sample(frac=1)
vhis_unfiltered.head(3)

Unnamed: 0,county_id,county_desc,voter_reg_num,election_lbl,election_desc,voting_method,voted_party_cd,voted_party_desc,pct_label,pct_description,ncid,voted_county_id,voted_county_desc,vtd_label,vtd_description
168838,1,ALAMANCE,9120481,11/08/2016,11/08/2016 GENERAL,IN-PERSON,UNA,UNAFFILIATED,03S,SOUTH BOONE,AA164253,1,ALAMANCE,03S,03S
54277,68,ORANGE,248274,11/02/2010,11/02/2010 GENERAL,ABSENTEE ONESTOP,DEM,DEMOCRATIC,CX,CHEEKS,AA123180,68,ORANGE,CX,CX
135404,1,ALAMANCE,9102871,11/03/2015,11/03/2015 MUNICIPAL GENERAL,CURBSIDE,REP,REPUBLICAN,03S,SOUTH BOONE,AA150595,1,ALAMANCE,03S,03S


In [189]:
# Filtering to only include 2012 and 2016 general elections
vhis_2012 = vhis_unfiltered[vhis_unfiltered['election_desc'] == '11/06/2012 GENERAL']
vhis_2016 = vhis_unfiltered[vhis_unfiltered['election_desc'] == '11/08/2016 GENERAL']

##### 2012 data

In [190]:
# Left joining the tables (have to join it first with 2012 voter history first, and then with 2016)
joined_2012 = pd.merge(voter, vhis_2012, on='voter_reg_num', how='left')

In [191]:
# Filtering out unnecessary columns
joined_2012 = joined_2012[['voter_reg_num', 'voter_status_desc', 'res_street_address', 
                          'res_city_desc', 'state_cd', 'zip_code', 'race_code', 'voting_method', 
                          'precinct_abbrv', 'precinct_desc', 'pct_description', 'vtd_label']]

In [192]:
# Renaming columns  
joined_2012 = joined_2012.rename(columns = {'voting_method': '2012_voting_method',
                                'precinct_abbrv': 'current_pct_abbrv', 
                                 'precinct_desc': 'current_pct_desc',
                                 'pct_description': '2012_pct_vote_name',
                                 'vtd_label': '2012_pct_vote_code'})

##### 2016 data

In [193]:
# Joining the 2012/voter data with the 2016 data 
joined_2016 = pd.merge(joined_2012, vhis_2016, on='voter_reg_num', how='left')

In [194]:
# Renaming columns from 2016
joined_2016 = joined_2016.rename(columns = {'pct_description': '2016_pct_vote_name',
                                 'vtd_label': '2016_pct_vote_code',
                                    'voting_method': '2016_voting_method'})

In [195]:
# Filtering to necessary columns
df = joined_2016[['voter_reg_num', 'voter_status_desc', 'res_street_address',
       'res_city_desc', 'state_cd', 'zip_code', 'race_code',
       'current_pct_abbrv', 'current_pct_desc', '2012_voting_method', '2012_pct_vote_name',
       '2012_pct_vote_code', '2016_voting_method','2016_pct_vote_name', '2016_pct_vote_code']]

In [199]:
df[30:38]

Unnamed: 0,voter_reg_num,voter_status_desc,res_street_address,res_city_desc,state_cd,zip_code,race_code,current_pct_abbrv,current_pct_desc,2012_voting_method,2012_pct_vote_name,2012_pct_vote_code,2016_voting_method,2016_pct_vote_name,2016_pct_vote_code
30,9137321,INACTIVE,130 W CRESCENT SQUARE DR #B,GRAHAM,NC,27253.0,W,064,GRAHAM 4,ABSENTEE ONESTOP,GRAHAM 4,064,,,
31,9166560,ACTIVE,308 SUTTON PL,MEBANE,NC,27302.0,W,10S,SOUTH MELVILLE,,,,,,
32,9102538,ACTIVE,1630 GRACE LANDING DR,MEBANE,NC,27302.0,W,103,MELVILLE 3,IN-PERSON,MELVILLE 3,103,IN-PERSON,MELVILLE 3,103
33,1169700,ACTIVE,2566 CAPRICE LN,BURLINGTON,NC,27215.0,W,02,COBLE,,,,,,
34,9095360,ACTIVE,6870 STILLHOPE LN,LIBERTY,NC,27298.0,W,01,PATTERSON,,,,IN-PERSON,PATTERSON,01
35,9164212,ACTIVE,1742 OLD ARBOR WAY,MEBANE,NC,27302.0,U,103,MELVILLE 3,,,,,,
36,9077956,ACTIVE,1105 COQUINA CT,MEBANE,NC,27302.0,W,10S,SOUTH MELVILLE,ABSENTEE ONESTOP,SOUTH MELVILLE,10S,ABSENTEE ONESTOP,SOUTH MELVILLE,10S
37,9171631,ACTIVE,3637 S JIM MINOR RD,HAW RIVER,NC,27258.0,W,09S,SOUTH THOMPSON,,,,,,


The dataset now consists of:
 - One row per voter
 - A few columns about that voter (e.g., current address, race, current precinct, etc.)
 - Voting method/voting location and voter location code for 2012 general election (null if they didn't vote)
 - Voting method/voting location and voter location code for 2016 general election (null if they didn't vote)

We're not in bad shape. Main thing we need to add is a column with the actual address of their actual assigned polling place 

### Modeling

Some to-do's here:
 - Clean up the dataset lol
 - Figure out the distance metric stuff and create a column for that 
 - Decide how we want to share initial code