<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Address-queries" data-toc-modified-id="Address-queries-1">Address queries</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Process" data-toc-modified-id="Process-1.0.1">Process</a></span></li><li><span><a href="#Adding-poll_changed-(can-delete-this-section-in-the-full-py-script-because-we'll-already-have-this-column)" data-toc-modified-id="Adding-poll_changed-(can-delete-this-section-in-the-full-py-script-because-we'll-already-have-this-column)-1.0.2">Adding poll_changed (can delete this section in the full py script because we'll already have this column)</a></span></li><li><span><a href="#4/4-Morning:-To-Do" data-toc-modified-id="4/4-Morning:-To-Do-1.0.3">4/4 Morning: To Do</a></span></li><li><span><a href="#Calculating-distance" data-toc-modified-id="Calculating-distance-1.0.4">Calculating distance</a></span><ul class="toc-item"><li><span><a href="#Sorting" data-toc-modified-id="Sorting-1.0.4.1">Sorting</a></span></li><li><span><a href="#Finding-50-closest-neighbors-for-those-with-new-polling-places" data-toc-modified-id="Finding-50-closest-neighbors-for-those-with-new-polling-places-1.0.4.2">Finding 50 closest neighbors for those with new polling places</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Address queries

### Process
 - First, need to sort by lat/long, since this is only an O(N log N) operation 
 - Then, for each of the ones _with a new polling place_, which reduces it 10-fold, find the 100 closest locations (50 up and 50 down unless it's at the top or bottom)
 - Calculate distance between each polling and its closest 100 locations  
 - Instead of 2,000,000 x 2,000,000 calculations with the 0.1 degrees approach, this is only 200,000 x 100 

In [231]:
from dask import dataframe as dd
import cleaning_functions
from pygeocoder import Geocoder as geo
from math import radians, cos, sin, asin, sqrt
import numpy as np
import pandas as pd

In [232]:
# Loading data and dropping duplicates
df = pd.read_csv('NC-000_gc.tsv', sep='\t')
df = df.drop_duplicates(subset=['ncid'])

In [233]:
def lat_long_columns ():
    
    '''
    Function that creates latitude and longitude columns from the coords column
    '''

    latitude_list = []
    longitude_list = []

    for count, elem in enumerate(df['coords'].tolist()):
        latitutde = float(column_list[count].split(',')[0].replace('(', ''))
        latitude_list.append(latitutde)

        longitude = float(column_list[count].split(',')[1].replace(')', '').replace(' ', ''))
        longitude_list.append(longitude)
    
    df['latitude'] = latitude_list
    df['longitude'] = longitude_list
    
    return df

In [234]:
# Running the lat/long column function
df = lat_long_columns()

In [235]:
# Removing some unnecessary columns
df = df[['ncid', 'race_code', 'active', 'address', 'county_desc', 'polling_place_name', 'voting_method',
        'precinct', 'poll_address', 'latitude', 'longitude']]
df.head()

Unnamed: 0,ncid,race_code,active,address,county_desc,polling_place_name,voting_method,precinct,poll_address,latitude,longitude
0,AA100006,W,1.0,"3613 DOE LN ,HAW RIVER,NC 27258",ALAMANCE,SALEM UNITED METHODIST CHURCH,ABSENTEE ONESTOP,SOUTH THOMPSON,"4924 SALEM CHURCH RD, HAW RIVER, NC 27258",35.998117,-79.316782
1,AA100009,W,1.0,"2652 SAXAPAHAW-BETHLEHEM CHURCH RD ,GRAHAM,NC ...",ALAMANCE,SALEM UNITED METHODIST CHURCH,ABSENTEE ONESTOP,SOUTH THOMPSON,"4924 SALEM CHURCH RD, HAW RIVER, NC 27258",35.958465,-79.2907
2,AA100029,W,1.0,"3761 PHILLIPS CHAPEL RD ,HAW RIVER,NC 27258",ALAMANCE,SALEM UNITED METHODIST CHURCH,IN-PERSON,SOUTH THOMPSON,"4924 SALEM CHURCH RD, HAW RIVER, NC 27258",35.99482,-79.316823
3,AA100050,W,1.0,"5114 SWEPSONVILLE-SAXAPAHAW RD ,GRAHAM,NC 27253",ALAMANCE,SALEM UNITED METHODIST CHURCH,ABSENTEE ONESTOP,SOUTH THOMPSON,"4924 SALEM CHURCH RD, HAW RIVER, NC 27258",35.972383,-79.322328
4,AA100084,W,1.0,"7533 MORROW MILL RD ,MEBANE,NC 27302",ALAMANCE,SALEM UNITED METHODIST CHURCH,ABSENTEE ONESTOP,SOUTH THOMPSON,"4924 SALEM CHURCH RD, HAW RIVER, NC 27258",35.9393,-79.268773


^^^ Need to get the NCID - whether their location changed - column from other dataset

### Adding poll_changed (can delete this section in the full py script because we'll already have this column)

In [236]:
# Loading dataframe, dropping duplicate NCID's, and reducing to only two columns

changed_poll_df = pd.read_csv('NC_combined_files.tsv', sep='\t')
changed_poll_df = changed_poll_df.drop_duplicates(subset=['ncid'])
changed_poll_df = changed_poll_df[['ncid', 'poll_changed']]

In [237]:
changed_poll_df['poll_changed'].mean()

0.05800710309710488

In [238]:
df.head(2)

Unnamed: 0,ncid,race_code,active,address,county_desc,polling_place_name,voting_method,precinct,poll_address,latitude,longitude
0,AA100006,W,1.0,"3613 DOE LN ,HAW RIVER,NC 27258",ALAMANCE,SALEM UNITED METHODIST CHURCH,ABSENTEE ONESTOP,SOUTH THOMPSON,"4924 SALEM CHURCH RD, HAW RIVER, NC 27258",35.998117,-79.316782
1,AA100009,W,1.0,"2652 SAXAPAHAW-BETHLEHEM CHURCH RD ,GRAHAM,NC ...",ALAMANCE,SALEM UNITED METHODIST CHURCH,ABSENTEE ONESTOP,SOUTH THOMPSON,"4924 SALEM CHURCH RD, HAW RIVER, NC 27258",35.958465,-79.2907


In [239]:
# Merging the dataframes
df = pd.merge(df, changed_poll_df, left_on='ncid', right_on='ncid', how='inner')

In [240]:
df['poll_changed'].mean()

0.0026373626373626374

In [241]:
0.0026373626373626374 * len(df)

6.0

### 4/4 Morning: To Do
 - Pull in the full NC_combined_files.tsv dataframe in pandas
 - Drop duplicates
 - Reduce it to just two columns: NCID and poll_changed
 - Do an inner join between the data I already have above and this dataframe
 - Proceed as normal (no neighbors should be added if their location also changed)

### Calculating distance

Apparently this approximation method works well for small distances (e.g., less than 500 miles) and is very quick (100,000 calculations per second)  
https://stackoverflow.com/questions/15736995/how-can-i-quickly-estimate-the-distance-between-two-latitude-longitude-points

^^^ Tried a few after looking it up in Google Maps and it's directionally correct. Need to strip out the negatives for long tho, or change function. Worth doing a few more to confirm 

In [242]:
def haversine(lat1, lon1, lat2, lon2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    # Radius of earth in kilometers is 6371
    km = 6371* c
    miles = km*0.62
    return miles

#### Sorting

In [243]:
# Sorting by lat and long
df = df.sort_values(by=['latitude', 'longitude'])
df.head()

Unnamed: 0,ncid,race_code,active,address,county_desc,polling_place_name,voting_method,precinct,poll_address,latitude,longitude,poll_changed
930,AA109022,W,1.0,"4372 CEDAR CLIFF RD ,GRAHAM,NC 27253",ALAMANCE,MT HERMON COMMUNITY CENTER,IN-PERSON,ALBRIGHT,"3735 BASS MOUNTAIN RD, GRAHAM, NC 27253",35.847937,-79.392419,0
958,AA102856,O,1.0,"1039 KELSO LN ,BURLINGTON,NC 27215",ALAMANCE,MARVIN B SMITH ELEMENTARY SCHO,ABSENTEE ONESTOP,SOUTH BOONE,"2235 DELANEY DR, BURLINGTON, NC 27215",35.847937,-79.392419,0
937,AA110239,W,1.0,"4763 MT HERMON-ROCK CREEK RD ,SNOW CAMP,NC 27349",ALAMANCE,MT HERMON COMMUNITY CENTER,IN-PERSON,ALBRIGHT,"3735 BASS MOUNTAIN RD, GRAHAM, NC 27253",35.849359,-79.403678,0
949,AA101717,B,1.0,"405 COBBLESTONE CT ,BURLINGTON,NC 27215",ALAMANCE,MARVIN B SMITH ELEMENTARY SCHO,ABSENTEE ONESTOP,SOUTH BOONE,"2235 DELANEY DR, BURLINGTON, NC 27215",35.849359,-79.403678,0
965,AA10375,W,1.0,"229 COACHLIGHT TRL ,BURLINGTON,NC 27215",ALAMANCE,MARVIN B SMITH ELEMENTARY SCHO,ABSENTEE ONESTOP,SOUTH BOONE,"2235 DELANEY DR, BURLINGTON, NC 27215",35.849359,-79.403678,0


#### Finding 50 closest neighbors for those with new polling places

In [244]:
# Resetting index (will need to discuss how to do this part in dask) then getting indices w/ new locations
df = df.reset_index()
new_poll_indeces = df.index[df['poll_changed'] == 1].tolist()

In [245]:
new_poll_indeces

[133, 1216, 1217, 1404, 1487, 1612]

In [246]:
# Creating dataframe with NCID and indices
def ncid_to_index ():
    '''
    Creates a dataframe that maps NCID to that individual's index. Used in later cleaning.
    '''
    ncid_list = []
    for count, elem in enumerate (new_poll_indeces):
        ncid_list.append(df['ncid'][count])

    return pd.DataFrame({'ncid': ncid_list, 'index': new_poll_indeces})

In [247]:
ncid_to_index_df = ncid_to_index()

In [248]:
ncid_to_index_df

Unnamed: 0,ncid,index
0,AA109022,133
1,AA102856,1216
2,AA110239,1217
3,AA101717,1404
4,AA10375,1487
5,AA105827,1612


In [249]:
def fifty_nearest (new_poll_indeces):
    '''
    Params:
        new_poll_indices: list of indices where that voter's polling location changed 
    
    Returns:
        A dictionary where each key is the index of a voter whose and the 
        values is a list of indices of his 50 closest neighbors
    
    Note:
        This only returns indices of neighbors who did not have their location changed, 
        and does not return the individual himself  
    '''
    
    fifty_nearest_dict = {}

    for elem in new_poll_indeces:
        unfiltered_list = np.linspace(elem - 34, elem + 35, 70)
        removed_movers_list = [elem for elem in unfiltered_list if elem not in new_poll_indeces]
        fifty_nearest_dict[elem] = removed_movers_list[11:61]
    
    return fifty_nearest_dict

In [250]:
# Creating the dict
fifty_nearest_dict = fifty_nearest(new_poll_indeces)

In [251]:
fifty_nearest_dict

{133: [110.0,
  111.0,
  112.0,
  113.0,
  114.0,
  115.0,
  116.0,
  117.0,
  118.0,
  119.0,
  120.0,
  121.0,
  122.0,
  123.0,
  124.0,
  125.0,
  126.0,
  127.0,
  128.0,
  129.0,
  130.0,
  131.0,
  132.0,
  134.0,
  135.0,
  136.0,
  137.0,
  138.0,
  139.0,
  140.0,
  141.0,
  142.0,
  143.0,
  144.0,
  145.0,
  146.0,
  147.0,
  148.0,
  149.0,
  150.0,
  151.0,
  152.0,
  153.0,
  154.0,
  155.0,
  156.0,
  157.0,
  158.0,
  159.0,
  160.0],
 1216: [1193.0,
  1194.0,
  1195.0,
  1196.0,
  1197.0,
  1198.0,
  1199.0,
  1200.0,
  1201.0,
  1202.0,
  1203.0,
  1204.0,
  1205.0,
  1206.0,
  1207.0,
  1208.0,
  1209.0,
  1210.0,
  1211.0,
  1212.0,
  1213.0,
  1214.0,
  1215.0,
  1218.0,
  1219.0,
  1220.0,
  1221.0,
  1222.0,
  1223.0,
  1224.0,
  1225.0,
  1226.0,
  1227.0,
  1228.0,
  1229.0,
  1230.0,
  1231.0,
  1232.0,
  1233.0,
  1234.0,
  1235.0,
  1236.0,
  1237.0,
  1238.0,
  1239.0,
  1240.0,
  1241.0,
  1242.0,
  1243.0,
  1244.0],
 1217: [1194.0,
  1195.0,
  1196.0,
 

In [252]:
def generate_column_names ():
    '''
    Simple function to generate column names for the 50 closest neighbors to be used in distance matrix
    '''
    return ["Neighbor" + str(elem+1) for elem in range(50)] 

In [253]:
def generate_rows ():
    '''
    This function creates an array of rows (each of which is another array) to eventually
    be passed into the dataframe 
    '''
    
    list_of_rows = []
    for key in fifty_nearest_dict.keys():

        row_values = []
        for elem in fifty_nearest_dict[key]:
            row_values.append(haversine(df['latitude'][key], df['longitude'][key], df['latitude'][elem], df['longitude'][elem]))
        list_of_rows.append(row_values)
        row_values = []
    
    return list_of_rows  

??? Is there a way to parallelize this? This is the bottleneck right now 

In [254]:
# Generating the rows for the dataframe
rows_for_df = generate_rows()

# Creating the dataframe and adding a column for the initial index
neighbor_df = pd.DataFrame(rows_for_df, columns = generate_column_names())
neighbor_df['initial_index'] = new_poll_indeces

In [255]:
# Reordering columns
cols = list(neighbor_df)
reordered_cols = cols.insert(0, cols.pop(cols.index('initial_index')))
neighbor_df = neighbor_df.loc[:, cols]
neighbor_df.head()

Unnamed: 0,initial_index,Neighbor1,Neighbor2,Neighbor3,Neighbor4,Neighbor5,Neighbor6,Neighbor7,Neighbor8,Neighbor9,...,Neighbor41,Neighbor42,Neighbor43,Neighbor44,Neighbor45,Neighbor46,Neighbor47,Neighbor48,Neighbor49,Neighbor50
0,133,0.273952,0.273952,0.173588,0.173588,2.362806,2.362806,4.524423,4.524423,0.2321,...,4.605959,2.117311,2.117311,8.764145,8.764145,9.046994,9.046994,5.537262,5.537262,5.92537
1,1216,0.292044,0.209853,0.209853,9.797274,9.797274,9.797274,9.797274,0.247017,0.247017,...,0.390094,1.115974,1.115974,0.677291,0.677291,0.147989,0.147989,1.468524,1.468524,0.38103
2,1217,0.209853,0.209853,9.797274,9.797274,9.797274,9.797274,0.247017,0.247017,9.850631,...,1.115974,1.115974,0.677291,0.677291,0.147989,0.147989,1.468524,1.468524,0.38103,0.38103
3,1404,3.962775,3.962775,4.197738,4.197738,2.461992,2.461992,2.461992,2.461992,0.95382,...,3.563164,3.893841,3.893841,0.289374,0.324508,3.935745,3.935745,0.239678,0.123279,4.203182
4,1487,3.075806,0.572968,3.608078,3.608078,3.608078,3.608078,2.783405,2.783405,9.584196,...,4.992343,2.723813,2.723813,2.723813,2.723813,9.490754,9.490754,4.444684,4.444684,2.486605


In [256]:
# Merging with other df to get ncid column 
final_df = pd.merge(ncid_to_index_df, neighbor_df, left_on='index', right_on='initial_index', how='inner').drop(['initial_index'], axis=1)

In [257]:
final_df

Unnamed: 0,ncid,index,Neighbor1,Neighbor2,Neighbor3,Neighbor4,Neighbor5,Neighbor6,Neighbor7,Neighbor8,...,Neighbor41,Neighbor42,Neighbor43,Neighbor44,Neighbor45,Neighbor46,Neighbor47,Neighbor48,Neighbor49,Neighbor50
0,AA109022,133,0.273952,0.273952,0.173588,0.173588,2.362806,2.362806,4.524423,4.524423,...,4.605959,2.117311,2.117311,8.764145,8.764145,9.046994,9.046994,5.537262,5.537262,5.92537
1,AA102856,1216,0.292044,0.209853,0.209853,9.797274,9.797274,9.797274,9.797274,0.247017,...,0.390094,1.115974,1.115974,0.677291,0.677291,0.147989,0.147989,1.468524,1.468524,0.38103
2,AA110239,1217,0.209853,0.209853,9.797274,9.797274,9.797274,9.797274,0.247017,0.247017,...,1.115974,1.115974,0.677291,0.677291,0.147989,0.147989,1.468524,1.468524,0.38103,0.38103
3,AA101717,1404,3.962775,3.962775,4.197738,4.197738,2.461992,2.461992,2.461992,2.461992,...,3.563164,3.893841,3.893841,0.289374,0.324508,3.935745,3.935745,0.239678,0.123279,4.203182
4,AA10375,1487,3.075806,0.572968,3.608078,3.608078,3.608078,3.608078,2.783405,2.783405,...,4.992343,2.723813,2.723813,2.723813,2.723813,9.490754,9.490754,4.444684,4.444684,2.486605
5,AA105827,1612,9.980911,9.980911,3.362701,3.362701,5.364712,5.364712,2.378135,2.378135,...,5.244073,5.244073,9.954029,9.954029,9.843871,9.843871,10.057163,10.057163,4.415089,4.415089


In [260]:
print (neighbor_df['Neighbor25'].mean())
print (neighbor_df['Neighbor45'].mean())

1.976985061971191
3.746936241414877


In [259]:
# Also, make sure to test out a few and confirm that the answer is right. One good plot is the mean distance of 
# all 50 (should be lower in the middle, e.g, around 25)