# Data Cleaning for Projections

The historical data from the CO Secretary of State's office is not uniform -- all the spreadsheets are formatted differently from year to year, some years have precinct-by-precinct data while others do not, the fields are different, etc.

This notebook will clean all that into a standard format and then dump that data in a Pandas-friendly CSV file that can be accessed later. Doing this year by year.

In [1]:
import pandas as pd
import glob
import re
import string

reg_dir = '../data/registration/'
res_dir = '../data/results/'

## Registration Data

The voter registration data has a uniform format, so a single function should do the trick.

In [2]:
def clean_registration_data(year):

    print('Processing registration data for {}'.format(year))
    df = pd.read_excel(reg_dir+'raw/{}.xlsx'.format(year), 'State Senate Districts', 
                                        header=0)#, names=columns)

    # empty squares have no voters, replace NAN with 0
    df = df.fillna(0)
    
    # rename columns to capital
    rename_dict = {}
    for col in df.columns:
        rename_dict[col] = col.upper()
        
    # make all strings capital
    df = df.rename(columns=rename_dict)
    df['COUNTY'] = df['COUNTY'].str.upper()
    
    df = df[df['COUNTY'].notnull()]

    # write to file
    df.to_csv(reg_dir+'cleaned/{}.csv'.format(year))

In [5]:
filepaths = glob.glob(reg_dir+'raw/*.xlsx')
years = [re.search(reg_dir+'raw/(.*).xlsx', filepath).group(1) for filepath in filepaths]
print(years)
for year in years:
    clean_registration_data(year)

['2016', '2020', '2012', '2014', '2018']
Processing registration data for 2016
Processing registration data for 2020
Processing registration data for 2012
Processing registration data for 2014
Processing registration data for 2018


## Results Data

Take in the results and isolate only the information we want, i.e. scrape out the ballot initiatives, federal elections, etc. to only look at the State Senate elections.

In [22]:
def clean_results_data(year):

    print('Processing results data for {}'.format(year))
    df = pd.read_excel(res_dir+'raw/{}.xlsx'.format(year))
    
    # deal with a change in formatting after two years
    if 'Candidate Votes' in df.columns:
        df['Yes Votes'] = df['Yes Votes'] + df['Candidate Votes']
        
    # capitalize all the columns
    rename_dict = {}
    for col in df.columns:
        rename_dict[col] = col.upper()

    # column name edge case party -- 
    #   spreadsheets use different names for what they were voting on every year
    rename_dict['Office/Issue/Judgeship'] = 'DISTRICT'         # 2012
    rename_dict['Office/Issue/Judgeship'] = 'DISTRICT'         # 2014
    rename_dict['OFFICE / BALLOT ISSUE'] = 'DISTRICT'          # 2016
    rename_dict['OFFICE/BALLOT ISSUE NUMBER'] = 'DISTRICT'     # 2018

    # capitalize all the strings
    df = df.rename(columns=rename_dict)
    df['COUNTY'] = df['COUNTY'].str.upper()
    df['PARTY'] = df['PARTY'].str.upper()

    # Isolate the data we want
    cols = ['YES VOTES', 'DISTRICT', 'COUNTY', 'PARTY']
    df = df[cols][df['DISTRICT'].str.match('State Senate')]
    # deal with the 'district-total' corner case
    df = df[(df['COUNTY'].notnull()) & ~(df['COUNTY'] == 'TOTAL')]
    df['DISTRICT'] = df['DISTRICT'].str.replace('State Senate - District', 'SD')

    # write to file
    df.to_csv(res_dir+'cleaned/{}.csv'.format(year))

In [23]:
filepaths = glob.glob(res_dir+'raw/*.xlsx')
years = [re.search(res_dir+'raw/(.*).xlsx', filepath).group(1) for filepath in filepaths]
print(years)
for year in years:
    clean_results_data(year)

['2016', '2012', '2014', '2018']
Processing results data for 2016
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False]
Processing results data for 2012
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False]
Processing results data for 2014
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False]
Processing results data for 2018
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False Fal