# Hawaii 2008 Presidential Elections: Data Cleaning & Preprocessing

**Goal:** Build a clean, analysis-ready county-level table for Hawaii, 2008 by merging the presidential primary and presidential general election results, and then derive summary stats (party totals). Note, there is no presidential primary election results for Hawaii 2008 so far.

**Output**: A single CSV where each row is a county and columns include:

- General per-candidate vote counts (prefixed with `gen_`)
- Party totals: `rep_general_total` , `dem_general_total`, `lib_general_total`, `con_general_total`, `gre_general_total`, `ind_general_total`

**Last Updated**: 2025/10/02

## 0. Library Import

In [1]:
import re
import pandas as pd
import numpy as np
from pathlib import Path

  from pandas.core import (


## 1. Inputs & Parameters

Define raw file paths once here so the entire notebook is easy to rerun on another machine. If a path changes, we only update it here. We keep a single `OUTPUT_PATH` so all exports land in one known place.

In [2]:
# HI 2008 dataset path
PRIMARY_PATH = r"../../data/raw/2008/HI/20080920__hi__primary__precinct.csv"
GENERAL_PATH = r"../../data/raw/2008/HI/20081104__hi__general__precinct.csv"

# Output directory
OUTPUT_PATH  = r"../../data/processed/2008/HI/"

# Analysis parameters
DISPLAY_ROWS = 10   # Number of rows to display in dataframes

## 2. Load & Filter

We load primary and general datasets separately and immediately subset to the rows we truly need:

- Restrict `office` to 'President' to avoid mixing down-ballot contests

- Remove columns that are fully missing or irrelevant post-filter (e.g., a district column that’s empty for county-level rows)

### a. Primary Election Dataset

In [3]:
# Load primary data
primary_df = pd.read_csv(PRIMARY_PATH)
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,precinct,office,district,party,candidate,absentee,early_votes,election_day,votes
0,County of Hawaii,01-01,Straight Party,,,DEMOCRATIC PARTY (D),0,0,427,427
1,County of Hawaii,01-01,Straight Party,,,INDEPENDENT PARTY (I),0,0,0,0
2,County of Hawaii,01-01,Straight Party,,,LIBERTARIAN PARTY (L),0,0,1,1
3,County of Hawaii,01-01,Straight Party,,,NONPARTISAN BALLOT (N),0,0,3,3
4,County of Hawaii,01-01,Straight Party,,,REPUBLICAN PARTY (R),0,0,44,44
5,County of Hawaii,01-01,US Representative,2.0,I,"STENSHOL, Shaun",0,0,0,0
6,County of Hawaii,01-01,US Representative,2.0,D,"HIRONO, Mazie",0,0,321,321
7,County of Hawaii,01-01,US Representative,2.0,R,"EVANS, Roger B.",0,0,36,36
8,County of Hawaii,01-01,US Representative,2.0,L,"MALLAN, Lloyd J. (Jeff)",0,0,1,1
9,County of Hawaii,01-01,State Senate,3.0,D,"ISBELL, Virginia",0,0,60,60


In [4]:
# Different values in 'office' column
primary_df["office"].value_counts()

office
Straight Party          2200
US Representative       1760
State Representative     947
State Senate             458
Name: count, dtype: int64

Well, there are no presidential rows. This means that the current `primary_df` lacks the presidential contest information that we are interested in. We can stop explore this dataframe from here unless there are some future exploratory direction.

### b. General Election Dataset

In [5]:
# Load general data
general_df = pd.read_csv(GENERAL_PATH)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,precinct,office,district,party,candidate,absentee,early_votes,election_day,votes
0,County of Hawaii,01-01,President,,CON,"BALDWIN, Chuck / CASTLE, Darrell L.",,,,4
1,County of Hawaii,01-01,President,,,Total Blank Votes,,,,5
2,County of Hawaii,01-01,President,,,Total Over Votes,,,,0
3,County of Hawaii,01-01,President,,,Total Ballots,,,,758
4,County of Hawaii,01-01,President,,LIB,"BARR, Bob / ROOT, Wayne A.",,,,0
5,County of Hawaii,01-01,President,,,Total Blank Votes,,,,5
6,County of Hawaii,01-01,President,,,Total Over Votes,,,,0
7,County of Hawaii,01-01,President,,,Total Ballots,,,,758
8,County of Hawaii,01-01,President,,REP,"McCAIN, John / PALIN, Sarah",,,,169
9,County of Hawaii,01-01,President,,,Total Blank Votes,,,,5


In [6]:
# Different values in 'office' column
general_df["office"].value_counts()

office
President               10584
US Representative        6228
State Representative     1864
State Senate             1000
Name: count, dtype: int64

Meanwhile, there are data for presidential election data in `general_df`. We can take a closer look into this dataset.

In [7]:
# Only keep rows where 'office' is 'President'
general_df = general_df[general_df["office"] == "President"]
general_df.shape

(10584, 10)

In [8]:
# Now, drop the "office" column as it's no longer needed
# Also, drop the district column as it's not applicable 
general_df = general_df.drop(columns=["office", "district"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,precinct,party,candidate,absentee,early_votes,election_day,votes
0,County of Hawaii,01-01,CON,"BALDWIN, Chuck / CASTLE, Darrell L.",,,,4
1,County of Hawaii,01-01,,Total Blank Votes,,,,5
2,County of Hawaii,01-01,,Total Over Votes,,,,0
3,County of Hawaii,01-01,,Total Ballots,,,,758
4,County of Hawaii,01-01,LIB,"BARR, Bob / ROOT, Wayne A.",,,,0
5,County of Hawaii,01-01,,Total Blank Votes,,,,5
6,County of Hawaii,01-01,,Total Over Votes,,,,0
7,County of Hawaii,01-01,,Total Ballots,,,,758
8,County of Hawaii,01-01,REP,"McCAIN, John / PALIN, Sarah",,,,169
9,County of Hawaii,01-01,,Total Blank Votes,,,,5


There are some peculiar observations in the dataset. Let's list the `candidate` values

In [9]:
# Candidates in general_df
general_df["candidate"].value_counts()

candidate
Total Blank Votes                      2646
Total Over Votes                       2646
Total Ballots                          2646
BALDWIN, Chuck / CASTLE, Darrell L.     441
BARR, Bob / ROOT, Wayne A.              441
McCAIN, John / PALIN, Sarah             441
McKINNEY, Cynthia / CLEMENTE, Rosa      441
NADER, Ralph / GONZALEZ, Matt           441
OBAMA, Barack / BIDEN, Joe              441
Name: count, dtype: int64

Oh, so there are some total rows that we can definitely drop them first.

In [10]:
# Drop rows that "candidate" starts with "Total"
general_df = general_df[~general_df["candidate"].str.startswith("Total")]
general_df.shape

(2646, 8)

In [11]:
# Updated candidate list in general_df
general_df["candidate"].value_counts()

candidate
BALDWIN, Chuck / CASTLE, Darrell L.    441
BARR, Bob / ROOT, Wayne A.             441
McCAIN, John / PALIN, Sarah            441
McKINNEY, Cynthia / CLEMENTE, Rosa     441
NADER, Ralph / GONZALEZ, Matt          441
OBAMA, Barack / BIDEN, Joe             441
Name: count, dtype: int64

Notice now that for values in `candidate` column, they are of format "President / Vice President". We only want to president candidate. Thus, we proceed to split them and only keep the president candidate for each of the values in `candidate` columns.

In [12]:
# Keep only the presidential candidate in the "candidate" column
general_df["candidate"] = (
    general_df["candidate"]
      .str.split(r"(?i)\s*(?:andf|and|&|/|/)\s*", n=1, expand=True)[0]
      .str.strip()
)

# Candidates in general_df
general_df["candidate"].value_counts()

candidate
BALDWIN, Chuck       441
BARR, Bob            441
McCAIN, John         441
McKINNEY, Cynthia    441
NADER, Ralph         441
OBAMA, Barack        441
Name: count, dtype: int64

We now want to see the what is in `absentee`, `early_votes`, and `election_day` columns to further proceed with these.

In [19]:
# Check for values in "absentee", "early_votes", and "election_day" columns
general_df[["absentee", "early_votes", "election_day"]].describe(include="all")

Unnamed: 0,absentee,early_votes,election_day
count,0.0,0.0,0.0
mean,,,
std,,,
min,,,
25%,,,
50%,,,
75%,,,
max,,,


Given that they are all missing values columns, we will drop all three of them.

In [20]:
# Drop the three columns as they are all missing values
general_df = general_df.drop(columns=["absentee", "early_votes", "election_day"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,precinct,party,candidate,votes
0,County of Hawaii,01-01,CON,"BALDWIN, Chuck",4
1,County of Hawaii,01-01,LIB,"BARR, Bob",0
2,County of Hawaii,01-01,REP,"McCAIN, John",169
3,County of Hawaii,01-01,GRE,"McKINNEY, Cynthia",2
4,County of Hawaii,01-01,IND,"NADER, Ralph",11
5,County of Hawaii,01-01,DEM,"OBAMA, Barack",567
6,County of Hawaii,01-02,CON,"BALDWIN, Chuck",3
7,County of Hawaii,01-02,LIB,"BARR, Bob",1
8,County of Hawaii,01-02,REP,"McCAIN, John",103
9,County of Hawaii,01-02,GRE,"McKINNEY, Cynthia",2


Now, we again aggregate precinct vote counts into county vote counts.

In [21]:
# Make sure votes are numeric
general_df["votes"] = pd.to_numeric(general_df["votes"], errors="coerce").fillna(0)

# Aggregate precinct vote counts into county vote counts
general_df = (
    general_df.
    groupby(["county", "party", "candidate"], as_index=False)["votes"]
    .sum()
)[["county", "candidate", "party", "votes"]]        # Reorder columns

# Snippet at the aggregated data
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,City & County of Honolulu,"BALDWIN, Chuck",CON,705
1,City & County of Honolulu,"OBAMA, Barack",DEM,260016
2,City & County of Honolulu,"McKINNEY, Cynthia",GRE,720
3,City & County of Honolulu,"NADER, Ralph",IND,2914
4,City & County of Honolulu,"BARR, Bob",LIB,968
5,City & County of Honolulu,"McCAIN, John",REP,99820
6,County of Hawaii,"BALDWIN, Chuck",CON,141
7,County of Hawaii,"OBAMA, Barack",DEM,29181
8,County of Hawaii,"McKINNEY, Cynthia",GRE,137
9,County of Hawaii,"NADER, Ralph",IND,403


In [22]:
# List out all the parties in the general election data
general_df["party"].value_counts()

party
CON    4
DEM    4
GRE    4
IND    4
LIB    4
REP    4
Name: count, dtype: int64

In [23]:
# Missing values count
general_df.isnull().sum()

county       0
candidate    0
party        0
votes        0
dtype: int64

In [24]:
# Final look at cleaned general_df
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,City & County of Honolulu,"BALDWIN, Chuck",CON,705
1,City & County of Honolulu,"OBAMA, Barack",DEM,260016
2,City & County of Honolulu,"McKINNEY, Cynthia",GRE,720
3,City & County of Honolulu,"NADER, Ralph",IND,2914
4,City & County of Honolulu,"BARR, Bob",LIB,968
5,City & County of Honolulu,"McCAIN, John",REP,99820
6,County of Hawaii,"BALDWIN, Chuck",CON,141
7,County of Hawaii,"OBAMA, Barack",DEM,29181
8,County of Hawaii,"McKINNEY, Cynthia",GRE,137
9,County of Hawaii,"NADER, Ralph",IND,403


In [25]:
# Shape after preprocessing
general_df.shape

(24, 4)

## 3. Table Pivoting

We convert tall (one row per county/party/candidate) into wide (one row per county with one column per candidate). This creates the consistent schema with previous group cleaned data.

Helper functions:

- `normalize_party(s)`: maps common forms (e.g., “Democratic”, “Republican”) to keys dem/rep so column names are stable
- `candidate_token(name)`: turns “Barack Obama” -> OBAMA, “John McCain” -> MCCAIN, etc. Create a short, readable, unique token for column names
- `pivot_wide(df, prefix, key_col="county")`: Main pivot function
        
    * groups by `county` x `party` × `candidate`, sums `votes`,
    * pivots to columns named like:
        * Primary: `pri_dem_OBAMA`, `pri_rep_MCCAIN`,...
        * General: `gen_dem_OBAMA`, `gen_rep_MCCAIN`,...

    * flattens the MultiIndex into plain column strings,
    * returns one wide row per county

In [None]:
def normalize_party(s: pd.Series) -> pd.Series:
    """
    Normalize party names: Lowercase the three-letter abbreviations
    """
    return(s.str.lower())     

In [30]:
def candidate_token(name: str) -> str:
    """
    Turn John McCain -> MCCAIN, Barack Obama -> OBAMA
    Skip suffixes, keep last name/token, capitalize, and remove punctuation
    """
    if pd.isna(name):
        return "UNKNOWN"                # Defensive purposes only, would not expect missing values

    # Return first token uppercase
    raw = str(name).strip()
    tokens = raw.split()
    return tokens[0][:-1].upper() if tokens else "UNKNOWN"

In [31]:
def pivot_wide(df: pd.DataFrame, prefix: str, key_col: str="county") -> pd.DataFrame:
    """
    Pivot the dataframe to wide format based on party and candidate
    """
    # Normalize party names
    df['party_key'] = normalize_party(df['party'])
    
    # Create candidate tokens
    df['candidate_token'] = df['candidate'].apply(candidate_token)
    
    # Create new column names based on party and candidate token
    df['new_col'] = prefix + '_' + df['party'] + '_' + df['candidate_token']
    
    # Pivot the dataframe
    pivot_df = df.pivot_table(index=key_col, 
                              columns=["party_key", "candidate_token"], 
                              values="votes", 
                              aggfunc='sum', 
                              fill_value=0)
    
    # Flatten multi-level columns
    pivot_df.columns = [f"{prefix}_{p}_{c}" for p, c in pivot_df.columns]
    
    # Reset index to turn key_col back into a column
    pivot_df = pivot_df.reset_index()
    
    return pivot_df

In [32]:
# General dataframe pivot
general_pivot = pivot_wide(general_df, prefix="gen")
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_con_BALDWIN,gen_dem_OBAMA,gen_gre_MCKINNEY,gen_ind_NADER,gen_lib_BARR,gen_rep_MCCAIN
0,City & County of Honolulu,705,260016,720,2914,968,99820
1,County of Hawaii,141,29181,137,403,157,9296
2,County of Kauai,56,11314,53,185,51,3769
3,County of Maui,111,25360,69,323,138,7681


In [33]:
# General dataframe shape after pivot
general_pivot.shape

(4, 7)

## 4. Adding Party Total Columns

Now, we will add party totals columns for general totals:

* `rep_general_total` = sum of all `gen_rep_*` columns
* `dem_general_total` = sum of all `gen_dem_*` columns
* `lib_general_total` = sum of all `gen_lib_*` columns
* `con_general_total` = sum of all `gen_con_*` columns
* `gre_general_total` = sum of all `gen_gre_*` columns
* `ind_general_total` = sum of all `gen_ind_*` columns

In [34]:
# Add party totals for general election
rep_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_rep")] 
dem_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_dem")]
lib_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_lib")]
con_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_con")]
gre_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_gre")]
ind_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_ind")]


general_pivot["rep_general_total"] = general_pivot[rep_general_cols].sum(axis=1) if rep_general_cols else 0
general_pivot["dem_general_total"] = general_pivot[dem_general_cols].sum(axis=1) if dem_general_cols else 0
general_pivot["lib_general_total"] = general_pivot[lib_general_cols].sum(axis=1) if lib_general_cols else 0
general_pivot["con_general_total"] = general_pivot[con_general_cols].sum(axis=1) if con_general_cols else 0
general_pivot["gre_general_total"] = general_pivot[gre_general_cols].sum(axis=1) if gre_general_cols else 0
general_pivot["ind_general_total"] = general_pivot[ind_general_cols].sum(axis=1) if ind_general_cols else 0

In [35]:
# Print out all the column names in the final dataframe
print("Final columns in the cleaned general dataframe:")
general_pivot.columns

Final columns in the cleaned general dataframe:


Index(['county', 'gen_con_BALDWIN', 'gen_dem_OBAMA', 'gen_gre_MCKINNEY',
       'gen_ind_NADER', 'gen_lib_BARR', 'gen_rep_MCCAIN', 'rep_general_total',
       'dem_general_total', 'lib_general_total', 'con_general_total',
       'gre_general_total', 'ind_general_total'],
      dtype='object')

In [36]:
# Preview the general_pivot dataframe with totals
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_con_BALDWIN,gen_dem_OBAMA,gen_gre_MCKINNEY,gen_ind_NADER,gen_lib_BARR,gen_rep_MCCAIN,rep_general_total,dem_general_total,lib_general_total,con_general_total,gre_general_total,ind_general_total
0,City & County of Honolulu,705,260016,720,2914,968,99820,99820,260016,968,705,720,2914
1,County of Hawaii,141,29181,137,403,157,9296,9296,29181,157,141,137,403
2,County of Kauai,56,11314,53,185,51,3769,3769,11314,51,56,53,185
3,County of Maui,111,25360,69,323,138,7681,7681,25360,138,111,69,323


Now, we save the cleaned dataframe into the processed directory.

In [37]:
# Save the cleaned and merged dataframe to CSV
out_dir = Path(OUTPUT_PATH)
out_dir.mkdir(parents=True, exist_ok=True)
general_pivot.to_csv(OUTPUT_PATH + "HI.csv", index=False)