# Connecticut 2008 Presidential Elections: Data Cleaning & Preprocessing

**Goal:** Build a clean, analysis-ready county-level table for Connecticut, 2008 for presidential general election results, and then derive summary stats (party totals). Note, there is no presidential primary election results for Connecticut 2008 so far.

**Output**: A single CSV where each row is a county and columns include:

- General per-candidate vote counts (prefixed with `gen_`)
- Party totals: `rep_general_total`, `dem_general_total`, `ind_general_total`, `wrt_general_total`

**Last Updated**: 2025/10/02

## 0. Library Import

In [1]:
import re
import pandas as pd
import numpy as np
from pathlib import Path

  from pandas.core import (


## 1. Inputs & Parameters

Define raw file paths once here so the entire notebook is easy to rerun on another machine. If a path changes, we only update it here. We keep a single `OUTPUT_PATH` so all exports land in one known place.

In [2]:
# CT 2008 dataset path
# PRIMARY_PATH = r""
GENERAL_PATH = r"../../data/raw/2008/CT/20081104__ct__general__town.csv"

# Output directory
OUTPUT_PATH  = r"../../data/processed/2008/CT/"

# Analysis parameters
DISPLAY_ROWS = 10   # Number of rows to display in dataframes

## 2. Load & Filter

We load primary and general datasets separately and immediately subset to the rows we truly need:

- Restrict `office` to 'President' to avoid mixing down-ballot contests

- Remove columns that are fully missing or irrelevant post-filter (e.g., a district column that’s empty for county-level rows)

### b. General Election Dataset

In [3]:
# Load general data
general_df = pd.read_csv(GENERAL_PATH)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,town,office,district,party,candidate,votes
0,Tolland,Andover,President,,Rep,John McCain,745
1,Tolland,Andover,President,,Dem,Barack Obama,1090
2,Tolland,Andover,President,,Ind,Ralph Nader,31
3,Tolland,Andover,President,,write-in,Chuck Baldwin,0
4,Tolland,Andover,President,,write-in,Roger Calero,0
5,Tolland,Andover,President,,write-in,Cynthia McKinney,0
6,Tolland,Andover,President,,write-in,Stewart Moore,0
7,Tolland,Andover,US House,2.0,Rep,Sean Sullivan,563
8,Tolland,Andover,US House,2.0,Dem,Joe Courtney,1030
9,Tolland,Andover,US House,2.0,Green,Scott Deshefy,38


In [4]:
# Different values in 'office' column
general_df["office"].value_counts()

office
President       1190
US House         934
State House      611
State Senate     518
Name: count, dtype: int64

In [5]:
# Only keep rows where 'office' is 'President'
general_df = general_df[general_df["office"] == "President"]
general_df.shape

(1190, 7)

In [6]:
# Now, drop the "office" column as it's no longer needed
# Also, drop the district column as it's not applicable 
general_df = general_df.drop(columns=["office", "district"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,town,party,candidate,votes
0,Tolland,Andover,Rep,John McCain,745
1,Tolland,Andover,Dem,Barack Obama,1090
2,Tolland,Andover,Ind,Ralph Nader,31
3,Tolland,Andover,write-in,Chuck Baldwin,0
4,Tolland,Andover,write-in,Roger Calero,0
5,Tolland,Andover,write-in,Cynthia McKinney,0
6,Tolland,Andover,write-in,Stewart Moore,0
7,New Haven,Ansonia,Rep,John McCain,2918
8,New Haven,Ansonia,Dem,Barack Obama,4616
9,New Haven,Ansonia,Ind,Ralph Nader,124


Now, we aggregate town vote counts into county vote counts.

In [7]:
# Make sure votes are numeric
general_df["votes"] = pd.to_numeric(general_df["votes"], errors="coerce").fillna(0).astype(int)

# Aggregate precinct vote counts into county vote counts
general_df = (
    general_df.
    groupby(["county", "party", "candidate"], as_index=False)["votes"]
    .sum()
)[["county", "candidate", "party", "votes"]]        # Reorder columns

# Snippet at the aggregated data
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Fairfield,Barack Obama,Dem,242936
1,Fairfield,Ralph Nader,Ind,3018
2,Fairfield,John McCain,Rep,167736
3,Fairfield,Chuck Baldwin,write-in,42
4,Fairfield,Cynthia McKinney,write-in,6
5,Fairfield,Roger Calero,write-in,2
6,Fairfield,Stewart Moore,write-in,1
7,Hartford,Barack Obama,Dem,268721
8,Hartford,Ralph Nader,Ind,4909
9,Hartford,John McCain,Rep,138984


In [8]:
# Candidates in general_df
general_df["candidate"].value_counts()

candidate
Barack Obama        9
Ralph Nader         9
John McCain         9
Chuck Baldwin       9
Cynthia McKinney    9
Roger Calero        9
Stewart Moore       9
Name: count, dtype: int64

In [9]:
# List out all the parties in the general election data
general_df["party"].value_counts()

party
write-in    36
Dem          9
Ind          9
Rep          9
Name: count, dtype: int64

In [10]:
# Missing values count
general_df.isnull().sum()

county       0
candidate    0
party        0
votes        0
dtype: int64

In [11]:
# Final look at the cleaned general_df
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Fairfield,Barack Obama,Dem,242936
1,Fairfield,Ralph Nader,Ind,3018
2,Fairfield,John McCain,Rep,167736
3,Fairfield,Chuck Baldwin,write-in,42
4,Fairfield,Cynthia McKinney,write-in,6
5,Fairfield,Roger Calero,write-in,2
6,Fairfield,Stewart Moore,write-in,1
7,Hartford,Barack Obama,Dem,268721
8,Hartford,Ralph Nader,Ind,4909
9,Hartford,John McCain,Rep,138984


In [12]:
# Shape after preprocessing
general_df.shape

(63, 4)

## 3. Table Pivoting

We convert tall (one row per county/party/candidate) into wide (one row per county with one column per candidate). This creates the consistent schema with previous group cleaned data.

Helper functions:

- `normalize_party(s)`: in this case, we lower everything so column names are stable with other dataframes
- `candidate_token(name)`: turns “Barack Obama” -> OBAMA, “John McCain” -> MCCAIN, etc. Create a short, readable, unique token for column names
- `pivot_wide(df, prefix, key_col="county")`: Main pivot function
        
    * groups by `county` x `party` × `candidate`, sums `votes`,
    * pivots to columns named like:
        * Primary: `pri_dem_OBAMA`, `pri_rep_MCCAIN`,...
        * General: `gen_dem_OBAMA`, `gen_rep_MCCAIN`,...

    * flattens the MultiIndex into plain column strings,
    * returns one wide row per county

In [13]:
def normalize_party(s: pd.Series) -> pd.Series:
    """
    Normalize party names: Democratic -> dem, Republican -> rep
    """
    return(s.str.strip()
           .str.capitalize()
           .map({
               "Write-in" : "wri"
                })
           .fillna(s.str.strip().str.lower()))      # For defensive purposes only, would not expect other parties

In [14]:
SUFFIXES = {
    "JR","SR","JNR","SNR",
    "II","III","IV","V","VI","VII","VIII","IX","X","XI","XII"
}

def candidate_token(name: str) -> str:
    """
    Turn John McCain -> MCCAIN, Barack Obama -> OBAMA
    Skip suffixes, keep last name/token, capitalize, and remove punctuation
    """
    if pd.isna(name):
        return "UNKNOWN"                # Defensive purposes only, would not expect missing values
    
    # Remove suffixes
    raw = str(name).strip()

    # If a comma exists, treat as 'LAST, FIRST ...'
    if "," in raw:
        last_part = raw.split(",", 1)[0]
        last_part = re.sub(r"[^A-Za-z0-9\s]+", "", last_part).strip().upper()
        tokens = last_part.split()
        return tokens[-1] if tokens else "UNKNOWN"

    # Otherwise: remove punctuation, split, then drop trailing suffixes
    tokens = re.sub(r"[^A-Za-z0-9\s]+", "", raw).strip().upper().split()
    while tokens and tokens[-1] in SUFFIXES:
        tokens.pop()
    return tokens[-1] if tokens else "UNKNOWN"

In [15]:
def pivot_wide(df: pd.DataFrame, prefix: str, key_col: str="county") -> pd.DataFrame:
    """
    Pivot the dataframe to wide format based on party and candidate
    """
    # Normalize party names
    df['party_key'] = normalize_party(df['party'])
    
    # Create candidate tokens
    df['candidate_token'] = df['candidate'].apply(candidate_token)
    
    # Create new column names based on party and candidate token
    df['new_col'] = prefix + '_' + df['party'] + '_' + df['candidate_token']
    
    # Pivot the dataframe
    pivot_df = df.pivot_table(index=key_col, 
                              columns=["party_key", "candidate_token"], 
                              values="votes", 
                              aggfunc='sum', 
                              fill_value=0)
    
    # Flatten multi-level columns
    pivot_df.columns = [f"{prefix}_{p}_{c}" for p, c in pivot_df.columns]
    
    # Reset index to turn key_col back into a column
    pivot_df = pivot_df.reset_index()
    
    return pivot_df

In [16]:
# General dataframe pivot
general_pivot = pivot_wide(general_df, prefix="gen")
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_dem_OBAMA,gen_ind_NADER,gen_rep_MCCAIN,gen_wri_BALDWIN,gen_wri_CALERO,gen_wri_MCKINNEY,gen_wri_MOORE
0,Fairfield,242936,3018,167736,42,2,6,1
1,Hartford,268721,4909,138984,84,7,20,6
2,Litchfield,51041,1726,46173,22,0,4,0
3,Middlesex,52983,1334,32918,11,0,4,2
4,New Haven,233589,4417,144650,74,4,26,1
5,New London,74776,1559,48491,34,0,7,7
6,Tolland,45053,1148,29266,17,1,14,2
7,Total,997772,19162,629428,311,15,90,19
8,Windham,28673,925,21210,27,1,9,0


In [17]:
# General dataframe shape after pivot
general_pivot.shape

(9, 8)

Notice that there is a total row in the `general_pivot` dataframe. We should drop this observation.

In [18]:
# Drop the total observation 
general_pivot = general_pivot.drop(general_pivot[general_pivot['county'].str.lower() == 'total'].index).reset_index(drop=True)
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_dem_OBAMA,gen_ind_NADER,gen_rep_MCCAIN,gen_wri_BALDWIN,gen_wri_CALERO,gen_wri_MCKINNEY,gen_wri_MOORE
0,Fairfield,242936,3018,167736,42,2,6,1
1,Hartford,268721,4909,138984,84,7,20,6
2,Litchfield,51041,1726,46173,22,0,4,0
3,Middlesex,52983,1334,32918,11,0,4,2
4,New Haven,233589,4417,144650,74,4,26,1
5,New London,74776,1559,48491,34,0,7,7
6,Tolland,45053,1148,29266,17,1,14,2
7,Windham,28673,925,21210,27,1,9,0


## 4. Adding Party Total Columns

Now, we will add party totals columns for general totals:

* `rep_general_total` = sum of all `gen_rep_*` columns
* `dem_general_total` = sum of all `gen_dem_*` columns
* `ind_general_total` = sum of all `gen_ind_*` columns
* `wrt_general_total` = sum of all `gen_wrt_*` columns


In [20]:
# Add party totals for general election
rep_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_rep")] 
dem_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_dem")]
ind_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_ind")]
wrt_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_wri")]

general_pivot["rep_general_total"] = general_pivot[rep_general_cols].sum(axis=1) if rep_general_cols else 0
general_pivot["dem_general_total"] = general_pivot[dem_general_cols].sum(axis=1) if dem_general_cols else 0
general_pivot["ind_general_total"] = general_pivot[ind_general_cols].sum(axis=1) if ind_general_cols else 0
general_pivot["wri_general_total"] = general_pivot[wrt_general_cols].sum(axis=1) if wrt_general_cols else 0

In [21]:
# Print out all the column names in the final dataframe
print("Final columns in the cleaned general dataframe:")
general_pivot.columns

Final columns in the cleaned general dataframe:


Index(['county', 'gen_dem_OBAMA', 'gen_ind_NADER', 'gen_rep_MCCAIN',
       'gen_wri_BALDWIN', 'gen_wri_CALERO', 'gen_wri_MCKINNEY',
       'gen_wri_MOORE', 'rep_general_total', 'dem_general_total',
       'ind_general_total', 'wri_general_total'],
      dtype='object')

In [22]:
# Preview the general_pivot dataframe with totals
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_dem_OBAMA,gen_ind_NADER,gen_rep_MCCAIN,gen_wri_BALDWIN,gen_wri_CALERO,gen_wri_MCKINNEY,gen_wri_MOORE,rep_general_total,dem_general_total,ind_general_total,wri_general_total
0,Fairfield,242936,3018,167736,42,2,6,1,167736,242936,3018,51
1,Hartford,268721,4909,138984,84,7,20,6,138984,268721,4909,117
2,Litchfield,51041,1726,46173,22,0,4,0,46173,51041,1726,26
3,Middlesex,52983,1334,32918,11,0,4,2,32918,52983,1334,17
4,New Haven,233589,4417,144650,74,4,26,1,144650,233589,4417,105
5,New London,74776,1559,48491,34,0,7,7,48491,74776,1559,48
6,Tolland,45053,1148,29266,17,1,14,2,29266,45053,1148,34
7,Windham,28673,925,21210,27,1,9,0,21210,28673,925,37


Now, we save the cleaned dataframe into the processed directory.

In [23]:
# Save the cleaned and merged dataframe to CSV
out_dir = Path(OUTPUT_PATH)
out_dir.mkdir(parents=True, exist_ok=True)
general_pivot.to_csv(OUTPUT_PATH + "CT.csv", index=False)