# Iowa 2008 Presidential Elections: Data Cleaning & Preprocessing

**Goal:** Build a clean, analysis-ready county-level table for Iowa, 2008 by merging the presidential primary and presidential general election results, and then derive summary stats (party totals). Note, there is no presidential primary election results for Iowa 2008 so far.

**Output**: A single CSV where each row is a county and columns include:

- Primary per-candidate vote counts (prefixed with `pri_`)
- General per-candidate vote counts (prefixed with `gen_`)
- Party totals: `rep_general_total`, `dem_general_total`, `lib_general_total`, `cst_general_total`, `grn_general_total`, `pfp_general_total`, `swp_general_total`,`psl_general_total`, `spu_general_total`

**Last Updated**: 2025/10/02

## 0. Library Import

In [1]:
import re
import pandas as pd
import numpy as np
from pathlib import Path

  from pandas.core import (


## 1. Inputs & Parameters

Define raw file paths once here so the entire notebook is easy to rerun on another machine. If a path changes, we only update it here. We keep a single `OUTPUT_PATH` so all exports land in one known place.

In [2]:
# IA 2008 dataset path
PRIMARY_PATH = r"../../data/raw/2008/IA/20080603__ia__primary__county.csv"
GENERAL_PATH = r"../../data/raw/2008/IA/20081104__ia__general__county.csv"

# Output directory
OUTPUT_PATH  = r"../../data/processed/2008/IA/"

# Analysis parameters
DISPLAY_ROWS = 10   # Number of rows to display in dataframes

## 2. Load & Filter

We load primary and general datasets separately and immediately subset to the rows we truly need:

- Restrict `office` to 'President' to avoid mixing down-ballot contests

- Remove columns that are fully missing or irrelevant post-filter (e.g., a district column that’s empty for county-level rows)

### a. Primary Election Dataset

In [3]:
# Load primary data
primary_df = pd.read_csv(PRIMARY_PATH)
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,office,district,candidate,party,reporting_level,jurisdiction,votes
0,U.S. Senate,,TOM HARKIN,Democrat,county,Adair,199
1,U.S. Senate,,OVER VOTES,Democrat,county,Adair,0
2,U.S. Senate,,UNDER VOTES,Democrat,county,Adair,19
3,U.S. Senate,,SCATTERING,Democrat,county,Adair,0
4,U.S. Senate,,TOTAL,Democrat,county,Adair,218
5,U.S. Senate,,TOM HARKIN,Democrat,county,Adams,256
6,U.S. Senate,,OVER VOTES,Democrat,county,Adams,0
7,U.S. Senate,,UNDER VOTES,Democrat,county,Adams,0
8,U.S. Senate,,SCATTERING,Democrat,county,Adams,0
9,U.S. Senate,,TOTAL,Democrat,county,Adams,256


In [4]:
# Different values in 'office' column
primary_df["office"].value_counts()

office
State Representative            2910
U.S. Senate                     1200
United States Representative    1172
State Senator                   1045
Name: count, dtype: int64

Well, there are no presidential rows. This means that the current `primary_df` lacks the presidential contest information that we are interested in. We can stop explore this dataframe from here unless there are some future exploratory direction.

### b. General Election Dataset

In [5]:
# Load general data
general_df = pd.read_csv(GENERAL_PATH)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,office,district,candidate,party,county,votes
0,President/Vice President,,BARACK OBAMA / JOE BIDEN,Democrat,Adair,1924
1,President/Vice President,,JOHN MCCAIN / SARAH PALIN,Republican,Adair,2060
2,President/Vice President,,CHUCK BALDWIN / DARRELL L. CASTLE,Constitution Party,Adair,11
3,President/Vice President,,CYNTHIA MCKINNEY / ROSA CLEMENTE,Green Party,Adair,4
4,President/Vice President,,BOB BARR / WAYNE A. ROOT,Libertarian,Adair,10
5,President/Vice President,,RALPH NADER / MATT GONZALEZ,Peace and Freedom,Adair,31
6,President/Vice President,,JAMES HARRIS / ALYSON KENNEDY,Socialist Workers Party,Adair,0
7,President/Vice President,,OVER VOTES,,Adair,0
8,President/Vice President,,UNDER VOTES,,Adair,0
9,President/Vice President,,SCATTERING,,Adair,13


In [6]:
# Different values in 'office' column
general_df["office"].value_counts()

office
State Representative            1698
President/Vice President        1300
United States Representative     625
State Senator                    618
U.S. Senate                      504
Name: count, dtype: int64

In [7]:
# Only keep rows where 'office' is 'President/Vice President'
general_df = general_df[general_df["office"] == "President/Vice President"]
general_df.shape

(1300, 6)

In [8]:
# Now, drop the "office" column as it's no longer needed
# Also, drop the district column as it's not applicable 
general_df = general_df.drop(columns=["office", "district"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,candidate,party,county,votes
0,BARACK OBAMA / JOE BIDEN,Democrat,Adair,1924
1,JOHN MCCAIN / SARAH PALIN,Republican,Adair,2060
2,CHUCK BALDWIN / DARRELL L. CASTLE,Constitution Party,Adair,11
3,CYNTHIA MCKINNEY / ROSA CLEMENTE,Green Party,Adair,4
4,BOB BARR / WAYNE A. ROOT,Libertarian,Adair,10
5,RALPH NADER / MATT GONZALEZ,Peace and Freedom,Adair,31
6,JAMES HARRIS / ALYSON KENNEDY,Socialist Workers Party,Adair,0
7,OVER VOTES,,Adair,0
8,UNDER VOTES,,Adair,0
9,SCATTERING,,Adair,13


There is something up with the rows in the dataframe. We now look at the values in `candidate` columns.

In [9]:
# Values in "candidate" column
general_df["candidate"].value_counts()

candidate
BARACK OBAMA / JOE BIDEN              100
JOHN MCCAIN / SARAH PALIN             100
CHUCK BALDWIN / DARRELL L. CASTLE     100
CYNTHIA MCKINNEY / ROSA CLEMENTE      100
BOB BARR / WAYNE A. ROOT              100
RALPH NADER / MATT GONZALEZ           100
JAMES HARRIS / ALYSON KENNEDY         100
OVER VOTES                            100
UNDER VOTES                           100
SCATTERING                            100
GLORIA LA RIVA / ROBERT MOSES         100
BRIAN MOORE / STEWART A. ALEXANDER    100
TOTAL                                 100
Name: count, dtype: int64

We can safely drop "OVER VOTES", "UNDER VOTES", "SCATTERING", and "TOTAL" as they will not affect our many purpose of modeling

In [15]:
# Drop rows with "OVER VOTES", "UNDER VOTES", "SCATTERING" and "TOTAL" in "candidate" column
general_df = general_df[~general_df["candidate"].isin(["OVER VOTES", "UNDER VOTES", "SCATTERING", "TOTAL"])]
general_df.shape

(900, 4)

Another thing to notice is the format of the other candidate names. They are in format of "President / Vice President". We only want the president values.

In [16]:
# Keep only the presidential candidate in the "candidate" column
general_df["candidate"] = (
    general_df["candidate"]
      .str.split(r"(?i)\s*(?:andf|and|&|/|/)\s*", n=1, expand=True)[0]
      .str.strip()
)

# Candidates in general_df
general_df["candidate"].value_counts()

candidate
BARACK OBAMA        100
JOHN MCCAIN         100
CHUCK BALDWIN       100
CYNTHIA MCKINNEY    100
BOB BARR            100
RALPH NADER         100
JAMES HARRIS        100
GLORIA LA RIVA      100
BRIAN MOORE         100
Name: count, dtype: int64

In [17]:
# List out all the parties in the general election data
general_df["party"].value_counts()

party
Democrat                              100
Republican                            100
Constitution Party                    100
Green Party                           100
Libertarian                           100
Peace and Freedom                     100
Socialist Workers Party               100
Party for Socialism and Liberation    100
Socialist Party USA                   100
Name: count, dtype: int64

In [18]:
# Missing values count
general_df.isnull().sum()

candidate    0
party        0
county       0
votes        0
dtype: int64

In [19]:
# Final look at the cleaned general_df
general_df.head(DISPLAY_ROWS)

Unnamed: 0,candidate,party,county,votes
0,BARACK OBAMA,Democrat,Adair,1924
1,JOHN MCCAIN,Republican,Adair,2060
2,CHUCK BALDWIN,Constitution Party,Adair,11
3,CYNTHIA MCKINNEY,Green Party,Adair,4
4,BOB BARR,Libertarian,Adair,10
5,RALPH NADER,Peace and Freedom,Adair,31
6,JAMES HARRIS,Socialist Workers Party,Adair,0
10,BARACK OBAMA,Democrat,Adams,1118
11,JOHN MCCAIN,Republican,Adams,1046
12,CHUCK BALDWIN,Constitution Party,Adams,7


In [20]:
# Shape after preprocessing
general_df.shape

(900, 4)

## 3. Table Pivoting

We convert tall (one row per county/party/candidate) into wide (one row per county with one column per candidate). This creates the consistent schema with previous group cleaned data.

Helper functions:

- `normalize_party(s)`: in this case, we lower everything so column names are stable with other dataframes
- `candidate_token(name)`: turns “Barack Obama” -> OBAMA, “John McCain” -> MCCAIN, etc. Create a short, readable, unique token for column names
- `pivot_wide(df, prefix, key_col="county")`: Main pivot function
        
    * groups by `county` x `party` × `candidate`, sums `votes`,
    * pivots to columns named like:
        * Primary: `pri_dem_OBAMA`, `pri_rep_MCCAIN`,...
        * General: `gen_dem_OBAMA`, `gen_rep_MCCAIN`,...

    * flattens the MultiIndex into plain column strings,
    * returns one wide row per county

In [21]:
def normalize_party(s: pd.Series) -> pd.Series:
    """
    Normalize party names: Democratic -> dem, Republican -> rep
    """
    return(s.str.strip()
           .str.capitalize()
           .map({
                "Democrat"               : "dem", 
                "Republican"             : "rep",
                "Libertarian"            : "lib",
                "Constitution party"     : "cst",
                "Green party"            : "grn",
                "Peace and freedom"      : "pfp",
                "Socialist workers party": "swp",
                "Party for socialism and liberation": "psl",
                "Socialist party usa"    : "spu"
               })
           .fillna(s.str.strip().str.lower()))      # For defensive purposes only, would not expect other parties

In [22]:
SUFFIXES = {
    "JR","SR","JNR","SNR",
    "II","III","IV","V","VI","VII","VIII","IX","X","XI","XII"
}

def candidate_token(name: str) -> str:
    """
    Turn John McCain -> MCCAIN, Barack Obama -> OBAMA
    Skip suffixes, keep last name/token, capitalize, and remove punctuation
    """
    if pd.isna(name):
        return "UNKNOWN"                # Defensive purposes only, would not expect missing values
    
    # Remove suffixes
    raw = str(name).strip()

    # If a comma exists, treat as 'LAST, FIRST ...'
    if "," in raw:
        last_part = raw.split(",", 1)[0]
        last_part = re.sub(r"[^A-Za-z0-9\s]+", "", last_part).strip().upper()
        tokens = last_part.split()
        return tokens[-1] if tokens else "UNKNOWN"

    # Otherwise: remove punctuation, split, then drop trailing suffixes
    tokens = re.sub(r"[^A-Za-z0-9\s]+", "", raw).strip().upper().split()
    while tokens and tokens[-1] in SUFFIXES:
        tokens.pop()
    return tokens[-1] if tokens else "UNKNOWN"

In [23]:
def pivot_wide(df: pd.DataFrame, prefix: str, key_col: str="county") -> pd.DataFrame:
    """
    Pivot the dataframe to wide format based on party and candidate
    """
    # Normalize party names
    df['party_key'] = normalize_party(df['party'])
    
    # Create candidate tokens
    df['candidate_token'] = df['candidate'].apply(candidate_token)
    
    # Create new column names based on party and candidate token
    df['new_col'] = prefix + '_' + df['party'] + '_' + df['candidate_token']
    
    # Pivot the dataframe
    pivot_df = df.pivot_table(index=key_col, 
                              columns=["party_key", "candidate_token"], 
                              values="votes", 
                              aggfunc='sum', 
                              fill_value=0)
    
    # Flatten multi-level columns
    pivot_df.columns = [f"{prefix}_{p}_{c}" for p, c in pivot_df.columns]
    
    # Reset index to turn key_col back into a column
    pivot_df = pivot_df.reset_index()
    
    return pivot_df

In [24]:
# General dataframe pivot
general_pivot = pivot_wide(general_df, prefix="gen")
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_cst_BALDWIN,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_lib_BARR,gen_pfp_NADER,gen_psl_RIVA,gen_rep_MCCAIN,gen_spu_MOORE,gen_swp_HARRIS
0,Adair,11,1924,4,10,31,0,2060,0,0
1,Adams,7,1118,4,6,13,0,1046,0,5
2,Allamakee,17,3971,9,20,42,1,2965,2,3
3,Appanoose,28,2970,7,27,36,0,3086,3,1
4,Audubon,15,1739,2,6,17,0,1634,1,4
5,Benton,35,7058,15,26,72,4,6447,1,8
6,Black Hawk,109,39184,39,171,348,3,24662,9,8
7,Boone,57,7356,10,56,66,0,6293,2,2
8,Bremer,29,6940,8,27,74,0,5741,1,2
9,Buchanan,25,6050,7,21,61,1,4139,3,2


In [25]:
# General dataframe shape after pivot
general_pivot.shape

(100, 10)

Wait, there are only 99 counties in Iowa. There must be an extra column, maybe for total?

In [30]:
# List of all values in "county" column
general_pivot["county"].values

array(['Adair', 'Adams', 'Allamakee', 'Appanoose', 'Audubon', 'Benton',
       'Black Hawk', 'Boone', 'Bremer', 'Buchanan', 'Buena Vista',
       'Butler', 'Calhoun', 'Carroll', 'Cass', 'Cedar', 'Cerro Gordo',
       'Cherokee', 'Chickasaw', 'Clarke', 'Clay', 'Clayton', 'Clinton',
       'Crawford', 'Dallas', 'Davis', 'Decatur', 'Delaware', 'Des Moines',
       'Dickinson', 'Dubuque', 'Emmet', 'Fayette', 'Floyd', 'Franklin',
       'Fremont', 'Greene', 'Grundy', 'Guthrie', 'Hamilton', 'Hancock',
       'Hardin', 'Harrison', 'Henry', 'Howard', 'Humboldt', 'Ida', 'Iowa',
       'Jackson', 'Jasper', 'Jefferson', 'Johnson', 'Jones', 'Keokuk',
       'Kossuth', 'Lee', 'Linn', 'Louisa', 'Lucas', 'Lyon', 'Madison',
       'Mahaska', 'Marion', 'Marshall', 'Mills', 'Mitchell', 'Monona',
       'Monroe', 'Montgomery', 'Muscatine', "O'Brien", 'Osceola', 'Page',
       'Palo Alto', 'Plymouth', 'Pocahontas', 'Polk', 'Pottawattamie',
       'Poweshiek', 'Ringgold', 'Sac', 'Scott', 'Shelby', 'Sioux',

This confirms that there is indeed a Total row in the dataframe. We will drop it now to not overcount stuff.

In [31]:
# Drop the total row in the generel_pivot
general_pivot = general_pivot[general_pivot["county"].str.upper() != "TOTAL"].reset_index(drop=True)
general_pivot.shape


(99, 10)

## 4. Adding Party Total Columns

Now, we will add party totals columns for general totals:

* `rep_general_total` = sum of all `gen_rep_*` columns
* `dem_general_total` = sum of all `gen_dem_*` columns
* `lib_general_total` = sum of all `gen_lib_*` columns
* `cst_general_total` = sum of all `gen_cst_*` columns
* `grn_general_total` = sum of all `gen_grn_*` columns
* `pfp_general_total` = sum of all `gen_pfp_*` columns
* `swp_general_total` = sum of all `gen_swp_*` columns
* `psl_general_total` = sum of all `gen_psl_*` columns
* `spu_general_total` = sum of all `gen_spu_*` columns

In [32]:
# Add party totals for general election
rep_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_rep")] 
dem_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_dem")]
lib_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_lib")]
cst_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_cst")]
grn_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_grn")]
pfp_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_pfp")]
swp_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_swp")]
psl_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_psl")]
spu_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_spu")]

general_pivot["rep_general_total"] = general_pivot[rep_general_cols].sum(axis=1) if rep_general_cols else 0
general_pivot["dem_general_total"] = general_pivot[dem_general_cols].sum(axis=1) if dem_general_cols else 0
general_pivot["lib_general_total"] = general_pivot[lib_general_cols].sum(axis=1) if lib_general_cols else 0
general_pivot["cst_general_total"] = general_pivot[cst_general_cols].sum(axis=1) if cst_general_cols else 0
general_pivot["grn_general_total"] = general_pivot[grn_general_cols].sum(axis=1) if grn_general_cols else 0
general_pivot["pfp_general_total"] = general_pivot[pfp_general_cols].sum(axis=1) if pfp_general_cols else 0
general_pivot["swp_general_total"] = general_pivot[swp_general_cols].sum(axis=1) if swp_general_cols else 0
general_pivot["psl_general_total"] = general_pivot[psl_general_cols].sum(axis=1) if psl_general_cols else 0
general_pivot["spu_general_total"] = general_pivot[spu_general_cols].sum(axis=1) if spu_general_cols else 0

In [33]:
# Print out all the column names in the final dataframe
print("Final columns in the cleaned general dataframe:")
general_pivot.columns

Final columns in the cleaned general dataframe:


Index(['county', 'gen_cst_BALDWIN', 'gen_dem_OBAMA', 'gen_grn_MCKINNEY',
       'gen_lib_BARR', 'gen_pfp_NADER', 'gen_psl_RIVA', 'gen_rep_MCCAIN',
       'gen_spu_MOORE', 'gen_swp_HARRIS', 'rep_general_total',
       'dem_general_total', 'lib_general_total', 'cst_general_total',
       'grn_general_total', 'pfp_general_total', 'swp_general_total',
       'psl_general_total', 'spu_general_total'],
      dtype='object')

In [34]:
# Preview the general_pivot dataframe with totals
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_cst_BALDWIN,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_lib_BARR,gen_pfp_NADER,gen_psl_RIVA,gen_rep_MCCAIN,gen_spu_MOORE,gen_swp_HARRIS,rep_general_total,dem_general_total,lib_general_total,cst_general_total,grn_general_total,pfp_general_total,swp_general_total,psl_general_total,spu_general_total
0,Adair,11,1924,4,10,31,0,2060,0,0,2060,1924,10,11,4,31,0,0,0
1,Adams,7,1118,4,6,13,0,1046,0,5,1046,1118,6,7,4,13,5,0,0
2,Allamakee,17,3971,9,20,42,1,2965,2,3,2965,3971,20,17,9,42,3,1,2
3,Appanoose,28,2970,7,27,36,0,3086,3,1,3086,2970,27,28,7,36,1,0,3
4,Audubon,15,1739,2,6,17,0,1634,1,4,1634,1739,6,15,2,17,4,0,1
5,Benton,35,7058,15,26,72,4,6447,1,8,6447,7058,26,35,15,72,8,4,1
6,Black Hawk,109,39184,39,171,348,3,24662,9,8,24662,39184,171,109,39,348,8,3,9
7,Boone,57,7356,10,56,66,0,6293,2,2,6293,7356,56,57,10,66,2,0,2
8,Bremer,29,6940,8,27,74,0,5741,1,2,5741,6940,27,29,8,74,2,0,1
9,Buchanan,25,6050,7,21,61,1,4139,3,2,4139,6050,21,25,7,61,2,1,3


Now, we save the cleaned dataframe into the processed directory.

In [35]:
# Save the cleaned and merged dataframe to CSV
out_dir = Path(OUTPUT_PATH)
out_dir.mkdir(parents=True, exist_ok=True)
general_pivot.to_csv(OUTPUT_PATH + "IA.csv", index=False)