# Colorado 2008 Presidential Elections: Data Cleaning & Preprocessing

**Goal:** Build a clean, analysis-ready county-level table for Colorado, 2008 for presidential general election results, and then derive summary stats (party totals). Note, there is no presidential primary election results for Colorado 2008 so far.

**Output**: A single CSV where each row is a county and columns include:

- General per-candidate vote counts (prefixed with `gen_`)
- Party totals: `rep_general_total`, `dem_general_total`, `lib_general_total`, `cst_general_total`, `grn_general_total`, `una_general_total`, `aip_general_total`, `btp_general_total`, `obj_general_total`, `prh_general_total`, `psl_general_total`, `swp_general_total`, `spu_general_total`, `usp_general_total`, `hq8_general_total`

**Last Updated**: 2025/10/02

## 0. Library Import

In [1]:
import re
import pandas as pd
import numpy as np
from pathlib import Path

  from pandas.core import (


## 1. Inputs & Parameters

Define raw file paths once here so the entire notebook is easy to rerun on another machine. If a path changes, we only update it here. We keep a single `OUTPUT_PATH` so all exports land in one known place.

In [2]:
# CO 2008 dataset path
PRIMARY_PATH = r"../../data/raw/2008/CO/20080812__co__primary.csv"
GENERAL_PATH = r"../../data/raw/2008/CO/20081104__co__general__precinct.csv"

# Output directory
OUTPUT_PATH  = r"../../data/processed/2008/CO/"

# Analysis parameters
DISPLAY_ROWS = 10   # Number of rows to display in dataframes

## 2. Load & Filter

We load primary and general datasets separately and immediately subset to the rows we truly need:

- Restrict `office` to 'President' to avoid mixing down-ballot contests

- Remove columns that are fully missing or irrelevant post-filter (e.g., a district column that’s empty for county-level rows)

### a. Primary Election Dataset

In [3]:
# Load primary data
primary_df = pd.read_csv(PRIMARY_PATH)
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,office,district,party,candidate,votes
0,Adams,U.S. Senate,,DEM,Mark Udall,16410
1,Alamosa,U.S. Senate,,DEM,Mark Udall,784
2,Arapahoe,U.S. Senate,,DEM,Mark Udall,22093
3,Archuleta,U.S. Senate,,DEM,Mark Udall,309
4,Baca,U.S. Senate,,DEM,Mark Udall,157
5,Bent,U.S. Senate,,DEM,Mark Udall,95
6,Boulder,U.S. Senate,,DEM,Mark Udall,24433
7,Broomfield,U.S. Senate,,DEM,Mark Udall,3235
8,Chaffee,U.S. Senate,,DEM,Mark Udall,1095
9,Cheyenne,U.S. Senate,,DEM,Mark Udall,40


In [4]:
# Different values in 'office' column
primary_df["office"].value_counts()

office
State House     753
U.S. House      382
U.S. Senate     260
State Senate    238
Name: count, dtype: int64

Wait, there are no presidential rows. This means that the current `primary_df` lacks the presidential contest information that we are interested in. We can stop explore this dataframe from here unless there are some future exploratory direction.

### b. General Election Dataset

In [5]:
# Load general data
general_df = pd.read_csv(GENERAL_PATH)
general_df.head(DISPLAY_ROWS)

  general_df = pd.read_csv(GENERAL_PATH)


Unnamed: 0,county,precinct,office,district,party,candidate,votes
0,Adams,2233301070,President,,Republican,John McCain/ Sarah Palin,397
1,Adams,2233301070,President,,Democrat,Barack Obama/ Joe Biden,429
2,Adams,2233301070,President,,Constitution,Chuck Baldwin/ Darrell L. Castle,0
3,Adams,2233301070,President,,Libertarian,Bob Barr/ Wayne A. Root,4
4,Adams,2233301070,President,,Green,Cynthia McKinney/ Rosa A. Clemente,2
5,Adams,2233301070,President,,HeartQuake '08,Jonathan E. Allen/ Jeffrey D. Stath,0
6,Adams,2233301070,President,,Prohibition,Gene C. Amondson/ Leroy J. Pletten,0
7,Adams,2233301070,President,,Socialist Workers,James Harris/ Alyson Kennedy,0
8,Adams,2233301070,President,,Boston Tea,Charles Jay/ Dan Sallis Jr.,0
9,Adams,2233301070,President,,America's Independent,Alan Keyes/ Brian Rohrbough,0


In [6]:
# Different values in 'office' column
general_df["office"].value_counts()

office
President                 53600
State House               41395
U.S. Senate               23450
U.S. House                16674
State Senate              16134
STATE SENATE - DISTRIC      239
Name: count, dtype: int64

Meanwhile, there are data for presidential election data in `general_df`. We can take a closer look into this dataset.

In [7]:
# Only keep rows where 'office' is 'President'
general_df = general_df[general_df["office"] == "President"]
general_df.shape

(53600, 7)

In [8]:
# Now, drop the "office" column as it's no longer needed
# Also, drop the district column as it's not applicable 
general_df = general_df.drop(columns=["office", "district"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,precinct,party,candidate,votes
0,Adams,2233301070,Republican,John McCain/ Sarah Palin,397
1,Adams,2233301070,Democrat,Barack Obama/ Joe Biden,429
2,Adams,2233301070,Constitution,Chuck Baldwin/ Darrell L. Castle,0
3,Adams,2233301070,Libertarian,Bob Barr/ Wayne A. Root,4
4,Adams,2233301070,Green,Cynthia McKinney/ Rosa A. Clemente,2
5,Adams,2233301070,HeartQuake '08,Jonathan E. Allen/ Jeffrey D. Stath,0
6,Adams,2233301070,Prohibition,Gene C. Amondson/ Leroy J. Pletten,0
7,Adams,2233301070,Socialist Workers,James Harris/ Alyson Kennedy,0
8,Adams,2233301070,Boston Tea,Charles Jay/ Dan Sallis Jr.,0
9,Adams,2233301070,America's Independent,Alan Keyes/ Brian Rohrbough,0


Now, we aggregate precinct vote counts into county vote counts.

In [9]:
# Make sure votes are numeric
general_df["votes"] = pd.to_numeric(general_df["votes"], errors="coerce").fillna(0).astype(int)

# Aggregate precinct vote counts into county vote counts
general_df = (
    general_df.
    groupby(["county", "party", "candidate"], as_index=False)["votes"]
    .sum()
)[["county", "candidate", "party", "votes"]]        # Reorder columns

# Snippet at the aggregated data
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Adams,Alan Keyes/ Brian Rohrbough,America's Independent,346
1,Adams,Charles Jay/ Dan Sallis Jr.,Boston Tea,29
2,Adams,Chuck Baldwin/ Darrell L. Castle,Constitution,459
3,Adams,Barack Obama/ Joe Biden,Democrat,93445
4,Adams,Cynthia McKinney/ Rosa A. Clemente,Green,217
5,Adams,Jonathan E. Allen/ Jeffrey D. Stath,HeartQuake '08,25
6,Adams,Bob Barr/ Wayne A. Root,Libertarian,725
7,Adams,Thomas Robert Stevens/ Alden Link,Objectivist,13
8,Adams,Gene C. Amondson/ Leroy J. Pletten,Prohibition,8
9,Adams,John McCain/ Sarah Palin,Republican,63976


In [10]:
# Candidates in general_df
general_df["candidate"].value_counts()

candidate
Alan Keyes/ Brian Rohrbough            64
Charles Jay/ Dan Sallis Jr.            64
Chuck Baldwin/ Darrell L. Castle       64
Barack Obama/ Joe Biden                64
Cynthia McKinney/ Rosa A. Clemente     64
Jonathan E. Allen/ Jeffrey D. Stath    64
Bob Barr/ Wayne A. Root                64
Thomas Robert Stevens/ Alden Link      64
Gene C. Amondson/ Leroy J. Pletten     64
John McCain/ Sarah Palin               64
Gloria La Riva/ Robert Moses           64
James Harris/ Alyson Kennedy           64
Brian Moore/ Stewart A. Alexander      64
Bradford Lyttle/ Abraham Bassford      64
Frank Edward McEnulty/ David Mangan    64
Ralph Nader/ Matt Gonzalez             64
Name: count, dtype: int64

Notice that for values in `candidate` column, they are of format "President/Vice President". We only want to president candidate. Thus, we proceed to split them and only keep the president candidate for each of the values in `candidate` columns.

In [11]:
# Keep only the presidential candidate in the "candidate" column
general_df["candidate"] = (
    general_df["candidate"]
      .str.split(r"(?i)\s*(?:andf|and|&|/|/)\s*", n=1, expand=True)[0]
      .str.strip()
)

# Candidates in general_df
general_df["candidate"].value_counts()

candidate
Alan Keyes               64
Charles Jay              64
Chuck Baldwin            64
Barack Obama             64
Cynthia McKinney         64
Jonathan E. Allen        64
Bob Barr                 64
Thomas Robert Stevens    64
Gene C. Amondson         64
John McCain              64
Gloria La Riva           64
James Harris             64
Brian Moore              64
Bradford Lyttle          64
Frank Edward McEnulty    64
Ralph Nader              64
Name: count, dtype: int64

In [12]:
# List out all the parties in the general election data
general_df["party"].value_counts()

party
Unaffiliated                128
America's Independent        64
Boston Tea                   64
Constitution                 64
Democrat                     64
Green                        64
HeartQuake '08               64
Libertarian                  64
Objectivist                  64
Prohibition                  64
Republican                   64
Socialism and Liberation     64
Socialist Workers            64
Socialist, USA               64
U.S. Pacifist                64
Name: count, dtype: int64

In [13]:
# Missing values count
general_df.isnull().sum()

county       0
candidate    0
party        0
votes        0
dtype: int64

In [14]:
# Final look at the cleaned general_df
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Adams,Alan Keyes,America's Independent,346
1,Adams,Charles Jay,Boston Tea,29
2,Adams,Chuck Baldwin,Constitution,459
3,Adams,Barack Obama,Democrat,93445
4,Adams,Cynthia McKinney,Green,217
5,Adams,Jonathan E. Allen,HeartQuake '08,25
6,Adams,Bob Barr,Libertarian,725
7,Adams,Thomas Robert Stevens,Objectivist,13
8,Adams,Gene C. Amondson,Prohibition,8
9,Adams,John McCain,Republican,63976


In [15]:
# Shape after preprocessing
general_df.shape

(1024, 4)

## 3. Table Pivoting

We convert tall (one row per county/party/candidate) into wide (one row per county with one column per candidate). This creates the consistent schema with previous group cleaned data.

Helper functions:

- `normalize_party(s)`: in this case, we lower everything so column names are stable with other dataframes
- `candidate_token(name)`: turns “Barack Obama” -> OBAMA, “John McCain” -> MCCAIN, etc. Create a short, readable, unique token for column names
- `pivot_wide(df, prefix, key_col="county")`: Main pivot function
        
    * groups by `county` x `party` × `candidate`, sums `votes`,
    * pivots to columns named like:
        * Primary: `pri_dem_OBAMA`, `pri_rep_MCCAIN`,...
        * General: `gen_dem_OBAMA`, `gen_rep_MCCAIN`,...

    * flattens the MultiIndex into plain column strings,
    * returns one wide row per county

In [None]:
def normalize_party(s: pd.Series) -> pd.Series:
    """
    Normalize party names: Democratic -> dem, Republican -> rep
    """
    return(s.str.strip()
           .str.capitalize()
           .map({
               "Democrat"       : "dem", 
               "Republican"     : "rep",
               "Libertarian"    : "lib",
               "Constitution"   : "cst",
               "Green"          : "grn",

               "Unaffiliated"            : "una",
               "America's independent"   : "aip",
               "Boston tea"              : "btp",
               "Objectivist"             : "obj",
               "Prohibition"             : "prh",
               "Socialism and liberation": "psl",
               "Socialist workers"       : "swp",
               "Socialist, usa"          : "spu",
               "U.s. pacifist"           : "usp",
               "Heartquake '08"          : "hq8",
                })
           .fillna(s.str.strip().str.lower()))      # For defensive purposes only, would not expect other parties

In [19]:
SUFFIXES = {
    "JR","SR","JNR","SNR",
    "II","III","IV","V","VI","VII","VIII","IX","X","XI","XII"
}

def candidate_token(name: str) -> str:
    """
    Turn John McCain -> MCCAIN, Barack Obama -> OBAMA
    Skip suffixes, keep last name/token, capitalize, and remove punctuation
    """
    if pd.isna(name):
        return "UNKNOWN"                # Defensive purposes only, would not expect missing values
    
    # Remove suffixes
    raw = str(name).strip()

    # If a comma exists, treat as 'LAST, FIRST ...'
    if "," in raw:
        last_part = raw.split(",", 1)[0]
        last_part = re.sub(r"[^A-Za-z0-9\s]+", "", last_part).strip().upper()
        tokens = last_part.split()
        return tokens[-1] if tokens else "UNKNOWN"

    # Otherwise: remove punctuation, split, then drop trailing suffixes
    tokens = re.sub(r"[^A-Za-z0-9\s]+", "", raw).strip().upper().split()
    while tokens and tokens[-1] in SUFFIXES:
        tokens.pop()
    return tokens[-1] if tokens else "UNKNOWN"

In [20]:
def pivot_wide(df: pd.DataFrame, prefix: str, key_col: str="county") -> pd.DataFrame:
    """
    Pivot the dataframe to wide format based on party and candidate
    """
    # Normalize party names
    df['party_key'] = normalize_party(df['party'])
    
    # Create candidate tokens
    df['candidate_token'] = df['candidate'].apply(candidate_token)
    
    # Create new column names based on party and candidate token
    df['new_col'] = prefix + '_' + df['party'] + '_' + df['candidate_token']
    
    # Pivot the dataframe
    pivot_df = df.pivot_table(index=key_col, 
                              columns=["party_key", "candidate_token"], 
                              values="votes", 
                              aggfunc='sum', 
                              fill_value=0)
    
    # Flatten multi-level columns
    pivot_df.columns = [f"{prefix}_{p}_{c}" for p, c in pivot_df.columns]
    
    # Reset index to turn key_col back into a column
    pivot_df = pivot_df.reset_index()
    
    return pivot_df

In [21]:
# General dataframe pivot
general_pivot = pivot_wide(general_df, prefix="gen")
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_aip_KEYES,gen_btp_JAY,gen_cst_BALDWIN,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_hq8_ALLEN,gen_lib_BARR,gen_obj_STEVENS,gen_prh_AMONDSON,gen_psl_RIVA,gen_rep_MCCAIN,gen_spu_MOORE,gen_swp_HARRIS,gen_una_MCENULTY,gen_una_NADER,gen_usp_LYTTLE
0,Adams,346,29,459,93445,217,25,725,13,8,6,63976,17,14,93,1121,7
1,Alamosa,6,2,24,3521,14,0,16,2,0,3,2635,0,1,1,61,0
2,Arapahoe,309,61,636,148224,278,27,1210,28,7,13,113868,27,12,73,1365,18
3,Archuleta,11,4,38,2836,8,0,31,1,0,3,3638,2,2,2,48,1
4,Baca,11,2,10,536,5,1,10,1,0,0,1572,0,2,0,25,0
5,Bent,1,1,6,799,2,1,6,0,2,0,1077,1,0,0,23,1
6,Boulder,133,35,376,124159,250,20,928,34,5,4,44904,13,13,33,852,4
7,Broomfield,42,14,79,16168,32,4,165,8,0,2,12757,1,1,11,169,0
8,Chaffee,21,3,37,4862,13,1,39,1,0,0,4873,0,0,4,67,0
9,Cheyenne,1,0,5,198,1,0,3,0,0,0,890,0,0,1,12,0


In [22]:
# General dataframe shape after pivot
general_pivot.shape

(64, 17)

## 4. Adding Party Total Columns

Now, we will add party totals columns for general totals:

* `rep_general_total` = sum of all `gen_rep_*` columns
* `dem_general_total` = sum of all `gen_dem_*` columns
* `lib_general_total` = sum of all `gen_lib_*` columns
* `cst_general_total` = sum of all `gen_cst_*` columns
* `grn_general_total` = sum of all `gen_grn_*` columns

* `una_general_total` = sum of all `gen_una_*` columns
* `aip_general_total` = sum of all `gen_aip_*` columns
* `btp_general_total` = sum of all `gen_btp_*` columns
* `obj_general_total` = sum of all `gen_obj_*` columns
* `prh_general_total` = sum of all `gen_prh_*` columns
* `psl_general_total` = sum of all `gen_psl_*` columns
* `swp_general_total` = sum of all `gen_swp_*` columns
* `spu_general_total` = sum of all `gen_spu_*` columns
* `usp_general_total` = sum of all `gen_usp_*` columns
* `hq8_general_total` = sum of all `gen_hq8_*` columns

In [23]:
# Add party totals for general election
rep_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_rep")] 
dem_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_dem")]
lib_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_lib")]
cst_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_cst")]
grn_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_grn")]

una_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_una")]
aip_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_aip")]
btp_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_btp")]
obj_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_obj")]
prh_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_prh")]
psl_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_psl")]
swp_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_swp")]
spu_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_spu")]
ups_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_ups")]
hq8_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_hq8")]

general_pivot["rep_general_total"] = general_pivot[rep_general_cols].sum(axis=1) if rep_general_cols else 0
general_pivot["dem_general_total"] = general_pivot[dem_general_cols].sum(axis=1) if dem_general_cols else 0
general_pivot["lib_general_total"] = general_pivot[lib_general_cols].sum(axis=1) if lib_general_cols else 0
general_pivot["cst_general_total"] = general_pivot[cst_general_cols].sum(axis=1) if cst_general_cols else 0
general_pivot["grn_general_total"] = general_pivot[grn_general_cols].sum(axis=1) if grn_general_cols else 0

general_pivot["una_general_total"] = general_pivot[una_general_cols].sum(axis=1) if una_general_cols else 0
general_pivot["aip_general_total"] = general_pivot[aip_general_cols].sum(axis=1) if aip_general_cols else 0
general_pivot["btp_general_total"] = general_pivot[btp_general_cols].sum(axis=1) if btp_general_cols else 0
general_pivot["obj_general_total"] = general_pivot[obj_general_cols].sum(axis=1) if obj_general_cols else 0
general_pivot["prh_general_total"] = general_pivot[prh_general_cols].sum(axis=1) if prh_general_cols else 0
general_pivot["psl_general_total"] = general_pivot[psl_general_cols].sum(axis=1) if psl_general_cols else 0
general_pivot["swp_general_total"] = general_pivot[swp_general_cols].sum(axis=1) if swp_general_cols else 0
general_pivot["spu_general_total"] = general_pivot[spu_general_cols].sum(axis=1) if spu_general_cols else 0
general_pivot["ups_general_total"] = general_pivot[ups_general_cols].sum(axis=1) if ups_general_cols else 0
general_pivot["hq8_general_total"] = general_pivot[hq8_general_cols].sum(axis=1) if hq8_general_cols else 0

In [24]:
# Print out all the column names in the final dataframe
print("Final columns in the cleaned general dataframe:")
general_pivot.columns

Final columns in the cleaned general dataframe:


Index(['county', 'gen_aip_KEYES', 'gen_btp_JAY', 'gen_cst_BALDWIN',
       'gen_dem_OBAMA', 'gen_grn_MCKINNEY', 'gen_hq8_ALLEN', 'gen_lib_BARR',
       'gen_obj_STEVENS', 'gen_prh_AMONDSON', 'gen_psl_RIVA', 'gen_rep_MCCAIN',
       'gen_spu_MOORE', 'gen_swp_HARRIS', 'gen_una_MCENULTY', 'gen_una_NADER',
       'gen_usp_LYTTLE', 'rep_general_total', 'dem_general_total',
       'lib_general_total', 'cst_general_total', 'grn_general_total',
       'una_general_total', 'aip_general_total', 'btp_general_total',
       'obj_general_total', 'prh_general_total', 'psl_general_total',
       'swp_general_total', 'spu_general_total', 'ups_general_total',
       'hq8_general_total'],
      dtype='object')

In [25]:
# Preview the general_pivot dataframe with totals
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_aip_KEYES,gen_btp_JAY,gen_cst_BALDWIN,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_hq8_ALLEN,gen_lib_BARR,gen_obj_STEVENS,gen_prh_AMONDSON,...,una_general_total,aip_general_total,btp_general_total,obj_general_total,prh_general_total,psl_general_total,swp_general_total,spu_general_total,ups_general_total,hq8_general_total
0,Adams,346,29,459,93445,217,25,725,13,8,...,1214,346,29,13,8,6,14,17,0,25
1,Alamosa,6,2,24,3521,14,0,16,2,0,...,62,6,2,2,0,3,1,0,0,0
2,Arapahoe,309,61,636,148224,278,27,1210,28,7,...,1438,309,61,28,7,13,12,27,0,27
3,Archuleta,11,4,38,2836,8,0,31,1,0,...,50,11,4,1,0,3,2,2,0,0
4,Baca,11,2,10,536,5,1,10,1,0,...,25,11,2,1,0,0,2,0,0,1
5,Bent,1,1,6,799,2,1,6,0,2,...,23,1,1,0,2,0,0,1,0,1
6,Boulder,133,35,376,124159,250,20,928,34,5,...,885,133,35,34,5,4,13,13,0,20
7,Broomfield,42,14,79,16168,32,4,165,8,0,...,180,42,14,8,0,2,1,1,0,4
8,Chaffee,21,3,37,4862,13,1,39,1,0,...,71,21,3,1,0,0,0,0,0,1
9,Cheyenne,1,0,5,198,1,0,3,0,0,...,13,1,0,0,0,0,0,0,0,0


Now, we save the cleaned dataframe into the processed directory.

In [26]:
# Save the cleaned and merged dataframe to CSV
out_dir = Path(OUTPUT_PATH)
out_dir.mkdir(parents=True, exist_ok=True)
general_pivot.to_csv(OUTPUT_PATH + "CO.csv", index=False)