# California 2008 Presidential Elections: Data Cleaning & Preprocessing

**Goal:** Build a clean, analysis-ready county-level table for California, 2008 for presidential general election results, and then derive summary stats (party totals). Note, there is no presidential primary election results for California 2008 so far.

**Output**: A single CSV where each row is a county and columns include:

- General per-candidate vote counts (prefixed with `gen_`)
- Party totals: `rep_general_total`, `dem_general_total`, `lib_general_total`, `grn_general_total`, `ind_general_total`, `ai_general_total`, `pf_general_total`

**Last Updated**: 2025/10/02

## 0. Library Import

In [1]:
import re
import pandas as pd
import numpy as np
from pathlib import Path

  from pandas.core import (


## 1. Inputs & Parameters

Define raw file paths once here so the entire notebook is easy to rerun on another machine. If a path changes, we only update it here. We keep a single `OUTPUT_PATH` so all exports land in one known place.

In [2]:
# CA 2008 dataset path
# PRIMARY_PATH = r""
GENERAL_PATH = r"../../data/raw/2008/CA/20081104__ca__general__president.csv"

# Output directory
OUTPUT_PATH  = r"../../data/processed/2008/CA/"

# Analysis parameters
DISPLAY_ROWS = 10   # Number of rows to display in dataframes

## 2. Load & Filter

We load the general dataset into the notebook and create a subset of the columns we truly need:

- Restrict `office` to 'President' to avoid mixing down-ballot contests

- Remove columns that are fully missing or irrelevant post-filter (e.g., a district column that’s empty for county-level rows)

### b. General Election Dataset

In [3]:
# Load general data
general_df = pd.read_csv(GENERAL_PATH)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,office,district,party,candidate,votes
0,Alameda,President,,AI,Alan Keyes,1205
1,Alameda,President,,DEM,Barack Obama,489106
2,Alameda,President,,LIB,Bob Barr,2426
3,Alameda,President,,IND (W/I),Chuck Baldwin,116
4,Alameda,President,,GRN,Cynthia McKinney,2536
5,Alameda,President,,IND (W/I),Frank Moore,10
6,Alameda,President,,IND (W/I),James Harris,5
7,Alameda,President,,REP,John McCain,119555
8,Alameda,President,,PF,Ralph Nader,5557
9,Alameda,President,,IND (W/I),Ron Paul,513


In [4]:
# Different values in 'office' column
general_df["office"].value_counts()

office
President    580
Name: count, dtype: int64

Since there is only "President" value in the `office` column, we can safely drop this column. Also, we will drop the `district` column as it does not provide us with any additional information.

In [5]:
# Drop the "office" column as it's no longer needed
# Also, drop the district column as it's not applicable 
general_df = general_df.drop(columns=["office", "district"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,Alameda,AI,Alan Keyes,1205
1,Alameda,DEM,Barack Obama,489106
2,Alameda,LIB,Bob Barr,2426
3,Alameda,IND (W/I),Chuck Baldwin,116
4,Alameda,GRN,Cynthia McKinney,2536
5,Alameda,IND (W/I),Frank Moore,10
6,Alameda,IND (W/I),James Harris,5
7,Alameda,REP,John McCain,119555
8,Alameda,PF,Ralph Nader,5557
9,Alameda,IND (W/I),Ron Paul,513


In [6]:
# List out all the parties in the general election data
general_df["party"].value_counts()

party
IND (W/I)    232
AI            58
DEM           58
LIB           58
GRN           58
REP           58
PF            58
Name: count, dtype: int64

In [7]:
# Missing values count
general_df.isnull().sum()

county       0
party        0
candidate    0
votes        0
dtype: int64

In [8]:
# Final look at the cleaned general_df
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,Alameda,AI,Alan Keyes,1205
1,Alameda,DEM,Barack Obama,489106
2,Alameda,LIB,Bob Barr,2426
3,Alameda,IND (W/I),Chuck Baldwin,116
4,Alameda,GRN,Cynthia McKinney,2536
5,Alameda,IND (W/I),Frank Moore,10
6,Alameda,IND (W/I),James Harris,5
7,Alameda,REP,John McCain,119555
8,Alameda,PF,Ralph Nader,5557
9,Alameda,IND (W/I),Ron Paul,513


In [9]:
# Shape after preprocessing
general_df.shape

(580, 4)

## 3. Table Pivoting

We convert tall (one row per county/party/candidate) into wide (one row per county with one column per candidate). This creates the consistent schema with previous group cleaned data.

Helper functions:

- `normalize_party(s)`: in this case, we lower everything so column names are stable with other dataframes
- `candidate_token(name)`: turns “Barack Obama” -> OBAMA, “John McCain” -> MCCAIN, etc. Create a short, readable, unique token for column names
- `pivot_wide(df, prefix, key_col="county")`: Main pivot function
        
    * groups by `county` x `party` × `candidate`, sums `votes`,
    * pivots to columns named like:
        * Primary: `pri_dem_OBAMA`, `pri_rep_MCCAIN`,...
        * General: `gen_dem_OBAMA`, `gen_rep_MCCAIN`,...

    * flattens the MultiIndex into plain column strings,
    * returns one wide row per county

In [11]:
def normalize_party(s: pd.Series) -> pd.Series:
    """
    Normalize party names: Democratic -> dem, Republican -> rep
    """
    return(s.str.strip()
           .map({
               "IND (W/I)"     : "ind"
               })
           .fillna(s.str.strip().str.lower()))      # For defensive purposes only, would not expect other parties

In [12]:
SUFFIXES = {
    "JR","SR","JNR","SNR",
    "II","III","IV","V","VI","VII","VIII","IX","X","XI","XII"
}

def candidate_token(name: str) -> str:
    """
    Turn John McCain -> MCCAIN, Barack Obama -> OBAMA
    Skip suffixes, keep last name/token, capitalize, and remove punctuation
    """
    if pd.isna(name):
        return "UNKNOWN"                # Defensive purposes only, would not expect missing values
    
    # Remove suffixes
    raw = str(name).strip()

    # If a comma exists, treat as 'LAST, FIRST ...'
    if "," in raw:
        last_part = raw.split(",", 1)[0]
        last_part = re.sub(r"[^A-Za-z0-9\s]+", "", last_part).strip().upper()
        tokens = last_part.split()
        return tokens[-1] if tokens else "UNKNOWN"

    # Otherwise: remove punctuation, split, then drop trailing suffixes
    tokens = re.sub(r"[^A-Za-z0-9\s]+", "", raw).strip().upper().split()
    while tokens and tokens[-1] in SUFFIXES:
        tokens.pop()
    return tokens[-1] if tokens else "UNKNOWN"

In [13]:
def pivot_wide(df: pd.DataFrame, prefix: str, key_col: str="county") -> pd.DataFrame:
    """
    Pivot the dataframe to wide format based on party and candidate
    """
    # Normalize party names
    df['party_key'] = normalize_party(df['party'])
    
    # Create candidate tokens
    df['candidate_token'] = df['candidate'].apply(candidate_token)
    
    # Create new column names based on party and candidate token
    df['new_col'] = prefix + '_' + df['party'] + '_' + df['candidate_token']
    
    # Pivot the dataframe
    pivot_df = df.pivot_table(index=key_col, 
                              columns=["party_key", "candidate_token"], 
                              values="votes", 
                              aggfunc='sum', 
                              fill_value=0)
    
    # Flatten multi-level columns
    pivot_df.columns = [f"{prefix}_{p}_{c}" for p, c in pivot_df.columns]
    
    # Reset index to turn key_col back into a column
    pivot_df = pivot_df.reset_index()
    
    return pivot_df

In [14]:
# General dataframe pivot
general_pivot = pivot_wide(general_df, prefix="gen")
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_ai_KEYES,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_ind_BALDWIN,gen_ind_HARRIS,gen_ind_MOORE,gen_ind_PAUL,gen_lib_BARR,gen_pf_NADER,gen_rep_MCCAIN
0,Alameda,1205,489106,2536,116,5,10,513,2426,5557,119555
1,Alpine,5,422,5,0,0,0,1,2,5,252
2,Amador,103,7813,43,5,0,0,36,92,157,10561
3,Butte,473,49013,335,21,0,0,223,526,1028,46706
4,Calaveras,118,9813,83,14,0,0,44,170,229,12835
5,Colusa,18,2569,19,0,0,0,12,30,48,3733
6,Contra Costa,1356,306983,1112,71,4,3,464,1868,3353,136436
7,Del Norte,47,4323,36,0,0,0,0,42,116,4967
8,El Dorado,358,40529,211,38,0,1,167,502,806,50314
9,Fresno,812,136706,609,42,0,1,283,911,1910,131015


In [15]:
# General dataframe shape after pivot
general_pivot.shape

(58, 11)

## 4. Adding Party Total Columns

Now, we will add party totals columns for general totals:

* `rep_general_total` = sum of all `gen_rep_*` columns
* `dem_general_total` = sum of all `gen_dem_*` columns
* `lib_general_total` = sum of all `gen_lib_*` columns
* `grn_general_total` = sum of all `gen_grn_*` columns
* `ind_general_total` = sum of all `gen_ind_*` columns
* `ai_general_total`  = sum of all `gen_ai*` columns
* `pf_general_total`  = sum of all `gen_pf*` columns

In [17]:
# Add party totals for general election
rep_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_rep")] 
dem_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_dem")]
lib_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_lib")]
grn_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_grn")]
ind_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_ind")]
ai_general_cols     = [c for c in general_pivot.columns if c.startswith("gen_ai")]
pf_general_cols     = [c for c in general_pivot.columns if c.startswith("gen_pf")]

general_pivot["rep_general_total"] = general_pivot[rep_general_cols].sum(axis=1) if rep_general_cols else 0
general_pivot["dem_general_total"] = general_pivot[dem_general_cols].sum(axis=1) if dem_general_cols else 0
general_pivot["lib_general_total"] = general_pivot[lib_general_cols].sum(axis=1) if lib_general_cols else 0
general_pivot["grn_general_total"] = general_pivot[grn_general_cols].sum(axis=1) if grn_general_cols else 0
general_pivot["ind_general_total"] = general_pivot[ind_general_cols].sum(axis=1) if ind_general_cols else 0
general_pivot["ai_general_total"]  = general_pivot[ai_general_cols].sum(axis=1) if ai_general_cols else 0
general_pivot["pf_general_total"]  = general_pivot[pf_general_cols].sum(axis=1) if pf_general_cols else 0

In [18]:
# Print out all the column names in the final dataframe
print("Final columns in the cleaned general dataframe:")
general_pivot.columns

Final columns in the cleaned general dataframe:


Index(['county', 'gen_ai_KEYES', 'gen_dem_OBAMA', 'gen_grn_MCKINNEY',
       'gen_ind_BALDWIN', 'gen_ind_HARRIS', 'gen_ind_MOORE', 'gen_ind_PAUL',
       'gen_lib_BARR', 'gen_pf_NADER', 'gen_rep_MCCAIN', 'rep_general_total',
       'dem_general_total', 'lib_general_total', 'grn_general_total',
       'ind_general_total', 'ai_general_total', 'pf_general_total'],
      dtype='object')

In [19]:
# Preview the general_pivot dataframe with totals
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_ai_KEYES,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_ind_BALDWIN,gen_ind_HARRIS,gen_ind_MOORE,gen_ind_PAUL,gen_lib_BARR,gen_pf_NADER,gen_rep_MCCAIN,rep_general_total,dem_general_total,lib_general_total,grn_general_total,ind_general_total,ai_general_total,pf_general_total
0,Alameda,1205,489106,2536,116,5,10,513,2426,5557,119555,119555,489106,2426,2536,644,1205,5557
1,Alpine,5,422,5,0,0,0,1,2,5,252,252,422,2,5,1,5,5
2,Amador,103,7813,43,5,0,0,36,92,157,10561,10561,7813,92,43,41,103,157
3,Butte,473,49013,335,21,0,0,223,526,1028,46706,46706,49013,526,335,244,473,1028
4,Calaveras,118,9813,83,14,0,0,44,170,229,12835,12835,9813,170,83,58,118,229
5,Colusa,18,2569,19,0,0,0,12,30,48,3733,3733,2569,30,19,12,18,48
6,Contra Costa,1356,306983,1112,71,4,3,464,1868,3353,136436,136436,306983,1868,1112,542,1356,3353
7,Del Norte,47,4323,36,0,0,0,0,42,116,4967,4967,4323,42,36,0,47,116
8,El Dorado,358,40529,211,38,0,1,167,502,806,50314,50314,40529,502,211,206,358,806
9,Fresno,812,136706,609,42,0,1,283,911,1910,131015,131015,136706,911,609,326,812,1910


Now, we save the cleaned dataframe into the processed directory.

In [21]:
# Save the cleaned and merged dataframe to CSV
out_dir = Path(OUTPUT_PATH)
out_dir.mkdir(parents=True, exist_ok=True)
general_pivot.to_csv(OUTPUT_PATH + "CA.csv", index=False)