# Arkansas 2008 Presidential Elections: Data Cleaning & Preprocessing

**Goal:** Build a clean, analysis-ready county-level table for Illinois, 2008 for presidential general election results, and then derive summary stats (party totals). Note, there is no presidential primary election results for Arkansas 2008 so far.

**Output**: A single CSV where each row is a county and columns include:

- General per-candidate vote counts (prefixed with `gen_`)
- Party totals: `rep_general_total`, `dem_general_total`, `lib_general_total`, `cst_general_total`, `grn_general_total`, `ind_general_total`, `psl_general_total`

**Last Updated**: 2025/10/01

## 0. Library Import

In [17]:
import re
import pandas as pd
import numpy as np
from pathlib import Path

## 1. Inputs & Parameters

Define raw file paths once here so the entire notebook is easy to rerun on another machine. If a path changes, we only update it here. We keep a single `OUTPUT_PATH` so all exports land in one known place.

In [18]:
# AR 2008 dataset path
PRIMARY_PATH = r"../../data/raw/2008/AR/20080520__ar__primary__precinct.csv"
GENERAL_PATH = r"../../data/raw/2008/AR/20081104__ar__general__precinct.csv"

# Output directory
OUTPUT_PATH  = r"../../data/processed/2008/AR/"

# Analysis parameters
DISPLAY_ROWS = 10   # Number of rows to display in dataframes

## 2. Load & Filter

We load primary and general datasets separately and immediately subset to the rows we truly need:

- Restrict `office` to 'President' to avoid mixing down-ballot contests

- Remove columns that are fully missing or irrelevant post-filter (e.g., a district column that’s empty for county-level rows)

### a. Primary Election Dataset

In [19]:
# Load primary data
primary_df = pd.read_csv(PRIMARY_PATH)
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,precinct,office,district,party,candidate,votes
0,Saline County,Mountainside Church,State Representative,28,Democrat,Lamont B. Cornwell,0
1,Saline County,Mountainside Church,State Representative,28,Democrat,Barbara Nix,0
2,Saline County,Congo Road Baptist Church,State Representative,28,Democrat,Barbara Nix,0
3,Saline County,Congo Road Baptist Church,State Representative,28,Democrat,Lamont B. Cornwell,0
4,Saline County,Olive Hill Church,State Representative,28,Democrat,Barbara Nix,0
5,Saline County,Olive Hill Church,State Representative,28,Democrat,Lamont B. Cornwell,0
6,Saline County,Trinity Baptist Church,State Representative,28,Democrat,Lamont B. Cornwell,30
7,Saline County,Trinity Baptist Church,State Representative,28,Democrat,Barbara Nix,27
8,Saline County,Tyndall Park,State Representative,28,Democrat,Barbara Nix,93
9,Saline County,Tyndall Park,State Representative,28,Democrat,Lamont B. Cornwell,40


In [20]:
# Different values in 'office' column
primary_df["office"].value_counts()

office
State Representative    3141
State Senate             698
Name: count, dtype: int64

Wait, there are no presidential rows. This means that the current `primary_df` lacks the presidential contest information that we are interested in. We can stop explore this dataframe from here unless there are some future exploratory direction.

### b. General Election Dataset

In [21]:
# Load general data
general_df = pd.read_csv(GENERAL_PATH)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,precinct,office,district,party,candidate,votes
0,Newton County,Jackson,President,,Democrat,Barack Obama and Joe Biden,71
1,Newton County,Jackson,President,,Libertarian,Bob Barr and Wayne Allyn Root,1
2,Newton County,Jackson,President,,Constitution,Chuck Baldwin and Darrell L. Castle,1
3,Newton County,Jackson,President,,Green,Cynthia McKinney and Rosa Clemente,2
4,Newton County,Jackson,President,,Socialism & Liberation,Gloria La Riva and Eugene Puryear,0
5,Newton County,Jackson,President,,Republican,John McCain and Sarah Palin,123
6,Newton County,Jackson,President,,Independent,Ralph Nader and Matt Gonzalez,9
7,Newton County,Mt. Sherman,President,,Independent,Ralph Nader and Matt Gonzalez,6
8,Newton County,Mt. Sherman,President,,Democrat,Barack Obama and Joe Biden,11
9,Newton County,Mt. Sherman,President,,Libertarian,Bob Barr and Wayne Allyn Root,0


In [22]:
# Different values in 'office' column
general_df["office"].value_counts()

office
President               14854
State Representative     4357
U.S. Senate              4240
U.S. House               3420
State Senate               94
Name: count, dtype: int64

Meanwhile, there are data for presidential election data in `general_df`. We can take a closer look into this dataset.

In [23]:
# Only keep rows where 'office' is 'President'
general_df = general_df[general_df["office"] == "President"]
general_df.shape

(14854, 7)

In [24]:
# Now, drop the "office" column as it's no longer needed
# Also, drop the district column as it's not applicable 
general_df = general_df.drop(columns=["office", "district"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,precinct,party,candidate,votes
0,Newton County,Jackson,Democrat,Barack Obama and Joe Biden,71
1,Newton County,Jackson,Libertarian,Bob Barr and Wayne Allyn Root,1
2,Newton County,Jackson,Constitution,Chuck Baldwin and Darrell L. Castle,1
3,Newton County,Jackson,Green,Cynthia McKinney and Rosa Clemente,2
4,Newton County,Jackson,Socialism & Liberation,Gloria La Riva and Eugene Puryear,0
5,Newton County,Jackson,Republican,John McCain and Sarah Palin,123
6,Newton County,Jackson,Independent,Ralph Nader and Matt Gonzalez,9
7,Newton County,Mt. Sherman,Independent,Ralph Nader and Matt Gonzalez,6
8,Newton County,Mt. Sherman,Democrat,Barack Obama and Joe Biden,11
9,Newton County,Mt. Sherman,Libertarian,Bob Barr and Wayne Allyn Root,0


Now, we aggregate precinct vote counts into county vote counts.

In [25]:
# Make sure votes are numeric
general_df["votes"] = pd.to_numeric(general_df["votes"], errors="coerce").fillna(0).astype(int)

# Aggregate precinct vote counts into county vote counts
general_df = (
    general_df.
    groupby(["county", "party", "candidate"], as_index=False)["votes"]
    .sum()
)[["county", "candidate", "party", "votes"]]        # Reorder columns

# Snippet at the aggregated data
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Arkansas County,Chuck Baldwin and Darrell L. Castle,Constitution,39
1,Arkansas County,Barack Obama and Joe Biden,Democrat,1338
2,Arkansas County,Cynthia McKinney and Rosa Clemente,Green,12
3,Arkansas County,Ralph Nader and Matt Gonzalez,Independent,98
4,Arkansas County,Bob Barr and Wayne Allyn Root,Libertarian,19
5,Arkansas County,John McCain and Sarah Palin,Republican,2470
6,Arkansas County,Gloria La Riva and Eugene Puryear,Socialism & Liberation,6
7,Ashley County,Chuck Baldwin and Darrell L. Castle,Constitution,38
8,Ashley County,Barack Obama and Joe Biden,Democrat,2976
9,Ashley County,Cynthia McKinney and Rosa Clemente,Green,54


Another thing to notice while we are cleaning is that for some candidates, instead of just the presidential candidate name, they put presidential ticket in the `candidate` field. That is, they have both president and vice president on one line. We will proceed to split them and only keep the presidential candidate for these.

In [26]:
# Candidates in general_df
general_df["candidate"].value_counts()

candidate
Chuck Baldwin and Darrell L. Castle    75
Barack Obama and Joe Biden             75
Cynthia McKinney and Rosa Clemente     75
Ralph Nader and Matt Gonzalez          75
Bob Barr and Wayne Allyn Root          75
John McCain and Sarah Palin            75
Gloria La Riva and Eugene Puryear      75
Name: count, dtype: int64

In [27]:
# Keep only the presidential candidate in the "candidate" column
general_df["candidate"] = (
    general_df["candidate"]
        .str.replace(r"(?i)\b(?:andf|and|&|/)\b.*$", "", regex=True)
        .str.strip()
)

# Candidates in general_df
general_df["candidate"].value_counts()

candidate
Chuck Baldwin       75
Barack Obama        75
Cynthia McKinney    75
Ralph Nader         75
Bob Barr            75
John McCain         75
Gloria La Riva      75
Name: count, dtype: int64

In [28]:
# List out all the parties in the general election data
general_df["party"].value_counts()

party
Constitution              75
Democrat                  75
Green                     75
Independent               75
Libertarian               75
Republican                75
Socialism & Liberation    75
Name: count, dtype: int64

In [29]:
# Missing values count
general_df.isnull().sum()

county       0
candidate    0
party        0
votes        0
dtype: int64

In [30]:
# Final look at the cleaned general_df
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Arkansas County,Chuck Baldwin,Constitution,39
1,Arkansas County,Barack Obama,Democrat,1338
2,Arkansas County,Cynthia McKinney,Green,12
3,Arkansas County,Ralph Nader,Independent,98
4,Arkansas County,Bob Barr,Libertarian,19
5,Arkansas County,John McCain,Republican,2470
6,Arkansas County,Gloria La Riva,Socialism & Liberation,6
7,Ashley County,Chuck Baldwin,Constitution,38
8,Ashley County,Barack Obama,Democrat,2976
9,Ashley County,Cynthia McKinney,Green,54


In [31]:
# Shape after preprocessing
general_df.shape

(525, 4)

## 3. Table Pivoting

We convert tall (one row per county/party/candidate) into wide (one row per county with one column per candidate). This creates the consistent schema with previous group cleaned data.

Helper functions:

- `normalize_party(s)`: in this case, we lower everything so column names are stable with other dataframes
- `candidate_token(name)`: turns “Barack Obama” -> OBAMA, “John McCain” -> MCCAIN, etc. Create a short, readable, unique token for column names
- `pivot_wide(df, prefix, key_col="county")`: Main pivot function
        
    * groups by `county` x `party` × `candidate`, sums `votes`,
    * pivots to columns named like:
        * Primary: `pri_dem_OBAMA`, `pri_rep_MCCAIN`,...
        * General: `gen_dem_OBAMA`, `gen_rep_MCCAIN`,...

    * flattens the MultiIndex into plain column strings,
    * returns one wide row per county

In [36]:
def normalize_party(s: pd.Series) -> pd.Series:
    """
    Normalize party names: Democratic -> dem, Republican -> rep
    """
    return(s.str.strip()
           .str.capitalize()
           .map({
                "Democrat"       : "dem", 
                "Republican"     : "rep",
                "Libertarian"    : "lib",
                "Constitution"   : "cst",
                "Green"          : "grn",
                "Independent"    : "ind",
                "Socialism & liberation" : "psl"
               })
           .fillna(s.str.strip().str.lower()))      # For defensive purposes only, would not expect other parties

In [37]:
SUFFIXES = {
    "JR","SR","JNR","SNR",
    "II","III","IV","V","VI","VII","VIII","IX","X","XI","XII"
}

def candidate_token(name: str) -> str:
    """
    Turn John McCain -> MCCAIN, Barack Obama -> OBAMA
    Skip suffixes, keep last name/token, capitalize, and remove punctuation
    """
    if pd.isna(name):
        return "UNKNOWN"                # Defensive purposes only, would not expect missing values
    
    # Remove suffixes
    raw = str(name).strip()

    # If a comma exists, treat as 'LAST, FIRST ...'
    if "," in raw:
        last_part = raw.split(",", 1)[0]
        last_part = re.sub(r"[^A-Za-z0-9\s]+", "", last_part).strip().upper()
        tokens = last_part.split()
        return tokens[-1] if tokens else "UNKNOWN"

    # Otherwise: remove punctuation, split, then drop trailing suffixes
    tokens = re.sub(r"[^A-Za-z0-9\s]+", "", raw).strip().upper().split()
    while tokens and tokens[-1] in SUFFIXES:
        tokens.pop()
    return tokens[-1] if tokens else "UNKNOWN"

In [38]:
def pivot_wide(df: pd.DataFrame, prefix: str, key_col: str="county") -> pd.DataFrame:
    """
    Pivot the dataframe to wide format based on party and candidate
    """
    # Normalize party names
    df['party_key'] = normalize_party(df['party'])
    
    # Create candidate tokens
    df['candidate_token'] = df['candidate'].apply(candidate_token)
    
    # Create new column names based on party and candidate token
    df['new_col'] = prefix + '_' + df['party'] + '_' + df['candidate_token']
    
    # Pivot the dataframe
    pivot_df = df.pivot_table(index=key_col, 
                              columns=["party_key", "candidate_token"], 
                              values="votes", 
                              aggfunc='sum', 
                              fill_value=0)
    
    # Flatten multi-level columns
    pivot_df.columns = [f"{prefix}_{p}_{c}" for p, c in pivot_df.columns]
    
    # Reset index to turn key_col back into a column
    pivot_df = pivot_df.reset_index()
    
    return pivot_df

In [39]:
# General dataframe pivot
general_pivot = pivot_wide(general_df, prefix="gen")
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_cst_BALDWIN,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_ind_NADER,gen_lib_BARR,gen_psl_RIVA,gen_rep_MCCAIN
0,Arkansas County,39,1338,12,98,19,6,2470
1,Ashley County,38,2976,54,112,39,17,4268
2,Baxter County,96,3296,75,278,125,16,6761
3,Benton County,318,23331,143,706,417,34,29052
4,Boone County,81,1917,63,209,94,17,4966
5,Bradley County,17,1680,27,32,11,12,2262
6,Calhoun County,8,691,6,29,14,7,1462
7,Carroll County,53,2220,58,131,76,6,4043
8,Chicot County,2,3043,7,24,13,0,2119
9,Clark County,33,2269,29,116,31,10,2975


In [40]:
# General dataframe shape after pivot
general_pivot.shape

(75, 8)

## 4. Adding Party Total Columns

Now, we will add party totals columns for general totals:

* `rep_general_total` = sum of all `gen_rep_*` columns
* `dem_general_total` = sum of all `gen_dem_*` columns
* `lib_general_total` = sum of all `gen_lib_*` columns
* `cst_general_total` = sum of all `gen_cst_*` columns
* `grn_general_total` = sum of all `gen_grn_*` columns
* `ind_general_total` = sum of all `gen_ind_*` columns
* `psl_general_total` = sum of all `gen_psl_*` columns

In [41]:
# Add party totals for general election
rep_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_rep")] 
dem_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_dem")]
lib_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_lib")]
cst_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_cst")]
grn_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_grn")]
ind_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_ind")]
psl_general_cols    = [c for c in general_pivot.columns if c.startswith("gen_psl")]


general_pivot["rep_general_total"] = general_pivot[rep_general_cols].sum(axis=1) if rep_general_cols else 0
general_pivot["dem_general_total"] = general_pivot[dem_general_cols].sum(axis=1) if dem_general_cols else 0
general_pivot["lib_general_total"] = general_pivot[lib_general_cols].sum(axis=1) if lib_general_cols else 0
general_pivot["cst_general_total"] = general_pivot[cst_general_cols].sum(axis=1) if cst_general_cols else 0
general_pivot["grn_general_total"] = general_pivot[grn_general_cols].sum(axis=1) if grn_general_cols else 0
general_pivot["ind_general_total"] = general_pivot[ind_general_cols].sum(axis=1) if ind_general_cols else 0
general_pivot["psl_general_total"] = general_pivot[psl_general_cols].sum(axis=1) if psl_general_cols else 0

In [44]:
# Print out all the column names in the final dataframe
print("Final columns in the cleaned general dataframe:")
general_pivot.columns

Final columns in the cleaned general dataframe:


Index(['county', 'gen_cst_BALDWIN', 'gen_dem_OBAMA', 'gen_grn_MCKINNEY',
       'gen_ind_NADER', 'gen_lib_BARR', 'gen_psl_RIVA', 'gen_rep_MCCAIN',
       'rep_general_total', 'dem_general_total', 'lib_general_total',
       'cst_general_total', 'grn_general_total', 'ind_general_total',
       'psl_general_total'],
      dtype='object')

In [42]:
# Preview the general_pivot dataframe with totals
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_cst_BALDWIN,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_ind_NADER,gen_lib_BARR,gen_psl_RIVA,gen_rep_MCCAIN,rep_general_total,dem_general_total,lib_general_total,cst_general_total,grn_general_total,ind_general_total,psl_general_total
0,Arkansas County,39,1338,12,98,19,6,2470,2470,1338,19,39,12,98,6
1,Ashley County,38,2976,54,112,39,17,4268,4268,2976,39,38,54,112,17
2,Baxter County,96,3296,75,278,125,16,6761,6761,3296,125,96,75,278,16
3,Benton County,318,23331,143,706,417,34,29052,29052,23331,417,318,143,706,34
4,Boone County,81,1917,63,209,94,17,4966,4966,1917,94,81,63,209,17
5,Bradley County,17,1680,27,32,11,12,2262,2262,1680,11,17,27,32,12
6,Calhoun County,8,691,6,29,14,7,1462,1462,691,14,8,6,29,7
7,Carroll County,53,2220,58,131,76,6,4043,4043,2220,76,53,58,131,6
8,Chicot County,2,3043,7,24,13,0,2119,2119,3043,13,2,7,24,0
9,Clark County,33,2269,29,116,31,10,2975,2975,2269,31,33,29,116,10


Now, we save the cleaned dataframe into the processed directory.

In [46]:
# Save the cleaned and merged dataframe to CSV
out_dir = Path(OUTPUT_PATH)
out_dir.mkdir(parents=True, exist_ok=True)
general_pivot.to_csv(OUTPUT_PATH + "AR.csv", index=False)