# Illinois 2008 Presidential Elections: Data Cleaning & Preprocessing

**Goal:** Build a clean, analysis-ready county-level table for Illinois, 2008 by merging the presidential primary and presidential general election results, and then derive summary stats (party totals).

**Output**: A single CSV where each row is a county and columns include:

- Primary per-candidate vote counts (prefixed with `pri_`)
- General per-candidate vote counts (prefixed with `gen_`)
- Party totals: `rep_primary_total`, `dem_primary_total`, `grn_primary_total`, `rep_general_total`, `dem_general_total`, `lib_general_total`, `grn_general_total`, `cpi_general_total`, `ind_general_total`, `new_general_total`, `wri_general_total`

**Last Updated**: 2025/10/01

## 0. Library Import

In [3]:
import re
import pandas as pd
import numpy as np
from pathlib import Path

  from pandas.core import (


## 1. Inputs & Parameters

Define raw file paths once here so the entire notebook is easy to rerun on another machine. If a path changes, we only update it here. We keep a single `OUTPUT_PATH` so all exports land in one known place.

In [4]:
# IL 2008 dataset path
PRIMARY_PATH = r"../../data/raw/2008/IL/20080205__il__primary__county.csv"
GENERAL_PATH = r"../../data/raw/2008/IL/20081104__il__general__county.csv"

# Output directory
OUTPUT_PATH  = r"../../data/processed/2008/IL/"

# Analysis parameters
DISPLAY_ROWS = 10   # Number of rows to display in dataframes

## 2. Load & Filter

We load primary and general datasets separately and immediately subset to the rows we truly need:

- Restrict `office` to 'President' to avoid mixing down-ballot contests

- Remove columns that are fully missing or irrelevant post-filter (e.g., a district column that’s empty for county-level rows)

### a. Primary Election Dataset

In [5]:
# Load primary data
primary_df = pd.read_csv(PRIMARY_PATH)
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,office,district,party,candidate,votes
0,DuPAGE,14TH REPUBLICAN DELEGATE,,REP,DENNIS WIGGINS,81
1,HENRY,14TH REPUBLICAN DELEGATE,,REP,DENNIS WIGGINS,31
2,KANE,14TH REPUBLICAN DELEGATE,,REP,DENNIS WIGGINS,1027
3,KENDALL,14TH REPUBLICAN DELEGATE,,REP,DENNIS WIGGINS,277
4,LEE,14TH REPUBLICAN DELEGATE,,REP,DENNIS WIGGINS,131
5,WHITESIDE,14TH REPUBLICAN DELEGATE,,REP,DENNIS WIGGINS,18
6,BUREAU,14TH REPUBLICAN DELEGATE,,REP,LARRY D. WEGMAN,2
7,DeKALB,14TH REPUBLICAN DELEGATE,,REP,LARRY D. WEGMAN,73
8,DuPAGE,14TH REPUBLICAN DELEGATE,,REP,LARRY D. WEGMAN,65
9,HENRY,14TH REPUBLICAN DELEGATE,,REP,LARRY D. WEGMAN,26


In [6]:
# Different values in 'office' column
primary_df["office"].value_counts()

office
President                                 2040
19TH REPUBLICAN DELEGATE                   720
19TH REPUBLICAN ALTERNATE DELEGATE         720
15TH REPUBLICAN DELEGATE                   550
15TH REPUBLICAN ALTERNATE DELEGATE         528
                                          ... 
1ST SUPREME - McMORROW VACANCY               1
11TH CIRCUIT- COOGAN VACANCY                 1
13TH CIRCUIT- CARTER VACANCY                 1
3RD SUBCIRCUIT - DONNERSBERGER VACANCY       1
11TH CIRCUIT- FROBISH VACANCY                1
Name: count, Length: 145, dtype: int64

In [7]:
# Only keep rows where 'office' is 'President'
primary_df = primary_df[primary_df["office"] == "President"]
primary_df.shape

(2040, 6)

In [8]:
# Now, drop the "office" column as it's no longer needed. Also, drop the district column
primary_df = primary_df.drop(columns=["office", "district"]).reset_index(drop=True)
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,ADAMS,DEM,BARACK OBAMA,3713
1,ALEXANDER,DEM,BARACK OBAMA,1097
2,BOND,DEM,BARACK OBAMA,921
3,BOONE,DEM,BARACK OBAMA,2652
4,BROWN,DEM,BARACK OBAMA,313
5,BUREAU,DEM,BARACK OBAMA,2432
6,CALHOUN,DEM,BARACK OBAMA,446
7,CARROLL,DEM,BARACK OBAMA,1160
8,CASS,DEM,BARACK OBAMA,679
9,CHAMPAIGN,DEM,BARACK OBAMA,17033


In [9]:
# Unique parties in primary_df
primary_df["party"].value_counts()

party
REP    918
DEM    714
GRN    408
Name: count, dtype: int64

In [10]:
# Candidates in primary_df
primary_df["candidate"].value_counts()

candidate
BARACK OBAMA                        102
HILLARY CLINTON                     102
RUDY GIULIANI                       102
TOM TANCREDO                        102
JAMES CREIGHTON MITCHELL, JR.       102
ALAN KEYES                          102
FRED THOMPSON                       102
JOHN McCAIN                         102
MITT ROMNEY                         102
RON PAUL                            102
HOWIE HAWKINS                       102
JARED A. BALL                       102
CYNTHIA McKINNEY                    102
KENT PHILIP MESPLAY                 102
JOE BIDEN                           102
CHRISTOPHER JOHN DODD               102
DENNIS J. KUCINICH                  102
WILLIAM "BILL" B. RICHARDSON III    102
JOHN EDWARDS                        102
MIKE HUCKABEE                       102
Name: count, dtype: int64

In [11]:
# Missing values count
primary_df.isnull().sum()

county       0
party        0
candidate    0
votes        0
dtype: int64

In [15]:
# Calculuate the duplicate rows
duplicate_rows = primary_df.duplicated()
duplicate_rows.sum()

0

In [12]:
# Final look at the cleaned primary_df
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,ADAMS,DEM,BARACK OBAMA,3713
1,ALEXANDER,DEM,BARACK OBAMA,1097
2,BOND,DEM,BARACK OBAMA,921
3,BOONE,DEM,BARACK OBAMA,2652
4,BROWN,DEM,BARACK OBAMA,313
5,BUREAU,DEM,BARACK OBAMA,2432
6,CALHOUN,DEM,BARACK OBAMA,446
7,CARROLL,DEM,BARACK OBAMA,1160
8,CASS,DEM,BARACK OBAMA,679
9,CHAMPAIGN,DEM,BARACK OBAMA,17033


In [13]:
# Shape after preprocessing
primary_df.shape

(2040, 4)

### b. General Election Dataset

In [16]:
# Load general data
general_df = pd.read_csv(GENERAL_PATH)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,office,district,party,candidate,votes
0,ADAMS,President,,NEW,JOHN JOSEPH POLACHEK,1
1,ALEXANDER,President,,NEW,JOHN JOSEPH POLACHEK,2
2,BOND,President,,NEW,JOHN JOSEPH POLACHEK,1
3,BOONE,President,,NEW,JOHN JOSEPH POLACHEK,6
4,BROWN,President,,NEW,JOHN JOSEPH POLACHEK,1
5,BUREAU,President,,NEW,JOHN JOSEPH POLACHEK,4
6,CALHOUN,President,,NEW,JOHN JOSEPH POLACHEK,0
7,CARROLL,President,,NEW,JOHN JOSEPH POLACHEK,2
8,CASS,President,,NEW,JOHN JOSEPH POLACHEK,0
9,CHAMPAIGN,President,,NEW,JOHN JOSEPH POLACHEK,14


In [17]:
# Different values in 'office' column
general_df["office"].value_counts()

office
President                              992
U.S. Senate                            524
State House                            433
U.S. House                             391
State Senate                           222
                                      ... 
11TH CIRCUIT- COOGAN VACANCY             1
11TH CIRCUIT- FROBISH VACANCY            1
10TH SUBCIRCUIT - MORRISSEY VACANCY      1
5TH CIRCUIT- ANDREWS VACANCY             1
14TH SUBCIRCUIT - HENRY VACANCY          1
Name: count, Length: 221, dtype: int64

In [18]:
# Only keep rows where 'office' is 'President'
general_df = general_df[general_df["office"] == "President"]
general_df.shape

(992, 6)

In [19]:
# Now, drop the "office" column as it's no longer needed. Also, drop the district column
general_df = general_df.drop(columns=["office", "district"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,ADAMS,NEW,JOHN JOSEPH POLACHEK,1
1,ALEXANDER,NEW,JOHN JOSEPH POLACHEK,2
2,BOND,NEW,JOHN JOSEPH POLACHEK,1
3,BOONE,NEW,JOHN JOSEPH POLACHEK,6
4,BROWN,NEW,JOHN JOSEPH POLACHEK,1
5,BUREAU,NEW,JOHN JOSEPH POLACHEK,4
6,CALHOUN,NEW,JOHN JOSEPH POLACHEK,0
7,CARROLL,NEW,JOHN JOSEPH POLACHEK,2
8,CASS,NEW,JOHN JOSEPH POLACHEK,0
9,CHAMPAIGN,NEW,JOHN JOSEPH POLACHEK,14


In [20]:
# List out all the parties in the general election data
general_df["party"].value_counts()

party
NEW    102
IND    102
CPI    102
GRN    102
LIB    102
DEM    102
REP    102
Name: count, dtype: int64

In [26]:
# Missing values count
general_df.isnull().sum()

county         0
party        278
candidate      0
votes          0
dtype: int64

Notice there are missing values in `party` column. We will create a sub-dataframe to see if we can resolve this missing data problem

In [32]:
# Rows that have missing values in party
general_missing_df = general_df[general_df["party"].isnull()]
general_missing_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
714,ADAMS,,RON PAUL,0
715,ALEXANDER,,RON PAUL,0
716,BOND,,RON PAUL,0
717,BOONE,,RON PAUL,0
718,BUREAU,,RON PAUL,0
719,CALHOUN,,RON PAUL,0
720,CARROLL,,RON PAUL,0
721,CASS,,RON PAUL,0
722,CHAMPAIGN,,RON PAUL,0
723,CHRISTIAN,,RON PAUL,0


In [33]:
# Names of candidates with missing party affiliation
general_missing_df["candidate"].value_counts()

candidate
RON PAUL             94
FRANK JAMES MOORE    75
DONALD K. ALLEN      56
RONALD G. HOBBS      53
Name: count, dtype: int64

For Illinois, 2008 general election (president), all four were recorded as write-in candidates. So, we will fill in the missing value in `WRI` for consistency.

In [34]:
# Fill in missing values in 'party' column with 'WRI' (Write-In)
general_df["party"] = general_df["party"].fillna("WRI")
general_df.isnull().sum()

county       0
party        0
candidate    0
votes        0
dtype: int64

In [35]:
# Final look at the cleaned general_df
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,ADAMS,NEW,JOHN JOSEPH POLACHEK,1
1,ALEXANDER,NEW,JOHN JOSEPH POLACHEK,2
2,BOND,NEW,JOHN JOSEPH POLACHEK,1
3,BOONE,NEW,JOHN JOSEPH POLACHEK,6
4,BROWN,NEW,JOHN JOSEPH POLACHEK,1
5,BUREAU,NEW,JOHN JOSEPH POLACHEK,4
6,CALHOUN,NEW,JOHN JOSEPH POLACHEK,0
7,CARROLL,NEW,JOHN JOSEPH POLACHEK,2
8,CASS,NEW,JOHN JOSEPH POLACHEK,0
9,CHAMPAIGN,NEW,JOHN JOSEPH POLACHEK,14


In [36]:
# Shape after preprocessing
general_df.shape

(992, 4)

## 3. Table Pivoting

We convert tall (one row per county/party/candidate) into wide (one row per county with one column per candidate). This creates the consistent schema with previous group cleaned data.

Helper functions:

- `normalize_party(s)`: in this case, we lower everything so column names are stable with other dataframes
- `candidate_token(name)`: turns “Barack Obama” -> OBAMA, “John McCain” -> MCCAIN, etc. Create a short, readable, unique token for column names
- `pivot_wide(df, prefix, key_col="county")`: Main pivot function
        
    * groups by `county` x `party` × `candidate`, sums `votes`,
    * pivots to columns named like:
        * Primary: `pri_dem_OBAMA`, `pri_rep_MCCAIN`,...
        * General: `gen_dem_OBAMA`, `gen_rep_MCCAIN`,...

    * flattens the MultiIndex into plain column strings,
    * returns one wide row per county

In [37]:
def normalize_party(s: pd.Series) -> pd.Series:
    """
    Normalize party names: Lowercase the three-letter abbreviations
    """
    return(s.str.lower())     

In [55]:
SUFFIXES = {
    "JR","SR","JNR","SNR",
    "II","III","IV","V","VI","VII","VIII","IX","X","XI","XII"
}

def candidate_token(name: str) -> str:
    """
    Turn John McCain -> MCCAIN, Barack Obama -> OBAMA
    Skip suffixes, keep last name/token, capitalize, and remove punctuation
    """
    if pd.isna(name):
        return "UNKNOWN"                # Defensive purposes only, would not expect missing values
    
    # Remove suffixes
    raw = str(name).strip()

    # If a comma exists, treat as 'LAST, FIRST ...'
    if "," in raw:
        last_part = raw.split(",", 1)[0]
        last_part = re.sub(r"[^A-Za-z0-9\s]+", "", last_part).strip().upper()
        tokens = last_part.split()
        return tokens[-1] if tokens else "UNKNOWN"

    # Otherwise: remove punctuation, split, then drop trailing suffixes
    tokens = re.sub(r"[^A-Za-z0-9\s]+", "", raw).strip().upper().split()
    while tokens and tokens[-1] in SUFFIXES:
        tokens.pop()
    return tokens[-1] if tokens else "UNKNOWN"

In [56]:
def pivot_wide(df: pd.DataFrame, prefix: str, key_col: str="county") -> pd.DataFrame:
    """
    Pivot the dataframe to wide format based on party and candidate
    """
    # Normalize party names
    df['party_key'] = normalize_party(df['party'])
    
    # Create candidate tokens
    df['candidate_token'] = df['candidate'].apply(candidate_token)
    
    # Create new column names based on party and candidate token
    df['new_col'] = prefix + '_' + df['party'] + '_' + df['candidate_token']
    
    # Pivot the dataframe
    pivot_df = df.pivot_table(index=key_col, 
                              columns=["party_key", "candidate_token"], 
                              values="votes", 
                              aggfunc='sum', 
                              fill_value=0)
    
    # Flatten multi-level columns
    pivot_df.columns = [f"{prefix}_{p}_{c}" for p, c in pivot_df.columns]
    
    # Reset index to turn key_col back into a column
    pivot_df = pivot_df.reset_index()
    
    return pivot_df

In [57]:
# Primary dataframe pivot
primary_pivot = pivot_wide(primary_df, prefix="pri")
primary_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_BIDEN,pri_dem_CLINTON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_KUCINICH,pri_dem_OBAMA,pri_dem_RICHARDSON,pri_grn_BALL,pri_grn_HAWKINS,...,pri_grn_MESPLAY,pri_rep_GIULIANI,pri_rep_HUCKABEE,pri_rep_KEYES,pri_rep_MCCAIN,pri_rep_MITCHELL,pri_rep_PAUL,pri_rep_ROMNEY,pri_rep_TANCREDO,pri_rep_THOMPSON
0,ADAMS,10,2534,0,237,2,3713,17,0,0,...,0,50,1535,17,3137,1,256,2148,0,53
1,ALEXANDER,5,725,2,156,4,1097,8,0,0,...,0,3,118,1,137,0,15,130,0,5
2,BOND,7,916,4,71,4,921,9,0,0,...,0,9,420,9,603,0,115,382,0,17
3,BOONE,9,2042,5,96,2,2652,5,1,0,...,0,78,983,19,2974,4,397,1856,3,63
4,BROWN,3,247,1,63,1,313,2,1,0,...,1,16,190,3,375,0,31,141,0,10
5,BUREAU,6,1489,1,113,5,2432,12,1,4,...,2,25,653,5,1374,2,106,1014,1,18
6,CALHOUN,7,606,1,86,1,446,3,0,0,...,0,2,70,1,178,0,27,103,0,3
7,CARROLL,3,590,3,40,1,1160,2,0,2,...,2,10,292,3,681,2,127,504,0,22
8,CASS,4,497,0,48,3,679,4,0,0,...,0,9,199,6,415,4,33,256,1,8
9,CHAMPAIGN,31,5515,7,345,133,17033,38,16,26,...,17,187,2687,43,5689,2,1060,5552,7,197


In [58]:
# Primary dataframe shape after pivot
primary_pivot.shape

(102, 21)

In [59]:
# General dataframe pivot
general_pivot = pivot_wide(general_df, prefix="gen")
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_cpi_BALDWIN,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_ind_NADER,gen_lib_BARR,gen_new_POLACHEK,gen_rep_MCCAIN,gen_wri_ALLEN,gen_wri_HOBBS,gen_wri_MOORE,gen_wri_PAUL
0,ADAMS,51,11794,40,150,76,1,18711,0,0,0,0
1,ALEXANDER,3,2189,14,24,13,2,1692,0,0,0,0
2,BOND,37,3843,7,53,36,1,3947,0,0,0,0
3,BOONE,53,11333,60,179,123,6,10403,0,0,0,0
4,BROWN,4,986,8,18,8,1,1544,0,0,0,0
5,BUREAU,42,8889,50,158,59,4,7911,0,0,0,0
6,CALHOUN,3,1423,8,32,12,0,1221,0,0,0,0
7,CARROLL,16,3965,22,46,27,2,3596,0,0,0,0
8,CASS,9,2690,28,44,20,0,2617,0,0,0,0
9,CHAMPAIGN,178,48597,313,610,560,14,33871,0,0,0,0


In [60]:
# General dataframe shape after pivot
general_pivot.shape

(102, 12)

## 4. Merge Dataframes

Before merging, we verify that county names match across primary and general:

In [61]:
# Check if county names match between primary_df and general_df
primary_counties = set(primary_df["county"].unique())
general_counties = set(general_df["county"].unique())
common_counties = primary_counties.intersection(general_counties)
print(f"Number of common counties: {len(common_counties)} out of {len(primary_counties)}")

Number of common counties: 102 out of 102


Great. Since we know that all counties name are matched, we don't need to perform further data preprocessing to match the county names. Thus, we can now merge them:

In [62]:
# Merge primary and general dataframes on 'county'
merged_df = primary_pivot.merge(general_pivot, on="county", how="inner").fillna(0)    # There should be no missing values to fill with 0
merged_df.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_BIDEN,pri_dem_CLINTON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_KUCINICH,pri_dem_OBAMA,pri_dem_RICHARDSON,pri_grn_BALL,pri_grn_HAWKINS,...,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_ind_NADER,gen_lib_BARR,gen_new_POLACHEK,gen_rep_MCCAIN,gen_wri_ALLEN,gen_wri_HOBBS,gen_wri_MOORE,gen_wri_PAUL
0,ADAMS,10,2534,0,237,2,3713,17,0,0,...,11794,40,150,76,1,18711,0,0,0,0
1,ALEXANDER,5,725,2,156,4,1097,8,0,0,...,2189,14,24,13,2,1692,0,0,0,0
2,BOND,7,916,4,71,4,921,9,0,0,...,3843,7,53,36,1,3947,0,0,0,0
3,BOONE,9,2042,5,96,2,2652,5,1,0,...,11333,60,179,123,6,10403,0,0,0,0
4,BROWN,3,247,1,63,1,313,2,1,0,...,986,8,18,8,1,1544,0,0,0,0
5,BUREAU,6,1489,1,113,5,2432,12,1,4,...,8889,50,158,59,4,7911,0,0,0,0
6,CALHOUN,7,606,1,86,1,446,3,0,0,...,1423,8,32,12,0,1221,0,0,0,0
7,CARROLL,3,590,3,40,1,1160,2,0,2,...,3965,22,46,27,2,3596,0,0,0,0
8,CASS,4,497,0,48,3,679,4,0,0,...,2690,28,44,20,0,2617,0,0,0,0
9,CHAMPAIGN,31,5515,7,345,133,17033,38,16,26,...,48597,313,610,560,14,33871,0,0,0,0


In [63]:
# Statistics check on merged dataframe 
merged_df.describe()

Unnamed: 0,pri_dem_BIDEN,pri_dem_CLINTON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_KUCINICH,pri_dem_OBAMA,pri_dem_RICHARDSON,pri_grn_BALL,pri_grn_HAWKINS,pri_grn_MCKINNEY,...,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_ind_NADER,gen_lib_BARR,gen_new_POLACHEK,gen_rep_MCCAIN,gen_wri_ALLEN,gen_wri_HOBBS,gen_wri_MOORE,gen_wri_PAUL
count,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,...,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0
mean,37.137255,6548.333333,11.480392,389.401961,41.509804,12923.862745,34.686275,3.04902,4.54902,14.833333,...,33523.02,116.058824,303.411765,192.568627,11.264706,19913.519608,0.029412,0.039216,0.029412,0.009804
std,189.038371,31609.286165,52.662334,1422.424992,244.600614,74054.312696,162.340004,8.346906,16.248459,64.755608,...,163464.0,407.84235,924.776134,602.522643,46.844154,53965.490473,0.169792,0.312181,0.220525,0.099015
min,0.0,199.0,0.0,18.0,0.0,242.0,0.0,0.0,0.0,0.0,...,845.0,5.0,16.0,2.0,0.0,1212.0,0.0,0.0,0.0,0.0
25%,4.0,689.75,1.0,73.25,2.0,851.25,4.25,0.0,0.0,1.0,...,2783.5,14.0,55.0,25.25,1.25,3712.25,0.0,0.0,0.0,0.0
50%,7.0,1099.5,2.0,127.0,5.0,1664.0,8.5,1.0,1.0,2.5,...,5166.5,38.0,105.0,48.5,3.0,6124.5,0.0,0.0,0.0,0.0
75%,16.75,2725.75,6.0,278.75,15.0,3820.25,19.0,3.0,4.0,6.0,...,12464.25,63.25,187.0,105.0,6.0,11308.0,0.0,0.0,0.0,0.0
max,1901.0,314634.0,512.0,14249.0,2461.0,743686.0,1636.0,77.0,160.0,638.0,...,1629024.0,4006.0,8903.0,5602.0,467.0,487736.0,1.0,3.0,2.0,1.0


Now, we will add party totals columns: 

- Primary totals:
    * `rep_primary_total` = sum of all `pri_rep_*` columns
    * `dem_primary_total` = sum of all `pri_dem_*` columns
    * `grn_primary_total` = sum of all `pri_grn_*` columns

- General totals:
    * `rep_general_total` = sum of all `gen_rep_*` columns
    * `dem_general_total` = sum of all `gen_dem_*` columns
    * `lib_general_total` = sum of all `gen_lib_*` columns
    * `grn_general_total` = sum of all `gen_grn_*` columns
    * `cpi_general_total` = sum of all `gen_cpi_*` columns
    * `ind_general_total` = sum of all `gen_ind_*` columns
    * `new_general_total` = sum of all `gen_new_*` columns
    * `wri_general_total` = sum of all `gen_wri_*` columns

In [64]:
# Add party totals for primary election
rep_primary_cols   = [c for c in merged_df.columns if c.startswith("pri_rep_")]
dem_primary_cols   = [c for c in merged_df.columns if c.startswith("pri_dem_")]
grn_primary_cols   = [c for c in merged_df.columns if c.startswith("pri_grn_")]

merged_df["rep_primary_total"] = merged_df[rep_primary_cols].sum(axis=1) if rep_primary_cols else 0
merged_df["dem_primary_total"] = merged_df[dem_primary_cols].sum(axis=1) if dem_primary_cols else 0
merged_df["grn_primary_total"] = merged_df[grn_primary_cols].sum(axis=1) if grn_primary_cols else 0

In [65]:
# Add party totals for general election
rep_general_cols   = [c for c in merged_df.columns if c.startswith("gen_rep_")]
dem_general_cols   = [c for c in merged_df.columns if c.startswith("gen_dem_")]
lib_general_cols   = [c for c in merged_df.columns if c.startswith("gen_lib_")]
grn_general_cols   = [c for c in merged_df.columns if c.startswith("gen_grn_")]
cpi_general_cols   = [c for c in merged_df.columns if c.startswith("gen_cpi_")]
ind_general_cols   = [c for c in merged_df.columns if c.startswith("gen_ind_")]
new_general_cols   = [c for c in merged_df.columns if c.startswith("gen_new_")]
wri_general_cols   = [c for c in merged_df.columns if c.startswith("gen_wri_")]

merged_df["rep_general_total"] = merged_df[rep_general_cols].sum(axis=1) if rep_general_cols else 0
merged_df["dem_general_total"] = merged_df[dem_general_cols].sum(axis=1) if dem_general_cols else 0
merged_df["lib_general_total"] = merged_df[lib_general_cols].sum(axis=1) if lib_general_cols else 0
merged_df["grn_general_total"] = merged_df[grn_general_cols].sum(axis=1) if grn_general_cols else 0
merged_df["cpi_general_total"] = merged_df[cpi_general_cols].sum(axis=1) if cpi_general_cols else 0
merged_df["ind_general_total"] = merged_df[ind_general_cols].sum(axis=1) if ind_general_cols else 0
merged_df["new_general_total"] = merged_df[new_general_cols].sum(axis=1) if new_general_cols else 0
merged_df["wri_general_total"] = merged_df[wri_general_cols].sum(axis=1) if wri_general_cols else 0

In [68]:
# Print out all the column names in the final dataframe
print("Final columns in the cleaned dataframe:")
merged_df.columns

Final columns in the cleaned dataframe:


Index(['county', 'pri_dem_BIDEN', 'pri_dem_CLINTON', 'pri_dem_DODD',
       'pri_dem_EDWARDS', 'pri_dem_KUCINICH', 'pri_dem_OBAMA',
       'pri_dem_RICHARDSON', 'pri_grn_BALL', 'pri_grn_HAWKINS',
       'pri_grn_MCKINNEY', 'pri_grn_MESPLAY', 'pri_rep_GIULIANI',
       'pri_rep_HUCKABEE', 'pri_rep_KEYES', 'pri_rep_MCCAIN',
       'pri_rep_MITCHELL', 'pri_rep_PAUL', 'pri_rep_ROMNEY',
       'pri_rep_TANCREDO', 'pri_rep_THOMPSON', 'gen_cpi_BALDWIN',
       'gen_dem_OBAMA', 'gen_grn_MCKINNEY', 'gen_ind_NADER', 'gen_lib_BARR',
       'gen_new_POLACHEK', 'gen_rep_MCCAIN', 'gen_wri_ALLEN', 'gen_wri_HOBBS',
       'gen_wri_MOORE', 'gen_wri_PAUL', 'rep_primary_total',
       'dem_primary_total', 'grn_primary_total', 'rep_general_total',
       'dem_general_total', 'lib_general_total', 'grn_general_total',
       'cpi_general_total', 'ind_general_total', 'new_general_total',
       'wri_general_total'],
      dtype='object')

In [66]:
# Preview merged dataframe with totals
merged_df.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_BIDEN,pri_dem_CLINTON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_KUCINICH,pri_dem_OBAMA,pri_dem_RICHARDSON,pri_grn_BALL,pri_grn_HAWKINS,...,dem_primary_total,grn_primary_total,rep_general_total,dem_general_total,lib_general_total,grn_general_total,cpi_general_total,ind_general_total,new_general_total,wri_general_total
0,ADAMS,10,2534,0,237,2,3713,17,0,0,...,6513,2,18711,11794,76,40,51,150,1,0
1,ALEXANDER,5,725,2,156,4,1097,8,0,0,...,1997,0,1692,2189,13,14,3,24,2,0
2,BOND,7,916,4,71,4,921,9,0,0,...,1932,2,3947,3843,36,7,37,53,1,0
3,BOONE,9,2042,5,96,2,2652,5,1,0,...,4811,3,10403,11333,123,60,53,179,6,0
4,BROWN,3,247,1,63,1,313,2,1,0,...,630,3,1544,986,8,8,4,18,1,0
5,BUREAU,6,1489,1,113,5,2432,12,1,4,...,4058,11,7911,8889,59,50,42,158,4,0
6,CALHOUN,7,606,1,86,1,446,3,0,0,...,1150,0,1221,1423,12,8,3,32,0,0
7,CARROLL,3,590,3,40,1,1160,2,0,2,...,1799,7,3596,3965,27,22,16,46,2,0
8,CASS,4,497,0,48,3,679,4,0,0,...,1235,0,2617,2690,20,28,9,44,0,0
9,CHAMPAIGN,31,5515,7,345,133,17033,38,16,26,...,23102,160,33871,48597,560,313,178,610,14,0


Now, we save the cleaned dataframe into the processed directory.

In [67]:
# Save the cleaned and merged dataframe to CSV
out_dir = Path(OUTPUT_PATH)
out_dir.mkdir(parents=True, exist_ok=True)
merged_df.to_csv(OUTPUT_PATH + "IL.csv", index=False)