# Delaware 2008 Presidential Elections: Data Cleaning & Preprocessing

**Goal:** Build a clean, analysis-ready county-level table for Delaware, 2008 by merging the presidential primary and presidential general election results, and then derive summary stats (party totals).

**Output**: A single CSV where each row is a county and columns include:

- Primary per-candidate vote counts (prefixed with `pri_`)
- General per-candidate vote counts (prefixed with `gen_`)
- Party totals: `rep_primary_total`, `dem_primary_total`, `rep_general_total`, `dem_general_total`, `lib_general_total`, `cst_general_total`, `grn_general_total`, `ind_general_total`, `swp_general_total`

**Last Updated**: 2025/10/02

## 0. Library Import

In [1]:
import re
import pandas as pd
import numpy as np
from pathlib import Path

  from pandas.core import (


## 1. Inputs & Parameters

Define raw file paths once here so the entire notebook is easy to rerun on another machine. If a path changes, we only update it here. We keep a single `OUTPUT_PATH` so all exports land in one known place.

In [2]:
# DE 2008 dataset path
PRIMARY_PATH1 = r"../../data/raw/2008/DE/20080205__de__primary__precinct.csv"
PRIMARY_PATH2 = r"../../data/raw/2008/DE/20080909__de__primary__precinct.csv"
GENERAL_PATH  = r"../../data/raw/2008/DE/20081104__de__general__precinct.csv"

# Output directory
OUTPUT_PATH  = r"../../data/processed/2008/DE/"

# Analysis parameters
DISPLAY_ROWS = 10   # Number of rows to display in dataframes

## 2. Load & Filter

We load primary and general datasets separately and immediately subset to the rows we truly need:

- Restrict `office` to 'President' to avoid mixing down-ballot contests

- Remove columns that are fully missing or irrelevant post-filter (e.g., a district column that’s empty for county-level rows)

### a. Primary Election Dataset

There are two files for the primary election. We will go through each of them and preprocess, then merge if neccessary.

In [3]:
# Load primary data
primary1_df = pd.read_csv(PRIMARY_PATH1)
primary1_df.head(DISPLAY_ROWS)

Unnamed: 0,county,election_district,office,district,party,candidate,election_day,absentee,votes
0,New Castle,01-01,President,,DEMOCRATIC,Biden J,8,0,8
1,New Castle,01-01,President,,DEMOCRATIC,Clinton H,99,7,106
2,New Castle,01-01,President,,DEMOCRATIC,Dodd C,0,0,0
3,New Castle,01-01,President,,DEMOCRATIC,Edwards J,1,1,2
4,New Castle,01-01,President,,DEMOCRATIC,Kucinich D,0,0,0
5,New Castle,04-01,President,,DEMOCRATIC,Biden J,13,0,13
6,New Castle,04-01,President,,DEMOCRATIC,Clinton H,77,1,78
7,New Castle,04-01,President,,DEMOCRATIC,Dodd C,0,0,0
8,New Castle,04-01,President,,DEMOCRATIC,Edwards J,0,0,0
9,New Castle,04-01,President,,DEMOCRATIC,Kucinich D,0,0,0


In [4]:
# Different values in 'office' column
primary1_df["office"].value_counts()

office
President    3768
Name: count, dtype: int64

In [5]:
# Since there is only presidential data, drop "office" column
# Also, drop the district column
primary1_df = primary1_df.drop(columns=["office", "district"]).reset_index(drop=True)
primary1_df.head(DISPLAY_ROWS)

Unnamed: 0,county,election_district,party,candidate,election_day,absentee,votes
0,New Castle,01-01,DEMOCRATIC,Biden J,8,0,8
1,New Castle,01-01,DEMOCRATIC,Clinton H,99,7,106
2,New Castle,01-01,DEMOCRATIC,Dodd C,0,0,0
3,New Castle,01-01,DEMOCRATIC,Edwards J,1,1,2
4,New Castle,01-01,DEMOCRATIC,Kucinich D,0,0,0
5,New Castle,04-01,DEMOCRATIC,Biden J,13,0,13
6,New Castle,04-01,DEMOCRATIC,Clinton H,77,1,78
7,New Castle,04-01,DEMOCRATIC,Dodd C,0,0,0
8,New Castle,04-01,DEMOCRATIC,Edwards J,0,0,0
9,New Castle,04-01,DEMOCRATIC,Kucinich D,0,0,0


We check the relationship between `election_day`, `absentee` and `votes` columns relationship.

In [6]:
# Check the relationship between `election_day`, `absentee` and `votes` columns
difference = primary1_df["election_day"] + primary1_df["absentee"] - primary1_df["votes"]
difference.value_counts()

0    3768
Name: count, dtype: int64

This means that the total votes in `votes` equal to the sum of votes in `election_day` and `absentee`. Thus, we can drop the other two columns and just keep the `votes` one.

In [7]:
# Drop the "election_day" and "absentee" columns
primary1_df = primary1_df.drop(columns=["election_day", "absentee"]).reset_index(drop=True)
primary1_df.head(DISPLAY_ROWS)

Unnamed: 0,county,election_district,party,candidate,votes
0,New Castle,01-01,DEMOCRATIC,Biden J,8
1,New Castle,01-01,DEMOCRATIC,Clinton H,106
2,New Castle,01-01,DEMOCRATIC,Dodd C,0
3,New Castle,01-01,DEMOCRATIC,Edwards J,2
4,New Castle,01-01,DEMOCRATIC,Kucinich D,0
5,New Castle,04-01,DEMOCRATIC,Biden J,13
6,New Castle,04-01,DEMOCRATIC,Clinton H,78
7,New Castle,04-01,DEMOCRATIC,Dodd C,0
8,New Castle,04-01,DEMOCRATIC,Edwards J,0
9,New Castle,04-01,DEMOCRATIC,Kucinich D,0


Now, we aggregate district vote counts into county vote counts.

In [8]:
# Make sure votes are numeric
primary1_df["votes"] = pd.to_numeric(primary1_df["votes"], errors="coerce").fillna(0)

# Aggregate precinct vote counts into county vote counts
primary1_df = (
    primary1_df.
    groupby(["county", "party", "candidate"], as_index=False)["votes"]
    .sum()
)[["county", "candidate", "party", "votes"]]        # Reorder columns

# Snippet at the aggregated data
primary1_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Kent,Biden J,DEMOCRATIC,408
1,Kent,Clinton H,DEMOCRATIC,5533
2,Kent,Dodd C,DEMOCRATIC,32
3,Kent,Edwards J,DEMOCRATIC,273
4,Kent,Kucinich D,DEMOCRATIC,32
5,Kent,Obama B,DEMOCRATIC,6735
6,Kent,Giuliani R,REPUBLICAN,192
7,Kent,Huckabee M,REPUBLICAN,1568
8,Kent,Mccain J,REPUBLICAN,3598
9,Kent,Paul R,REPUBLICAN,289


In [9]:
# Unique parties in primary1_df
primary1_df["party"].value_counts()

party
DEMOCRATIC    18
REPUBLICAN    18
Name: count, dtype: int64

In [10]:
# Candidates in primary1_df
primary1_df["candidate"].value_counts()

candidate
Biden J       3
Clinton H     3
Dodd C        3
Edwards J     3
Kucinich D    3
Obama B       3
Giuliani R    3
Huckabee M    3
Mccain J      3
Paul R        3
Romney M      3
Tancredo T    3
Name: count, dtype: int64

In [11]:
# Final look at the (supposed) cleaned primary1_df
primary1_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Kent,Biden J,DEMOCRATIC,408
1,Kent,Clinton H,DEMOCRATIC,5533
2,Kent,Dodd C,DEMOCRATIC,32
3,Kent,Edwards J,DEMOCRATIC,273
4,Kent,Kucinich D,DEMOCRATIC,32
5,Kent,Obama B,DEMOCRATIC,6735
6,Kent,Giuliani R,REPUBLICAN,192
7,Kent,Huckabee M,REPUBLICAN,1568
8,Kent,Mccain J,REPUBLICAN,3598
9,Kent,Paul R,REPUBLICAN,289


In [12]:
# Shape after preprocessing
primary1_df.shape

(36, 4)

Finishing up with this dataset, we now look at the other primary election dataset.

In [13]:
# Load primary data
primary2_df = pd.read_csv(PRIMARY_PATH2)
primary2_df.head(DISPLAY_ROWS)

Unnamed: 0,county,election_district,office,district,party,candidate,election_day,absentee,votes
0,New Castle,01-01,U.S. House,,DEMOCRATIC,Hartley-Na,190,6,196
1,New Castle,01-01,U.S. House,,DEMOCRATIC,Miller M,184,6,190
2,New Castle,01-01,U.S. House,,DEMOCRATIC,Northingto,21,3,24
3,New Castle,02-01,U.S. House,,DEMOCRATIC,Hartley-Na,145,3,148
4,New Castle,02-01,U.S. House,,DEMOCRATIC,Miller M,183,7,190
5,New Castle,02-01,U.S. House,,DEMOCRATIC,Northingto,20,0,20
6,New Castle,03-01,U.S. House,,DEMOCRATIC,Hartley-Na,109,2,111
7,New Castle,03-01,U.S. House,,DEMOCRATIC,Miller M,169,2,171
8,New Castle,03-01,U.S. House,,DEMOCRATIC,Northingto,21,0,21
9,New Castle,04-01,U.S. House,,DEMOCRATIC,Hartley-Na,185,6,191


In [14]:
# Different values in 'office' column
primary2_df["office"].value_counts()

office
Governor          1356
U.S. House        1020
State Senate       199
State Assembly      20
Name: count, dtype: int64

Oh, there are no presidential data in this dataframe. Then, we can just safely ignore it and proceed with general dataset.

### b. General Election Dataset

In [15]:
# Load general data
general_df = pd.read_csv(GENERAL_PATH)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,election_district,office,district,party,candidate,election_day,absentee,votes
0,New Castle,01-01,President,,DEMOCRATIC,Obama B,886,87,973
1,New Castle,01-01,President,,REPUBLICAN,Mccain J,107,15,122
2,New Castle,01-01,President,,CONSTITUTN,Baldwinc,0,0,0
3,New Castle,01-01,President,,GREEN,Mckinney C,0,0,0
4,New Castle,01-01,President,,IND OF DEL,Nader R,2,2,4
5,New Castle,02-01,President,,DEMOCRATIC,Obama B,834,47,881
6,New Castle,02-01,President,,REPUBLICAN,Mccain J,30,1,31
7,New Castle,02-01,President,,CONSTITUTN,Baldwinc,0,0,0
8,New Castle,02-01,President,,GREEN,Mckinney C,0,0,0
9,New Castle,02-01,President,,IND OF DEL,Nader R,1,0,1


In [16]:
# Different values in 'office' column
general_df["office"].value_counts()

office
President              3073
U.S. House             1314
Governor               1308
Lieutenant Governor    1308
State Assembly          946
U.S. Senate             876
State Senate            347
Name: count, dtype: int64

In [17]:
# Only keep rows where 'office' is 'President'
general_df = general_df[general_df["office"] == "President"]
general_df.shape

(3073, 9)

In [18]:
# Now, drop the "office" column as it's no longer needed
# Also, drop the district column
general_df = general_df.drop(columns=["office", "district"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,election_district,party,candidate,election_day,absentee,votes
0,New Castle,01-01,DEMOCRATIC,Obama B,886,87,973
1,New Castle,01-01,REPUBLICAN,Mccain J,107,15,122
2,New Castle,01-01,CONSTITUTN,Baldwinc,0,0,0
3,New Castle,01-01,GREEN,Mckinney C,0,0,0
4,New Castle,01-01,IND OF DEL,Nader R,2,2,4
5,New Castle,02-01,DEMOCRATIC,Obama B,834,47,881
6,New Castle,02-01,REPUBLICAN,Mccain J,30,1,31
7,New Castle,02-01,CONSTITUTN,Baldwinc,0,0,0
8,New Castle,02-01,GREEN,Mckinney C,0,0,0
9,New Castle,02-01,IND OF DEL,Nader R,1,0,1


In [19]:
# Sanity check if votes = election_day + absentee
difference = general_df["election_day"] + general_df["absentee"] - general_df["votes"]
difference.value_counts()

0    3073
Name: count, dtype: int64

In [20]:
# Thus, we drop the "election_day" and "absentee" columns
general_df = general_df.drop(columns=["election_day", "absentee"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,election_district,party,candidate,votes
0,New Castle,01-01,DEMOCRATIC,Obama B,973
1,New Castle,01-01,REPUBLICAN,Mccain J,122
2,New Castle,01-01,CONSTITUTN,Baldwinc,0
3,New Castle,01-01,GREEN,Mckinney C,0
4,New Castle,01-01,IND OF DEL,Nader R,4
5,New Castle,02-01,DEMOCRATIC,Obama B,881
6,New Castle,02-01,REPUBLICAN,Mccain J,31
7,New Castle,02-01,CONSTITUTN,Baldwinc,0
8,New Castle,02-01,GREEN,Mckinney C,0
9,New Castle,02-01,IND OF DEL,Nader R,1


Now, we again aggregate district vote counts into county vote counts.

In [21]:
# Make sure votes are numeric
general_df["votes"] = pd.to_numeric(general_df["votes"], errors="coerce").fillna(0)

# Aggregate precinct vote counts into county vote counts
general_df = (
    general_df.
    groupby(["county", "party", "candidate"], as_index=False)["votes"]
    .sum()
)[["county", "candidate", "party", "votes"]]        # Reorder columns

# Snippet at the aggregated data
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Kent,Baldwinc,CONSTITUTN,107
1,Kent,Obama B,DEMOCRATIC,36392
2,Kent,Mckinney C,GREEN,70
3,Kent,Nader R,IND OF DEL,381
4,Kent,Barr B,LIBERTARIN,144
5,Kent,Mccain J,REPUBLICAN,29827
6,Kent,Calero R,SOC WORKER,4
7,New Castle,Baldwinc,CONSTITUTN,378
8,New Castle,Obama B,DEMOCRATIC,178768
9,New Castle,Mckinney C,GREEN,248


In [22]:
# List out all the parties in the general election data
general_df["party"].value_counts()

party
CONSTITUTN    3
DEMOCRATIC    3
GREEN         3
IND OF DEL    3
LIBERTARIN    3
REPUBLICAN    3
SOC WORKER    3
Name: count, dtype: int64

In [23]:
# Missing values count
general_df.isnull().sum()

county       0
candidate    0
party        0
votes        0
dtype: int64

In [24]:
# Final look at cleaned general_df
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Kent,Baldwinc,CONSTITUTN,107
1,Kent,Obama B,DEMOCRATIC,36392
2,Kent,Mckinney C,GREEN,70
3,Kent,Nader R,IND OF DEL,381
4,Kent,Barr B,LIBERTARIN,144
5,Kent,Mccain J,REPUBLICAN,29827
6,Kent,Calero R,SOC WORKER,4
7,New Castle,Baldwinc,CONSTITUTN,378
8,New Castle,Obama B,DEMOCRATIC,178768
9,New Castle,Mckinney C,GREEN,248


In [25]:
# Shape after preprocessing
general_df.shape

(21, 4)

## 3. Table Pivoting

We convert tall (one row per county/party/candidate) into wide (one row per county with one column per candidate). This creates the consistent schema with previous group cleaned data.

Helper functions:

- `normalize_party(s)`: maps common forms (e.g., “Democratic”, “Republican”) to keys dem/rep so column names are stable
- `candidate_token(name)`: turns “Barack Obama” -> OBAMA, “John McCain” -> MCCAIN, etc. Create a short, readable, unique token for column names
- `pivot_wide(df, prefix, key_col="county")`: Main pivot function
        
    * groups by `county` x `party` × `candidate`, sums `votes`,
    * pivots to columns named like:
        * Primary: `pri_dem_OBAMA`, `pri_rep_MCCAIN`,...
        * General: `gen_dem_OBAMA`, `gen_rep_MCCAIN`,...

    * flattens the MultiIndex into plain column strings,
    * returns one wide row per county

In [26]:
def normalize_party(s: pd.Series) -> pd.Series:
    """
    Normalize party names: Democratic -> dem, Republican -> rep
    """
    return(s.str.strip()
           .str.capitalize()
           .map({
               "Democratic"     : "dem", 
               "Republican"     : "rep",
                "Libertarin"    : "lib",
                "Constitutn"    : "cst",
                "Green"         : "grn",
                "Ind of del"    : "ind",
                "Soc worker"    : "swp"
                })
           .fillna(s.str.strip().str.lower()))      # For defensive purposes only, would not expect other parties

In [27]:
def candidate_token(name: str) -> str:
    """
    Turn John McCain -> MCCAIN, Barack Obama -> OBAMA
    Skip suffixes, keep last name/token, capitalize, and remove punctuation
    """
    if pd.isna(name):
        return "UNKNOWN"                # Defensive purposes only, would not expect missing values

    # Return first token uppercase
    raw = str(name).strip()
    tokens = raw.split()
    return tokens[0].upper() if tokens else "UNKNOWN"

In [28]:
def pivot_wide(df: pd.DataFrame, prefix: str, key_col: str="county") -> pd.DataFrame:
    """
    Pivot the dataframe to wide format based on party and candidate
    """
    # Normalize party names
    df['party_key'] = normalize_party(df['party'])
    
    # Create candidate tokens
    df['candidate_token'] = df['candidate'].apply(candidate_token)
    
    # Create new column names based on party and candidate token
    df['new_col'] = prefix + '_' + df['party'] + '_' + df['candidate_token']
    
    # Pivot the dataframe
    pivot_df = df.pivot_table(index=key_col, 
                              columns=["party_key", "candidate_token"], 
                              values="votes", 
                              aggfunc='sum', 
                              fill_value=0)
    
    # Flatten multi-level columns
    pivot_df.columns = [f"{prefix}_{p}_{c}" for p, c in pivot_df.columns]
    
    # Reset index to turn key_col back into a column
    pivot_df = pivot_df.reset_index()
    
    return pivot_df

In [29]:
# Primary dataframe pivot
primary_pivot = pivot_wide(primary1_df, prefix="pri")
primary_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_BIDEN,pri_dem_CLINTON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_KUCINICH,pri_dem_OBAMA,pri_rep_GIULIANI,pri_rep_HUCKABEE,pri_rep_MCCAIN,pri_rep_PAUL,pri_rep_ROMNEY,pri_rep_TANCREDO
0,Kent,408,5533,32,273,32,6735,192,1568,3598,289,2806,31
1,New Castle,1785,26564,82,575,116,37795,744,3140,13225,1434,8758,109
2,Sussex,3533,49405,226,1634,236,57718,1574,10704,28429,2539,21124,210


In [30]:
# Primary dataframe shape after pivot
primary_pivot.shape

(3, 13)

In [31]:
# General dataframe pivot
general_pivot = pivot_wide(general_df, prefix="gen")
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_cst_BALDWINC,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_ind_NADER,gen_lib_BARR,gen_rep_MCCAIN,gen_swp_CALERO
0,Kent,107,36392,70,381,144,29827,4
1,New Castle,378,178768,248,1571,796,74608,48
2,Sussex,767,295758,452,2850,1278,200313,64


In [32]:
# General dataframe shape after pivot
general_pivot.shape

(3, 8)

## 4. Merge Dataframes

Before merging, we verify that county names match across primary and general:

In [33]:
# Check if county names match between primary_df and general_df
primary_counties = set(primary1_df["county"].unique())
general_counties = set(general_df["county"].unique())
common_counties = primary_counties.intersection(general_counties)
print(f"Number of common counties: {len(common_counties)} out of {len(primary_counties)}")

Number of common counties: 3 out of 3


Great. Since we know that all counties name are matched, we don't need to perform further data preprocessing to match the county names. Thus, we can now merge them:

In [34]:
# Merge primary and general dataframes on 'county'
merged_df = primary_pivot.merge(general_pivot, on="county", how="inner").fillna(0)    # There should be no missing values to fill with 0
merged_df.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_BIDEN,pri_dem_CLINTON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_KUCINICH,pri_dem_OBAMA,pri_rep_GIULIANI,pri_rep_HUCKABEE,pri_rep_MCCAIN,pri_rep_PAUL,pri_rep_ROMNEY,pri_rep_TANCREDO,gen_cst_BALDWINC,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_ind_NADER,gen_lib_BARR,gen_rep_MCCAIN,gen_swp_CALERO
0,Kent,408,5533,32,273,32,6735,192,1568,3598,289,2806,31,107,36392,70,381,144,29827,4
1,New Castle,1785,26564,82,575,116,37795,744,3140,13225,1434,8758,109,378,178768,248,1571,796,74608,48
2,Sussex,3533,49405,226,1634,236,57718,1574,10704,28429,2539,21124,210,767,295758,452,2850,1278,200313,64


In [35]:
# Statistics check on merged dataframe 
merged_df.describe()

Unnamed: 0,pri_dem_BIDEN,pri_dem_CLINTON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_KUCINICH,pri_dem_OBAMA,pri_rep_GIULIANI,pri_rep_HUCKABEE,pri_rep_MCCAIN,pri_rep_PAUL,pri_rep_ROMNEY,pri_rep_TANCREDO,gen_cst_BALDWINC,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_ind_NADER,gen_lib_BARR,gen_rep_MCCAIN,gen_swp_CALERO
count,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
mean,1908.666667,27167.333333,113.333333,827.333333,128.0,34082.666667,836.666667,5137.333333,15084.0,1420.666667,10896.0,116.666667,417.333333,170306.0,256.666667,1600.666667,739.333333,101582.666667,38.666667
std,1566.166126,21942.221955,100.724045,714.726754,102.528045,25693.435666,695.644545,4884.529592,12519.446913,1125.059258,9344.279748,89.745938,331.753422,129889.894033,191.147413,1234.76732,569.119788,88386.048279,31.069814
min,408.0,5533.0,32.0,273.0,32.0,6735.0,192.0,1568.0,3598.0,289.0,2806.0,31.0,107.0,36392.0,70.0,381.0,144.0,29827.0,4.0
25%,1096.5,16048.5,57.0,424.0,74.0,22265.0,468.0,2354.0,8411.5,861.5,5782.0,70.0,242.5,107580.0,159.0,976.0,470.0,52217.5,26.0
50%,1785.0,26564.0,82.0,575.0,116.0,37795.0,744.0,3140.0,13225.0,1434.0,8758.0,109.0,378.0,178768.0,248.0,1571.0,796.0,74608.0,48.0
75%,2659.0,37984.5,154.0,1104.5,176.0,47756.5,1159.0,6922.0,20827.0,1986.5,14941.0,159.5,572.5,237263.0,350.0,2210.5,1037.0,137460.5,56.0
max,3533.0,49405.0,226.0,1634.0,236.0,57718.0,1574.0,10704.0,28429.0,2539.0,21124.0,210.0,767.0,295758.0,452.0,2850.0,1278.0,200313.0,64.0


Now, we will add party totals columns: 

- Primary totals:
    * `rep_primary_total` = sum of all `pri_rep_*` columns
    * `dem_primary_total` = sum of all `pri_dem_*` columns

- General totals:
    * `rep_general_total` = sum of all `gen_rep_*` columns
    * `dem_general_total` = sum of all `gen_dem_*` columns
    * `lib_general_total` = sum of all `gen_lib_*` columns
    * `cst_general_total` = sum of all `gen_cst_*` columns
    * `grn_general_total` = sum of all `gen_grn_*` columns
    * `ind_general_total` = sum of all `gen_ind_*` columns
    * `swp_general_total` = sum of all `gen_swp_*` columns

In [36]:
# Add party totals for primary election
rep_primary_cols   = [c for c in merged_df.columns if c.startswith("pri_rep_")]
dem_primary_cols   = [c for c in merged_df.columns if c.startswith("pri_dem_")]

merged_df["rep_primary_total"] = merged_df[rep_primary_cols].sum(axis=1) if rep_primary_cols else 0
merged_df["dem_primary_total"] = merged_df[dem_primary_cols].sum(axis=1) if dem_primary_cols else 0

In [37]:
# Add party totals for general election
rep_general_cols   = [c for c in merged_df.columns if c.startswith("gen_rep_")]
dem_general_cols   = [c for c in merged_df.columns if c.startswith("gen_dem_")]
lib_general_cols   = [c for c in merged_df.columns if c.startswith("gen_lib_")]
cst_general_cols   = [c for c in merged_df.columns if c.startswith("gen_cst_")]
grn_general_cols   = [c for c in merged_df.columns if c.startswith("gen_grn_")]
ind_general_cols   = [c for c in merged_df.columns if c.startswith("gen_ind_")]
swp_general_cols   = [c for c in merged_df.columns if c.startswith("gen_swp_")]

merged_df["rep_general_total"] = merged_df[rep_general_cols].sum(axis=1) if rep_general_cols else 0
merged_df["dem_general_total"] = merged_df[dem_general_cols].sum(axis=1) if dem_general_cols else 0
merged_df["lib_general_total"] = merged_df[lib_general_cols].sum(axis=1) if lib_general_cols else 0
merged_df["cst_general_total"] = merged_df[cst_general_cols].sum(axis=1) if cst_general_cols else 0
merged_df["grn_general_total"] = merged_df[grn_general_cols].sum(axis=1) if grn_general_cols else 0
merged_df["ind_general_total"] = merged_df[ind_general_cols].sum(axis=1) if ind_general_cols else 0
merged_df["swp_general_total"] = merged_df[swp_general_cols].sum(axis=1) if swp_general_cols else 0


In [38]:
# Print out all the column names in the final dataframe
print("Final columns in the cleaned dataframe:")
merged_df.columns

Final columns in the cleaned dataframe:


Index(['county', 'pri_dem_BIDEN', 'pri_dem_CLINTON', 'pri_dem_DODD',
       'pri_dem_EDWARDS', 'pri_dem_KUCINICH', 'pri_dem_OBAMA',
       'pri_rep_GIULIANI', 'pri_rep_HUCKABEE', 'pri_rep_MCCAIN',
       'pri_rep_PAUL', 'pri_rep_ROMNEY', 'pri_rep_TANCREDO',
       'gen_cst_BALDWINC', 'gen_dem_OBAMA', 'gen_grn_MCKINNEY',
       'gen_ind_NADER', 'gen_lib_BARR', 'gen_rep_MCCAIN', 'gen_swp_CALERO',
       'rep_primary_total', 'dem_primary_total', 'rep_general_total',
       'dem_general_total', 'lib_general_total', 'cst_general_total',
       'grn_general_total', 'ind_general_total', 'swp_general_total'],
      dtype='object')

In [39]:
# Preview merged dataframe with totals
merged_df.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_BIDEN,pri_dem_CLINTON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_KUCINICH,pri_dem_OBAMA,pri_rep_GIULIANI,pri_rep_HUCKABEE,pri_rep_MCCAIN,...,gen_swp_CALERO,rep_primary_total,dem_primary_total,rep_general_total,dem_general_total,lib_general_total,cst_general_total,grn_general_total,ind_general_total,swp_general_total
0,Kent,408,5533,32,273,32,6735,192,1568,3598,...,4,8484,13013,29827,36392,144,107,70,381,4
1,New Castle,1785,26564,82,575,116,37795,744,3140,13225,...,48,27410,66917,74608,178768,796,378,248,1571,48
2,Sussex,3533,49405,226,1634,236,57718,1574,10704,28429,...,64,64580,112752,200313,295758,1278,767,452,2850,64


Now, we save the cleaned dataframe into the processed directory.

In [40]:
# Save the cleaned and merged dataframe to CSV
out_dir = Path(OUTPUT_PATH)
out_dir.mkdir(parents=True, exist_ok=True)
merged_df.to_csv(OUTPUT_PATH + "DE.csv", index=False)