# Arizona 2008 Presidential Elections: Data Cleaning & Preprocessing

**Goal:** Build a clean, analysis-ready county-level table for Arizona, 2008 by merging the presidential primary and presidential general election results, and then derive summary stats (party totals).

**Output**: A single CSV where each row is a county and columns include:

- Primary per-candidate vote counts (prefixed with `pri_`)
- General per-candidate vote counts (prefixed with `gen_`)
- Party totals: `rep_primary_total`, `dem_primary_total`, `rep_general_total`, `dem_general_total`, `grn_general_total`, `lbt_general_total`, `wri_general_total`

**Last Updated**: 2025/10/01

## 0. Library Import

In [5]:
import re
import pandas as pd
import numpy as np
from pathlib import Path

  from pandas.core import (


## 1. Inputs & Parameters

Define raw file paths once here so the entire notebook is easy to rerun on another machine. If a path changes, we only update it here. We keep a single `OUTPUT_PATH` so all exports land in one known place.

In [9]:
# AZ 2008 dataset path
PRIMARY_PATH = r"../../data/raw/2008/AZ/20080205__az__primary__president.csv"
GENERAL_PATH = r"../../data/raw/2008/AZ/20081104__az__general.csv"

# Output directory
OUTPUT_PATH  = r"../../data/processed/2008/AZ/"

# Analysis parameters
DISPLAY_ROWS = 10   # Number of rows to display in dataframes

## 2. Load & Filter

We load primary and general datasets separately and immediately subset to the rows we truly need:

- Restrict `office` to 'President' to avoid mixing down-ballot contests

- Remove columns that are fully missing or irrelevant post-filter (e.g., a district column that’s empty for county-level rows)

### a. Primary Election Dataset

In [10]:
# Load primary data
primary_df = pd.read_csv(PRIMARY_PATH)
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,office,district,party,candidate,votes,winner,write-in,notes
0,Apache,President,,DEM,"Peter ""Simon"" Bollander",8,,,
1,Cochise,President,,DEM,"Peter ""Simon"" Bollander",3,,,
2,Coconino,President,,DEM,"Peter ""Simon"" Bollander",19,,,
3,Gila,President,,DEM,"Peter ""Simon"" Bollander",11,,,
4,Graham,President,,DEM,"Peter ""Simon"" Bollander",2,,,
5,Greenlee,President,,DEM,"Peter ""Simon"" Bollander",1,,,
6,La Paz,President,,DEM,"Peter ""Simon"" Bollander",2,,,
7,Maricopa,President,,DEM,"Peter ""Simon"" Bollander",48,,,
8,Mohave,President,,DEM,"Peter ""Simon"" Bollander",3,,,
9,Navajo,President,,DEM,"Peter ""Simon"" Bollander",9,,,


In [11]:
# Different values in 'office' column
primary_df["office"].value_counts()

office
President    768
Name: count, dtype: int64

This dataset is specifically for presidential election. Thus, we can just drop this column

In [12]:
# Drop the "office" column 
# Also, drop the district, winner, and notes columns
primary_df = primary_df.drop(columns=["office", "district", "winner", "notes"]).reset_index(drop=True)
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes,write-in
0,Apache,DEM,"Peter ""Simon"" Bollander",8,
1,Cochise,DEM,"Peter ""Simon"" Bollander",3,
2,Coconino,DEM,"Peter ""Simon"" Bollander",19,
3,Gila,DEM,"Peter ""Simon"" Bollander",11,
4,Graham,DEM,"Peter ""Simon"" Bollander",2,
5,Greenlee,DEM,"Peter ""Simon"" Bollander",1,
6,La Paz,DEM,"Peter ""Simon"" Bollander",2,
7,Maricopa,DEM,"Peter ""Simon"" Bollander",48,
8,Mohave,DEM,"Peter ""Simon"" Bollander",3,
9,Navajo,DEM,"Peter ""Simon"" Bollander",9,


We might want to have a closer look at the `write-in` column to understand what is this column about

In [15]:
# Closer look at "write-in"
primary_df["write-in"].nunique()

0

This means that this entire column is empty. Thus, we also drop this.

In [17]:
# Drop "write-in" column
primary_df = primary_df.drop(columns=["write-in"]).reset_index(drop=True)
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,Apache,DEM,"Peter ""Simon"" Bollander",8
1,Cochise,DEM,"Peter ""Simon"" Bollander",3
2,Coconino,DEM,"Peter ""Simon"" Bollander",19
3,Gila,DEM,"Peter ""Simon"" Bollander",11
4,Graham,DEM,"Peter ""Simon"" Bollander",2
5,Greenlee,DEM,"Peter ""Simon"" Bollander",1
6,La Paz,DEM,"Peter ""Simon"" Bollander",2
7,Maricopa,DEM,"Peter ""Simon"" Bollander",48
8,Mohave,DEM,"Peter ""Simon"" Bollander",3
9,Navajo,DEM,"Peter ""Simon"" Bollander",9


In [18]:
# Unique parties in primary_df
primary_df["party"].value_counts()

party
DEM    384
REP    384
Name: count, dtype: int64

In [19]:
# Candidates in primary_df
primary_df["candidate"].value_counts()

candidate
Peter "Simon" Bollander         16
William Campbell                16
Jerry Curry                     16
John Michael Fitzpatrick        16
Bob Forthan                     16
Daniel Gilbert                  16
Rudy Giuliani                   16
Mike Huckabee                   16
Duncan Hunter                   16
Alan Keyes                      16
John McCain                     16
Frank McEnulty                  16
John R. McGrath                 16
James Creighton Mitchell Jr.    16
Sean "Cf" Murphy                16
Rick Outzen                     16
Ron Paul                        16
Mitt Romney                     16
David Ruben                     16
Michael P. Shaw                 16
Jack Shepard                    16
Charles Skelley                 16
Rhett R. Smith                  16
Hugh Cort                       16
Michael Burzynski               16
Sandy Whitehouse                16
Libby Hubbard                   16
Hillary Clinton                 16
Orion Dale

In [20]:
# Missing values count
primary_df.isnull().sum()

county       48
party         0
candidate     0
votes         0
dtype: int64

There are missing values in `county`. By looking into the data, I suspect that rows with missing values are total row for each candidate across counties. I might test this by seeing if the number of missing values in `county` matches with the number of candidates.

In [38]:
# Number of candidates in primary_df
primary_df["candidate"].nunique()

48

That is true. Then, we can just drop rows with missing county without worrying too much about affecting our dataset.

In [22]:
# Drop rows with missing county in primary_df
primary_df = primary_df.dropna(subset=["county"]).copy()
primary_df.isnull().sum()

county       0
party        0
candidate    0
votes        0
dtype: int64

In [23]:
# Calculuate the duplicate rows
duplicate_rows = primary_df.duplicated()
duplicate_rows.sum()

0

In [24]:
# Final look at the cleaned primary_df
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,Apache,DEM,"Peter ""Simon"" Bollander",8
1,Cochise,DEM,"Peter ""Simon"" Bollander",3
2,Coconino,DEM,"Peter ""Simon"" Bollander",19
3,Gila,DEM,"Peter ""Simon"" Bollander",11
4,Graham,DEM,"Peter ""Simon"" Bollander",2
5,Greenlee,DEM,"Peter ""Simon"" Bollander",1
6,La Paz,DEM,"Peter ""Simon"" Bollander",2
7,Maricopa,DEM,"Peter ""Simon"" Bollander",48
8,Mohave,DEM,"Peter ""Simon"" Bollander",3
9,Navajo,DEM,"Peter ""Simon"" Bollander",9


In [25]:
# Shape after preprocessing
primary_df.shape

(720, 4)

### b. General Election Dataset

In [26]:
# Load general data
general_df = pd.read_csv(GENERAL_PATH)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,office,district,party,candidate,votes,winner,write-in,notes
0,Apache,President,,DEM,Barack Obama,15390,,,
1,Cochise,President,,DEM,Barack Obama,18943,,,
2,Coconino,President,,DEM,Barack Obama,31433,,,
3,Gila,President,,DEM,Barack Obama,7884,,,
4,Graham,President,,DEM,Barack Obama,3487,,,
5,Greenlee,President,,DEM,Barack Obama,1165,,,
6,La Paz,President,,DEM,Barack Obama,1929,,,
7,Maricopa,President,,DEM,Barack Obama,602166,,,
8,Mohave,President,,DEM,Barack Obama,22092,,,
9,Navajo,President,,DEM,Barack Obama,15579,,,


In [27]:
# Different values in 'office' column
general_df["office"].value_counts()

office
State House                 287
State Senate                159
U.S. House                  143
President                   128
Corporation Commissioner     96
Name: count, dtype: int64

In [28]:
# Only keep rows where 'office' is 'President'
general_df = general_df[general_df["office"] == "President"]
general_df.shape

(128, 9)

In [29]:
# Now, drop the "office" column as it's no longer needed
# # Also, drop the district, winner, and notes columns
general_df = general_df.drop(columns=["office", "district", "winner", "notes"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes,write-in
0,Apache,DEM,Barack Obama,15390,
1,Cochise,DEM,Barack Obama,18943,
2,Coconino,DEM,Barack Obama,31433,
3,Gila,DEM,Barack Obama,7884,
4,Graham,DEM,Barack Obama,3487,
5,Greenlee,DEM,Barack Obama,1165,
6,La Paz,DEM,Barack Obama,1929,
7,Maricopa,DEM,Barack Obama,602166,
8,Mohave,DEM,Barack Obama,22092,
9,Navajo,DEM,Barack Obama,15579,


Again, we have to check if there are anything in `write-in`.

In [32]:
# Closer look at "write-in"
general_df["write-in"].notna().sum()

48

There are nonempty values in the "write-in" column. We have to look at what these nonempty values are.

In [33]:
# Closer look at "write-in"
general_df["write-in"].value_counts()

write-in
True    48
Name: count, dtype: int64

Given that this is just a Boolean variable, we can safely drop it and does not affect our overall dataframe.

In [35]:
# Drop "write-in" column
general_df = general_df.drop(columns=["write-in"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,Apache,DEM,Barack Obama,15390
1,Cochise,DEM,Barack Obama,18943
2,Coconino,DEM,Barack Obama,31433
3,Gila,DEM,Barack Obama,7884
4,Graham,DEM,Barack Obama,3487
5,Greenlee,DEM,Barack Obama,1165
6,La Paz,DEM,Barack Obama,1929
7,Maricopa,DEM,Barack Obama,602166
8,Mohave,DEM,Barack Obama,22092
9,Navajo,DEM,Barack Obama,15579


In [36]:
# List out all the parties in the general election data
general_df["party"].value_counts()

party
Write-In    48
DEM         16
GRN         16
LBT         16
REP         16
NONE        16
Name: count, dtype: int64

Notice that there is "Write-in" value as well as "NONE" value. For our current sake, we can categorize both of these into a bucket "WRI" for write-in/nonpartisan.

In [44]:
# Replace "Write-in" and "NONE" as "WRI"
general_df["party"] = (
    general_df["party"].replace({
        "Write-In": "WRI",
        "NONE": "WRI"
    })
)

# Sanity check
general_df["party"].value_counts()

party
WRI    60
DEM    15
GRN    15
LBT    15
REP    15
Name: count, dtype: int64

In [45]:
# Missing values count
general_df.isnull().sum()

county       0
party        0
candidate    0
votes        0
dtype: int64

Still, there are missing values in `county`. With same suspect as above, we check if this number of missing values agrees with the number of candidates in the dataframe.

In [46]:
# Number of candidates in general_df
general_df["candidate"].nunique()

8

This again agrees. Then, we can drop those rows with missing county value.

In [47]:
# Drop rows with missing county in primary_df
general_df = general_df.dropna(subset=["county"]).copy()
general_df.isnull().sum()

county       0
party        0
candidate    0
votes        0
dtype: int64

In [48]:
# Final look at the cleaned general_df
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,Apache,DEM,Barack Obama,15390
1,Cochise,DEM,Barack Obama,18943
2,Coconino,DEM,Barack Obama,31433
3,Gila,DEM,Barack Obama,7884
4,Graham,DEM,Barack Obama,3487
5,Greenlee,DEM,Barack Obama,1165
6,La Paz,DEM,Barack Obama,1929
7,Maricopa,DEM,Barack Obama,602166
8,Mohave,DEM,Barack Obama,22092
9,Navajo,DEM,Barack Obama,15579


In [49]:
# Shape after preprocessing
general_df.shape

(120, 4)

## 3. Table Pivoting

We convert tall (one row per county/party/candidate) into wide (one row per county with one column per candidate). This creates the consistent schema with previous group cleaned data.

Helper functions:

- `normalize_party(s)`: in this case, we lower everything so column names are stable with other dataframes
- `candidate_token(name)`: turns “Barack Obama” -> OBAMA, “John McCain” -> MCCAIN, etc. Create a short, readable, unique token for column names
- `pivot_wide(df, prefix, key_col="county")`: Main pivot function
        
    * groups by `county` x `party` × `candidate`, sums `votes`,
    * pivots to columns named like:
        * Primary: `pri_dem_OBAMA`, `pri_rep_MCCAIN`,...
        * General: `gen_dem_OBAMA`, `gen_rep_MCCAIN`,...

    * flattens the MultiIndex into plain column strings,
    * returns one wide row per county

In [50]:
def normalize_party(s: pd.Series) -> pd.Series:
    """
    Normalize party names: Lowercase the three-letter abbreviations
    """
    return(s.str.lower())     

In [51]:
SUFFIXES = {
    "JR","SR","JNR","SNR",
    "II","III","IV","V","VI","VII","VIII","IX","X","XI","XII"
}

def candidate_token(name: str) -> str:
    """
    Turn John McCain -> MCCAIN, Barack Obama -> OBAMA
    Skip suffixes, keep last name/token, capitalize, and remove punctuation
    """
    if pd.isna(name):
        return "UNKNOWN"                # Defensive purposes only, would not expect missing values
    
    # Remove suffixes
    raw = str(name).strip()

    # If a comma exists, treat as 'LAST, FIRST ...'
    if "," in raw:
        last_part = raw.split(",", 1)[0]
        last_part = re.sub(r"[^A-Za-z0-9\s]+", "", last_part).strip().upper()
        tokens = last_part.split()
        return tokens[-1] if tokens else "UNKNOWN"

    # Otherwise: remove punctuation, split, then drop trailing suffixes
    tokens = re.sub(r"[^A-Za-z0-9\s]+", "", raw).strip().upper().split()
    while tokens and tokens[-1] in SUFFIXES:
        tokens.pop()
    return tokens[-1] if tokens else "UNKNOWN"

In [52]:
def pivot_wide(df: pd.DataFrame, prefix: str, key_col: str="county") -> pd.DataFrame:
    """
    Pivot the dataframe to wide format based on party and candidate
    """
    # Normalize party names
    df['party_key'] = normalize_party(df['party'])
    
    # Create candidate tokens
    df['candidate_token'] = df['candidate'].apply(candidate_token)
    
    # Create new column names based on party and candidate token
    df['new_col'] = prefix + '_' + df['party'] + '_' + df['candidate_token']
    
    # Pivot the dataframe
    pivot_df = df.pivot_table(index=key_col, 
                              columns=["party_key", "candidate_token"], 
                              values="votes", 
                              aggfunc='sum', 
                              fill_value=0)
    
    # Flatten multi-level columns
    pivot_df.columns = [f"{prefix}_{p}_{c}" for p, c in pivot_df.columns]
    
    # Reset index to turn key_col back into a column
    pivot_df = pivot_df.reset_index()
    
    return pivot_df

In [53]:
# Primary dataframe pivot
primary_pivot = pivot_wide(primary_df, prefix="pri")
primary_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_BOLLANDER,pri_dem_CAMPBELL,pri_dem_CLINTON,pri_dem_DALEY,pri_dem_DOBSON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_GEST,pri_dem_GRAVEL,...,pri_rep_MURPHY,pri_rep_OUTZEN,pri_rep_PAUL,pri_rep_ROMNEY,pri_rep_RUBEN,pri_rep_SHAW,pri_rep_SHEPARD,pri_rep_SKELLEY,pri_rep_SMITH,pri_rep_THOMPSON
0,Apache,8,18,3768,8,20,5,208,3,5,...,1,4,140,1006,2,2,3,1,0,24
1,Cochise,3,4,4896,4,14,14,468,2,14,...,1,0,444,4260,4,2,4,1,1,82
2,Coconino,19,16,4884,6,18,13,360,4,7,...,5,1,421,2295,2,3,6,2,0,60
3,Gila,11,8,2839,7,14,12,633,5,7,...,2,1,250,1615,0,2,0,0,0,126
4,Graham,2,10,1027,4,2,10,184,1,0,...,1,1,58,1845,0,0,0,1,0,13
5,Greenlee,1,5,574,1,0,0,121,1,0,...,0,0,7,137,0,0,0,0,0,7
6,La Paz,2,3,581,0,2,1,57,1,3,...,1,0,38,377,0,0,0,0,0,17
7,Maricopa,48,96,125553,35,203,275,15712,33,154,...,217,30,15106,116995,61,35,41,35,30,7448
8,Mohave,3,9,6541,3,13,16,591,2,5,...,4,1,819,5469,0,3,2,0,1,286
9,Navajo,9,15,3834,8,18,15,325,2,15,...,3,2,257,4065,0,2,4,3,0,29


In [54]:
# Primary dataframe shape after pivot
primary_pivot.shape

(15, 49)

In [55]:
# General dataframe pivot
general_pivot = pivot_wide(general_df, prefix="gen")
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_lbt_BARR,gen_rep_MCCAIN,gen_wri_ALLEN,gen_wri_BALDWIN,gen_wri_JAY,gen_wri_NADER
0,Apache,15390,75,111,8551,0,26,0,109
1,Cochise,18943,90,371,29026,0,34,0,356
2,Coconino,31433,117,267,22186,1,31,0,309
3,Gila,7884,31,150,14095,0,17,0,156
4,Graham,3487,23,60,8376,0,5,0,56
5,Greenlee,1165,3,16,1712,0,0,0,17
6,La Paz,1929,14,39,3509,0,8,0,53
7,Maricopa,602166,1799,7605,746448,5,832,12,6095
8,Mohave,22092,111,433,44333,0,75,0,561
9,Navajo,15579,70,158,19761,0,50,0,182


In [56]:
# General dataframe shape after pivot
general_pivot.shape

(15, 9)

## 4. Merge Dataframes

Before merging, we verify that county names match across primary and general:

In [57]:
# Check if county names match between primary_df and general_df
primary_counties = set(primary_df["county"].unique())
general_counties = set(general_df["county"].unique())
common_counties = primary_counties.intersection(general_counties)
print(f"Number of common counties: {len(common_counties)} out of {len(primary_counties)}")

Number of common counties: 15 out of 15


Great. Since we know that all counties name are matched, we don't need to perform further data preprocessing to match the county names. Thus, we can now merge them:

In [58]:
# Merge primary and general dataframes on 'county'
merged_df = primary_pivot.merge(general_pivot, on="county", how="inner").fillna(0)    # There should be no missing values to fill with 0
merged_df.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_BOLLANDER,pri_dem_CAMPBELL,pri_dem_CLINTON,pri_dem_DALEY,pri_dem_DOBSON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_GEST,pri_dem_GRAVEL,...,pri_rep_SMITH,pri_rep_THOMPSON,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_lbt_BARR,gen_rep_MCCAIN,gen_wri_ALLEN,gen_wri_BALDWIN,gen_wri_JAY,gen_wri_NADER
0,Apache,8,18,3768,8,20,5,208,3,5,...,0,24,15390,75,111,8551,0,26,0,109
1,Cochise,3,4,4896,4,14,14,468,2,14,...,1,82,18943,90,371,29026,0,34,0,356
2,Coconino,19,16,4884,6,18,13,360,4,7,...,0,60,31433,117,267,22186,1,31,0,309
3,Gila,11,8,2839,7,14,12,633,5,7,...,0,126,7884,31,150,14095,0,17,0,156
4,Graham,2,10,1027,4,2,10,184,1,0,...,0,13,3487,23,60,8376,0,5,0,56
5,Greenlee,1,5,574,1,0,0,121,1,0,...,0,7,1165,3,16,1712,0,0,0,17
6,La Paz,2,3,581,0,2,1,57,1,3,...,0,17,1929,14,39,3509,0,8,0,53
7,Maricopa,48,96,125553,35,203,275,15712,33,154,...,30,7448,602166,1799,7605,746448,5,832,12,6095
8,Mohave,3,9,6541,3,13,16,591,2,5,...,1,286,22092,111,433,44333,0,75,0,561
9,Navajo,9,15,3834,8,18,15,325,2,15,...,0,29,15579,70,158,19761,0,50,0,182


In [59]:
# Statistics check on merged dataframe 
merged_df.describe()

Unnamed: 0,pri_dem_BOLLANDER,pri_dem_CAMPBELL,pri_dem_CLINTON,pri_dem_DALEY,pri_dem_DOBSON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_GEST,pri_dem_GRAVEL,pri_dem_GRAYSON,...,pri_rep_SMITH,pri_rep_THOMPSON,gen_dem_OBAMA,gen_grn_MCKINNEY,gen_lbt_BARR,gen_rep_MCCAIN,gen_wri_ALLEN,gen_wri_BALDWIN,gen_wri_JAY,gen_wri_NADER
count,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,...,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0
mean,10.266667,16.533333,15300.066667,6.533333,26.533333,32.266667,1574.733333,5.466667,22.666667,21.466667,...,2.933333,632.8,68980.466667,227.066667,837.0,82007.4,0.533333,91.4,1.066667,753.4
std,12.144762,23.188256,32781.062045,8.919214,50.318793,68.588282,3956.328844,8.408046,43.009412,37.733779,...,7.620149,1891.685575,155816.50325,465.140911,1931.309141,189270.393765,1.302013,209.807327,3.195235,1557.993939
min,1.0,1.0,574.0,0.0,0.0,0.0,57.0,0.0,0.0,3.0,...,0.0,7.0,1165.0,3.0,16.0,1712.0,0.0,0.0,0.0,17.0
25%,2.5,6.5,2520.0,1.0,5.5,8.0,196.0,1.5,4.0,7.5,...,0.0,20.5,8283.5,27.0,85.5,8463.5,0.0,6.5,0.0,82.5
50%,8.0,10.0,4884.0,4.0,14.0,13.0,468.0,3.0,7.0,10.0,...,1.0,82.0,18559.0,75.0,205.0,22186.0,0.0,26.0,0.0,182.0
75%,10.5,15.5,6736.5,7.5,18.0,18.0,728.5,5.0,14.5,17.5,...,1.5,295.5,34161.0,116.5,481.5,51877.0,0.5,62.5,0.0,561.5
max,48.0,96.0,125553.0,35.0,203.0,275.0,15712.0,33.0,154.0,155.0,...,30.0,7448.0,602166.0,1799.0,7605.0,746448.0,5.0,832.0,12.0,6095.0


Now, we will add party totals columns: 

- Primary totals:
    * `rep_primary_total` = sum of all `pri_rep_*` columns
    * `dem_primary_total` = sum of all `pri_dem_*` columns

- General totals:
    * `rep_general_total` = sum of all `gen_rep_*` columns
    * `dem_general_total` = sum of all `gen_dem_*` columns
    * `grn_general_total` = sum of all `gen_grn_*` columns
    * `lbt_general_total` = sum of all `gen_lbt_*` columns
    * `wri_general_total` = sum of all `gen_wri_*` columns

In [60]:
# Add party totals for primary election
rep_primary_cols   = [c for c in merged_df.columns if c.startswith("pri_rep_")]
dem_primary_cols   = [c for c in merged_df.columns if c.startswith("pri_dem_")]

merged_df["rep_primary_total"] = merged_df[rep_primary_cols].sum(axis=1) if rep_primary_cols else 0
merged_df["dem_primary_total"] = merged_df[dem_primary_cols].sum(axis=1) if dem_primary_cols else 0

In [62]:
# Add party totals for general election
rep_general_cols   = [c for c in merged_df.columns if c.startswith("gen_rep_")]
dem_general_cols   = [c for c in merged_df.columns if c.startswith("gen_dem_")]
grn_general_cols   = [c for c in merged_df.columns if c.startswith("gen_grn_")]
lbt_general_cols   = [c for c in merged_df.columns if c.startswith("gen_lbt_")]
wri_general_cols   = [c for c in merged_df.columns if c.startswith("gen_wri_")]

merged_df["rep_general_total"] = merged_df[rep_general_cols].sum(axis=1) if rep_general_cols else 0
merged_df["dem_general_total"] = merged_df[dem_general_cols].sum(axis=1) if dem_general_cols else 0
merged_df["grn_general_total"] = merged_df[grn_general_cols].sum(axis=1) if grn_general_cols else 0
merged_df["lbt_general_total"] = merged_df[lbt_general_cols].sum(axis=1) if lbt_general_cols else 0
merged_df["wri_general_total"] = merged_df[wri_general_cols].sum(axis=1) if wri_general_cols else 0

In [63]:
# Print out all the column names in the final dataframe
print("Final columns in the cleaned dataframe:")
merged_df.columns

Final columns in the cleaned dataframe:


Index(['county', 'pri_dem_BOLLANDER', 'pri_dem_CAMPBELL', 'pri_dem_CLINTON',
       'pri_dem_DALEY', 'pri_dem_DOBSON', 'pri_dem_DODD', 'pri_dem_EDWARDS',
       'pri_dem_GEST', 'pri_dem_GRAVEL', 'pri_dem_GRAYSON', 'pri_dem_HAYMER',
       'pri_dem_HUBBARD', 'pri_dem_KRUEGER', 'pri_dem_KUCINICH', 'pri_dem_LEE',
       'pri_dem_LYNCH', 'pri_dem_MONTELL', 'pri_dem_OATMAN', 'pri_dem_OBAMA',
       'pri_dem_RICHARDSON', 'pri_dem_SEE', 'pri_dem_TANNER',
       'pri_dem_VITULLO', 'pri_dem_WHITEHOUSE', 'pri_rep_BURZYNSKI',
       'pri_rep_CORT', 'pri_rep_CURRY', 'pri_rep_FITZPATRICK',
       'pri_rep_FORTHAN', 'pri_rep_GILBERT', 'pri_rep_GIULIANI',
       'pri_rep_HUCKABEE', 'pri_rep_HUNTER', 'pri_rep_KEYES', 'pri_rep_MCCAIN',
       'pri_rep_MCENULTY', 'pri_rep_MCGRATH', 'pri_rep_MITCHELL',
       'pri_rep_MURPHY', 'pri_rep_OUTZEN', 'pri_rep_PAUL', 'pri_rep_ROMNEY',
       'pri_rep_RUBEN', 'pri_rep_SHAW', 'pri_rep_SHEPARD', 'pri_rep_SKELLEY',
       'pri_rep_SMITH', 'pri_rep_THOMPSON', 'gen_d

In [64]:
# Preview merged dataframe with totals
merged_df.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_BOLLANDER,pri_dem_CAMPBELL,pri_dem_CLINTON,pri_dem_DALEY,pri_dem_DOBSON,pri_dem_DODD,pri_dem_EDWARDS,pri_dem_GEST,pri_dem_GRAVEL,...,gen_wri_BALDWIN,gen_wri_JAY,gen_wri_NADER,rep_primary_total,dem_primary_total,rep_general_total,dem_general_total,grn_general_total,lbt_general_total,wri_general_total
0,Apache,8,18,3768,8,20,5,208,3,5,...,26,0,109,2361,6481,8551,15390,75,111,135
1,Cochise,3,4,4896,4,14,14,468,2,14,...,34,0,356,10699,9393,29026,18943,90,371,390
2,Coconino,19,16,4884,6,18,13,360,4,7,...,31,0,309,8281,12050,22186,31433,117,267,341
3,Gila,11,8,2839,7,14,12,633,5,7,...,17,0,156,6046,5852,14095,7884,31,150,173
4,Graham,2,10,1027,4,2,10,184,1,0,...,5,0,56,2801,2244,8376,3487,23,60,61
5,Greenlee,1,5,574,1,0,0,121,1,0,...,0,0,17,368,1142,1712,1165,3,16,17
6,La Paz,2,3,581,0,2,1,57,1,3,...,8,0,53,1332,1041,3509,1929,14,39,61
7,Maricopa,48,96,125553,35,203,275,15712,33,154,...,832,12,6095,350246,253985,746448,602166,1799,7605,6944
8,Mohave,3,9,6541,3,13,16,591,2,5,...,75,0,561,17511,10679,44333,22092,111,433,636
9,Navajo,9,15,3834,8,18,15,325,2,15,...,50,0,182,7579,7472,19761,15579,70,158,232


Now, we save the cleaned dataframe into the processed directory.

In [65]:
# Save the cleaned and merged dataframe to CSV
out_dir = Path(OUTPUT_PATH)
out_dir.mkdir(parents=True, exist_ok=True)
merged_df.to_csv(OUTPUT_PATH + "AZ.csv", index=False)