# Pennsylvania 2008 Presidential Elections: Data Cleaning & Preprocessing

**Goal:** Build a clean, analysis-ready county-level table for Pennsylvania, 2008 by merging the presidential primary and presidential general election results, and then derive summary stats (party totals).

**Output**: A single CSV where each row is a county and columns include:

- Primary per-candidate vote counts (prefixed with `pri_`)
- General per-candidate vote counts (prefixed with `gen_`)
- Party totals: `rep_primary_total`, `dem_primary_total`, `rep_general_total`, `dem_general_total`, `lib_general_total`, `ind_general_total`

**Last Updated**: 2025/10/01

## 0. Library Import

In [1]:
import re
import pandas as pd
import numpy as np
from pathlib import Path

  from pandas.core import (


## 1. Inputs & Parameters

Define raw file paths once here so the entire notebook is easy to rerun on another machine. If a path changes, we only update it here. We keep a single `OUTPUT_PATH` so all exports land in one known place.

In [2]:
# PA 2008 dataset path
PRIMARY_PATH = r"../../data/raw/2008/PA/20080422__pa__primary__precinct.csv"
GENERAL_PATH = r"../../data/raw/2008/PA/20081104__pa__general__county.csv"

# Output directory
OUTPUT_PATH  = r"../../data/processed/2008/PA/"

# Analysis parameters
DISPLAY_ROWS = 10   # Number of rows to display in dataframes

## 2. Load & Filter

We load primary and general datasets separately and immediately subset to the rows we truly need:

- Restrict `office` to 'President' to avoid mixing down-ballot contests

- Remove columns that are fully missing or irrelevant post-filter (e.g., a district column that’s empty for county-level rows)

- Further any additional complications at each step

### a. Primary Election Dataset

In [3]:
# Pennsylvania precinct schema (33 cols) for 2008-era files
PA2008_COLS = [
    "year", "election_type", "county_code", "precinct_code",
    "cand_office_rank", "cand_district", "cand_party_rank", "cand_ballot_position",
    "office_code", "party_code", "candidate_id", "last", "first", "middle", "suffix",
    "votes", "us_cd", "state_sen", "state_house", "mcd_type", "mcd_name",
    "ward_code", "ward_name", "precinct_part_code", "precinct_part_name",
    "bi_county_flag", "mcd_code", "county_fips3", "vtd_code", "prev_precinct_code",
    "prev_us_cd", "prev_state_sen", "prev_state_house"
]

# Load primary data (no header in CSV)
primary_df = pd.read_csv(
    PRIMARY_PATH,
    header=None,
    names=PA2008_COLS,
    na_values=["", "NA", "N/A"]
)

# Sneak peek at the data
primary_df.head(DISPLAY_ROWS)

  primary_df = pd.read_csv(


Unnamed: 0,year,election_type,county_code,precinct_code,cand_office_rank,cand_district,cand_party_rank,cand_ballot_position,office_code,party_code,...,precinct_part_code,precinct_part_name,bi_county_flag,mcd_code,county_fips3,vtd_code,prev_precinct_code,prev_us_cd,prev_state_sen,prev_state_house
0,2008,P,1,10,1,0,1,1,USP,DEM,...,,,0,5,1,10.0,0,19,33,91
1,2008,P,1,20,1,0,1,1,USP,DEM,...,,,0,10,1,20.0,0,19,33,91
2,2008,P,1,30,1,0,1,1,USP,DEM,...,,,0,15,1,30.0,0,19,33,193
3,2008,P,1,40,1,0,1,1,USP,DEM,...,,,0,20,1,40.0,0,19,33,91
4,2008,P,1,50,1,0,1,1,USP,DEM,...,,,0,25,1,50.0,0,19,33,193
5,2008,P,1,60,1,0,1,1,USP,DEM,...,,,0,27,1,60.0,0,19,33,91
6,2008,P,1,70,1,0,1,1,USP,DEM,...,,,0,30,1,70.0,0,19,33,193
7,2008,P,1,80,1,0,1,1,USP,DEM,...,,,0,33,1,80.0,0,19,33,91
8,2008,P,1,85,1,0,1,1,USP,DEM,...,,,0,33,1,,0,0,0,0
9,2008,P,1,90,1,0,1,1,USP,DEM,...,,,0,35,1,90.0,0,19,33,91


A lot of the columns we will not need. We now select a subset of the dataframe of columns that we will really need:

In [4]:
# Take columns we care about
needed_cols = ["county_code", "precinct_code", "office_code", "party_code", "last", "votes"]
primary_df = primary_df[needed_cols]
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county_code,precinct_code,office_code,party_code,last,votes
0,1,10,USP,DEM,OBAMA,36
1,1,20,USP,DEM,OBAMA,54
2,1,30,USP,DEM,OBAMA,18
3,1,40,USP,DEM,OBAMA,84
4,1,50,USP,DEM,OBAMA,53
5,1,60,USP,DEM,OBAMA,80
6,1,70,USP,DEM,OBAMA,113
7,1,80,USP,DEM,OBAMA,150
8,1,85,USP,DEM,OBAMA,96
9,1,90,USP,DEM,OBAMA,126


Based on the documentation, we will only keep columns with USP `office_code`. That is, the rows with data of presidential election.

In [5]:
# Only keep rows where 'office_code' is 'USP'
primary_df = primary_df[primary_df["office_code"] == "USP"]
primary_df.shape

(64904, 6)

In [6]:
# Now, drop the "office" column as it's no longer needed.
primary_df = primary_df.drop(columns=["office_code"]).reset_index(drop=True)
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county_code,precinct_code,party_code,last,votes
0,1,10,DEM,OBAMA,36
1,1,20,DEM,OBAMA,54
2,1,30,DEM,OBAMA,18
3,1,40,DEM,OBAMA,84
4,1,50,DEM,OBAMA,53
5,1,60,DEM,OBAMA,80
6,1,70,DEM,OBAMA,113
7,1,80,DEM,OBAMA,150
8,1,85,DEM,OBAMA,96
9,1,90,DEM,OBAMA,126


In [7]:
# Save the cleaned and merged dataframe to CSV
out_dir = Path(OUTPUT_PATH)
out_dir.mkdir(parents=True, exist_ok=True)
primary_df.to_csv(OUTPUT_PATH + "PA.csv", index=False)

Now, we aggregate precinct vote counts into county vote counts.

In [8]:
# Make sure votes are numeric
primary_df["votes"] = pd.to_numeric(primary_df["votes"], errors="coerce").fillna(0)

# Aggregate precinct vote counts into county vote counts
primary_df = (
    primary_df.
    groupby(["county_code", "party_code", "last"], as_index=False)["votes"]
    .sum()
    .rename(columns={"party_code": "party", "last": "candidate", "county_code": "county"})
)[["county", "candidate", "party", "votes"]]        # Reorder columns

# Snippet at the aggregated data
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,1,CLINTON,DEM,6567
1,1,OBAMA,DEM,4733
2,1,WRITE-IN,DEM,0
3,1,HUCKABEE,REP,1201
4,1,MCCAIN,REP,6561
5,1,PAUL,REP,1693
6,1,WRITE-IN,REP,0
7,2,CLINTON,DEM,169707
8,2,OBAMA,DEM,142361
9,2,WRITE-IN,DEM,0


There might still be some problems with the data that we will look into that further. But now, we will perform a cross-over to change from county code to county name for more readability.

In [9]:
# Code (1..67) -> county name
CODE_TO_NAME = {
     1:"Adams",         2:"Allegheny",   3:"Armstrong",    4:"Beaver",          5:"Bedford",
     6:"Berks",         7:"Blair",       8:"Bradford",     9:"Bucks",          10:"Butler",
    11:"Cambria",      12:"Cameron",    13:"Carbon",      14:"Centre",         15:"Chester",
    16:"Clarion",      17:"Clearfield", 18:"Clinton",     19:"Columbia",       20:"Crawford",
    21:"Cumberland",   22:"Dauphin",    23:"Delaware",    24:"Elk",            25:"Erie",
    26:"Fayette",      27:"Forest",     28:"Franklin",    29:"Fulton",         30:"Greene",
    31:"Huntingdon",   32:"Indiana",    33:"Jefferson",   34:"Juniata",        35:"Lackawanna",
    36:"Lancaster",    37:"Lawrence",   38:"Lebanon",     39:"Lehigh",         40:"Luzerne",
    41:"Lycoming",     42:"Mckean",     43:"Mercer",      44:"Mifflin",        45:"Monroe",
    46:"Montgomery",   47:"Montour",    48:"Northampton", 49:"Northumberland", 50:"Perry",
    51:"Philadelphia", 52:"Pike",       53:"Potter",      54:"Schuylkill",     55:"Snyder",
    56:"Somerset",     57:"Sullivan",   58:"Susquehanna", 59:"Tioga",          60:"Union",
    61:"Venango",      62:"Warren",     63:"Washington",  64:"Wayne",          65:"Westmoreland",
    66:"Wyoming",      67:"York"
}

# Map county codes to names
primary_df["county"] = pd.to_numeric(primary_df["county"], errors="coerce").map(CODE_TO_NAME)

# Snippet at the data with county names
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Adams,CLINTON,DEM,6567
1,Adams,OBAMA,DEM,4733
2,Adams,WRITE-IN,DEM,0
3,Adams,HUCKABEE,REP,1201
4,Adams,MCCAIN,REP,6561
5,Adams,PAUL,REP,1693
6,Adams,WRITE-IN,REP,0
7,Allegheny,CLINTON,DEM,169707
8,Allegheny,OBAMA,DEM,142361
9,Allegheny,WRITE-IN,DEM,0


In [10]:
# Unique parties in primary_df
primary_df["party"].value_counts()

party
REP    268
DEM    201
Name: count, dtype: int64

In [11]:
# Candidates in primary_df
primary_df["candidate"].value_counts()

candidate
WRITE-IN    134
CLINTON      67
OBAMA        67
HUCKABEE     67
MCCAIN       67
PAUL         67
Name: count, dtype: int64

There are a lot of write-in. Let's see if we can drop it.

In [12]:
primary_df[primary_df["candidate"] == "WRITE-IN"]["votes"].describe()

count     134.000000
mean      133.917910
std       193.441606
min         0.000000
25%         0.000000
50%        60.500000
75%       213.000000
max      1201.000000
Name: votes, dtype: float64

Well, not quite possible to drop this. We will keep these for now, later when pivoting and creating a total party count, we will add those in later.

In [13]:
# Missing values count
primary_df.isnull().sum()

county       0
candidate    0
party        0
votes        0
dtype: int64

In [14]:
# Final look at the (supposed) cleaned primary_df
primary_df.head(DISPLAY_ROWS)

Unnamed: 0,county,candidate,party,votes
0,Adams,CLINTON,DEM,6567
1,Adams,OBAMA,DEM,4733
2,Adams,WRITE-IN,DEM,0
3,Adams,HUCKABEE,REP,1201
4,Adams,MCCAIN,REP,6561
5,Adams,PAUL,REP,1693
6,Adams,WRITE-IN,REP,0
7,Allegheny,CLINTON,DEM,169707
8,Allegheny,OBAMA,DEM,142361
9,Allegheny,WRITE-IN,DEM,0


In [15]:
# Shape after preprocessing
primary_df.shape

(469, 4)

### b. General Election Dataset

In [16]:
# Load general data
general_df = pd.read_csv(GENERAL_PATH)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,office,district,party,candidate,votes
0,Adams,President,,DEM,Barack Obama,17633
1,Adams,President,,REP,John McCain,26349
2,Adams,President,,IND,Ralph Nader,355
3,Adams,President,,LIB,Bob Barr,154
4,Adams,Attorney General,,DEM,John M. Morganelli,12005
5,Adams,Attorney General,,REP,Tom Corbett,30390
6,Adams,Attorney General,,LIB,Marakay J. Rogers,1147
7,Adams,U.S. House,19.0,DEM,Philip J. Avillo,13416
8,Adams,U.S. House,19.0,REP,Todd Platts,30393
9,Adams,State Senate,33.0,DEM,Bruce Tushingham,14843


In [17]:
# Different values in 'office' column
general_df["office"].value_counts()

office
State House         441
President           268
U.S. House          246
Attorney General    201
State Senate        141
Name: count, dtype: int64

In [18]:
# Only keep rows where 'office' is 'President'
general_df = general_df[general_df["office"] == "President"]
general_df.shape

(268, 6)

In [19]:
# Now, drop the "office" column as it's no longer needed. Also, drop the district column
general_df = general_df.drop(columns=["office", "district"]).reset_index(drop=True)
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,Adams,DEM,Barack Obama,17633
1,Adams,REP,John McCain,26349
2,Adams,IND,Ralph Nader,355
3,Adams,LIB,Bob Barr,154
4,Allegheny,DEM,Barack Obama,373153
5,Allegheny,REP,John McCain,272347
6,Allegheny,IND,Ralph Nader,3927
7,Allegheny,LIB,Bob Barr,2009
8,Armstrong,DEM,Barack Obama,11138
9,Armstrong,REP,John McCain,18542


In [20]:
# List out all the parties in the general election data
general_df["party"].value_counts()

party
DEM    67
REP    67
IND    67
LIB    67
Name: count, dtype: int64

In [21]:
# Missing values count
general_df.isnull().sum()

county       0
party        0
candidate    0
votes        0
dtype: int64

In [22]:
# Final look at cleaned general_df
general_df.head(DISPLAY_ROWS)

Unnamed: 0,county,party,candidate,votes
0,Adams,DEM,Barack Obama,17633
1,Adams,REP,John McCain,26349
2,Adams,IND,Ralph Nader,355
3,Adams,LIB,Bob Barr,154
4,Allegheny,DEM,Barack Obama,373153
5,Allegheny,REP,John McCain,272347
6,Allegheny,IND,Ralph Nader,3927
7,Allegheny,LIB,Bob Barr,2009
8,Armstrong,DEM,Barack Obama,11138
9,Armstrong,REP,John McCain,18542


In [23]:
# Shape after preprocessing
general_df.shape

(268, 4)

## 3. Table Pivoting

We convert tall (one row per county/party/candidate) into wide (one row per county with one column per candidate). This creates the consistent schema with previous group cleaned data.

Helper functions:

- `normalize_party(s)`: maps common forms (e.g., “Democratic”, “Republican”) to keys dem/rep so column names are stable
- `candidate_token(name)`: turns “Barack Obama” -> OBAMA, “John McCain” -> MCCAIN, etc. Create a short, readable, unique token for column names
- `pivot_wide(df, prefix, key_col="county")`: Main pivot function
        
    * groups by `county` x `party` × `candidate`, sums `votes`,
    * pivots to columns named like:
        * Primary: `pri_dem_OBAMA`, `pri_rep_MCCAIN`,...
        * General: `gen_dem_OBAMA`, `gen_rep_MCCAIN`,...

    * flattens the MultiIndex into plain column strings,
    * returns one wide row per county

In [24]:
def normalize_party(s: pd.Series) -> pd.Series:
    """
    Normalize party names: Lowercase the three-letter abbreviations
    """
    return(s.str.lower())     

In [25]:
SUFFIXES = {
    "JR","SR","JNR","SNR",
    "II","III","IV","V","VI","VII","VIII","IX","X","XI","XII"
}

def candidate_token(name: str) -> str:
    """
    Turn John McCain -> MCCAIN, Barack Obama -> OBAMA
    Skip suffixes, keep last name/token, capitalize, and remove punctuation
    """
    if pd.isna(name):
        return "UNKNOWN"                # Defensive purposes only, would not expect missing values
    
    # Remove suffixes
    raw = str(name).strip()

    # If a comma exists, treat as 'LAST, FIRST ...'
    if "," in raw:
        last_part = raw.split(",", 1)[0]
        last_part = re.sub(r"[^A-Za-z0-9\s]+", "", last_part).strip().upper()
        tokens = last_part.split()
        return tokens[-1] if tokens else "UNKNOWN"

    # Otherwise: remove punctuation, split, then drop trailing suffixes
    tokens = re.sub(r"[^A-Za-z0-9\s]+", "", raw).strip().upper().split()
    while tokens and tokens[-1] in SUFFIXES:
        tokens.pop()
    return tokens[-1] if tokens else "UNKNOWN"

In [26]:
def pivot_wide(df: pd.DataFrame, prefix: str, key_col: str="county") -> pd.DataFrame:
    """
    Pivot the dataframe to wide format based on party and candidate
    """
    # Normalize party names
    df['party_key'] = normalize_party(df['party'])
    
    # Create candidate tokens
    df['candidate_token'] = df['candidate'].apply(candidate_token)
    
    # Create new column names based on party and candidate token
    df['new_col'] = prefix + '_' + df['party'] + '_' + df['candidate_token']
    
    # Pivot the dataframe
    pivot_df = df.pivot_table(index=key_col, 
                              columns=["party_key", "candidate_token"], 
                              values="votes", 
                              aggfunc='sum', 
                              fill_value=0)
    
    # Flatten multi-level columns
    pivot_df.columns = [f"{prefix}_{p}_{c}" for p, c in pivot_df.columns]
    
    # Reset index to turn key_col back into a column
    pivot_df = pivot_df.reset_index()
    
    return pivot_df

In [27]:
# Primary dataframe pivot
primary_pivot = pivot_wide(primary_df, prefix="pri")
primary_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_CLINTON,pri_dem_OBAMA,pri_dem_WRITEIN,pri_rep_HUCKABEE,pri_rep_MCCAIN,pri_rep_PAUL,pri_rep_WRITEIN
0,Adams,6567,4733,0,1201,6561,1693,0
1,Allegheny,169707,142361,0,5153,42509,9280,0
2,Armstrong,7246,2888,147,792,4410,815,285
3,Beaver,28331,12278,462,1309,6895,1565,748
4,Bedford,3711,1582,0,1027,4103,430,0
5,Berks,36064,26111,0,2403,14731,5869,0
6,Blair,8876,4827,87,1979,8564,1576,366
7,Bradford,3877,2014,70,1203,6007,939,396
8,Bucks,71757,42860,0,2941,29148,5868,0
9,Butler,15278,8864,195,2248,12557,2313,437


In [28]:
# Primary dataframe shape after pivot
primary_pivot.shape

(67, 8)

Note here, since there were write-in values, there are corresponding `pri_rep_WRITEIN` and `pri_dem_WRITEIN` columns in this pivoted dataframe. We keep these a little more to calculate the total party vote and then drop them.

In [29]:
# General dataframe pivot
general_pivot = pivot_wide(general_df, prefix="gen")
general_pivot.head(DISPLAY_ROWS)

Unnamed: 0,county,gen_dem_OBAMA,gen_ind_NADER,gen_lib_BARR,gen_rep_MCCAIN
0,Adams,17633,355,154,26349
1,Allegheny,373153,3927,2009,272347
2,Armstrong,11138,263,138,18542
3,Beaver,40499,826,268,42895
4,Bedford,6059,164,96,16124
5,Berks,97047,1614,826,80513
6,Blair,19813,335,246,32708
7,Bradford,10306,302,122,15057
8,Bucks,179031,2405,1240,150248
9,Butler,32260,722,369,57074


In [30]:
# General dataframe shape after pivot
general_pivot.shape

(67, 5)

## 4. Merge Dataframes

Before merging, we verify that county names match across primary and general:

In [31]:
# Check if county names match between primary_df and general_df
primary_counties = set(primary_df["county"].unique())
general_counties = set(general_df["county"].unique())
common_counties = primary_counties.intersection(general_counties)
print(f"Number of common counties: {len(common_counties)} out of {len(primary_counties)}")

Number of common counties: 67 out of 67


Great. Since we know that all counties name are matched, we don't need to perform further data preprocessing to match the county names. Thus, we can now merge them:

In [32]:
# Merge primary and general dataframes on 'county'
merged_df = primary_pivot.merge(general_pivot, on="county", how="inner").fillna(0)    # There should be no missing values to fill with 0
merged_df.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_CLINTON,pri_dem_OBAMA,pri_dem_WRITEIN,pri_rep_HUCKABEE,pri_rep_MCCAIN,pri_rep_PAUL,pri_rep_WRITEIN,gen_dem_OBAMA,gen_ind_NADER,gen_lib_BARR,gen_rep_MCCAIN
0,Adams,6567,4733,0,1201,6561,1693,0,17633,355,154,26349
1,Allegheny,169707,142361,0,5153,42509,9280,0,373153,3927,2009,272347
2,Armstrong,7246,2888,147,792,4410,815,285,11138,263,138,18542
3,Beaver,28331,12278,462,1309,6895,1565,748,40499,826,268,42895
4,Bedford,3711,1582,0,1027,4103,430,0,6059,164,96,16124
5,Berks,36064,26111,0,2403,14731,5869,0,97047,1614,826,80513
6,Blair,8876,4827,87,1979,8564,1576,366,19813,335,246,32708
7,Bradford,3877,2014,70,1203,6007,939,396,10306,302,122,15057
8,Bucks,71757,42860,0,2941,29148,5868,0,179031,2405,1240,150248
9,Butler,15278,8864,195,2248,12557,2313,437,32260,722,369,57074


In [33]:
# Statistics check on merged dataframe 
merged_df.describe()

Unnamed: 0,pri_dem_CLINTON,pri_dem_OBAMA,pri_dem_WRITEIN,pri_rep_HUCKABEE,pri_rep_MCCAIN,pri_rep_PAUL,pri_rep_WRITEIN,gen_dem_OBAMA,gen_ind_NADER,gen_lib_BARR,gen_rep_MCCAIN
count,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0
mean,19020.820896,15846.985075,70.865672,1402.656716,8757.462687,2046.925373,196.970149,48900.940299,641.447761,297.19403,39640.074627
std,30360.201826,39916.207585,95.130197,1086.779668,9519.491622,2347.502653,241.422613,92572.082439,735.476022,370.267399,47860.898226
min,409.0,281.0,0.0,103.0,532.0,46.0,0.0,879.0,30.0,9.0,1323.0
25%,3375.5,1772.0,0.0,673.0,3146.5,578.0,0.0,6926.5,191.0,77.0,10833.0
50%,7901.0,4398.0,27.0,1034.0,4609.0,1069.0,94.0,16780.0,379.0,154.0,20750.0
75%,23333.5,14495.0,107.0,1946.0,11076.5,2447.0,303.5,47214.0,821.0,351.0,50551.5
max,169707.0,288376.0,462.0,5153.0,42509.0,9280.0,1201.0,595980.0,3927.0,2009.0,272347.0


Now, we will add party totals columns: 

- Primary totals:
    * `rep_primary_total` = sum of all `pri_rep_*` columns
    * `dem_primary_total` = sum of all `pri_dem_*` columns

- General totals:
    * `rep_general_total` = sum of all `gen_rep_*` columns
    * `dem_general_total` = sum of all `gen_dem_*` columns
    * `ind_general_total` = sum of all `gen_ind_*` columns
    * `lib_general_total` = sum of all `gen_lib_*` columns

In [35]:
# Add party totals for primary election
rep_primary_cols   = [c for c in merged_df.columns if c.startswith("pri_rep_")]
dem_primary_cols   = [c for c in merged_df.columns if c.startswith("pri_dem_")]

merged_df["rep_primary_total"] = merged_df[rep_primary_cols].sum(axis=1) if rep_primary_cols else 0
merged_df["dem_primary_total"] = merged_df[dem_primary_cols].sum(axis=1) if dem_primary_cols else 0

Now, we have calculated the total vote for each party. Thus, we can drop the two WRITEIN columns

In [36]:
# Drop WRITE-IN columns for primary election
writein_cols = [c for c in merged_df.columns if "WRITEIN" in c]
merged_df = merged_df.drop(columns=writein_cols)

# Snippet at the merged dataframe with primary totals
merged_df.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_CLINTON,pri_dem_OBAMA,pri_rep_HUCKABEE,pri_rep_MCCAIN,pri_rep_PAUL,gen_dem_OBAMA,gen_ind_NADER,gen_lib_BARR,gen_rep_MCCAIN,rep_primary_total,dem_primary_total
0,Adams,6567,4733,1201,6561,1693,17633,355,154,26349,9455,11300
1,Allegheny,169707,142361,5153,42509,9280,373153,3927,2009,272347,56942,312068
2,Armstrong,7246,2888,792,4410,815,11138,263,138,18542,6302,10281
3,Beaver,28331,12278,1309,6895,1565,40499,826,268,42895,10517,41071
4,Bedford,3711,1582,1027,4103,430,6059,164,96,16124,5560,5293
5,Berks,36064,26111,2403,14731,5869,97047,1614,826,80513,23003,62175
6,Blair,8876,4827,1979,8564,1576,19813,335,246,32708,12485,13790
7,Bradford,3877,2014,1203,6007,939,10306,302,122,15057,8545,5961
8,Bucks,71757,42860,2941,29148,5868,179031,2405,1240,150248,37957,114617
9,Butler,15278,8864,2248,12557,2313,32260,722,369,57074,17555,24337


In [37]:
# Add party totals for general election
rep_general_cols   = [c for c in merged_df.columns if c.startswith("gen_rep_")]
dem_general_cols   = [c for c in merged_df.columns if c.startswith("gen_dem_")]
lib_general_cols   = [c for c in merged_df.columns if c.startswith("gen_lib_")]
ind_general_cols   = [c for c in merged_df.columns if c.startswith("gen_ind_")]

merged_df["rep_general_total"] = merged_df[rep_general_cols].sum(axis=1) if rep_general_cols else 0
merged_df["dem_general_total"] = merged_df[dem_general_cols].sum(axis=1) if dem_general_cols else 0
merged_df["lib_general_total"] = merged_df[lib_general_cols].sum(axis=1) if lib_general_cols else 0
merged_df["ind_general_total"] = merged_df[ind_general_cols].sum(axis=1) if ind_general_cols else 0

In [38]:
# Print out all the column names in the final dataframe
print("Final columns in the cleaned dataframe:")
merged_df.columns

Final columns in the cleaned dataframe:


Index(['county', 'pri_dem_CLINTON', 'pri_dem_OBAMA', 'pri_rep_HUCKABEE',
       'pri_rep_MCCAIN', 'pri_rep_PAUL', 'gen_dem_OBAMA', 'gen_ind_NADER',
       'gen_lib_BARR', 'gen_rep_MCCAIN', 'rep_primary_total',
       'dem_primary_total', 'rep_general_total', 'dem_general_total',
       'lib_general_total', 'ind_general_total'],
      dtype='object')

In [39]:
# Preview merged dataframe with totals
merged_df.head(DISPLAY_ROWS)

Unnamed: 0,county,pri_dem_CLINTON,pri_dem_OBAMA,pri_rep_HUCKABEE,pri_rep_MCCAIN,pri_rep_PAUL,gen_dem_OBAMA,gen_ind_NADER,gen_lib_BARR,gen_rep_MCCAIN,rep_primary_total,dem_primary_total,rep_general_total,dem_general_total,lib_general_total,ind_general_total
0,Adams,6567,4733,1201,6561,1693,17633,355,154,26349,9455,11300,26349,17633,154,355
1,Allegheny,169707,142361,5153,42509,9280,373153,3927,2009,272347,56942,312068,272347,373153,2009,3927
2,Armstrong,7246,2888,792,4410,815,11138,263,138,18542,6302,10281,18542,11138,138,263
3,Beaver,28331,12278,1309,6895,1565,40499,826,268,42895,10517,41071,42895,40499,268,826
4,Bedford,3711,1582,1027,4103,430,6059,164,96,16124,5560,5293,16124,6059,96,164
5,Berks,36064,26111,2403,14731,5869,97047,1614,826,80513,23003,62175,80513,97047,826,1614
6,Blair,8876,4827,1979,8564,1576,19813,335,246,32708,12485,13790,32708,19813,246,335
7,Bradford,3877,2014,1203,6007,939,10306,302,122,15057,8545,5961,15057,10306,122,302
8,Bucks,71757,42860,2941,29148,5868,179031,2405,1240,150248,37957,114617,150248,179031,1240,2405
9,Butler,15278,8864,2248,12557,2313,32260,722,369,57074,17555,24337,57074,32260,369,722


Now, we save the cleaned dataframe into the processed directory.

In [41]:
# Save the cleaned and merged dataframe to CSV
out_dir = Path(OUTPUT_PATH)
out_dir.mkdir(parents=True, exist_ok=True)
merged_df.to_csv(OUTPUT_PATH + "PA.csv", index=False)