# 109th Congress Data

This notebook is meant to call, clean, and examine data from the 2004 election, to produce a viable metric correlating PVI and electoral success.

It feeds into a larger project about fairness in redistricting; for instance, if an R+15 district is unattainable for a Democrat to win, then the district can be wholly classified as "safe," and should be bucketed with R+30 district.

The nuance enters in the margins. How safe is an R+6 district for instance? Can we quantify a district moving from R+3 to R+6 as a significant bias against Democrats? By cleaning this dataset, and others like it, we can determine the predictive power of PVI in each district, and use those probabilities later to assess redistricting fairness.

## Retrieve Data

Currently no free source contains a list of PVI and congressional maps. However, our prvios dataset from the 110th Congress has PVIs that also apply to the 2004 results. In this section, we will load in data on the sitting members of the 109th and match it to the previous dataset

In [None]:
import pandas as pd
pd.set_option("display.max_rows", None, "display.max_columns", None)

In [87]:
#load in PVIs from the 110th Congress
pvi_only_119 = pd.read_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/raw_data/pvi_only_109.csv")
#load in wikipedia pull (with some pre-cleaning)
wiki = pd.read_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/raw_data/110_109_wikipedia.csv")
#data pulled from https://en.wikipedia.org/wiki/2004_United_States_House_of_Representatives_elections

In [94]:
#clean up data with the party column
wiki = wiki[["District","Party"]]
wiki = wiki[wiki.Party != ""]
wiki = wiki[wiki.Party != "Party"]
wiki = wiki[wiki.Party != "Party"]
wiki.Party = wiki.Party.str.replace("Democratic-NPL","Democratic")
wiki.Party = wiki.Party.str.replace("Independent","Democratic")
#preserve Florida 16 and New Jersey 13, which are Republican  and Democrat won respectively
wiki.at[171,'Party']='Republican'
wiki.at[518,'Party']='Democratic'
#remove blank columns
wiki = wiki.dropna()
#test results
print(wiki.Party.unique())
print(wiki.shape)

['R' 'D' 'Republican' 'Democratic']
(435, 2)


## Clean the Results Data for the 109th Congress

In [95]:
results_109 = wiki

In [96]:
#remove at-large
results_109.District = results_109.District.str.replace("at-large","AL")
results_109.head(8)

Unnamed: 0,District,Party
0,Alabama 1,R
1,Alabama 2,R
2,Alabama 3,R
3,Alabama 4,R
4,Alabama 5,D
5,Alabama 6,R
6,Alabama 7,D
16,Alaska AL,R


In [104]:
#create all the baseline varibale
results_109["year"] = 2004
results_109["congress"] = 109
results_109["seat"] = results_109["Party"].str[0]
results_109['ST'] = results_109['District'].str[:-2]
results_109.ST = results_109.ST.str.rstrip()
results_109['num'] = results_109['District'].str[-2:]
results_109.num = results_109.num.str.lstrip()
results_109["is_GOP"] = results_109["seat"].replace("R",1)
results_109["is_GOP"] = results_109["is_GOP"].replace("D",0)
#The seats of Charlie Dent and Pat Meehan in Pennsylvania went R -> D during this time frame
#covert to integer for later numeric analysis
results_109["is_GOP"] = results_109["is_GOP"].astype(int)
results_109.head(n=7)
print(results_109["is_GOP"].unique())

[1 0]


In [105]:
#call in a dictionary of state abbreviations
us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
}

In [106]:
#abbreviate the ST coulmn
results_109 = results_109.replace({"ST": us_state_to_abbrev})
#check that both ST and num function apropriatley
print(results_109["ST"].unique())
results_109["ST#"] = results_109["ST"] + results_109["num"]

['AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'FL' 'GA' 'HI' 'ID' 'IL' 'IN'
 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS' 'MO' 'MT' 'NE' 'NV'
 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD' 'TN'
 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY']


In [108]:
results_109 = results_109[["year","congress","ST","ST#","seat","is_GOP"]]
results_109.head(8)

Unnamed: 0,year,congress,ST,ST#,seat,is_GOP
0,2004,109,AL,AL1,R,1
1,2004,109,AL,AL2,R,1
2,2004,109,AL,AL3,R,1
3,2004,109,AL,AL4,R,1
4,2004,109,AL,AL5,D,0
5,2004,109,AL,AL6,R,1
6,2004,109,AL,AL7,D,0
16,2004,109,AK,AKAL,R,1


## Merge the 109th Election Results with PVI

In [None]:
#Merge

In [None]:
import numpy as np
import plotnine as p9
from plotnine import ggplot, aes, facet_grid, labs, geom_point, geom_smooth
from sklearn.linear_model import LinearRegression as lm
import warnings
warnings.filterwarnings('ignore')

In [None]:
#load in and check the data
pvi_only_119 = pd.read_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/raw_data/pvi_only_109.csv")
print(pvi_only_119.head(n=1))
print(pvi_only_119.shape)

In [None]:
#Merge with congressional districts

In [None]:
#correct the column name for later use
pvi_109 = pvi_109.rename(columns={"Party of\nRepresentative": "Representative"})

In [None]:
#fix at large designation
pvi_109['District'] = pvi_109['District'].str.replace('at-large','AL')
pvi_109.head(n=8)

In [None]:
#fix at large designation
pvi_109['District'] = pvi_109['District'].str.replace('At-large','AL')
pvi_109['District'] = pvi_109['District'].str.rstrip("st")
pvi_109['District'] = pvi_109['District'].str.rstrip("nd")
pvi_109['District'] = pvi_109['District'].str.rstrip("rd")
pvi_109['District'] = pvi_109['District'].str.rstrip("th")
pvi_109.head(n=8)

In [None]:
#seperate state and district
pvi_109["num"] = pvi_109.District
pvi_109["ST"] =  pvi_109["State"]
pvi_109.head(n=1)

In [None]:
#abbreviate the ST coulmn
pvi_109 = pvi_109.replace({"ST": us_state_to_abbrev})
#check that both ST and num function apropriatley
print(pvi_109["ST"].unique())
print(pvi_109["num"].unique())

In [None]:
#create the district code variable
pvi_109["ST#"] = pvi_109["ST"] + pvi_109["num"]
#create the "party of represenative" variable
pvi_109["seat"] = pvi_109.Representative.str[0]
#make a dummy
pvi_109["is_GOP"] = pvi_109["seat"].replace("R",1)
pvi_109["is_GOP"] = pvi_109["is_GOP"].replace("D",0)
#The seats of Charlie Dent and Pat Meehan in Pennsylvania went R -> D during this time frame
#covert to integer for later numeric analysis
pvi_109["is_GOP"] = pvi_109["is_GOP"].astype(int)
pvi_109.head(n=7)
print(pvi_109["is_GOP"].unique())

In [None]:
#possibly unneccessary
#pull out district lean
pvi_109["lean"] = pvi_109.PVI.str[0]
pvi_109["lean"].unique()

In [None]:
#split out pvi by party and weight
#first eliminate even values
pvi_109["PVI"] = pvi_109["PVI"].str.replace("EVEN","R+0")
pvi_109["pvi_party"] = pvi_109.PVI.str[0]
pvi_109['partisan_weight'] = pvi_109['PVI'].str.split('+').str[1]
#check results
print(pvi_109.head(n=1))
print(pvi_109["pvi_party"].unique())

In [None]:
#create a variable that is negative when the party is democratic
pvi_109["neg"] = pvi_109['pvi_party'].str.replace('D','-')
pvi_109["neg"] = pvi_109["neg"].str.replace('R','')
#ensure partisan weight has a negative value for democratic leaning and a positive value for Republican leaning
pvi_109['partisan_weight'] = pvi_109["neg"] + pvi_109['partisan_weight']
#ensure values are integers
pvi_109['partisan_weight'] = pvi_109['partisan_weight'].astype(int)

In [None]:
pvi_109["metric"] = ((pvi_109['partisan_weight'] / 2) + 50) / 100
pvi_109.head(n=7)

In [None]:
#add constants for later aggregation
pvi_109["year"] = 2004
pvi_109["congress"] = 109

In [None]:
#create a new datasets for the district PVIs from the previoys congress
pvi_only_109 = pvi_109
pvi_only_109["year"] = 2004
pvi_only_109["congress"] = 109
[["year","congress","ST","ST#","PVI","metric"]]
pvi_only_109.to_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/raw_data/pvi_only_109.csv",index=False)

## Export clean versions of the data

In [None]:
#create a dataset solely to correlate pvi with the holder of the seat
pure_109 = pvi_109[["year","metric","is_GOP"]]
pure_109.to_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/clean_data/pure_datasets/pure_109.csv",index=False)
#create a more detailed dataset for greater uses
data_109 = pvi_109[["year","congress","ST","ST#","seat","is_GOP","PVI","metric"]]
data_109.to_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/clean_data/full_districts/data_109.csv",index=False)

In [None]:
data_109.head(8)

## Conduct preliminary examinations of the data

In [None]:
# Create a super simple scatterplot to examine the relationship between PVI and congressional district
(p9.ggplot(data=pure_109, mapping=p9.aes(x='metric', y='is_GOP'))
 + p9.geom_point() 
 + labs(x='GOP Leaning of District', y='GOP Representation',color="",title="PVI Correlation"))

In [None]:
#group by metric and average GOP seats
bm_109 = pure_109.groupby('metric').mean()
#this will not be produced into a dataset without aggregating all years, as 
bm_109 = bm_109.reset_index()

In [None]:
#Plot averages
(p9.ggplot(data=bm_109, mapping=p9.aes(x='metric', y='is_GOP'))
 + p9.geom_point() 
 + labs(x='GOP Leaning of District', y="Share of GOP Representatives",color="",title="PVI Correlation"))

In [None]:
#now lets limit the data to only the R+10 to D+10 range, calling it Limited Domain
ld_109 = bm_109[bm_109["metric"] <= .55]
ld_109 = ld_109[ld_109["metric"] >= .45]

In [None]:
#Plot averages
(p9.ggplot(data=ld_109, mapping=p9.aes(x='metric', y='is_GOP'))
 + p9.geom_point() 
 + geom_smooth(method = "lm", color = "red", se = False)
 + labs(x='GOP Leaning of District', y="Share of GOP Representatives",color="",title="PVI Correlation"))