# 117th Congress Data

This notebook is meant to call, clean, and examine data from the 2020 election, to produce a viable metric correlating PVI and electoral success.

It feeds into a larger project about fairness in redistricting; for instance, if an R+15 district is unattainable for a Democrat to win, then the district can be wholly classified as "safe," and should be bucketed with R+30 district.

The nuance enters in the margins. How safe is an R+6 district for instance? Can we quantify a district moving from R+3 to R+6 as a significant bias against Democrats? By cleaning this dataset, and others like it, we can determine the predictive power of PVI in each district, and use those probabilities later to assess redistricting fairness.

## Retrieve data from Wikipedia

Wikipedia's current Cook PVI page contains a free and current (up to 2020 election results) version of the index, where free downloads from Cook's website are dated, containing 2018 results and 2016 PVI metrics

In [1]:
import requests
import pandas as pd
pd.set_option("display.max_rows", None, "display.max_columns", None)

In [2]:
#import and format url
url = "https://en.wikipedia.org/wiki/Cook_Partisan_Voting_Index"
page = pd.read_html(url)
dat = pd.concat(page,ignore_index=True)

In [3]:
#split data into state and district pvis
wiki_district = dat[0:435]
wiki_state = dat[435:485]

In [4]:
#cut NA columns
wiki_district = wiki_district.dropna(axis=1, how='any', thresh=None, subset=None, inplace=False)
wiki_state = wiki_state.dropna(axis=1, how='any', thresh=None, subset=None, inplace=False)

## Clean and Organize State Data

In [5]:
#call in a dictionary of state abbreviations
us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
}

In [6]:
state_117 = wiki_state
state_117["ST"] =  state_117["State"]
#abbreviate the ST coulmn
state_117 = state_117.replace({"ST": us_state_to_abbrev})
#check that both ST and num function apropriatley
print(state_117["ST"].unique())

['AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'FL' 'GA' 'HI' 'ID' 'IL' 'IN'
 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS' 'MO' 'MT' 'NE' 'NV'
 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD' 'TN'
 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY']


In [7]:
#split out pvi by party and weight
#first eliminate even values
state_117["PVI"] = state_117["PVI"].str.replace("EVEN","R+0")
state_117["pvi_party"] = state_117.PVI.str[0]
state_117['partisan_weight'] = state_117['PVI'].str.split('+').str[1]
#check results
print(state_117.head(n=1))
print(state_117["pvi_party"].unique())

    Housebalance   PVI Party ofgovernor Partyin Senate    State  ST pvi_party  \
435       6R, 1D  R+15       Republican     Republican  Alabama  AL         R   

    partisan_weight  
435              15  
['R' 'D']


In [8]:
#add constants for later aggregation
pvi_117["year"] = 2020
pvi_117["congress"] = 117

NameError: name 'pvi_117' is not defined

In [None]:
#create a variable that is negative when the party is democratic
state_117["neg"] = state_117['pvi_party'].str.replace('D','-')
state_117["neg"] = state_117["neg"].str.replace('R','')
#ensure partisan weight has a negative value for democratic leaning and a positive value for Republican leaning
state_117['partisan_weight'] = state_117["neg"] + state_117['partisan_weight']
#ensure values are integers
state_117['partisan_weight'] = state_117['partisan_weight'].astype(int)

In [None]:
state_117["metric"] = ((state_117['partisan_weight'] / 2) + 50) / 100
state_117.head(n=7)

In [None]:
state_117 = state_117[["ST","PVI","metric"]]
state_117.to_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/clean_data/state_pvi/state_117.csv",index=False)

## Clean and Organize the District Data

In [None]:
import numpy as np
import plotnine as p9
from plotnine import ggplot, aes, facet_grid, labs, geom_point, geom_smooth
from sklearn.linear_model import LinearRegression as lm
import warnings
warnings.filterwarnings('ignore')

In [None]:
#load in and check the data
pvi_117 = wiki_district
print(pvi_117.head(n=1))
print(pvi_117.shape)

In [None]:
#correct the column name for later use
pvi_117 = pvi_117.rename(columns={"Party ofrepresentative": "Representative"})

In [None]:
#fix at large designation
pvi_117['District'] = pvi_117['District'].str.replace('at-large','AL')
pvi_117.head(n=8)

In [None]:
#seperate state and district
pvi_117["num"] = pvi_117.District.str[-2:]
pvi_117["state"] = pvi_117.District.str[:-2]
pvi_117["state"] = pvi_117['state'].str.rstrip()
pvi_117["num"] = pvi_117['num'].str.lstrip()
pvi_117["ST"] =  pvi_117["state"]
pvi_117.head(n=1)

In [None]:
#abbreviate the ST coulmn
pvi_117 = pvi_117.replace({"ST": us_state_to_abbrev})
#check that both ST and num function apropriatley
print(pvi_117["ST"].unique())
print(pvi_117["num"].unique())

In [None]:
#create the district code variable
pvi_117["ST#"] = pvi_117["ST"] + pvi_117["num"]
#create the "party of represenative" variable
pvi_117["seat"] = pvi_117.Representative.str[0]
#make a dummy
pvi_117["is_GOP"] = pvi_117["seat"].replace("R",1)
pvi_117["is_GOP"] = pvi_117["is_GOP"].replace("D",0)
pvi_117["is_GOP"] = pvi_117["is_GOP"].astype(int)
pvi_117.head(n=7)

In [None]:
#possibly unneccessary
#pull out district lean
pvi_117["lean"] = pvi_117.PVI.str[0]
pvi_117["lean"].unique()

In [None]:
#split out pvi by party and weight
#first eliminate even values
pvi_117["PVI"] = pvi_117["PVI"].str.replace("EVEN","R+0")
pvi_117["pvi_party"] = pvi_117.PVI.str[0]
pvi_117['partisan_weight'] = pvi_117['PVI'].str.split('+').str[1]
#check results
print(pvi_117.head(n=1))
print(pvi_117["pvi_party"].unique())

In [None]:
#create a variable that is negative when the party is democratic
pvi_117["neg"] = pvi_117['pvi_party'].str.replace('D','-')
pvi_117["neg"] = pvi_117["neg"].str.replace('R','')
#ensure partisan weight has a negative value for democratic leaning and a positive value for Republican leaning
pvi_117['partisan_weight'] = pvi_117["neg"] + pvi_117['partisan_weight']
#ensure values are integers
pvi_117['partisan_weight'] = pvi_117['partisan_weight'].astype(int)

In [None]:
pvi_117["metric"] = ((pvi_117['partisan_weight'] / 2) + 50) / 100
pvi_117.head(n=7)

## Export clean versions of the data

In [None]:
#create a dataset solely to correlate pvi with the holder of the seat
pure_117 = pvi_117[["metric","is_GOP"]]
pure_117.to_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/clean_data/pure_datasets/pure_117.csv",index=False)
#create a more detailed dataset for greater uses
data_117 = pvi_117[["year","congress","ST","ST#","seat","is_GOP","PVI","metric"]]
data_117.to_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/clean_data/full_districts/data_117.csv",index=False)

In [None]:
data_117.head()

## Conduct preliminary examinations of the data

Because this data is useless in aggregate (2020 alone is not a good basis for prediction, given the lack of accounting for major waves), we will run basic examinations in this notebook

In [None]:
# Create a super simple scatterplot to examine the relationship between PVI and congressional district
(p9.ggplot(data=pure_117, mapping=p9.aes(x='metric', y='is_GOP'))
 + p9.geom_point() 
 + labs(x='GOP Leaning of District', y='GOP Representation',color="",title="PVI Correlation"))

From the 117th Congress, we can see that generally PVI is a perfect predictor of race outcome, except within the range of approximatley D+5 to R+5. 

In [None]:
#group by metric and average GOP seats
bm_117 = pure_117.groupby('metric').mean()
#this will not be produced into a dataset without aggregating all years, as 
bm_117 = bm_117.reset_index()

In [None]:
#Plot averages
(p9.ggplot(data=bm_117, mapping=p9.aes(x='metric', y='is_GOP'))
 + p9.geom_point() 
 + labs(x='GOP Leaning of District', y="Share of GOP Representatives",color="",title="PVI Correlation"))

Herein we see the error in a single cycle dataset- this would imply that a certain republican PVI still has a 100% change of being democratically represented. At a glance, this appears to be R+3, where all four districts with that value, in Iowa, New York, New Jersey, and Virginia, sent Democrats to Congress, and no R+3 district sent a Republican.

In [None]:
#now lets limit the data to only the R+10 to D+10 range, calling it Limited Domain
ld_117 = bm_117[bm_117["metric"] <= .55]
ld_117 = ld_117[ld_117["metric"] >= .45]

In [None]:
#Plot averages
(p9.ggplot(data=ld_117, mapping=p9.aes(x='metric', y='is_GOP'))
 + p9.geom_point() 
 + geom_smooth(method = "lm", color = "red", se = False)
 + labs(x='GOP Leaning of District', y="Share of GOP Representatives",color="",title="PVI Correlation"))