# 116th Congress Data

This notebook is meant to call, clean, and examine data from the 2018 election, to produce a viable metric correlating PVI and electoral success.

It feeds into a larger project about fairness in redistricting; for instance, if an R+15 district is unattainable for a Democrat to win, then the district can be wholly classified as "safe," and should be bucketed with R+30 district.

The nuance enters in the margins. How safe is an R+6 district for instance? Can we quantify a district moving from R+3 to R+6 as a significant bias against Democrats? By cleaning this dataset, and others like it, we can determine the predictive power of PVI in each district, and use those probabilities later to assess redistricting fairness.

## Retrieve data from Wikipedia

Wikipedia's current Cook PVI page contains a free and current (up to 2020 election results) version of the index, where free downloads from Cook's website are dated, containing 2018 results and 2016 PVI metrics

In [1]:
import requests
import pandas as pd
pd.set_option("display.max_rows", None, "display.max_columns", None)

In [2]:
#import and format url
url = "https://web.archive.org/web/20200714173127/https://en.wikipedia.org/wiki/Cook_Partisan_Voting_Index"
page = pd.read_html(url)
dat = pd.concat(page,ignore_index=True)

In [3]:
#split data into state and district pvis
wiki_district = dat[4:439]
wiki_state = dat[439:]

In [4]:
#cut NA columns
wiki_district = wiki_district.dropna(axis=1, how='any', thresh=None, subset=None, inplace=False)
wiki_state = wiki_state.dropna(axis=1, how='any', thresh=None, subset=None, inplace=False)

## Clean and Organize State Data

In [5]:
#call in a dictionary of state abbreviations
us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
}

In [6]:
state_116 = wiki_state
state_116["ST"] =  state_116["State"]
#abbreviate the ST coulmn
state_116 = state_116.replace({"ST": us_state_to_abbrev})
#check that both ST and num function apropriatley
print(state_116["ST"].unique())

['AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'FL' 'GA' 'HI' 'ID' 'IL' 'IN'
 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS' 'MO' 'MT' 'NE' 'NV'
 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD' 'TN'
 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY']


In [7]:
#split out pvi by party and weight
#first eliminate even values
state_116["PVI"] = state_116["PVI"].str.replace("EVEN","R+0")
state_116["pvi_party"] = state_116.PVI.str[0]
state_116['partisan_weight'] = state_116['PVI'].str.split('+').str[1]
#check results
print(state_116.head(n=1))
print(state_116["pvi_party"].unique())

    Housebalance   PVI Party ofgovernor Partyin Senate    State  ST pvi_party  \
439       6R, 1D  R+14       Republican           Both  Alabama  AL         R   

    partisan_weight  
439              14  
['R' 'D']


In [8]:
#add constants for later aggregation
state_116["year"] = 2018
state_116["congress"] = 116

In [9]:
#create a variable that is negative when the party is democratic
state_116["neg"] = state_116['pvi_party'].str.replace('D','-')
state_116["neg"] = state_116["neg"].str.replace('R','')
#ensure partisan weight has a negative value for democratic leaning and a positive value for Republican leaning
state_116['partisan_weight'] = state_116["neg"] + state_116['partisan_weight']
#ensure values are integers
state_116['partisan_weight'] = state_116['partisan_weight'].astype(int)

In [10]:
state_116["metric"] = ((state_116['partisan_weight']) + 50) / 100
state_116.head(n=7)

Unnamed: 0,Housebalance,PVI,Party ofgovernor,Partyin Senate,State,ST,pvi_party,partisan_weight,year,congress,neg,metric
439,"6R, 1D",R+14,Republican,Both,Alabama,AL,R,14,2018,116,,0.64
440,1R,R+9,Republican,Republican,Alaska,AK,R,9,2018,116,,0.59
441,"5D, 4R",R+5,Republican,Both,Arizona,AZ,R,5,2018,116,,0.55
442,4R,R+15,Republican,Republican,Arkansas,AR,R,15,2018,116,,0.65
443,"45D, 8R",D+12,Democratic,Democratic,California,CA,D,-12,2018,116,-,0.38
444,"4D, 3R",D+1,Democratic,Both,Colorado,CO,D,-1,2018,116,-,0.49
445,5D,D+6,Democratic,Democratic,Connecticut,CT,D,-6,2018,116,-,0.44


In [11]:
state_116 = state_116[["year","ST","PVI","metric"]]
state_116.to_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/clean_data/state_pvi/state_116.csv",index=False)

In [12]:
BLOCK

NameError: name 'BLOCK' is not defined

Below this point the code is not up to date

## Clean and Organize the District Data

In [None]:
import numpy as np
import plotnine as p9
from plotnine import ggplot, aes, facet_grid, labs, geom_point, geom_smooth
from sklearn.linear_model import LinearRegression as lm
import warnings
warnings.filterwarnings('ignore')

In [None]:
#load in and check the data
pvi_115 = wiki_district
print(pvi_115.head(n=1))
print(pvi_115.shape)

In [None]:
#correct the column name for later use
pvi_115 = pvi_115.rename(columns={"Party of Representative": "Representative"})

In [None]:
#fix at large designation
pvi_115['District'] = pvi_115['District'].str.replace('at-large','AL')
pvi_115.head(n=8)

In [None]:
#seperate state and district
pvi_115["num"] = pvi_115.District.str[-2:]
pvi_115["state"] = pvi_115.District.str[:-2]
pvi_115["state"] = pvi_115['state'].str.rstrip()
pvi_115["num"] = pvi_115['num'].str.lstrip()
pvi_115["ST"] =  pvi_115["state"]
pvi_115.head(n=1)

In [None]:
#abbreviate the ST coulmn
pvi_115 = pvi_115.replace({"ST": us_state_to_abbrev})
#check that both ST and num function apropriatley
print(pvi_115["ST"].unique())
print(pvi_115["num"].unique())

In [None]:
#create the district code variable
pvi_115["ST#"] = pvi_115["ST"] + pvi_115["num"]
#create the "party of represenative" variable
pvi_115["seat"] = pvi_115.Representative.str[0]
#make a dummy
pvi_115["is_GOP"] = pvi_115["seat"].replace("R",1)
pvi_115["is_GOP"] = pvi_115["is_GOP"].replace("D",0)
#The seats of Charlie Dent and Pat Meehan in Pennsylvania went R -> D during this time frame
#color them as Republicans because they were originally elected as R seats
pvi_115["is_GOP"] = pvi_115["is_GOP"].replace("V",1)
#covert to integer for later numeric analysis
pvi_115["is_GOP"] = pvi_115["is_GOP"].astype(int)
pvi_115.head(n=7)
print(pvi_115["is_GOP"].unique())

In [None]:
#possibly unneccessary
#pull out district lean
pvi_115["lean"] = pvi_115.PVI.str[0]
pvi_115["lean"].unique()

In [None]:
#split out pvi by party and weight
#first eliminate even values
pvi_115["PVI"] = pvi_115["PVI"].str.replace("EVEN","R+0")
pvi_115["pvi_party"] = pvi_115.PVI.str[0]
pvi_115['partisan_weight'] = pvi_115['PVI'].str.split('+').str[1]
#check results
print(pvi_115.head(n=1))
print(pvi_115["pvi_party"].unique())

In [None]:
#create a variable that is negative when the party is democratic
pvi_115["neg"] = pvi_115['pvi_party'].str.replace('D','-')
pvi_115["neg"] = pvi_115["neg"].str.replace('R','')
#ensure partisan weight has a negative value for democratic leaning and a positive value for Republican leaning
pvi_115['partisan_weight'] = pvi_115["neg"] + pvi_115['partisan_weight']
#ensure values are integers
pvi_115['partisan_weight'] = pvi_115['partisan_weight'].astype(int)

In [None]:
pvi_115["metric"] = ((pvi_115['partisan_weight']) + 50) / 100
pvi_115.head(n=7)

In [None]:
#add constants for later aggregation
pvi_115["year"] = 2016
pvi_115["congress"] = 115

## Export clean versions of the data

In [None]:
#create a dataset solely to correlate pvi with the holder of the seat
pure_115 = pvi_115[["year","metric","is_GOP"]]
pure_115.to_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/clean_data/pure_datasets/pure_115.csv",index=False)
#create a more detailed dataset for greater uses
data_115 = pvi_115[["year","congress","ST","ST#","seat","is_GOP","PVI","metric"]]
data_115.to_csv("/Users/xavier/Desktop/DSPP/solo_projects/redistricting_project/clean_data/full_districts/data_115.csv",index=False)

In [None]:
data_115.head(8)

## Conduct preliminary examinations of the data

Because this data is useless in aggregate (2020 alone is not a good basis for prediction, given the lack of accounting for major waves), we will run basic examinations in this notebook

In [None]:
# Create a super simple scatterplot to examine the relationship between PVI and congressional district
(p9.ggplot(data=pure_115, mapping=p9.aes(x='metric', y='is_GOP'))
 + p9.geom_point() 
 + labs(x='GOP Leaning of District', y='GOP Representation',color="",title="PVI Correlation"))

From the 115th Congress, we can see that generally PVI is a perfect predictor of race outcome, except within the range of approximatley D+10 to R+10, with a few outliers. 

In [None]:
#group by metric and average GOP seats
bm_115 = pure_115.groupby('metric').mean()
#this will not be produced into a dataset without aggregating all years, as 
bm_115 = bm_115.reset_index()

In [None]:
#Plot averages
(p9.ggplot(data=bm_115, mapping=p9.aes(x='metric', y='is_GOP'))
 + p9.geom_point() 
 + labs(x='GOP Leaning of District', y="Share of GOP Representatives",color="",title="PVI Correlation"))

This data appears to be more of an even distribution; some Dem seats are GOP occupied and vice versa, but not with substantial bias; the close intersection of the regression line below with .5,.5 indicates a fairly predictable result

In [None]:
#now lets limit the data to only the R+10 to D+10 range, calling it Limited Domain
ld_115 = bm_115[bm_115["metric"] <= .55]
ld_115 = ld_115[ld_115["metric"] >= .45]

In [None]:
#Plot averages
(p9.ggplot(data=ld_115, mapping=p9.aes(x='metric', y='is_GOP'))
 + p9.geom_point() 
 + geom_smooth(method = "lm", color = "red", se = False)
 + labs(x='GOP Leaning of District', y="Share of GOP Representatives",color="",title="PVI Correlation"))