# Data 512 Final Project: Measuring political polarization in United States Congressional roll-call votes over the last 2000-2020

# Motivation & Problem Statement

## Goal
This project will analyze the polarization of Congress according to roll-call vote data from 2000-2020. The two goals will be:
   
   1. Identification of trends over time in polarization.
   2. Identification of correlation between topics and polarization
   
Identifying trends in polarization over time and polarization of topics may provide insight into the causal factors of polarization and how to best treat them.

## Why is it important?
Political polarization has been a topic covered by multiple media sources in the past four years. [Articles](https://greatergood.berkeley.edu/article/item/what_is_the_true_cost_of_polarization_in_america) have suggested that increased polarization could reach deeply into people's everyday lives through their personal interactions, and even impact their physical health. Beyond that, political polarization in Congress could have legislative impact, such as the [ping-poing effect on energy regulation](https://www.spglobal.com/platts/en/market-insights/latest-news/coal/102320-us-elections-political-polarization-creating-regulatory-ping-pong-effect-for-us-energy). Both the legislative and individual impact highlight the importance of political polarization at the human-level.

## Why we don't know about this or what's the contribution of this analysis?
A few Google searches have not found substantial walkthroughs or approachable analysis of roll-call vote data to measure political polarization. Many articles represent congresspeople through ideological scores (e.g., [NYT](https://www.nytimes.com/2014/06/12/upshot/polarization-is-dividing-american-society-not-just-politics.html), [Pew](https://www.pewresearch.org/fact-tank/2014/06/12/polarized-politics-in-congress-began-in-the-1970s-and-has-been-getting-worse-ever-since/)) or [polling data](https://fivethirtyeight.com/features/how-hatred-negative-partisanship-came-to-dominate-american-politics/). The use of a roll-call vote would be an alternative measurement of the polarization, with the following potential benefits:

   1. The data is easily accessible. Others may re-create the analysis on polarization, as compared to the survey or idealogical score data referenced above.
   2. The roll-call vote is a direct outcome of legislative intent compared to ideological score or survey data. If polarization exists outside of legislative votes, but not within legislative votes, then it may not have real impact on legislative outcomes.

## Human-centered aspect of the project
The problem itself, as mentioned in "Why is it important?" is related to a topic which can both directly impact humans (e.g., through physical health) and more indirectly impact humans through legislative outcomes. 

In addition to this, the visualization and data representation itself will require human-centered design.

# Data
We will use the open-sourced [congress GitHub](https://github.com/unitedstates/congress) set of scrapers to download Congressional roll-call vote data from 2000-2020. More specifically, we'll utilize the [votes data](https://github.com/unitedstates/congress/wiki/votes) which contains data on Congress roll-call votes.

From the project:

> This project collects data on roll call votes, which are the sorts of votes in which the individual positions of legislators is recorded. Other sorts of votes such as unanimous consent requests and voice votes are not collected here.
> 
> Congress publishes roll call vote data in XML starting in 1990 (101st Congress, 2nd session) for the House and 1989 (101st Congress, 1st Session) for the Senate. Senate votes are numbered uniquely by session. Sessions roughly follow calendar years, and there are two sessions per Congress. House vote numbering continues consecutively throughout the Congress.

NOTE: despite what the source says, there were errors getting data before 2000, as such we truncated the dataset for this analysis.

The dataset schema is similar to what follows (maybe updated depending on whether or not I pull in additional data from the scraper):

| Name | Description | Example Value |
| ----- | ----------------------| ------ |
| chamber | Either "h" for House or "s" for Senate | "h" |
| congress | Number of the Congress which carried out this vote | 112 |
| date | The date the roll-call vote happened | "2013-07-18T22:40:39-04:00" |
| number | The number of the vote | 202 |
| session | The year that Congress carried out this vote | 2013 |
| source_url | The source url for the data | "http://clerk.house.gov/evs/2013/roll202.xml"
| updated_at | The date that the data was updated at | "2013-07-18T22:40:39-04:00" |
| vote_id | The vote id of the roll-call vote | "h202-113.2013" |
| category | The type of roll-call vote | "amendment" |
| question | The question that the roll-call vote is on | "On Agreeing to the Amendment: Amendment 24 to H R 2217" |
| type | The type of vote this is | "On the Amendment" |
| requires | The fraction of the vote required to pass | "1/2" |
| result | The result of the roll-call vote | "failed" |
| result_text | The result of the vote (this is just a duplicate field according to the documentation) | "Failed" |
| display_name | Congress person's display name | "John Jay" |
| party | "D" for Democrat, "R" for Republican | "R" |
| state | Two-letter state abbreviation of the state that the congressperson represents | "NC"

## Data preparation
I have used  [code/scripts/get_votes.ps1](code/scripts/get_votes.ps1) to download the data for multiple Congresses.

The dataset is rather large and takes an additional Python script to get the data into a dataframe format. It may be reasonable to precompute aggregations (e.g., by year or month) for some portions of the data to be able to visualize and play with it faster.

Futhermore some of the downloaded data appears to have run into errors a few times, so I may need to shorten the length of analysis to simplify dealing with substantial missing values.



# Unknowns
<ol>
    <li>Some Congresses have incomplete voting data through the scrapers, should we throw out these years or use the partial data. If so, how?</li>
    <li>How should we consider party identification of independents or parties outside the two-party system?</li>
</ol>

# Research Questions
<ol>
    <li>Are there trends in the mean difference in roll-call votes over the years of 2000-2020?</li>
    <li>If topic areas of bills can be easily automatically identified, are there particular topic areas or types of bills that have greater mean differences in votes by party-identification than others?</li>
</ol>

## Hypotheses
   NOTE: both of these are included at the suggestion of the rubric and in order to provide a more specific context to focus on -- but the main questions will be exploratory.
  1. <b>The mean difference in votes by party identification has remained relatively stable over time.</b> There are many roll-call votes which are related to on-going legislation.
  2. <b>The 90th percentile of difference in votes by party identification increases over time.</b>In other words, the extremes of voting are more extreme over the years but the center of the distribution (as mentioned in point (1)) does not change substantially.
  

# Background or related work
> NOTE: TO-DO fix footnotes here for some reason they're broken!

Background work suggests the following about the design and methodology of this project:
  1. There are heterogenuous effects of the topic of the bill [^1] on the partisianship of the vote for the bill.
  2. Since each Congress votes on different topics [^1], we may have to control  for the topics voted on in a particular Congress to understand the trend in partisanship across years of Congress. There is also prior work on identifying topics in speech which could be particularly divisive [^2][^3]
  3. Partisanship has increased in the last 3 decades[^4].
  
# Limitations
We likely will not be able to address the following topics due to time constraints:

   1. Normalizing the comparison of roll-call vote partisanship over time by the topic areas of bills across years.
   2. Pulling data for before 2020 Congresses

# Methodology
The following methodologies will be utilized:

| Research Question # | Methodology | Summary of Application | Expected outcome of method |
| -- | ------------------ | ------------------ | ------------------------- |
| 1 | Plotting and visualization using Seaborn & Matplotlib | We will use plots and graphs to provide a view of the time series of mean difference in roll-call vote. | A visual with a clear depiction of trend over time of mean difference in roll-call vote |
| 1 | ARIMA / regression | We will utilize a simple ARIMA or regression analysis to identify a trendline which estimates the mean change per year or month in mean difference in roll-call vote | An estimate of whether polarization is increasing or decreasing over time |
| 2 | Count of Top-K key phrases using nltk by question | We'll extract the top-k key phrases of the "question" field per vote, and then measure the mean difference in roll-call vote by topic area. | An approximate measure of the polarization by topic area |
 



# Appendix

## Some notes on References
[This paper discusses methods to identify topics in speech](https://www.brown.edu/Research/Shapiro/pdfs/politext.pdf) and polarization of particular topics of speech. Brookings also has a paper which [discusses the polarization of speech](https://www.brookings.edu/wp-content/uploads/2012/09/2012b_Jensen.pdf) specifically. 
<br/><br/>
[This article finds rising polarization](https://scholar.harvard.edu/files/rogowski/files/npat-paper-july2019.pdf) and discusses a common methodology, DW-Nominate scores. It also mentions that in order to isolate polarization from other factors (e.g., # of votes of a particular Congressmen; topics they've voted on; etc). So the literature points towards potential variance in topics by Congressional year.
<br/><br/>
This [Pew ideological scoring of Congressmen, shows a clear shift between 1970 to 2012](https://www.pewresearch.org/fact-tank/2014/06/12/polarized-politics-in-congress-began-in-the-1970s-and-has-been-getting-worse-ever-since/), suggesting that polarization could be identifiable in general political beliefs of Congressmen.


# References
[^1]: [Parsing Party Polarization in Congress](https://scholar.harvard.edu/files/rogowski/files/npat-paper-july2019.pdf)

[^2]: [MEASURING GROUP DIFFERENCES IN HIGH-DIMENSIONAL CHOICES:METHOD AND APPLICATION TO CONGRESSIONAL SPEECH](https://www.brown.edu/Research/Shapiro/pdfs/politext.pdf)

[^3]: [Political Polarization and the Dynamics of Political Language: Evidence from 130 Years of Partisan Speech](https://www.brookings.edu/wp-content/uploads/2012/09/2012b_Jensen.pdf)

[^4]: [Pew: The polarized Congress of today has its roots in the 1970s](https://www.pewresearch.org/fact-tank/2014/06/12/polarized-politics-in-congress-began-in-the-1970s-and-has-been-getting-worse-ever-since/)
