In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from ipython_memwatcher import MemWatcher
mw = MemWatcher()
mw.start_watching_memory()

In [2] used 0.000 MiB RAM in 0.002s, peaked 0.000 MiB above current, total RAM usage 42.496 MiB


# Ideal Point Estimation

Here we perform Ideal Point Estimation of legislators in 113th Congress.

## Load Data
### Legislators
First we have to load in all legislators (this is for all of time, from GovTrack).

In [3]:
import ideal_point.raw_data

In [3] used 32.645 MiB RAM in 1.254s, peaked 0.000 MiB above current, total RAM usage 75.141 MiB


In [4]:
legislator_df = ideal_point.raw_data.legislators()
legislator_df.head()

Unnamed: 0,last_name,first_name,birthday,gender,type,state,district,party,url,address,...,thomas_id,opensecrets_id,lis_id,cspan_id,govtrack_id,votesmart_id,ballotpedia_id,washington_post_id,icpsr_id,wikipedia_id
0,Brown,Sherrod,1952-11-09,M,sen,OH,,Democrat,https://www.brown.senate.gov,713 Hart Senate Office Building Washington DC ...,...,136.0,N00003535,S307,5051.0,400050,27018.0,Sherrod Brown,,29389.0,Sherrod Brown
1,Cantwell,Maria,1958-10-13,F,sen,WA,,Democrat,https://www.cantwell.senate.gov,511 Hart Senate Office Building Washington DC ...,...,172.0,N00007836,S275,26137.0,300018,27122.0,Maria Cantwell,,39310.0,Maria Cantwell
2,Cardin,Benjamin,1943-10-05,M,sen,MD,,Democrat,https://www.cardin.senate.gov,509 Hart Senate Office Building Washington DC ...,...,174.0,N00001955,S308,4004.0,400064,26888.0,Ben Cardin,,15408.0,Ben Cardin
3,Carper,Thomas,1947-01-23,M,sen,DE,,Democrat,http://www.carper.senate.gov,513 Hart Senate Office Building Washington DC ...,...,179.0,N00012508,S277,663.0,300019,22421.0,Tom Carper,,15015.0,Tom Carper
4,Casey,Robert,1960-04-13,M,sen,PA,,Democrat,https://www.casey.senate.gov,393 Russell Senate Office Building Washington ...,...,1828.0,N00027503,S309,47036.0,412246,2541.0,"Bob Casey, Jr.",,40703.0,Bob Casey Jr.


In [4] used 10.219 MiB RAM in 0.245s, peaked 0.000 MiB above current, total RAM usage 85.359 MiB


### Votes

Next we can load in all the votes. We get two dataframes from this, `vote_df` and `position_df`.

Each row of `vote_df` corresponds to one roll call vote (like on the passage of a bill).

Each row of `position_df` corresponds to one legislators position on a vote.

In [5]:
vote_df, position_df = ideal_point.raw_data.votes(legislator_df)


In [5] used 196.305 MiB RAM in 315.142s, peaked 22.344 MiB above current, total RAM usage 281.664 MiB


In [6]:
vote_df.head()

Unnamed: 0,amendment_author,amendment_number,amendment_purpose,amendment_type,bill_congress,bill_number,bill_title,bill_type,category,chamber,...,record_modified,requires,result,result_text,session,source_url,subject,type,updated_at,vote_id
0,,,,,,,,,quorum,h,...,,QUORUM,Passed,Passed,2013,http://clerk.house.gov/evs/2013/roll001.xml,,Call of the House,2014-06-18T11:22:24-04:00,h1-113.2013
1,,,,,,,,,procedural,h,...,,1/2,Failed,Failed,2013,http://clerk.house.gov/evs/2013/roll010.xml,,On the Motion to Adjourn,2014-06-18T11:22:23-04:00,h10-113.2013
2,,,,,113.0,1120.0,,hr,recommit,h,...,,1/2,Failed,Failed,2013,http://clerk.house.gov/evs/2013/roll100.xml,Preventing Greater Uncertainty in Labor-Manage...,On the Motion to Recommit,2014-06-18T11:22:04-04:00,h100-113.2013
3,,,,,113.0,1120.0,,hr,passage,h,...,,1/2,Passed,Passed,2013,http://clerk.house.gov/evs/2013/roll101.xml,Preventing Greater Uncertainty in Labor-Manage...,On Passage of the Bill,2014-06-18T11:22:04-04:00,h101-113.2013
4,,,,,,,,,procedural,h,...,,1/2,Passed,Passed,2013,http://clerk.house.gov/evs/2013/roll102.xml,,On Approving the Journal,2014-06-18T11:22:04-04:00,h102-113.2013


In [6] used 0.098 MiB RAM in 0.039s, peaked 0.000 MiB above current, total RAM usage 281.762 MiB


In [7]:
position_df.head()

Unnamed: 0,legislator_index,position,vote_index
0,54,Not Voting,0
1,212,Not Voting,0
2,213,Not Voting,0
3,301,Not Voting,0
4,39,Present,0


In [7] used 0.000 MiB RAM in 0.009s, peaked 0.000 MiB above current, total RAM usage 281.762 MiB


## Transform Data

Next we have to transform our data to a format we can train our model on.

Our observed data is basically `position_df`. but instead of categorical `position`s, we need them to
be 1s and 0s. Also, since we aren't using all of the legislators, we need to transform
the `legislator_index` into a relative index. We call this transformed dataframe `model_position_df`.

In [8]:
import ideal_point.ideal_point

In [8] used 9.543 MiB RAM in 1.769s, peaked 0.000 MiB above current, total RAM usage 291.305 MiB


In [9]:
model_position_df, model_legislator_index, model_vote_index = ideal_point.ideal_point.transform_data(position_df, vote_df, legislator_df)

In [9] used 15.746 MiB RAM in 1.372s, peaked 21.543 MiB above current, total RAM usage 307.051 MiB


In [10]:
model_position_df.head()

Unnamed: 0,legislator,position,vote
0,39,0,0
1,530,0,0
2,40,0,0
3,375,0,0
4,0,0,0


In [10] used 0.000 MiB RAM in 0.013s, peaked 0.000 MiB above current, total RAM usage 307.051 MiB


In [11]:
model_position_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 508555 entries, 0 to 508554
Data columns (total 3 columns):
legislator    508555 non-null int64
position      508555 non-null int64
vote          508555 non-null int64
dtypes: int64(3)
memory usage: 11.6 MB
In [11] used 0.000 MiB RAM in 0.010s, peaked 0.000 MiB above current, total RAM usage 307.051 MiB


The two series `model_legislator_index` and `model_vote_index` map the values in `model_position_df` to the full dataframes, from index to value.

## Create Model

Now we can create our model, given we have observed those votes. The notation is based
on ["Comparing NOMINATE and IDEAL: Points of Difference and Monte Carlo Tests"](http://scholar-qa.princeton.edu/sites/default/files/jameslo/files/lsq_nomvsideal.pdf).

In [12]:
import ideal_point.gradient

In [12] used 0.000 MiB RAM in 0.012s, peaked 0.000 MiB above current, total RAM usage 307.051 MiB


In [13]:
g = ideal_point.gradient.Gradient(model_position_df)

In [13] used 883.781 MiB RAM in 3.888s, peaked 0.000 MiB above current, total RAM usage 1190.832 MiB


In [14]:
g.n_legislators, g.n_votes

(548, 1446)

In [14] used 0.023 MiB RAM in 0.007s, peaked 0.000 MiB above current, total RAM usage 1190.855 MiB


## Train Model

Now we can run variational inference to compute estimated parameters for the model.

In [15]:
# g.run(1000)
# params = g.params
# ideal_point.ideal_point.save_params(params)

In [15] used 0.000 MiB RAM in 0.001s, peaked 0.000 MiB above current, total RAM usage 1190.855 MiB


Or load it from disk, if we have already computed it (takes about an hour and a half on my computer)

In [16]:
params = ideal_point.ideal_point.load_params()

In [16] used 0.070 MiB RAM in 0.032s, peaked 0.000 MiB above current, total RAM usage 1190.926 MiB


## Integrate Data

Now we can integrate the parameters we learned backed info our `vote_df` and `legislator_df`. We add a `ideology` column to both of them and filter out rows without ideal points. We also add a `bias` to the votes (which is greater if any senator is more likely to vote yes).

In [17]:
legislators_pt_df = ideal_point.ideal_point.leg_add_ideology(legislator_df, model_legislator_index, g.params)
vote_pt_df = ideal_point.ideal_point.vote_add_ideology_and_bias(vote_df, model_vote_index, g.params)

In [17] used 6.031 MiB RAM in 0.065s, peaked 0.000 MiB above current, total RAM usage 1196.957 MiB


### Visualize Points

We can do a quick gut chuck of our legislator ideal points to make sure they seperate democrats and republicans

In [18]:
from altair import *

In [18] used 0.000 MiB RAM in 0.261s, peaked 0.000 MiB above current, total RAM usage 1196.957 MiB


In [19]:
Chart(legislators_pt_df[["ideology", "party"]]).mark_tick().encode(
    x='ideology:Q',
    y='party:O',
)

In [19] used 0.121 MiB RAM in 0.078s, peaked 0.000 MiB above current, total RAM usage 1197.078 MiB


In [20]:
vote_pt_df.head()

Unnamed: 0,amendment_author,amendment_number,amendment_purpose,amendment_type,bill_congress,bill_number,bill_title,bill_type,category,chamber,...,result,result_text,session,source_url,subject,type,updated_at,vote_id,bias,ideology
2,,,,,113.0,1120.0,,hr,recommit,h,...,Failed,Failed,2013,http://clerk.house.gov/evs/2013/roll100.xml,Preventing Greater Uncertainty in Labor-Manage...,On the Motion to Recommit,2014-06-18T11:22:04-04:00,h100-113.2013,0.206324,-0.90987
3,,,,,113.0,1120.0,,hr,passage,h,...,Passed,Passed,2013,http://clerk.house.gov/evs/2013/roll101.xml,Preventing Greater Uncertainty in Labor-Manage...,On Passage of the Bill,2014-06-18T11:22:04-04:00,h101-113.2013,0.351693,-0.051634
5,,,,,113.0,1162.0,,hr,passage-suspension,h,...,Passed,Passed,2013,http://clerk.house.gov/evs/2013/roll103.xml,"To amend title 31, United States Code, to make...","On Motion to Suspend the Rules and Pass, as Am...",2014-06-18T11:22:04-04:00,h103-113.2013,2.138762,0.761325
6,,,,,113.0,882.0,,hr,passage-suspension,h,...,Passed,Passed,2013,http://clerk.house.gov/evs/2013/roll104.xml,To prohibit the awarding of a contract or gran...,"On Motion to Suspend the Rules and Pass, as Am...",2014-06-18T11:22:04-04:00,h104-113.2013,-0.332675,-2.306667
7,,,,,113.0,249.0,,hr,passage-suspension,h,...,Failed,Failed,2013,http://clerk.house.gov/evs/2013/roll105.xml,"To amend title 5, United States Code, to provi...",On Motion to Suspend the Rules and Pass,2014-06-18T11:22:04-04:00,h105-113.2013,0.004038,-0.457308


In [20] used 0.000 MiB RAM in 0.033s, peaked 0.000 MiB above current, total RAM usage 1197.078 MiB


### Validation

Some of the most conservative members in our model include Mike Pompeo, who lead the house inquiry into Benghazi, and Randy Weber, who drew fire for a tweet declaring Barack Obama a "socialist dictator."

Some of the most liberal members include Jim McGovern, who represents the pioneer valley and Jerrold Nadler, who represents Manhattan's upper west side. The most liberal legislator, Jan Schakowsky, is a longtime critic of the Iraq war.

The house bills to remove voted on by all democrats are close to the democratic ideology.

In [21]:
# pd.set_option('display.max_columns', 999)
# pd.set_option('display.max_colwidth', 200)
# vote_pt_df[vote_pt_df["question"].str.contains("Immigration", na=False)]
# vote_pt_df[vote_pt_df["number"] == 168]

In [21] used 0.000 MiB RAM in 0.002s, peaked 0.000 MiB above current, total RAM usage 1197.078 MiB


In [22]:
labels = pd.Series([
    "Violence Against Women Reauthorization Act of 2013",
    "Border Security, Economic Opportunity, and Immigration Modernization Act",
    "Keystone XL Pipeline Approval Act",
    "Consolidated Appropriations Act, 2014"
], index=[
    500,
    717,
    76,  
    570
])

NameError: name 'pd' is not defined

In [22] used 0.000 MiB RAM in 0.204s, peaked 0.000 MiB above current, total RAM usage 1197.078 MiB


In [None]:
chart_df = vote_pt_df[["ideology", "bias"]].assign(bill=labels).dropna()

In [None]:
Chart(chart_df).mark_text().encode(
    x='ideology:Q',
    y='bias:Q',
    text='bill',
)

In [None]:
Chart(vote_pt_df[["ideology", "bias"]].dropna()).mark_circle().encode(
    x='ideology:Q',
    y='bias:Q',)

This is to get bills linked w/ duplication data for now

## Text reuse

In [26]:
tmp = vote_pt_df[["bill_congress", "bill_number", "bill_type", "ideology"]]
tmp.to_csv("vote_pt_df.csv")
# run wtfpandasjoin.py to do the join. i kept trying to do w/ pandas and getting errors/bugs

In [26] used 0.000 MiB RAM in 0.040s, peaked 0.000 MiB above current, total RAM usage 1197.238 MiB


In [None]:
import pandas as pd
import math
reuse_df = pd.DataFrame.from_csv("pairs_enhanced_again.txt")
# many, many, many unknowns b.c no votes. research opportunity. ideal point w/o vote. interesting... b/c many
# things are killed in committee. 
criterion = reuse_df['ideology_a'].map(lambda x: x != "unknown") 
reuse_df_no_unk = reuse_df[criterion]
reuse_df_no_unk['ideology_a'] = reuse_df_no_unk['ideology_a'].astype(float)
reuse_df_no_unk['ideology_a'] = reuse_df_no_unk['ideology_b'].astype(float)
reuse_df_no_unk['ideology_eq'] = reuse_df_no_unk.apply(lambda x:x["ideology_a"] == x["ideology_b"], axis=1)
reuse_df_no_unk = reuse_df_no_unk[reuse_df_no_unk['ideology_eq'] == False]

In [None]:
reuse_df_no_unk.tail()

In [None]:
%matplotlib inline
# http://stackoverflow.com/questions/14300137/making-matplotlib-scatter-plots-from-dataframes-in-pythons-pandas
import matplotlib.pylab as plt

reuse_df_no_unk.plot(kind='scatter', x='ideology_a', y='ideology_b', title="Ideologies for reuse pairings")

#### Bipartisanship?

It appears that a liberal section 

In [None]:
import pandas as pd
import math
reuse_df = pd.DataFrame.from_csv("pairs_enhanced_again.txt")
# many, many, many unknowns b.c no votes. research opportunity. ideal point w/o vote. interesting... b/c many
# things are killed in committee. 
criterion = reuse_df['ideology_a'].map(lambda x: x != "unknown") 
reuse_df_no_unk = reuse_df[criterion]
reuse_df_no_unk["sa"] = reuse_df_no_unk["ideology_a"].astype(float)> 0
reuse_df_no_unk["sb"] = reuse_df_no_unk["ideology_b"].astype(float)> 0
reuse_df_no_unk["diff"] = reuse_df_no_unk["sa"] == reuse_df_no_unk["sb"]
c2 = reuse_df_no_unk["diff"].map(lambda x: x == False) 
reuse_df_no_unk = reuse_df_no_unk[c2]
reuse_df_no_unk.shape