# Basic Analysis
Now that we've cleaned up our data and have only the features we care about, we can run some basic statistical analysis to see if we can find any obvious patterns or interesting insights.

In [1]:
import pandas as pd
import numpy as np
import lzma, ast, gc
from scipy.sparse import lil_matrix, csr_matrix, save_npz, load_npz
from sklearn.metrics.pairwise import cosine_similarity

In [17]:
with lzma.open("./cleaned_input/bills.pkl.xz", 'r') as f:
    bills = pd.read_pickle(f)
with lzma.open("./cleaned_input/people.pkl.xz", 'r') as f:
    people = pd.read_pickle(f)
with lzma.open("./cleaned_input/votes.pkl.xz", 'r') as f:
    votes = pd.read_pickle(f)

Let's take a quick look at our people dataframe, there are some interesting going on that might be interesting to point out.

In [3]:
people

Unnamed: 0_level_0,Name,Party,Role,State,District
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6033,Carl Gatto,R,Rep,AK,HD-013
6034,Robert Lynn,R,Rep,AK,HD-026
6035,Max Gruenberg,D,Rep,AK,HD-016
6036,Nancy Dahlstrom,R,Rep,AK,HD-018
6037,Wes Keller,R,Rep,AK,HD-010
...,...,...,...,...,...
8675,Cale Case,R,Sen,WY,SD-025
8679,Dan Dockstader,R,Sen,WY,SD-016
8711,Dan Zwonitzer,R,Rep,WY,HD-043
8713,Bob Nicholas,R,Rep,WY,HD-008


Woah, 177,598 people have served in elected legislative positions since 2008? That seems wrong, I suspect there's probably a fair number of duplicates in there. Let's look at the dataframe with the duplicates removed.

In [4]:
people.loc[~people.duplicated()]

Unnamed: 0_level_0,Name,Party,Role,State,District
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6033,Carl Gatto,R,Rep,AK,HD-013
6034,Robert Lynn,R,Rep,AK,HD-026
6035,Max Gruenberg,D,Rep,AK,HD-016
6036,Nancy Dahlstrom,R,Rep,AK,HD-018
6037,Wes Keller,R,Rep,AK,HD-010
...,...,...,...,...,...
24296,Forrest Chadwick,R,Rep,WY,HD-062
24307,Joshua Larson,R,Rep,WY,HD-017
24311,Stacy Jones,R,Sen,WY,SD-013
24375,Liz Storer,D,Rep,WY,HD-023


Much better, 21,761 is far more reasonable. It's important to note how we have removed duplicates, as we only removed rows that were exactly the same, representing people who served in the same position in multiple years. Some people have served in different positions or different districts, so it's important to keep those "duplicates," even though the indexes are the same because we do get some interesting information from that.

In [5]:
people = people.loc[~people.duplicated()]

Now we want to combine this with our people data so we can see how each person voted simply by looking at their information, rather than scraping through all votes. To do this, we'll collect all the votes of each person into a new dataframe then merge it with the current `people` dataframe.

In [18]:
people_votes = {}
for roll_call in votes.index:
    vote = ast.literal_eval(votes.loc[roll_call]["Votes"])
    bill = votes.loc[roll_call]["Bill ID"]
    for person, v in vote:
        if person in people_votes.keys():
            people_votes[person].append((bill, roll_call, v))
        else:
            people_votes[person] = [(bill, roll_call, v)]
for p in people_votes.keys():
    people_votes[p] = f"{people_votes[p]}"
people_votes_df = pd.DataFrame.from_records(people_votes, index=["Votes"]).T
people = people.join(people_votes_df)

This is a pretty important database for other analysis, so let's save this out to a file like we did in the data cleaning section.

In [7]:
people.to_pickle("./cleaned_input/people_votes.pkl.xz")

Now that that's done, let's create a massive matrix of all people and all votes to make comparisons easy. This will take up a massive amount of space (18,000 x 1,292,603), but it will be insanely sparse, so we'll take advantage of that and use sparse matrix representation offered by SciPy.

In [8]:
try:
    smaller = load_npz("./similarity_matrix.npz")
except FileNotFoundError:
    people_votes_matrix = lil_matrix((np.max(people.index) + 1, np.max(votes.index) + 1), dtype="uint8")
    vote_nums = ["Yea", "Absent", "NV", "Nay"]
    empty_count = 0
    for person in people.index:
        if type(people.loc[person]["Votes"]) != str:
            if type(people.loc[person]["Votes"]) == float:
                continue
            if type(people.loc[person]["Votes"].iloc[0]) == float:
                continue
            person_votes = ast.literal_eval(people.loc[person]["Votes"].iloc[0])
        else:
            person_votes = ast.literal_eval(people.loc[person]["Votes"])
        for vote in person_votes:
            vote_id = vote[1]
            vote_actual = vote[2]
            if vote_actual not in vote_nums:
                continue
            people_votes_matrix[person, vote_id] = vote_nums.index(vote_actual)
    smaller = csr_matrix(people_votes_matrix)
    save_npz("./similarity_matrix.npz", smaller)
    del people_votes_matrix
    gc.collect()

Alright, now let's run cosine similarity on this to determine similarity between representatives.

In [9]:
similarities = cosine_similarity(smaller)

Let's make sure that this is doing what we want it to do by checking to see who is similar to Lauren Boebert (I suspect MTG will be quite similar, as well as some other QAnon wackos).

In [10]:
boebert = people.loc[people["Name"] == "Lauren Boebert"].index[0]
people.loc[similarities[boebert].argsort()[::-1][1:6]]

Unnamed: 0,Name,Party,Role,State,District,Votes
21935,Marjorie Greene,R,Rep,US,HD-GA-14,"[(1394633, 1014585, 'Yea'), (1460770, 1015299,..."
21975,Robert Good,R,Rep,US,HD-VA-5,"[(1394633, 1014585, 'Yea'), (1460770, 1015299,..."
21952,Matt Rosendale,R,Rep,US,HD-MT,"[(1394633, 1014585, 'Yea'), (1460770, 1015299,..."
21941,Mary Miller,R,Rep,US,HD-IL-15,"[(1394633, 1014585, 'Yea'), (1460770, 1015299,..."
21934,Andrew Clyde,R,Rep,US,HD-GA-9,"[(1394633, 1014585, 'Yea'), (1460770, 1015299,..."


Yeah, that all checks out. These people are very similar to Boebert, so it makes sense that they would all vote similarly. Looks like cosine similarity does exactly what we want it to do. 

Now let's add some similarity metrics to the bills so we can further compare representatives across states and legislative sessions.

In [11]:
# IMPORT THE BILL TOPICS STUFF HERE

With our new shiny similarity metrics, we don't want to compare every single vote to every single other vote, primarily because that matrix would be MASSIVE (a dense 20,000 x 1,200,000 with each entry being a float, about 390,000,000gb). Instead, we'll average each representative's votes on each bill, then average all votes from all bills in each category to get a general "temperature" of how each person is voting on each topic. This is fairly rudamentary, and could definitely be improved in the future, but it works for now.

First, we'll convert the indexing of the votes dataframe to better see which votes pertain to each bill. This will allow us to go bill by bill, rather than vote by vote, making temperature calculations a bit easier.

In [31]:
votes = votes.reset_index().set_index(["Bill ID", "ID"])
votes

Unnamed: 0_level_0,Unnamed: 1_level_0,Description,Passed,Votes
Bill ID,ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
454312,306479,Senate: <pre> SR 1 Final Passage,True,"[(6044, 'Yea'), (6061, 'Yea'), (6064, 'Yea'), ..."
472178,306480,Senate: CSHB 84(FIN)(efd am S) Third Reading -...,True,"[(6044, 'Yea'), (6061, 'Yea'), (6064, 'Yea'), ..."
472178,306481,Senate: CSHB 84(FIN)(efd am S) Third Reading -...,True,"[(6044, 'Yea'), (6061, 'Yea'), (6064, 'Yea'), ..."
472178,306482,House: Concur,True,"[(6034, 'Yea'), (6035, 'Yea'), (6037, 'Yea'), ..."
545632,306483,House: Special Order of Business,True,"[(6034, 'Yea'), (6035, 'Yea'), (6037, 'Yea'), ..."
...,...,...,...,...
1673024,1268431,Line Item Veto Override 27-3-1-0-0,True,"[(8641, 'Yea'), (8663, 'Yea'), (8675, 'Yea'), ..."
1673024,1268432,Line Item Veto Override 29-1-1-0-0,True,"[(8641, 'Yea'), (8663, 'Yea'), (8675, 'Yea'), ..."
1673024,1268433,Line Item Veto Override 27-3-1-0-0,True,"[(8641, 'Yea'), (8663, 'Yea'), (8675, 'Yea'), ..."
1673024,1268434,Line Item Veto Override 23-7-1-0-0,True,"[(8641, 'Nay'), (8663, 'Nay'), (8675, 'Yea'), ..."


In [45]:
people_votes_temperatures = {}
i = 0
for bill_id, vote_id in votes.index:
    vote_list = ast.literal_eval(votes.loc[bill_id, vote_id]["Votes"])
    for vote in vote_list:
        if vote[0] not in people_votes_temperatures:
            people_votes_temperatures[vote[0]] = {bill_id: [-1 if vote[1] == "Nay" else 1 if vote[1] == "Yea" else 0]}
        elif bill_id not in people_votes_temperatures[vote[0]]:
            people_votes_temperatures[vote[0]][bill_id] = [-1 if vote[1] == "Nay" else 1 if vote[1] == "Yea" else 0]
        else:
            people_votes_temperatures[vote[0]][bill_id].append(-1 if vote[1] == "Nay" else 1 if vote[1] == "Yea" else 0)
for person in people_votes_temperatures:
    for bill in people_votes_temperatures[person]:
        people_votes_temperatures[person][bill] = np.mean(people_votes_temperatures[person][bill])

Now we'll do something similar, but this time we'll incorporate the topics of each bill and give each person a temperature on a topic, rather than specific bills.