# Basic Analysis
Now that we've cleaned up our data and have only the features we care about, we can run some basic statistical analysis to see if we can find any obvious patterns or interesting insights.

In [2]:
import pandas as pd
import numpy as np
import lzma

In [3]:
with lzma.open("./cleaned_input/bills.pkl.xz", 'r') as f:
    bills = pd.read_pickle(f)
with lzma.open("./cleaned_input/people.pkl.xz", 'r') as f:
    people = pd.read_pickle(f)
with lzma.open("./cleaned_input/votes.pkl.xz", 'r') as f:
    votes = pd.read_pickle(f)

Let's take a quick look at our people dataframe, there are some interesting going on that might be interesting to point out.

In [4]:
people

Unnamed: 0_level_0,Name,Party,Role,State,District
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6033,Carl Gatto,R,Rep,AK,HD-013
6034,Robert Lynn,R,Rep,AK,HD-026
6035,Max Gruenberg,D,Rep,AK,HD-016
6036,Nancy Dahlstrom,R,Rep,AK,HD-018
6037,Wes Keller,R,Rep,AK,HD-010
...,...,...,...,...,...
8675,Cale Case,R,Sen,WY,SD-025
8679,Dan Dockstader,R,Sen,WY,SD-016
8711,Dan Zwonitzer,R,Rep,WY,HD-043
8713,Bob Nicholas,R,Rep,WY,HD-008


Woah, 177,598 people have served in elected legislative positions since 2008? That seems wrong, I suspect there's probably a fair number of duplicates in there. Let's look at the dataframe with the duplicates removed.

In [5]:
people.loc[~people.duplicated()]

Unnamed: 0_level_0,Name,Party,Role,State,District
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6033,Carl Gatto,R,Rep,AK,HD-013
6034,Robert Lynn,R,Rep,AK,HD-026
6035,Max Gruenberg,D,Rep,AK,HD-016
6036,Nancy Dahlstrom,R,Rep,AK,HD-018
6037,Wes Keller,R,Rep,AK,HD-010
...,...,...,...,...,...
24296,Forrest Chadwick,R,Rep,WY,HD-062
24307,Joshua Larson,R,Rep,WY,HD-017
24311,Stacy Jones,R,Sen,WY,SD-013
24375,Liz Storer,D,Rep,WY,HD-023


Much better, 21,761 is far more reasonable. It's important to note how we have removed duplicates, as we only removed rows that were exactly the same, representing people who served in the same position in multiple years. Some people have served in different positions or different districts, so it's important to keep those "duplicates," even though the indexes are the same because we do get some interesting information from that.

In [6]:
people = people.loc[~people.duplicated()]

Now lets look at our votes dataframe. As it stands, we index by ID and keep a note of what bill was voted on, but this is not the best way to represent this data. Let's switch to a multi-index so we can group all votes on a single bill together.

In [22]:
votes = votes.reset_index().set_index(["Bill ID", "ID"])
votes

Unnamed: 0_level_0,Unnamed: 1_level_0,Description,Passed,Votes
Bill ID,ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
454312,306479,Senate: <pre> SR 1 Final Passage,True,"[(6044, 'Yea'), (6061, 'Yea'), (6064, 'Yea'), ..."
472178,306480,Senate: CSHB 84(FIN)(efd am S) Third Reading -...,True,"[(6044, 'Yea'), (6061, 'Yea'), (6064, 'Yea'), ..."
472178,306481,Senate: CSHB 84(FIN)(efd am S) Third Reading -...,True,"[(6044, 'Yea'), (6061, 'Yea'), (6064, 'Yea'), ..."
472178,306482,House: Concur,True,"[(6034, 'Yea'), (6035, 'Yea'), (6037, 'Yea'), ..."
545632,306483,House: Special Order of Business,True,"[(6034, 'Yea'), (6035, 'Yea'), (6037, 'Yea'), ..."
...,...,...,...,...
1673024,1268431,Line Item Veto Override 27-3-1-0-0,True,"[(8641, 'Yea'), (8663, 'Yea'), (8675, 'Yea'), ..."
1673024,1268432,Line Item Veto Override 29-1-1-0-0,True,"[(8641, 'Yea'), (8663, 'Yea'), (8675, 'Yea'), ..."
1673024,1268433,Line Item Veto Override 27-3-1-0-0,True,"[(8641, 'Yea'), (8663, 'Yea'), (8675, 'Yea'), ..."
1673024,1268434,Line Item Veto Override 23-7-1-0-0,True,"[(8641, 'Nay'), (8663, 'Nay'), (8675, 'Yea'), ..."
