 written by Robert McCormick
# The purpose of the this note book is to create a score between the Media Topics and the contents of the bill.

In this approach, we decided to understand the subject of the bill with TF-IDF word frequency. We chose this method because we are working on a single bill text rather than a corpus of text. We then used TF-IDF to calculate the word frequencies and extracted the top 10 words, placing them in a word vector to match the number of words in a vector from the Media topics.
 
Next, we converted both the bill keywords and the ten words for one topic into GloVe embeddings using GloVe 6B. From there, we took one word embedding from the bill keywords vector and calculated the cosine similarity for every word embedding in the topic. We took the max similarity and repeated this process for all the embeddings in the bill keywords vector. With the ten top similarity numbers, we then averaged them to get the final number to represent the similarity of that one bill with that specific topic. We performed this process for every bill and every topic.
 
From the topic modeling, we had over 750 topics; however, we decided due to time and computing constraints, to limit the analysis to the top 50 topics. Then, for each bill, we calculated the similarity score for each of the top 50 topics and had an associated sentiment score. From here, we wanted to convert this large dataset into two others for evaluation so we would have a total of three for evaluation. The first was the original with 100 columns of the top 50 similarity scores and 50 sentiment scores, the next had the top 20 similarity scores and 20 sentiment scores, and the last dataset we conducted was with PCA analysis on the data from 100 columns. We saw that there was a high correlation between the features in the large dataset, so we extracted the top 32 principal components, which accounted for 95 percent of the variance in the dataset. Lastly, we repeated this process to create three datasets but applied a weight to each word from the News Topics that was created in the topic modeling process. We did this in hopes of capturing the importance of each word from all topics.


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(1)

<torch._C.Generator at 0x7f8bb0b41890>

In [2]:
import pandas as pd 


df = pd.read_csv("/Users/robertmccormick/Desktop/Advanced ML/final project/data/115th_clean.csv")
df

Unnamed: 0.1,Unnamed: 0,bill_id,bill_slug,bill_type,number,bill_uri,title,short_title,sponsor_title,sponsor_id,...,committees,committee_codes,subcommittee_codes,primary_subject,summary,summary_short,latest_major_action_date,latest_major_action,raw_text,cleaned_text
0,0,hr7401-115,hr7401,hr,H.R.7401,https://api.propublica.org/congress/v1/115/bil...,To modify provisions of law relating to refuge...,Strengthening Refugee Resettlement Act,Rep.,E000288,...,House Ways and Means Committee,['HSJU'],[],Immigration,,,2019-01-02,Referred to the Subcommittee on Trade.,[Congressional Bills 115th Congress]\n[From th...,115th congress 2d session h. r. 7401 to modify...
1,1,hr7400-115,hr7400,hr,H.R.7400,https://api.propublica.org/congress/v1/115/bil...,Making continuing appropriations for the Coast...,Making continuing appropriations for the Coast...,Rep.,W000826,...,House Appropriations Committee,['HSAP'],[],Transportation and Public Works,,,2019-01-02,Referred to the House Committee on Appropriati...,[Congressional Bills 115th Congress]\n[From th...,115th congress 2d session h. r. 7400 making co...
2,2,hr7399-115,hr7399,hr,H.R.7399,https://api.propublica.org/congress/v1/115/bil...,To amend the Federal Election Campaign Act of ...,Inaugural Fund Integrity Act,Rep.,S001205,...,House Oversight and Reform Committee,['HSHA'],[],Government Operations and Politics,,,2018-12-27,Referred to the Committee on House Administrat...,[Congressional Bills 115th Congress]\n[From th...,115th congress 2d session h. r. 7399 to amend ...
3,3,hr7397-115,hr7397,hr,H.R.7397,https://api.propublica.org/congress/v1/115/bil...,To provide further additional continuing appro...,To provide further additional continuing appro...,Rep.,H000874,...,House Budget Committee,['HSAP'],[],Economics and Public Finance,DIVISION A--FURTHER ADDITIONAL CONTINUING APPR...,DIVISION A--FURTHER ADDITIONAL CONTINUING APPR...,2018-12-22,"Referred to the Committee on Appropriations, a...",[Congressional Bills 115th Congress]\n[From th...,115th congress 2d session h. r. 7397 to provid...
4,4,hr7398-115,hr7398,hr,H.R.7398,https://api.propublica.org/congress/v1/115/bil...,To prohibit the operation of an exercise facil...,SPA Act,Rep.,F000454,...,House Committee on House Administration,['HSHA'],[],Congress,,,2018-12-22,Referred to the House Committee on House Admin...,[Congressional Bills 115th Congress]\n[From th...,115th congress 2d session h. r. 7398 to prohib...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8871,8871,hres13-115,hres13,hres,H.RES.13,https://api.propublica.org/congress/v1/115/bil...,Expressing the sense of the House of Represent...,Expressing the sense of the House of Represent...,Rep.,J000032,...,House Homeland Security Committee,['HSHM'],[],Emergency Management,Expresses the sense of the House of Representa...,Expresses the sense of the House of Representa...,2017-01-03,Referred to the House Committee on Homeland Se...,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. res. 13 expressi...
8872,8872,hconres2-115,hconres2,hconres,H.CON.RES.2,https://api.propublica.org/congress/v1/115/bil...,To authorize the use of United States Armed Fo...,Authorization for Use of Military Force Agains...,Rep.,C001053,...,House Foreign Affairs Committee,['HSFA'],[],International Affairs,Authorization for Use of Military Force Agains...,Authorization for Use of Military Force Agains...,2017-01-03,Referred to the House Committee on Foreign Aff...,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. con. res. 2 to a...
8873,8873,hconres3-115,hconres3,hconres,H.CON.RES.3,https://api.propublica.org/congress/v1/115/bil...,Recognizing former United States Federal Judge...,Recognizing former United States Federal Judge...,Rep.,G000553,...,House Judiciary Committee,['HSJU'],['HSJU10'],"Civil Rights and Liberties, Minority Issues",Recognizes former federal judge Frank Minis Jo...,Recognizes former federal judge Frank Minis Jo...,2017-01-11,Referred to the Subcommittee on the Constituti...,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. con. res. 3 reco...
8874,8874,hconres1-115,hconres1,hconres,H.CON.RES.1,https://api.propublica.org/congress/v1/115/bil...,Regarding consent to assemble outside the seat...,Regarding consent to assemble outside the seat...,Rep.,S000250,...,,[],[],Congress,(This measure has not been amended since it wa...,(This measure has not been amended since it wa...,2017-01-04,Received in the Senate.,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. con. res. 1 in t...


In [4]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import numpy as np
from torchtext.vocab import GloVe
from torch.nn.functional import cosine_similarity
from statistics import mean 
import os
glove = GloVe(name='6B')




# TF-IDF process for Bill Key Words 

In [5]:
def get_top_words_bills(bill):
    
    custom_stop_words = list(ENGLISH_STOP_WORDS.union({'additional', 'stopwords'}))
    vectorizer = TfidfVectorizer(stop_words=custom_stop_words)
    
    
    tfidf_matrix = vectorizer.fit_transform([bill])

    feature_array = vectorizer.get_feature_names_out()
    tfidf_sorting = np.argsort(tfidf_matrix.toarray().flatten())[::-1]

    n = 10

    top_n = feature_array[tfidf_sorting][:n].tolist()
    return top_n

# Following two function converts thw words into glove embedding vectors and calculates the average cosine similarity score between the two word vectors.

In [6]:
def get_score(bill_words, topic_words):


    bill_vecs = glove.get_vecs_by_tokens(get_top_words_bills(bill_words))

    news_vec = glove.get_vecs_by_tokens(topic_words)
    avg_lst = []
    num = float("-inf")
    for i in range(0, len(bill_vecs)):
        for j in range(0, len(bill_vecs)):

            bill_word_vec = bill_vecs[i,:].unsqueeze(0)
            news_word_vec = news_vec[j,:].unsqueeze(0)
            
            if cosine_similarity(bill_word_vec,news_word_vec).item() > num:
            
                num = cosine_similarity(bill_word_vec,news_word_vec).item()

        avg_lst.append(num)
        num = float("-inf")
        
    return mean(avg_lst)

In [7]:
def get_score_weights(bill_words, topic_words, weights):


    bill_vecs = glove.get_vecs_by_tokens(get_top_words_bills(bill_words))

    news_vec = glove.get_vecs_by_tokens(topic_words)
    avg_lst = []
    num = float("-inf")
    for i in range(0, len(bill_vecs)):
        for j in range(0, len(bill_vecs)):

            bill_word_vec = bill_vecs[i,:].unsqueeze(0)
            news_word_vec = news_vec[j,:].unsqueeze(0)
            weight_num = weights[j]
            if (cosine_similarity(bill_word_vec,news_word_vec).item() * weight_num) > num:
            
                num = (cosine_similarity(bill_word_vec,news_word_vec).item() * weight_num)

        avg_lst.append(num)
        num = float("-inf")
        
    return mean(avg_lst)

In [8]:
get_top_words_bills(df.loc[0][-1])


['section',
 'refugee',
 'act',
 'shall',
 'alien',
 'states',
 'resettlement',
 'united',
 'granted',
 'status']

In [9]:
topic_date_1 = pd.read_csv("/Users/robertmccormick/Desktop/Advanced ML/final project/data/topics_collapsed/topics_2017_01")

topic_date_1["words"][0]

"['obamacare', 'health', 'insurance', 'care', 'repeal', 'medicaid', 'bill', 'republicans', 'coverage', 'affordable']"

# Preprocessing Data to Create Final Data Frame 

In [10]:
path = "/Users/robertmccormick/Desktop/Advanced ML/final project/data/topics_collapsed/"
print(os.listdir(path))
file_lst =['topics_2017_01','topics_2017_02','topics_2017_03','topics_2017_04', 'topics_2017_05', 'topics_2017_06','topics_2017_07','topics_2017_08','topics_2017_09' ,'topics_2017_10']

lst = []
num = 1 
for file in file_lst:
    df = pd.read_csv(f'/Users/robertmccormick/Desktop/Advanced ML/final project/data/topics_collapsed/{file}')

    df.loc[:,"Window"] = num 

    num += 1 

    lst.append(df)

topics = pd.concat(lst, ignore_index=True)




['topics_2017_04', 'topics_2017_03', 'topics_2017_02', 'topics_2017_05', 'topics_2017_10', 'topics_2017_09', 'topics_2017_07', 'topics_2017_06', 'topics_2017_01', 'topics_2017_08']


In [11]:
topics.to_excel('/Users/robertmccormick/Desktop/Advanced ML/final project/data/inspection.xlsx', index=False)

In [12]:
final_topics = pd.read_excel("/Users/robertmccormick/Desktop/Advanced ML/final project/data/removed_spanish_stopwords.xlsx")
bills = pd.read_csv("/Users/robertmccormick/Desktop/Advanced ML/final project/data/115th_clean.csv")
final_topics

Unnamed: 0.1,Unnamed: 0,topics,topic_words,mean_positive,count,words,weights,Window
0,1,0,"[('obamacare', 0.008568612132015228), ('health...",0.117310,1398,"['obamacare', 'health', 'insurance', 'care', '...","[0.008568612132015228, 0.008531346294839339, 0...",1
1,2,1,"[('nationals', 0.016195271509683095), ('baseba...",0.387597,774,"['nationals', 'baseball', 'cubs', 'league', 'i...","[0.016195271509683095, 0.008930316187410938, 0...",1
2,3,2,"[('restaurant', 0.00646270945965236), ('food',...",0.574893,701,"['restaurant', 'food', 'recipe', 'sauce', 'che...","[0.00646270945965236, 0.006035357343149471, 0....",1
3,4,3,"[('album', 0.01177499259119117), ('music', 0.0...",0.761384,549,"['album', 'music', 'songs', 'song', 'band', 'j...","[0.01177499259119117, 0.010106354264392641, 0....",1
4,5,4,"[('capitals', 0.01946676853954486), ('trotz', ...",0.561111,540,"['capitals', 'trotz', 'goals', 'ovechkin', 'ga...","[0.01946676853954486, 0.014246855036860814, 0....",1
...,...,...,...,...,...,...,...,...
495,46,45,"[('win', 0.011423773806217894), ('wcac', 0.011...",0.787097,155,"['win', 'wcac', '4a', 'championship', 'johns',...","[0.011423773806217894, 0.011407813389731262, 0...",10
496,47,46,"[('cancer', 0.011434200279381218), ('hospital'...",0.233766,154,"['cancer', 'hospital', 'doctors', 'patients', ...","[0.011434200279381218, 0.006656138014150969, 0...",10
497,48,47,"[('jerusalem', 0.03379421185625118), ('israel'...",0.261438,153,"['jerusalem', 'israel', 'palestinian', 'palest...","[0.03379421185625118, 0.02152774780718253, 0.0...",10
498,49,48,"[('opioid', 0.019648905632138128), ('drug', 0....",0.125000,152,"['opioid', 'drug', 'opioids', 'fentanyl', 'add...","[0.019648905632138128, 0.012292262724203263, 0...",10


In [13]:
final_topics["words"].astype(list())
final_topics["weights"].astype(list())


In [14]:
bills.columns 

Index(['Unnamed: 0', 'bill_id', 'bill_slug', 'bill_type', 'number', 'bill_uri',
       'title', 'short_title', 'sponsor_title', 'sponsor_id', 'sponsor_name',
       'sponsor_state', 'sponsor_party', 'sponsor_uri', 'gpo_pdf_uri',
       'congressdotgov_url', 'govtrack_url', 'introduced_date', 'active',
       'last_vote', 'house_passage', 'senate_passage', 'enacted', 'vetoed',
       'cosponsors', 'cosponsors_by_party', 'committees', 'committee_codes',
       'subcommittee_codes', 'primary_subject', 'summary', 'summary_short',
       'latest_major_action_date', 'latest_major_action', 'raw_text',
       'cleaned_text'],
      dtype='object')

In [15]:
bills.loc[:,'introduced_date'] = pd.to_datetime(bills.loc[:,'introduced_date'], errors='coerce')

start_date = '2017-01-01'
end_date = '2017-10-31'
bills_date = bills[(bills['introduced_date'] >= start_date) & (bills['introduced_date'] <= end_date)]


  bills.loc[:,'introduced_date'] = pd.to_datetime(bills.loc[:,'introduced_date'], errors='coerce')


In [16]:
bills_date.loc[:,"Month"] = bills.loc[:,'introduced_date'].dt.month
bills_date


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bills_date.loc[:,"Month"] = bills.loc[:,'introduced_date'].dt.month


Unnamed: 0.1,Unnamed: 0,bill_id,bill_slug,bill_type,number,bill_uri,title,short_title,sponsor_title,sponsor_id,...,committee_codes,subcommittee_codes,primary_subject,summary,summary_short,latest_major_action_date,latest_major_action,raw_text,cleaned_text,Month
3884,3884,hr4198-115,hr4198,hr,H.R.4198,https://api.propublica.org/congress/v1/115/bil...,To promote the economic security and safety of...,Security and Financial Empowerment Act of 2017,Rep.,R000486,...,['HSED'],[],Crime and Law Enforcement,,,2017-10-31,Referred to the Committee on Education and the...,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. r. 4198 to promo...,10
3885,3885,hr4194-115,hr4194,hr,H.R.4194,https://api.propublica.org/congress/v1/115/bil...,To direct the Mayor of the District of Columbi...,To direct the Mayor of the District of Columbi...,Del.,N000147,...,['HSGO'],[],Armed Forces and National Security,,,2017-10-31,Referred to the House Committee on Oversight a...,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. r. 4194 to direc...,10
3886,3886,hjres120-115,hjres120,hjres,H.J.RES.120,https://api.propublica.org/congress/v1/115/bil...,Proposing an amendment to the Constitution of ...,Proposing an amendment to the Constitution of ...,Rep.,C001068,...,['HSJU'],['HSJU10'],Crime and Law Enforcement,,,2017-11-02,Sponsor introductory remarks on measure. (CR H...,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. j. res. 120 prop...,10
3887,3887,hr4181-115,hr4181,hr,H.R.4181,https://api.propublica.org/congress/v1/115/bil...,To amend the Higher Education Act of 1965 rega...,POST Act of 2017,Rep.,C001068,...,['HSED'],[],Education,,,2017-10-31,Referred to the House Committee on Education a...,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. r. 4181 to amend...,10
3888,3888,hr4186-115,hr4186,hr,H.R.4186,https://api.propublica.org/congress/v1/115/bil...,"To amend title 18, United States Code, to prot...",Lori Jackson Domestic Violence Survivor Protec...,Rep.,H001047,...,['HSJU'],['HSJU08'],Crime and Law Enforcement,Lori Jackson Domestic Violence Survivor Protec...,Lori Jackson Domestic Violence Survivor Protec...,2017-11-17,"Referred to the Subcommittee on Crime, Terrori...",[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. r. 4186 to amend...,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8871,8871,hres13-115,hres13,hres,H.RES.13,https://api.propublica.org/congress/v1/115/bil...,Expressing the sense of the House of Represent...,Expressing the sense of the House of Represent...,Rep.,J000032,...,['HSHM'],[],Emergency Management,Expresses the sense of the House of Representa...,Expresses the sense of the House of Representa...,2017-01-03,Referred to the House Committee on Homeland Se...,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. res. 13 expressi...,1
8872,8872,hconres2-115,hconres2,hconres,H.CON.RES.2,https://api.propublica.org/congress/v1/115/bil...,To authorize the use of United States Armed Fo...,Authorization for Use of Military Force Agains...,Rep.,C001053,...,['HSFA'],[],International Affairs,Authorization for Use of Military Force Agains...,Authorization for Use of Military Force Agains...,2017-01-03,Referred to the House Committee on Foreign Aff...,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. con. res. 2 to a...,1
8873,8873,hconres3-115,hconres3,hconres,H.CON.RES.3,https://api.propublica.org/congress/v1/115/bil...,Recognizing former United States Federal Judge...,Recognizing former United States Federal Judge...,Rep.,G000553,...,['HSJU'],['HSJU10'],"Civil Rights and Liberties, Minority Issues",Recognizes former federal judge Frank Minis Jo...,Recognizes former federal judge Frank Minis Jo...,2017-01-11,Referred to the Subcommittee on the Constituti...,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. con. res. 3 reco...,1
8874,8874,hconres1-115,hconres1,hconres,H.CON.RES.1,https://api.propublica.org/congress/v1/115/bil...,Regarding consent to assemble outside the seat...,Regarding consent to assemble outside the seat...,Rep.,S000250,...,[],[],Congress,(This measure has not been amended since it wa...,(This measure has not been amended since it wa...,2017-01-04,Received in the Senate.,[Congressional Bills 115th Congress]\n[From th...,115th congress 1st session h. con. res. 1 in t...,1


In [17]:
final_topics.reset_index()
final_topics

Unnamed: 0.1,Unnamed: 0,topics,topic_words,mean_positive,count,words,weights,Window
0,1,0,"[('obamacare', 0.008568612132015228), ('health...",0.117310,1398,"['obamacare', 'health', 'insurance', 'care', '...","[0.008568612132015228, 0.008531346294839339, 0...",1
1,2,1,"[('nationals', 0.016195271509683095), ('baseba...",0.387597,774,"['nationals', 'baseball', 'cubs', 'league', 'i...","[0.016195271509683095, 0.008930316187410938, 0...",1
2,3,2,"[('restaurant', 0.00646270945965236), ('food',...",0.574893,701,"['restaurant', 'food', 'recipe', 'sauce', 'che...","[0.00646270945965236, 0.006035357343149471, 0....",1
3,4,3,"[('album', 0.01177499259119117), ('music', 0.0...",0.761384,549,"['album', 'music', 'songs', 'song', 'band', 'j...","[0.01177499259119117, 0.010106354264392641, 0....",1
4,5,4,"[('capitals', 0.01946676853954486), ('trotz', ...",0.561111,540,"['capitals', 'trotz', 'goals', 'ovechkin', 'ga...","[0.01946676853954486, 0.014246855036860814, 0....",1
...,...,...,...,...,...,...,...,...
495,46,45,"[('win', 0.011423773806217894), ('wcac', 0.011...",0.787097,155,"['win', 'wcac', '4a', 'championship', 'johns',...","[0.011423773806217894, 0.011407813389731262, 0...",10
496,47,46,"[('cancer', 0.011434200279381218), ('hospital'...",0.233766,154,"['cancer', 'hospital', 'doctors', 'patients', ...","[0.011434200279381218, 0.006656138014150969, 0...",10
497,48,47,"[('jerusalem', 0.03379421185625118), ('israel'...",0.261438,153,"['jerusalem', 'israel', 'palestinian', 'palest...","[0.03379421185625118, 0.02152774780718253, 0.0...",10
498,49,48,"[('opioid', 0.019648905632138128), ('drug', 0....",0.125000,152,"['opioid', 'drug', 'opioids', 'fentanyl', 'add...","[0.019648905632138128, 0.012292262724203263, 0...",10


# Function that creates final data frame and applies scores by topic as columns for each bill.

In [18]:
import warnings


with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=pd.errors.PerformanceWarning)


import ast 
total_bills = bills_date.copy()


counter = 1

for idx, row in bills_date.iterrows():
  
    topics_for_month = final_topics[final_topics["Window"] == row["Month"]]

    

    for idx2, topic_row in topics_for_month.iterrows():
        
        topic_col_name = f'Topic {topic_row["Unnamed: 0"]}'
        sentiment_col_name = f'Sentiment {topic_row["Unnamed: 0"]}'

        
        if topic_col_name not in total_bills.columns:
            with warnings.catch_warnings():
                warnings.simplefilter("ignore", category=pd.errors.PerformanceWarning)
                total_bills[topic_col_name] = None
                total_bills[sentiment_col_name] = None 

        
        score = get_score_weights(row["cleaned_text"], ast.literal_eval(topic_row["words"]),ast.literal_eval(topic_row["weights"]) )

        total_bills.at[idx, topic_col_name] = score
        total_bills.loc[idx, sentiment_col_name] = topic_row["mean_positive"]
        

    



total_bills

Unnamed: 0.1,Unnamed: 0,bill_id,bill_slug,bill_type,number,bill_uri,title,short_title,sponsor_title,sponsor_id,...,Topic 46,Sentiment 46,Topic 47,Sentiment 47,Topic 48,Sentiment 48,Topic 49,Sentiment 49,Topic 50,Sentiment 50
3884,3884,hr4198-115,hr4198,hr,H.R.4198,https://api.propublica.org/congress/v1/115/bil...,To promote the economic security and safety of...,Security and Financial Empowerment Act of 2017,Rep.,R000486,...,0.001646,0.787097,0.00146,0.233766,0.005072,0.261438,0.002448,0.125,0.003685,0.195946
3885,3885,hr4194-115,hr4194,hr,H.R.4194,https://api.propublica.org/congress/v1/115/bil...,To direct the Mayor of the District of Columbi...,To direct the Mayor of the District of Columbi...,Del.,N000147,...,0.002762,0.787097,0.001808,0.233766,0.005079,0.261438,0.002368,0.125,0.004394,0.195946
3886,3886,hjres120-115,hjres120,hjres,H.J.RES.120,https://api.propublica.org/congress/v1/115/bil...,Proposing an amendment to the Constitution of ...,Proposing an amendment to the Constitution of ...,Rep.,C001068,...,0.002373,0.787097,0.001492,0.233766,0.005338,0.261438,0.002172,0.125,0.005603,0.195946
3887,3887,hr4181-115,hr4181,hr,H.R.4181,https://api.propublica.org/congress/v1/115/bil...,To amend the Higher Education Act of 1965 rega...,POST Act of 2017,Rep.,C001068,...,0.001843,0.787097,0.001363,0.233766,0.003697,0.261438,0.001884,0.125,0.003588,0.195946
3888,3888,hr4186-115,hr4186,hr,H.R.4186,https://api.propublica.org/congress/v1/115/bil...,"To amend title 18, United States Code, to prot...",Lori Jackson Domestic Violence Survivor Protec...,Rep.,H001047,...,0.002946,0.787097,0.001591,0.233766,0.004896,0.261438,0.002189,0.125,0.004604,0.195946
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8871,8871,hres13-115,hres13,hres,H.RES.13,https://api.propublica.org/congress/v1/115/bil...,Expressing the sense of the House of Represent...,Expressing the sense of the House of Represent...,Rep.,J000032,...,0.003473,0.613636,0.001307,0.407692,0.003068,0.09375,0.004326,0.062992,0.003041,0.420635
8872,8872,hconres2-115,hconres2,hconres,H.CON.RES.2,https://api.propublica.org/congress/v1/115/bil...,To authorize the use of United States Armed Fo...,Authorization for Use of Military Force Agains...,Rep.,C001053,...,0.002345,0.613636,0.001214,0.407692,0.002477,0.09375,0.003783,0.062992,0.001682,0.420635
8873,8873,hconres3-115,hconres3,hconres,H.CON.RES.3,https://api.propublica.org/congress/v1/115/bil...,Recognizing former United States Federal Judge...,Recognizing former United States Federal Judge...,Rep.,G000553,...,0.001861,0.613636,0.002067,0.407692,0.002483,0.09375,0.003948,0.062992,0.001641,0.420635
8874,8874,hconres1-115,hconres1,hconres,H.CON.RES.1,https://api.propublica.org/congress/v1/115/bil...,Regarding consent to assemble outside the seat...,Regarding consent to assemble outside the seat...,Rep.,S000250,...,0.002894,0.613636,0.001419,0.407692,0.00192,0.09375,0.004153,0.062992,0.002372,0.420635


In [19]:
final_bills_large  = total_bills.copy()
final_bills_large.to_csv('/Users/robertmccormick/Desktop/Advanced ML/final project/data/final_bills_large_weighted.csv', index=False)
final_bills_large

Unnamed: 0.1,Unnamed: 0,bill_id,bill_slug,bill_type,number,bill_uri,title,short_title,sponsor_title,sponsor_id,...,Topic 46,Sentiment 46,Topic 47,Sentiment 47,Topic 48,Sentiment 48,Topic 49,Sentiment 49,Topic 50,Sentiment 50
3884,3884,hr4198-115,hr4198,hr,H.R.4198,https://api.propublica.org/congress/v1/115/bil...,To promote the economic security and safety of...,Security and Financial Empowerment Act of 2017,Rep.,R000486,...,0.001646,0.787097,0.00146,0.233766,0.005072,0.261438,0.002448,0.125,0.003685,0.195946
3885,3885,hr4194-115,hr4194,hr,H.R.4194,https://api.propublica.org/congress/v1/115/bil...,To direct the Mayor of the District of Columbi...,To direct the Mayor of the District of Columbi...,Del.,N000147,...,0.002762,0.787097,0.001808,0.233766,0.005079,0.261438,0.002368,0.125,0.004394,0.195946
3886,3886,hjres120-115,hjres120,hjres,H.J.RES.120,https://api.propublica.org/congress/v1/115/bil...,Proposing an amendment to the Constitution of ...,Proposing an amendment to the Constitution of ...,Rep.,C001068,...,0.002373,0.787097,0.001492,0.233766,0.005338,0.261438,0.002172,0.125,0.005603,0.195946
3887,3887,hr4181-115,hr4181,hr,H.R.4181,https://api.propublica.org/congress/v1/115/bil...,To amend the Higher Education Act of 1965 rega...,POST Act of 2017,Rep.,C001068,...,0.001843,0.787097,0.001363,0.233766,0.003697,0.261438,0.001884,0.125,0.003588,0.195946
3888,3888,hr4186-115,hr4186,hr,H.R.4186,https://api.propublica.org/congress/v1/115/bil...,"To amend title 18, United States Code, to prot...",Lori Jackson Domestic Violence Survivor Protec...,Rep.,H001047,...,0.002946,0.787097,0.001591,0.233766,0.004896,0.261438,0.002189,0.125,0.004604,0.195946
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8871,8871,hres13-115,hres13,hres,H.RES.13,https://api.propublica.org/congress/v1/115/bil...,Expressing the sense of the House of Represent...,Expressing the sense of the House of Represent...,Rep.,J000032,...,0.003473,0.613636,0.001307,0.407692,0.003068,0.09375,0.004326,0.062992,0.003041,0.420635
8872,8872,hconres2-115,hconres2,hconres,H.CON.RES.2,https://api.propublica.org/congress/v1/115/bil...,To authorize the use of United States Armed Fo...,Authorization for Use of Military Force Agains...,Rep.,C001053,...,0.002345,0.613636,0.001214,0.407692,0.002477,0.09375,0.003783,0.062992,0.001682,0.420635
8873,8873,hconres3-115,hconres3,hconres,H.CON.RES.3,https://api.propublica.org/congress/v1/115/bil...,Recognizing former United States Federal Judge...,Recognizing former United States Federal Judge...,Rep.,G000553,...,0.001861,0.613636,0.002067,0.407692,0.002483,0.09375,0.003948,0.062992,0.001641,0.420635
8874,8874,hconres1-115,hconres1,hconres,H.CON.RES.1,https://api.propublica.org/congress/v1/115/bil...,Regarding consent to assemble outside the seat...,Regarding consent to assemble outside the seat...,Rep.,S000250,...,0.002894,0.613636,0.001419,0.407692,0.00192,0.09375,0.004153,0.062992,0.002372,0.420635


In [20]:
final_bills_large.columns
final_bills_small =final_bills_large.iloc[:,:77]
final_bills_small.to_csv('/Users/robertmccormick/Desktop/Advanced ML/final project/data/final_bills_small_weighted.csv', index=False)

In [50]:
pca_df = final_bills_large.iloc[:,37:]
pca_df


Unnamed: 0,Topic 1,Sentiment 1,Topic 2,Sentiment 2,Topic 3,Sentiment 3,Topic 4,Sentiment 4,Topic 5,Sentiment 5,...,Topic 46,Sentiment 46,Topic 47,Sentiment 47,Topic 48,Sentiment 48,Topic 49,Sentiment 49,Topic 50,Sentiment 50
3884,0.003469,0.099167,0.003252,0.169715,0.002177,0.143678,0.001206,0.410405,0.001867,0.211538,...,0.001646,0.787097,0.00146,0.233766,0.005072,0.261438,0.002448,0.125,0.003685,0.195946
3885,0.003923,0.099167,0.004808,0.169715,0.003799,0.143678,0.001414,0.410405,0.001652,0.211538,...,0.002762,0.787097,0.001808,0.233766,0.005079,0.261438,0.002368,0.125,0.004394,0.195946
3886,0.004728,0.099167,0.004229,0.169715,0.0037,0.143678,0.001309,0.410405,0.00199,0.211538,...,0.002373,0.787097,0.001492,0.233766,0.005338,0.261438,0.002172,0.125,0.005603,0.195946
3887,0.003958,0.099167,0.002672,0.169715,0.001967,0.143678,0.00088,0.410405,0.001341,0.211538,...,0.001843,0.787097,0.001363,0.233766,0.003697,0.261438,0.001884,0.125,0.003588,0.195946
3888,0.003722,0.099167,0.004291,0.169715,0.003138,0.143678,0.001575,0.410405,0.001884,0.211538,...,0.002946,0.787097,0.001591,0.233766,0.004896,0.261438,0.002189,0.125,0.004604,0.195946
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8871,0.002402,0.11731,0.002481,0.387597,0.001483,0.574893,0.00111,0.761384,0.002063,0.561111,...,0.003473,0.613636,0.001307,0.407692,0.003068,0.09375,0.004326,0.062992,0.003041,0.420635
8872,0.001947,0.11731,0.002765,0.387597,0.000985,0.574893,0.001055,0.761384,0.001926,0.561111,...,0.002345,0.613636,0.001214,0.407692,0.002477,0.09375,0.003783,0.062992,0.001682,0.420635
8873,0.001981,0.11731,0.002218,0.387597,0.001015,0.574893,0.00167,0.761384,0.001436,0.561111,...,0.001861,0.613636,0.002067,0.407692,0.002483,0.09375,0.003948,0.062992,0.001641,0.420635
8874,0.002467,0.11731,0.002026,0.387597,0.001066,0.574893,0.001311,0.761384,0.001806,0.561111,...,0.002894,0.613636,0.001419,0.407692,0.00192,0.09375,0.004153,0.062992,0.002372,0.420635


# PCA analysis 

In [51]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(pca_df)

pca = PCA()
pca.fit(X)


cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
print(cumulative_variance)

n_components = np.where(cumulative_variance >= 0.95)[0][0] + 1
pca = PCA(n_components=n_components)
principal_components = pca.fit_transform(X)

column_names = [f"PC{i+1}" for i in range(n_components)]
principal_df = pd.DataFrame(data=principal_components, columns=column_names)




[0.17492743 0.31537142 0.44229626 0.55877295 0.65416673 0.73926437
 0.80526537 0.86132472 0.89512617 0.91414396 0.9192371  0.92381412
 0.92816756 0.93236729 0.93603313 0.93963612 0.94317972 0.94656402
 0.94976715 0.95269615 0.9553004  0.95779777 0.96014079 0.96234234
 0.96442186 0.96644356 0.96841205 0.97027515 0.97206302 0.97383487
 0.97556384 0.97710449 0.9786022  0.98007419 0.98149021 0.98284798
 0.9841322  0.98537918 0.98654736 0.98760296 0.98863244 0.98961374
 0.99054187 0.99142954 0.99227184 0.99311054 0.99387854 0.99464364
 0.99535508 0.99603749 0.99669542 0.99731392 0.99790309 0.99842794
 0.9989343  0.99941326 0.99963563 0.99983287 1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.       

In [52]:
principal_df

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20
0,2.323007,-2.030491,0.417065,-2.629868,4.112415,-4.512404,-1.750318,5.697752,0.436233,-0.509256,0.371670,0.204484,0.413278,-0.095539,0.322021,0.144843,0.186527,-0.419737,0.541590,0.429093
1,1.019288,-1.826251,0.262051,-2.104244,4.500607,-4.598679,-2.909120,5.862785,-3.088431,0.231067,-0.029148,-0.330180,-0.249809,1.368128,-0.814840,0.458949,-0.120813,-0.313341,-0.520864,-0.069791
2,1.645720,-2.054209,0.006113,-2.142857,4.950369,-5.345003,-2.466043,5.900272,-2.639806,-0.009997,-0.193918,0.619178,0.222482,0.830829,0.204386,0.300944,-0.474825,-0.347486,0.208287,-0.220777
3,2.257788,-1.653079,0.482212,-2.320129,4.085735,-4.489537,-1.417351,5.581553,1.464769,0.394044,-0.275651,-0.237628,0.277210,-0.345134,0.139114,-0.220851,-0.167788,0.504913,-0.050450,-0.232214
4,1.316194,-2.255532,-0.099609,-2.438909,4.269174,-4.984843,-2.157008,5.911649,-1.683756,-0.161836,0.041036,-0.413706,0.652925,0.810090,0.189202,-0.592441,0.508064,0.223306,0.221717,-0.120774
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4987,7.385945,3.549393,-1.136075,3.266528,-1.672932,-0.520607,-0.061313,-0.376113,-1.379749,-0.469706,-0.252406,1.238482,1.369741,-0.653806,0.611185,0.534690,0.300029,0.632659,-0.254808,0.741243
4988,7.613184,3.643783,-1.008998,3.071097,-1.505509,-0.326568,0.336897,-0.655299,-0.032192,-0.656883,-1.904584,1.780182,0.479555,-1.771854,-0.704926,0.444865,0.138436,0.036903,0.430237,0.680624
4989,6.799999,3.206913,-0.125639,2.945822,-1.484215,-0.378001,0.462912,-0.783412,-0.713811,-0.210000,0.339318,0.123096,-0.962394,-1.041034,0.012991,-0.335310,-1.236404,-0.994567,-1.164177,-0.730821
4990,6.588829,3.180833,-0.236356,3.230896,-1.652373,-0.537540,0.429552,-0.365411,-0.545025,-0.096130,-0.086134,0.273485,0.148400,-0.587200,0.136473,-0.160127,-0.131659,-0.612049,-0.754385,-0.261935


In [54]:



bills_date
final_pca = bills_date.join(principal_df)

final_pca = final_pca.iloc[:,1:]
final_pca

Unnamed: 0.1,index,Unnamed: 0,bill_id,bill_slug,bill_type,number,bill_uri,title,short_title,sponsor_title,...,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20
0,3884,3884,hr4198-115,hr4198,hr,H.R.4198,https://api.propublica.org/congress/v1/115/bil...,To promote the economic security and safety of...,Security and Financial Empowerment Act of 2017,Rep.,...,0.371670,0.204484,0.413278,-0.095539,0.322021,0.144843,0.186527,-0.419737,0.541590,0.429093
1,3885,3885,hr4194-115,hr4194,hr,H.R.4194,https://api.propublica.org/congress/v1/115/bil...,To direct the Mayor of the District of Columbi...,To direct the Mayor of the District of Columbi...,Del.,...,-0.029148,-0.330180,-0.249809,1.368128,-0.814840,0.458949,-0.120813,-0.313341,-0.520864,-0.069791
2,3886,3886,hjres120-115,hjres120,hjres,H.J.RES.120,https://api.propublica.org/congress/v1/115/bil...,Proposing an amendment to the Constitution of ...,Proposing an amendment to the Constitution of ...,Rep.,...,-0.193918,0.619178,0.222482,0.830829,0.204386,0.300944,-0.474825,-0.347486,0.208287,-0.220777
3,3887,3887,hr4181-115,hr4181,hr,H.R.4181,https://api.propublica.org/congress/v1/115/bil...,To amend the Higher Education Act of 1965 rega...,POST Act of 2017,Rep.,...,-0.275651,-0.237628,0.277210,-0.345134,0.139114,-0.220851,-0.167788,0.504913,-0.050450,-0.232214
4,3888,3888,hr4186-115,hr4186,hr,H.R.4186,https://api.propublica.org/congress/v1/115/bil...,"To amend title 18, United States Code, to prot...",Lori Jackson Domestic Violence Survivor Protec...,Rep.,...,0.041036,-0.413706,0.652925,0.810090,0.189202,-0.592441,0.508064,0.223306,0.221717,-0.120774
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4987,8871,8871,hres13-115,hres13,hres,H.RES.13,https://api.propublica.org/congress/v1/115/bil...,Expressing the sense of the House of Represent...,Expressing the sense of the House of Represent...,Rep.,...,-0.252406,1.238482,1.369741,-0.653806,0.611185,0.534690,0.300029,0.632659,-0.254808,0.741243
4988,8872,8872,hconres2-115,hconres2,hconres,H.CON.RES.2,https://api.propublica.org/congress/v1/115/bil...,To authorize the use of United States Armed Fo...,Authorization for Use of Military Force Agains...,Rep.,...,-1.904584,1.780182,0.479555,-1.771854,-0.704926,0.444865,0.138436,0.036903,0.430237,0.680624
4989,8873,8873,hconres3-115,hconres3,hconres,H.CON.RES.3,https://api.propublica.org/congress/v1/115/bil...,Recognizing former United States Federal Judge...,Recognizing former United States Federal Judge...,Rep.,...,0.339318,0.123096,-0.962394,-1.041034,0.012991,-0.335310,-1.236404,-0.994567,-1.164177,-0.730821
4990,8874,8874,hconres1-115,hconres1,hconres,H.CON.RES.1,https://api.propublica.org/congress/v1/115/bil...,Regarding consent to assemble outside the seat...,Regarding consent to assemble outside the seat...,Rep.,...,-0.086134,0.273485,0.148400,-0.587200,0.136473,-0.160127,-0.131659,-0.612049,-0.754385,-0.261935


In [55]:
final_pca.to_csv('/Users/robertmccormick/Desktop/Advanced ML/final project/data/final_pca_weighted.csv')

In [33]:
final_pca

Unnamed: 0.1,Unnamed: 0,bill_id,bill_slug,bill_type,number,bill_uri,title,short_title,sponsor_title,sponsor_id,...,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20
3884,3884,hr4198-115,hr4198,hr,H.R.4198,https://api.propublica.org/congress/v1/115/bil...,To promote the economic security and safety of...,Security and Financial Empowerment Act of 2017,Rep.,R000486,...,-0.192693,0.316541,0.305797,-0.302148,-0.205310,-0.284881,0.000398,-0.639528,0.423800,0.166096
3885,3885,hr4194-115,hr4194,hr,H.R.4194,https://api.propublica.org/congress/v1/115/bil...,To direct the Mayor of the District of Columbi...,To direct the Mayor of the District of Columbi...,Del.,N000147,...,0.396020,-0.458572,-0.245778,-0.075398,0.205259,-0.037999,-0.825235,-0.564465,-1.111216,0.319677
3886,3886,hjres120-115,hjres120,hjres,H.J.RES.120,https://api.propublica.org/congress/v1/115/bil...,Proposing an amendment to the Constitution of ...,Proposing an amendment to the Constitution of ...,Rep.,C001068,...,1.299166,-1.181868,-0.354268,-0.286531,-0.438750,0.428335,0.036069,1.259122,-0.108619,-0.054720
3887,3887,hr4181-115,hr4181,hr,H.R.4181,https://api.propublica.org/congress/v1/115/bil...,To amend the Higher Education Act of 1965 rega...,POST Act of 2017,Rep.,C001068,...,-0.505990,0.227894,-0.641409,0.620496,0.449717,-0.301923,-0.288209,0.472881,0.040873,0.038390
3888,3888,hr4186-115,hr4186,hr,H.R.4186,https://api.propublica.org/congress/v1/115/bil...,"To amend title 18, United States Code, to prot...",Lori Jackson Domestic Violence Survivor Protec...,Rep.,H001047,...,1.477354,-1.327159,-0.153487,-1.398784,-0.104037,-0.088486,0.290064,-0.105980,0.579084,1.398548
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8871,8871,hres13-115,hres13,hres,H.RES.13,https://api.propublica.org/congress/v1/115/bil...,Expressing the sense of the House of Represent...,Expressing the sense of the House of Represent...,Rep.,J000032,...,,,,,,,,,,
8872,8872,hconres2-115,hconres2,hconres,H.CON.RES.2,https://api.propublica.org/congress/v1/115/bil...,To authorize the use of United States Armed Fo...,Authorization for Use of Military Force Agains...,Rep.,C001053,...,,,,,,,,,,
8873,8873,hconres3-115,hconres3,hconres,H.CON.RES.3,https://api.propublica.org/congress/v1/115/bil...,Recognizing former United States Federal Judge...,Recognizing former United States Federal Judge...,Rep.,G000553,...,,,,,,,,,,
8874,8874,hconres1-115,hconres1,hconres,H.CON.RES.1,https://api.propublica.org/congress/v1/115/bil...,Regarding consent to assemble outside the seat...,Regarding consent to assemble outside the seat...,Rep.,S000250,...,,,,,,,,,,


In [26]:
pca_df.columns 

sentiment =[]
media_score =[]

for name in pca_df.columns:

    if "Topic" in name:
        media_score.append(name)
    else:
        sentiment.append(name)


sentiment_df = pca_df.loc[:, sentiment]
media_score = pca_df.loc[:, media_score]

In [30]:
sentiment_df.describe()

Unnamed: 0,Sentiment 1,Sentiment 2,Sentiment 3,Sentiment 4,Sentiment 5,Sentiment 6,Sentiment 7,Sentiment 8,Sentiment 9,Sentiment 10,...,Sentiment 41,Sentiment 42,Sentiment 43,Sentiment 44,Sentiment 45,Sentiment 46,Sentiment 47,Sentiment 48,Sentiment 49,Sentiment 50
count,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,...,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0
unique,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,...,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
top,0.11731,0.387597,0.574893,0.761384,0.561111,0.230303,0.138716,0.247379,0.796976,0.171569,...,0.107143,0.306569,0.874074,0.11194,0.356061,0.613636,0.407692,0.09375,0.062992,0.420635
freq,910.0,910.0,910.0,910.0,910.0,910.0,910.0,910.0,910.0,910.0,...,910.0,910.0,910.0,910.0,910.0,910.0,910.0,910.0,910.0,910.0


In [31]:
media_score.describe()

Unnamed: 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9,Topic 10,...,Topic 41,Topic 42,Topic 43,Topic 44,Topic 45,Topic 46,Topic 47,Topic 48,Topic 49,Topic 50
count,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,...,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0,4992.0
unique,4967.0,4967.0,4967.0,4967.0,4967.0,4967.0,4967.0,4967.0,4967.0,4967.0,...,4967.0,4967.0,4967.0,4967.0,4967.0,4967.0,4967.0,4967.0,4967.0,4967.0
top,0.002697,0.002486,0.001493,0.001347,0.001259,0.002307,0.003237,0.001901,0.001402,0.001296,...,0.002277,0.004089,0.001225,0.006547,0.004238,0.002821,0.007232,0.001957,0.001967,0.005911
freq,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,...,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0
