# TI-IDf for Soibsired Legislation
The goal is to generate characteristic words and phrases for one legislator's sponsored bills, to quickly give viewers a good idea of the legislator's work in Congress.

Steps:

1. Connect to either the relational Postgres DB or the Mongo DB, get the bill descriptions

2. Create single documents for all legislators by appending all bull descriptions

3. Use sklearn to calculate tf-idf for all 1/2/3 word combinations

4. Take the top 10 by tf-idf per legislator

5. Create a new table in the postgres DB with these charwords

In [3]:
import numpy as np
import pandas as pd
import psycopg
from sqlalchemy import create_engine
import os
import dotenv
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer

dotenv.load_dotenv()
POSTGRES_PASSWORD = os.getenv('POSTGRES_PASSWORD')

### Connect the relational Postgres DB, get the bill description

In [5]:
dbms = 'postgresql'
package = 'psycopg'
user = 'postgres'
password = POSTGRES_PASSWORD
host = 'localhost'
port = '5432'
db = 'contrans'

engine = create_engine(f'{dbms}+{package}://{user}:{password}@{host}:{port}/{db}')
engine

Engine(postgresql+psycopg://postgres:***@localhost:5432/contrans)

In [8]:
myquery = '''
SELECT b.bioguide_id,bv.text
FROM bill_versions bv
INNER JOIN bills b 
    ON bv.bill_type = b.bill_type AND bv.bill_number = b.bill_number
'''
bill_description = pd.read_sql_query(myquery, con=engine)

In [10]:
import re

def strip_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

bill_description['text'] = bill_description['text'].apply(strip_html_tags)
bill_description

Unnamed: 0,bioguide_id,text
0,J000293,Shutdown Fairness ActThis bill provides approp...
1,O000173,Federal Worker Childcare Protection Act of 202...
2,A000380,This bill requires the federal government to r...
3,S001194,Federal Employees Civil Relief Act This bill e...
4,S001203,Fair Pay for Federal Contractors Act of 2025Th...
...,...,...
2746,M001233,This joint resolution nullifies the final rule...
2747,C001129,Laken Riley ActThis bill requires the Departme...
2748,B001319,Laken Riley ActThis bill requires the Departme...
2749,L000598,This resolution disapproves of the Central Bus...


In [11]:
bill_description = bill_description.groupby('bioguide_id').agg({'text': ' '.join}).reset_index()

In [14]:
bill_description

Unnamed: 0,bioguide_id,text
0,A000055,"Departments of Labor, Health and Human Service..."
1,A000148,Supporting Transit Commutes Act&nbsp;This bill...
2,A000369,Coin Metal Modification Authorization and Cost...
3,A000370,This resolution recognizes (1) the Greensboro ...
4,A000371,No Hungry Kids in Schools ActThis bill directs...
...,...,...
485,W000829,Facility for Runway Operations and Safe Transp...
486,W000830,Chiquita Canyon Tax Relief ActThis bill exclud...
487,Y000064,Synthetic Biology Advancement Act of 2025This ...
488,Y000067,This resolution supports the designation of Na...


In [12]:
x = ['have', 'a', 'nice', 'day']
x

['have', 'a', 'nice', 'day']

In [13]:
"!,".join(x)

'have!,a!,nice!,day'

In [16]:
tfIdfVectorizer = TfidfVectorizer(stop_words='english',
                                  max_df=0.8,
                                  ngram_range=(1,3))   
tfIdf = tfIdfVectorizer.fit_transform(bill_description['text'])


In [17]:
tfIdf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 473354 stored elements and shape (490, 260231)>

In [29]:
def tfidf_one_legislator(index):
    tfidf_data = pd.DataFrame(tfIdf[index].T.todense(),
                index=tfIdfVectorizer.get_feature_names_out(),
                columns=['TF-IDF']).sort_values('TF-IDF', ascending=False).head(10)

    tfidf_data['bioguide_id'] = bill_description['bioguide_id'][index]  
    tfidf_data = tfidf_data.reset_index()
    tfidf_data = tfidf_data.rename({'index':'keyword'}, axis=1)
    return tfidf_data

In [30]:
tfidf_one_legislator(100)

Unnamed: 0,keyword,TF-IDF,bioguide_id
0,advertising,0.136238,D000216
1,sampling,0.134639,D000216
2,fda,0.121398,D000216
3,drug,0.098595,D000216
4,direct consumer advertising,0.091144,D000216
5,consumer advertising,0.091144,D000216
6,animal feeding,0.091144,D000216
7,appropriations,0.089683,D000216
8,cr,0.086844,D000216
9,feeding,0.083509,D000216


In [31]:
tfidf_list = [tfidf_one_legislator(i) for i in range(len(bill_description))]

In [41]:
tfidf_fulldata = pd.concat(tfidf_list)
tfidf_fulldata = tfidf_fulldata.rename({'TF-IDF':'tf_idf'}, axis=1)
tfidf_fulldata

Unnamed: 0,keyword,tf_idf,bioguide_id
0,education,0.127234,A000055
1,health,0.124415,A000055
2,labor,0.124057,A000055
3,administration,0.119192,A000055
4,safety health,0.116229,A000055
...,...,...,...
5,fort belknap,0.131791,Z000018
6,reservation montana,0.131791,Z000018
7,community,0.123193,Z000018
8,fort,0.116811,Z000018


### Create a new table in the postgres DB with these charwords

In [42]:
tfidf_fulldata.to_csv('../data/thirdNF/tfidf.csv', index=False)

In [43]:
tfidf_fulldata.to_sql('tfidf', con=engine, if_exists='replace', index=False, chunksize=1000)

-5

In [45]:
myquery = '''
SELECT t.keyword, t.tf_idf
FROM tfidf t
INNER JOIN members m
    ON t.bioguide_id = m.bioguide_id
WHERE m.state_abbrev = 'VA' AND m.district_code = 5
'''
pd.read_sql_query(myquery, con=engine)

Unnamed: 0,keyword,tf_idf
0,virginia,0.181661
1,dc,0.159028
2,highways,0.126172
3,covered agricultural,0.123099
4,agricultural vehicles,0.123099
5,agricultural,0.111065
6,interstate highways,0.105755
7,interstate,0.097527
8,vehicle,0.084455
9,visited,0.082066
