# Topic Cluster Analysis

This is where the big guns come out. We'll be utilizing sentence transformers to turn the topic sentence of each bill into an embedding representing semantic meaning, then clustering topics from there to attempt to group topics together.

In [19]:
import pandas as pd
import numpy as np
import pickle
from openai import OpenAI
from sklearn.cluster import MiniBatchKMeans
from nltk.cluster import cosine_distance
from sentence_transformers import SentenceTransformer, util

Let's grab the topics list we made in the last notebook.

In [20]:
with open("./topics.pkl", "rb") as f:
    topics = pickle.load(f)

In a normal NLP task, this is the part where we would tune our LLM to the data we have so that it performs better on the given dataset. Unfortunately, that only works well with labeled data, which this data is not. We could go through and manually label every topic, but that would defeat the purpose of using an LLM to cluster. Instead, we will just rely on what the model was trained with initially, as this will likely be accurate enough for our purposes. We will use BERT's `all-mpnet-base-v2`, as it is the most accurate without fine tuning and can apply to a wide range of contexts.

In [21]:
model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(topics)

Now that we have embeddings, we can run some basic similarity metrics to get a baseline understanding of what terms are similar to each other.

In [33]:
n = 700
clusterer = MiniBatchKMeans(n, init="k-means++", n_init="auto", max_iter=1000000000, batch_size=8192)
clusters = clusterer.fit_predict(embeddings)

Hopefully that's given us something useful. Let's make sure by checking on some big ticket topics.

In [34]:
topics_series = pd.Series(topics)
clusters_series = pd.Series(clusters)
matched = pd.DataFrame({"Topic": topics_series, "Cluster": clusters_series})

In [35]:
matched.loc[matched["Topic"].str.contains("gun", case=False)]

Unnamed: 0,Topic,Cluster
4623,CONCEALED HANDGUNS,145
6875,Concealed handgun licensing-revisions,145
6876,Concealed handgun-possess in school safety zon...,145
6877,Concealed handguns/firearms control/ self-defe...,145
11348,Electric Guns,481
11357,Electric Projectile Guns,481
14182,GUN TRIGGER SAFETY LOCKS,218
14183,"GUNTER, CITY OF",626
14448,Ghost Guns,481
14863,Gun Safety,145


In [36]:
matched.loc[matched["Topic"].str.contains("abortion", case=False)]

Unnamed: 0,Topic,Cluster
103,ABORTION,315
104,ABORTION COMPLICATIONS,315
105,ABORTION COMPLICATIONS REPORTING ACT,315
509,Abortion,315
510,Abortion Care,315
511,Abortion Data,315
512,Abortion Data Reporting Act,315
513,Abortion-inducing Drugs,315
514,Abortion-judicial consent for minor-hearing pr...,315
515,Abortion-pregnant minor-judicial consent,315


Cool, topics with the same general topic seem to be grouped together. One issue remains, what do the cluster numbers mean? With this narrowing down we can see that it clearly concerns abortion, but if we saw only one of these topics out of context and its cluster we could not easily make the same assumption. Now we have the task of naming each cluster, which isn't a super obvious task with unsupervised learning. One way we can try is finding the topic closest to the center of each cluster and using that as a title.

In [37]:
for i, mean in enumerate(clusterer.cluster_centers_):
    clustered_topics = matched.loc[matched["Cluster"] == i]
    big = (-1, 0)
    for j in range(len(clustered_topics)):
        dist = cosine_distance(embeddings[j], mean)
        if dist > big[1]:
            big = (j, dist)
    if big[0] == -1: continue
    print(f"Cluster {i} has topic {clustered_topics.iloc[big[0]]['Topic']}")

Cluster 0 has topic Prohibit electronic text-based communication-while driving a vehicle
Cluster 1 has topic Children, abduction
Cluster 2 has topic STORYBOOK CAPITAL OF TEXAS
Cluster 3 has topic Energy And Water Efficiency Fund
Cluster 4 has topic GOOD CONDUCT TIME
Cluster 5 has topic ECONOMIC DEVELOPMENT, Redevelopment and Renovation of Urban Areas
Cluster 6 has topic Relating To School Construction And Education
Cluster 7 has topic Fish & Game (both)
Cluster 8 has topic Tax Fraud
Cluster 9 has topic Advanced practice registered nurses/ physician assistants-admit patients
Cluster 10 has topic Commercial Offices (s)
Cluster 11 has topic Oversight
Cluster 12 has topic Blind or Deaf Persons
Cluster 13 has topic Kenbridge Town of
Cluster 14 has topic SANDOW MUNICIPAL UTILITY DISTRICT NO. 1
Cluster 15 has topic ABC license applicants, criminal background check process revised
Cluster 16 has topic Event Wagering
Cluster 17 has topic Department of Environmental Quality
Cluster 18 has topic 

Ok that didn't work as intended, many of these "titles" are not descriptive of their clusters at all. This is going to require significantly bigger armaments, as turning an unsupervised machine learning task into a supervised task is very difficult. For this specific task, I think turning to OpenAI's GPT is the best option. It's not free, but for the number of tokens being sent here, this cost should hopefully be fairly cheap.

In [38]:
with open("openai_key.txt", 'r') as f:
    api_key = f.read()

In [39]:
# PROMPT
p = """
You are designed to output summaries of the theme of a group of topics in as few words as possible, with a maximum word count of 6 words. 
These topic groups are the given topics for certain bills passed by bodies of legislatures around the United States. Many of these bills refer to the same topic but have varying topic labels, so your job is to group the similar topics together into one topic group.
If there is no discernable theme you will respond with \"Misc\". Your answer should be formatted as simply the determined topic group with each word in title case, separated by commas if necessary. Do not include extra punctuation like quotes, asterisks, or periods.
"""

In [40]:
client = OpenAI(api_key=api_key)
for i in range(10):
  response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "system", "content": p},
      {"role": "user", "content":  f"{matched.loc[matched['Cluster'] == i]['Topic'].to_list()}"}
    ]
  )
  print(f"Cluster {i} has topic {response.choices[0].message.content}")

Cluster 0 has topic Texting While Driving
Cluster 1 has topic Children, Abduction, Labor, Protection, Health, Fatalities
Cluster 2 has topic Texas Cities and Landmarks
Cluster 3 has topic Water Resources Funding
Cluster 4 has topic Ethics, Governmental ethics, Public corruption
Cluster 5 has topic Economic Development, Redevelopment, Urban Revitalization, Districts
Cluster 6 has topic Education Facilities, School Buildings, School Supplies
Cluster 7 has topic Fish and Game
Cluster 8 has topic False Identity, Fraudulent Claims, Forgery, Perjury
Cluster 9 has topic Nursing, Nurse Practitioner, Nurse Education, Nursing Board


In [41]:
matched.loc[matched['Cluster'] == 0]['Topic'].tolist()

['Commercial driver licensing/texting while driving/maximum vehicle lengths',
 'Motor vehicles, wireless telecommunications devices, restrictions on use while driving, provided',
 'Prohibit electronic text-based communication-while driving a vehicle',
 'TEXTING AND DRIVING',
 'Text Messaging While Driving',
 'Texting While Driving',
 'Traffic law photo-monitoring devices-prohibit use of',
 'Wireless Communication Device While Driving']

In [42]:
abortion = matched.loc[matched["Topic"] == "abortion"]["Cluster"].iloc[0]
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
    {"role": "system", "content": p},
    {"role": "user", "content":  f"{matched.loc[matched['Cluster'] == abortion]['Topic'].to_list()}"}
]
)
print(f"Cluster {abortion} has topic {response.choices[0].message.content}")

Cluster 315 has topic Abortion,Maternal Health,Fertility Fraud,Misc


In [43]:
matched.loc[matched['Cluster'] == abortion]['Topic'].tolist()

['ABORTION',
 'ABORTION COMPLICATIONS',
 'ABORTION COMPLICATIONS REPORTING ACT',
 'Abortion',
 'Abortion Care',
 'Abortion Data',
 'Abortion Data Reporting Act',
 'Abortion-inducing Drugs',
 'Abortion-judicial consent for minor-hearing procedure/burden of proof',
 'Abortion-pregnant minor-judicial consent',
 'Abortions',
 'Abortions For Minors',
 'Aspiration Abortions',
 'Born-alive Abortion Survivors Protection Act',
 'Civil Fertility Fraud',
 'FETAL HEARTBEAT PREBORN CHILD PROTECTION ACT',
 'Fertility Fraud',
 'Fetal Death',
 'GESTATIONAL AGREEMENTS ACT',
 'Health, abortion',
 'Health: abortion',
 'IDAHO UNBORN INFANTS DIGNITY ACT',
 'Individual Reproductive Rights (ConAm)',
 'Induce pregnant woman to use drugs-violate corrupting another with drugs statute',
 'MATERNAL MORTALITY',
 'MATERNAL MORTALITY & MORBIDITY TASK FORCE',
 'Maternal Fatalities',
 'Maternal Health Equity',
 'Maternal Mortality Review Comm.',
 'PREBORN NONDISCRIMINATION ACT',
 'PREBORN PAIN ACT',
 'PREGNANT PRISONE