# Topic Cluster Analysis

This is where the big guns come out. We'll be utilizing sentence transformers to turn the topic sentence of each bill into an embedding representing semantic meaning, then clustering topics from there to attempt to group topics together.

In [1]:
import pandas as pd
import numpy as np
import pickle
from openai import OpenAI
from sklearn.cluster import MiniBatchKMeans
from nltk.cluster import cosine_distance
from sentence_transformers import SentenceTransformer, util

Let's grab the topics list we made in the last notebook.

In [2]:
with open("./topics.pkl", "rb") as f:
    topics = pickle.load(f)

In a normal NLP task, this is the part where we would tune our LLM to the data we have so that it performs better on the given dataset. Unfortunately, that only works well with labeled data, which this data is not. We could go through and manually label every topic, but that would defeat the purpose of using an LLM to cluster. Instead, we will just rely on what the model was trained with initially, as this will likely be accurate enough for our purposes. We will use BERT's `all-mpnet-base-v2`, as it is the most accurate without fine tuning and can apply to a wide range of contexts.

In [3]:
model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(topics)

Now that we have embeddings, we can run some basic similarity metrics to get a baseline understanding of what terms are similar to each other.

In [4]:
n = 500
clusterer = MiniBatchKMeans(n, init="k-means++", n_init="auto", max_iter=1000000, batch_size=8192)
clusters = clusterer.fit_predict(embeddings)

Hopefully that's given us something useful. Let's make sure by checking on some big ticket topics.

In [5]:
topics_series = pd.Series(topics)
clusters_series = pd.Series(clusters)
matched = pd.DataFrame({"Topic": topics_series, "Cluster": clusters_series})

In [6]:
matched.loc[matched["Topic"].str.contains("gun", case=False)]

Unnamed: 0,Topic,Cluster
4347,CONCEALED HANDGUNS,202
6460,Concealed handgun licensing-revisions,202
6461,Concealed handgun-possess in school safety zon...,202
6462,Concealed handguns/firearms control/ self-defe...,202
10680,Electric Guns,203
10688,Electric Projectile Guns,203
13337,GUN TRIGGER SAFETY LOCKS,202
13338,"GUNTER, CITY OF",11
13579,Ghost Guns,203
13953,Gun Safety,202


In [7]:
matched.loc[matched["Topic"].str.contains("abortion", case=False)]

Unnamed: 0,Topic,Cluster
92,ABORTION,401
93,ABORTION COMPLICATIONS,401
94,ABORTION COMPLICATIONS REPORTING ACT,401
477,Abortion,401
478,Abortion Care,401
479,Abortion Data,401
480,Abortion Data Reporting Act,401
481,Abortion-inducing Drugs,401
482,Abortion-judicial consent for minor-hearing pr...,401
483,Abortion-pregnant minor-judicial consent,401


Cool, topics with the same general topic seem to be grouped together. One issue remains, what do the cluster numbers mean? With this narrowing down we can see that it clearly concerns abortion, but if we saw only one of these topics out of context and its cluster we could not easily make the same assumption. Now we have the task of naming each cluster, which isn't a super obvious task with unsupervised learning. One way we can try is finding the topic closest to the center of each cluster and using that as a title.

In [8]:
for i, mean in enumerate(clusterer.cluster_centers_):
    clustered_topics = matched.loc[matched["Cluster"] == i]
    big = (-1, 0)
    for j in range(len(clustered_topics)):
        dist = cosine_distance(embeddings[j], mean)
        if dist > big[1]:
            big = (j, dist)
    if big[0] == -1: continue
    print(f"Cluster {i} has topic {clustered_topics.iloc[big[0]]['Topic']}")

Cluster 0 has topic BELL COUNTY WATER CONTROL & IMPROVEMENT DISTRICT NO. 6
Cluster 1 has topic Appointments (h)
Cluster 2 has topic Administrative law and regulatory procedures
Cluster 3 has topic Age
Cluster 4 has topic Health and Mental Health
Cluster 5 has topic Education, financing
Cluster 6 has topic Economic development, plant rehabilitation
Cluster 7 has topic Federal
Cluster 8 has topic HARRIS COUNTY MUNICIPAL UTILITY DISTRICT NO. 61
Cluster 9 has topic TAX/CORP FRANCHISE
Cluster 10 has topic State Route 88
Cluster 11 has topic Jarratt Town of
Cluster 12 has topic Trauma-Informed
Cluster 13 has topic Tier 2 Studies
Cluster 14 has topic Department of Human Resources Development
Cluster 15 has topic Child Care Expenses
Cluster 16 has topic Aquifer Protection Permits
Cluster 17 has topic PIPELINE CORPORATIONS
Cluster 18 has topic Radioactive wastes and releases
Cluster 19 has topic Election Law Amendments
Cluster 20 has topic Working Animals
Cluster 21 has topic State and County G

Ok that didn't work as intended, many of these "titles" are not descriptive of their clusters at all. This is going to require significantly bigger armaments, as turning an unsupervised machine learning task into a supervised task is very difficult. For this specific task, I think turning to OpenAI's GPT is the best option. It's not free, but for the number of tokens being sent here, this cost should hopefully be fairly cheap.

In [9]:
with open("openai_key.txt", 'r') as f:
    api_key = f.read()

In [10]:
# PROMPT
p = f"In as few words as possible, what is the general theme of this list of topics: {list}. If there is no obvious theme, respond 'misc'"

In [15]:
client = OpenAI(api_key=api_key)
for i in range(10):
  response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "system", "content": "You are designed to output summaries of the theme of a group of topics in as few words as possible, with a maximum word count of 6 words. If there is no discernable theme you will respond with \"Misc\""},
      {"role": "user", "content":  f"{matched.loc[matched['Cluster'] == i]['Topic'].to_list()}"}
    ]
  )
  print(f"Cluster {i} has topic {response.choices[0].message.content}")

Cluster 0 has topic Water control and improvement districts.
Cluster 1 has topic Appointment scheduling and availability services.
Cluster 2 has topic Administrative Procedures and Hearings oversight.
Cluster 3 has topic Aging population, caregivers, senior citizens.


In [12]:
i = 47
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
    {"role": "system", "content": "You are designed to output summaries of the theme of a group of topics in as few words as possible, with a maximum word count of 6 words. If there is no discernable theme you will respond with \"Misc\""},
    {"role": "user", "content":  f"{matched.loc[matched['Cluster'] == i]['Topic'].to_list()}"}
]
)
print(f"Cluster {i} has topic {response.choices[0].message.content}")

Cluster 47 has topic Forensic Audits


In [13]:
matched.loc[matched['Cluster'] == i]['Topic'].to_list()

['AUDITS',
 'AUDITS/AUDITING',
 'Advamced Forensic Testing',
 'Advanced Forensic Testing (h)',
 'Annual Inspection',
 'Attorney General Investigation',
 'Audit',
 'Audit Findings',
 'Audit Recommendations',
 'Audit Results',
 'Audits',
 'Audits (s)',
 'Audits/Auditing',
 'CRIMINAL INVESTIGATIONS',
 'Crime Lab',
 'Crime Scene Investigation',
 'Crime Scene Investigations',
 'Depositions',
 'Examination On Oath',
 'FORENSIC TESTING',
 'Forensic',
 'Forensic Evidence Testing',
 'Forensic Examinations',
 'Forensic Interview',
 'Forensic Medical Exam',
 'Forensic Peer Specialist Program',
 'Forensic Science',
 'Forensic Sciences',
 'Forensic Sciences Department',
 'Full-service Crime Laboratory',
 'Full-service Crime Labs',
 'Government investigations',
 'Independent Audit',
 'Inspection',
 'Inspection And Examination',
 'Inspection and Duplication',
 'Inspections',
 'Inspections (h)',
 'Inspectors',
 'Internal Audit',
 'Internal Investigations',
 'Investigation',
 'Investigation (h)',
 'Inv