# Topic Cluster Analysis

This is where the big guns come out. We'll be utilizing sentence transformers to turn the topic sentence of each bill into an embedding representing semantic meaning, then clustering topics from there to attempt to group topics together.

In [1]:
import pandas as pd
import numpy as np
import pickle
from openai import OpenAI
from sklearn.cluster import MiniBatchKMeans
from nltk.cluster import cosine_distance
from sentence_transformers import SentenceTransformer, util

Let's grab the topics list we made in the last notebook.

In [2]:
with open("./topics.pkl", "rb") as f:
    topics = pickle.load(f)

In a normal NLP task, this is the part where we would tune our LLM to the data we have so that it performs better on the given dataset. Unfortunately, that only works well with labeled data, which this data is not. We could go through and manually label every topic, but that would defeat the purpose of using an LLM to cluster. Instead, we will just rely on what the model was trained with initially, as this will likely be accurate enough for our purposes. We will use BERT's `all-mpnet-base-v2`, as it is the most accurate without fine tuning and can apply to a wide range of contexts.

In [3]:
model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(topics)

Now that we have embeddings, we can run some basic similarity metrics to get a baseline understanding of what terms are similar to each other.

In [4]:
n = 250
clusterer = MiniBatchKMeans(n, init="k-means++", n_init="auto", max_iter=1000000000, batch_size=8192)
clusters = clusterer.fit_predict(embeddings)

Hopefully that's given us something useful. Let's make sure by checking on some big ticket topics.

In [5]:
topics_series = pd.Series(topics)
clusters_series = pd.Series(clusters)
matched = pd.DataFrame({"Topic": topics_series, "Cluster": clusters_series})

In [6]:
matched.loc[matched["Topic"].str.contains("gun", case=False)]

Unnamed: 0,Topic,Cluster
4623,CONCEALED HANDGUNS,58
6875,Concealed handgun licensing-revisions,58
6876,Concealed handgun-possess in school safety zon...,58
6877,Concealed handguns/firearms control/ self-defe...,58
11348,Electric Guns,58
11357,Electric Projectile Guns,58
14182,GUN TRIGGER SAFETY LOCKS,58
14183,"GUNTER, CITY OF",249
14448,Ghost Guns,58
14863,Gun Safety,58


In [7]:
matched.loc[matched["Topic"].str.contains("abortion", case=False)]

Unnamed: 0,Topic,Cluster
103,ABORTION,188
104,ABORTION COMPLICATIONS,188
105,ABORTION COMPLICATIONS REPORTING ACT,188
509,Abortion,188
510,Abortion Care,188
511,Abortion Data,188
512,Abortion Data Reporting Act,188
513,Abortion-inducing Drugs,188
514,Abortion-judicial consent for minor-hearing pr...,87
515,Abortion-pregnant minor-judicial consent,188


Cool, topics with the same general topic seem to be grouped together. One issue remains, what do the cluster numbers mean? With this narrowing down we can see that it clearly concerns abortion, but if we saw only one of these topics out of context and its cluster we could not easily make the same assumption. Now we have the task of naming each cluster, which isn't super easy with unsupervised learning. One way we can try is finding the topic closest to the center of each cluster and using that as a title.

In [13]:
for i, mean in enumerate(clusterer.cluster_centers_):
    clustered_topics = matched.loc[matched["Cluster"] == i]
    big = (-1, 0)
    for j in range(len(clustered_topics)):
        dist = cosine_distance(embeddings[j], mean)
        if dist > big[1]:
            big = (j, dist)
    if big[0] == -1: continue
    print(f"Cluster {i} has topic {clustered_topics.iloc[big[0]]['Topic']}")
    if i > 8:
        break

Cluster 0 has topic Purchase of Real Property
Cluster 1 has topic Foreign Adversary Divestment Act
Cluster 2 has topic Health, Safety And Welfare
Cluster 3 has topic Transportation (h)
Cluster 4 has topic Economically Disadvantaged
Cluster 5 has topic MOORE COUNTY
Cluster 6 has topic To Vacate The Forfeiture Or Revocation Of The Charter Of Tabor Franchi Post 2396 Veterans Of Foreign Wars
Cluster 7 has topic ROSE CITY MUNICIPAL UTILITY DISTRICT
Cluster 8 has topic Other Relief
Cluster 9 has topic Simulators


Ok that didn't work as intended, many of these "titles" are not descriptive of their clusters at all. This is going to require significantly bigger armaments, as turning an unsupervised machine learning task into a supervised task is very difficult. For this specific task, I think turning to OpenAI's GPT is the best option. It's not free, but for the number of tokens being sent here, this cost should hopefully be fairly cheap.

In [9]:
with open("openai_key.txt", 'r') as f:
    api_key = f.read()

In [10]:
# PROMPT
p = """
You are designed to output summaries of the theme of a group of topics in as few words as possible, with a maximum word count of 6 words. 
These topic groups are the given topics for certain bills passed by bodies of legislatures around the United States. Many of these bills refer to the same topic but have varying topic labels, so your job is to group the similar topics together into one topic group.
If there is no discernable theme you will respond with \"Misc\". Your answer should be formatted as simply the determined topic group with each word in title case, separated by commas if necessary. Do not include extra punctuation like quotes, asterisks, or periods.
"""

In [11]:
client = OpenAI(api_key=api_key)
for i in range(10):
  response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "system", "content": p},
      {"role": "user", "content":  f"{matched.loc[matched['Cluster'] == i]['Topic'].to_list()}"}
    ]
  )
  print(f"Cluster {i} has topic {response.choices[0].message.content}")

Cluster 0 has topic Real Estate, Appraisals, Foreclosures, Property Sales
Cluster 1 has topic Misc
Cluster 2 has topic Health and Human Services, Social Services
Cluster 3 has topic Transportation
Cluster 4 has topic Homelessness Assistance, Food and Nutrition Programs, Financial Assistance
Cluster 5 has topic Counties, Pistol permit fees, Voting center
Cluster 6 has topic To Vacate The Forfeiture of Charters
Cluster 7 has topic Municipal Utility Districts
Cluster 8 has topic Health Services, Massage Therapy, Medical Devices, Anatomical Gifts, Prosthetics
Cluster 9 has topic Gaming, Board Management, Gaming Regulation, Game Development


Alright, these are looking more focused. Let's make sure by checking the raw contents of that first cluster.

In [14]:
matched.loc[matched['Cluster'] == 0]['Topic'].tolist()

['APPRAISALS',
 'APPRAISEMENT',
 'APPRAISERS',
 'Adre (see Also: State Real Estate Department)',
 'Adre (see Also: State Real Estate Department) (h)',
 'Agents and Brokers',
 'Appraisal',
 'Appraisal Dispute',
 'Appraisal Management Companies',
 'Appraisal Services',
 'Appraisals',
 'Appraiser',
 'Appraiser Liability',
 'Appraisers',
 'Approves the Sale of the Leased Fee Interest in TMK No. 1-4-1-33-42',
 'Blighted/abandoned residential parcels- expedite foreclosure & transfer',
 'Bureau of Conveyances',
 'COMMERCIAL PROPERTY',
 'CONDOMINIUM PROPERTY ACT',
 'Commercial Property',
 'Commercial Real Property Receivership',
 'Conveyance Of Land And Buildings',
 'Conveyance of state-owned real estate to Step by Step Academy',
 'Department Of Real Estate',
 'Distressed Property Consultant',
 'ESCROW',
 'ESTATES',
 'Economic development, obsolete property and rehabilitation',
 'Economic development: obsolete property and rehabilitation',
 'Escrow Agents',
 'Estates',
 'FORECLOSURE',
 'FORECL

Depending on the run, this looks pretty decent from both approaches actually. Unfortunately, this is not a deterministic approach and could lead to wildly different outcomes on multiple runs for essentially no reason. Fixing this isn't super easy with a GPT model, but it is possible to mitigate issues.

The biggest issue so far isn't actually caused by the GPT model, but in fact the clustering algorithm. By using k-means, we are artificially limiting the number of clusters to a certain number, which is either too high or too low but almost never exactly right. This can mean that we don't have clusters that are focused enough, leading to more off-topic items in clusters, or that we have too many clusters, leading to similar topics being split by a thin, near-meaningless line. To tackle this, we need to shift to a different clustering algorithm, preferably one that's resistant to outliers and requires as few hard-coded values as possible. 

Reading:

https://en.wikipedia.org/wiki/Fuzzy_clustering

https://en.wikipedia.org/wiki/DBSCAN

https://en.wikipedia.org/wiki/Self-organizing_map

https://en.wikipedia.org/wiki/Cluster_analysis 

**Reintegration**

Ok, now that we've given the clusters labels, lets assign those labels to each bill, so that we can override the old labels.

With that, we can now do the same analysis we did in the previous notebook, but with the "correct" labels (note: I will refer to the newly generated labels as the "correct" ones from here on out, even if they aren't exactly perfect).