# Topic Cluster Analysis

This is where the big guns come out. We'll be utilizing sentence transformers to turn the topic sentence of each bill into an embedding representing semantic meaning, then clustering topics from there to attempt to group topics together.

In [1]:
import pandas as pd
import numpy as np
import pickle, nltk
from nltk.cluster.util import cosine_distance
from nltk.cluster import KMeansClusterer
from sentence_transformers import SentenceTransformer, util

Let's grab the topics list we made in the last notebook.

In [2]:
with open("./topics.pkl", "rb") as f:
    topics = pickle.load(f)

In [3]:
topics

['',
 '"Certificate of birth resulting in stillbirth"',
 '"Hang on Sloopy"-official state rock song',
 '"Hawaii Made"',
 '"Hawaii Made" Program',
 '"Made in Hawaii"',
 '"Redboxing"',
 '"Stir Crazy in Williamsburg"',
 '"Stoppers"',
 '#TEXAS',
 '#TEXASTODO',
 '#TXLEGE',
 '(1) 4-1-001:013',
 '(h)',
 '100th anniversary of the American Legion',
 '100th anniversary of the Army Warrant Officer Corp',
 '100th anniversary of the creation of the National',
 '1847 COLT WALKER',
 '1998 Makaha Beach Park Master Plan',
 '2012 Summer Paralympic Games',
 '2017 Statutory Construction Bill',
 '2018',
 '2018 Winter Olympics',
 '2018-2019',
 '2020-2021',
 '2020-2021 (h)',
 '2020-2021 School Year',
 '2021',
 '2021 Compact Trust Fund',
 '2021 Gaming Compact Amendment',
 '2021-2022',
 '2022',
 '2022 (h)',
 '2022 Amendments',
 '2022 Election Cycle',
 '2022 Taxable Year',
 '2022-2023',
 '2023',
 '2023-2024',
 '2023-2024 (h)',
 '2030 Agenda for Sustainable Development',
 '2030 Development Agenda',
 '2050 (h)',


In a normal NLP task, this is the part where we would tune our LLM to the data we have so that it performs better on the given dataset. Unfortunately, that only works well with labeled data, which this data is not. We could go through and manually label every topic, but that would defeat the purpose of using an LLM to cluster. Instead, we will just rely on what the model was trained with initially, as this will likely be accurate enough for our purposes.

In [4]:
model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(topics)

Now that we have embeddings, we can run some basic similarity metrics to get a baseline understanding of what terms are similar to each other.

In [5]:
n = 150
clusterer = KMeansClusterer(n, distance=cosine_distance, repeats=25, avoid_empty_clusters=True)
clusters = clusterer.cluster(embeddings, assign_clusters=True)

Hopefully that's given us something useful. Let's make sure by checking on some big ticket topics.

In [14]:
topics_series = pd.Series(topics)
clusters_series = pd.Series(clusters)
matched = pd.DataFrame({"Topic": topics_series, "Embedding": embeddings.tolist(), "Cluster": clusters_series})

In [15]:
matched.loc[matched["Topic"].str.contains("gun", case=False)]

Unnamed: 0,Topic,Embedding,Cluster
4347,CONCEALED HANDGUNS,"[0.02568192221224308, -0.005536421202123165, 0...",57
6460,Concealed handgun licensing-revisions,"[0.01703336089849472, 0.018000798299908638, 0....",121
6461,Concealed handgun-possess in school safety zon...,"[0.01550246775150299, -0.0051614465191960335, ...",95
6462,Concealed handguns/firearms control/ self-defe...,"[-0.016524557024240494, -0.01991763710975647, ...",117
10680,Electric Guns,"[-0.010645641945302486, -0.01648801751434803, ...",84
10688,Electric Projectile Guns,"[0.0063621667213737965, 0.0124552296474576, 0....",84
13337,GUN TRIGGER SAFETY LOCKS,"[-0.038710109889507294, -0.04523995891213417, ...",71
13338,"GUNTER, CITY OF","[0.021835029125213623, -0.015142696909606457, ...",49
13579,Ghost Guns,"[0.022212084382772446, -0.0220542773604393, 0....",57
13953,Gun Safety,"[-0.029492579400539398, -0.01094827800989151, ...",71


In [16]:
matched.loc[matched["Topic"].str.contains("abortion", case=False)]

Unnamed: 0,Topic,Embedding,Cluster
92,ABORTION,"[0.029020236805081367, 0.04534588009119034, 0....",36
93,ABORTION COMPLICATIONS,"[0.011687698774039745, 0.013806693255901337, -...",36
94,ABORTION COMPLICATIONS REPORTING ACT,"[-0.028023317456245422, 0.016842016950249672, ...",36
477,Abortion,"[0.029020264744758606, 0.04534583166241646, 0....",36
478,Abortion Care,"[0.019065028056502342, 0.0446045808494091, -0....",36
479,Abortion Data,"[-0.024709586054086685, 0.15588581562042236, -...",36
480,Abortion Data Reporting Act,"[-0.032505061477422714, 0.07749525457620621, -...",36
481,Abortion-inducing Drugs,"[0.004018457606434822, 0.01607905514538288, 0....",36
482,Abortion-judicial consent for minor-hearing pr...,"[0.03210146352648735, 0.007982312701642513, 0....",36
483,Abortion-pregnant minor-judicial consent,"[0.03251469135284424, -0.005309178493916988, 0...",36


The gun topics are a little all-over-the-place, but the abortion topics have grouped perfectly. One issue remains, what is cluster 36? With this narrowing down we can see that it clearly concerns abortion, but if we saw only one of these topics out of context and its cluster we could not easily make the same assumption. Now we have the task of naming each cluster. The easiest way to do this will be to find the center of each cluster and find the topic that is closest to that center.

In [36]:
for i, mean in enumerate(clusterer.means()):
    clustered_topics = matched.loc[matched["Cluster"] == i+1]
    big = (-1, 0)
    for j, embedding in enumerate(clustered_topics["Embedding"]):
        dist = cosine_distance(embedding, mean)
        if dist > big[1]:
            big = (j, dist)
    if big[0] == -1: continue
    print(f"Cluster {i + 1} has topic {clustered_topics.iloc[big[0]]['Topic']}")

Cluster 1 has topic COVID-19 Travel Restriction Notification Procedure for Airlines
Cluster 2 has topic Gross Weight
Cluster 3 has topic Community Reinvestment Agencies
Cluster 5 has topic RITA LITTLEFIELD CHRONIC KIDNEY DISEASE CENTRALIZED RESOURCE CENTER
Cluster 6 has topic BRAZIL, DAVID BRIAN
Cluster 7 has topic Enjoy
Cluster 8 has topic Evidence (s)
Cluster 9 has topic Spousal Medicare Part B Premium Reimbursement
Cluster 10 has topic Breast Tomosynthesis
Cluster 11 has topic Immigrant health and welfare
Cluster 12 has topic Severability (s)
Cluster 13 has topic Relating To Food And Drugs-regulation Of Powdered Caffeine
Cluster 14 has topic Child Preference
Cluster 15 has topic SEVILLA, MARIO ALEXIS
Cluster 16 has topic Adult Dental Benefits
Cluster 17 has topic Mental Health and Relationships of Children and Teenagers
Cluster 18 has topic LIENS OF MECHANICS AND MATERIALMEN
Cluster 19 has topic Kihei Launch Ramp
Cluster 20 has topic Italian Consulate in Detroit
Cluster 21 has topic

Hmmm, that does not look right at all. Cluster 36, which we know from above deals with abortion, has the mean closest to "Congenital Diaphragmatic Hernia Week-week that includes July 17." Yes, that does deal with birth and reproductive healthcare, but it's not a very informative title. You wouldn't expect all topics in that cluster to relate to abortion specifically, as CDH is a very specific topic, rather than the generalized "abortion."

For a more obvious example, look at cluster 57, which includes topics related handguns. The mean is closest to the topic "Perfluoroalkyl and Perfluoroalkyl Substances." This has nothing to do with guns, and is more closely related to health risks stemming from human-made chemicals. Clearly something is wrong here, maybe we didn't have a large enough k?

In [None]:
n = 1000
clusterer2 = KMeansClusterer(n, distance=cosine_distance, repeats=25, avoid_empty_clusters=True)
clusters2 = clusterer2.cluster(embeddings, assign_clusters=True)

In [None]:
clusters_series = pd.Series(clusters)
matched = pd.DataFrame({"Topic": topics_series, "Embedding": embeddings.tolist(), "Cluster": clusters_series})

In [None]:
matched.loc[matched["Topic"].str.contains("gun", case=False)]

In [None]:
matched.loc[matched["Topic"].str.contains("abortion", case=False)]