# Cluster Verified Indicators with BGE-M3 Embeddings

**Primary authors:** Victoria (HDBSCAN clustering, embedding pipeline, cohesion analysis), Sahana (agglomerative clustering, seed word integration)
**Builds on:** Data Cleaning for Indicator Clustering copy.ipynb (Victoria)
**Prompt engineering:** Victoria
**AI assistance:** Claude (Anthropic), Gemini (Google)
**Environment:** Local (with sentence-transformers installed) or Colab (GPU recommended)

---

With 14,000 examples, you have enough data for patterns to emerge, but you also have enough "density" to make the "Curse of Dimensionality" a real problem.
In a 14k dataset of short phrases, many will be near-duplicates or highly similar. Here is how to adjust the previous approach for this larger volume:

## 1. Scaling the Parameters
With 14,000 rows, a n_neighbors=5 setting is too small; it will be too sensitive to tiny variations and create thousands of tiny clusters.
* <b>Increase `n_neighbors`</b>: Try <b>15 to 30</b>. This forces UMAP to look at the broader "neighborhood" of a phrase, which helps group various ways of saying the same thing (e.g., grouping "ignoring odds" with all 15 variations of that concept).
* <b>Dimensions</b>: Stick to <b>5–10</b> for clustering. Even with 14k rows, the semantic "concepts" in 1–6 word phrases aren't complex enough to justify 50 dimensions.
## 2. The Clustering Strategy (HDBSCAN)
Standard K-Means is a bad fit for 14,000 phrases because it assumes all clusters are "round" and of similar size. <b>HDBSCAN</b> is far superior here because:
* It handles <b>varying densities</b> (some clusters might have 500 phrases, others only 10).
* It identifies <b>noise</b> (out of 14k phrases, a few thousand will likely be unique "junk" that shouldn't be forced into a cluster).
## 3. Optimized Workflow for 14k Rows
Since BGE-M3 is a heavy model, you should process this in <b>batches</b> to avoid memory errors.

In [1]:
# imports
import os
from pathlib import Path
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import umap       # umap may take 15-60 seconds to import
from sklearn.cluster import HDBSCAN
from sklearn import metrics
from sklearn.metrics import pairwise_distances

ModuleNotFoundError: No module named 'sentence_transformers'

In [2]:
# ==========================
# PATHS & CONFIG
# ==========================
# 1. Detect environment
IS_COLAB = 'google.colab' in str(get_ipython())

if IS_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_ROOT = Path('/content/drive/MyDrive/SIADS 692 Milestone II/Milestone II - NLP Cryptic Crossword Clues')
else:
    # On local, move up from notebooks/ to project root
    # Adjust the number of .parent calls based on where this notebook sits
    PROJECT_ROOT = Path.cwd().parent 

DATA_DIR = PROJECT_ROOT / "data"
OUTPUT_DIR = PROJECT_ROOT / "outputs"


In [3]:
# Load the data (using pandas) to make a list of verified indicators
df = pd.read_csv(f'{DATA_DIR}/verified_indicators.csv', header=None)
#df = pd.read_csv('verified_indicators.csv', header=None)

# Name the column
df.rename(columns={0: 'indicator'}, inplace=True)

# Make the data into a list as the BGE-M3 model requires
verified_indicators_list = df['indicator'].tolist()

The next cell is SLOW to create the embeddings. See instructions below to adjust Colab settings and use alternative code.

In [4]:
# Embed the indicator phrases and cluster them

# 1. Load the BGE-M3 model (for dense embeddings)
# This model is specifically designed for short-to-long text consistency
model = SentenceTransformer('BAAI/bge-m3')

# 2. Generate the Embeddings
# batch_size=32 or 64 helps manage GPU/RAM memory
embeddings = model.encode(
    verified_indicators_list,
    batch_size=64,
    show_progress_bar=True
    )

# 3. Dimensionality Reduction with UMAP
# We reduce from BGE-M3's 1024 dimensions to 5-10 to "distill"
# the semantic signal
reducer = umap.UMAP(
    n_neighbors=30,      # Increased for larger dataset
    min_dist=0.0,        # Helps HDBSCAN find dense clusters
    n_components=10,     # Slightly more "room" for 14k rows
    metric='cosine',
    random_state=42
)
embeddings_reduced = reducer.fit_transform(embeddings)


Batches:   0%|          | 0/222 [00:00<?, ?it/s]

  warn(


In [None]:
# Save the original embeddings as parquet


Since you are on Google Colab, you have access to a free T4 GPU, which is the single best way to make your 14,000-row analysis fly. Without the GPU, BGE-M3 will be painfully slow on Colab's standard CPU.
### Step 1: Turn on the GPU
1. Go to the Runtime menu at the top.
2. Select Change runtime type.
3. Under "Hardware accelerator," choose T4 GPU (or any available GPU).
4. Click Save.
### Step 2: High-Speed Encoding Code
Now that the GPU is active, use this specific configuration to process your 14k list in a fraction of the time:

In [5]:
import torch

# 1. Load the model directly to the GPU
# 'cuda' is the engine that powers the NVIDIA T4 GPU
#device = "cuda" if torch.cuda.is_available() else "cpu"
#model = SentenceTransformer('BAAI/bge-m3', device=device)

# 2. Encode with a larger batch size
# Colab's T4 can easily handle a batch_size of 128 for short phrases
#embeddings = model.encode(
#    your_14k_list,
#    batch_size=128,
#    show_progress_bar=True,
#    convert_to_numpy=True
#)

In [6]:
# 4. Clustering with HDBSCAN - Find the natural groups
# This algorithm doesn't force us to pick the number of clusters k
clusterer = HDBSCAN(
    min_cluster_size=10 # Minimum number of phrases to form a "theme"
)
cluster_labels = clusterer.fit_predict(embeddings_reduced)

  warn(


In [7]:
# Store the data and results in a summary dataframe
df_results = pd.DataFrame({
    "phrase": verified_indicators_list,  # The exact list used for model.encode
    "cluster": clusterer.labels_,        # The output from clusterer.fit_predict
    "probability": clusterer.probabilities_
})

To extract representative phrases from HDBSCAN, we leverage the `probabilities_` attribute.

In density-based clustering, the "most representative" points aren't just in the spatial center; they are the points with the highest <b>membership probability</b>, meaning they are located in the densest, most stable part of the cluster.


In [8]:
# 2. Calculate the size of each cluster
cluster_counts = df_results["cluster"].value_counts().to_dict()

# 3. Get representative phrases (Top 5 by probability)
representative_samples = (
    df_results[df_results["cluster"] != -1]
    .sort_values("probability", ascending=False)
    .groupby("cluster")
    .head(5)
)

# 4. Print Summary (Sorted by largest clusters first)
# Filter out -1 (noise) for the printing loop
valid_clusters = [c for c in cluster_counts.keys() if c != -1]
sorted_clusters = sorted(valid_clusters, key=lambda x: cluster_counts[x], reverse=True)

print(f"Total Clusters Found: {len(valid_clusters)}")
print(f"Total Noise Points (Cluster -1): {cluster_counts.get(-1, 0)}\n")

for cluster_id in sorted_clusters:
    size = cluster_counts[cluster_id]
    print(f"=== Cluster {cluster_id} | Size: {size} phrases ===")

    top_phrases = representative_samples[representative_samples["cluster"] == cluster_id]["phrase"].tolist()

    for i, phrase in enumerate(top_phrases, 1):
        print(f"  {i}. {phrase}")
    print("-" * 40) # Divider for readability

Total Clusters Found: 352
Total Noise Points (Cluster -1): 4210

=== Cluster 176 | Size: 185 phrases ===
  1. volunteers up
  2. when growing up
  3. when climbing
  4. upward
  5. upped
----------------------------------------
=== Cluster 95 | Size: 150 phrases ===
  1. recounted
  2. after reorganisation
  3. after replanting
  4. being replanted
  5. being redeveloped
----------------------------------------
=== Cluster 175 | Size: 150 phrases ===
  1. return of
  2. made a return
  3. it will come back
  4. has come back
  5. having got back
----------------------------------------
=== Cluster 321 | Size: 131 phrases ===
  1. becoming odd
  2. in a strange way
  3. in an odd way
  4. in an unconventional way
  5. in peculiar guise
----------------------------------------
=== Cluster 327 | Size: 126 phrases ===
  1. scatty
  2. sally
  3. shabby
  4. shaggy
  5. shaky
----------------------------------------
=== Cluster 64 | Size: 115 phrases ===
  1. affair
  2. becomes involved
  

In [9]:
df_results.sort_values(by = 'cluster', ascending=True)

Unnamed: 0,phrase,cluster,probability
10208,will offer,-1,0.000000
4386,pulsing,-1,0.000000
4383,pulled out,-1,0.000000
10203,which shows,-1,0.000000
10197,visiting,-1,0.000000
...,...,...,...
3105,incorrect,351,0.731708
3106,incorrectly,351,0.952843
2980,in error,351,1.000000
3107,incorrectly delivered,351,1.000000


In [10]:
len(df_results['cluster'].unique())

353

## Performance Comparison

We'll compare the clustering above with our group's first attempt to cluster indicators.

First Model results from indicators w/ clue context:

=== MODEL COMPARISON SUMMARY ===

            ARI  Silhouette  MeanVariance

DistilRoBERTa 0.005098    0.015992      0.001302


MPNet          0.003627    0.014522      0.001302

MiniLM         0.003058    0.017184      0.002604

CPU times: user 3h 57min 9s, sys: 3min 42s, total: 4h 52s
Wall time: 3h 34min 26s

In [11]:
# 1. Silhouette Score
# Measures how similar a phrase is to its own cluster vs. other clusters
# (Higher is better, range -1 to 1)
sil_score = metrics.silhouette_score(embeddings_reduced, clusterer.labels_, metric='cosine')

# 2. Mean Variance
# Measures the average 'spread' within clusters.
# Lower variance usually means tighter, more coherent groups.
def mean_variance(data, labels):
    variances = []
    for cluster in np.unique(labels):
        if cluster == -1: continue # Skip noise
        cluster_points = data[labels == cluster]
        variances.append(np.var(cluster_points))
    return np.mean(variances) if variances else 0

m_var = mean_variance(embeddings_reduced, clusterer.labels_)

# 3. ARI (Adjusted Rand Index)
# ONLY run this if you have a variable 'true_labels' (your ground truth)
# ari_score = metrics.adjusted_rand_score(true_labels, clusterer.labels_)

print(f"=== BGE-M3 + UMAP SUMMARY ===")
print(f"Silhouette Score: {sil_score:.6f}")
print(f"Mean Variance:    {m_var:.6f}")
# print(f"ARI Score:        {ari_score:.6f}")

=== BGE-M3 + UMAP SUMMARY ===
Silhouette Score: 0.292060
Mean Variance:    4.463622


=== BGE-M3 + UMAP SUMMARY ===

Silhouette Score: 0.304369

Mean Variance:    5.295987

## Technical Interpretation
* <b>Silhouette Score (0.30 vs 0.01):</b> A score of 0.30 indicates that the clusters are not just present, but <b>well-separated</b>. In high-dimensional text clustering, anything above 0.25 is typically considered a "strong" signal. It means the model has successfully identified dense regions where phrases are significantly closer to each other than they are to phrases in neighboring groups.
* <b>Mean Variance (5.29):</b> While this number looks higher than your previous results (~0.001), that is actually an <b>artifact of the math</b>, not a decline in quality. Your previous models were likely "squashing" all 14,000 vectors into a tiny, indistinguishable ball near the origin of a 768D space (hence the low variance). This new variance shows that UMAP has "unfolded" the data into a 10D space where the phrases have enough "breathing room" to form distinct, spread-out islands.
## Conceptual Interpretation
Conceptually, the difference between your first results and these results is the difference between looking at a crowd from a satellite vs. walking through the crowd.
* <b>Previous Models:</b> Saw a blurry mass of words.
* <b>BGE-M3 + UMAP:</b> Is seeing the "groups" (the 352 clusters). The phrases in Cluster A (e.g., "alternation" words) are now mathematically distant from Cluster B (e.g., "anagram" words).
## Why the jump was so big
1. <b>Model Power:</b> BGE-M3 is significantly better at handling short, functional phrases than the "all-*" models.
2. <b>Noise Removal:</b> UMAP stripped away the 1,014 "empty" dimensions that were confusing the Silhouette formula.
3. <b>Algorithmic Fit:</b> HDBSCAN is far better at finding "natural" shapes in 10D than a standard K-Means approach would be in 768D.
### The Verdict:
You have moved from a model that was guessing to a model that has learned the underlying structure of your data. The 352 clusters are likely highly reliable.

Since you have such a high Silhouette score, we can perform a <b>"Cluster Cohesion Check"</b> to see which of those 352 clusters are the "tightest" (most synonymous) and which ones are more diverse.

## Cluster Cohesion Check

To perform a <b>Cluster Cohesion Check</b>, we will calculate the Intra-cluster Distance. This helps you identify which of your 352 clusters are "Golden Clusters" (highly synonymous, like "ignoring the odds" and "skipping every other") versus "Loose Clusters" (phrases that share a vibe but aren't strictly identical).

We'll use the <b>Medoid</b> (the most central point) as the anchor and measure how far away, on average, the other phrases in that cluster are.


In [12]:
from sklearn.metrics import pairwise_distances

cohesion_results = []

# We'll use the reduced 10D embeddings for this
for cluster_id in np.unique(clusterer.labels_):
    if cluster_id == -1:
        continue # Skip noise

    # Get points belonging to this cluster
    cluster_indices = np.where(clusterer.labels_ == cluster_id)[0]
    cluster_points = embeddings_reduced[cluster_indices]

    # Find the Medoid (the point with the lowest average distance to all others)
    dist_matrix = pairwise_distances(cluster_points, metric='euclidean')
    dist_sums = dist_matrix.sum(axis=1)
    medoid_idx = np.argmin(dist_sums)

    # Calculate average distance to medoid
    avg_dist = np.mean(dist_matrix[medoid_idx])

    # Store results
    cohesion_results.append({
        'cluster': cluster_id,
        'cohesion_score': avg_dist,
        'size': len(cluster_indices),
        'representative': verified_indicators_list[cluster_indices[medoid_idx]]
    })

# Convert to DataFrame for easy analysis
df_cohesion = pd.DataFrame(cohesion_results).sort_values('cohesion_score')


In [13]:

print("=== TOP 5 TIGHTEST CLUSTERS (High Cohesion) ===")
print(df_cohesion.head(5)[['cluster', 'cohesion_score', 'size', 'representative']])

print("\n=== TOP 5 LOOSEST CLUSTERS (Low Cohesion) ===")
print(df_cohesion.tail(5)[['cluster', 'cohesion_score', 'size', 'representative']])


=== TOP 5 TIGHTEST CLUSTERS (High Cohesion) ===
     cluster  cohesion_score  size representative
131      131        0.006957    12        imbibed
5          5        0.007098    25  being retired
11        11        0.008047    18      invest in
33        33        0.008129    13      hiding in
149      149        0.008435    17     reflection

=== TOP 5 LOOSEST CLUSTERS (Low Cohesion) ===
     cluster  cohesion_score  size   representative
235      235        0.222534    34       creatively
331      331        0.237024    47          muddled
289      289        0.240192    14            reels
187      187        0.287006    70         to house
154      154        0.342581    97  to be put right


In [14]:
# Take a look at the cohesion for the largest clusters
df_cohesion.sort_values(by='size', ascending=False).head(5)

Unnamed: 0,cluster,cohesion_score,size,representative
176,176,0.175856,185,awakening
95,95,0.13393,150,reshaping
175,175,0.14504,150,to return
321,321,0.172197,131,in unusual way
327,327,0.207011,126,scatty


In [15]:
len(df_cohesion)

352

# Agglomerative Clustering (Bottom - Up)

In [16]:
from sklearn.cluster import AgglomerativeClustering
import numpy as np
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram

In [17]:
len(verified_indicators_list)

14196

In [18]:
embeddings_placeholder = np.random.rand(len(verified_indicators_list), 10)

In [19]:
# Read the different seeds
cc_for_dummies_ALL = pd.read_excel(f'{DATA_DIR}/Wordplay Seeds.xlsx', sheet_name = "cc_for_dummies_ALL")
cc_for_dummies_ho_6 = pd.read_excel(f'{DATA_DIR}/Wordplay Seeds.xlsx', sheet_name = "cc_for_dummies_ho_6")
minute_cryptic_ALL = pd.read_excel(f'{DATA_DIR}/Wordplay Seeds.xlsx', sheet_name = "minute_cryptic_ALL")
minute_cryptic_ho_7 = pd.read_excel(f'{DATA_DIR}/Wordplay Seeds.xlsx', sheet_name = "minute_cryptic_ho_7")

In [20]:
seed_category_counts = {'cc_for_dummies_ALL': len(cc_for_dummies_ALL.columns), 'cc_for_dummies_ho_6' : len(cc_for_dummies_ho_6.columns),
                       'minute_cryptic_ALL': len(minute_cryptic_ALL.columns), 'minute_cryptic_ho_7': len(minute_cryptic_ho_7.columns)}
seed_results = {}

In [21]:
indicator_groupings = [cc_for_dummies_ALL, cc_for_dummies_ho_6, minute_cryptic_ALL, minute_cryptic_ho_7]
indicator_dicts = []
for grouping in indicator_groupings:
  grouping_index_lookup = {}
  grouping_list = grouping.columns.tolist()
  for group in grouping_list:
    group_values = grouping[group].values
    group_values = group_values[~pd.isna(group_values)]
    indexes = []
    for value in group_values:
      try:
        index = verified_indicators_list.index(value)
        indexes.append(index)
      except ValueError:
          pass
    grouping_index_lookup[group] = indexes
  indicator_dicts.append(grouping_index_lookup)

In [22]:
indicator_dicts[0]

{'anagram': [954,
  1486,
  1323,
  1279,
  6325,
  1858,
  6195,
  1891,
  4144,
  1571,
  3725,
  1834,
  978,
  6368,
  4485,
  5175,
  600],
 'container': [566,
  7588,
  6848,
  7652,
  7855,
  7167,
  7512,
  10856,
  7767,
  7342,
  5034],
 'hidden': [9600, 7310, 7588, 7107, 9250, 996, 9560, 2347, 7474, 9891, 9985],
 'reversal': [12822,
  566,
  960,
  12913,
  13030,
  2645,
  3394,
  13568,
  13680,
  11585,
  8891,
  13770,
  6314,
  14081],
 'deletion': [4637, 8565, 8899, 4005, 8707, 1475, 8541, 9077],
 'deletion_positioning': [],
 'homophone': [10505,
  10545,
  947,
  10621,
  10346,
  10276,
  10562,
  10608,
  10651,
  10741],
 'charade_positioning': [10836, 10863, 7540, 3528, 9947, 11831, 7770, 8519]}

In [23]:
for grouping, num_categories in seed_category_counts.items():
    bottom_up_clustering = AgglomerativeClustering(n_clusters = num_categories, linkage = 'ward', compute_distances = True)
    bottom_up_clustering_predictions = bottom_up_clustering.fit_predict(embeddings_reduced)
    df_bottom_up_results = pd.DataFrame({
        "phrase": verified_indicators_list,  # The exact list used for model.encode
        "cluster": bottom_up_clustering.labels_       # The output from clusterer.fit_predict
    })
    sil_score = metrics.silhouette_score(embeddings_reduced, bottom_up_clustering.labels_, metric='cosine')
    seed_results[grouping] = (sil_score, df_bottom_up_results)

In [24]:
for grouping, results in seed_results.items():
    print("Grouping: ", grouping)
    print("Number of Groupings :", seed_category_counts[grouping])
    print("Silhouette Score: ", results[0])

Grouping:  cc_for_dummies_ALL
Number of Groupings : 8
Silhouette Score:  0.3847180902957916
Grouping:  cc_for_dummies_ho_6
Number of Groupings : 6
Silhouette Score:  0.4153389036655426
Grouping:  minute_cryptic_ALL
Number of Groupings : 26
Silhouette Score:  0.45207664370536804
Grouping:  minute_cryptic_ho_7
Number of Groupings : 12
Silhouette Score:  0.4649384617805481
