# ITR: generate clusters based on semantic similarity

**Tangxiaoxue Zhang**

## Method's Current Usage

A common “semantic clustering” workflow in recent research is: (i) represent each item (text, image, or image–text pair) as a dense embedding from a pretrained model, then (ii) run a clustering algorithm (often k-means) in that vector space to discover groups that share meaning, topic, or latent structure. This is attractive because embeddings compress high-dimensional raw data (pixels or tokens) into a geometry where semantic similarity is more likely to correspond to distance.

In computer vision and multimodal work, embeddings are often treated as a way to “name” or “index” concept neighborhoods. For example, MoDE (Ma et al., 2024) clusters large-scale CLIP-style training data so that each “data expert” specializes on one semantic cluster, reducing noise from mismatched image–caption pairs; practically, it uses caption-side embeddings to cluster the dataset, then trains separate models per cluster and ensembles them at inference time.  This is a strong example of semantic clustering as data organization: clusters become actionable units for model training, routing, and analysis.

In NLP, the same logic appears in embedding-based clustering for discovering latent intents or topics without labels. Park et al. (2024) explicitly compare utterance embedding models (e.g., MiniLM / MPNet / SimCSE) and multiple clustering methods (including k-means) for intent induction. Their main takeaway is directly relevant to my experience when testing with my data: clustering outcomes depend heavily on both the embedding space and the clustering algorithm, so it’s not enough to “just run k-means,  I need to evaluate robustness and metric sensitivity.  

A third line of work focuses on which embedding pooling strategy produces clusterable representations. Ortakci (2024) tests SBERT variants and pooling methods (CLS / mean / max) across many text clustering tasks and shows that there is no universal “best SBERT,” but mean pooling is most consistently effective in their benchmark set.  This matters for my project because “semantic similarity” is only as good as the embedding: if the embedding model collapses distinctions that humans care about, k-means will appear unstable or misaligned with category labels.

How this inform my project: my goal is to form semantically close groups of images that plausibly co-occur “in reality” (shared contexts, shared cultural meaning, similar concepts). Embedding + k-means is widely used for exactly that: it produces a data-driven semantic partition that can be used as (a) a control/stratification variable, (b) a sampling tool for constructing matched sets, or (c) a way to test whether memorability patterns are driven by semantic neighborhoods rather than isolated object labels. At the same time, the literature emphasizes that stability is not guaranteed—it must be measured and engineered (e.g., pooling choice, normalization, multiple restarts, stability metrics). 

## Test on my data

### Read Data

In [1]:
import pandas as pd

vectors_path = "data/semantic_embedding.csv"
word_list_path = "data/word_list.csv"
concept_path = "data/concepts.tsv"

vectors = pd.read_csv(vectors_path, header=None)
word_list = pd.read_csv(word_list_path, header=None)
concept = pd.read_csv(concept_path, sep="\t")

column_category = ["Bottom-up Category (Human Raters)",\
                   "Top-down Category (WordNet)",\
                   "Top-down Category (manual selection)"]
concept_list = concept[column_category]

word_list.columns = ["Word"]
vectors.columns = [f"Dimension_{i}" for i in range(1, len(vectors.columns)+1)]
dataset = pd.concat([word_list, concept_list, vectors], axis=1)
dataset.index = range(1, len(dataset)+1)
dataset.head()

Unnamed: 0,Word,Bottom-up Category (Human Raters),Top-down Category (WordNet),Top-down Category (manual selection),Dimension_1,Dimension_2,Dimension_3,Dimension_4,Dimension_5,Dimension_6,...,Dimension_291,Dimension_292,Dimension_293,Dimension_294,Dimension_295,Dimension_296,Dimension_297,Dimension_298,Dimension_299,Dimension_300
1,aardvark,animal,animal,animal,0.002518,0.068236,-0.028361,0.166795,-0.065438,0.031492,...,-0.045665,0.011164,-0.005354,0.035327,-0.001382,0.077301,-0.083461,0.064104,-0.004183,0.046304
2,abacus,,,home decor,0.056792,-0.063938,-0.001322,0.045321,-0.038369,0.048631,...,-0.00218,0.041572,-0.017164,-0.052926,-0.051482,-0.040223,-0.066239,0.016329,-0.076589,-0.009422
3,accordion,musical instrument,musical instrument,musical instrument,0.027205,0.002443,-0.02544,0.022057,-0.027733,0.004925,...,0.018821,0.07433,-0.086789,-0.115503,0.019062,0.06938,0.001089,-0.006804,0.006405,0.036277
4,acorn,,fruit,,0.034074,0.006323,-0.079977,0.064698,-0.002513,-0.01995,...,0.012923,0.023666,-0.046211,-0.060001,0.047634,-0.036336,0.012826,-0.053503,-0.013425,0.05968
5,air conditioner,,,electronic device,0.001522,0.003388,-0.031035,-0.008351,-0.013928,0.066164,...,-0.048569,0.047054,-0.002966,0.038435,0.027077,-0.031518,-0.092717,0.145555,0.015335,0.023629


In [2]:
# For cluster: select vector columns [Dimension_1 to Dimension_300]
column_selected_vector = dataset.columns[4:]
dataset_vector = dataset[column_selected_vector] # select only vector columns
# Remove rows with NaN values
dataset_vector = dataset_vector.loc[~dataset_vector["Dimension_1"].isna()]

In [3]:
dataset_vector.isna().sum().sum() # check if there is any NaN value

np.int64(0)

In [4]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=27, random_state=42)
clusters = kmeans.fit_predict(dataset_vector)

dataset_vector["cluster"] = clusters

In [5]:
dataset_vector

Unnamed: 0,Dimension_1,Dimension_2,Dimension_3,Dimension_4,Dimension_5,Dimension_6,Dimension_7,Dimension_8,Dimension_9,Dimension_10,...,Dimension_292,Dimension_293,Dimension_294,Dimension_295,Dimension_296,Dimension_297,Dimension_298,Dimension_299,Dimension_300,cluster
1,0.002518,0.068236,-0.028361,0.166795,-0.065438,0.031492,0.072952,0.005569,0.006577,-0.011896,...,0.011164,-0.005354,0.035327,-0.001382,0.077301,-0.083461,0.064104,-0.004183,0.046304,9
2,0.056792,-0.063938,-0.001322,0.045321,-0.038369,0.048631,0.050771,-0.090274,-0.016943,0.067346,...,0.041572,-0.017164,-0.052926,-0.051482,-0.040223,-0.066239,0.016329,-0.076589,-0.009422,25
3,0.027205,0.002443,-0.025440,0.022057,-0.027733,0.004925,0.069669,-0.036609,0.043710,-0.023415,...,0.074330,-0.086789,-0.115503,0.019062,0.069380,0.001089,-0.006804,0.006405,0.036277,13
4,0.034074,0.006323,-0.079977,0.064698,-0.002513,-0.019950,0.094188,-0.002276,-0.002716,0.128304,...,0.023666,-0.046211,-0.060001,0.047634,-0.036336,0.012826,-0.053503,-0.013425,0.059680,11
5,0.001522,0.003388,-0.031035,-0.008351,-0.013928,0.066164,0.036939,-0.052600,0.041855,0.092639,...,0.047054,-0.002966,0.038435,0.027077,-0.031518,-0.092717,0.145555,0.015335,0.023629,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1850,-0.007247,-0.009497,-0.098389,0.014883,-0.074022,-0.034846,0.031211,-0.075538,0.023160,0.065245,...,0.112528,-0.085710,-0.013974,-0.015309,-0.062135,-0.036656,-0.000741,0.035923,0.008339,5
1851,-0.046959,0.029584,0.026730,-0.007823,-0.048818,-0.022490,0.042724,-0.058452,-0.051158,0.060507,...,0.004924,-0.097171,-0.011267,0.022133,-0.018902,0.015964,-0.019782,0.044278,0.005173,16
1852,-0.012443,-0.002405,-0.115256,0.053251,-0.037307,-0.033384,0.041371,-0.027366,-0.069512,-0.011347,...,0.061006,-0.054515,0.034001,0.131828,0.113010,0.030256,0.042306,0.012012,0.003221,9
1853,-0.002309,0.003196,-0.023785,0.035685,-0.052391,-0.026013,-0.007483,0.030385,0.060418,-0.001094,...,0.004240,-0.056979,0.045449,0.008078,-0.051059,0.035830,-0.024869,-0.006122,0.064542,6


In [6]:
category = dataset_vector["cluster"].unique()

dataset["cluster"] = None
for i in category:
    dataset.loc[dataset_vector[dataset_vector["cluster"]==i].index, "cluster"] = i

In [10]:
dataset_cluster = dataset.iloc[:,[0,1,2,3,len(dataset.columns)-1]]
dataset_cluster = dataset_cluster.sort_values(by="cluster")
dataset_cluster

Unnamed: 0,Word,Bottom-up Category (Human Raters),Top-down Category (WordNet),Top-down Category (manual selection),cluster
1326,roller coaster,,,,0
581,ferry,,vehicle,vehicle,0
343,chute,,,,0
604,fishing pole,,,,0
140,blimp,,vehicle,,0
...,...,...,...,...,...
1293,ready meal,food,,food,
1522,spring roll,food,,food,
1600,swing set,,,,
1700,train car,vehicle,,,


In [8]:
# dataset_cluster[dataset_cluster["cluster"]==15]["Word"].to_csv("cluster_weapon.txt", index=False, header=False)

Here, I read one-time output of clustering, cause using K-means to cluster would produce different outputs each run

In [9]:
with open("cluster_weapon.txt", "r") as f:
    lines = f.readlines()
    for line in lines:
        print(line.strip())

gun
landmine
lighter
flamethrower
dynamite
fire
firecracker
firetruck
fireworks
solar panel
extinguisher
cannon
fire alarm
bulldozer
blowtorch
blowgun
rifle
grenade
catapult
rocket
trigger
detonator
shell
hail
dart
missile
bazooka
cannonball
squirt gun
revolver
tank
bullet
remote control
slingshot
torpedo
submarine
bomb
machine gun
armor
bulletproof vest


## Based on current exploration + Reflection
Clustering is not just a technical step—it is a way of imposing structure on a cultural object. In my case, the “object” is a universe of images (and their concepts) that carry meanings shaped by human experience, cultural categories, and media exposure. Semantic embedding + k-means effectively asks: what counts as “similar” in a learned semantic geometry, and what kinds of meaning neighborhoods does that geometry produce?

If the clusters are stable and interpretable, they can become a useful lens for my broader project because they provide a middle layer between raw stimuli and outcomes like memorability. Instead of treating each image as isolated, clustering lets me test whether memorability behaves like a property of semantic neighborhoods: e.g., are certain semantic clusters systematically more memorable because they map to culturally salient themes, threat-related content, novelty schemas, or repeated media motifs? This is aligned with the idea that cultural environments and shared exposure shape what stands out and what gets remembered—clustering helps operationalize that “shared structure” as measurable groups.

At the same time, the instability I observed is itself socially meaningful. If k-means partitions change drastically across runs (or across minor preprocessing choices), it suggests the embedding space may not contain strong, consensual boundaries between categories at the granularity I chose (k=27). Substantively, this can be interpreted as: the dataset’s human-coded categories may not correspond to a single coherent semantic taxonomy, or the embedding model is encoding similarity in a way that blurs distinctions humans consider meaningful. That mismatch matters for social science: it reminds us that “semantic similarity” is not a neutral fact—it’s a model-mediated cultural measurement, influenced by what the embedding model saw during training and what cultural regularities it absorbed.

This reflection directly informs my next methodological step. If my research question depends on “semantically close groups that are likely together in reality,” then I need to show that those groups are (1) stable, (2) interpretable, and (3) not merely an artifact of forcing k=27. The solution is not to abandon clustering, but to treat clustering results as a claim that must be validated: add stability checks, justify k (or compare multiple k), and potentially adopt a more stable variant (e.g., more restarts, cosine-aware clustering, PCA denoising). This makes my clustering outputs more defensible as a measurement of semantic structure—and therefore more useful for explaining memorability patterns rather than accidentally inventing them. 

----

## Bibliography
Ma, J., Huang, P.-Y., Xie, S., Li, S.-W., Zettlemoyer, L., Chang, S.-F., Yih, W.-T., & Xu, H. (2024). MoDE: CLIP Data Experts via Clustering. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 26344–26353. https://doi.org/10.1109/CVPR52733.2024.02489

Ortakci, Y. (2024). Revolutionary text clustering: Investigating transfer learning capacity of SBERT models through pooling techniques. Engineering Science and Technology, an International Journal, 55, 101730. https://doi.org/10.1016/j.jestch.2024.101730

Park, J., Jang, Y., Lee, C., & Lim, H. (2024). Analysis of Utterance Embeddings and Clustering Methods Related to Intent Induction for Task-Oriented Dialogue (No. arXiv:2212.02021). arXiv. https://doi.org/10.48550/arXiv.2212.02021

