<a href="https://colab.research.google.com/github/samipn/clustering_demos/blob/main/document_clustering_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment (g): Document Clustering with LLM Embeddings

This notebook uses state-of-the-art sentence embeddings from `sentence-transformers` to cluster documents and reports clustering quality via silhouette score.


In [1]:
!pip install --quiet sentence-transformers


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np


In [3]:
# Example corpus (replace with your own documents)
documents = [
    "Neural networks are a powerful tool for machine learning.",
    "Gradient descent is used to optimize neural networks.",
    "Transformers have revolutionized natural language processing.",
    "The stock market fluctuates based on many economic factors.",
    "Bonds are often considered safer than stocks.",
    "Portfolio diversification reduces overall risk.",
    "Soccer is the most popular sport in many countries.",
    "Basketball is a fast-paced indoor sport.",
    "Tennis is played either individually or in pairs.",
]

for i, doc in enumerate(documents):
    print(f"[{i}] {doc}")


[0] Neural networks are a powerful tool for machine learning.
[1] Gradient descent is used to optimize neural networks.
[2] Transformers have revolutionized natural language processing.
[3] The stock market fluctuates based on many economic factors.
[4] Bonds are often considered safer than stocks.
[5] Portfolio diversification reduces overall risk.
[6] Soccer is the most popular sport in many countries.
[7] Basketball is a fast-paced indoor sport.
[8] Tennis is played either individually or in pairs.


In [4]:
# Embed documents with a SOTA sentence-transformer
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

embeddings = model.encode(documents, convert_to_numpy=True, show_progress_bar=True)
print("Embedding shape:", embeddings.shape)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Embedding shape: (9, 768)


In [5]:
# Cluster embeddings & evaluate
num_clusters = 3
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(embeddings)

sil = silhouette_score(embeddings, labels)
print("Silhouette score:", sil)

# Show clusters and their member documents
clusters = {i: [] for i in range(num_clusters)}
for doc, label in zip(documents, labels):
    clusters[label].append(doc)

for cluster_id, docs in clusters.items():
    print(f"\n=== Cluster {cluster_id} ===")
    for d in docs:
        print("-", d)


Silhouette score: 0.18386763

=== Cluster 0 ===
- The stock market fluctuates based on many economic factors.
- Bonds are often considered safer than stocks.
- Portfolio diversification reduces overall risk.

=== Cluster 1 ===
- Soccer is the most popular sport in many countries.
- Basketball is a fast-paced indoor sport.
- Tennis is played either individually or in pairs.

=== Cluster 2 ===
- Neural networks are a powerful tool for machine learning.
- Gradient descent is used to optimize neural networks.
- Transformers have revolutionized natural language processing.
