# Sentence-Transformers — Intro Practice

**Goal.** This notebook practices the core workflow of `sentence-transformers`:
1) load a pre-trained sentence embedding model,
2) encode short news-like sentences,
3) compute cosine similarities and run semantic search (Top-K),
4) cluster embeddings with KMeans.

**Model.** `sentence-transformers/all-MiniLM-L6-v2` (384-dim; small & fast).

**How to run.**
- Kernel: `Python (framing-py310)`
- Dependencies: see `requirements.txt`
- Execute the three code cells in order: Load → Similarity/Search → Clustering.

**Expected outputs.**
- Embedding shape like `(6, 384)`
- A 6×6 cosine similarity matrix
- Top-K search results for the query “Central bank hikes rates again.”
- KMeans cluster IDs showing ~3 themes (economy, technology, sports).


In [1]:
import sys, pkgutil
print(sys.executable)  # path should include \envs\framing-py310\python.exe
print("sentence-transformers installed?",
      pkgutil.find_loader("sentence_transformers") is not None)


E:\Anaconda\download\envs\framing-py310\python.exe
sentence-transformers installed? True


In [1]:
# Practice with sentence-transformers: load a small, fast model and encode a tiny corpus.
# Model choice: "all-MiniLM-L6-v2" is lightweight (384-dim) and good for demos on CPU.

from sentence_transformers import SentenceTransformer, util
import torch


In [2]:
# 1) Load the sentence-embedding model (downloads on first use).
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [3]:
# 2) A tiny toy corpus of news-like sentences across different topics.
sentences = [
    "The Federal Reserve raised interest rates again.",
    "Stocks fell on inflation worries.",
    "Apple unveils the new iPhone.",
    "Google announces an AI laptop chip.",
    "Real Madrid wins the derby.",
    "Olympic committee adds a new sport."
]

In [4]:
# 3) Encode all sentences into embeddings (vectors).
#    - convert_to_tensor=True: returns a PyTorch tensor for fast similarity ops.
#    - normalize_embeddings=True: L2-normalizes vectors (cosine similarity becomes dot product).
embeddings = model.encode(
    sentences,
    convert_to_tensor=True,
    normalize_embeddings=True
)

print("Embeddings shape (num_sentences, dim):", tuple(embeddings.shape))
print("Device:", "CUDA" if torch.cuda.is_available() else "CPU")

Embeddings shape (num_sentences, dim): (6, 384)
Device: CPU


In [5]:
# (A) compute pairwise cosine similarities among the 6 sentences
# (B) run a semantic search: given a query, find the Top-K most similar sentences

from sentence_transformers import util


In [6]:
# (A) Pairwise cosine similarity matrix (6x6). Higher value => more similar.
sim_matrix = util.cos_sim(embeddings, embeddings)

print("Pairwise cosine similarity matrix (rounded to 3 decimals):")
for i in range(sim_matrix.size(0)):
    row = [f"{float(sim_matrix[i, j]):.3f}" for j in range(sim_matrix.size(1))]
    print(row)

Pairwise cosine similarity matrix (rounded to 3 decimals):
['1.000', '0.307', '0.233', '0.092', '-0.070', '0.167']
['0.307', '1.000', '0.085', '0.006', '-0.074', '-0.060']
['0.233', '0.085', '1.000', '0.307', '0.021', '0.193']
['0.092', '0.006', '0.307', '1.000', '-0.021', '0.131']
['-0.070', '-0.074', '0.021', '-0.021', '1.000', '0.147']
['0.167', '-0.060', '0.193', '0.131', '0.147', '1.000']


In [7]:
# (B) Semantic search (Top-K): encode the query and retrieve the K most similar sentences
query = "Central bank hikes rates again."
q_emb = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)

top_k = 3
hits = util.semantic_search(q_emb, embeddings, top_k=top_k)[0]

print("\nQuery:", query)
print(f"Top-{top_k} results (score -> sentence):")
for h in hits:
    print(f"{h['score']:.3f} -> {sentences[h['corpus_id']]}")


Query: Central bank hikes rates again.
Top-3 results (score -> sentence):
0.754 -> The Federal Reserve raised interest rates again.
0.213 -> Apple unveils the new iPhone.
0.195 -> Stocks fell on inflation worries.


In [8]:
# Clustering groups similar sentences together WITHOUT labels.
# We use KMeans with K=3 (roughly: economy, technology, sports).

from sklearn.cluster import KMeans
import numpy as np


In [9]:
# 1) Convert embeddings to NumPy (scikit-learn expects NumPy arrays).
X = embeddings.detach().cpu().numpy()

# 2) Run KMeans with a fixed random_state for reproducibility.
k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init="auto").fit(X)
labels = kmeans.labels_

In [10]:
# 3) Print cluster assignment for each sentence.
print(f"KMeans clustering with K={k}:")
for lab, sent in sorted(zip(labels, sentences), key=lambda x: (x[0], x[1])):
    print(f"[cluster {lab}] {sent}")


KMeans clustering with K=3:
[cluster 0] Apple unveils the new iPhone.
[cluster 0] Google announces an AI laptop chip.
[cluster 0] Olympic committee adds a new sport.
[cluster 1] Real Madrid wins the derby.
[cluster 2] Stocks fell on inflation worries.
[cluster 2] The Federal Reserve raised interest rates again.


In [11]:
# 4) (Optional) Inspect how "central" each sentence is to each cluster via centroid similarity.
centroids = kmeans.cluster_centers_  # shape: (k, dim)
centroids = centroids / np.linalg.norm(centroids, axis=1, keepdims=True)  # L2-normalize

print("\nCentroid-to-sentence cosine similarities (rounded):")
for ci, c in enumerate(centroids):
    sims = np.dot(X, c)  # dot product == cosine because both sides are normalized
    sims_str = ", ".join(f"{float(s):.3f}" for s in sims)
    print(f"cluster {ci}: {sims_str}")


Centroid-to-sentence cosine similarities (rounded):
cluster 0: 0.238, 0.015, 0.727, 0.696, 0.071, 0.641
cluster 1: -0.070, -0.074, 0.021, -0.021, 1.000, 0.147
cluster 2: 0.809, 0.809, 0.197, 0.061, -0.089, 0.066
