---
title: "Sentence Transformers"
jupyter: python3
execute:
  eval: false
---

## The Spoiler

**BERT produces a matrix of token vectors; Sentence Transformers collapse that matrix into a single coordinate, turning semantic similarity into geometric distance.**

## The Mechanism (Why It Works)

BERT gives you a vector for every token in a sentence. If you want to compare two sentences, you're stuck comparing two messy matrices of varying sizes. The naive approach—averaging all token vectors—throws away positional information and treats every word equally, which is wrong. The word "not" in "not good" should drastically change the sentence embedding, but simple averaging dilutes its impact.

**Sentence-BERT** (SBERT) solves this by training a **Siamese Network**. The same BERT model processes two sentences independently, producing their respective token matrices. We then apply pooling (mean, max, or CLS-token extraction) to collapse each matrix into a single vector. The training objective is contrastive: if the sentences are semantically similar (e.g., paraphrases), their vectors should be close in Euclidean or cosine space. If they're unrelated, their vectors should be distant.

Think of it like creating a library catalog. Instead of storing every word on every page, you compress each book into a single Dewey Decimal number. Books on similar topics get similar numbers, enabling efficient retrieval. The compression loses fine-grained detail, but gains search speed.

The mathematical trick is the **Siamese architecture**—weight sharing ensures both sentences are embedded into the same vector space using identical transformations. This makes the distance between vectors meaningful: similar sentences cluster together, dissimilar ones push apart.

## The Application (How We Use It)

Sentence Transformers enable semantic search, clustering, and similarity comparisons. Let's see how to use them in practice.

### Basic Semantic Search

Here's how to encode sentences and find the most similar matches:

In [None]:
from sentence_transformers import SentenceTransformer, util

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "Someone in a gorilla costume is playing a set of drums."
]

# Encode all sentences into 384-dimensional vectors
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

query = "A man is eating pasta."
query_embedding = model.encode(query, convert_to_tensor=True)

# Compute cosine similarities
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)

print(f"Query: {query}")
print("\nTop 3 most similar sentences:")
for hit in hits[0]:
    print(f"{corpus[hit['corpus_id']]} (Score: {hit['score']:.4f})")

Expected output:
```
Query: A man is eating pasta.

Top 3 most similar sentences:
A man is eating food. (Score: 0.6964)
A man is eating a piece of bread. (Score: 0.6281)
A man is riding a horse. (Score: 0.2235)
```

The model correctly identifies that "eating pasta" is semantically closest to "eating food" and "eating bread," even though the exact words don't match. This is semantic search—matching by meaning, not keywords.

### Clustering Documents

You can also cluster documents by their semantic content:

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "Python is a programming language",
    "Java is used for software development",
    "The cat sat on the mat",
    "Dogs are loyal animals",
    "Machine learning is a subset of AI",
    "Neural networks mimic the brain",
]

embeddings = model.encode(sentences)

# Cluster into 2 groups
num_clusters = 2
clustering_model = KMeans(n_clusters=num_clusters, random_state=42)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []
    clustered_sentences[cluster_id].append(sentences[sentence_id])

for cluster_id, cluster_sentences in clustered_sentences.items():
    print(f"\nCluster {cluster_id + 1}:")
    for sentence in cluster_sentences:
        print(f"  - {sentence}")

Expected clustering:
```
Cluster 1:
  - Python is a programming language
  - Java is used for software development
  - Machine learning is a subset of AI
  - Neural networks mimic the brain

Cluster 2:
  - The cat sat on the mat
  - Dogs are loyal animals
```

The model separates technical/programming sentences from animal-related sentences without any labeled data.

### Choosing the Right Model

Different Sentence Transformer models optimize for different trade-offs:

- **all-MiniLM-L6-v2**: Fast and lightweight (384 dimensions), good for most applications
- **all-mpnet-base-v2**: Higher quality (768 dimensions), slower but more accurate
- **multi-qa-mpnet-base-dot-v1**: Optimized for question-answering and retrieval tasks
- **paraphrase-multilingual-mpnet-base-v2**: Supports 50+ languages

Choose based on your constraints: speed vs. accuracy, monolingual vs. multilingual, general-purpose vs. domain-specific.

### Architecture: The Siamese Network

The key innovation is the **Siamese Network** architecture:

![Siamese Network](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SBERT_Siamese_Network.png)

Both sentences pass through the **same BERT model** (shared weights). This ensures they're embedded into a common vector space. The pooling layer then collapses each token matrix into a single vector. During training, the loss function pushes similar sentence pairs together and dissimilar pairs apart.

Common pooling strategies:

- **Mean pooling**: Average all token vectors (most common)
- **Max pooling**: Take element-wise maximum across tokens
- **CLS-token**: Use the [CLS] token's final hidden state (BERT's built-in sentence representation)

Mean pooling generally works best because it captures information from all tokens while being robust to varying sentence lengths.

### Where This Breaks

**Static Compression**: A sentence gets exactly one vector, regardless of context. "The bank" in "the river bank" and "the financial bank" might get similar embeddings if they share enough surrounding words. The model compresses meaning into a fixed point, losing nuance.

**Word Order Sensitivity**: "The dog bit the man" and "The man bit the dog" share the same words. If the model relies too heavily on lexical overlap (bag-of-words similarity), they'll end up dangerously close in vector space. Good models learn syntax, but they're not perfect.

**Computational Cost**: Although retrieval is fast (dot products), encoding large corpora is expensive. Encoding 1 million sentences with a large model can take hours. Pre-compute and cache embeddings whenever possible.

**Domain Shift**: Models trained on general text (Wikipedia, news) may perform poorly on specialized domains (medical, legal). Fine-tuning on domain-specific data helps, but requires labeled sentence pairs.

## The Takeaway

Sentence Transformers collapse BERT's token matrix into a single vector using Siamese Networks and contrastive learning. The result is fast semantic search: encode once, compare with dot products. Choose your pooling strategy and model size based on speed-accuracy trade-offs, and remember that compression always loses information.