<a href="https://colab.research.google.com/github/skojaku/Practical-Guide-to-Sentence-Transformers/blob/main/notebook/Practical_Guide_to_Sentence_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial: Practical Guide to Sentence Transformers**
- v 0.1: _Monday, 17th Mar 2025_ - Added pipeline exploration
- v 0.0: _Monday, 27th Sep 2021_ - Added sentence-transformer models and interactive hands-on

By Sadamori Kojaku.

## References
- *Paper*:
Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-BERT: Sentence Embeddings 
Using Siamese BERT-Networks.” EMNLP. arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1908.10084.


- *Library*:
https://www.sbert.net/


- *Video*:
https://www.youtube.com/watch?v=Ey81KfQ3PQU

# **1. How to use sentence-transformer models**

## **1.1. Setup**

First, we need the following libraries to use the sentence BERT. 

- [transformers](https://huggingface.co/transformers/) provides a variety of pre-trained transformer-based models.
- [sentence-transformer](https://www.sbert.net/index.html) provides a lightweight wrapper for transformers and training procedures.

If you are running this notebook locally, the easiest way is to use [uv](https://github.com/astral-sh/uv) to install the libraries.  

```bash 
pip install uv # Install uv 
uv venv .venv # Create a virtual environment
source .venv/bin/activate # activate the virtual environment
uv pip install torch torchvision torchaudio # Install torch
uv pip install sentence-transformers transformers datasets jupyter # Install other libraries
```

For colab users, comment out the following cell and run it to install the libraries.


In [1]:
#%%capture
#!pip install -U sentence-transformers datasets transformers

After installing the libraries, we import the modules for loading sentence transformers.

In [1]:
from sentence_transformers import SentenceTransformer

## **1.2. Model**

Next, we need to select the transformer-based model for embedding. There are more than 15,000 models, and what model to use is a critical modeling decision.

The key feature of sentence-transformers is fine-tuning, i.e., they are trained such that ***sentence embeddings*** are useful, whereas pre-trained models are trained such that ***word embeddings*** are useful. The fine-tuned models can be downloaded from [sentence-transformers library](https://www.sbert.net/docs/pretrained_models.html). You can also use pre-trained models in [Hugging Face model hub](https://huggingface.co/models).

As a demonstration of sentence-transformers, we use a fine-tuned model. The model is trained on various sentence pairs in Wikipedia, scientific papers, reviews, and Q&A websites. 

In [2]:
MODEL_NAME = "paraphrase-MiniLM-L6-v2"

The model can be downloaded by

In [None]:
model = SentenceTransformer(MODEL_NAME)
# model = SentenceTransformer(MODEL_NAME, device = 0) # device = -1 == CPU, device = 0 == GPU

This method takes a list of sentences and produces an array of embedding (`numpy.ndarray`). Each row in the array is the embedding vector for a given sentence.

In [None]:
sentences = [
    "This framework generates embeddings for each input sentence",
    "Sentences are passed as a list of string.",
    "The quick brown fox jumps over the lazy dog.",
]

# Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

# Print the embeddings
for i, sentence in enumerate(sentences):
    print("Sentence:", sentence)
    print("Embedding[:10]:", embeddings[i, :10])
    print("")

## **1.3. Semantic search**

A key feature of BERT is its ability to capture semantics. To demonstrate this, let us consider a basic NLP task: 
- You are given pairs of sentences, e.g., "He likes eating noodles", and "His favorite food is noodles".
- You are asked to provide semantic relatedness of the given sentences. 

To calculate the semantic relatedness, we'll embed the given sentences and calculate the similarity.

Let us consider the following sentence pairs in order of semantic relatedness. 

In [6]:
sentence_pairs = [
    [
        "The little bird is bathing in the sink.",
        "Birdie is washing itself in the water basin.",
    ],
    [
        "Two boys on a couch are playing video games.",
        "Two boys are playing a video game.",
    ],
    [
        "John said he is considered a witness but not suspect.",
        "'He is not a suspect anymore', John said.",
    ],
    ["They flew out of the nest in groups.", "They flew into the nest together."],
    [
        "The woman is playing the violin.",
        "The young lady enjoys listening to the guitar.",
    ],
    [
        "The black dog is running through the snow.",
        "A race car driver is driving his car through the mud.",
    ],
]

The first sentence is semantically equivalent although no word except 'is' and 'in' are in common (and thus a very challenging example). The second sentence pair is also semantically very similar but some details are different. The last sentence pair is semantically different.

Can the sentence-transformers really capture the semantic relatedness?

In [None]:
import numpy as np

MODEL_NAME = "paraphrase-MiniLM-L6-v2"

model = SentenceTransformer(MODEL_NAME)


def cosine_similarity_matrix(emb):
    emb = np.einsum("ij,i->ij", emb, 1 / np.linalg.norm(emb, axis=1))
    return emb @ emb.T


for sentence_pair in sentence_pairs:
    emb = model.encode(sentence_pair)
    sim = cosine_similarity_matrix(emb)[0, 1]
    print(
        "sim = {sim:.2f}: '{sent1}' '{sent2}'".format(
            sent1=sentence_pair[0], sent2=sentence_pair[1], sim=sim
        )
    )

The similarity for the first sentence is relatively high even though only two general words ('in' and 'is') are in common. The second to the fourth sentences have clearly higher similarity than those of semantically less related sentence pairs (the fifth and sixth). 

## **1.4. Semantic search with pre-trained models**

The `sentence-transformers` makes it easy for you to generate sentence embeddings with pre-trained models. 
Although pre-trained models are not trained for sentence embeddings, they would capture some aspects of semantic relatedness of words. With pre-trained models, the embedding for a sentence is calculated by the average of the embeddings of words in the sentence.

Pre-trained models are sometimes useful because there are more than 15,000 pre-trained models trained for various tasks, whereas there are only less than 50 sentence-transformer models trained for some specific tasks. 

An example is sentiment analysis: given a sentence, decide whether or not the sentiment is positive or negative. As of 09/23/2021, there is no sentence-transformers model but numerous pre-trained models for sentiment analysis.

Here, we use a model in [hugging models hub](https://huggingface.co/) for sentiment analysis. 

In [8]:
PRE_TRAINED_MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment"

In [None]:
model = SentenceTransformer(PRE_TRAINED_MODEL_NAME)

Suppose that we have a list of sentences with different sentiments:

In [10]:
sentences = [
    "I love you",
    "I don't like you",
    "I know you",
    "I like you before and although you did something good to me, I hate you",
]

Our task is that, given a query sentence, rank the sentences based on sentiment similarity:

In [None]:
query = "I like you"  # Query sentence

emb = model.encode([query] + sentences)
sim = cosine_similarity_matrix(emb)[0, 1:]
hits = np.argsort(-sim)
for i in hits:
    print("sim = {sim:.2f}: '{sent}'".format(sent=sentences[i], sim=sim[i]))

## **1.5. Interactive Hands-On**
- With fine-tuned models
  1. Go to the [sentence-transformers library](https://www.sbert.net/docs/pretrained_models.html) and find "Model Overview" section
  2. Copy a model name into `MODEL_NAME`.
  3. Adapt the text and make your first semantic search.
- With pre-trained models 
  1. Go to the [Hugging Face model hub](https://huggingface.co/models) and click a model for text classification. 
  2. In the model card of a model, copy the model name at the top left and past it into `MODEL_NAME`.
  3. Adapt the text and make your first semantic search.

In [12]:
MODEL_NAME = ""

model = SentenceTransformer(MODEL_NAME)

In [None]:
sentences = []

In [None]:
query = ""  # Query sentence

emb = model.encode([query] + sentences)
sim = cosine_similarity_matrix(emb)[0, 1:]
hits = np.argsort(-sim)
for i in hits:
    print("sim = {sim:.2f}: '{sent}'".format(sent=sentences[i], sim=sim[i]))

# **2. Fine-tuning models with your data**

So far, we have used existing fine-tuned or pre-trained models trained with generic texts. However, we all have different problems, and thus often want to tailor the models with our data. In the following, we will walk through how to fine-tune transformer-based models using sentence-transformer architecture. 

## **2.1 Setup**

To start, we need to import some libraries:

In [None]:
from sentence_transformers import (
    InputExample,
    SentencesDataset,
    SentenceTransformer,
    evaluation,
    losses,
    models,
)
from torch.utils.data import DataLoader

## **2.2 Define a model**

`sentence-transformers` library provides building blocks to define a model for sentence-transformers. Here, we construct a sentence-transformer model with a pre-trained model, `distilroberta-base`, and average pooling layer.

In [None]:
# Define the base model for word embeddings
word_embedding_model = models.Transformer("distilroberta-base", max_seq_length=512)

# Define the pooling layer that aggregates word embeddings into a sentence embedding
pooling_model = models.Pooling(
    word_embedding_model.get_word_embedding_dimension(), pooling_mode="mean"
)

# Construct a sentence transformer
model = SentenceTransformer(
    modules=[word_embedding_model, pooling_model],
    #   device=0,  # Set GPU device id if GPU is available
)

##### **The followings are alternative model designs:**

Model that produces unit-norm sentence embeddings:
```python
word_embedding_model = models.Transformer("distilroberta-base", max_seq_length=512)
pooling_model = models.Pooling(
    word_embedding_model.get_word_embedding_dimension(), pooling_mode="mean"
)

# This ensures the unit norm of the sentence embedding. 
normalize_model = models.Normalize()

model = SentenceTransformer(
    modules=[word_embedding_model, pooling_model, normalize_model],
    device=-1
)
```

Model that produces more compact sentence embeddings:
```python
word_embedding_model = models.Transformer("distilroberta-base", max_seq_length=512)

pooling_model = models.Pooling(
    word_embedding_model.get_word_embedding_dimension(), pooling_mode="mean"
)

# This reduces 768 dimensional embeddings to 256 dimensional embeddings.
dense_model = models.Dense(
    in_features=pooling_model.get_sentence_embedding_dimension(),
    out_features=256,
    activation_function=nn.Tanh(),
)

model = SentenceTransformer(
    modules=[word_embedding_model, pooling_model, dense_model], device=-1
)
```

## **2.3 Training Data**


To train the sentence-transformer, we need pairs of sentences that are semantically similar. The sentence pair should be wrapped with `InputExample`, and all pairs should be stored in `DataLoader`. For example, 

```python
train_examples = [
    InputExample(texts=["My first sentence", "My second sentence"]),
    InputExample(texts=["Another pair", "Related sentence"]),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
```

As a toy example, here we use the title of Physics papers. This dataset consists of 5000 pairs of papers published from the American Physical Society journals. A paper `i` is paired with another paper `j` when `i` cites `j`. 

In [None]:
import pandas as pd

data_table = pd.read_csv(
    "https://raw.githubusercontent.com/skojaku/Practical-Guide-to-Sentence-Transformers/main/data/training-data.csv"
)
data_table.head()

In [None]:
train_examples = [
    InputExample(texts=[row["src"], row["trg"]]) for _, row in data_table.iterrows()
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

## **2.4 Loss function**

The loss function is by far the most critical for performance. There are several loss functions available in [sentence-transformers library](https://www.sbert.net/docs/package_reference/losses.html?highlight=loss%20functions). A common choice is *triplet loss*. See [this paper](https://arxiv.org/abs/1703.07737) for details.  

Another important variable is the type of similarity for embeddings. Euclidean, dot-product, and cosine similarity are the commonly used metric for similarity. Here we use cosine similarity as a metric for similarity, which can be specified through `distance_metric` argument of the loss function. 

In [None]:
train_loss = losses.BatchSemiHardTripletLoss(
    model=model,
    distance_metric=losses.BatchHardTripletLossDistanceFunction.cosine_distance,
)

## **2.5 Evaluator**

The training usually takes some time, and we might want to monitor the learning progress. `evaluation` module contains various evaluators that measure the performance improvements during the training phase. See [here](https://www.sbert.net/docs/package_reference/evaluation.html?highlight=evaluators) for the available evaluators. 



In [None]:
# We will make two groups of sentece pairs, `pos_pairs` and `neg_pairs`.
# `pos_pairs` is composed of sentences paired by citatons.
pos_pairs = data_table.sample(frac=0.1)
pos_pairs["score"] = 1  # group label

# `neg_pairs` is composed of the pairs of randomly selected sentences.
neg_pairs = data_table.copy()
neg_pairs["trg"] = neg_pairs["trg"].sample(frac=1).values
neg_pairs = neg_pairs.sample(frac=0.1)
neg_pairs["score"] = 0  # group label

# Concatenate the pairs
eval_data_table = pd.concat([pos_pairs, neg_pairs])

# Set up the evaluator
evaluator = evaluation.EmbeddingSimilarityEvaluator(
    eval_data_table["src"].values.tolist(),  # sentence
    eval_data_table["trg"].values.tolist(),  # sentence
    scores=eval_data_table["score"].values.tolist(),  # similarity
    show_progress_bar=True,
)

## **2.6. Training**

Set some parameters for training:

In [None]:
num_epochs = 4
warmup_steps = 100
evaluation_steps = 1000
model_save_path = "model"

All set. We can train the model by

In [None]:
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=num_epochs,
    evaluation_steps=evaluation_steps,
    warmup_steps=warmup_steps,
    output_path=model_save_path,
)

# 3. Pipeline

[pipiline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines) is a module in transformer library that provides a convenient way to use transformer-based models for various NLP tasks. Although not directly related to sentence transformers, given its popularity and ease of use, we will briefly introduce it here.

An example is worth a thousand words. Let's use it for sentiment analysis.  

In [None]:
from transformers import pipeline

# Initialize the sentiment analysis pipeline
model = pipeline("sentiment-analysis")

# Analyze some text examples
model("I love this product! It works perfectly.")

`pipeline` bundles the model, tokenizer, and task-specific processing steps into a single object, so that you can use it as a function. We can specificy the model for the task by `model` argument. 

In [None]:
from transformers import pipeline

# Initialize the sentiment analysis pipeline
model = pipeline("sentiment-analysis", model = "tabularisai/multilingual-sentiment-analysis")

# Analyze some text examples
model("I love this product! It works perfectly.")

There are various tasks that can be performed with `pipeline`.
- "audio-classification"
- "automatic-speech-recognition"
- "depth-estimation"
- "document-question-answering"
- "feature-extraction"
- "fill-mask"
- "image-classification"
- "image-feature-extraction"
- "image-segmentation"
- "image-text-to-text"
- "image-to-image"
- "image-to-text"
- "mask-generation"
- "object-detection"
- "question-answering"
- "summarization"
- "table-question-answering"
- "text2text-generation"
- "text-classification" (alias "sentiment-analysis" available)
- "text-generation"
- "text-to-audio" (alias "text-to-speech" available)
- "token-classification" (alias "ner" available)
- "translation"
- "translation_xx_to_yy"
- "video-classification"
- "visual-question-answering"
- "zero-shot-classification"
- "zero-shot-image-classification"
- "zero-shot-audio-classification"
- "zero-shot-object-detection"
 

In [None]:
# For RoBERTa model
roberta_fill_mask = pipeline("fill-mask", model="roberta-base")

# Example with RoBERTa mask token
roberta_fill_mask("The capital of France is <mask>.")

## Exercise: Pipeline Exploration

Now that you've seen basic examples of the pipeline module, it's time to experiment with it yourself. The pipeline abstraction makes it incredibly easy to try different NLP tasks and models without writing extensive code.

Explore at least three of the following pipeline tasks:

1. **Text Generation**: Generate text completions from prompts
    ```python
    generator = pipeline("text-generation")
    generator("In the distant future, artificial intelligence")
    ```

2. **Question Answering**: Answer questions based on context
    ```python
    qa = pipeline("question-answering")
    qa(question="Where do I live?", context="My name is Sarah and I live in London")
    ```

3. **Text Summarization**: Create concise summaries of longer texts
    ```python
    summarizer = pipeline("summarization")
    summarizer("Your long article or text goes here...", max_length=100, min_length=30)
    ```

4. **Translation**: Translate text between languages
    ```python
    translator = pipeline("translation_en_to_fr")
    translator("The house is wonderful.")
    ```

5. **Named Entity Recognition**: Identify entities like people, organizations, locations
    ```python
    ner = pipeline("ner")
    ner("My name is Sarah Jessica Parker and I live in New York City")
    ```

6. **Zero-shot Classification**: Classify text without pre-defined labels
    ```python
    classifier = pipeline("zero-shot-classification")
    classifier("I have a problem with my iPhone that needs to be fixed.",
               candidate_labels=["electronics", "travel", "cooking"])
    ```

Try out different tasks and models to see how they work! 