# Fine Tuning Embedding Models for Retrieval on Domain Specific Data

1. Preparing a synthetic dataset of positive question + chunk pairs
2. Manipulating and preparing the dataset for training and evaluators
3. Evaluating the base performance of the embedding model
4. Fine tuning the embedding model on our data with Matryoshka Representation Learning
5. Publishing the fine tuned model to Hugging Face
6. Evaluating the performance of our fine-tuned model

---
##Install Dependencies

In [None]:
# %%capture
!pip install --upgrade sentence-transformers datasets transformers torch tensorboar

Collecting transformers
  Using cached transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting torch
  Using cached torch-2.7.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (29 kB)
[31mERROR: Could not find a version that satisfies the requirement tensorboar (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for tensorboar[0m[31m
[0m

In [None]:
import torch

from sentence_transformers import SentenceTransformer, SentenceTransformerModelCardData, SentenceTransformerTrainingArguments, SentenceTransformerTrainer
from sentence_transformers.evaluation import InformationRetrievalEvaluator, SequentialEvaluator
from sentence_transformers.util import cos_sim
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers

from datasets import load_dataset, concatenate_datasets

**Login to Hugging Face**

Used for pushing model to the Hugging Face Hub and downloading gated models or datasets

In [None]:
from huggingface_hub import login
from google.colab import userdata

login(token=userdata.get('HF_TOKEN'), add_to_git_credential=True)

---
## Dataset Preperation

In [None]:
!pip install -U datasets huggingface_hub fsspec
# Load dataset from the hub
dataset = load_dataset("thanhpham1/almost_final", split="train")

Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


In [None]:
# Clean & Format Columns
dataset = dataset.rename_column("question", "anchor")
dataset = dataset.rename_column("text", "positive")
dataset = dataset.remove_columns(["Unnamed: 0", "question_id", "answer"]) # keep global_chunk_id

# Add an id column to the dataset
dataset = dataset.add_column("id", range(len(dataset)))

Once formatted, we shuffle the entries and split into a 90/10 train/test split. These are saved briefly onto our disk for easier loading.

In [None]:
# Shuffle Dataset
dataset = dataset.shuffle()

# Split Dataset Into a 90/10 Train/Test split
dataset = dataset.train_test_split(test_size=0.1)

# Save Datasets to Disk
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

Creating json from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

562459

---
## Base Model Evaluation & Matryoshka Dimensions

Now that we have our dataset prepped, ready. For this example we will be using sentence-transformers/all-mpnet-base-v2 (https://huggingface.co/sentence-transformers/all-mpnet-base-v2)

In [None]:
# Hugging Face model ID
model_id = "sentence-transformers/all-mpnet-base-v2"

# Loading via SentenceTransformer
model = SentenceTransformer(
    model_id, device="cuda" if torch.cuda.is_available() else "cpu"
)

This evaluator requires three key data structures:

1. A corpus dictionary mapping IDs to documents (`{corpus_id: text_chunk}`)
2. A queries dictionary mapping IDs to questions (`{query_id: question_text}`)
3. A relevant_docs dictionary specifying which corpus documents are relevant for each query (`{query_id: [relevant_corpus_ids]}`)


In [None]:
# Load train and test datasets from their respective JSON files
# These contain pairs of questions (anchors) and text chunks (positives)
test_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")

# Combine train and test datasets into a single corpus
# This ensures we have all possible text chunks available for retrieval evaluation
corpus_dataset = concatenate_datasets([train_dataset, test_dataset])

# Convert datasets into dictionary format required by the InformationRetrievalEvaluator
# corpus: maps corpus IDs to their text chunks (documents)
# Format: {corpus_id: text_chunk}
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["positive"])
)

# queries: maps query IDs to their questions
# Format: {query_id: question_text}
queries = dict(
    zip(test_dataset["id"], test_dataset["anchor"])
)

# Create a mapping between queries and their relevant documents
# This tells the evaluator which documents are correct matches for each query
relevant_docs = {}
for q_id, global_chunk_id in zip(test_dataset["id"], test_dataset["global_chunk_id"]):
    # Initialize empty list for each query if not already present
    if q_id not in relevant_docs:
        relevant_docs[q_id] = []

    # Find all corpus entries that share the same global_chunk_id
    # This handles cases where multiple questions can refer to the same text chunk
    matching_corpus_ids = [
        cid for cid, chunk in zip(corpus_dataset["id"], corpus_dataset["global_chunk_id"])
        if chunk == global_chunk_id
    ]
    # Add the matching corpus IDs to the relevant documents for this query
    relevant_docs[q_id].extend(matching_corpus_ids)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

While we can use and train the base model as such, an interesting approach that's gained popularity is applying [matryoshka embedding](https://huggingface.co/blog/matryoshka) \([paper](https://arxiv.org/pdf/2205.13147)\) techniques.  


Matryoshka Representation Learning (MRL) is a technique for training models to encode information at different granularities within the same embedding vector, with coarser/higher-level information packed into earlier dimensions and finer details in later dimensions. Named after Russian nesting dolls, this approach allows for flexible truncation of the embedding to different sizes while maintaining comparable accuracy to independently trained models of those smaller sizes, enabling adaptive compute-vs-accuracy trade-offs during deployment.

<img src="https://weaviate.io/assets/images/hero-237ed4b707a303e4ad3353daaf4edab8.jpeg" width=400>




In [None]:
# Dimensions of interest
matryoshka_dimensions = [768, 512, 256, 128, 64] # Important: large to small

# Create empty list to hold evaluators
matryoshka_evaluators = []

# Create an evaluator for each above dimension
for dim in matryoshka_dimensions:
    # Define the evaluator
    ir_evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name=f"dim_{dim}",
        truncate_dim=dim,  # Truncate the embeddings to the respective dimension
        score_functions={"cosine": cos_sim},
    )
    # Add to list
    matryoshka_evaluators.append(ir_evaluator)

# Create a sequential evaluator
# Able to run all our dimension specific InformationRetrievalEvaluators sequentially.
evaluator = SequentialEvaluator(matryoshka_evaluators)

 For our embedding model evaluation, we focus on five complementary metrics that together provide a comprehensive view of retrieval quality.


Technical implementation:
```
Accuracy@k = (queries with ≥1 relevant doc in top k) / (total queries)
```

```
DCG@k = Σ(i=1 to k) rel_i / log2(i + 2)
NDCG@k = DCG@k / IDCG@k
```

```
Precision@k = (relevant docs in top k) / k
Recall@k = (relevant docs in top k) / (total relevant docs)
```


```
MRR = (1/|Q|) Σ(i=1 to |Q|) 1/rank_i
```

```
AP@k = (1/min(k, R)) Σ(r=1 to k) (P@r * rel(r))
MAP@k = (1/|Q|) Σ(q=1 to |Q|) AP@k(q)
```

---

These metrics are evaluated at multiple k values (typically k=1,3,5,10 for most metrics, k=100 for MAP) to assess performance across different result depths. Together, they provide a comprehensive framework for evaluating retrieval systems across multiple dimensions: basic retrieval capability (Accuracy), ranking quality (NDCG), result set precision and completeness (Precision/Recall), first-relevant-result positioning (MRR), and overall ranking effectiveness (MAP).

In [None]:
# Evaluate the model
base_results = evaluator(model)

# Print header
print("\nBase Model Evaluation Results")
print("-" * 85)
print(f"{'Metric':15} {'768d':>12} {'512d':>12} {'256d':>12} {'128d':>12} {'64d':>12}")
print("-" * 85)

# List of metrics to display
metrics = [
    'ndcg@10',
    'mrr@10',
    'map@100',
    'accuracy@1',
    'accuracy@3',
    'accuracy@5',
    'accuracy@10',
    'precision@1',
    'precision@3',
    'precision@5',
    'precision@10',
    'recall@1',
    'recall@3',
    'recall@5',
    'recall@10'
]

# Print each metric
for metric in metrics:
    values = []
    for dim in matryoshka_dimensions:
        key = f"dim_{dim}_cosine_{metric}"
        values.append(base_results[key])

    # Highlight NDCG@10
    metric_name = f"=={metric}==" if metric == "ndcg@10" else metric
    print(f"{metric_name:15}", end="  ")
    for val in values:
        print(f"{val:12.4f}", end=" ")
    print()

# Print sequential score
print("-" * 85)
print(f"{'seq_score:'} {base_results['sequential_score']:1f}")


Base Model Evaluation Results
-------------------------------------------------------------------------------------
Metric                  768d         512d         256d         128d          64d
-------------------------------------------------------------------------------------
==ndcg@10==            0.4926       0.4761       0.4222       0.3701       0.2671 
mrr@10                 0.4232       0.4063       0.3640       0.3111       0.2199 
map@100                0.4682       0.4521       0.4071       0.3547       0.2604 
accuracy@1             0.3584       0.3392       0.3077       0.2570       0.1748 
accuracy@3             0.4231       0.4126       0.3601       0.3129       0.2220 
accuracy@5             0.5245       0.5157       0.4598       0.3899       0.2885 
accuracy@10            0.6381       0.6224       0.5472       0.4895       0.3689 
precision@1            0.3584       0.3392       0.3077       0.2570       0.1748 
precision@3            0.3228       0.3118       0.2

For our matryoshka embedding evaluation, we track these metrics across multiple embedding dimensions: 768, 512, 256, 128, and 64. This helps us understand how retrieval quality changes as we reduce the embedding size.

---
## Training

Now with our training data prepared, our evaluation methodology ready, and our base model loaded with baseline metrics- it's time to train!

We'll continue using Sentence Transformers [fine-tuning](https://sbert.net/docs/sentence_transformer/training_overview.html) tools, see linked documentation for further details.


In [None]:
# load model with SDPA for using Flash Attention 2
model = SentenceTransformer(
    model_id,
    model_kwargs={"attn_implementation": "eager"},
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="Fine-tune-all-mpnet-base-v2",
    ),
)

Next is defining our loss function. Loss functions are what's used to guide the model towards improvements at train time, generally comparing current performance with expected performance, calculating the difference, and then the value determines the direction we optimize towards.

Sentence Transformers offers [many different loss functions](https://sbert.net/docs/sentence_transformer/loss_overview.html) for various scenarios. Given our commitment to MRL training, we will need not only a base loss function, but an additional adapter.

Given our data structure of positive pairs, we utilize [`MultipleNegativesRankingLoss`](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) which optimizes for retrieval scenarios by treating each batch as (a₁, p₁)...(aₙ, pₙ) pairs where (aᵢ, pᵢ) are positive pairs and (aᵢ, pⱼ) for i≠j become negative pairs. This effectively samples n-1 negative examples per positive pair within each batch, with performance scaling with batch size.

We wrap this with [`MatryoshkaLoss`](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) to enable multi-dimensional embedding training, allowing for dynamic dimensionality reduction at inference time without requiring retraining.

In [None]:
# Initial Loss
base_loss = MultipleNegativesRankingLoss(model)

# Matryoshka Loss Wrapper
train_loss = MatryoshkaLoss(
    model, base_loss, matryoshka_dims=matryoshka_dimensions
)

Below are the defined training hyperparameters. These are taken directly from the aforementioned [Philipp Schmid's original blogpost](https://www.philschmid.de/fine-tune-embedding-model-for-rag#4-fine-tune-embedding-model-with-sentencetransformerstrainer). It is worth testing various combinations of hyperparameters for optimal performance, but for the sake of this demonstration we will default to Philipp's provided arguments.

In [None]:
# Training Arguments
args = SentenceTransformerTrainingArguments(
    output_dir="Fine-tune-all-mpnet-base-v2", # output directory and hugging face model ID
    num_train_epochs=4,                                        # number of epochs
    per_device_train_batch_size=32,                            # train batch size
    gradient_accumulation_steps=16,                            # for a global batch size of 512
    per_device_eval_batch_size=16,                             # evaluation batch size
    warmup_ratio=0.1,                                          # warmup ratio
    learning_rate=2e-5,                                        # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                                # use cosine learning rate scheduler
    optim="adamw_torch_fused",                                 # use fused adamw optimizer
    tf32=False,                                                 # use tf32 precision
    bf16=True,                                                 # use bf16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,                 # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                                     # evaluate after each epoch
    save_strategy="epoch",                                     # save after each epoch
    logging_steps=10,                                          # log every 10 steps
    save_total_limit=3,                                        # save only the last 3 models
    load_best_model_at_end=True,                               # load the best model when training ends
    metric_for_best_model="eval_dim_128_cosine_ndcg@10",       # Optimizing for the best ndcg@10 score for the 128 dimension
    report_to="none"                                           # Turning off training logging for now, input 'wandb' etc. if desired.
)

Finally, package our model, training arguments, dataset, loss function and evaluator together into a `SentenceTransformerTrainer`

In [None]:
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset.select_columns(["anchor", "positive"]),
    loss=train_loss,
    evaluator=evaluator,
)

Start the training run!

In [None]:
# Start training
trainer.train()

# Save the best model based on our eval_dim_128_cosine_ndcg@10 criteria
trainer.save_model()

Epoch,Training Loss,Validation Loss,Dim 768 Cosine Accuracy@1,Dim 768 Cosine Accuracy@3,Dim 768 Cosine Accuracy@5,Dim 768 Cosine Accuracy@10,Dim 768 Cosine Precision@1,Dim 768 Cosine Precision@3,Dim 768 Cosine Precision@5,Dim 768 Cosine Precision@10,Dim 768 Cosine Recall@1,Dim 768 Cosine Recall@3,Dim 768 Cosine Recall@5,Dim 768 Cosine Recall@10,Dim 768 Cosine Ndcg@10,Dim 768 Cosine Mrr@10,Dim 768 Cosine Map@100,Dim 512 Cosine Accuracy@1,Dim 512 Cosine Accuracy@3,Dim 512 Cosine Accuracy@5,Dim 512 Cosine Accuracy@10,Dim 512 Cosine Precision@1,Dim 512 Cosine Precision@3,Dim 512 Cosine Precision@5,Dim 512 Cosine Precision@10,Dim 512 Cosine Recall@1,Dim 512 Cosine Recall@3,Dim 512 Cosine Recall@5,Dim 512 Cosine Recall@10,Dim 512 Cosine Ndcg@10,Dim 512 Cosine Mrr@10,Dim 512 Cosine Map@100,Dim 256 Cosine Accuracy@1,Dim 256 Cosine Accuracy@3,Dim 256 Cosine Accuracy@5,Dim 256 Cosine Accuracy@10,Dim 256 Cosine Precision@1,Dim 256 Cosine Precision@3,Dim 256 Cosine Precision@5,Dim 256 Cosine Precision@10,Dim 256 Cosine Recall@1,Dim 256 Cosine Recall@3,Dim 256 Cosine Recall@5,Dim 256 Cosine Recall@10,Dim 256 Cosine Ndcg@10,Dim 256 Cosine Mrr@10,Dim 256 Cosine Map@100,Dim 128 Cosine Accuracy@1,Dim 128 Cosine Accuracy@3,Dim 128 Cosine Accuracy@5,Dim 128 Cosine Accuracy@10,Dim 128 Cosine Precision@1,Dim 128 Cosine Precision@3,Dim 128 Cosine Precision@5,Dim 128 Cosine Precision@10,Dim 128 Cosine Recall@1,Dim 128 Cosine Recall@3,Dim 128 Cosine Recall@5,Dim 128 Cosine Recall@10,Dim 128 Cosine Ndcg@10,Dim 128 Cosine Mrr@10,Dim 128 Cosine Map@100,Dim 64 Cosine Accuracy@1,Dim 64 Cosine Accuracy@3,Dim 64 Cosine Accuracy@5,Dim 64 Cosine Accuracy@10,Dim 64 Cosine Precision@1,Dim 64 Cosine Precision@3,Dim 64 Cosine Precision@5,Dim 64 Cosine Precision@10,Dim 64 Cosine Recall@1,Dim 64 Cosine Recall@3,Dim 64 Cosine Recall@5,Dim 64 Cosine Recall@10,Dim 64 Cosine Ndcg@10,Dim 64 Cosine Mrr@10,Dim 64 Cosine Map@100,Sequential Score
1,14.5908,No log,0.56993,0.648601,0.774476,0.867133,0.56993,0.497086,0.38042,0.226573,0.256993,0.583916,0.72771,0.857955,0.717876,0.643079,0.679269,0.54021,0.627622,0.763986,0.867133,0.54021,0.471445,0.371678,0.22535,0.248689,0.556964,0.71183,0.857809,0.703403,0.619966,0.660086,0.545455,0.625874,0.741259,0.837413,0.545455,0.472611,0.361189,0.21993,0.251311,0.56046,0.691434,0.829108,0.69269,0.617381,0.657252,0.517483,0.61014,0.715035,0.812937,0.517483,0.463287,0.359091,0.213287,0.230041,0.536276,0.677885,0.802739,0.665788,0.59177,0.631181,0.41958,0.493007,0.604895,0.732517,0.41958,0.375291,0.295804,0.191084,0.188666,0.436043,0.559003,0.722465,0.571994,0.492547,0.537236,0.571994
2,8.5538,No log,0.578671,0.666084,0.79021,0.875874,0.578671,0.507576,0.38986,0.23007,0.260781,0.595571,0.744318,0.86961,0.729467,0.65367,0.689483,0.561189,0.652098,0.786713,0.877622,0.561189,0.49359,0.388112,0.23007,0.254953,0.575612,0.739656,0.869027,0.72092,0.640234,0.678067,0.564685,0.646853,0.753497,0.854895,0.564685,0.491841,0.373427,0.224825,0.257867,0.578089,0.710373,0.846591,0.710866,0.636598,0.674525,0.533217,0.624126,0.734266,0.83042,0.533217,0.471445,0.363636,0.215734,0.239219,0.549679,0.690851,0.815851,0.679296,0.606294,0.644782,0.440559,0.520979,0.629371,0.762238,0.440559,0.393357,0.309441,0.198077,0.198427,0.458916,0.587558,0.746212,0.594158,0.514284,0.557581,0.594158
3,6.916,No log,0.589161,0.676573,0.793706,0.884615,0.589161,0.515734,0.393357,0.231993,0.26486,0.603875,0.749563,0.876748,0.738185,0.663398,0.698559,0.573427,0.662587,0.798951,0.879371,0.573427,0.504079,0.392657,0.230594,0.260052,0.588578,0.751166,0.871795,0.729259,0.651326,0.688469,0.564685,0.662587,0.772727,0.858392,0.564685,0.498834,0.382168,0.22535,0.256993,0.586976,0.726836,0.848776,0.714912,0.640763,0.679596,0.547203,0.63986,0.743007,0.833916,0.547203,0.481935,0.366783,0.218357,0.248106,0.565268,0.699155,0.824883,0.69157,0.618951,0.657396,0.438811,0.531469,0.631119,0.756993,0.438811,0.39627,0.309441,0.197028,0.199301,0.466055,0.591492,0.741259,0.593855,0.514792,0.559781,0.593855
4,6.5704,No log,0.587413,0.681818,0.795455,0.886364,0.587413,0.518065,0.394406,0.231993,0.263986,0.607372,0.752185,0.878059,0.738661,0.663561,0.698873,0.573427,0.666084,0.800699,0.881119,0.573427,0.505245,0.393706,0.230944,0.260052,0.591492,0.754371,0.872669,0.730334,0.652236,0.689387,0.566434,0.666084,0.77972,0.858392,0.566434,0.501166,0.386364,0.22535,0.257721,0.589307,0.735431,0.848776,0.716787,0.643294,0.682358,0.54021,0.63986,0.743007,0.83042,0.54021,0.479604,0.367832,0.218182,0.245192,0.562354,0.701049,0.822844,0.688633,0.614658,0.654367,0.435315,0.533217,0.631119,0.762238,0.435315,0.394522,0.309441,0.198252,0.198427,0.465472,0.591055,0.746795,0.595302,0.513878,0.559206,0.595302


Optionally save the model to Hugging Face


In [21]:
# # Upload model to hub
trainer.model.push_to_hub("Fine-tune-all-mpnet-base-v2")

Uploading...:   0%|          | 0.00/438M [00:00<?, ?B/s]

'https://huggingface.co/thanhpham1/Fine-tune-all-mpnet-base-v2/commit/47f2116eac2fec631a515617186b8c5983703eee'

---
## Evaluating Trained Model

In [22]:
fine_tuned_model = SentenceTransformer(
    args.output_dir, device="cuda" if torch.cuda.is_available() else "cpu"
)

# Evaluate the model
ft_results = evaluator(fine_tuned_model)

# Print header
print("Fine Tuned Model Evaluation Results")
print("-" * 85)
print(f"{'Metric':15} {'768d':>12} {'512d':>12} {'256d':>12} {'128d':>12} {'64d':>12}")
print("-" * 85)

# List of metrics to display
metrics = [
    'ndcg@10',
    'mrr@10',
    'map@100',
    'accuracy@1',
    'accuracy@3',
    'accuracy@5',
    'accuracy@10',
    'precision@1',
    'precision@3',
    'precision@5',
    'precision@10',
    'recall@1',
    'recall@3',
    'recall@5',
    'recall@10'
]

# Print each metric
for metric in metrics:
    values = []
    for dim in matryoshka_dimensions:
        key = f"dim_{dim}_cosine_{metric}"
        values.append(ft_results[key])

    # Highlight NDCG@10
    metric_name = f"=={metric}==" if metric == "ndcg@10" else metric
    print(f"{metric_name:15}", end="  ")
    for val in values:
        print(f"{val:12.4f}", end=" ")
    print()

# Print sequential score
print("-" * 85)
print(f"{'seq_score:'} {ft_results['sequential_score']:1f}")

Fine Tuned Model Evaluation Results
-------------------------------------------------------------------------------------
Metric                  768d         512d         256d         128d          64d
-------------------------------------------------------------------------------------
==ndcg@10==            0.7376       0.7283       0.7143       0.6903       0.5945 
mrr@10                 0.6625       0.6513       0.6405       0.6156       0.5149 
map@100                0.6977       0.6886       0.6795       0.6546       0.5594 
accuracy@1             0.5874       0.5734       0.5647       0.5420       0.4388 
accuracy@3             0.6766       0.6661       0.6626       0.6399       0.5297 
accuracy@5             0.7955       0.8007       0.7727       0.7413       0.6294 
accuracy@10            0.8864       0.8776       0.8566       0.8339       0.7587 
precision@1            0.5874       0.5734       0.5647       0.5420       0.4388 
precision@3            0.5152       0.5052     

---
## Base vs FT Comparison

| Metric | Dimension | Base | Fine-tuned | Abs. Improvement | % Improvement |
|---------|-----------|------|------------|-----------------|---------------|
| ndcg@10 | 768d | 0.4926 | 0.7376 | 0.2450 | 49.7% |
| ndcg@10 | 512d | 0.4761 | 0.7283 | 0.2522 | 53.0% |
| ndcg@10 | 256d | 0.4222 | 0.7143 | 0.2921 | 69.2% |
| ndcg@10 | 128d | 0.3701 | 0.6903 | 0.3202 | 86.5% |
| ndcg@10 | 64d | 0.2671 | 0.5945 | 0.3274 | 122.6% |
| mrr@10 | 768d | 0.4232 | 0.6625 | 0.2393 | 56.5% |
| mrr@10 | 512d | 0.4063 | 0.6513 | 0.2450 | 60.3% |
| mrr@10 | 256d | 0.3640 | 0.6405 | 0.2765 | 76.0% |
| mrr@10 | 128d | 0.3111 | 0.6156 | 0.3045 | 97.9% |
| mrr@10 | 64d | 0.2199 | 0.5149 | 0.2950 | 134.2% |
| map@100 | 768d | 0.4682 | 0.6977 | 0.2295 | 49.0% |
| map@100 | 512d | 0.4521 | 0.6886 | 0.2365 | 52.3% |
| map@100 | 256d | 0.4071 | 0.6795 | 0.2724 | 66.9% |
| map@100 | 128d | 0.3547 | 0.6546 | 0.2999 | 84.6% |
| map@100 | 64d | 0.2604 | 0.5594 | 0.2990 | 114.8% |

Some impressive results given our fine tuning! Generalizing effectively to unseen queries across the existing knowledgebase.

Further testing would have to be ran to understand how well this model may generalize to unseen documents outside of the knowledgebase.

---
## Using the Model
The model can now be loaded and used like any other sentence transformers model:

In [23]:
%%capture
!pip install --upgrade sentence-transformers
!pip install git+https://github.com/huggingface/transformers

In [1]:
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("thanhpham1/Fine-tune-all-mpnet-base-v2", truncate_dim=256)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/205 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/38.3k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [2]:
# Run inference
sentences = [
    'What type of framework is Ray described as?',
    '''Overview#
Ray is an open-source unified framework for scaling AI and Python applications like machine learning. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert. Ray minimizes the complexity of running your distributed individual and end-to-end machine learning workflows with these components:

Scalable libraries for common machine learning tasks such as data preprocessing, distributed training, hyperparameter tuning, reinforcement learning, and model serving.
Pythonic distributed computing primitives for parallelizing and scaling Python applications.
Integrations and utilities for integrating and deploying a Ray cluster with existing tools and infrastructure such as Kubernetes, AWS, GCP, and Azure.

For data scientists and machine learning practitioners, Ray lets you scale jobs without needing infrastructure expertise:''', # Corresponding Positive
    '''Fault tolerance#Fault tolerance in Ray Train and Tune consists of experiment-level and trial-level
restoration. Experiment-level restoration refers to resuming all trials,
in the event that an experiment is interrupted in the middle of training due
to a cluster-level failure. Trial-level restoration refers to resuming
individual trials, in the event that a trial encounters a runtime
error such as OOM.

Framework#The deep-learning framework used for the model(s), loss(es), and optimizer(s)
inside an RLlib Algorithm. RLlib currently supports PyTorch and TensorFlow.

GCS / Global Control Service#Centralized metadata server for a Ray cluster. It runs on the Ray head node
and has functions like managing node membership and actor directory.
It’s also known as the Global Control Store.

Head node#A node that runs extra cluster-level processes like GCS and API server in
addition to those processes running on a worker node. A Ray cluster only has
one head node.''', # Random Excerpt
]

embeddings = model.encode(sentences)
print(embeddings.shape)

(3, 256)


In [3]:
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities[0])

tensor([1.0000, 0.5034, 0.1761])


For comparison, output from our base model

In [5]:
# Download from the 🤗 Hub
model2 = SentenceTransformer("sentence-transformers/all-mpnet-base-v2", truncate_dim=256)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
# Run inference
sentences = [
    'What type of framework is Ray described as?',
    '''Overview#
Ray is an open-source unified framework for scaling AI and Python applications like machine learning. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert. Ray minimizes the complexity of running your distributed individual and end-to-end machine learning workflows with these components:

Scalable libraries for common machine learning tasks such as data preprocessing, distributed training, hyperparameter tuning, reinforcement learning, and model serving.
Pythonic distributed computing primitives for parallelizing and scaling Python applications.
Integrations and utilities for integrating and deploying a Ray cluster with existing tools and infrastructure such as Kubernetes, AWS, GCP, and Azure.

For data scientists and machine learning practitioners, Ray lets you scale jobs without needing infrastructure expertise:''', # Corresponding Positive

    '''Fault tolerance#Fault tolerance in Ray Train and Tune consists of experiment-level and trial-level
restoration. Experiment-level restoration refers to resuming all trials,
in the event that an experiment is interrupted in the middle of training due
to a cluster-level failure. Trial-level restoration refers to resuming
individual trials, in the event that a trial encounters a runtime
error such as OOM.

Framework#The deep-learning framework used for the model(s), loss(es), and optimizer(s)
inside an RLlib Algorithm. RLlib currently supports PyTorch and TensorFlow.

GCS / Global Control Service#Centralized metadata server for a Ray cluster. It runs on the Ray head node
and has functions like managing node membership and actor directory.
It’s also known as the Global Control Store.

Head node#A node that runs extra cluster-level processes like GCS and API server in
addition to those processes running on a worker node. A Ray cluster only has
one head node.''', # Random Excerpt
]

embeddings2 = model2.encode(sentences)
print(embeddings2.shape)

(3, 256)


In [35]:
# Get the similarity scores for the embeddings
similarities2 = model2.similarity(embeddings2, embeddings2)
print(similarities2[0])

tensor([1.0000, 0.3352, 0.4369])
