In [1]:
%load_ext autoreload
%autoreload 2

# Systematically Improving RAG

Depending on your use-case, you might want to use an open source model instead of a proprietary one like Cohere. In this notebook, we'll look at how to use an open source model from hugging face.

## Setup

We'll use a similar setup as the previous notebook, creating a train-test-split and the fine-tuning the model on the training data.

Let's first see how we can use the same metrics to evaluate the model as before

In [2]:
# To resolve the warning from huggingface/tokenizers about parallelism:
# 1. Avoid using `tokenizers` before the fork if possible.
# 2. Explicitly set the environment variable TOKENIZERS_PARALLELISM to either 'true' or 'false'.
import os

# Set the environment variable to disable parallelism warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

TEST_SIZE = 0.2
BASE_MODEL_NAME = "BAAI/bge-base-en"
FINETUNED_MODEL_NAME = "ivanleomk/finetuned-bge-base-en"

MODEL_OUTPUT_DIR = "./models/bge-base-en"
WANDB_RUN_NAME = "bge-base-en"
CATEGORIES_PATH = "data/categories.json"
TRAIN_EVALUATOR_NAME = "bge-base-en-train"
EVAL_EVALUATOR_NAME = "bge-base-en-eval"


In [3]:
from braintrust import init_dataset
import json

categories = json.load(open(CATEGORIES_PATH))
dataset = init_dataset(project="fine-tuning", name="Synthetic Transactions")


def get_dataset_split(split: str, dataset):
    return [
        {
            "input": transaction["input"],
            "expected": transaction["expected"],
        }
        for transaction in dataset
        if transaction["metadata"]["split"] == split
    ]


train_data = get_dataset_split("train", dataset)
eval_data = get_dataset_split("eval", dataset)

# Finetuning

We can fine-tune the model using the same approach as before. In this specific case, we'll define a train-test-split and fine-tune the model on the training data.

- Train : Our model is trained on this data
- Test :  We use this data to see how the model performs during each epoch
- Eval : This is held out and only used at the end to see the model's performance.


We'll do so with Sentence Transformers because it provides a large amount of loss functions to choose from out of the box and is compatible with Hugging Face, making it easy to upload and save the fine-tuned models when we're done.

We'll be using the Batch Semi Hard Triplet Loss in this case. It helps with our model learning a decision boundary between the positive and negative examples.

To do so, we'll need to create two formats of our dataset

1. A Triplet Dataset - This is used by the evaluator to see the performance of the model;
2. A transaction to label dataset - this is used to train the model so that it groups similar transactions together in the latent space.

In [4]:

train_data = get_dataset_split("train", dataset)
eval_data = get_dataset_split("eval", dataset)

test_data = train_data[: int(len(train_data) * TEST_SIZE)]
train_data = train_data[int(len(train_data) * TEST_SIZE) :]

In [5]:
from collections import defaultdict
import random
from datasets import Dataset


def create_labels(data):
    label_to_example = defaultdict(list)

    for item in data:
        label_to_example[item["expected"][0]].append(item)

    return {label: idx for idx, label in enumerate(label_to_example.keys())}


def create_sentence_to_label_dataset(data, label_to_idx):
    return Dataset.from_dict(
        {
            "sentence": [item["input"] for item in data],
            "label": [label_to_idx[item["expected"][0]] for item in data],
        }
    )


def create_triplet_dataset(data):
    label_to_example = defaultdict(list)

    for item in data:
        label_to_example[item["expected"][0]].append(item)

    labels = set(label_to_example.keys())

    anchors = []
    positives = []
    negatives = []

    for item in data:
        label = item["expected"][0]
        anchor = item
        positive = label
        negative = random.choice([item for item in labels if item != label])
        anchors.append(anchor)
        positives.append(positive)
        negatives.append(negative)

    return {"anchor": anchors, "positive": positives, "negative": negatives}


labels_to_idx = create_labels(train_data)

train_triplets = create_triplet_dataset(train_data)
test_triplets = create_triplet_dataset(test_data)
eval_triplets = create_triplet_dataset(eval_data)

sentence_to_label_train_dataset = create_sentence_to_label_dataset(
    train_data, labels_to_idx
)
sentence_to_label_test_dataset = create_sentence_to_label_dataset(
    test_data, labels_to_idx
)
sentence_to_label_eval_dataset = create_sentence_to_label_dataset(
    eval_data, labels_to_idx
)


In [6]:
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.losses import BatchSemiHardTripletLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator

model = SentenceTransformer(BASE_MODEL_NAME)
loss = BatchSemiHardTripletLoss(model)
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir=MODEL_OUTPUT_DIR,
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=False,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=100,
    run_name=WANDB_RUN_NAME,  # Will be used in W&B if `wandb` is installed
)

train_evaluator = TripletEvaluator(
    anchors=train_triplets["anchor"],
    positives=train_triplets["positive"],
    negatives=train_triplets["negative"],
    name=TRAIN_EVALUATOR_NAME,
)

train_evaluator(model)

{'bge-base-en-train_cosine_accuracy': 0.8461538461538461,
 'bge-base-en-train_dot_accuracy': 0.15384615384615385,
 'bge-base-en-train_manhattan_accuracy': 0.8557692307692307,
 'bge-base-en-train_euclidean_accuracy': 0.8461538461538461,
 'bge-base-en-train_max_accuracy': 0.8557692307692307}

In [7]:
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=sentence_to_label_train_dataset,
    eval_dataset=sentence_to_label_test_dataset,
    loss=loss,
    evaluator=train_evaluator,
)

trainer.train()

Step,Training Loss,Validation Loss


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=65, training_loss=4.900086388221154, metrics={'train_runtime': 5.0757, 'train_samples_per_second': 204.897, 'train_steps_per_second': 12.806, 'total_flos': 0.0, 'train_loss': 4.900086388221154, 'epoch': 5.0})

In [8]:
test_evaluator = TripletEvaluator(
    anchors=eval_triplets["anchor"],
    positives=eval_triplets["positive"],
    negatives=eval_triplets["negative"],
    name=EVAL_EVALUATOR_NAME,
)
test_evaluator(model)

{'bge-base-en-eval_cosine_accuracy': 0.9545454545454546,
 'bge-base-en-eval_dot_accuracy': 0.045454545454545456,
 'bge-base-en-eval_manhattan_accuracy': 0.9545454545454546,
 'bge-base-en-eval_euclidean_accuracy': 0.9545454545454546,
 'bge-base-en-eval_max_accuracy': 0.9545454545454546}

In [9]:
model.save_pretrained(f"models/finetuned-{BASE_MODEL_NAME}")
model.push_to_hub(FINETUNED_MODEL_NAME, exist_ok=True)


'https://huggingface.co/ivanleomk/finetuned-bge-base-en/commit/c3c83d0594e980f60275019f1cb16fba8738d27d'

In [10]:
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

def create_lancedb_table(model_name: str, categories: list[str]):
    model = get_registry().get("huggingface").create(name=model_name)


    class Category(LanceModel):
        text: str = model.SourceField()
        embedding: Vector(model.ndims()) = model.VectorField()


    db = lancedb.connect("./lancedb")
    table = db.create_table(
        f"categories-{model_name.replace('/', '-')}", schema=Category, mode="overwrite"
    )

    table.add(
        [
            {
                "text": category["category"],
            }
            for category in categories
        ]
    )

    return table

In [11]:
from indomee import calculate_metrics_at_k
from braintrust import Score, Eval

categories = json.load(open(CATEGORIES_PATH))
base_table = create_lancedb_table(BASE_MODEL_NAME, categories)
finetuned_table = create_lancedb_table(FINETUNED_MODEL_NAME, categories)

db = lancedb.connect("/root/lancedb")


def evaluate_braintrust(input, output, **kwargs):
    metrics = calculate_metrics_at_k(
        metrics=["mrr", "recall"], k=[1, 3, 5], preds=output, labels=kwargs["expected"]
    )
    return [
        Score(
            name=metric,
            score=metrics[metric],
            metadata={"query": input, "result": output, **kwargs["metadata"]},
        )
        for metric in metrics
    ]


def task(user_query, table_to_query):
    return [
        item["text"]
        for item in table_to_query.search(user_query, query_type="vector")
        .limit(25)
        .to_list()
    ]

for query_table in [base_table, finetuned_table]:
    await Eval(
        "fine-tuning",
        experiment_name=f"synthetic-transactions-train-{query_table.name}",
        data=lambda: eval_data,
        task=lambda query: task(query, query_table),
        scores=[evaluate_braintrust],
        metadata={"model": query_table.name},
    )


Experiment synthetic-transactions-train-categories-BAAI-bge-base-en is running at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/synthetic-transactions-train-categories-BAAI-bge-base-en
fine-tuning [experiment_name=synthetic-transactions-train-categories-BAAI-bge-base-en] (data): 66it [00:00, 182240.99it/s]


fine-tuning [experiment_name=synthetic-transactions-train-categories-BAAI-bge-base-en] (tasks):   0%|         …

Experiment synthetic-transactions-train-categories-ivanleomk-finetuned-bge-base-en is running at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/synthetic-transactions-train-categories-ivanleomk-finetuned-bge-base-en



synthetic-transactions-train-categories-BAAI-bge-base-en compared to synthetic-transactions-train-bdf7ddaa:
36.36% (-19.70%) 'mrr@1'    score	(6 improvements, 19 regressions)
48.99% (-20.45%) 'mrr@3'    score	(8 improvements, 29 regressions)
51.41% (-19.17%) 'mrr@5'    score	(8 improvements, 32 regressions)
36.36% (-19.70%) 'recall@1' score	(6 improvements, 19 regressions)
68.18% (-18.18%) 'recall@3' score	(0 improvements, 12 regressions)
78.79% (-12.12%) 'recall@5' score	(0 improvements, 8 regressions)

5.77s (-89.66%) 'duration'	(18 improvements, 48 regressions)

See results for synthetic-transactions-train-categories-BAAI-bge-base-en at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/synthetic-transactions-train-categories-BAAI-bge-base-en


fine-tuning [experiment_name=synthetic-transactions-train-categories-ivanleomk-finetuned-bge-base-en] (data): 66it [00:00, 88413.95it/s]


fine-tuning [experiment_name=synthetic-transactions-train-categories-ivanleomk-finetuned-bge-base-en] (tasks):…


synthetic-transactions-train-categories-ivanleomk-finetuned-bge-base-en compared to synthetic-transactions-train-categories-BAAI-bge-base-en:
56.06% (+19.70%) 'mrr@1'    score	(19 improvements, 6 regressions)
69.44% (+20.45%) 'mrr@3'    score	(29 improvements, 8 regressions)
70.58% (+19.17%) 'mrr@5'    score	(32 improvements, 8 regressions)
56.06% (+19.70%) 'recall@1' score	(19 improvements, 6 regressions)
86.36% (+18.18%) 'recall@3' score	(12 improvements, 0 regressions)
90.91% (+12.12%) 'recall@5' score	(8 improvements, 0 regressions)

5.58s (-18.99%) 'duration'	(56 improvements, 10 regressions)

See results for synthetic-transactions-train-categories-ivanleomk-finetuned-bge-base-en at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/synthetic-transactions-train-categories-ivanleomk-finetuned-bge-base-en
