# Systematically Improving Your Rag Application

We want to improve the quality of our expense categorization. To do so, we'll finetune a cohere reranker to do so.

Since we only have ~80 examples, we'll bootstrap new examples by randomly sampling from our train dataset to find hard negatives from other categories.

In order to fine-tune our dataset, we'll need to format it nicely for the cohere reranker. This means that we'll need to format our data in the following way

```
{
    "query": "What is the name of the merchant?",
    "relevant_passages": ["McDonalds", "Starbucks"],
    "hard_negatives": ["Exxon", "Chevron"]
}
```

In our case, we'll be using the `query` as the transaction input that we previously generated and the correct label as the `relevant_passages`. Other labels will be our `hard_negatives`.

So for each example in our dataset, we can generate another 4 more by simply sampling from the other labels 4 more times to get 4 unique hard negatives

In [22]:
from braintrust import init_dataset

dataset = init_dataset(project="fine-tuning", name="Synthetic Transactions")


def get_dataset_split(split: str, dataset):
    return [
        {
            "input": transaction["input"],
            "expected": transaction["expected"],
        }
        for transaction in dataset
        if transaction["metadata"]["split"] == split
    ]


train_data = get_dataset_split("train", dataset)
eval_data = get_dataset_split("eval", dataset)

In [12]:
from pydantic import BaseModel


class CohereFinetuneItem(BaseModel):
    query: str
    relevant_passages: list[str]
    hard_negatives: list[str]


labels = set([transaction["expected"][0] for transaction in train_data])

In [23]:
import random

finetuning_data = []

for transaction in train_data:
    query = transaction["input"]
    relevant_passages = [transaction["expected"][0]]

    valid_hard_negatives = [label for label in labels if label != relevant_passages[0]]

    for i in range(2):
        hard_negatives = random.choices(valid_hard_negatives, k=4)
        finetuning_data.append(
            CohereFinetuneItem(
                query=query,
                relevant_passages=relevant_passages,
                hard_negatives=hard_negatives,
            )
        )

with open("./data/cohere_finetune.jsonl", "w") as f:
    for item in finetuning_data:
        f.write(item.model_dump_json() + "\n")


In [24]:
import cohere

co = cohere.ClientV2()

reranked_dataset = co.datasets.create(
    name="Synthetic Transactions Finetune",
    data = open("./data/cohere_finetune.jsonl","rb"),
    type="reranker-finetune-input",
)

co.wait(reranked_dataset).dataset.validation_status


...
...
...


'validated'

In [25]:
from cohere.finetuning import BaseModel, FinetunedModel, Settings

finetune_request = co.finetuning.create_finetuned_model(
    request=FinetunedModel(
        name="finetuned-cohere-reranker",
        settings=Settings(
            base_model=BaseModel(base_type="BASE_TYPE_RERANK"),
            dataset_id=reranked_dataset.id,
        ),
    )
)

In [29]:
response = co.finetuning.get_finetuned_model(finetune_request.finetuned_model.id)
response.finetuned_model.status


'STATUS_READY'

In [31]:
import lancedb
db = lancedb.connect("./lancedb")
table = db.open_table("categories")


In [36]:
from braintrust import Eval, Score
import itertools
from lancedb.rerankers import CohereReranker

def calculate_mrr(predictions: list[str], gt: list[str]):
    mrr = 0
    for label in gt:
        if label in predictions:
            mrr = max(mrr, 1 / (predictions.index(label) + 1))
    return mrr


def get_recall(predictions: list[str], gt: list[str]):
    return len([label for label in gt if label in predictions]) / len(gt)


eval_metrics = [["mrr", calculate_mrr], ["recall", get_recall]]
sizes = [1,3,5]

metrics = {
    f"{metric_name}@{size}": lambda predictions, gt, m=metric_fn, s=size: (
        lambda p, g: m(p[:s], g)
    )(predictions, gt)
    for (metric_name, metric_fn), size in itertools.product(eval_metrics, sizes)
}


def evaluate_braintrust(input, output, **kwargs):
    return [
        Score(
            name=metric,
            score=score_fn(output, kwargs["expected"]),
            metadata={"query": input, "result": output, **kwargs["metadata"]},
        )
        for metric, score_fn in metrics.items()
    ]


def task(input,reranker):
    query = table.search(input, query_type="vector").limit(25)

    if reranker:
        query = query.rerank(reranker)
    
    return [
        item["text"]
        for item in query.to_list()
    ]


rerankers = [
    CohereReranker(model_name=f"{finetune_request.finetuned_model.id}-ft"),
    CohereReranker(model_name="rerank-english-v3.0"),
    None,
]

for reranker in rerankers:
    await Eval(
        "fine-tuning",  # Replace with your project name
        data=eval_data,
        task=lambda query: task(query, reranker),  # Replace with your LLM call
        scores=[evaluate_braintrust],
    )

Experiment fine-tuning-1730888663 is running at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/fine-tuning-1730888663
fine-tuning (data): 66it [00:00, 40968.49it/s]


fine-tuning (tasks):   0%|          | 0/66 [00:00<?, ?it/s]


fine-tuning-1730888663 compared to fine-tuning-1730888635:
72.73% 'mrr@1'    score
81.31% (+31.57%) 'mrr@3'    score	(32 improvements, 3 regressions)
81.62% (+29.14%) 'mrr@5'    score	(33 improvements, 3 regressions)
72.73% 'recall@1' score
90.91% (+28.79%) 'recall@3' score	(19 improvements, 0 regressions)
92.42% (+18.18%) 'recall@5' score	(12 improvements, 0 regressions)

3.89s (+79.16%) 'duration'	(15 improvements, 51 regressions)

See results for fine-tuning-1730888663 at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/fine-tuning-1730888663


Experiment fine-tuning-1730888676 is running at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/fine-tuning-1730888676
fine-tuning (data): 66it [00:00, 25122.43it/s]


fine-tuning (tasks):   0%|          | 0/66 [00:00<?, ?it/s]


fine-tuning-1730888676 compared to fine-tuning-1730888663:
22.73% (-50.00%) 'mrr@1'    score	(2 improvements, 35 regressions)
25.76% (-55.56%) 'mrr@3'    score	(3 improvements, 43 regressions)
27.20% (-54.42%) 'mrr@5'    score	(3 improvements, 44 regressions)
22.73% (-50.00%) 'recall@1' score	(2 improvements, 35 regressions)
30.30% (-60.61%) 'recall@3' score	(0 improvements, 40 regressions)
36.36% (-56.06%) 'recall@5' score	(0 improvements, 37 regressions)

4.38s (+49.30%) 'duration'	(7 improvements, 59 regressions)

See results for fine-tuning-1730888676 at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/fine-tuning-1730888676


Experiment fine-tuning-1730888689 is running at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/fine-tuning-1730888689
fine-tuning (data): 66it [00:00, 24096.80it/s]


fine-tuning (tasks):   0%|          | 0/66 [00:00<?, ?it/s]


fine-tuning-1730888689 compared to fine-tuning-1730888676:
40.91% (+18.18%) 'mrr@1'    score	(19 improvements, 7 regressions)
49.75% (+23.99%) 'mrr@3'    score	(28 improvements, 8 regressions)
52.47% (+25.28%) 'mrr@5'    score	(33 improvements, 8 regressions)
40.91% (+18.18%) 'recall@1' score	(19 improvements, 7 regressions)
62.12% (+31.82%) 'recall@3' score	(25 improvements, 4 regressions)
74.24% (+37.88%) 'recall@5' score	(26 improvements, 1 regressions)

2.58s (-180.16%) 'duration'	(64 improvements, 2 regressions)

See results for fine-tuning-1730888689 at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/fine-tuning-1730888689
