In [2]:
%load_ext autoreload
%autoreload 2

# Week 2 : Systematically Improving Your Rag Application

# Why Fine-tune Embeddings?

> If you have not already, please make sure that you've gone through Week 1's notebooks. That'll help to contextualise this week's notebook.

Fine-Tuning a model allows a model to learn the unique nuances and patterns in your data. We can use smaller specialised models that can provide better performance at a significantly reduced cost than a general purpose model. 

With just 100 examples, we were able to beat proprietary models like OpenAI's `text-embedding-3-small` and achieve greater accuracy at a lower price point with open source embedding models such as `bge-base-en-v1.5` ( See [Fine Tuning Embeddings With Modal](https://modal.com/blog/fine-tuning-embeddings)) on the Quora dataset. 

For most production applications, that's less than **1-2 weeks of data** to get a massive improvement in accuracy. 

[Ramp](https://engineering.ramp.com/transaction-embeddings) fine-tuned an embedding model on transaction data to automatically suggest expense categories. Even though each customer's expense categories were unique, their model accurately generalized to new customers. This showcases how fine-tuning can adapt models to specific, real-world tasks.

## What you'll achieve in this notebook?

We're going to replicate Ramp's process using synthetic financial transaction data that we'll generate from scratch. This will give a step by step guide to fine-tune an embedding model for your own applications, whether it's through a proprietary model provider like Cohere or an open source embedding model. 

We have defined 25 categories ahead of time that we'll use to classify our synthetic transactions. Each transaction should map to a single category. We want to measure the ability of a model to identify the correct category for each transaction using just embeddings.

This will be done in 3 main steps

1. **Understand Transaction Dataset** : We'll learn about the type of data used by Ramp to fine-tune their embedding model and how we can replicate that with synthetic data using `instructor`.

2. **Iterate on Synthetic Data** : We'll then iteratively generate a large dataset of transactions by generating examples with a language model and then selecting the best examples manually using a streamlit dataset. For each new batch of transactions, we'll verify that these are challenging examples by evaluating the recall performance of a baseline with our initial questions.

3. **Evaluate and Fine-Tune Models** : Once we've generated enough questions, we'll then split our data into a train and evaluation set that we can use to evaluate the performance improvements of fine-tuning an embedding model. We'll do so with both Cohere's reranker models and an open source embedding model.

Throughout this process, we'll use `braintrust`[https://braintrust.dev] to collect data and measure improvements. Braintrust makes it easy to collaborate with a team and simplifies data collection and evaluation.



## Understanding Transactions

To fine-tune our model effectively, we need to understand the transaction data we're working with.

Typical Transaction Fields:

- Merchant Name: The vendor or service provider's name.
- Merchant Category Code (MCC): General category of the transaction (e.g., Restaurants).
- Department Name: The company department responsible for the transaction.
- Location: Where the transaction took place.
- Amount: The transaction's monetary value.
- Spend Program Name: Specific budget or spend limit allocated.
- Trip Name: If the transaction occurred during travel.

We can see an example below

```
Name : Beirut Bakery
Category: Restaurants, Cafeteria
Department: Engineering
Location: Calgary, AB, Canada
Amount: 56.67 CAD
Card: Ramp's Physical Card
Trip Name: unknown
```

This is a difficult task because there's very little information. 

Additionally since each company has unique categories that have some implicit rules, it's difficult for a general embedding model to classify these transactions without fine-tuning.

# Generating Synthetic Transactions

Since we don't have a dataset of transactions to work with, we'll need to generate our own. We use `instructor `here because it makes it easy to switch between different models and iterate on the prompt with the jinja templating. We're using `gpt-4o-mini` here because it's cheap to use, making it perfect for getting practice with labelling and generating examples.

We'll be iteratively generating this dataset in 3 steps

1. **Data Generation** : We'll first generate an iniital batch of synthetic transactions. These will be generated using a simple prompt and we will be using a streamlit application which you can run using `streamlit run label.py` to manually review and select the best examples. We recommend using `ChatGPT` or `Claude` to reason and evaluate these examples to get a sense for what makes a good or bad example during this process.

2. **Data Refinement** : We'll then use these initial examples to generate new examples that are more challenging by including a random subset of these examples in the prompt. The important step here is to use these examples as a way to find characteristics and patterns of good and bad examples that we can then add to our prompt as rules or few shot examples to guide the model to generate better examples.

3. **Braintrust Evaluation**: We'll ingest in the categories into `lancedb` and use `braintrust` to log the recall@1,3,5 and mrr@1,3,5 of our initial and refined examples. We use `lancedb` because it provides a single api to access re-rankers, vector search and batching of embeddings for us. This makes it easy for us to experiment with different retrieval configurations ( Eg. Vector Search, Re-Rankers ) easily.

We'll repeat this process until we've generated at least 300 examples. This ensures that we have enough examples to fine-tune a cohere re-ranker ( requires min 256 examples) or a sentence transformer model while also having enough examples to create a held out evaluation set.

We want to iterate on our synthetic data for two main reasons.

1. By generating a small amount of data, it becomes practical to manually label examples and slowly build up a high quality dataset of examples.  We can ensure that each example is challenging by calculating recall and mrr at different subsets of the retrieved data using braintrust as we iteratively generate better examples.

2. By constantly sampling a random sample of an ever growing number of examples, we're able to introduce randomness in our prompt that can create a diverse dataset. This helps us to avoid potential issues with diversity and quality that doing a single pass of data generation can introduce.


## Step 1 : Generating our initial transactions.

We'll start by generating our initial transactions using a simple prompt. These are going to be very simple examples that will not be great but they are useful as an initial starting point.


In [1]:
from pydantic import BaseModel, field_validator, ValidationInfo
from openai import AsyncOpenAI
import instructor
from typing import Optional
from textwrap import dedent
import random
import json
import asyncio

# Load in pre-defined categories
categories = json.load(open("data/categories.json"))

# Define a Pydantic model that can represent the same transaction data that Ramp was using
class Transaction(BaseModel):
    merchant_name: str
    merchant_category: list[str]
    department: str
    location: str
    amount: float
    spend_program_name: str
    trip_name: Optional[str] = None
    expense_category: str

    def generate_transaction(self):
        return dedent(f"""
        Name : {self.merchant_name}
        Category: {", ".join(self.merchant_category)}
        Department: {self.department}
        Location: {self.location}
        Amount: {self.amount}
        Card: {self.spend_program_name}
        Trip Name: {self.trip_name if self.trip_name else "unknown"}
        """)

    @field_validator("expense_category")
    @classmethod
    def validate_expense_category(cls, v, info: ValidationInfo):
        if not info.context or not info.context["category"]:
            return v
        return info.context["category"]["category"]


client = instructor.from_openai(AsyncOpenAI())

async def generate_transaction(category):
    return await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """Generate a transaction for a tech company that could be filed under the category of {{ category }}. This should be distinct from the sample_transactions provided in the categories.json file

                - The spend program is a specific spending authority or allocation that has defined limits, rules, and permissions. It's like a virtual card or spending account set up for a specific purpose.
                - Merchant Category Name is a label that best describes the merchant of the transaction.
                - Merchant name should be realistic and not obviously made up.
                """,
            }
        ],
        context={"category": category},
        response_model=Transaction,
    )


# Generate 5 initial transactions and choose the category randomly
coros = []
for _ in range(5):
    coros.append(generate_transaction(random.choice(categories)))

transactions = await asyncio.gather(*coros)
with open("./data/generated_transactions.jsonl", "a") as f:
    for transaction in transactions:
        f.write(transaction.model_dump_json() + "\n")

## Step 2 : Labeling Transactions

Now that we've generated a small set of initial transactions, please run `streamlit run label.py` to manually select transactions that you think might be difficult to classify. 

> You can modify and edit the transaction details before approving them. Hot keys of ctrl + e ( approve ) and ctrl + r ( reject ) make this process much faster. Only approved transactions will be saved to `generated_transactions.jsonl` below. We'll then use these examples to generate a new set of transactions that are more challenging.

You can also manually override transaction details in the streamlit application. We recommend using `ChatGPT` or `Claude` to discuss and generate good default and examples. A prompt that I used to prompt the chat UI was

```
I'd like to generate a transaction for a tech company that is challenging to classify into a specific category. Here are the details

<Details go here>

I'd like you to help rewrite some of the details to make it more realistic. Please stick to the following rules

- MCCs should be realistic. If possible, let's try to use a MCC that will cover a superset of the given category
- Let's try to suggest a non-uniform number (Eg. not 1500 ) so that it seems more realistic
- The Spend Program name should be a specific spending authority or allocation that has defined limits, rules, and permissions. It's like a virtual card or spending account set up for a specific purpose. In our case, this spend program name should not be a name that directly mentions the category or merchant
```



In [234]:
async def generate_transaction_with_examples(category, examples: list[Transaction]):
    return await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """
                Generate a potentially ambiguous business transaction that could reasonably be categorized as {{ category }} or another similar category. The goal is to create transactions that challenge automatic categorization systems by having characteristics that could fit multiple categories.


                Available categories in the system.:
                <categories>
                {% for category_option in categories %}
                    {{ category_option["category"] }}
                {% endfor %}
                </categories>

                
                The transaction should:
                1. Use a realistic but non-obvious merchant name (international names welcome), don't use names that are obviously made u 
                2. Include a plausible but non-rounded amount with decimals (e.g., $1247.83)
                3. Be difficult to categorize definitively (could fit in multiple categories)
                4. Merchant Category Name(s) should not reference the category at all and should be able to be used for other similar categories if possible.

                Here are some good examples of transactions that were previously generated for other categories.

                {% for example in examples %}
                {{ example.model_dump_json() }}
                {% endfor %}
                """,
            }
        ],
        context={"category": category, "examples": examples, "categories": categories},
        response_model=Transaction,
    )

In [240]:
with open("./data/cleaned.jsonl", "r") as f:
    sample_transactions = []
    for line in f:
        sample_transactions.append(Transaction(**json.loads(line)))


coros = []
for _ in range(20):
    coros.append(generate_transaction_with_examples(random.choice(categories), random.sample(sample_transactions, 10)))

transactions = await asyncio.gather(*coros)

with open("./data/generated_transactions.jsonl", "w") as f:
    for transaction in transactions:
        f.write(transaction.model_dump_json() + "\n")

## Step 3 : Evaluating Recall and MRR Performance.

Remember that we're building a model that can suggest transaction categories to a user. To do so, we'll only be able to show the top 3-5 results and we want to make sure that the correct result is ranked as highly as possible.

Therefore, we'll be using recall and mrr to evaluate our mode's performance here. 

- `recall` : This measures whether the correct category is in the top k retrieved results.
- `mrr` : This measures how highly is the correct category ranked in the retrieved results.

Ideally we want a model with a high recall and mrr. This means that when we display the results, they're likely to be relevant to the user. By measuring the recall and mrr, we're able to ensure that we're conssitently generating questions that the model finds challenging. 

We're using `lancedb` here since it provides an easy way to perform these evaluations with automatic batching of embeddings for queries and data along with a single api for vector search and reranking.

In [167]:
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

func = get_registry().get("openai").create(name="text-embedding-3-small")
categories = json.load(open("data/categories.json"))

class Category(LanceModel):
    text: str = func.SourceField()
    embedding: Vector(func.ndims()) = func.VectorField()


db = lancedb.connect("./lancedb")
table = db.create_table("categories", schema=Category, mode="overwrite")


table.add(
    [
        {
            "text": category["category"],
        }
        for category in categories
    ]
)

table.create_fts_index(field_names=["text"], replace=True)

In [242]:
from braintrust import Eval, Score
import itertools

transactions = []
for line in open("./data/cleaned.jsonl").readlines():
    transactions.append(Transaction(**json.loads(line)))

len(transactions)

def calculate_mrr(predictions: list[str], gt: list[str]):
    mrr = 0
    for label in gt:
        if label in predictions:
            mrr = max(mrr, 1 / (predictions.index(label) + 1))
    return mrr


def get_recall(predictions: list[str], gt: list[str]):
    return len([label for label in gt if label in predictions]) / len(gt)


eval_metrics = [["mrr", calculate_mrr], ["recall", get_recall]]
sizes = [1,3,5]

metrics = {
    f"{metric_name}@{size}": lambda predictions, gt, m=metric_fn, s=size: (
        lambda p, g: m(p[:s], g)
    )(predictions, gt)
    for (metric_name, metric_fn), size in itertools.product(eval_metrics, sizes)
}


def evaluate_braintrust(input, output, **kwargs):
    return [
        Score(
            name=metric,
            score=score_fn(output, kwargs["expected"]),
            metadata={"query": input, "result": output, **kwargs["metadata"]},
        )
        for metric, score_fn in metrics.items()
    ]


def task(user_query):
    return [
        item["text"]
        for item in table.search(user_query, query_type="vector")
        .limit(25)
        .to_list()
    ]


await Eval(
    "fine-tuning",  # Replace with your project name
    data=lambda: [
        {
            "input": transaction.generate_transaction(),
            "expected": [transaction.expense_category],
        }
        for transaction in transactions
    ],  # Replace with your eval dataset
    task=task,  # Replace with your LLM call
    scores=[evaluate_braintrust],
)

Experiment fine-tuning-1730883986 is running at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/fine-tuning-1730883986
fine-tuning (data): 326it [00:00, 67959.40it/s]


fine-tuning (tasks):   0%|          | 0/326 [00:00<?, ?it/s]


fine-tuning-1730883986 compared to fine-tuning-1730883924:
38.34% (-) 'mrr@1'    score	(0 improvements, 0 regressions)
49.28% (-) 'mrr@3'    score	(0 improvements, 0 regressions)
52.14% (-) 'mrr@5'    score	(0 improvements, 0 regressions)
38.34% (-) 'recall@1' score	(0 improvements, 0 regressions)
63.80% (-) 'recall@3' score	(0 improvements, 0 regressions)
76.38% (-) 'recall@5' score	(0 improvements, 0 regressions)

8.22s (-54.64%) 'duration'	(190 improvements, 56 regressions)

See results for fine-tuning-1730883986 at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/fine-tuning-1730883986


EvalResultWithSummary(summary="...", results=[...])

At this point, we've generated a large dataset of synthetic transactions that we can use to fine-tune a model on. However, it's important here to call out that synthetic data has its challenges.

1. Quality and Diversity : It's difficult to ensure that the synethic data is of high quality and diverse. We've done so by manually reviewing and selecting good examples but ultimately we need real production data to ensure that our model is able to generalise.

2. Human Error : Manual review is great to ensure the quality of transactions but is expensive and error prone. This is not something that scales well, especially if you're trying to generate thousands of examples which you'd like humans to manually label.

We want to treat this synthetic data as a starting point and iteratively make it better using the techniques we've discussed in this notebook. But you will need to eventually mix in production data and continue generating synthetic data in order to adequately evaluate and test the generalisation capabilities of your model.



# Creating a Dataset

At this point, we've generated a large dataset of synthetic transactions that we can use to fine-tune a model on. 

We want to segregate our data into a train and evaluation set because it allows us to evaluate the performance of our model on data that it hasn't seen before. We use `braintrust` here to upload our dataset and a simple metadata flag to segregate between a train and evaluation portion of our dataset. This allows us to easily run evaluations on our model in the subsequent notebooks later on.

If we fine-tuned our model on the same data that we evaluated it on, it would be difficult to tell if the improvements we made were due to the model generalizing better or due to overfitting. In this case, we're just going to split our data by randomly shuffling it and then selecting the first 80% as our training set and the remaining 20% as our evaluation set.

In practice, you'd want to think carefully about these splits - using the category as a way to ensure that we have a diverse set of examples or generating new labels for the evaluation set based on the training labels. (Eg. Restaurants -> Dining Establishments or randomly grouping categories together )

Before we start fine-tuning our models here, we also need to make sure that the evaluation set and training set are similar. We do so by measuring the recall and mrr and verifying that they have similar values.

In [245]:
from braintrust import init_dataset

train_ratio = 0.8 * len(transactions)

random.shuffle(transactions)

train_transactions = transactions[:int(train_ratio)]
eval_transactions = transactions[int(train_ratio):]

dataset = init_dataset(project="fine-tuning", name="Synthetic Transactions")

for transaction in train_transactions:
    dataset.insert(
        input=transaction.generate_transaction(),
        expected=[transaction.expense_category],
        metadata={"split": "train"},
    )

for transaction in eval_transactions:
    dataset.insert(
        input=transaction.generate_transaction(),
        expected=[transaction.expense_category],
        metadata={"split": "eval"},
    )

print(dataset.summarize())

Now let's see if we can get a baseline performance for each individual split

In [246]:
def get_dataset_split(split:str, dataset):
    return [
        {
            "input": transaction['input'],
            "expected": transaction['expected'],
        }
        for transaction in dataset
        if transaction["metadata"]["split"] == split
    ]

train_data = get_dataset_split("train", dataset)
eval_data = get_dataset_split("eval", dataset)
len(train_data), len(eval_data)


(260, 66)

In [247]:
await Eval(
    "fine-tuning",
    experiment_name="synthetic-transactions-train",
    data=lambda: train_data,
    task=task,  # Replace with your LLM call
    scores=[evaluate_braintrust],
)

Experiment synthetic-transactions-train-04a0fc30 is running at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/synthetic-transactions-train-04a0fc30
fine-tuning [experiment_name=synthetic-transactions-train] (data): 260it [00:00, 53451.58it/s]


fine-tuning [experiment_name=synthetic-transactions-train] (tasks):   0%|          | 0/260 [00:00<?, ?it/s]


synthetic-transactions-train-04a0fc30 compared to fine-tuning-1730883986:
37.69% (-) 'mrr@1'    score	(0 improvements, 0 regressions)
49.17% (-) 'mrr@3'    score	(0 improvements, 0 regressions)
52.05% (-) 'mrr@5'    score	(0 improvements, 0 regressions)
37.69% (-) 'recall@1' score	(0 improvements, 0 regressions)
64.23% (-) 'recall@3' score	(0 improvements, 0 regressions)
76.92% (-) 'recall@5' score	(0 improvements, 0 regressions)

6.65s (-125.99%) 'duration'	(145 improvements, 115 regressions)

See results for synthetic-transactions-train-04a0fc30 at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/synthetic-transactions-train-04a0fc30


EvalResultWithSummary(summary="...", results=[...])

In [248]:
await Eval(
    "fine-tuning",
    experiment_name="synthetic-transactions-train",
    data=lambda: eval_data,
    task=task,  # Replace with your LLM call
    scores=[evaluate_braintrust],
)

Experiment synthetic-transactions-train-cb6b07e2 is running at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/synthetic-transactions-train-cb6b07e2
fine-tuning [experiment_name=synthetic-transactions-train] (data): 66it [00:00, 26946.76it/s]


fine-tuning [experiment_name=synthetic-transactions-train] (tasks):   0%|          | 0/66 [00:00<?, ?it/s]


synthetic-transactions-train-cb6b07e2 compared to synthetic-transactions-train-04a0fc30:
40.91% 'mrr@1'    score
49.75% 'mrr@3'    score
52.47% 'mrr@5'    score
40.91% 'recall@1' score
62.12% 'recall@3' score
74.24% 'recall@5' score

1.67s duration

See results for synthetic-transactions-train-cb6b07e2 at https://www.braintrust.dev/app/567/p/fine-tuning/experiments/synthetic-transactions-train-cb6b07e2


EvalResultWithSummary(summary="...", results=[...])

# Results and Analysis

We can see that the performance of the text-embedding-3-small model is similar between the training and evaluation set. 

| Metric | Training Set (n=260) | Evaluation Set (n=66) | Difference |
|--------|---------------------|----------------------|------------|
| Recall@1 | 0.37 | 0.41 | -4% |
| Recall@3 | 0.64 | 0.62 | -2% |
| Recall@5 | 0.77 | 0.74 | -3% |
| MRR@1 | 0.38 | 0.41 | -3% |
| MRR@3 | 0.49 | 0.50 | -1% |
| MRR@5 | 0.52 | 0.53 | -1% |


In our initial experiment, we implemented a straightforward evaluation approach by randomly shuffling the dataset and creating an 80-20 split, resulting in 260 training examples and 66 evaluation examples. The performance metrics shown above demonstrate remarkably consistent behavior between the training and evaluation sets, with differences typically around 2-3%. This consistency suggests that both sets are drawn from the same underlying distribution, providing a solid foundation for model training and evaluation.

While these results are promising for our proof-of-concept, it's important to note that a production implementation would benefit from more sophisticated validation strategies. 

This could include

1. Generating synonymous labels to test if the model is able to generalise to other similar labels
2. Implementing multiple validation sets with different characteristics
3. Carefully managing data composition to prevent leakage

These enhancements are out of the scope of this notebook and we'll leave that for a production implementation. In the meantime, let's move on to using cohere's re-ranker models to fine-tune our embedding model.

