# Training Embedding Model

We'll be using a multiple negative loss function, so we need to adapt our training data to that format (anchor, positive, negative) pairs.

This notebook contains the logic to do the training.

**If you want to generate the training data in the right format, go to the "Training Data" notebook.**

### Background: Existing Training Data File Format

Each row in the existing JSONL training file includes a source model, with desire to harmonize to a target. We then have ground truth harmonization in `expected_mappings.tsv`.

The JSONL file has 3 columns: `input_source_model`, `input_target_model`, `harmonized_mapping`

Those 3 columns are effectively populated by content of files:

- `source_model.json` == `input_source_model`
- `expected_mappings.tsv` == `harmonized_mapping`
- `target_model.json` == `input_target_model`

### How we converted existing training JSONL file to the format expected for embedding model training

For each ground truth mapping in `harmonized_mapping`:

- Extrapolate the "negatives" (wrong choices) by providing the same source variable but with every target variable _except_ the correct, "positive" one, in the harmonized mapping
- Output a new CSV file with 3 columns: `anchor`, `positive`, `negative`

Where:

- `anchor` == the source variable from the harmonized mapping
- `positive` == the target variable from the harmonized mapping
- `negative` == every other target variable except the right one

> tl;dr the NOTE below: to simplify things to start, we are NOT going to provide explicit negatives and let the batch deal with them itself. The training dataset will still includet them for potential future use.

> NOTE: negatives were originally expanded in a denormalized way so that a given achor/positive pair ended up with n number of test cases where n is the number of unique negatives. 
>
> It's not clear if that will result in better outcomes than if we have a single test case for each anchor/positive pair with a list of negatives. Update: It seems like MultiNegativeLoss can't take an actual list. So it appears like they expect an expanded version.
>
> It's unclear to me how to support a list without
> fully expanding duplicate anchor/positive rows with every possible negative. And doing
> so I don't think really achieves what I want with the MultipleNegativesRankingLoss function.
> It seemingly treats every other option as negative if just providing achor/positive,
> so we can try that.

Change the paths below to wherever you have the existing training data.

In [None]:
training_data_csv_path = (
    "../datasets/embedding_training_data_v0.0.1/embedding_training.csv"
)

#### Use Training Data CSV from above and create splits

In [None]:
import csv
import sys
import os
import random
from typing import Tuple

csv.field_size_limit(sys.maxsize)

def stream_split_csv(
    input_path: str,
    output_dir: str,
    seed: int = 42,
    fractions: Tuple[float, float, float] = (0.80, 0.10, 0.10),
    header: bool = True,
):
    """
    Split a gigantic CSV into train/eval/test without loading it into RAM.

    The function reads *one line at a time*, assigns it to a split using a
    deterministic random choice, and writes the line to the appropriate output
    file.

    Args:
        input_path: Path to the huge CSV
        output_dir: Folder where `train.csv`, `validation.csv` and
            `test.csv` will be written.  Will be created if it does not exist.
        seed: Seed for the random assignment - the split will be deterministic.
        fractions: Desired (train, eval, test) fractions.
            Must sum to 1.0.  Default is (0.80, 0.10, 0.10).
        header: If the input file has a header row, it will be copied to
            every output file.

    Returns:
        None that three CSV files are written into `output_dir`.
    """
    # Check that the fractions sum to 1.0
    if abs(sum(fractions) - 1.0) > 1e-6:
        raise ValueError("fractions must sum to 1.0")

    # Compute the split thresholds
    train_threshhold, validate_threshhold, _ = fractions
    train_threshhold = fractions[0]
    validate_threshhold = fractions[0] + fractions[1]

    # Create the output directory
    os.makedirs(output_dir, exist_ok=True)
    out_paths = {
        "train": os.path.join(output_dir, "train.csv"),
        "eval": os.path.join(output_dir, "eval.csv"),
        "test": os.path.join(output_dir, "test.csv"),
    }

    # Open a CSV writer for each split
    files = {}
    writers = {}
    for key, path in out_paths.items():
        file = open(path, "w", newline="", encoding="utf-8")
        files[key] = file
        writers[key] = csv.writer(file)

    open_kwargs = {"newline": "", "encoding": "utf-8"}
    reader = open(input_path, "r", **open_kwargs)

    random.seed(seed)

    # Counters for a quick sanity check after processing
    n_rows = n_train = n_val = n_test = 0

    with reader as f_in:
        reader_obj = csv.reader(f_in)

        # If the file has a header, pull it out first
        header_row = next(reader_obj) if header else None

        # Write the header to all output files
        if header_row:
            for w in writers.values():
                w.writerow(header_row)

        # Process each line once
        for row in reader_obj:
            n_rows += 1
            r = random.random()
            if r < train_threshhold:
                writers["train"].writerow(row)
                n_train += 1
            elif r < validate_threshhold:
                writers["eval"].writerow(row)
                n_val += 1
            else:
                writers["test"].writerow(row)
                n_test += 1

    # Close all file handles
    for file in files.values():
        file.close()

    # Report the final counts
    print(f"✓ Split complete – {n_rows:,} rows processed")
    print(f"   Train: {n_train:,} rows")
    print(f"   Val:   {n_val:,} rows")
    print(f"   Test:  {n_test:,} rows")

> **WARNING**: Below can take a long time and it will _duplicate_ the large input into separated files. So you need about double the hard drive space of the input for this whole process to work. 

This splitting is also randomizing the data a bit, putting things in training, eval, test.

In [None]:
stream_split_csv(
    input_path=training_data_csv_path,
    output_dir=os.path.join(os.path.dirname(training_data_csv_path), "splits"),
    seed=123,
    fractions=(0.80, 0.10, 0.10),  # train / val / test
    header=True,
)

## Training

In [None]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "3"  # change as necessary

The following is influenced heavily by [Huggingface's blog article](https://huggingface.co/blog/train-sentence-transformers#trainer).

If using the pre-computed training data, the counts are:

```
✓ Split complete – 812,327 rows processed
   Train: 649,352 rows
   Val:   81,496 rows
   Test:  81,479 rows
```

> NOTE: The .zip when expanded from above is ~80GB and you'll need more disk space to deal with the AI models, loading of the dataset, and training. Ensure you have enough headroom on disk. 

In [None]:
from datasets import load_dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import (
    MultipleNegativesRankingLoss,
    CachedMultipleNegativesRankingLoss,
)
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator

import torch
import os

EMBEDDING_MODEL = "google/embeddinggemma-300m"
# EMBEDDING_MODEL = "Qwen/Qwen3-Embedding-0.6B"
# EMBEDDING_MODEL = "BAAI/bge-large-en-v1.5"

# 1. Load a model to finetune with 2. model card data
model = SentenceTransformer(
    EMBEDDING_MODEL,
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="Base trained on appropriate relationships between similar biomedical variables",
    ),
)

# 3. Load a dataset to finetune on
path_to_csv_splits = (
    "../datasets/harmonization_training_Mutated_SDCs_v3_20250423_v0.0.2/_training_data/splits"
)
dataset = load_dataset(
    "csv",
    data_files={
        "train": os.path.join(path_to_csv_splits, "train.csv"),
        "eval": os.path.join(path_to_csv_splits, "eval.csv"),
        "test": os.path.join(path_to_csv_splits, "test.csv"),
    },
)

Subset the full dataset as needed.

In [None]:
train_dataset = dataset["train"].select(range(100_000))
eval_dataset = dataset["eval"].select(range(20_000))
test_dataset = dataset["test"].select(range(20_000))

In [None]:
# Filter out any rows that contain None b/c our loss function expects full triplet of anchor, positive and negative examples.
# These shouldn't exist, but in case they do...
def has_no_nones(example):
    return all(example[key] is not None for key in example)

train_dataset = train_dataset.filter(has_no_nones)
eval_dataset = eval_dataset.filter(has_no_nones)
test_dataset = test_dataset.filter(has_no_nones)

# For now, let's remove the negatives. It's unclear to me how to support a list without
# fully expanding duplicate anchor/positive rows with every possible negative.
# Note that the negatives column right now is a JSON-ified array string
#
# Even if I could figure out how to supply these,
# I don't think it really achieves what I want with the MultipleNegativesRankingLoss function.
# It seemingly treats every other option as negative if just providing achor/positive,
# so let's try that.
try:
    train_dataset = train_dataset.remove_columns("negatives")
    eval_dataset = eval_dataset.remove_columns("negatives")
    test_dataset = test_dataset.remove_columns("negatives")
except ValueError:
    # might already be removed
    pass

print(len(train_dataset), len(eval_dataset), len(test_dataset))

In [None]:
# 4. Define a loss function
loss = MultipleNegativesRankingLoss(model)

# TODO: was running into device issues on apple silicon when trying CachedMultipleNegativesRankingLoss
#       to save on memory usage. Giving up for now.
# # Pick a device you actually have
# if torch.cuda.is_available():
#     device = torch.device("cuda")
# elif torch.backends.mps.is_available():
#     # Apple‑silicon
#     device = torch.device("mps")
# else:
#     device = torch.device("cpu")
# print(f"device: {device}")
# # Build / load your model
# model.to(device)
# loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=64)

# 5. (Optional) Specify training arguments
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir=f"models/{EMBEDDING_MODEL.split('/')[-1]}-bio-mapping",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_ratio=0.1,
    fp16=True,  # Set to False if GPU can't handle FP16
    bf16=False,  # Set to True if GPU supports BF16
    # MultipleNegativesRankingLoss benefits from no duplicates
    batch_sampler=BatchSamplers.NO_DUPLICATES,
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=100,
    # Used in W&B if `wandb` is installed
    run_name=f"{EMBEDDING_MODEL.split('/')[-1]}-bio-mapping",
)

# 6. (Optional) Create an evaluator & evaluate the base model
# dev_evaluator = TripletEvaluator(
#     anchors=eval_dataset["anchor"],
#     positives=eval_dataset["positive"],
#     negatives=eval_dataset["negatives"],
#     name="bio-mapping-eval",
# )
# dev_evaluator(model)

In [None]:
# 7. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    # evaluator=dev_evaluator,
    # Disable model‑card callback (if the library supports it)
    # callbacks=[],
)
trainer.train()

# (Optional) Evaluate the trained model on the test set, after training completes
# test_evaluator = TripletEvaluator(
#     anchors=test_dataset["anchor"],
#     positives=test_dataset["positive"],
#     negatives=test_dataset["negatives"],
#     name="bio-mapping-eval",
# )
# test_evaluator(model)

# 8. Save the trained model
model.save_pretrained(f"models/{EMBEDDING_MODEL.split('/')[-1]}-bio-mapping/final")

# 9. (Optional) Push it to the Hugging Face Hub
# model.push_to_hub(f"{EMBEDDING_MODEL.split('/')[-1]}-bio-mapping")

## Test the Trained AI Model

In [None]:
# Load the trained model from disk
embedding_model = SentenceTransformer(
    f"models/{EMBEDDING_MODEL.split('/')[-1]}-bio-mapping/final"
)