# Training Data for Embedding Model

We'll be using a multiple negative loss function, so we need to adapt our training data to that format (anchor, positive, negative) pairs.

This notebook contains the logic to generate the training data.

**If you already have the training data in the right format, skip ahead to the "Training" notebook.**

### Existing Training Data File Format

Each row in the existing JSONL training file includes a source model, with desire to harmonize to a target. We then have ground truth harmonization in `expected_mappings.tsv`.

The JSONL file has 3 columns: `input_source_model`, `input_target_model`, `harmonized_mapping`

Those 3 columns are effectively populated by content of files:

- `source_model.json` == `input_source_model`
- `expected_mappings.tsv` == `harmonized_mapping`
- `target_model.json` == `input_target_model`

### How to convert existing training JSONL file to the format expected for embedding model training

For each ground truth mapping in `harmonized_mapping`:

- Extrapolate the "negatives" (wrong choices) by providing the same source variable but with every target variable _except_ the correct, "positive" one, in the harmonized mapping
- Output a new CSV file with 3 columns: `anchor`, `positive`, `negative`

Where:

- `anchor` == the source variable from the harmonized mapping
- `positive` == the target variable from the harmonized mapping
- `negative` == every other target variable except the right one

> **WARNING**: The below cell will take a long time on the full dataset and the resulting file is large

In [None]:
import json
import csv
from tqdm import tqdm
from pathlib import Path
from typing import List
from ai_harmonization.simple_data_model import (
    SimpleDataModel,
    get_data_model_as_node_prop_type_descriptions,
)


def load_model(json_obj) -> SimpleDataModel | None:
    """
    Attempt to parse a JSON model into a SimpleDataModel instance.
    Returns None if parsing fails.
    """
    try:
        # Some libraries may accept dict directly; otherwise, use json.dumps
        return SimpleDataModel.get_from_unknown_json_format(json.dumps(json_obj))
    except Exception as exc:
        print(f"Could not parse model: {exc}")
        return None


def process_jsonl(
    input_jsonl_path: Path,
    output_csv_path: Path,
) -> None:
    """
    Reads an input JSONL where each line contains:
        - input_source_model (dict)
        - input_target_model (dict)
        - harmonized_mapping (Path to a TSV file with 2 columns)
    For each source/target property pair in the TSV, writes rows of the form:
        anchor (source description), positive (target description), negatives (any other target prop)
    The output is a CSV file with headers: anchor,positive,negatives.
    """
    # Count total lines once for a static progress bar
    total_lines = sum(1 for _ in input_jsonl_path.open("r", encoding="utf-8"))

    with (
        input_jsonl_path.open("r", encoding="utf-8") as infile,
        output_csv_path.open("w", newline="", encoding="utf-8") as outfile,
    ):
        writer = csv.writer(outfile)
        # header
        writer.writerow(["anchor", "positive", "negatives"])

        for line_no, line in tqdm(
            enumerate(infile, start=1), total=total_lines, desc="Processing lines"
        ):
            line = line.strip()
            if not line:
                continue

            try:
                entry = json.loads(line)
            except json.JSONDecodeError as exc:
                print(f"Skipping malformed JSON at line {line_no}: {exc}")
                continue

            # Load source & target models
            source_model = load_model(entry.get("input_source_model"))
            target_model = load_model(entry.get("input_target_model"))
            if source_model is None or target_model is None:
                print(f"Skipping line {line_no} due to model parse failure.")
                continue

            # Get all target property descriptions
            target_props = get_data_model_as_node_prop_type_descriptions(target_model)

            # Load harmonization TSV
            for row_number, tsv_line in enumerate(
                entry.get("harmonized_mapping").split("\n"), start=1
            ):
                parts = tsv_line.rstrip("\n").split("\t")
                if len(parts) != 2:
                    print(f"Skipping malformed TSV row {tsv_line}")
                    continue

                source_desc, target_desc = parts

                # just in case descriptions have commas, lets get rid of them so the final CSV is
                # not malformed
                source_desc = source_desc.replace(",", ";")
                target_desc = target_desc.replace(",", ";")

                # handle newlines at the beginning of file
                if source_desc == "ai_model_node_prop_desc" or not source_desc:
                    continue

                # Build the list of negatives (exclude the positive itself)
                negatives_list = [
                    neg.replace(",", ";") for neg in target_props if neg != target_desc
                ]

                # Convert the list to a JSON string
                negatives_json = json.dumps(negatives_list)

                # Write the single row
                writer.writerow([source_desc, target_desc, negatives_json])

Change the paths below to wherever you have the existing training data.

In [None]:
input_path = Path(
    "../datasets/harmonization_training_Mutated_SDCs_v3_20250423_v0.0.2/_training_data/final_training_data.jsonl"
)
output_path = Path(
    "../datasets/embedding_training_data_v0.0.1/embedding_training_with_negatives.csv"
)
process_jsonl(input_path, output_path)
print(output_path)