# Harmonization Approach Using Abstractions

## Prerequisites

Install package manager and sync required packages.

## Setup

In [None]:
# If you are actively working on related *.py files and would like changes to reload automatically into this notebook
%load_ext autoreload
%autoreload 2

In [None]:
import os
import json
import time
from matplotlib import pyplot as plt


from harmonization.jsonl import (
    split_harmonization_jsonl_by_input_target_model,
    jsonl_to_csv,
)
from harmonization.harmonization_benchmark import get_metrics_for_approach
from harmonization.harmonization_approaches.similarity_inmem import (
    SimilaritySearchInMemoryVectorDb,
)
from harmonization.harmonization_approaches.embeddings import (
    MedGemmaEmbeddings,
    QwenEmbeddings
)
from langchain_huggingface import HuggingFaceEmbeddings

Set available GPUs (skip this step is using CPUs)

In [None]:
#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"  # change as necessary

## Single Benchmark Test File

Each test should include a source model, with desire to harmonize to a target. We expect harmonization `expected_mappings.tsv`.

Tests are in a JSONL file with a source to target model per row.

The JSONL file should have 3 columns: `input_source_model`, `input_target_model`, `harmonized_mapping`

Those 3 columns should be populated by content of the files:

- `source_model.json` == `input_source_model`
- `expected_mappings.tsv` == `harmonized_mapping`
- `target_model.json` == `input_target_model`

Below we are reading `output.jsonl` file that contains 710 lines or `limited_output.jsonl` file that contains 10 first lines from `output.jsonl` file.

`limited_output.jsonl` might be useful for testing locally

In [None]:
benchmark_filepath = (
    # Synthetic Benchmark
    # "../datasets/harmonization_benchmark_SDCs_27_Gen3_DMs_mutated_v0.0.2/output.jsonl"
    # Real Benchmark
    "../datasets/harmonization_benchmark_real_BIOLINCC_BDC_v0.0.1/harmonization_benchmark_real_BIOLINCC_BDC_v0.0.1.jsonl"
)

In [None]:
output_jsonls_per_target_model_dir_path = (
    f"../output/temp/{os.path.basename(benchmark_filepath)}/per_target"
)
split_harmonization_jsonl_by_input_target_model(
    benchmark_filepath, output_jsonls_per_target_model_dir_path
)

In [None]:
folder_name = time.time()
output_directory = "./output/harmonization/"

### Choose sentence-transformers, Medgemma or Qwen embedding

 Here are links to the model that might be used for embeddings:

 * sentence-transformers model (default model, 768-dimension): https://huggingface.co/sentence-transformers/all-mpnet-base-v2
 * Qwen model (1024-dimension): https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
 * MedGemma model (2560-dimension): https://huggingface.co/google/medgemma-4b-it
 * EmbeddingGemma model (768-dimension): https://huggingface.co/google/embeddinggemma-300m 
 

> Please note: You might need to get access prior to using MedGemma or EmbeddingGemma models and you need use your HF_TOKEN inside this notebook to allow it to connect to the model. In case you want to use Medgemma or EmbeddingGemma models, please uncomment the following code 

In [None]:
# Uncomment this code and ensure it works if your model requires authorization via HuggingFace token
# import os
# from huggingface_hub import login

# login(os.environ["HF_TOKEN"])

Choose desired embedding by uncommenting a line, and configure batch size.

> Tip: if you are using GPUs and getting Out of Memory error, try setting smaller batch size

In [None]:

# embedding_fn = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
embedding_fn = QwenEmbeddings(model_name="Qwen/Qwen3-Embedding-0.6B")
# embedding_fn = MedGemmaEmbeddings(model_name="google/medgemma-4b-it")
# embedding_fn = MedGemmaEmbeddings(model_name="google/embeddinggemma-300m")

batch_size = 32

Optional - test embeddings on small text inputs:

In [None]:
#text = "heart disease"
#embedded_text = embedding_fn.embed_query(text)
#print("Embedded text:", embedded_text)
#print("Embedding dimension": len(embedded_text))
#del embedded_text

> Warning: The next cells may take **a long time** and a lot of CPU/GPU when you run it. It's embedding every single target data model into a persistent vectorstore on disk (and loaded in mem) as it goes the first time. And then every run it's embedding all the test case `node.property` and doing similarity search.

In [None]:
# Remove future warning from pandas
import pandas as pd
pd.set_option('future.no_silent_downcasting', True)

for file in os.listdir(output_jsonls_per_target_model_dir_path):
    benchmark_filepath = os.path.join(output_jsonls_per_target_model_dir_path, file)
    print(f"Opening {benchmark_filepath}...")
    output_filepath = f"{output_directory}/{folder_name}/{file}"
    os.makedirs(os.path.dirname(output_filepath), exist_ok=True)

    # since these files are separated by target model already, just get the first row
    input_target_model = None
    with open(benchmark_filepath, "r", encoding="utf-8") as infile:
        for line in infile:
            row = json.loads(line)
            try:
                input_target_model = json.loads(row["input_target_model"])
            except Exception:
                input_target_model = row["input_target_model"]

            break
    print("Input target model received")

    # :62 b/c of limitation on chromadb collection names
    harmonization_approach = SimilaritySearchInMemoryVectorDb(
        vectordb_persist_directory_name=f"{file[:53]}-{embedding_fn.model.name_or_path.split("/")[-1][:5]}-0",
        input_target_model=input_target_model,
        embedding_function=embedding_fn,
        batch_size=batch_size,
    )
    print("Input target model added to vectorstore")

    max_suggestions_per_property = 5
    output_filename = get_metrics_for_approach(
        benchmark_filepath,
        harmonization_approach,
        output_filepath,
        k=max_suggestions_per_property,
        metrics_column_name="custom_metrics",
        output_sssom_per_row=True,
        output_tsvs_per_row=True,
        output_expected_results_per_row=True,
    )
    print(f"Output metrics to {output_filepath}")

Optional - empty GPU cache if GPU used:

In [None]:
#import torch
#del embedding_fn
#torch.cuda.empty_cache()

### Example conversation to CSVs

In [None]:
csv_path = output_filename.replace(".jsonl", ".csv")
jsonl_to_csv(jsonl_path=output_filename, csv_path=csv_path)

## Visualize Results

In [None]:

def visualize_metrics(jsonl_file):
    """
    Visualizes metrics from a JSONL file.

    Args:
        jsonl_file (str): Path to the JSONL file.
    """

    data = []
    row_numbers = []
    with open(jsonl_file, "r") as f:
        for i, line in enumerate(f):
            try:
                json_data = json.loads(line)
                if "custom_metrics" in json_data:
                    data.append(json_data["custom_metrics"])
                    row_numbers.append(i + 1)  # Row numbers start at 1
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON: {e}")
                continue

    if not data:
        print("No data found in the JSONL file.")
        return

    df = pd.DataFrame(data)

    # --- Overall Accuracy ---
    plt.figure(figsize=(12, 6))
    plt.bar(row_numbers, df["recall"])
    mean_accuracy = df["recall"].mean()
    plt.axhline(
        y=mean_accuracy, color="r", linestyle="--", label=f"Mean: {mean_accuracy:.2f}"
    )
    plt.xlabel("Row Number")
    plt.ylabel("Overall Accuracy (Recall)")
    plt.title("Overall Accuracy (Recall) by Row Number")
    plt.xticks(row_numbers, rotation=90)
    plt.legend()
    plt.tight_layout()
    plt.show()

    # --- Precision, Recall, F1-Score ---
    metrics = ["precision", "f1_score"]
    for metric in metrics:
        plt.figure(figsize=(12, 6))
        plt.bar(row_numbers, df[metric])
        mean_metric = df[metric].mean()
        plt.axhline(
            y=mean_metric, color="r", linestyle="--", label=f"Mean: {mean_metric:.2f}"
        )
        plt.xlabel("Row Number")
        plt.ylabel(metric.capitalize())
        plt.title(f"{metric.capitalize()} by Row Number")
        plt.xticks(row_numbers, rotation=90)
        plt.legend()
        plt.tight_layout()
        plt.show()

visualize_metrics(output_filename)

We expect High Recall, Low Precision above, because our approaches are providing *suggestions* for a final mapping and we're expecting an expert in the loop to select the right one. In other words, we expect a lot of false positives and that's okay.

So we should only weigh heavily the "recall" / overall accuracy. Precision would be ideally high, but tuning for that might result in the right option not being presented at all, so we should not focus too heavily on that. Similarly, the F1 score provides a harmonic mean of precision and recall but we expect that to be low b/c precision is low.