# Harmonization Approach Using Abstractions

## Prerequisites

Install package manager and sync required packages.

In [None]:
# If you are actively working on related *.py files and would like changes to reload automatically into this notebook
%load_ext autoreload
%autoreload 2

Set available GPUs:

In [None]:
#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"  # change as necessary

## Single Benchmark Test File

Each test should include a source model: `*__ai_model_output.json`, with desire to harmonize to `harmonized_data_model.json`. We expect harmonization `expected_mappings.tsv`.

JSONL file with a test per row.

The JSONL file has 3 columns: `input_source_model`, `input_target_model`, `harmonized_mapping`

Those 3 columns should be populated by content of the files:

- `*__ai_model_ouput.json` == `input_source_model`
- `expected_mappings.tsv` == `input_target_model`
- `harmonized_data_model.json` == `harmonized_mapping`

In [None]:
import os
import json
import time

from harmonization.jsonl import (
    split_harmonization_jsonl_by_input_target_model,
    jsonl_to_csv,
)
from harmonization.harmonization_benchmark import get_metrics_for_approach
from harmonization.harmonization_approaches.similarity_inmem import (
    SimilaritySearchInMemoryVectorDb,
)
from harmonization.harmonization_approaches.embeddings import (
    MedGemmaEmbeddings,
    QwenEmbeddings
)
from langchain_huggingface import HuggingFaceEmbeddings

`output.jsonl` file contains 710 lines and `limited_output.jsonl` file contains 10 first lines from `output.jsonl` file.

`limited_output.jsonl` might be useful for testing locally

In [None]:
#output_json_filepath = (
#    "../datasets/harmonization_benchmark_SDCs_27_Gen3_DMs_mutated_v0.0.2/output.jsonl"
#)

output_json_filepath = (
    "../datasets/harmonization_benchmark_SDCs_27_Gen3_DMs_mutated_v0.0.2/limited_output.jsonl"
)

In [None]:
output_jsonls_per_target_model_dir_path = (
    "../output/temp/harmonization/v0.0.2/per_target"
)
split_harmonization_jsonl_by_input_target_model(
    output_json_filepath, output_jsonls_per_target_model_dir_path
)

In [None]:
folder_name = time.time()
output_directory = "./output/harmonization/"

### Choose sentence-transformers, Medgemma or Qwen embedding

 Here are links to the model that might be used for embeddings:

 * sentence-transformers model (default model, 768-dimension): https://huggingface.co/sentence-transformers/all-mpnet-base-v2
 * Qwen model (1024-dimension): https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
 * MedGemma model (2560-dimension): https://huggingface.co/google/medgemma-4b-it
 * EmbeddingGemma model (768-dimension): https://huggingface.co/google/embeddinggemma-300m 
 

> Please note: You might need to get access prior to using MedGemma or EmbeddingGemma models and you need use your HF_TOKEN inside this notebook to allow it to connect to the model. In case you want to use Medgemma or EmbeddingGemma models, please uncomment the following code 

In [1]:
# Uncomment this code if your model requires authorization via HuggingFace token
#import os
#from huggingface_hub import login
#hf_token = None
#with open(os.path.expanduser("~/.bashrc"), "r") as f:
#    for line in f:
#        if line.startswith("export HF_TOKEN="):
#            hf_token = line.strip().split("=", 1)[1]
#            break
## Remove any quotes (if present)
#if hf_token is not None:
#    hf_token = hf_token.strip('"').strip("'")
#login(hf_token)

Choose desired embedding by uncommenting a line, and configure batch size.

> Tip: if you are using GPUs and getting Out of Memory error, try setting smaller batch size

In [None]:

#embedding_fn = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
embedding_fn = QwenEmbeddings(model_name="Qwen/Qwen3-Embedding-0.6B")
#embedding_fn = MedGemmaEmbeddings(model_name="google/medgemma-4b-it")
#embedding_fn = MedGemmaEmbeddings(model_name="google/embeddinggemma-300m")

batch_size = 100

Optional - test embeddings on small text inputs:

In [None]:
#text = "heart disease"
#embedded_text = embedding_fn.embed_query(text)
#print("Embedded text:", embedded_text)
#print("Embedding dimension": len(embedded_text))
#del embedded_text

> Warning: The next cells will take **a very long time** and a lot of CPU/GPU the first time you run it (took me 32 minutes on an M3 Mac), and just **a long time** (took me 20 minutes on an M3 Mac) on future runs. It's embedding every single target data model into a persistent vectorstore on disk (and loaded in mem) as it goes the first time. And then every run it's embedding all the test case `node.property` and doing similarity search.

In [None]:
# Remove future warning from pandas
import pandas as pd
pd.set_option('future.no_silent_downcasting', True)

for file in os.listdir(output_jsonls_per_target_model_dir_path):
    full_file_path = os.path.join(output_jsonls_per_target_model_dir_path, file)
    print(f"Opening {full_file_path}...")
    output_json_filepath = f"{output_directory}/{folder_name}/{file}"
    os.makedirs(os.path.dirname(output_json_filepath), exist_ok=True)

    # since these files are separated by target model already, just get the first row
    input_target_model = None
    with open(full_file_path, "r", encoding="utf-8") as infile:
        for line in infile:
            row = json.loads(line)
            input_target_model = json.loads(row["input_target_model"])
            break
    print("Input target model received")

    # :62 b/c of limitation on chromadb collection names
    harmonization_approach = SimilaritySearchInMemoryVectorDb(
        vectordb_persist_directory_name=f"{file[:62]}",
        input_target_model=input_target_model,
        embedding_function=embedding_fn,
        batch_size=batch_size
    )
    print("Input target model added to vectorstore")

    output_filename = get_metrics_for_approach(
        full_file_path,
        harmonization_approach,
        output_json_filepath,
        metrics_column_name="custom_metrics",
    )
    print(f"Output metrics to {output_json_filepath}")

Optional - empty GPU cache if GPU used:

In [None]:
#import torch
#del embedding_fn
#torch.cuda.empty_cache()

### Example conversation to CSVs

In [None]:
# output_directory = "./output/harmonization/"
# output_directory = os.path.join(
#     output_directory, "1755028259.3249412"
# )  # REPLACE with folder you want
# for file in os.listdir(output_directory):
#     full_file_path = os.path.abspath(os.path.join(output_directory, file))
#     csv_path = full_file_path.replace(".jsonl", ".csv")
#     jsonl_to_csv(jsonl_path=full_file_path, csv_path=csv_path)