# Harmonization Approach Using Abstractions

## Prerequisites

Install package manager and sync required packages.

In [None]:
# If you are actively working on related *.py files and would like changes to reload automatically into this notebook
%load_ext autoreload
%autoreload 2

## Single Benchmark Test File

Each test should include a source model, with desire to harmonize to a target. We expect harmonization `expected_mappings.tsv`.

Tests are in a JSONL file with a source to target model per row.

The JSONL file should have 3 columns: `input_source_model`, `input_target_model`, `harmonized_mapping`

Those 3 columns should be populated by content of the files:

- `source_model.json` == `input_source_model`
- `expected_mappings.tsv` == `harmonized_mapping`
- `target_model.json` == `input_target_model`

In [None]:
import os
import json
import time
import json
import pandas as pd
import matplotlib.pyplot as plt

from harmonization.jsonl import (
    split_harmonization_jsonl_by_input_target_model,
    jsonl_to_csv,
)
from harmonization.harmonization_benchmark import get_metrics_for_approach
from harmonization.harmonization_approaches.similarity_inmem import (
    SimilaritySearchInMemoryVectorDb,
)

In [None]:
benchmark_filepath = (
    # Synthetic Benchmark
    # "../datasets/harmonization_benchmark_SDCs_27_Gen3_DMs_mutated_v0.0.2/output.jsonl"
    # Real Benchmark
    "../datasets/harmonization_benchmark_real_BIOLINCC_BDC_v0.0.1/harmonization_benchmark_real_BIOLINCC_BDC_v0.0.1.jsonl"
)

In [None]:
output_jsonls_per_target_model_dir_path = (
    f"../output/temp/{os.path.basename(benchmark_filepath)}/per_target"
)
split_harmonization_jsonl_by_input_target_model(
    benchmark_filepath, output_jsonls_per_target_model_dir_path
)

> Warning: The next cells may take **a very long time** (depending on the input dataset and how many target models exist) and it may take a lot of CPU/GPU the first time you run it (took me 32 minutes on an M3 Mac). Could take just **a long time** (took me 20 minutes on an M3 Mac) on future runs. It's embedding every single target data model into a persistent vectorstore on disk (and loaded in mem) as it goes the first time. And then every run it's embedding all the test case `node.property` and doing similarity search.

In [None]:
folder_name = time.time()
output_directory = "./output/harmonization/"

In [None]:
for file in os.listdir(output_jsonls_per_target_model_dir_path):
    benchmark_filepath = os.path.join(output_jsonls_per_target_model_dir_path, file)
    print(f"Opening {benchmark_filepath}...")
    output_filepath = f"{output_directory}/{folder_name}/{file}"
    os.makedirs(os.path.dirname(output_filepath), exist_ok=True)

    # since these files are separated by target model already, just get the first row
    input_target_model = None
    with open(benchmark_filepath, "r", encoding="utf-8") as infile:
        for line in infile:
            row = json.loads(line)
            try:
                input_target_model = json.loads(row["input_target_model"])
            except Exception:
                input_target_model = row["input_target_model"]

            break

    # :62 b/c of limitation on chromadb collection names
    harmonization_approach = SimilaritySearchInMemoryVectorDb(
        vectordb_persist_directory_name=f"{file[:62]}",
        input_target_model=input_target_model
    )

    max_suggestions_per_property = 5
    output_filename = get_metrics_for_approach(
        benchmark_filepath,
        harmonization_approach,
        output_filepath,
        k=max_suggestions_per_property,
        metrics_column_name="custom_metrics",
        output_sssom_per_row=True,
        output_tsvs_per_row=True,
        output_expected_results_per_row=True,
    )
    print(f"Output metrics to {output_filepath}")

### Example conversation to CSVs

In [None]:
csv_path = output_filename.replace(".jsonl", ".csv")
jsonl_to_csv(jsonl_path=output_filename, csv_path=csv_path)

## Visualize Results

In [None]:

def visualize_metrics(jsonl_file):
    """
    Visualizes metrics from a JSONL file.

    Args:
        jsonl_file (str): Path to the JSONL file.
    """

    data = []
    row_numbers = []
    with open(jsonl_file, "r") as f:
        for i, line in enumerate(f):
            try:
                json_data = json.loads(line)
                if "custom_metrics" in json_data:
                    data.append(json_data["custom_metrics"])
                    row_numbers.append(i + 1)  # Row numbers start at 1
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON: {e}")
                continue

    if not data:
        print("No data found in the JSONL file.")
        return

    df = pd.DataFrame(data)

    # --- Overall Accuracy ---
    plt.figure(figsize=(12, 6))
    plt.bar(row_numbers, df["recall"])
    mean_accuracy = df["recall"].mean()
    plt.axhline(
        y=mean_accuracy, color="r", linestyle="--", label=f"Mean: {mean_accuracy:.2f}"
    )
    plt.xlabel("Row Number")
    plt.ylabel("Overall Accuracy (Recall)")
    plt.title("Overall Accuracy (Recall) by Row Number")
    plt.xticks(row_numbers, rotation=90)
    plt.legend()
    plt.tight_layout()
    plt.show()

    # --- Precision, Recall, F1-Score ---
    metrics = ["precision", "f1_score"]
    for metric in metrics:
        plt.figure(figsize=(12, 6))
        plt.bar(row_numbers, df[metric])
        mean_metric = df[metric].mean()
        plt.axhline(
            y=mean_metric, color="r", linestyle="--", label=f"Mean: {mean_metric:.2f}"
        )
        plt.xlabel("Row Number")
        plt.ylabel(metric.capitalize())
        plt.title(f"{metric.capitalize()} by Row Number")
        plt.xticks(row_numbers, rotation=90)
        plt.legend()
        plt.tight_layout()
        plt.show()


visualize_metrics(output_filename)

We expect High Recall, Low Precision above, because our approaches are providing *suggestions* for a final mapping and we're expecting an expert in the loop to select the right one. In other words, we expect a lot of false positives and that's okay.

So we should only weigh heavily the "recall" / overall accuracy. Precision would be ideally high, but tuning for that might result in the right option not being presented at all, so we should not focus too heavily on that. Similarly, the F1 score provides a harmonic mean of precision and recall but we expect that to be low b/c precision is low.