# AI-Assisted Data Curation Toolkit: BDC Use Cases

This notebook demonstrates the AI-Assisted Data Curation Toolkit for use cases in NHLBI BioData Catalyst.

The toolkit is capable of suggesting harmonizations from a source data model into a target data model using AI-backed approaches, while leaving the expert curator in complete control.

For BDC we will demonstrate the following use cases:

- Assist in mapping between the new DMC BDC-HM data model (modeled in LinkML, export in JSON) and the existing NHLBI BioData Catalyst Gen3 Data Dictionary (modeled in YAML/JSON)
- Assist in mapping from a BioLINCC study's original variables to variables in the new DMC BDC-HM data model

## Setup

Let's pull the latest generated JSON Schema of the BDC HM data model.

In [None]:
!wget https://raw.githubusercontent.com/RTIInternational/NHLBI-BDC-DMC-HM/refs/heads/main/generated/bdchm.schema.json

Now setup some imports from the toolkit.

In [None]:
import os
import json

from ai_harmonization.interactive import (
    get_interactive_table_for_suggestions,
    get_nodes_and_properties_df,
)
from ai_harmonization.simple_data_model import (
    SimpleDataModel,
    get_data_model_as_node_prop_type_descriptions,
)
from ai_harmonization.harmonization_approaches.similarity_inmem import (
    SimilaritySearchInMemoryVectorDb,
)
from ai_harmonization.harmonization_approaches.embeddings import BGEEmbeddings

Set available GPUs (skip this step is using CPUs).

In [None]:
# os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"  # change as necessary

## Use a Harmonization Approach to get Suggestions

### Input 

- A `source data model` you want to harmonize from
- A `target data model` you want to harmonize to

For this initial example, we will evaluate two use cases: 

1. BDC-HM to Gen3 Data Dictionary
- The `source data model` will be the BDC-HM model we pulled from the public repo above. File is `bdchm.schema.json`
- The `target data model` example is the **NHLBI BioData Catalyst Gen3 Data Dictionary v4.6.5** (latest version as of 21 AUG 2025)

2. Unharmonized Study to BDC-HM

- The `source data model` will be: `example_real_source_model.json`, which is a real original study before ingestion into the NHLBI BioData Catalyst ecosystem (e.g. not yet harmonized). It is based off an original BioLINCC data model
- The `target data model` will be the BDC-HM model we pulled from the public repo above. File is `bdchm.schema.json`

The toolkit is genrealized for any models, so you can experiment with altering these.
You can change this to supply your own source or target model, so long as the format follows any of the examples, or a simplified JSON format.

## 1. BDC-HM to BDC Gen3 Data Dictionary

The first step is to parse the source and target data models into a core data schema, AKA a simple data model, which is generalized to represent both graph-like and relational models. It's simple JSON and we have utilities for parsing it from a LinkML JSON Schema output, Gen3, and from other ARPA-H AI Curation tools (but we'll leave those out of this explanation for now, since we're dealing with existing data models).

In [None]:
source_file = "bdchm.schema.json"
target_file = "./examples/example_target_model_BDC.json"

with open(source_file, "r") as f:
    input_source_model = json.load(f)

# Note: there is a SimpleDataModel.get_from_unknown_json_format(), but since we know the format - we can specify
input_source_model = SimpleDataModel.from_linkml_jsonschema(
    json.dumps(input_source_model), ignore_properties_with_endings=["id"]
)

with open(target_file, "r") as f:
    input_target_model = json.load(f)

input_target_model = SimpleDataModel.from_gen3_model(
    json.dumps(input_target_model), ignore_properties_with_endings=["id"]
)

In [None]:
print("Source Model")
input_source_model.get_property_df()

In [None]:
print("Target Model")
input_target_model.get_property_df()

### Use a Specific Harmonization Approach to get Suggestions

This could take some time depending on what hardware you're running this one. It is embedding the entire target and source models and performing in-memory vector database searches. 

This is a **baseline algorithm** we are using as an initial harmonization approach to prove out the concept. We plan to use a more sophisticated approach, with trained AI models, for better results in the future.

In [None]:
embedding_fn = BGEEmbeddings(model_name="BAAI/bge-large-en-v1.5")
batch_size = 32

harmonization_approach = SimilaritySearchInMemoryVectorDb(
    # A unique name for this file and embedding algorithm within the limits of the length required by the in-memory vectostore
    vectordb_persist_directory_name=f"{os.path.basename(target_file)[:53]}-{embedding_fn.model.name_or_path.split("/")[-1][:5]}-3",
    input_target_model=input_target_model,
    embedding_function=embedding_fn,
    batch_size=batch_size,
)

max_suggestions_per_property = 10
score_threshold = 0.7

suggestions = harmonization_approach.get_harmonization_suggestions(
    input_source_model=input_source_model,
    input_target_model=input_target_model,
    score_threshold=score_threshold,
    k=max_suggestions_per_property,
)
# you may see warnings about No relevant docs being retrieved. This is okay.
# There may not be a great mapping from every source variable.

### Visualize Suggestions

In [None]:
table_df = suggestions.to_simlified_dataframe()
table_df.sort_values(by="Similarity", ascending=False, inplace=True)

# Group by 'Original Node.Property' and find the index of max similarity for each group
idx = table_df.groupby("Original Node.Property")["Similarity"].idxmax()

# Filter DataFrame using the indices found above
filtered_df = table_df.loc[idx]
filtered_df.drop(columns=["Original Description", "Target Description"], inplace=True)
filtered_df.sort_values(by="Similarity", ascending=False, inplace=True)
filtered_df

### Create Interactive Table for Selecting Suggestions

This is where an expert will go through the suggestions from the toolkit, evaluate, and use this information to inform a final harmonization.

In [None]:
table = get_interactive_table_for_suggestions(
    table_df,
    column_for_filtering=1,
    # additional config for the interactive table
    maxBytes="2MB",
    pageLength=50,
)
table

### Example ITable 
![Example ITable](./examples/example_itable.png)

> **Don't see the table or see an error above?** Try restarting the kernel, then try restarting jupyter lab (if that's what you're using). The installs for AnyWidgets might not be picked up yet.

> **Dark Theme?** If you're using a dark theme, you might need to switch to light for the table to display properly. 

You can select interactively above, but we can also just dump the entire table to a CSV:

In [None]:
table_df.to_csv(
    "./all_suggestions_1.csv",
    index=False,
    na_rep="N/A",
    sep="\t",
    quotechar='"',
)

## 2 - Original BioLINCC Study to BDC-HM


- The `source data model` will be: `example_real_source_model.json`, which is a real original study before ingestion into the NHLBI BioData Catalyst ecosystem (e.g. not yet harmonized). It is based off an original BioLINCC data model
- The `target data model` will be the BDC-HM model we pulled from the public repo above. File is `bdchm.schema.json`


In [None]:
source_file_2 = "./examples/example_real_source_model.json"
target_file_2 = "bdchm.schema.json"

with open(source_file_2, "r") as f:
    input_source_model_2 = json.load(f)

input_source_model_2 = SimpleDataModel.from_simple_json(
    json.dumps(input_source_model_2), ignore_properties_with_endings=["id"]
)

with open(target_file_2, "r") as f:
    input_target_model_2 = json.load(f)

input_target_model_2 = SimpleDataModel.from_linkml_jsonschema(
    json.dumps(input_target_model_2), ignore_properties_with_endings=["id"]
)

In [None]:
print("Source Model")
input_source_model_2.get_property_df()

In [None]:
print("Target Model")
input_target_model_2.get_property_df()

### Use a Specific Harmonization Approach to get Suggestions

This could take some time depending on what hardware you're running this one. It is embedding the entire target and source models and performing in-memory vector database searches. 

This is a **baseline algorithm** we are using as an initial harmonization approach to prove out the concept. We plan to use a more sophisticated approach, with trained AI models, for better results in the future.

In [None]:
harmonization_approach_2 = SimilaritySearchInMemoryVectorDb(
    # A unique name for this file and embedding algorithm within the limits of the length required by the in-memory vectostore
    vectordb_persist_directory_name=f"{os.path.basename(target_file)[:53]}-{embedding_fn.model.name_or_path.split("/")[-1][:5]}-3",
    input_target_model=input_target_model_2,
    embedding_function=embedding_fn,
    batch_size=batch_size,
)

suggestions_2 = harmonization_approach_2.get_harmonization_suggestions(
    input_source_model=input_source_model_2,
    input_target_model=input_target_model_2,
    score_threshold=score_threshold,
    k=max_suggestions_per_property,
)
# you may see warnings about No relevant docs being retrieved. This is okay.
# There may not be a great mapping from every source variable.

### Visualize Suggestions

In [None]:
table_df_2 = suggestions_2.to_simlified_dataframe()
table_df_2.sort_values(by="Similarity", ascending=False, inplace=True)
table_df_2

In [None]:
# Group by 'Original Node.Property' and find the index of max similarity for each group
idx_2 = table_df_2.groupby("Original Node.Property")["Similarity"].idxmax()

# Filter DataFrame using the indices found above
filtered_df_2 = table_df_2.loc[idx_2]
filtered_df_2.drop(columns=["Original Description", "Target Description"], inplace=True)
filtered_df_2.sort_values(by="Similarity", ascending=False, inplace=True)
filtered_df_2

### Create Interactive Table for Selecting Suggestions

In [None]:
table_2 = get_interactive_table_for_suggestions(
    table_df_2,
    column_for_filtering=1,
    # additional config for the interactive table
    maxBytes="2MB",
    pageLength=50,
)
table_2

### Example ITable 
![Example ITable](./examples/example_itable_2.png)

> **Don't see the table or see an error above?** Try restarting the kernel, then try restarting jupyter lab (if that's what you're using). The installs for AnyWidgets might not be picked up yet.

> **Dark Theme?** If you're using a dark theme, you might need to switch to light for the table to display properly. 

You can select interactively above, but we can also just dump the entire table to a CSV:

In [None]:
table_df_2.to_csv(
    "./all_suggestions_2.csv",
    index=False,
    na_rep="N/A",
    sep="\t",
    quotechar='"',
)