 Datasets are in Drive ([assets folder](https://drive.google.com/drive/folders/1ZVjLNbd2LriFEqVV_kj8A4VuNHGeTkD-?usp=drive_link)). Please create a shortcut from this folder to "MyDrive".

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Testing [OntologyAligners](https://github.com/sciknoworg/OntoAligner) KG Embedding for entity matching

In [2]:
!pip install ontoaligner

Collecting argparse (from ontoaligner)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


Import necessary modules from the ontoaligner package

In [3]:
from ontoaligner.ontology import GraphTripleOMDataset          # Handles ontology data collection in graph triple format
from ontoaligner.encoder import GraphTripleEncoder             # Encodes graph triple data into model-consumable format
from ontoaligner.aligner import ConvEAligner                   # Alignment model based on ConvE (Knowledge Graph Embedding)
from ontoaligner.postprocess import graph_postprocessor        # Applies post-processing to improve alignment quality
from ontoaligner.utils import metrics, xmlify                  # Utilities for evaluation and XML export

INFO:pykeen.utils:Using opt_einsum


### Step 1: Initialize ontology matching task

In [4]:
task = GraphTripleOMDataset()
task.ontology_name = "Customer-Guests"                             # Assign a name for the ontology matching task
print("task:", task)
# Example output: task: Track: GraphTriple, Source-Target sets: Mouse-Human

task: Track: GraphTriple, Source-Target sets: Customer-Guests


### Step 2: Load source, target, and reference ontologies

In [13]:
dataset = task.collect(
    source_ontology_path="/content/drive/MyDrive/assets/integration_test_data/customers.xml",      # Path to source ontology
    target_ontology_path="/content/drive/MyDrive/assets/integration_test_data/guests.xml",      # Path to target ontology
    reference_matching_path="/content/drive/MyDrive/assets/integration_test_data/reference.xml" # Path to ground-truth reference alignments
)
print("dataset key-values:", dataset.keys())
# Example output: dict_keys(['dataset-info', 'source', 'target', 'reference'])
print("dataset [sorce]: ", dataset["source"][:3])
print("dataset [target]: ", dataset["target"][:3])
print("dataset [reference]: ", dataset["reference"][:3])

100%|██████████| 41/41 [00:00<00:00, 32197.43it/s]

dataset key-values: dict_keys(['dataset-info', 'source', 'target', 'reference'])
dataset [sorce]:  [{'subject': ('http://example.org/ontology/customer#birth_day', 'birth_day'), 'predicate': ('http://www.w3.org/2000/01/rdf-schema#range', 'range'), 'object': ('http://www.w3.org/2001/XMLSchema#string', 'string'), 'subject_is_class': False, 'object_is_class': False}, {'subject': ('http://example.org/ontology/customer#a2ab1195-6e12-4ba4-9550-94cd8e9a9433', 'a2ab1195-6e12-4ba4-9550-94cd8e9a9433'), 'predicate': ('http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'type'), 'object': ('http://example.org/ontology/customer#Customer', 'Customer'), 'subject_is_class': False, 'object_is_class': True}, {'subject': ('http://example.org/ontology/customer#Customer', 'Customer'), 'predicate': ('http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'type'), 'object': ('http://www.w3.org/2002/07/owl#Class', 'Class'), 'subject_is_class': True, 'object_is_class': False}]
dataset [target]:  [{'subject': ('http://




Print a sample of the parsed source ontology data

In [14]:
print("Sample source ontology:", dataset['source'][0])

Sample source ontology: {'subject': ('http://example.org/ontology/customer#birth_day', 'birth_day'), 'predicate': ('http://www.w3.org/2000/01/rdf-schema#range', 'range'), 'object': ('http://www.w3.org/2001/XMLSchema#string', 'string'), 'subject_is_class': False, 'object_is_class': False}


### Step 3: Encode the dataset into a format suitable for the aligner

In [15]:
encoder = GraphTripleEncoder()
encoded_dataset = encoder(**dataset)                           # Transforms raw ontology triples into embedding-compatible format

### Step 4: Define training parameters for the ConvE aligner

In [16]:
kge_params = {
    'device': 'cuda',                  # Device to use ('cpu' or 'cuda')
    'embedding_dim': 300,            # Size of embedding vectors
    'num_epochs': 50,                # Total number of training epochs
    'train_batch_size': 128,         # Batch size for training
    'eval_batch_size': 64,           # Batch size for evaluation
    'num_negs_per_pos': 5,           # Number of negative samples for each positive sample
    'random_seed': 42,               # Seed for reproducibility
}

### Step 5: Initialize and train the aligner model
For faster training connect to runtime with T4 (GPU)

In [17]:
aligner = ConvEAligner(**kge_params)
matchings = aligner.generate(input_data=encoded_dataset)       # Generate predicted alignments between source and target ontologies

INFO:pykeen.pipeline.api:Using device: cuda
INFO:pykeen.nn.modules:Resolving None * None * None = 300.
INFO:pykeen.nn.representation:Inferred unique=False for Embedding()
INFO:pykeen.nn.representation:Inferred unique=False for Embedding()
INFO:pykeen.nn.representation:Inferred unique=False for Embedding()
INFO:pykeen.triples.triples_factory:Creating inverse triples.


Training epochs on cuda:0:   0%|          | 0/50 [00:00<?, ?epoch/s]

INFO:pykeen.triples.triples_factory:Creating inverse triples.
INFO:pykeen.training.training_loop:Dropping last (incomplete) batch each epoch (1/1 (100.00%) batches).


Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]



Evaluating on cuda:0:   0%|          | 0.00/83.0 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 0.11s seconds


### Step 6: Post-process the predicted matchings
Filters matchings using a similarity threshold (e.g., 0.5)

In [18]:
processed_matchings = graph_postprocessor(predicts=matchings, threshold=0.5)

100%|██████████| 1/1 [00:00<00:00, 14563.56it/s]


## Visualise results using pandas DataFrame's

In [19]:
import pandas as pd
import numpy as np
df_matchings = pd.DataFrame(matchings)

# Build lookup dictionary
lookup = {
    (d["source"], d["target"]): d["score"]
    for d in processed_matchings
}

# Assign new column
df_matchings["post_processed_score"] = df_matchings.apply(
    lambda row: lookup.get((row["source"], row["target"])),
    axis=1
)

ref_df = pd.DataFrame(dataset["reference"])

# Convert "=" → 1.0, everything else → NaN (or 0.0 if you prefer)
ref_df["true_match"] = np.where(ref_df["relation"] == "=", 1.0, np.nan)

df_matchings = df_matchings.merge(
    ref_df[["source", "target", "true_match"]],
    on=["source", "target"],
    how="left"
)

df_matchings

Unnamed: 0,source,target,score,post_processed_score,true_match
0,http://example.org/ontology/customer#Customer,http://example.org/ontology/guest#Guest,0.03763,,


### Step 7: Evaluate matchings before and after post-processing

In [20]:
evaluation = metrics.evaluation_report(predicts=matchings, references=dataset['reference'])
print("Matching Evaluation Report:\n", evaluation)

evaluation = metrics.evaluation_report(predicts=processed_matchings, references=dataset['reference'])
print("Matching Evaluation Report -- after post-processing:\n", evaluation)

Matching Evaluation Report:
 {'intersection': 0, 'precision': 0.0, 'recall': 0.0, 'f-score': 0, 'predictions-len': 1, 'reference-len': 5}
Matching Evaluation Report -- after post-processing:
 {'intersection': 0, 'precision': 0, 'recall': 0.0, 'f-score': 0, 'predictions-len': 0, 'reference-len': 5}


### Step 8: Convert processed matchings to XML format and save to file

In [None]:
xml_str = xmlify.xml_alignment_generator(matchings=processed_matchings)
with open("matchings.xml", "w", encoding="utf-8") as xml_file:
    xml_file.write(xml_str)