![logo](https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/images/logo-with-background.png)

[![PyPI version](https://badge.fury.io/py/OntoAligner.svg)](https://badge.fury.io/py/OntoAligner)
[![PyPI Downloads](https://static.pepy.tech/badge/ontoaligner)](https://pepy.tech/projects/ontoaligner)
![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
[![Documentation Status](https://readthedocs.org/projects/ontoaligner/badge/?version=main)](https://ontoaligner.readthedocs.io/)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](MAINTANANCE.md)
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.14533133.svg)](https://doi.org/10.5281/zenodo.14533133)

- **Documentation website**: [https://ontoaligner.readthedocs.io/index.html](https://ontoaligner.readthedocs.io/index.html)
- **Resource Paper**: [https://doi.org/10.1007/978-3-031-94578-6_10](https://doi.org/10.1007/978-3-031-94578-6_10)


--------

# Deep Dive into OntoAligner Modules 1 (Parser, Encoder, Evaluator and Exporter)


Before diving lets have a look at a big picture of how OA models should operate.


![](https://mdpi-res.com/futureinternet/futureinternet-02-00238/article_deploy/html/images/futureinternet-02-00238-g001.png)


**Key components to consider when developing an OA:**
- How should I load the source ontology and extract desirable data from it to do the alignment? `concepts` (classes), or `concepts`+`childs` (classes and child classes) or ...
- How to prepare the input of OA models?
- Do I need to do the post-processing of alignments after the aligning the ontologies? applying threshold based filtering, ... --> ``topic of next tutorial``!
- How to do the evaluation?
- How store the results?

--------
Contents of this tutorial:
1. How the ``Parser`` module works?
2. What is the ``task`` (or ``OMDataset``) in OntoAligner?
3. How the ``Encoder`` module works?
4. How the ``Exporter`` module works.
5. Putting it all together - A complete workflow with an ``Evaluation``.

--------


In [None]:
# Install the OntoAligner library. (restart the notebook after installation)
!pip install -q ontoaligner numpy>=2.0

---

# 1Ô∏è‚É£. How the ``Parser`` module works?

The **Parser** module is responsible for reading ontology files (OWL, RDF/XML, etc.) and converting them into a standardized format that OntoAligner can work with. It extracts key information such as:

- Entity IRIs (Internationalized Resource Identifiers)
- Labels and names
- Comments and documentation
- Synonyms
- Hierarchical relationships (parents, children)

**Key Parser Classes:**
- `BaseOntologyParser` - Base parser for generic ontologies where ``GenericOntology`` operates on it to pars ontologies in generic manner.
- `BaseAlignmentsParser` - Parses alignment/matching/reference files that are being used for evaluation.
- Domain-specific parsers for OAEI benchmark tasks

A source and target ontologies

In [4]:
from ontoaligner.ontology import GenericOntology
from ontoaligner.base import BaseAlignmentsParser

# Initialize parsers
ontology_parser = GenericOntology()
alignment_parser = BaseAlignmentsParser()

# Parse source ontology
source_parsed = ontology_parser.parse(
    "https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/source.xml"
)

# Parse target ontology
target_parsed = ontology_parser.parse(
    "https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/target.xml"
)

# Parse reference alignments
alignments_parsed = alignment_parser.parse(
    "https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/reference.xml"
)

print(f"\nSource entities parsed: {len(source_parsed)}")
print(f"Target entities parsed: {len(target_parsed)}")
print(f"Alignments parsed: {len(alignments_parsed)}")

2744it [00:00, 13836.60it/s]
3304it [00:00, 5829.51it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9102/9102 [00:00<00:00, 32444.79it/s]


Source entities parsed: 2743
Target entities parsed: 3304
Alignments parsed: 1516





In [6]:
from pprint import pprint

In [9]:
# Examine a parsed entity in detail
print("\nDetailed view of first source entity:")
pprint(source_parsed[5])


Detailed view of first source entity:
{'childrens': [],
 'comment': [],
 'iri': 'http://mouse.owl#MA_0000006',
 'label': 'head/neck',
 'name': 'head/neck',
 'parents': [{'iri': 'http://mouse.owl#MA_0002433',
              'label': 'anatomic region',
              'name': 'anatomic region'}],
 'synonyms': []}


In [10]:
# Examine parsed alignments
print("\nSample parsed alignments:")
pprint(alignments_parsed[:5])


Sample parsed alignments:
[{'relation': '=',
  'source': 'http://mouse.owl#MA_0002401',
  'target': 'http://human.owl#NCI_C52561'},
 {'relation': '=',
  'source': 'http://mouse.owl#MA_0000270',
  'target': 'http://human.owl#NCI_C33736'},
 {'relation': '=',
  'source': 'http://mouse.owl#MA_0001951',
  'target': 'http://human.owl#NCI_C12715'},
 {'relation': '=',
  'source': 'http://mouse.owl#MA_0002303',
  'target': 'http://human.owl#NCI_C52701'},
 {'relation': '=',
  'source': 'http://mouse.owl#MA_0001543',
  'target': 'http://human.owl#NCI_C12385'}]


# 2Ô∏è‚É£. What is the ``task`` (or ``OMDataset``) in OntoAligner?




Ontology alignment systems often handle multiple tasks in a benchmark dataset. Each task can be evaluated independently. In OntoAligner, a task typically consists of:
* **Source ontology** ‚Äì the first ontology in the alignment.
* **Target ontology** ‚Äì the second ontology you want to align to the source.
* **Reference or gold standard alignment** (optional) ‚Äì used for evaluation.


üí°üí°üí° Explore how two different parsers work together to build up a task (a.k.a OMDataset) [https://ontoaligner.readthedocs.io/developerguide/parsers.html](https://ontoaligner.readthedocs.io/developerguide/parsers.html)


**Why this matter?** OntoAligner makes it easy to modularize an OA task for easy experimentations, for example in OntoAligner's workflow.

In [None]:
from ontoaligner.ontology import GenericOMDataset

task = GenericOMDataset()

dataset = task.collect(
    source_ontology_path="https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/source.xml",
    target_ontology_path="https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/target.xml",
    reference_matching_path="https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/reference.xml"
)

In [13]:
# Inspect a source ontology entity
print("\nSource Ontology Entity:")
pprint(dataset['source'][0])


Source Ontology Entity:
{'childrens': [],
 'comment': [],
 'iri': 'http://mouse.owl#MA_0000001',
 'label': 'mouse anatomy',
 'name': 'mouse anatomy',
 'parents': [],
 'synonyms': []}


In [25]:
# Inspect a target ontology entity
print("\nTarget Ontology Entity:")
pprint(dataset['target'][0])


Target Ontology Entity:
{'childrens': [{'iri': 'http://human.owl#NCI_C12680',
                'label': 'Body_Region',
                'name': 'Body_Region'},
               {'iri': 'http://human.owl#NCI_C12919',
                'label': 'Organ_System',
                'name': 'Organ_System'},
               {'iri': 'http://human.owl#NCI_C13018',
                'label': 'Organ',
                'name': 'Organ'},
               {'iri': 'http://human.owl#NCI_C13236',
                'label': 'Body_Fluid_or_Substance',
                'name': 'Body_Fluid_or_Substance'},
               {'iri': 'http://human.owl#NCI_C21599',
                'label': 'Microanatomy',
                'name': 'Microanatomy'},
               {'iri': 'http://human.owl#NCI_C25444',
                'label': 'Cavity',
                'name': 'Cavity'},
               {'iri': 'http://human.owl#NCI_C32221',
                'label': 'Body_Part',
                'name': 'Body_Part'},
               {'iri': 'http://huma

In [23]:
# Inspect reference alignments
print("\nReference Alignments (first 5):")
pprint(dataset['reference'][:5])


Reference Alignments (first 5):
[{'relation': '=',
  'source': 'http://mouse.owl#MA_0002401',
  'target': 'http://human.owl#NCI_C52561'},
 {'relation': '=',
  'source': 'http://mouse.owl#MA_0000270',
  'target': 'http://human.owl#NCI_C33736'},
 {'relation': '=',
  'source': 'http://mouse.owl#MA_0001951',
  'target': 'http://human.owl#NCI_C12715'},
 {'relation': '=',
  'source': 'http://mouse.owl#MA_0002303',
  'target': 'http://human.owl#NCI_C52701'},
 {'relation': '=',
  'source': 'http://mouse.owl#MA_0001543',
  'target': 'http://human.owl#NCI_C12385'}]


**Or you can use a dedicated OA dataset module:**

In [27]:
from ontoaligner.ontology.oaei import MouseHumanOMDataset

task = MouseHumanOMDataset()

dataset = task.collect(
    source_ontology_path="https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/source.xml",
    target_ontology_path="https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/target.xml",
    reference_matching_path="https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/reference.xml"
)

print(f"\nDataset Info: {dataset['dataset-info']}")
print(f"Number of source entities: {len(dataset['source'])}")
print(f"Number of target entities: {len(dataset['target'])}")
print(f"Number of reference alignments: {len(dataset['reference'])}")

2744it [00:00, 6365.82it/s]
3304it [00:00, 3778.01it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9102/9102 [00:00<00:00, 47764.76it/s]



Dataset Info: {'track': 'anatomy', 'ontology-name': 'mouse-human'}
Number of source entities: 2737
Number of target entities: 3298
Number of reference alignments: 1516


The base ``OMDataset`` module structure:

```python
class OMDataset(ABC):
    track: str = ""
    ontology_name: str = ""

    source_ontology: Any = None
    target_ontology: Any = None

    alignments: Any = BaseAlignmentsParser()

    def collect(self,
                source_ontology_path: str,
                target_ontology_path: str,
                reference_matching_path: str="") -> Dict:
        ....
```

Read more on OA datasets and tasks definitions at: https://ontoaligner.readthedocs.io/developerguide/parsers.html

---

# 3Ô∏è‚É£. How the ``Encoder`` Module Works?

As you can see, per class within source or target ontologies we can get high number of information:
```
{'childrens': [{'iri': 'http://human.owl#NCI_C12680',
                'label': 'Body_Region',
                'name': 'Body_Region'},
               {'iri': 'http://human.owl#NCI_C12919',
                'label': 'Organ_System',
                'name': 'Organ_System'},
               {'iri': 'http://human.owl#NCI_C13018',
                'label': 'Organ',
                'name': 'Organ'},
               {'iri': 'http://human.owl#NCI_C13236',
                'label': 'Body_Fluid_or_Substance',
                'name': 'Body_Fluid_or_Substance'},
               {'iri': 'http://human.owl#NCI_C21599',
                'label': 'Microanatomy',
                'name': 'Microanatomy'},
               {'iri': 'http://human.owl#NCI_C25444',
                'label': 'Cavity',
                'name': 'Cavity'},
               {'iri': 'http://human.owl#NCI_C32221',
                'label': 'Body_Part',
                'name': 'Body_Part'},
               {'iri': 'http://human.owl#NCI_C33904',
                'label': 'Other_Anatomic_Concept',
                'name': 'Other_Anatomic_Concept'}],
 'comment': [],
 'iri': 'http://human.owl#NCI_C12219',
 'label': 'Anatomic_Structure_System_or_Substance',
 'name': 'Anatomic_Structure_System_or_Substance',
 'parents': [],
 'synonyms': []}
 ```
Which of these informations are going to be used for alignment model input? and what format they should have?

Here the **Encoder** module comes into play that converts parsed outputs into textually structured formats in which later steps goes to the alignments models. Textual entity information can be only a class or class accompanied by its parents or childs. **The encoder also responsible for preprocessings.**


**Key encoding approaches that we will dive into it here:**
* ``ConceptLightweightEncoder``: A basic textual entity representation that only uses class labels.
* ``ConceptChildrenLightweightEncoder``: A basic textual entity representation that only uses class labels and childrens.
* ``ConceptParentLightweightEncoder``: A basic textual entity representation that only uses class labels and parents.


üí°üí° List of Encoders are presented at: https://ontoaligner.readthedocs.io/package_reference/encoders.html

üìùüìù Encoders are mostly model independent, for different models we might apply different encoder dependeing what information each aligner wants. But with OntoAligner you can plug and play with existing modules.

In [47]:
from ontoaligner.encoder import ConceptLightweightEncoder, \
                                ConceptChildrenLightweightEncoder, \
                                ConceptParentLightweightEncoder

encoder = ConceptLightweightEncoder()

encoded_source_onto, encoded_target_onto = encoder(
        source=dataset['source'],
        target=dataset['target']
)

print("Encoded source ontology:")
pprint(encoded_source_onto[0])

print("Encoded target ontology:")
pprint(encoded_target_onto[0])

Encoded source ontology:
{'iri': 'http://mouse.owl#MA_0000001', 'text': 'mouse anatomy'}
Encoded target ontology:
{'iri': 'http://human.owl#NCI_C12219',
 'text': 'anatomic structure system or substance'}


In [48]:
encoder = ConceptChildrenLightweightEncoder()

encoded_source_onto, encoded_target_onto = encoder(
        source=dataset['source'],
        target=dataset['target']
)

print("Encoded source ontology:")
pprint(encoded_source_onto[0])

print("Encoded target ontology:")
pprint(encoded_target_onto[0])

Encoded source ontology:
{'iri': 'http://mouse.owl#MA_0000001', 'text': 'mouse anatomy  '}
Encoded target ontology:
{'iri': 'http://human.owl#NCI_C12219',
 'text': 'anatomic structure system or substance  body region, other anatomic '
         'concept, organ, cavity, organ system, body fluid or substance, '
         'microanatomy, body part'}


In [54]:
encoder = ConceptParentLightweightEncoder()

encoded_source_onto, encoded_target_onto = encoder(
        source=dataset['source'],
        target=dataset['target']
)

print("Encoded source ontology:")
pprint(encoded_source_onto[0])

print("Encoded target ontology:")
pprint(encoded_target_onto[0])

Encoded source ontology:
{'iri': 'http://mouse.owl#MA_0000001', 'text': 'mouse anatomy  '}
Encoded target ontology:
{'iri': 'http://human.owl#NCI_C12219',
 'text': 'anatomic structure system or substance  '}


---

# 4Ô∏è‚É£. How the ``Exporter`` Module Works

The **Exporter** module converts computed alignments into standardized formats that can be used by other systems and evaluation benchmarks. It handles format conversion and validation.

**Supported Export Formats:**
- **RDF/XML**: Standard semantic web format
- **SKOS-XL**: Simple Knowledge Organization System
- **Alignment Format (EDOAL)**: Standard ontology alignment format
- **CSV**: Tabular format for easy inspection
- **JSON**: Structured data format

**Key Exporter Classes:**
- `BaseExporter` - Base exporting functionality
- `RDFExporter` - Export to RDF/XML format
- `AlignmentFormatExporter` - Export to OAEI alignment format

In [69]:
from ontoaligner.utils import xmlify

# Create sample alignments to export
sample_alignments = [
    {
        'source': 'http://mouse.owl#MA_0002401',
        'target': 'http://human.owl#NCI_C52561',
        'relation': '=',
        'score': 0.95
    },
    {
        'source': 'http://mouse.owl#MA_0000270',
        'target': 'http://human.owl#NCI_C33736',
        'relation': '=',
        'score': 0.87
    },
    {
        'source': 'http://mouse.owl#MA_0001951',
        'target': 'http://human.owl#NCI_C12715',
        'relation': '=',
        'score': 0.92
    }
]

In [70]:
# üìÑ Export to XML (RDF Alignment Format)
alignment_xml = xmlify.xml_alignment_generator(matchings=sample_alignments)

print(alignment_xml)

with open("alignments.xml", "w", encoding="utf-8") as f:
    f.write(alignment_xml)

<?xml version="1.0" ?>
<rdf:RDF xmlns="http://knowledgeweb.semanticweb.org/heterogeneity/alignment" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
  <Alignment>
    <xml>yes</xml>
    <level>0</level>
    <type>??</type>
    <map>
      <Cell>
        <entity1 rdf:resource="http://mouse.owl#MA_0002401"/>
        <entity2 rdf:resource="http://human.owl#NCI_C52561"/>
        <relation>=</relation>
        <measure rdf:datatype="xsd:float">0.9</measure>
      </Cell>
    </map>
    <map>
      <Cell>
        <entity1 rdf:resource="http://mouse.owl#MA_0000270"/>
        <entity2 rdf:resource="http://human.owl#NCI_C33736"/>
        <relation>=</relation>
        <measure rdf:datatype="xsd:float">0.8</measure>
      </Cell>
    </map>
    <map>
      <Cell>
        <entity1 rdf:resource="http://mouse.owl#MA_0001951"/>
        <entity2 rdf:resource="http://human.owl#NCI_C12715"/>
        <relation>=</relation>
        <measure rdf:dataty

In [71]:
# üßæ Export to JSON
import json

with open("alignments.json", "w", encoding="utf-8") as f:
    json.dump(sample_alignments, f, indent=4, ensure_ascii=False)

In [73]:
# üìä Export to CSV
import csv

with open("alignments.csv", "w", newline='', encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=['source', 'target', 'relation', 'score'])
    writer.writeheader()
    writer.writerows(sample_alignments)

---

# 5Ô∏è‚É£. Putting it all together - A complete workflow with an ``Evaluation``.

Now let's demonstrate how all modules work together in an end-to-end ontology alignment workflow:

In [79]:
# STEP 1: PARSER - Load ontologies and alignments
from ontoaligner.ontology import GenericOMDataset

task = GenericOMDataset()
dataset = task.collect(
    source_ontology_path="https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/source.xml",
    target_ontology_path="https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/target.xml",
    reference_matching_path="https://raw.githubusercontent.com/sciknoworg/OntoAligner/main/assets/mouse-human/reference.xml"
)
print(f"‚úì Loaded {len(dataset['source'])} source entities")
print(f"‚úì Loaded {len(dataset['target'])} target entities")
print(f"‚úì Loaded {len(dataset['reference'])} reference alignments")

2744it [00:00, 13022.28it/s]
3304it [00:00, 9744.22it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9102/9102 [00:00<00:00, 36866.93it/s]

‚úì Loaded dataset info: dict_keys(['dataset-info', 'source', 'target', 'reference'])
‚úì Loaded 2743 source entities
‚úì Loaded 3304 target entities
‚úì Loaded 1516 reference alignments





In [75]:
# STEP 2: ENCODER - Structure the Input
from ontoaligner.encoder import ConceptLightweightEncoder

encoder = ConceptLightweightEncoder()
encoded_source_onto, encoded_target_onto = encoder(
        source=dataset['source'],
        target=dataset['target']
)

print(f"‚úì Structured {len(encoded_source_onto)} source entities")
print(f"‚úì Structured {len(encoded_target_onto)} target entities")

‚úì Structured 2743 source entities
‚úì Structured 3304 target entities


In [78]:
# STEP 3: ALIGNER - Define Aligner model and generte the alignments
from ontoaligner.aligner import SimpleFuzzySMLightweight

model = SimpleFuzzySMLightweight(fuzzy_sm_threshold=0.7)
alignments = model.generate(input_data=[encoded_source_onto, encoded_target_onto])


print(f"\n‚úì Found {len(alignments)} high-confidence alignments (>0.7)")
print("\nTop 5 alignments:")
for i, alignment in enumerate(alignments[:5], 1):
    source_id = alignment['source']
    target_id = alignment['target']
    print(f"  {i}. {source_id} -> {target_id} (confidence-score: {alignment['score']:.4f})")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2743/2743 [00:02<00:00, 1113.45it/s]


‚úì Found 2159 high-confidence alignments (>0.7)

Top 5 alignments:
  1. http://mouse.owl#MA_0000001 -> http://human.owl#NCI_C21599 (confidence-score: 0.7200)
  2. http://mouse.owl#MA_0000002 -> http://human.owl#NCI_C49799 (confidence-score: 0.8444)
  3. http://mouse.owl#MA_0000003 -> http://human.owl#NCI_C12919 (confidence-score: 1.0000)
  4. http://mouse.owl#MA_0000004 -> http://human.owl#NCI_C33816 (confidence-score: 1.0000)
  5. http://mouse.owl#MA_0000006 -> http://human.owl#NCI_C12418 (confidence-score: 0.8182)





**Now, lets evaluate the performance of the fuzzy matcher by comparing the predicted matchings with the reference data, before exporting the alignments.**

In [80]:
# STEP 4: EVALUATE - Evaluate the efficency of alignments.
from ontoaligner.utils import metrics

evaluation = metrics.evaluation_report(
    predicts=alignments,
    references=dataset['reference']
)

print("Evaluation Report:\n", json.dumps(evaluation, indent=4))

Evaluation Report:
 {
    "intersection": 1163,
    "precision": 53.86753126447429,
    "recall": 76.71503957783641,
    "f-score": 63.29251700680272,
    "predictions-len": 2159,
    "reference-len": 1516
}


In [82]:
# STEP 5: EXPORTER - Export results in a XML format
from ontoaligner.utils import xmlify

xml_str = xmlify.xml_alignment_generator(alignments)
with open("alignments.xml", "w", encoding="utf-8") as xml_file:
    xml_file.write(xml_str)

print("‚úì Exported to Alignment Format (OAEI standard)")
print("  Format: RDF/XML")

‚úì Exported to Alignment Format (OAEI standard)
  Format: RDF/XML


---

# ‚úÖ Key Takeaways

1. **Parser Module**: Enables flexible parsing of ontologies from multiple file formats.
2. **OMDataset**: Defines an ontology alignment task by loading source and target ontologies together with their reference alignment in a single step.
3. **Encoder Module**: Structures and prepares ontology data as input for aligners.
4. **Exporter Module**: Produces standardized alignment outputs compatible with OAEI benchmarks.
5. **Modularity**: Each component can be used independently or combined to build complete alignment workflows.

For more information, visit the [OntoAligner Documentation](https://ontoaligner.readthedocs.io/)

-----------------------------------------------------------
-----------------------------------------------------------

üìÉ Acknowledgement

OntoAligner is licensed under [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)


```bibtex
@inproceedings{babaei2025ontoaligner,
  title={OntoAligner: A Comprehensive Modular and Robust Python Toolkit for Ontology Alignment},
  author={Babaei Giglou, Hamed and D‚ÄôSouza, Jennifer and Karras, Oliver and Auer, S{\"o}ren},
  booktitle={European Semantic Web Conference},
  pages={174--191},
  year={2025},
  organization={Springer}
}
```