Ontology-Based Synthetic Data Generation for Neuro-Symbolic Knowledge Graph Reasoning.
This repository contains the source code for my bachelor thesis at KU Leuven.
Neuro-Symbolic AI aims to bridge the gap between two paradigms: the robustness and pattern-matching capabilities of Neural AI (like KG embeddings and GNNs) and the interpretable, rigorous reasoning of Symbolic AI (e.g. formal logic and ontologies). A key application domain is Knowledge Graph Reasoning (KGR), which involves predicting missing links in a Knowledge Graph (KG) by performing multi-hop logical reasoning.
However, training effective Neuro-Symbolic models requires large datasets that specifically necessitate complex reasoning. Existing data generation methods - such as standard benchmarks, forward-chaining reasoners, or Answer Set Programming (ASP) - often produce datasets that are:
- Biased towards "easy" logic, allowing models to succeed via shallow heuristics (pattern recognition) rather than learning the underlying logical rules.
- Limited in rule coverage, failing to represent the full complexity of the ontology.
This project investigates the following research question:
How to generate high-quality data that enables a model to perform multi-hop logical reasoning rather than just pattern recognition?
The core hypothesis is that backward-chaining data generation - applying deductive reasoning on ontologies (TBox) to generate synthetic data (ABox) - can create high-quality training datasets. By constructing proof trees for derived facts, we can:
- Ensure logical consistency and diverse reasoning depths.
- Generate "hard" negative samples via proof-based corruption (breaking specific links in a valid proof chain), forcing the model to distinguish between valid and invalid reasoning paths.
This repository implements this generator and evaluates the quality of the generated data by training a Recursive Reasoning Network (RRN), a Neuro-Symbolic link prediction model, as well as other baseline models to benchmark performance.
- Introduction
- Features
- Installation
- Generating datasets
- Training RRN model
- Hyperparameter Optimization (WandB Sweeps)
- Full workflow
- Custom configurations
- Experiments
- Development
- TODO
- Known issues
Don't worry if the repository looks a bit overwhelming :) I value reproducibility of scientific experiments very highly, so:
- I created a sophisticated
uvmonorepo, i.e. a single repository containing multiple packages as 'subprojects', each with their own dependencies and configurations. - I added a Linux devcontainer for easy setup on any OS (including Windows, which is not Unix-based like Linux or macOS).
The subprojects (located in apps/) are:
ont_generator: The backward-chaining ontology-based data generator I created for my thesisasp_generator: The ASP-based family tree data generator by Patrick Hohenecker (see below)rrn: The Recursive Reasoning Model (also by Patrick Hohenecker) is a neuro-symbolic link prediction model, used for testing the quality of the generated datasets.baselines: A collection of baseline link prediction models (e.g., TransE, DistMult, ComplEx) to further benchmark the performance of the generated datasets.
The uv nature of this repo makes it possible to easily manage dependencies between these subprojects. Furthermore, it provides a task runner (invoke) to run common tasks (e.g., generating datasets, training models, running experiments) from the project root. Use the following command to see all available tasks:
uv run invoke --list # list all available tasks
uv run invoke <task-name> # run a specific taskThis project uses uv for dependency management and invoke for task automation.
Make sure you have cloned the repo and are in the project root directory.
On Unix systems, you can locally run all commands as-is. As an alternative, follow the Windows instructions to use the devcontainer. Below are the steps to set up the project on your own macOS or Linux machine without using the devcontainer.
If don't already have uv installed, then do so first, e.g. on macOS with Homebrew:
brew install uvOr on Linux using the official installation script:
curl -LsSf https://astral.sh/uv/install.sh | shThen, install project dependencies:
uv syncAs you can see, with uv, installing dependencies is as easy as running a single command! No contradictory requirements.txt files or anything like that :)
The family tree data generator makes use of the DLV system in order to perform symbolic reasoning over family trees by means of the ontology mentioned above.
If you are running the project on your own Linux machine, you can use the provided installation script to download and set up DLV automatically:
bash install-dlv-linux.shIf running the project on your own macOS machine, you have to download the DLV executable for your platform from the official website
After you have downloaded and extracted the DLV executable, change the permissions to make it executable:
chmod +x /path/to/dlv/executableFinally, update the configuration file configs/asp_generator/config.yaml to point to the DLV executable you just downloaded:
# configs/asp_generator/config.yaml
# ...
dlv: /path/to/dlv/executable # <- change this!
# ...For the easiest use, you should open the devcontainer, which I included in .devcontainer/, for example using VS Code:
- I assume you are in the project root directory.
- Click the
><icon in the bottom-left corner of VS Code. - Select
Reopen in Container.
The (Linux) devcontainer will be built using Dockerfile and post_create.sh will take care of installing uv, as well as syncing the project dependencies and setting up the config files.
After the installation is complete, VS Code might prompt you with
"Press any key to exit"
Once you actually press a key, a new terminal will open in the devcontainer, but the virtual environment might not be activated yet.
Close the terminal and open a new one (CMD + J or Terminal > Create New Terminal). This new terminal should now have the virtual environment activated automatically.
You should always see (synthology) > at the beginning of the terminal prompt when working in the devcontainer, which indicates that the virtual environment is active.
You don't need to install DLV manually (like on macOS/Linux), as it is already installed in the devcontainer.
See the Development section for instructions on setting up development tools like ruff and ty (using VS Code extensions is recommended).
All generators output data in a standardized format.
Each split (train, val, test) contains:
facts.csv: Base facts (explicit relations/memberships).targets.csv: All facts (base + inferred) and negative samples.
Below, I describe how to generate the reldata Family Tree dataset based on the ASP solver by Patrick Hohenecker.
Quick Start (generates and converts to standard format):
uv run invoke gen-ft-aspThis command generates raw reldata output in data/asp/out-reldata and then automatically converts it to the standard format (facts.csv and targets.csv) in data/asp/family_tree/{train,val,test}.
Step-by-Step (for more control):
-
Generate Raw Data Only:
uv run --package asp_generator python apps/asp_generator/src/asp_generator/create_data.py
This generates raw
reldataoutput indata/asp/out-reldatawithout converting. -
Convert to Standard Format (separate step):
uv run invoke convert-reldata
This converts existing data in
data/asp/out-reldatato the standard format.
To tweak the generation parameters, please refer to the configuration section.
To use the backward-chaining ontology-based generator (which outputs the standard format):
uv run invoke gen-ft-ontOr run directly:
uv run --package ont_generator python -m ont_generator.create_dataThis generates facts.csv and targets.csv in data/ont/family/{train,val,test}.
To train the Recursive Reasoning Network (RRN) model on the generated family tree datasets, use the following invoke task:
uv run invoke train-rrn
# configs/rrn/ config.yaml
# data/ default.yaml
# dataset/asp.yaml
# dataset/ont.yaml
# model/ default.yaml
# hyperparams/ default.yamlYou can run hyperparameter sweeps that span both the ontology data generation and the RRN model training. This allows you to find the optimal combination of dataset characteristics (e.g., complexity, size, negative sampling ratio) and model hyperparameters.
A wrapper script scripts/sweep_ont_rrn.py handles the coordination between the generator and the model.
-
Define your sweep configuration: Create a YAML file (e.g.,
configs/my_sweep.yaml) defining the parameters to tune. Use the prefixgen.for generator parameters andrrn.for RRN parameters.Example (
configs/sweep_sample.yaml):program: scripts/sweep_ont_rrn.py method: bayes metric: name: val_loss goal: minimize parameters: # Generator Parameters gen.dataset.n_train: values: [1000, 2000] gen.negative_sampling.ratio: min: 0.5 max: 2.0 # Model Parameters rrn.hyperparams.learning_rate: min: 0.0001 max: 0.01
-
Initialize the sweep:
uv run wandb sweep configs/sweep_sample.yaml
This will output a sweep ID (e.g.,
username/project/sweep_id). -
Start the agent:
uv run wandb agent <SWEEP_ID>
The script automatically generates a temporary dataset for each run, trains the model on it, reports metrics to WandB, and cleans up the data afterwards.
- Generate a dataset using either the ASP-based (default for now) or ontology-based generator (work in progress).
- Make sure the
data/asp/family_tree/ordata/ont/family_tree/folder contains 3 folders:train/,val/, andtest/, each containing.csvfiles with triples. - Train the RRN model on the generated dataset
This repo uses Hydra for configuration management.
You can modify the default configurations in 2 ways:
All configurations -- for the link-prediction models and the data generators -- are stored in the configs/ folder.
You can create your own configuration files by copying and modifying the existing ones.
For example, create a hyperparams2.yaml file in configs/rrn/hyperparams/ and modify configs/rrn/config.yaml to use it:
defaults:
- data: default
- model: default
- hyperparams: hyperparams2 # <- your custom hyperparameters
- _self_
# rest of config...You can also override specific configuration options directly from the command line.
(note that this only works when running the packages directly, not via invoke)
uv run --package ont_generator python -m ont_generator.create_data \
dataset.n_train=500 \
dataset.n_val=100 \
dataset.n_test=100Another example, for training the RRN model with custom (hyper)parameters:
uv run --package rrn python -m rrn.train \
hyperparams.num_epochs=20 \
data/dataset=aspI aim to complete the first two experiments listed below. Time will probably not allow for the other two, although they are very useful for evaluating the generator.
- Goal: Prove that the Synthology generator creates that that requires deep reasoning
- Method:
- Train RRN on 500 KGs generated by LUBM-10 (10 universities)
- Train RRN on 500 KGs generated by Synthology (with high depth)
- Test both models on hard test set featuring 3+ hop inferences.
- Metric: test set accuracies
- Expected Result: Synthology should perform better on the hard test set.
- Goal: Validate the proof-based corruption (hard negatives)
- Method:
- Test out all different methods
- Random
- Constrained
- Proof-based
- Mixed
- Test out all different methods
- Metric: Measure False Positive Rate (FPR) on test set that contains near-miss triples.
- Goal: Show that the data is information-dense
- Method:
- Take the Synthology and LUBM datasets.
- Create training subsets of increasing sizes: 10%, 25%, 50%, and 100% of the total data.
- Train the RRN from scratch on each subset.
- Evaluate all models on the exact same standard evaluation/test set.
- Metric: Learning curves for each subset size
- Expected Result:
- An intelligent generator should reach high accuracy with fewer training samples because every sample is designed to be a "lesson" in logic.
- We expect a steeper learning curve for the Synthology generator.
- Goal: Show that the generator can scale to large ontologies (way more complex than family tree)
- Method:
- Select 3 ontologies with increasing complexity
- Level 1: simple hierarchy (standard subclasses, 1-hop relations)
- Level 2: intermediate (transitivity and property chains)
- Level 3: complex (heavy use of disjointness, recursive rules and deep property chains)
- Generate fixed number of valid triples (10 000) for each ontology
- Select 3 ontologies with increasing complexity
- Metric:
- Total CPU execution time for generation
- Peak RAM usage during generation
- Number of "dead ends" (discarded proofs)
- Expected Result: The generator should be able to scale to large ontologies without running out of memory or taking too long to generate.
A standard test set is chosen at random, meaning it usually consists of "easy" 1-hop facts. To create a hard test set, we must filter triples based on their proof complexity:
- Run the backward-chainer to generate a massive pool of valid inferences.
- Filter this pool to only include triples where the shortest possible proof tree requires 3 or more hops.
- Sample our test triples exclusively from this filtered pool.
We shall publish the hard test set(s) for reproducibility.
Creating a new subproject:
uv init apps/my-new-app --package
uv sync
Adding new dependencies only to a specific subproject:
uv add <dependency> --package my-new-app
- Add LUBM generator
- Add experiments
- Add
invokecommands to reproduce experiments - Add LUBM java dependencies to devcontainer
In case the terminal doesn't show real-time updates, try setting the following environment variable:
export PYTHONUNBUFFERED=1This forces Python to flush its output buffer immediately.