**Text-to-SPARQL engine** using a GPT-2 model fine-tuned on synthetic SPARQL queries generated from a given knowledge graph (TTL file). The pipeline includes the following steps:

1. **Loading and parsing the TTL file**
2. **Generating synthetic NLQ-SPARQL pairs**
3. **Applying entity masking**
4. **Tokenising input and output data**
5. **Preparing a dataset**
6. **Fine-tuning GPT-2**
7. **Post-processing and evaluating the model using BLEU score**


# 1: load and explore the TTL file

We start by loading the TTL file using the `rdflib` library to understand the structure of the knowledge graph.

In [19]:
from rdflib import Graph

# Load the TTL file
g = Graph()
g.parse("data/Industry_Demos_-_Energy_Objects_NEN2660_2_2024-10-01_1354.ttl", format="ttl")

# Print sample triples to explore the data
for s, p, o in list(g)[:10]:
    print(f"Subject: {s}, Predicate: {p}, Object: {o}")


Subject: ne591e7fa22564f3db14c2322a70ea98bb137, Predicate: http://www.w3.org/ns/shacl#class, Object: http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/956208b8-6780-381c-bf8f-acca108efbbd
Subject: http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/e16464c7-bb94-35a8-aa4c-94c5b962da01, Predicate: http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/shallBeCompliantWith, Object: http://hub.laces.tech/ns/semmtech/private/live/demo-energy/library/test-demo-energy/0c85b485-dacb-3b18-b169-be950a0a58c2
Subject: http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/5e832a8a-fa56-3b48-b390-65fbff973169, Predicate: http://www.w3.org/ns/shacl#property, Object: ne591e7fa22564f3db14c2322a70ea98bb267
Subject: ne591e7fa22564f3db14c23

# 2: Generate synthetic NLQ-SPARQL pairs

We generate natural language questions (NLQs) and corresponding SPARQL queries using the predicates in the TTL file.


In [20]:
import rdflib

# Function to generate synthetic NLQ-SPARQL pairs
def generate_synthetic_data(graph, num_samples=100):
    nlq_sparql_pairs = []

    for s, p, o in list(graph)[:num_samples]:
        subject_label = s.split("/")[-1] if isinstance(s, rdflib.URIRef) else "Entity"
        predicate_label = p.split("#")[-1]

        if "prefLabel" in predicate_label:
            nlq = f"What is the label of {subject_label}?"
            sparql = f"SELECT ?label WHERE {{ <{s}> <{p}> ?label }}"
        elif "type" in predicate_label:
            nlq = f"What type is {subject_label}?"
            sparql = f"SELECT ?type WHERE {{ <{s}> <{p}> ?type }}"
        elif "hasPart" in predicate_label:
            nlq = f"What parts does {subject_label} have?"
            sparql = f"SELECT ?part WHERE {{ <{s}> <{p}> ?part }}"
        else:
            continue

        nlq_sparql_pairs.append((nlq, sparql))

    return nlq_sparql_pairs

# Generate synthetic data
synthetic_data = generate_synthetic_data(g)
print(synthetic_data[:5])


[('What is the label of e16464c7-bb94-35a8-aa4c-94c5b962da01?', 'SELECT ?label WHERE { <http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/e16464c7-bb94-35a8-aa4c-94c5b962da01> <http://www.w3.org/2004/02/skos/core#prefLabel> ?label }'), ('What type is 48d41e29-7d6b-4bfc-afa2-b839dcf3c62b?', 'SELECT ?type WHERE { <http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/48d41e29-7d6b-4bfc-afa2-b839dcf3c62b> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type }'), ('What type is 2bcd1ad4-542f-3793-a21b-411e8945c0bc?', 'SELECT ?type WHERE { <http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/2bcd1ad4-542f-3793-a21b-411e8945c0bc> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type }'), ('What is the label of d5580707-98e0-3a9b-a36e-345b4c55f6a2?', 'SELECT ?label WHERE { <http://hub.la

In [3]:
synthetic_data

[('What is the label of e16464c7-bb94-35a8-aa4c-94c5b962da01?',
  'SELECT ?label WHERE { <http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/e16464c7-bb94-35a8-aa4c-94c5b962da01> <http://www.w3.org/2004/02/skos/core#prefLabel> ?label }'),
 ('What type is Entity?',
  'SELECT ?type WHERE { <n6fa24a935db34e198678039900cfac3bb171> <http://www.w3.org/ns/shacl#datatype> ?type }'),
 ('What type is 48d41e29-7d6b-4bfc-afa2-b839dcf3c62b?',
  'SELECT ?type WHERE { <http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/48d41e29-7d6b-4bfc-afa2-b839dcf3c62b> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type }'),
 ('What type is 2bcd1ad4-542f-3793-a21b-411e8945c0bc?',
  'SELECT ?type WHERE { <http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/2bcd1ad4-542f-3793-a21b-411e8945c0bc> <http://www.w3

# 3: mask entities in NLQs

To generalise the model, we mask specific entities in the NLQs with placeholder tokens like `[ENT1]`.


In [21]:
import re

# Function to mask entities in NLQs
def mask_entities(nlq, entities):
    masked_nlq = nlq
    for i, entity in enumerate(entities):
        masked_nlq = re.sub(entity, f"[ENT{i+1}]", masked_nlq)
    return masked_nlq

# Example usage
entities = ["Agent", "Convertor"]
nlq = "What is the label of Agent?"
masked_nlq = mask_entities(nlq, entities)
print(masked_nlq)


What is the label of [ENT1]?


# 4: Tokenise Data for Training

We tokenise both the input NLQs and output SPARQL queries using Hugging Face’s `tokenizer`.

In [22]:
from transformers import GPT2Tokenizer

# Load the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Add a padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})

# Tokenize function for both input (NLQ) and output (SPARQL)
def tokenize_function(example):
    input_encoding = tokenizer(
        example["input_text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )
    output_encoding = tokenizer(
        example["output_text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

    return {
        "input_ids": input_encoding["input_ids"],
        "attention_mask": input_encoding["attention_mask"],
        "labels": output_encoding["input_ids"],
    }

# Prepare the dataset
train_data = [{"input_text": nlq, "output_text": sparql} for nlq, sparql in synthetic_data]
tokenized_data = [tokenize_function(data) for data in train_data]

# Verify tokenized data
print(tokenized_data[:2])


[{'input_ids': [2061, 318, 262, 6167, 286, 304, 23237, 2414, 66, 22, 12, 11848, 5824, 12, 2327, 64, 23, 12, 7252, 19, 66, 12, 5824, 66, 20, 65, 4846, 17, 6814, 486, 30, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

# 5: Convert to Hugging Face Dataset

We convert the tokenized data to a dataset using Hugging Face’s `datasets` library

In [23]:
from datasets import Dataset

# Convert to Hugging Face Dataset
hf_dataset = Dataset.from_list(tokenized_data)

# Verify the structure
print(hf_dataset[0])


{'input_ids': [2061, 318, 262, 6167, 286, 304, 23237, 2414, 66, 22, 12, 11848, 5824, 12, 2327, 64, 23, 12, 7252, 19, 66, 12, 5824, 66, 20, 65, 4846, 17, 6814, 486, 30, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

# 6: fine-tune a GPT-2 model

We fine-tune a pre-trained GPT-2 model using Hugging Face’s `Trainer` class

In [24]:
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding

# Load the model
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Prepare data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    save_steps=10_000,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=500,
)

# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=hf_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()


  trainer = Trainer(


Step,Training Loss


TrainOutput(global_step=21, training_loss=6.1940663655598955, metrics={'train_runtime': 23.4025, 'train_samples_per_second': 3.589, 'train_steps_per_second': 0.897, 'total_flos': 5487132672000.0, 'train_loss': 6.1940663655598955, 'epoch': 3.0})

# 7: post-process generated SPARQL queries

After the model generates queries, we replace the masked tokens with the original entity names

In [25]:
def replace_masked_entities(query, entity_map):
    for mask, entity in entity_map.items():
        query = query.replace(mask, entity)
    return query

# Example usage
generated_query = "SELECT ?label WHERE { [ENT1] <http://www.w3.org/2004/02/skos/core#prefLabel> ?label }"
entity_map = {"[ENT1]": "<http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/Agent>"}
final_query = replace_masked_entities(generated_query, entity_map)
print(final_query)


SELECT ?label WHERE { <http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/Agent> <http://www.w3.org/2004/02/skos/core#prefLabel> ?label }


In [26]:
from nltk.translate.bleu_score import sentence_bleu

# Example evaluation
reference = ["SELECT ?label WHERE { <http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/Agent> <http://www.w3.org/2004/02/skos/core#prefLabel> ?label }"]
candidate = ["SELECT ?label WHERE { <http://hub.laces.tech/semmtech/consultancy/demonstrations/industries/energy/otl/industry-demos---energy-objects-nen2660/Agent> <http://www.w3.org/2004/02/skos/core#prefLabel> ?label }"]
score = sentence_bleu(reference, candidate)
print(f"BLEU Score: {score}")


BLEU Score: 0
