# Fine-tuning the OpenAI gpt-4o-mini model

The untuned models do not reliably create sparql requests even when providing the ontology and trying different prompting techniques. This notebook creates a dataset of questions-answer pairs for fine-tuning the model. An important part is to clarify the generation parts that the models understands the worst. Mainly the use of wktLiterals instead of geometry objects.

In [3]:
template = """
Write a SPARQL SELECT query for querying a graph database.
The ontology schema delimited by triple backticks in Turtle format is:
```
{}
```
Use only the classes and properties provided in the schema to construct the SPARQL query.
Do not use any classes or properties that are not explicitly provided in the SPARQL query.
Include all necessary prefixes.
Do not include any explanations or apologies in your responses.
Do not wrap the query in backticks.
Do not include any text except the SPARQL query generated.
The question delimited by triple backticks is:
```
{}
```
"""

In [4]:
import os
import itertools
from rdflib import Graph
from openai import OpenAI
from functions.sparql_requests import sparql_select

graphdb_server_url = 'http://localhost:7200'
repository_id = 'geonuts'
select_endpoint_url = f"http://localhost:7200/repositories/{repository_id}"
update_endpoint_url = f"http://localhost:7200/repositories/{repository_id}/statements"
# 
base_iri_geometry = "http://geonuts.eu/geometry/"

ontology_full = Graph()
ontology_full = ontology_full.parse("data/ontology_full_v1.ttl", format="turtle")
ontology_full_turtle = ontology_full.serialize(format="turtle")

In [5]:
client = OpenAI(api_key = os.environ['OPENAI1'])
ft_model_name = os.environ.get('FT_MODEL')

In [6]:
def test_requests(comb, ontology):
    model, template, prompt = comb
    full_prompt = template.format(ontology, prompt)

    if model == "gemini-1.5-flash":
        time.sleep(15)
        runner = GenerativeModel(model_name=model)
        response = runner.generate_content(full_prompt)
        try:
            return response.text
        except:
            return "COULD NOT GET TEXT GEMINI"

    else:
        completion = client.chat.completions.create(
            model=model,
            temperature=0,
            n=1,
            messages=[{"role": "user", "content": full_prompt}],
            max_tokens=500
            )
        try:
            return completion.choices[0].message.content
        except:
            return "COULD NOT GET TEXT FROM OPENAI"
    

def run_tests(ontology, models=[], templates=[], prompts=[]):
    print("Testing current ontology:")
    combinations = list(itertools.product(models, templates, prompts))

    results = [test_requests(comb, ontology) for comb in combinations]
    return results


template = """
Write a SPARQL SELECT query for querying a graph database.
The ontology schema delimited by triple backticks in Turtle format is:
```
{}
```
Use only the classes and properties provided in the schema to construct the SPARQL query.
Do not use any classes or properties that are not explicitly provided in the SPARQL query.
Include all necessary prefixes.
Do not include any explanations or apologies in your responses.
Do not wrap the query in backticks.
Do not include any text except the SPARQL query generated.
The question delimited by triple backticks is:
```
{}
```
"""

# Define 5 example questions

In [8]:
prompts = ["What are cities that are within 5 km of the NUTS region DE30?",
           "What NUTS regions are neighbors of the NUTS region AT22 and have the same NUTS level?",
           "Which NUTS region border the NUTS region FRK21 but are not inside the region?",
           "Which cities with more then 20000 inhabitant are within a 20 km radius of Warsaw?",
           "Which cities are 20 km or less from the NUTS region DED43 but not inside the region itself and have more than 40 thousand inhabitants?"]

## Send the examples to gpt-4o-mini
then correct the answers manually and save them in a training list. Below, the fixed examples start.

In [10]:
results = run_tests(ontology_full_turtle, models=["gpt-4o-mini"], templates=[template], prompts=prompts)

Testing current ontology:


In [11]:
print(results[0])

PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX nutsdef: <http://data.europa.eu/nuts/> 

SELECT ?city ?cityName WHERE {
  ?city a gn:Feature .
  ?city gn:name ?cityName .
  ?city geo:hasGeometry ?cityGeometry .
  nutsdef:DE30 geo:hasGeometry ?nutsGeometry .
  FILTER(geof:sfWithin(?cityGeometry, geof:buffer(?nutsGeometry, 5000)))
}


In [12]:
fixed = """
PREFIX dcterms: <http://purl.org/dc/terms/> 
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX nutsdef: <http://data.europa.eu/nuts/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 

SELECT ?city ?cityName WHERE {
  ?city a gn:Feature ;
        gn:name ?cityName ;
        geo:hasGeometry ?cityGeometry .
  ?cityGeometry geo:asWKT ?cityWKT .
  ?region a skos:Concept ;
          skos:notation "DE30" ;
          geo:hasGeometry ?regionGeometry .
  ?regionGeometry geo:asWKT ?regionWKT .
  FILTER(geof:sfWithin(?cityWKT, geof:buffer(?regionWKT, 5000)))
}"""

The first result already has problems using the wktLiterals. This will be fixed and than saved as a fine-tuning example.

In [14]:
training_list = []
training_list.append([template.format(ontology_full_turtle, prompts[0]), fixed])

In [15]:
print(results[1])

PREFIX nutsdef: <http://data.europa.eu/nuts/>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX gn: <https://www.geonames.org/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?neighborRegion
WHERE {
  ?region a skos:Concept ;
          nutsdef:level ?level ;
          skos:notation "AT22" .
  
  ?neighborRegion a skos:Concept ;
                  nutsdef:level ?level ;
                  geo:hasGeometry ?geometry1 .
  
  ?region geo:hasGeometry ?geometry2 .
  
  FILTER(geof:sfTouches(?geometry1, ?geometry2))
}


In [16]:
fixed = """
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX nutsdef: <http://data.europa.eu/nuts/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?neighborNotation
WHERE {
	?region a skos:Concept ;
        nutsdef:level ?level ;
        skos:notation "AT22" ;
    	geo:hasGeometry ?regionGeom .
    ?regionGeom geo:asWKT ?regionWKT .
    ?neighborRegion a skos:Concept ;
        nutsdef:level ?level ;
        skos:notation ?neighborNotation ;
    	geo:hasGeometry ?neighborRegionGeom .
    ?neighborRegionGeom geo:asWKT ?neighborRegionWKT .
    FILTER(geof:sfTouches(?regionWKT, ?neighborRegionWKT)) .
}"""

In [17]:
training_list.append([template.format(ontology_full_turtle, prompts[1]), fixed])

In [18]:
print(prompts[2])
print(results[2])

Which NUTS region border the NUTS region FRK21 but are not inside the region?
PREFIX fno: <https://w3id.org/function/ontology#> 
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX nutsdef: <http://data.europa.eu/nuts/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 

SELECT ?region WHERE {
  ?region a skos:Concept .
  ?region nutsdef:hasGeometry ?geometry .
  ?frk21 a skos:Concept ;
          nutsdef:hasGeometry ?geometry_frk21 .
  ?geometry_frk21 geo:asWKT ?wkt_frk21 .
  ?geometry geo:asWKT ?wkt .
  
  FILTER(geof:sfCrosses(?wkt_frk21, ?wkt) && !geof:sfWithin(?wkt, ?wkt_frk21))
}


In [19]:
fixed = """
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 

SELECT ?regionNotation
WHERE {
  ?region a skos:Concept ;
          geo:hasGeometry ?geometry ;
    	  skos:notation ?regionNotation .
    ?geometry geo:asWKT ?wkt .
  ?region2 a skos:Concept ;
           skos:notation "FRK21" ;
           geo:hasGeometry ?geometry2 .
    ?geometry2 geo:asWKT ?wkt2 .
  FILTER(geof:sfTouches(?wkt, ?wkt2) && !geof:sfWithin(?wkt, ?wkt2))
}"""

In [20]:
training_list.append([template.format(ontology_full_turtle, prompts[2]), fixed])

In [21]:
print(prompts[3])
print(results[3])

Which cities with more then 20000 inhabitant are within a 20 km radius of Warsaw?
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
PREFIX gn: <https://www.geonames.org/ontology#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?city ?population WHERE {
  ?city a gn:Feature ;
        gn:population ?population ;
        geo:hasGeometry ?geometry .
  ?geometry geo:asWKT ?wkt .
  
  FILTER(?population > 20000)
  
  FILTER(geof:sfWithin(?wkt, geof:buffer(geo:wktLiteral("POINT(21.0122 52.2297)"^^geo:wktLiteral), 20000)))
}


In [22]:
fixed = """
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
PREFIX gn: <https://www.geonames.org/ontology#>

SELECT ?cityName WHERE {
    ?city a gn:Feature ;
    	gn:name ?cityName ;
        gn:population ?population ;
        geo:hasGeometry ?geometry .
    ?geometry geo:asWKT ?wkt .

    FILTER(?population > 20000)

    ?warsaw a gn:Feature ;
    	gn:name "Warsaw" ;
    	geo:hasGeometry ?warsawGeom .
	?warsawGeom geo:asWKT ?warsawWKT .

  	FILTER(geof:sfWithin(?wkt, geof:buffer(?warsawWKT, 20000)))
}"""

In [23]:
training_list.append([template.format(ontology_full_turtle, prompts[3]), fixed])

In [24]:
print(prompts[4])
print(results[4])

Which cities are 20 km or less from the NUTS region DED43 but not inside the region itself and have more than 40 thousand inhabitants?
PREFIX fno: <https://w3id.org/function/ontology#> 
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX nutsdef: <http://data.europa.eu/nuts/> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?city ?population WHERE {
  ?city a gn:Feature ;
        gn:population ?population ;
        geo:hasGeometry ?cityGeometry .
  
  ?region a skos:Concept ;
          nutsdef:level "3" ;
          rdfs:label "DED43" ;
          geo:hasGeometry ?regionGeometry .

  FILTER(?population > 40000)
  FILTER(geof:sfDisjoint(?cityGeometry, ?regionGeometry))
  FILTER(geof:sfWithin(?cityGeometry, geof:buffer(?regionGeometry, 20000)))
}


In [25]:
fixed = """
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 

SELECT ?cityName ?population 
WHERE {
    ?city a gn:Feature ;
    	gn:name ?cityName ;
        gn:population ?population ;
        geo:hasGeometry ?cityGeom .

    ?cityGeom geo:asWKT ?cityWKT .

    ?region a skos:Concept ;
        skos:notation "DED43" ;
        geo:hasGeometry ?regionGeom .
    ?regionGeom geo:asWKT ?regionWKT .

    FILTER(?population > 40000)
    FILTER(geof:sfDisjoint(?cityWKT, ?regionWKT))
  	FILTER(geof:sfWithin(?cityWKT, geof:buffer(?regionWKT, 20000)))
}"""

In [26]:
training_list.append([template.format(ontology_full_turtle, prompts[4]), fixed])

# Additional Training examples:
Define more questions, generate the answers and fix them manually.

In [28]:
ex = "What are the 5 biggest cities in the region MT00?"
prompts.append(ex)
print(run_tests(ontology_full_turtle, models=["gpt-4o-mini"], templates=[template], prompts=[ex])[0])

Testing current ontology:
PREFIX gn: <https://www.geonames.org/ontology#>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX nutsdef: <http://data.europa.eu/nuts/>

SELECT ?city ?population WHERE {
  ?city a gn:Feature ;
        gn:population ?population ;
        gn:geonamesID ?geonamesID .
  ?region a skos:Concept ;
          nutsdef:level "1" ;
          skos:notation "MT00" .
  ?city geo:hasGeometry ?geometry .
  ?geometry geo:asWKT ?wkt .
} ORDER BY DESC(?population) LIMIT 5


In [29]:
fixed = """PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 

SELECT ?cityName ?population WHERE {
    ?city a gn:Feature ;
        gn:population ?population ;
    	gn:name ?cityName ;
        geo:hasGeometry ?cityGeom .
    ?cityGeom geo:asWKT ?cityWKT .
            
    ?region a skos:Concept ;
    	skos:notation "MT00" ;
        geo:hasGeometry ?regionGeom .
    ?regionGeom geo:asWKT ?regionWKT .
    FILTER(geof:sfWithin(?cityWKT, ?regionWKT))
} ORDER BY DESC(?population) LIMIT 5"""

In [30]:
training_list.append([template.format(ontology_full_turtle, ex), fixed])

In [31]:
ex = "What the 3 biggest cities that are in a region bordering and with the same NUTS level as the region PL22A?"
prompts.append(ex)
print(run_tests(ontology_full_turtle, models=["gpt-4o-mini"], templates=[template], prompts=[ex])[0])

Testing current ontology:
PREFIX fno: <https://w3id.org/function/ontology#> 
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX nutsdef: <http://data.europa.eu/nuts/> 
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 

SELECT ?city ?population WHERE {
  ?region a skos:Concept ;
          nutsdef:level "2" ;
          skos:notation "PL22A" .
  
  ?borderingRegion a skos:Concept ;
                   nutsdef:level "2" ;
                   geo:hasGeometry ?geometry .
  
  ?city a gn:Feature ;
        gn:population ?population ;
        geo:hasGeometry ?cityGeometry .
  
  geof:sfIntersects(?geometry, ?cityGeometry) .
  
  FILTER(?region != ?borderingRegion)
} 
ORDER BY DESC(?population) 
LIMIT 3


In [32]:
fixed = """PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX nutsdef: <http://data.europa.eu/nuts/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 

SELECT ?cityName ?population
WHERE {
    ?region a skos:Concept ;
        nutsdef:level ?level ;
        skos:notation "PL22A" ;
        geo:hasGeometry ?regionGeom .
    ?regionGeom geo:asWKT ?regionWKT .

    ?city a gn:Feature ;
        gn:population ?population ;
    	gn:name ?cityName ;
        geo:hasGeometry ?cityGeom .
    ?cityGeom geo:asWKT ?cityWKT .

    ?borderingRegion a skos:Concept ;
        nutsdef:level ?level ;
        geo:hasGeometry ?borderGeometry .
    ?borderGeometry geo:asWKT ?borderWKT .

    FILTER(geof:sfTouches(?regionWKT, ?borderWKT))
    FILTER(geof:sfIntersects(?cityWKT, ?borderWKT))
}
ORDER BY DESC(?population)
LIMIT 3"""

In [33]:
training_list.append([template.format(ontology_full_turtle, ex), fixed])

In [34]:
ex = "Which regions are neighbors of the region ITF1 and share the same NUTS level?"
prompts.append(ex)
print(run_tests(ontology_full_turtle, models=["gpt-4o-mini"], templates=[template], prompts=[ex])[0])

Testing current ontology:
PREFIX fno: <https://w3id.org/function/ontology#> 
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX nutsdef: <http://data.europa.eu/nuts/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 

SELECT ?neighborRegion 
WHERE {
  ?region a skos:Concept ;
          nutsdef:level ?level .
  ?neighborRegion a skos:Concept ;
                  nutsdef:level ?level ;
                  geo:hasGeometry ?geometryNeighbor .
  ?region geo:hasGeometry ?geometryRegion .
  FILTER(?region = <http://data.europa.eu/nuts/ITF1>)
}


In [35]:
fixed = """
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX nutsdef: <http://data.europa.eu/nuts/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?neighborNotation
WHERE {
    ?region a skos:Concept ;
    	skos:notation "ITC1";
    	nutsdef:level ?level ;
    	geo:hasGeometry ?regionGeom .
    ?regionGeom geo:asWKT ?regionWKT .
    ?neighborRegion a skos:Concept ;
        nutsdef:level ?level ;
        skos:notation ?neighborNotation ;
    	geo:hasGeometry ?neighborGeom .
    ?neighborGeom geo:asWKT ?neighborWKT .
    	
    FILTER(geof:sfTouches(?regionWKT, ?neighborWKT))
}"""

In [36]:
training_list.append([template.format(ontology_full_turtle, ex), fixed])

In [37]:
ex = "What are the three biggest cities in the NUTS region RO32?"
prompts.append(ex)
print(run_tests(ontology_full_turtle, models=["gpt-4o-mini"], templates=[template], prompts=[ex])[0])

Testing current ontology:
PREFIX gn: <https://www.geonames.org/ontology#>
PREFIX nutsdef: <http://data.europa.eu/nuts/>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?city ?population
WHERE {
  ?city a gn:Feature ;
        gn:population ?population ;
        gn:geonamesID ?geonameID .
  ?city geo:hasGeometry ?geometry .
  ?geometry geo:asWKT ?wkt .
  ?city rdfs:label ?label .
  FILTER(CONTAINS(STR(?label), "RO32"))
}
ORDER BY DESC(?population)
LIMIT 3


In [38]:
fixed = """
PREFIX gn: <https://www.geonames.org/ontology#>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>

SELECT ?cityName ?population 
WHERE {
    ?city a gn:Feature ;
        gn:population ?population ;
    	gn:name ?cityName ;
        geo:hasGeometry ?cityGeom .
    ?cityGeom geo:asWKT ?cityWKT .
    ?region a skos:Concept ;
    	skos:notation "RO32" ;
        geo:hasGeometry ?regionGeom .
    ?regionGeom geo:asWKT ?regionWKT .
    
	FILTER(geof:sfWithin(?cityWKT, ?regionWKT))
}
ORDER BY DESC(?population)
LIMIT 3"""

In [39]:
training_list.append([template.format(ontology_full_turtle, ex), fixed])

In [40]:
ex = "How many neighbors of the same level does the NUTS region EL30 have?"
prompts.append(ex)
print(run_tests(ontology_full_turtle, models=["gpt-4o-mini"], templates=[template], prompts=[ex])[0])

Testing current ontology:
PREFIX nutsdef: <http://data.europa.eu/nuts/>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX gn: <https://www.geonames.org/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT (COUNT(?neighbor) AS ?count) WHERE {
  ?nutsRegion a skos:Concept ;
              nutsdef:level ?level ;
              rdfs:label "EL30" .
  ?neighbor a skos:Concept ;
            nutsdef:level ?level ;
            ?s ?nutsRegion .
}


In [41]:
fixed = """
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX nutsdef: <http://data.europa.eu/nuts/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 

SELECT (COUNT(?neighbor) AS ?count) WHERE {
    ?region a skos:Concept ;
        nutsdef:level ?level ;
        skos:notation "EL30" ;
    	geo:hasGeometry ?regionGeom .
    ?regionGeom geo:asWKT ?regionWKT .
    ?neighbor a skos:Concept ;
        nutsdef:level ?level ;
        skos:notation ?neighborNotation ;
    	geo:hasGeometry ?neighborGeom .
    ?neighborGeom geo:asWKT ?neighborWKT .
    	
    FILTER(geof:sfTouches(?neighborWKT, ?regionWKT))
}"""

In [42]:
training_list.append([template.format(ontology_full_turtle, ex), fixed])

In [43]:
ex = "How many cities with more than 50000 inhabitants are within 20 km of the NUTS region SK01?"
prompts.append(ex)
print(run_tests(ontology_full_turtle, models=["gpt-4o-mini"], templates=[template], prompts=[ex])[0])

Testing current ontology:
PREFIX fno: <https://w3id.org/function/ontology#> 
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX nutsdef: <http://data.europa.eu/nuts/> 
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 

SELECT (COUNT(?city) AS ?cityCount) 
WHERE {
  ?city a gn:Feature ;
        gn:population ?population ;
        geo:hasGeometry ?cityGeometry .
  ?region a skos:Concept ;
          nutsdef:level "1" ;
          geo:hasGeometry ?regionGeometry ;
          skos:notation "SK01" .
  FILTER(?population > 50000)
  FILTER(geof:sfWithin(?cityGeometry, geof:buffer(?regionGeometry, 20000)))
}


In [44]:
fixed = """
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT  (COUNT(?city) AS ?cityCount) 
WHERE { 
    ?city a gn:Feature ; 
        gn:population ?population ; 
        geo:hasGeometry ?cityGeometry .
    ?cityGeometry geo:asWKT ?cityWKT .

    ?region a skos:Concept ; 
        skos:notation "SK01" ; 
        geo:hasGeometry ?regionGeom . 
   	?regionGeom geo:asWKT ?regionWKT .
    
    FILTER(?population > 50000) 
    FILTER(geof:sfWithin(?cityWKT, geof:buffer(?regionWKT, 20000))) 
}"""

In [45]:
training_list.append([template.format(ontology_full_turtle, ex), fixed])

In [46]:
ex = "In which NUTS level 3 region is the city of Rostock located?"
prompts.append(ex)
print(run_tests(ontology_full_turtle, models=["gpt-4o-mini"], templates=[template], prompts=[ex])[0])

Testing current ontology:
PREFIX fno: <https://w3id.org/function/ontology#> 
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX nutsdef: <http://data.europa.eu/nuts/> 
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 

SELECT ?region 
WHERE { 
  ?city a gn:Feature ; 
        gn:name "Rostock" ; 
        geo:hasGeometry ?geometry . 
  ?region a skos:Concept ; 
          nutsdef:level "3" ; 
          geo:hasGeometry ?regionGeometry . 
  FILTER(geof:sfContains(?regionGeometry, ?geometry)) 
} 


In [47]:
fixed = """
PREFIX geo: <http://www.opengis.net/ont/geosparql#> 
PREFIX gn: <https://www.geonames.org/ontology#> 
PREFIX nutsdef: <http://data.europa.eu/nuts/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/> 

SELECT ?region 
WHERE { 
    ?city gn:name "Rostock" . 
    ?city geo:hasGeometry ?cityGeom .
    ?cityGeom geo:asWKT ?cityWKT .
    
    ?region nutsdef:level "3" . 
    ?region geo:hasGeometry ?regionGeom .
    ?regionGeom geo:asWKT ?regionWKT .
    
    FILTER(geof:sfContains(?regionWKT, ?cityWKT))
}"""

In [48]:
training_list.append([template.format(ontology_full_turtle, ex), fixed])

In [49]:
prompts

['What are cities that are within 5 km of the NUTS region DE30?',
 'What NUTS regions are neighbors of the NUTS region AT22 and have the same NUTS level?',
 'Which NUTS region border the NUTS region FRK21 but are not inside the region?',
 'Which cities with more then 20000 inhabitant are within a 20 km radius of Warsaw?',
 'Which cities are 20 km or less from the NUTS region DED43 but not inside the region itself and have more than 40 thousand inhabitants?',
 'What are the 5 biggest cities in the region MT00?',
 'What the 3 biggest cities that are in a region bordering and with the same NUTS level as the region PL22A?',
 'Which regions are neighbors of the region ITF1 and share the same NUTS level?',
 'What are the three biggest cities in the NUTS region RO32?',
 'How many neighbors of the same level does the NUTS region EL30 have?',
 'How many cities with more than 50000 inhabitants are within 20 km of the NUTS region SK01?',
 'In which NUTS level 3 region is the city of Rostock locat

# Alternative formulations
For each question we formulate 2 more alternate wordings that are semantically identical. In total there are then 36 question-answer pairs for fine-tuning. This is of course a pretty small fine-tuning dataset, however fine-tuning is not the main focus of the study.

In [67]:
idents = [
    "Which cities are 5 km or closer to the NUTS region DE30?",
    "Which NUTS regions of the same level share a border with the NUTS region AT22?",
    "What are the NUTS regions that are neighbors of the region FRK21?",
    "What cities are a maximum of 20 km from Warsaw and have more than 20 thousand inhabitants?",
    "Which cities are closer than 20 km from the NUTS region DED43, have at least 40 thousand inhabitants and are not inside the region itself?",
    "The 5 largest cities by population inside the region MT00?",
    "Give the 3 most populated cities that are within a NUTS region that borders and has the same level as the region PL22A?",
    "What regions have the same NUTS level as the region ITF1 and share a border with it?",
    "What are the three most populated cities in the NUTS region RO32?",
    "How many regions exist that share the level of and a border with the NUTS region EL30?",
    "The number of cities with more than 50 thousand people that are closer than 20 km to the NUTS region SK01?",
    "Give the NUTS region (level 3) that contains Rostock."]

idents2 = [
    "What cities are within 5 km of the NUTS region DE30?",
    "Which NUTS regions at the same level are adjacent to the NUTS region AT22?",
    "Which NUTS regions border the region FRK21 but are outside of it?",
    "Which cities are within 20 km of Warsaw and have a population of more than 20,000?",
    "What cities are less than 20 km from the NUTS region DED43, have a population of at least 40,000, and are not located within the region itself?",
    "What are the 5 most populous cities within the region MT00?",
    "List the 3 most populous cities within a NUTS region that borders and is at the same level as the region PL22A.",
    "Which regions are at the same NUTS level as the region ITF1 and share a border with it?",
    "Which are the three most populous cities in the NUTS region RO32?",
    "How many regions are there that are at the same level and share a border with the NUTS region EL30?",
    "What is the number of cities with more than 50,000 people that are within 20 km of the NUTS region SK01?",
    "Identify the level 3 NUTS region with Rostock in it."]

In [69]:
answers = [x[1] for x in training_list]

In [71]:
train = []
train.extend(prompts[:7])
train.extend(idents[:7])
train.extend(idents2[:7])

train = [template.format(ontology_full_turtle, x) for x in train]

answers_train = answers[:7] * 3

In [73]:
codes = ["DE30", "AT22", "FRK21", "DED43", "MT00", "PL22A", "ITF1", "RO32", "EL30", "SK01"]
cities = ["Warsaw", "Rostock"]

In [77]:
def create_line(q, a):
    line = {}
    system = {"role": "system", "content": "You are a helpful assistant that generates sparql queries based on a users question and a ontology schema"}
    user = {"role": "user", "content": q}
    assistant = {"role": "assistant", "content": a}
    line["messages"] = [system, user, assistant]
    return line
    # print(line)

# Fine tuning with train-test-split

In [79]:
lines = []

for question, answer in zip(train, answers_train):
    line = create_line(question, answer)
    lines.append(line)

file_path = 'tuning/fine-tuning-file.jsonl'
if not os.path.exists(file_path):
    print("Writing new file")
    with open(file_path, 'w') as jsonl_file:
        for line in lines:
            jsonl_file.write(json.dumps(line) + '\n')

Writing new file


# Fine Tuning
Fine-tuning takes place on the OpenAI platform. The fine-tuning-file.jsonl file is uploaded and the gpt-4o-mini model is fine-tuned using standard parameters.

After we can try to generate a few SPARQL queries from the questions above again. For that, the env-variable FT_MODEL has to be set to the model code that is availbale in the OpenAI platform

Keep in mind that if these test are run again, the results might be different.

In [82]:
ex = prompts[8]
gen = run_tests(ontology_full_turtle, models=[ft_model_name], templates=[template], prompts=[ex])[0]

print(ex)
sparql_select(gen, select_endpoint_url)

Testing current ontology:
What are the three biggest cities in the NUTS region RO32?


Unnamed: 0,cityName,population
0,Bucharest,1877155
1,Voluntari,30323
2,Buftea,20691


### -> Passed

In [84]:
ex = prompts[9]
gen = run_tests(ontology_full_turtle, models=[ft_model_name], templates=[template], prompts=[ex])[0]

print(ex)
sparql_select(gen, select_endpoint_url)

Testing current ontology:
How many neighbors of the same level does the NUTS region EL30 have?


Unnamed: 0,neighborCount
0,2


### -> PASSED (2 is correct, EL65 & EL64)

In [86]:
ex = prompts[10]
gen = run_tests(ontology_full_turtle, models=[ft_model_name], templates=[template], prompts=[ex])[0]

print(ex)
sparql_select(gen, select_endpoint_url)

Testing current ontology:
How many cities with more than 50000 inhabitants are within 20 km of the NUTS region SK01?


Unnamed: 0,cityCount
0,2


### -> PASSED (Bratislava & Trnava)

In [92]:
ex = prompts[11]
gen = run_tests(ontology_full_turtle, models=[ft_model_name], templates=[template], prompts=[ex])[0]

print(ex)
sparql_select(gen, select_endpoint_url)

Testing current ontology:
In which NUTS level 3 region is the city of Rostock located?


Unnamed: 0,regionNotation
0,DE803


### -> PASSED (DE803 is the ocrrect region)

# Test 2 with idents
Testing again with some of the other formulations

In [97]:
ex = idents[8]
gen = run_tests(ontology_full_turtle, models=[ft_model_name], templates=[template], prompts=[ex])[0]

print(ex)
sparql_select(gen, select_endpoint_url)

Testing current ontology:
What are the three most populated cities in the NUTS region RO32?


Unnamed: 0,cityName,population
0,Bucharest,1877155
1,Voluntari,30323
2,Buftea,20691


### -> Passed

In [100]:
ex = idents[9]
gen = run_tests(ontology_full_turtle, models=[ft_model_name], templates=[template], prompts=[ex])[0]

print(ex)
sparql_select(gen, select_endpoint_url)

Testing current ontology:
How many regions exist that share the level of and a border with the NUTS region EL30?


Unnamed: 0,regionCount
0,0


### -> FAILED (Should be 2 neighbors)

In [102]:
ex = idents[10]
gen = run_tests(ontology_full_turtle, models=[ft_model_name], templates=[template], prompts=[ex])[0]

print(ex)
sparql_select(gen, select_endpoint_url)

Testing current ontology:
The number of cities with more than 50 thousand people that are closer than 20 km to the NUTS region SK01?


Unnamed: 0,cityCount
0,2


### -> PASSED

In [104]:
ex = idents[11]
gen = run_tests(ontology_full_turtle, models=[ft_model_name], templates=[template], prompts=[ex])[0]

print(ex)
sparql_select(gen, select_endpoint_url)

Testing current ontology:
Give the NUTS region (level 3) that contains Rostock.


Unnamed: 0,regionNotation
0,DE803


### -> PASSED

# Takeaway
Way more questions are able to be generated sucessfully as SPARQL queries. Here, questions from the training dataset are used as examples. In the actual study the results are analyzed more systematically. However, since these questions are quite similar to the questions in the study, overfitting is quite likely here.