## Question to SPARQL query generation

In this notebook, we generate a SPARQL query from an input plain english question and execute it against a knowledge graph.

Below are the two prompts we will use for the language model. First, the `SPARQL_TEMPLATE` is used to construct a SPARQL query from an input quersion and context. Then, the output will be executed against the knowledge graph and the `ANSWER_TEMPLATE` will be used to generate a human-readable answer to described the results.

In [1]:

SPARQL_TEMPLATE = """
Generate a SPARQL query to answer the input question. A sample of the knowledge graph schema is provided to help construct the query.
After you generate the sparql, you should display it.
When generating sparql:
* never enclose the sparql in back-quotes.
* always include the prefix declarations.
* prefer using OPTIONAL when selecting multiple variables.
* Allow case-insensitive matching of strings.

Use the following format:

Question: the input question for which you must generate a SPARQL query
Information: the schema information in RDF format. This will help you generate the sparql query with the correct format.

Question: {question_str}
Information:
{context_str}
Answer:
"""

ANSWER_TEMPLATE = """
The following describe a user question, associated SPARQL query and the result from executing the query.
Based on this information, write an answer in simple terms that describes the results.
When appropriate, use markdown formatting to format the results into a table or bullet points.

Question:
{question_str}
Query:
{query_str}
Result:
{result_str}
Answer:
"""

We setup a minimal configuration, with the vector database (Chroma) running in client-only mode, and a small RDF file acting as the knowledge graph. This file contains both the instance data and the ontology. The ontology is enclosed in a named graph inside the file.

For the sake of the demo, we use a small model for embeddings (MiniLM-L6-V2) and rely on the OpenAI key for text geneartion for text generation. A local model can be used instead, but it will require high RAM and ideally a GPU.

In [19]:
from aikg.config import ChatConfig, ChromaConfig, SparqlConfig

chroma_config = ChromaConfig(
    host="local",
    port=8000,
    collection_name="test",
    embedding_model="all-MiniLM-L6-v2",
)
sparql_config = SparqlConfig(
    endpoint="../data/test_data.trig",
)
chat_config = ChatConfig(
    answer_template=ANSWER_TEMPLATE,
    sparql_template=SPARQL_TEMPLATE
)


In [25]:
import os
os.environ["OPENAI_API_KEY"] = "sk-..."

In [20]:

from aikg.utils.llm import setup_llm_chain
from aikg.utils.rdf import setup_kg

# Use OpenAI API
from langchain.llms import OpenAI
llm = OpenAI(model_name="gpt-3.5-turbo-0125")

# For now, both chains share the same model to spare memory
answer_chain = setup_llm_chain(llm, chat_config.answer_template)
sparql_chain = setup_llm_chain(llm, chat_config.sparql_template)
kg = setup_kg(**sparql_config.dict())



First, we need to embed the ontology into the vector database. This will allow us to retrieve semantically similar concepts from the ontology based on the question.

In the example rdf file, the ontology is enclosed in a named graph calles `http://example.org/ontology`. 

In [14]:
from aikg.flows.chroma_build import chroma_build_flow
chroma_build_flow(chroma_config, sparql_config, graph="https://example.org/ontology")

  from .autonotebook import tqdm as notebook_tqdm


[Completed(message=None, type=COMPLETED, result=UnpersistedResult(type='unpersisted', artifact_type='result', artifact_description='Unpersisted result of type `tuple`')),
 Completed(message=None, type=COMPLETED, result=UnpersistedResult(type='unpersisted', artifact_type='result', artifact_description='Unpersisted result of type `list`')),
 Completed(message=None, type=COMPLETED, result=UnpersistedResult(type='unpersisted', artifact_type='result', artifact_description='Unpersisted result of type `NoneType`'))]

In [21]:

from aikg.utils.chroma import setup_client, setup_collection
client = setup_client(
    chroma_config.host,
    chroma_config.port,
    chroma_config.persist_directory,
)
collection = setup_collection(
    client,
    chroma_config.collection_name,
    chroma_config.embedding_model,
)


The Chroma collection now contains the ontology concepts as vectors. We can retrieve the most similar concepts to a given question.
Notice that the property "programmingLanguage" is retrieved, even though the question does not contain the word "programming".

In [22]:
QUESTION = "What softwares are written in Python?"
results = collection.query(query_texts=QUESTION, n_results=5)
print('\n'.join([res.get("triples", "") for res in results['metadatas'][0]]))


<http://schema.org/programmingLanguage> <http://www.w3.org/2000/01/rdf-schema#label> "programming language" .
<http://schema.org/programmingLanguage> <http://www.w3.org/2000/01/rdf-schema#range> <http://www.w3.org/2001/XMLSchema#string> .
<http://schema.org/programmingLanguage> <http://www.w3.org/2000/01/rdf-schema#comment> "The computer programming language." .
<http://schema.org/programmingLanguage> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Property> .
<http://schema.org/programmingLanguage> <http://www.w3.org/2000/01/rdf-schema#domain> <http://schema.org/SoftwareSourceCode> .

<http://schema.org/SoftwareSourceCode> <http://www.w3.org/2000/01/rdf-schema#comment> "Computer programming source code. Example: Full (compile ready) solutions, code snippet samples, scripts, templates." .
<http://schema.org/SoftwareSourceCode> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Class> .
<http://schema.org

Then, we can generate the SPARQL query.

In [23]:
from aikg.utils.chat import generate_sparql
query = generate_sparql(QUESTION, collection, sparql_chain)
print(query)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT ?softwareName
WHERE {
  ?software rdf:type <http://schema.org/SoftwareSourceCode> .
  ?software <http://schema.org/programmingLanguage> ?language .
  FILTER regex(str(?language), "python", "i") .
  ?software <http://schema.org/name> ?softwareName .
}


and execute it:

In [24]:
from aikg.utils.rdf import query_kg
results = query_kg(kg, query)
print(results)

[['softwareName'], ['SDSC-ORD/gimie'], ['SDSC-ORD/zarr_linked_data']]


We can now generate a human-readable answer from the results of the query:

In [32]:
from aikg.utils.chat import generate_answer
generate_answer(QUESTION, query, results, answer_chain)

'The query returned two softwares written in Python: SDSC-ORD/gimie and SDSC-ORD/zarr_linked_data.'