## Ask a question about your Ontology file

This notebook aims to query ontology file based on user's question.

Workflow outline:
- Loading graph
- Extracting Labels from graph
- Find which labels could be related to user's question. NLP sentence transformer technique was utilized for this goal. It detects which labels align closely with given question, and rank related labels.
- Find which individuals and datatypes that related labels have.
- Provide these individuals and datatypes into the prompt with user's question and let LLM evaluate data and generate answer in natural language.

Note: Open source LLM "ollama" can be downloaded [here](https://ollama.com)
version Llama 3.1 8B version was used in this notebook.

Future Work:
- Investigate capabilities of GraphRAG
- Add initial LLM step that handles user's query and activate GraphRAG functions.
- Let LLM write sparql queries if GraphRAG does not work, so our method can be used for generic purposes.
- Utilize Langgraph to create complex workflow, and maybe ReAct approach let LLM plan workflow (which functions to use).  

In [161]:
from langchain_community.graphs import RdfGraph
def loadGraph(ontology_source= "eSCRO_Developing.ttl"):
    graph = RdfGraph(
    source_file=ontology_source,
    #source_file="a3cae519-0ece-48bd-b55e-dd271dd895fc.ttl",
    standard="rdf",
    local_copy="local_copy.ttl",
    )
    graph.load_schema()
    return graph

In [178]:
def extract_labels_from_graph(graph):
    query = """
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?class ?label
    WHERE {
    ?class rdfs:label ?label .
    }
    """
    class_labels = graph.query(query)
    
    # Extract labels from the query result and ensure they are strings
    labels = []
    for _, label in class_labels:
        if label is not None and isinstance(label, str):
            labels.append(label.strip())
    
    print(f"Extracted {len(labels)} valid labels from the graph.")
    
    return labels


In [186]:
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import time
from sentence_transformers import SentenceTransformer, util

# NLP function to find related labels
def find_related_labels_byNLP(question, label_list):
    query = question
    start = time.time()
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')  # Mini model for speed

    def find_top_n_similar(query, items, n=3):
        query_embedding = model.encode(query, convert_to_tensor=True)
        item_embeddings = model.encode(items, convert_to_tensor=True)
        
        # Calculate cosine similarity between query and all items
        similarities = util.pytorch_cos_sim(query_embedding, item_embeddings)[0]
        
        # Get the indices of the top 'n' most similar items
        top_n_indices = similarities.topk(n).indices
        
        top_n_items = [(items[i], similarities[i].item()) for i in top_n_indices]
        return top_n_items

    # Find the top N similar labels
    top_10_similar = find_top_n_similar(query, label_list, n=1)

    end = time.time()
    latency = end - start
    print("Processing time: ", latency)

    labels_most_10_similar = []
    print(f"The top 3 most similar items to '{query}' are:")
    for item, score in top_10_similar:
        print(f"'{item}' with a similarity score of {score:.4f}")
        labels_most_10_similar.append(item)
        
    return labels_most_10_similar


In [181]:
# Function to get the class URI based on a label
def find_Individuals_datatypes(graph, top10_labels):
    def get_class_from_label(label):
        query = f"""
        SELECT ?class WHERE {{
            ?class rdfs:label ?label .
            FILTER (lcase(str(?label)) = "{label.lower()}")
        }}
        """
        results = graph.query(query)
        if results:
            return results[0]["class"]
        else:
            print(f"No classes found for label: {label}")
            return None

    # Function to get individuals
    def get_individuals_of_class(class_uri):
        query = f"""
        SELECT ?individual WHERE {{
            ?individual a <{class_uri}>.
        }}
        """
        results = graph.query(query)
        individuals = [result["individual"] for result in results]
        return individuals
    
    individuals_list = [] 

    for label in top10_labels:
        class_uri = get_class_from_label(label) 
        if class_uri:
            print(f"Class URI for label '{label}': {class_uri}")
            individuals = get_individuals_of_class(class_uri)  
            if individuals:
                individuals_list.extend(individuals)  
            print(f"Individuals of class '{label}': {individuals}")
        else:
            print(f"No class found for label '{label}'")

    # Function to get data properties of an individual
    def get_data_properties_of_individual(ind_list):
        data_properties_list = []
        for ind in ind_list:
            query = f"""
            SELECT ?property ?value WHERE {{
                <{ind}> ?property ?value .
                FILTER isLiteral(?value)
            }}
            """
            results = graph.query(query)
            data_properties = [(result["property"], result["value"]) for result in results]
            data_properties_list.extend(data_properties) 
        return data_properties_list

    # Return the data properties 
    return get_data_properties_of_individual(individuals_list)

In [190]:
from langchain.prompts.prompt import PromptTemplate
from langchain_community.llms import Ollama

def LLM_node(question, data):
    def define_label_finding_chain():

        #model = define_llm()
        model =  Ollama(model="llama3.1")

        prompt = PromptTemplate(
            template = """
                You are an assistant that has to asnwer user question with provided data.
                Evaluate provided data and extract meaning from it to answe user's question.
                You must find the answer **only** from the information provided in the "Provided data" section and answer the user's question. If the data contains relevant information, extract the relevant part and include it in your answer.

                Provided data:
                {data}  
                End of the data.

                User's question:
                {question}

                Answer:
                """,
            input_variables=["data", "question"]
        )

        return prompt | model 


    chain = define_label_finding_chain()
    print("\n\n")
    print("--- LLM ANSWER START --- ")
    result = chain.invoke({"question": question, "data": data})
    print(result)
    print("--- LLM ANSWER END --- ")

## Final step: RUN

I am going to run all of these functions above in sequence at the momemnt.

In future, I will utilize LangGraph code to call functions in more complex way.

In [191]:
if __name__ == "__main__":

    ## ASK Your Question
    Question = "What is the maximum height limitation in the traffic?"

    #Load the graph
    graph = loadGraph(ontology_source= "eSCRO_Developing.ttl")
    
    # Extract Labels from the ontology
    label_list = extract_labels_from_graph(graph)
    
    # Find most 10 related labels with the question
    top_10_most_related = find_related_labels_byNLP(Question, label_list)

    print("Most related labels:", top_10_most_related)

    # Find individuals  of these classes
    individuals = find_Individuals_datatypes(graph, top_10_most_related)

    # Run LLM to answer question in natural language
    result = LLM_node(Question, individuals)
 

Extracted 2728 valid labels from the graph.




Processing time:  14.563163995742798
The top 3 most similar items to 'What is the maximum height limitation in the traffic?' are:
'maximum height without special approval' with a similarity score of 0.6253
Most related labels: ['maximum height without special approval']
Class URI for label 'maximum height without special approval': http://quantecton.com/kb/eSCRO#MaximumHeightWithoutSpecialApproval
Individuals of class 'maximum height without special approval': [rdflib.term.URIRef('http://quantecton.com/kb/eSCRO#TransportMaxHeight')]



--- LLM ANSWER START --- 
Based on the provided data, it appears that there is a property "hasSimpleExpressionValue" associated with a URI "https://spec.industrialontologies.org/ontology/core/Core/" and a value of "4.5m".

To answer the user's question about the maximum height limitation in traffic, I would extract the relevant part of this information:

"...a value of '4.5m'"

This suggests that the maximum height limitation in traffic is 4.5 meters.

A

In [172]:
# from langgraph.graph import Graph

# workflow = Graph()

# workflow.add_node("label", extract_labels_from_graph)
# workflow.add_node("nlp", find_related_labels_byNLP)

# workflow.add_edge('label','nlp')

# workflow.add_node("Individuals", find_Individuals_datatypes)

# workflow.add_edge('nlp','Individuals')

# workflow.add_node("llm", LLM_node)

# workflow.add_edge('Individuals','llm')

# workflow.set_entry_point('label')
# workflow.set_finish_point('llm')

# app = workflow.compile()