# Application of LLM-Augmented Knowledge Graphs for Wirearchy Management

### Universitat Oberta de Catalunya
### Data Science Master's Degree - Data Analysis and Big Data.<br>Final project

- Author: Xavier Ventura de los Ojos
- Project Supervisor: Francesc Julbe López
- Coordinating Professor: Albert Solé Ribalta
- Date of submission: 06/2024

## POC 1: How to query a KG using Natural Language

In this POC we showcase how prompt engineering techniques using state of the art Large Language Models (LLM) can provide a Knowledge Graph with Natural Language query capabilities.

It is not in the scope of the POC to build an application with chat or Q&A capabilities but to use the necessary building blocks and to measure the performance of these tools on a set of predefined questions.

Instead of an interactive UI, the code in this Notebook allows performing tests on questions datasets (questions.json) using the indicated configuration (configurations.json). Questions, answers and related metadata is store in JSON files (poc1_answers folder) for proper analysis.

For a detailed analysis of the interactions with the LLMs, it is recommended to configure [LangSmith](https://smith.langchain.com).

These are the LLMs in scope of the POC:


|Vendor| Model | Description |
|---|---|--|
|[OpenAI](https://openai.com)|[GPT-3.5 Turbo](https://platform.openai.com/docs/models/gpt-3-5-turbo)|The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls. Returns a maximum of 4,096 output tokens.|
|[OpenAI](https://openai.com)|[GPT-4o](https://platform.openai.com/docs/models/gpt-4o)|Our most advanced, multimodal flagship model that’s cheaper and faster than GPT-4 Turbo. Currently points to gpt-4o-2024-05-13.|
|[Anthropic](https://www.anthropic.com)|[Claude 3 Haiku](https://docs.anthropic.com/en/docs/models-overview)|Our most powerful model, delivering state-of-the-art performance on highly complex tasks and demonstrating fluency and human-like understanding|
|[Anthropic](https://www.anthropic.com)|[Claude 3 Opus](https://docs.anthropic.com/en/docs/models-overview)|Our fastest and most compact model, designed for near-instant responsiveness and seamless AI experiences that mimic human interactions|

> NOTE: Additional models can be easily incorporated to the test by adding the corresponding Chat in the *LLMS* dict and entry in the *configuration.json*. More details in the corresponding sections below.


## Prerequisites and requirements

The queries are executed on a Neo4j Graph database that needs to be created and fed as a prerequisite for the POCs 1 and 2.
Detailed instructions are available as part of the thesis project documentation. 

This is the list of used python components and modules:

| Component | Description |
| ----- | ---- |
| [LangChain](https://python.langchain.com/v0.1/docs/get_started/introduction) | Framework for developing applications powered by LLMs |
| [GraphCypherQAChain](https://python.langchain.com/docs/integrations/graphs/neo4j_cypher) | Chain to interact with Neo4j graph database |
| [OpenAI API Keys (login required)](https://platform.openai.com/api-keys) | API Keys to access OpenAI GPT3.5 and GPT4 models|
| [Anthropic API Keys](https://docs.anthropic.com/en/api/getting-started) | API Keys to access Anthropic Haiku and Opus Models| 

## Initial setup 

The following API keys and Neo4j pwd are expected to be available as environment variables:

- OPENAI_API_KEY: OpenAI API KEY.
- ANTHROPIC_API_KEY: Anthropic API KEY.
- LANGCHAIN_API_KEY: (Recommended).
- NEO4J_PWD: Password of the Neo4j user.


Execute the following %pip commands to install required packages if needed.

In [1]:
# Import required modules

from langchain.chains import GraphCypherQAChain
from langchain.chains.graph_qa.prompts import CYPHER_GENERATION_PROMPT, CYPHER_QA_PROMPT
from langchain_community.graphs import Neo4jGraph
from langchain_community.callbacks.manager import get_openai_callback

from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic

from langchain.prompts.prompt import PromptTemplate

import neo4j
import os, glob
import json, pickle, jsonpickle
import datetime, time

In [2]:
# Constants

# Project name in LangSmith
LANGCHAIN_PROJECT = "TFM"

# Folders
ANSWERS_FOLDER = "poc1_answers"
CONFIGS_FOLDER = "poc1_config"

# Neo4j
NEO4J_URL = "bolt://localhost:7687"
NEO4J_USERNAME = "neo4j" 

# Text format
TEXT_BOLD  = '\033[1m'
TEXT_BLUE  = '\033[94m'
TEXT_GREEN = '\033[92m'
TEXT_END   = '\033[0m' 

## Load the test configuration

The test consists on the following components:

- Dataset with 20 predefined **questions** related to the graph.
- 2 **prompt templates** to guide the LLMs in order to improve its performance.
- 8 different **scenarios** which are the choosen combinations of the 4 models and the 3 templates (1 default + 2 custom).

The **questions** dataset is loaded from the *poc1_config/questions.json* file with the following structure:

```
[
    {
        "id": "Q001",
        "question": "Qui és el president de la Generalitat de Catalunya?"
    },
    {
        "id": "Q002",
        "question": "Qui té actualment el càrrec de 'President de la Generalitat de Catalunya'?"
    },
    ...
]
```

The different execution **scenarios** are loaded from the *poc1_config/configuration.json* file with the following structure:

```
{
"GPT35_P0":  {
        "id": "GPT35_P0",
        "description": "gpt-3.5-turbo default prompt",
        "llm": "gpt-3.5-turbo"
    },
"GPT4O_P0": {
        "id": "GPT4O_P0",
        "description": "gpt-4o default prompt",
        "llm": "gpt-4o"
    },
    ...
"OPUS_P1":  {
        "id": "OPUS_P1",
        "description": "claude-3-opus complex prompt 1",
        "llm": "claude-3-opus",
        "cypher_prompt_template": "prompt_template_1",
        "requests_per_minute": 2
    },
    ...
}
```

The different **prompt templates** are loaded from the *poc1_config/prompt_template_\*.txt* files.



In [3]:
# Helper json functions 

def save_json(filename, data):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

def load_json(path, filename): 
    file = os.path.join(path,filename)
    with open(file,'r') as fp:
        data = json.load(fp)
    return data

def save_jsonpickle(filename, data):
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(jsonpickle.encode(data, unpicklable= False, indent=4))

# Templates are stored in txt files for convenience
def load_templates():
    """Load the templates from the TXT files located in the CONFIGS_FOLDER.
    
    Returns:
        A dict with the templates. The key is the filename after removing the ".txt" extension.
    """
    templates={}
    for filename in glob.glob(os.path.join(CONFIGS_FOLDER,"prompt_template*.txt")):
        with open(filename, 'r') as file:
            templates[filename.split(os.sep)[-1][:-4]] = file.read()
    return templates

In [4]:
# Load the configurations

TEMPLATES = load_templates()
CONFIGURATIONS = load_json(CONFIGS_FOLDER,"configurations.json")
QUESTIONS = load_json(CONFIGS_FOLDER,"questions.json")

In [5]:
print("Configurations:\n")
for k,a in CONFIGURATIONS.items(): 
    print( TEXT_BOLD + a['id'] + TEXT_END ,a['description'], TEXT_BLUE+ a.get("cypher_prompt_template","") + TEXT_END)

Configurations:

[1mGPT35_P0[0m gpt-3.5-turbo default prompt [94m[0m
[1mGPT4O_P0[0m gpt-4o default prompt [94m[0m
[1mGPT35_P1[0m gpt-3.5-turbo complex prompt 1 [94mprompt_template_1[0m
[1mGPT4O_P1[0m gpt-4o complex prompt 1 [94mprompt_template_1[0m
[1mHAIKU_P1[0m claude-3-haiku complex prompt 1 [94mprompt_template_1[0m
[1mOPUS_P1[0m claude-3-opus complex prompt 1 [94mprompt_template_1[0m
[1mGPT4O_P2[0m gpt-4o complex prompt 2 [94mprompt_template_2[0m
[1mOPUS_P2[0m claude-3-opus complex prompt 2 [94mprompt_template_2[0m


In [6]:
# Print templates
print(TEXT_BOLD + "Cypher prompt templates:\n\n" + TEXT_BLUE + "DEFAULT:\n" + TEXT_END)
print(CYPHER_GENERATION_PROMPT.template)

for k,a in TEMPLATES.items(): print("\n\n" + TEXT_BLUE + TEXT_BOLD + k +":" + TEXT_END +"\n\n" + a)

[1mCypher prompt templates:

[94mDEFAULT:
[0m
Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}


[94m[1mprompt_template_1:[0m

Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Always use Cypher CONTAINS expressions when searching on the "subject", "name" or "role" attributes.
Always use variables to bind nodes and relationships.
Return only the properties that are

In [7]:
print ("Questions:\n")
for q in QUESTIONS: print(TEXT_BOLD + q["id"] + TEXT_END, q["question"])

Questions:

[1mQ001[0m Qui és el president de la Generalitat de Catalunya?
[1mQ002[0m Qui té actualment el càrrec de 'President de la Generalitat de Catalunya'?
[1mQ003[0m Qui és el responsable de Direcció General de Turisme?
[1mQ004[0m Quina és la estructura del Parlament de Catalunya?
[1mQ005[0m Quines reunions s'han celebrat amb el Grup Universitat Oberta de Catalunya?
[1mQ006[0m Qui es va reunir amb el grup Universitat Oberta de Catalunya?
[1mQ007[0m Qui és Jaume Giró Ribas?
[1mQ008[0m Quan es va reunir en Tomàs Roy Català amb el grup Universitat Oberta de Catalunya? I quin tema es va tractar?
[1mQ009[0m Amb qui s'ha reunit en Tomàs Roy Català?
[1mQ010[0m Llista les 10 persones amb més carrecs
[1mQ011[0m Parlam sobre Elisenda Guillaumes Cullell?
[1mQ012[0m Quina és la relació entre 'Miquel Salazar Canalda' i 'Joan Vintró Castells'? Descriu la relació pas a pas.
[1mQ013[0m Quins grups s'han reunit per tractar sobre la sequera? Incloure el tema de la reunió 

# Instanciate LangChain Chats LLMs

The LLMS dict hosts the LangChain Chats to access the LLMs in the scope of the POCs.


In [8]:
# Create the chats to interact with the LLMs.
LLMS={}

# OpenAI models
LLMS["gpt-3.5-turbo"] = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
# LLMS["gpt-4-turbo"]   = ChatOpenAI(temperature=0, model="gpt-4-turbo")
LLMS["gpt-4o"]        = ChatOpenAI(temperature=0, model="gpt-4o") # gpt-4o released 2024-05-13

# Anthropic models: https://docs.anthropic.com/en/docs/models-overview#model-comparison
LLMS["claude-3-haiku"] = ChatAnthropic(temperature=0, model_name="claude-3-haiku-20240307")
LLMS["claude-3-opus"]  = ChatAnthropic(temperature=0, model_name="claude-3-opus-20240229")


# Procedures to execute the Q&A tests

- now(): returns the current timestamp in ISO format.
- create_chain(configuration): creates a GraphCypherQAChain object from the given configuration.
- ask_questions(configuration, questions, run_id=None, comment=None): invokes the given model to answer a lists of questions. the results are stored in a JSON files for analysis.
- print_run(run): prints the run results for proper documentation.


In [9]:

def now():
    return datetime.datetime.now(datetime.UTC).strftime("%Y-%m-%dT%H:%M:%SZ")
    
def create_chain(configuration):
    """Creates an instance of a GraphCypherQAChain from a given configuration
    The configuration indicates the LLM (Chat) to use and the prompt template to
    generate the Cypher queries.

    In all cases the Department label is excluded to provide the LLM with a simpler graph schema.
    This does not exclude any node because Departments are also Organizations.
    
    Args:
        configuration: Dict with the required configuration (from configuration.json file).

    Returns:
        A GraphCypherQAChain chain 
    """

    # Create a prompt for Cypher code generation if there is a template specified in the configuration
    cypher_prompt = None
    cypher_prompt_template = configuration.get("cypher_prompt_template")
    if cypher_prompt_template:
        cypher_prompt = PromptTemplate(
            input_variables=["schema", "question"], template=TEMPLATES[cypher_prompt_template]
        )

    # Since Departments are also Organizations, we exclude them to provide the LLM with a slightly simpler schema.
    exclude_types= ["Department"]
    
    chain = GraphCypherQAChain.from_llm(
        llm = LLMS[configuration["llm"]],
        cypher_prompt = cypher_prompt,
        graph=graph, 
        verbose=False, 
        return_intermediate_steps=True,
        exclude_types = exclude_types
    )
    return chain


def ask_questions(configuration, questions, run_id=None, comment=None):
    """Sends the list of questions to the GraphCypherQAChain and stores the results in a JSON file.

    The GraphCypherQAChain answers the questions in three steps. 
        1) The question in Natural Language is converted to a Cypher query.
        2) The Cypher query is sent to Neo4j.
        3) The results of the Cypher query and the original query are sent to the LLM to get the answer in natural language.

    The questions and answers plus execution metadata are stored in the "poc1_answers/<run_id>.json" file for further evaluation.
    For models with limited throughput a pause is introduced to respect the corresponding request per minute rate.
    
    If LangSmith is configured (recommended) questions are tagged with "POC1" and metadata with 
    configuration id, llm name and question id is provided.

    Args:
        configuration: Dict with the configuration to create a GraphCypherQAChain.
        questions: List of questions to send to the LLM via the Chain.
        run_id: Name of the JSON file where the answers to the questions are stored.
        comment: Optional comment added in the 

    Returns:
        A tuple with the Dict with the Questions and Answers and the GraphCypherQAChain
    """

    print("Run Id:", run_id)
    print("Configuration:",configuration["id"],configuration["description"])
    print()
    
    chain = create_chain(configuration)

    sleep_between_questions = 60.0 / configuration.get("requests_per_minute") if configuration.get("requests_per_minute") else 0.0
        
    results = {"run_id": run_id,"comment": comment, "start_time": now()}
    run_start_time = time.time()
    
    # Tags and metadata information for LangSmith
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_PROJECT"] = LANGCHAIN_PROJECT
    config = {"tags":["POC1"],"metadata":{"config": configuration["id"],"llm": configuration["llm"]}}
    if run_id: config["metadata"]["run_id"] = run_id

    answers = []
    for idx,q in enumerate(questions):

        # Wait for next call to avoid error 429 with Claude
        if idx > 0 and (sleep_between_questions - result_metadata["query_time"])>0:
            wait_s = sleep_between_questions - result_metadata["query_time"]
            print(" wait(s):", wait_s)
            time.sleep(wait_s)
        
        start_time = time.time()
        query_start_time = now()
        result = {"question": q}
        result_metadata ={}

        # Send question to the LLM via the Chain
        print("\nQuestion: ", q)

        config["metadata"]["question"] = q["id"]
        
        try:
            with get_openai_callback() as cb:
                result["answer"] = chain.invoke(q["question"],config = config)
            if cb.total_tokens > 0:
                result_metadata.update({"prompt_tokens": cb.prompt_tokens,
                                        "completion_tokens": cb.completion_tokens,
                                        "total_tokens": cb.total_tokens,
                                        "total_cost": cb.total_cost})

        except Exception as e:
            result["answer"] = {"query": q["question"], "result": str(e)}
        
        print("Answer: ", result["answer"]["result"])
        
        result_metadata["start_time"] = query_start_time
        result_metadata["end_time"] = now()
        result_metadata["query_time"] = time.time() - start_time
        result["metadata"] = result_metadata
        
        answers.append(result)

    results["end_time"] = now()
    results["total_time"] = time.time() - run_start_time
    results["num_questions"] = len(answers)
    results["configuration"] = configuration
    results["questions"] = answers

    # Results are saved in a json file for further analysis.
    if run_id:
        save_jsonpickle(os.path.join(ANSWERS_FOLDER,f"{run_id}.json"), results)
    
    return (results, chain)

In [10]:
def print_run(run):
    """ Helper procedure to produce a user friendy printed version of a given run (JSON).

    Args:
        run: Dict with the run (JSON) to print.
    """
        
    print("Run ID:", TEXT_BOLD + run.get("run_id") + TEXT_END, run.get("comment"))
    print("Config: " + run["configuration"]["id"] + " " + run["configuration"]["description"])
    print("Time:  ", run.get("start_time"), " - ", run.get("end_time"), "\n\nQuestions:")

    for q in run["questions"]:
        print("\n"+ TEXT_BOLD + q["question"]["id"] + " " + TEXT_BLUE + q["question"]["question"] + TEXT_END +"\n")
        #print("Query:")
        print(TEXT_GREEN + q["answer"]["intermediate_steps"][0]["query"] + TEXT_END)
        print("\n" + q["answer"]["result"])


## Execute the tests

### Connect to Graph
If using Neo4j Desktop, ensure that the Neo4j Graph Engine is up and running.


In [11]:
# Connecting to the Neo4j graph (Desktop)
graph = Neo4jGraph(url = NEO4J_URL, username = NEO4J_USERNAME, password = os.environ['NEO4J_PWD'])

# Print the graph schema
print(graph.schema)

Node properties:
Organization {last_event_date: DATE, code: STRING, id: INTEGER, name: STRING, first_event_date: DATE, pk: STRING}
Person {name: STRING, pk: STRING}
Department {name: STRING, id: INTEGER, first_event_date: DATE, last_event_date: DATE, pk: STRING, code: STRING}
Group {name: STRING, url: STRING, proporals: STRING, type: STRING, id: STRING, has_events: BOOLEAN, pk: STRING, mission: STRING}
Event {time: LOCAL_TIME, date: DATE, subject: STRING}
Agreement {subject: STRING, signees_gencat: STRING, signees_other: STRING, signature_date: DATE, validity_date: DATE, code: STRING, title: STRING, document: STRING}
Relationship properties:
RESPONSIBLE_OF {role: STRING, date: DATE, date_from: DATE, date_to: DATE}
PARTICIPATE {role: STRING}
The relationships:
(:Organization)-[:CHILD_OF]->(:Organization)
(:Organization)-[:CHILD_OF]->(:Department)
(:Organization)-[:PARTICIPATE]->(:Event)
(:Person)-[:RESPONSIBLE_OF]->(:Organization)
(:Person)-[:RESPONSIBLE_OF]->(:Department)
(:Person)-[:P

### Execute each test run

The *ask_questions* function is used to test the different configurations.
The test is executed in two stages:

* Stage 1: Using the initial graph created from the CSV files where the **Agreement** nodes are NOT connected to the rest of the graph.


Convert the following cell from "Raw" to "Code" to perform the actual tests.<br>Else the JSON files with the results of each run are available in the "poc1_answers" folder.

**REMEMBER**: There are some costs associated to the LLMs usage (token consumption) for running the tests. (It should be less than one USD though). Actual token usage (and cost estimate) is available in the json files for OpenAI. Token usage per call is available in LangSmith.


* Stage 2: Run another set of tests AFTER the completion of the POC2 (Enrich graph). After the POC2 the nodes belonging to the **Agreement** 2022/9/0304 should be connected to "Person", "Organization" and "Group" nodes.

In this case we are interested on the last 3 questions of the dataset which especifically refer to that agreement.

AFTER running the POC2, the graph schema needs to be reloaded by executing the Connect to Graph cell above.<br>
The new :SIGNED and :REPRESENTS relationships should show in the cell output.

**POC 1 Notebook ends here.**