# Enhanced Question Answering Integrating Unstructured and Graph Knowledge using Neo4j and LangChain

In this notebook, we walk through the implementation of a sophisticated question-answering system, leveraging the synergistic capabilities of Neo4j and LangChain. The step-by-step guide emphasises the process of integrating unstructured data and graph knowledge, ensuring a comprehensive understanding of utilizing Neo4j Vector Index and GraphCypherQAChain for enhanced, informed response generation with Mistral-7b.

![neo4j_mistral_architecture](../assets/img/neo4j_mistral_architecture.png)

In [None]:
%pip install langchain openai wikipedia tiktoken neo4j python-dotenv transformers
%pip install -U sagemaker

## Neo4j Vector Index

We will start by importing the requisite libraries and modules, setting a foundation for interfacing with the dataset preparation, Neo4j Vector Index, and utilizing text generation capabilities of Mistral 7B. Utilizing dotenv,  it securely loads environment variables, safeguarding sensitive  credentials for the OpenAI API and Neo4j database.

In [None]:
import os
import re
from langchain.vectorstores.neo4j_vector import Neo4jVector
from langchain.document_loaders import WikipediaLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from dotenv import load_dotenv

load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
os.environ["NEO4J_URI"] = os.getenv('NEO4J_URI')
os.environ["NEO4J_USERNAME"] = os.getenv('NEO4J_USERNAME')
os.environ["NEO4J_PASSWORD"] = os.getenv('NEO4J_PASSWORD')

Here, we decide to work with a  Wikipedia page of Leonhard Euler for our experiment. We use the bert-base-uncased model for tokenizing the text. The WikipediaLoader loads the raw content of the specified page, which is then chunked into smaller text pieces using RecursiveCharacterTextSplitter from LangChain. This splitter ensures that each chunk is maximized to  200 tokens with an overlap of 20 tokens, adhering to context window  limits for embedding models and making sure that the continuity of the context is not lost.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def bert_len(text):
    tokens = tokenizer.encode(text)
    return len(tokens)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
raw_documents = WikipediaLoader(query="Leonhard Euler").load()
text_splitter = RecursiveCharacterTextSplitter(
          chunk_size = 200,
          chunk_overlap  = 20,
          length_function = bert_len,
          separators=['\n\n', '\n', ' ', ''],
      )

documents = text_splitter.create_documents([raw_documents[0].page_content])

In [None]:
print(len(documents))

17


The chunked documents are instantiated into the Neo4j vector index as nodes. It uses the core functionalities of Neo4j graph database and OpenAI embeddings to construct this vector index.

In [None]:
# Instantiate Neo4j vector from documents
neo4j_vector = Neo4jVector.from_documents(
    documents,
    OpenAIEmbeddings(),
    url=os.environ["NEO4J_URI"],
    username=os.environ["NEO4J_USERNAME"],
    password=os.environ["NEO4J_PASSWORD"]
)

After ingesting the documents in the vector index, we perform vector similarity search for a sample user query and retrieve top2 most similar documents.

In [None]:
query = "Who were the siblings of Leonhard Euler?"
vector_results = neo4j_vector.similarity_search(query, k=2)
for i, res in enumerate(vector_results):
    print(res.page_content)
    if i != len(vector_results)-1:
        print()
vector_result = vector_results[0].page_content

== Early life ==
Leonhard Euler was born on 15 April 1707, in Basel to Paul III Euler, a pastor of the Reformed Church, and Marguerite (née Brucker), whose ancestors include a number of well-known scholars in the classics. He was the oldest of four children, having two younger sisters, Anna Maria and Maria Magdalena, and a younger broth

Leonhard Euler ( OY-lər, German: [ˈleːɔnhaʁt ˈɔʏlɐ] ; 15 April 1707 – 18 September 1783) was a Swiss mathematician, physicist, astronomer, geographer, logician, and engineer who founded the studies of graph theory and topology and


## Build Knowledge Graph

Highly inspired by the NaLLM project, we use their open-source project to construct a knowledge graph from unstructured data. Below is a knowledge graph constructed using a single chunk of a document from Wikipedia article of Leonhard Euler.

![p3_kg](../assets/img/p3_kg.png)

## Neo4j DB QA chain

Next, we import the necessary libraries to setup the Neo4j DB QA Chain.

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import GraphCypherQAChain
from langchain.graphs import Neo4jGraph

In [None]:
graph = Neo4jGraph(
    url=os.environ["NEO4J_URI"], username=os.environ["NEO4J_USERNAME"], password=os.environ["NEO4J_PASSWORD"]
)

Once the graph is constructed, we need to connect to the Neo4jGraph instance and visualize the schema.

In [None]:
print(graph.schema)


        Node properties are the following:
        [{'labels': 'Person', 'properties': [{'property': 'name', 'type': 'STRING'}, {'property': 'nationality', 'type': 'STRING'}, {'property': 'death_date', 'type': 'STRING'}, {'property': 'birth_date', 'type': 'STRING'}]}, {'labels': 'Location', 'properties': [{'property': 'name', 'type': 'STRING'}]}, {'labels': 'Organization', 'properties': [{'property': 'name', 'type': 'STRING'}]}, {'labels': 'Publication', 'properties': [{'property': 'name', 'type': 'STRING'}]}]
        Relationship properties are the following:
        []
        The relationships are the following:
        ['(:Person)-[:worked_at]->(:Organization)', '(:Person)-[:influenced_by]->(:Person)', '(:Person)-[:born_in]->(:Location)', '(:Person)-[:lived_in]->(:Location)', '(:Person)-[:child_of]->(:Person)', '(:Person)-[:sibling_of]->(:Person)', '(:Person)-[:published]->(:Publication)']
        


The GraphCycherQAChain abstracts all the details and outputs a natural language response for a natural language question(NLQ). However, internally it uses LLMs to generate a Cypher query for an NLQ and retrieves graph result from the graph database and finally uses those result to generate the final natural language response, again using an LLM.

In [None]:
chain = GraphCypherQAChain.from_llm(
    ChatOpenAI(temperature=0), graph=graph, verbose=True
)

In [None]:
graph_result = chain.run("Who were the siblings of Leonhard Euler?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Person {name: 'Leonhard Euler'})-[:sibling_of]->(sibling)
RETURN sibling.name[0m
Full Context:
[32;1m[1;3m[{'sibling.name': 'Maria Magdalena'}, {'sibling.name': 'Anna Maria'}][0m

[1m> Finished chain.[0m


In [None]:
graph_result

'The siblings of Leonhard Euler were Maria Magdalena and Anna Maria.'

## Mistral-7b-Instruct

We setup the Mistral-7B endpoint from Hugging Face within the AWS SageMaker environment.

In [None]:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

In [None]:
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

hub = {
    'HF_MODEL_ID':'mistralai/Mistral-7B-Instruct-v0.1',
    'SM_NUM_GPUS': json.dumps(1)
}

In [None]:
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
    env=hub,
    role=role,
)

The final response is crafted by constructing a prompt that includes an instruction, relevant data from the vector index, relevant information from the graph database, and the user's query. This prompt is then passed to the Mistral-7b model, which generates a meaningful and accurate response based on the provided information.

In [None]:
mistral7b_predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.4xlarge",
    container_startup_health_check_timeout=300,
)

-------------!

In [None]:
query = "Who were the siblings of Leonhard Euler?"
final_prompt = f"""You are a helpful question-answering agent. Your task is to analyze
and synthesize information from two sources: the top result from a similarity search
(unstructured information) and relevant data from a graph database (structured information).
Given the user's query: {query}, provide a meaningful and efficient answer based
on the insights derived from the following data:

Unstructured information: {vector_result}.
Structured information: {graph_result}.
"""

print(final_prompt)

You are a helpful question-answering agent. Your task is to analyze 
and synthesize information from two sources: the top result from a similarity search 
(unstructured information) and relevant data from a graph database (structured information). 
Given the user's query: Who were the siblings of Leonhard Euler?, provide a meaningful and efficient answer based 
on the insights derived from the following data:

Unstructured information: == Early life ==
Leonhard Euler was born on 15 April 1707, in Basel to Paul III Euler, a pastor of the Reformed Church, and Marguerite (née Brucker), whose ancestors include a number of well-known scholars in the classics. He was the oldest of four children, having two younger sisters, Anna Maria and Maria Magdalena, and a younger broth. 
Structured information: The siblings of Leonhard Euler were Maria Magdalena and Anna Maria..



In [None]:
response = mistral7b_predictor.predict({
    "inputs": final_prompt,
})

print(re.search(r"Answer: (.+)", response[0]['generated_text']).group(1))

The siblings of Leonhard Euler were Maria Magdalena and Anna Maria.
