<H1>Graph Retreval Augmented Generation with LangChain and Neo4j</H1>

Questions are provided at the end of the notebook (4 marks)

Submission Due Date: November 1 2024


**Versioning**
- Revised by Alvern Ong Wei Zhe in August 2024
- Revised by Wang Qiuhong in September-Octorber 2024

**Consultation**
- RAG and neo4j: Li Kaixin <likaixin@u.nus.edu>
- Assignment: Goh Kaitlyn Wen Jing <e0774226@u.nus.edu>

<img src="https://drive.google.com/uc?id=1y0rfDpTbkPr32UkP9KMY_01Ff2shCc6H" width="500" height="300"></img>

In this jupyter notebook, we will be walking through how to implement a Large Language Model (LLM) with Retrevial Augemented Generation (RAG).

The notebook will be broken down into 3 main steps and cover two forms of RAG - one with graphs and the other without. In step 3a, we will be defining a RAG model without graphs to show an example of how to implement a standard RAG model. In step 3b, we will focus on how to implement graphs in the RAG structure with neo4j. In this step, we will be creating a hybrid retriever that is a combination of a structred retriever and unstructured retriever.

A unstructured retriever will retreive data from text sources. This data can be in the form of text passages, vector embeddings, etc.

A structured retriever will retreive data from structured data sources such as databases, tables or knowledge graphs.

This hybrid retriever will then parse the data from the structured and unstructured retriever into the LLM and the LLM will produce a response.

<h3><u>TechStack:</u></h3>

**LangChain**
- A framework designed to help developers build and integrate applications that use large language models (LLMs) and other AI models
- For example, we will use the following functions to customize a prompt:
 - **ChatPromptTemplate.from_template**: from a template with placeholders for dynamic values
 - **ChatPromptTemplate.from_message**: from a list of structured message objects

**Neo4j**
- A graph databse management system used to manage and query graph data
- For example, we will use the following functions in Python to communicate with a Neo4j instance
 - **Neo4jGraph.add_graph_documents** to import graph data into the database
 - **Neo4jGraph.query** to retrive information from the neo4j graph database

<h3><u>The sections are :</u></h3>
<ol>
    <li>Section 1: Initialisation</li>
    <li>Section 2: Load data for RAG</li>
    <li>Section 3a: Defining a RAG structure without graphs</li>
    <li>Section 3b: Defining a RAG structure with graphs</li>
    <ol>
        <li> Step 3b.1: Adding documents to the graph structure</li>
        <li> Step 3b.2: Creating a Hybrid Retrevial for RAG</li>
            <ol>
                <li> Step 3b.2.1: Unstructured data retriever</li>
                <li> Step 3b.2.2: Structured data retriever</li>
                <li> Step 3b.2.3: Hybrid data retriever</li>
            </ol>
        <li> Step 3b.3 Define the RAG chain</li>
    </ol>
</ol>


Reference
- https://medium.com/@jinglemind.dev/mastering-advanced-rag-methods-graphrag-with-neo4j-implementation-with-langchain-42b8f1d05246

<h2>Download required modules through pip </h2>


In [None]:
%pip install --upgrade --quiet  langchain langchain-community langchain-openai langchain-experimental neo4j wikipedia tiktoken yfiles_jupyter_graphs py2neo faiss-cpu pypdf

In [None]:
%pip install unstructured[all-docs]

Collecting unstructured[all-docs]
  Downloading unstructured-0.15.13-py3-none-any.whl.metadata (29 kB)
Collecting filetype (from unstructured[all-docs])
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured[all-docs])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured[all-docs])
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured[all-docs])
  Downloading python_iso639-2024.4.27-py3-none-any.whl.metadata (13 kB)
Collecting langdetect (from unstructured[all-docs])
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rapidfuzz (from unstructured[all-docs])
  Downloading rapidfuzz-3.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_6

<h2>Import all required modules</h2>

In [None]:
# LangChain's core runnables for orchestrating tasks in workflows
from langchain_core.runnables import (
    RunnableBranch,
    RunnableLambda,
    RunnableParallel,
    RunnablePassthrough,
)
# LangChain's core components for building custom prompts, handling messages, and parsing outputs
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser

# Typing Imports
from typing import Tuple, List

# Integrating LangChain with Neo4j, which can be useful for tasks like combining graph databases and vector stores for advanced AI workflows.
# For example:
# We can use Neo4jGraph to retrieve structured graph data from Neo4j
# We can store and query document embeddings using Neo4jVector
# We can leverage LLMGraphTransformer to help the LLM reason about relationships within the graph
# We can use remove_lucene_chars to ensure that queries passed into Neo4j are well-formatted and don’t cause issues with search.
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain_community.vectorstores.neo4j_vector import remove_lucene_chars
from langchain_experimental.graph_transformers import LLMGraphTransformer

# Document Loaders and Text Splitters
# from langchain.document_loaders import WikipediaLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import TokenTextSplitter

# LangChain components that interface with OpenAI models
# ChatOpenAI handles interactive conversations with a language model
# OpenAIEmbeddings transform text into vectors, stores and compares the semantic meaning of user inputs or documents in a vector store like Neo4jVector.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Neo4j & Graph Visualization
# To establish a connection with a Neo4j database and handling the graph database by running Cypher queries, interacting with nodes and relationships
from neo4j import GraphDatabase
# To visually represent the graph data retrieved from Neo4j
from yfiles_jupyter_graphs import GraphWidget

# FAISS (Facebook AI Similarity Search) stores text embeddings and then retrieves similar documents based on a query
from langchain.vectorstores import FAISS

# Chains for QA by combining a retrieval mechanism (like FAISS) with a language model
from langchain.chains import RetrievalQA

# Miscellaneous
import os
import warnings
import textwrap

#colab imports if running in Google colab
try:
  import google.colab
  from google.colab import output
  output.enable_custom_widget_manager()
except:
  pass

warnings.filterwarnings("ignore")



For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


#<h1>Section 1. Initialisation</h1>

##<h2>Step 1.1: Initialise a Neo4j Database Instance.</h2>
<h3>Create an instance using Neo4j Aura. (launched through a web browser)</h3>

Neo4j is the graph database application we will use to store the Graph documents.

<ol>
    <li>Create an account with Neo4j Aura and log in to the web browser.</li>
    <li>Click 'New Instance' under the 'Instances' tab and select the 'Free' option.
        <ul>
            <li>Remember to save the generated password as it will be needed to access your db instance.</li>
        </ul>
    </li>
    <li>Look for your Connection URI in the instance and save it as well</li>
    <li>Lastly, key in your OpenAI API Key, your username ('neo4j' by default), your db password (Step 2) and URI (Step 3) into the field below.
    <li>Note: if you created a free neo4j instance but did not run it for a while, you need to resume it on the neo4j web portal, which may take a few minutes as it is an on-demand service.
</ol>

In [None]:
#Save these variables in your environment
os.environ["OPENAI_API_KEY"] = "sk-5SE5-0_NCW9JO_sUsI4K8lP1f6n5L5UW_XXbikGKhqT3BlbkFJxLymQhYjHR6VKPNM2XRzejPcMuAVWIJyJkr0ykj1YA"
os.environ["NEO4J_URI"] = 'neo4j+s://ca0da43f.databases.neo4j.io'
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = 'AvNscVmezenv0ZJ4aefrS0bG-cRN0EL4nAnDuhqLE2Q'

# Create a connection to the Neo4j database
# graph = Neo4jGraph()
graph = Neo4jGraph(url=os.environ["NEO4J_URI"], username=os.environ["NEO4J_USERNAME"], password=os.environ["NEO4J_PASSWORD"]) # Explicitly pass the connection details to Neo4jGraph

##<h2>Step 1.2: Initialise LLM</h2>

Load the large language model that has already been trained by OpenAI

In [None]:
#Initialize the Language Model and Graph Transformer
llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-0125") # gpt-4-0125-preview occasionally has issues but in theory you would want to use the most capable model to construct the graph
llm_transformer = LLMGraphTransformer(llm=llm)

## Step 1.3: An example on how the LLM will respond to a prompt which it has no knowledge on.

In [None]:
print('Example of LLM without RAG process: \n')
response = llm("What is the deliverables of the BT4222, and how are students assessed?").content
wrapped_response = textwrap.fill(response, width=80)
print(wrapped_response)


Example of LLM without RAG process: 

The deliverables of BT4222, a module in the National University of Singapore's
School of Computing, typically include individual or group projects,
presentations, reports, and possibly exams.   Students are assessed through a
combination of these deliverables, with their performance being evaluated based
on criteria such as the quality of their work, their understanding of the
subject matter, their ability to apply concepts to real-world problems, and
their communication and presentation skills. Grades are usually assigned based
on a combination of these factors, with a weighting assigned to each deliverable
based on its importance in the overall assessment of the module.


As you can see from the output of the llm, the answer is unrelated to what we are actually trying to ask the large language model. This is because the llm was not trained on the information we are trying to retreive. Thus, we can use retreival augmented generation to feed the model relevant information before an answer is generated.

#<h1> Section 2. Load data for RAG </h1>

<h2>Load Data</h2>

For this demonstration, we will use information from the BT4222 course project deliverable guideline pdf file. We can utilize LangChain loaders to fetch and split the documents from PDFs seamlessly.

In [None]:
from google.colab import drive
### Permit this notebook to access your Google Drive files?
### Please select "Connect to Google Drive", choose your account and select continue
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
### adjust the data directory if needed
### For example, here we created a "data" folder
### in Google Drive and put the data files needed under the "data" folder
%cd /content/drive/My Drive/data

/content/drive/My Drive/data


In [None]:
#Load the relevant data using the pyPDFLoader
pdf_loader = PyPDFLoader("Without Table BT4222 Project Deliverable Guideline 2024-25 Term1.pdf")
raw_documents = pdf_loader.load()
print(len(raw_documents))




5


In [None]:
# Define chunking strategy
text_splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=20)
documents = text_splitter.split_documents(raw_documents)

In [None]:
print("The number of chunks from the raw documents:",len(documents))
# Print each key-value pair from the __dict__ attribute line by line
print("\n Attributes of documents[0]:\n")
for key, value in documents[2].__dict__.items():
    print(f"{key}: {value}")


The number of chunks from the raw documents: 7

 Attributes of documents[0]:

id: None
metadata: {'source': 'Without Table BT4222 Project Deliverable Guideline 2024-25 Term1.pdf', 'page': 2}
page_content: ©QIUHONG WANG 2024 3 • Use table and/or figures to report the model performance on both the training set and the testing sets (all these outcome can be reproduced by running your source code). Precise and succinct explanation should be provided if necessary.  This section is a general requirement for any machine learning related project. Let us keep it simple and straightforward with only necessary information.  Section 3 Contribution and Justification  Among the four aspects identified below or your unique aspects that are not listed here, please provide your justification within 1-3 pages for each aspect. This excludes the self-evaluation table.  • Complete the following table and assess your own contribution taken into account both the extent of efforts and the effectiveness of the

#<h1> Section 3a. Defining a RAG structure without graphs </h1>

<img src="https://drive.google.com/uc?id=1d97tQa3yuMupwvWRuGeeUi-PqOe4ZdHZ" width="400"></img>

In this section, we will demonstrate how to set up a standard RAG structure without the graph database.

In [None]:
# Initializes the embeddings model from OpenAI. This model converts text into numerical vectors.
embeddings = OpenAIEmbeddings()

# Uses the FAISS library to create a vector store from the documents. FAISS is a library for efficient similarity search.
# It indexes the documents after converting them to vectors using the embeddings model, allowing for fast retrieval.
vectorstore = FAISS.from_documents(documents, embeddings)

In [None]:
# Print basic information
print(f"Type of vectorstore: {type(vectorstore)}")
print(f"Number of documents: {len(vectorstore.index_to_docstore_id)}")

# Print information about the underlying FAISS index
faiss_index = vectorstore.index
print(f"\nFAISS Index type: {type(faiss_index)}")
print(f"FAISS Index dimension: {faiss_index.d}")
print(f"Total number of vectors: {faiss_index.ntotal}")


# Print some example document IDs
print("\nExample document IDs:")
for i, doc_id in list(vectorstore.index_to_docstore_id.items())[:len(vectorstore.index_to_docstore_id)]:
    print(f"Index {i}: Document ID {doc_id}")


print('The last two vector embeddings stored in vectorstore:\n')
vectors = vectorstore.index.reconstruct_n(len(vectorstore.index_to_docstore_id)-2, 2)
print(vectors)


Type of vectorstore: <class 'langchain_community.vectorstores.faiss.FAISS'>
Number of documents: 7

FAISS Index type: <class 'faiss.swigfaiss_avx2.IndexFlatL2'>
FAISS Index dimension: 1536
Total number of vectors: 7

Example document IDs:
Index 0: Document ID 392319f7-1cb5-40de-a7d3-a05d323379ed
Index 1: Document ID 763f7b88-1b3b-437d-80fd-569c616376f2
Index 2: Document ID 69105248-b13d-437c-8780-85c7ac724105
Index 3: Document ID b2ff5040-36b2-4952-827f-f39727a441e3
Index 4: Document ID ea31ae98-e10d-46d0-9e8a-7823f4f2ed20
Index 5: Document ID 3c0b89e5-a330-48a1-8fbb-4b1dbce344fa
Index 6: Document ID ddb8f64a-eb38-4477-93e1-9d59ff91dc79
The last two vector embeddings stored in vectorstore:

[[-0.02597512  0.00025854  0.00511051 ... -0.03230054 -0.01881777
  -0.0346247 ]
 [-0.01280921 -0.00227121  0.01899435 ... -0.03567426 -0.00469538
  -0.04522463]]


In [None]:
# Set up the (Question-Answer)QA chain using the vectorstore as a retriever
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever())

In [None]:
print('Demonstration of RAG response:\n')
#question_step3a = 'Who teaches R in the NUS course about business analytics, what else is taught and how are students graded?'
question_step3a = 'Does the project deliverable require source code and data?'

response = qa_chain.run(question_step3a)
wrapped_response = textwrap.fill(response, width=80)
print(wrapped_response)

Demonstration of RAG response:

Yes, the project deliverable requires both source code and data. The source code
files should be uploaded by a specific deadline, and they should be self-
explanatory, able to run directly via Google Colab, and able to reproduce the
reported model performance. Additionally, the datasets should be provided with
accessible links, and a PDF file explaining their purpose and content should be
uploaded by a certain deadline as well.


As you can see from the above response, the model is able to give a response that is relevant to BT4222.

However, we realise that the response is not concise and more narrative.

In the subsequent sections, we will show how graph RAG will help with this.

#<h1> Section 3b. Defining a RAG structure with graphs </h1>

##<h2>Step 3b.1 Add the documents to the graph (Neo4j)</h2>

 Now it’s time to construct a graph based on the retrieved documents. For this purpose, we have implemented an LLMGraphTransformer module that simplifies constructing and storing a knowledge graph in a graph database.

The LLM graph transformer returns graph documents, which can be imported to Neo4j via the `add_graph_documents` method.

In [None]:
# Construct a graph based on the retrieved documents.
# Using the LLMGraphTransformer module significantly simplifies constructing and storing a knowledge graph in a graph database.
graph_documents = llm_transformer.convert_to_graph_documents(documents)

In [None]:
print(f"count of documents:{len(graph_documents)}")
print(f"count of nodes in the first document chunk:{len(graph_documents[3].nodes)}")
print(f"count of relationships in the first document chunk:{len(graph_documents[3].relationships)}")

count of documents:7
count of nodes in the first document chunk:13
count of relationships in the first document chunk:12


In [None]:
print("As shown below, each of the document is split into nodes and relationships:\n")

# Iterate through each item in graph_documents
for item in graph_documents[3].nodes:
    # Print details of the Node
    print(f"Node ID: {item.id}")
    print(f"Node Type: {item.type}")
    print(f"Node Properties: {item.properties}")
    print("-" * 50)  # Separator for clarity

for item in graph_documents[3].relationships:
    # Print details of the relationships
    print(f"Relationship from: {item.source.id} (Type: {item.source.type})")
    print(f"  to: {item.target.id} (Type: {item.target.type})")
    print(f"Relationship Type: {item.type}")
    print(f"Relationship Properties: {item.properties}")
    print("-" * 50)  # Separator for clarity


As shown below, each of the document is split into nodes and relationships:

Node ID: Ensemble Learning
Node Type: Machine learning
Node Properties: {}
--------------------------------------------------
Node ID: Ml Models
Node Type: Machine learning
Node Properties: {}
--------------------------------------------------
Node ID: Innovative Architecture
Node Type: Machine learning
Node Properties: {}
--------------------------------------------------
Node ID: Pipeline
Node Type: Machine learning
Node Properties: {}
--------------------------------------------------
Node ID: Ml Methods
Node Type: Machine learning
Node Properties: {}
--------------------------------------------------
Node ID: Research Papers
Node Type: Research paper
Node Properties: {}
--------------------------------------------------
Node ID: Creativity
Node Type: Concept
Node Properties: {}
--------------------------------------------------
Node ID: Insights
Node Type: Concept
Node Properties: {}
----------------------

In [None]:
# Check if any nodes are available in the database
check_query = "MATCH (n) RETURN count(n) AS node_count"
result = graph.query(check_query)
for record in result:
    print(record["node_count"])  # Should print 0 if the database is empty

59


In [None]:
# To create a new database, you can use Cypher query to delete all nodes and relationships
clear_db_query = """
MATCH (n)
DETACH DELETE n
"""

# Execute the query to clear the database
graph.query(clear_db_query)

[]

In [None]:
# baseEntityLabel: this parameter assigns an additional __Entity__ label to each node, enhancing indexing and query performance.
# include_source: this parameter links nodes to their originating documents, facilitating data traceability and context understanding.
graph.add_graph_documents(
    graph_documents,
    # Ensures that each entity in graph_documents is labeled with its base entity type
    baseEntityLabel=True,
    # Indicate that the source information (like the original document or context) should be included in the graph nodes or edges.
    include_source=True
)

Now that we have added the graph documents to the graph, we will define a function to show Neo4j graph.

In this function, we will be able to visualise the nodes and edges that we have added above.

In [None]:
# Cypher is the query language used for interacting with Neo4j.
# Here we generate a query that finds instances where either the source node or the target node contains 'data'.
# the query content is case sensitive.
default_cypher = "MATCH (s)-[r]->(t) WHERE toLower(s.id) CONTAINS 'data' OR toLower(t.id) CONTAINS 'data' RETURN s, r, t"
# You can try other query
# default_cypher = "MATCH (s)-[r:IDENTIFY]->(t) RETURN s,r,t LIMIT 50"

# Function to display graph structure
def showGraph(cypher: str = default_cypher):
    # Create a neo4j session to run queries
    driver = GraphDatabase.driver(
        uri = os.environ["NEO4J_URI"],
        auth = (os.environ["NEO4J_USERNAME"],
                os.environ["NEO4J_PASSWORD"]))
    session = driver.session()
    widget = GraphWidget(graph = session.run(cypher).graph())
    widget.node_label_mapping = 'id'
    return widget

showGraph()

GraphWidget(layout=Layout(height='640px', width='100%'))

In [None]:
# We want to understand entity 'Document' and its relationships with other entities
# We will use the text property of entity 'Document'
showGraph("MATCH p=(d:Document)-[]->() RETURN p LIMIT 25 UNION MATCH p=()-[]->(d:Document) RETURN p;")

GraphWidget(layout=Layout(height='760px', width='100%'))

##<h2>Step 3b.2 Creating a Hybrid Retrieval for RAG</h2>

After the graph generation, we will start designing the hybrid retrevial function.

We will use a hybrid retrieval approach that combines vector and keyword indexes with graph retrieval for RAG applications.

<img src="https://drive.google.com/uc?id=1MJsLg6W8_7SOflvK4LP5-hbwHRUMWIe1" width="400"></img>

The diagram illustrates a retrieval process beginning with a user posing a question, which is then directed to an RAG retriever.

The hybrid retrevial process is circled in red and is a combination of a unstructured and structured retriever.

This retriever employs keyword and vector searches to search through unstructured text data(unstructured retriever) and combines it with the information collected from the knowledge graph which employs graph search(structured retriever).

The collected data from these sources is fed into an LLM to generate and deliver the final answer.

###<h2>Step 3b.2.1 Unstructured data retriever</h2>

<img src="https://drive.google.com/uc?id=1_90bjkrX_mSOKAZokxIeseb_XIFQqadS" width="400"></img>

First, we start by designing the unsturctured retriever.

In the `from_exisiting_graph()` method, we are using both keyword-based and vector-based searches and targetting nodes with the label 'Document'. Within the 'Document' node, we will extract the 'text' property in the node. We pick the 'Document' node as it has the 'text' property, which stores chunks of text which will be useful in providing context to the llm.

The `similarity_search` method can be used to retreive the relevant documents. The 4 most similar documents are retreived.

In [None]:
vector_index = Neo4jVector.from_existing_graph(
    # Uses a model from OpenAI that converts text into vector embeddings which are used for vector-based search
    OpenAIEmbeddings(),
    # Search for similar words using a hybrid approach, combining both keyword-based and vector-based searches.
    search_type="hybrid",
    # Only nodes with the Document label will be indexed
    node_label="Document",
    # Within the node, we will return the 'text' property
    text_node_properties=["text"],
    embedding_node_property="embedding"
)

In [None]:
print('Example of the output of similarity search:\n')
# By default the the method will return the top 4 most similar results.
# To tune this, we can add in a new parameter, k = number of results, in the similarity_search function.

def display_matching_strings(results, query_string):
  """Displays page_content only if it contains the query_string from the top 4 search results."""

  for doc in results[:4]:
      if query_string in doc.page_content:
          print("\n this is matching result: " + doc.page_content)
      else:
          print("\n there is no exactly matching strings in the result" + doc.page_content)

# The similarity_search method is used to retrieve documents or nodes based on their vector similarity to a given query.
results = vector_index.similarity_search('Justification', k=4)
# Please note that as we search node labeled as "Document",
# the retrieved results could be very tedious as they are the text relevant to the query_string)
display_matching_strings(results, 'Justification')

Example of the output of similarity search:






 this is matching result: 
text: ©QIUHONG WANG 2024 3 • Use table and/or figures to report the model performance on both the training set and the testing sets (all these outcome can be reproduced by running your source code). Precise and succinct explanation should be provided if necessary.  This section is a general requirement for any machine learning related project. Let us keep it simple and straightforward with only necessary information.  Section 3 Contribution and Justification  Among the four aspects identified below or your unique aspects that are not listed here, please provide your justification within 1-3 pages for each aspect. This excludes the self-evaluation table.  • Complete the following table and assess your own contribution taken into account both the extent of efforts and the effectiveness of the outcomes.  Regarding the four aspects of the contribution, you are not required to cover all of them in your project. It should be your own decision depending on your int

###<h2>Step 3b.2.2 Structured retriever</h2>

In this example, we will use a full-text index to identify relevant nodes and then return their direct neighborhood.

<img src="https://drive.google.com/uc?id=1vhSHND4m3K_TZhEZH_IJvwZXaogESe5h" width="400"></img>


The graph retriever starts by identifying relevant entities in the input. For simplicity, we instruct the LLM to identify **deliverable**, **expectation**, **level** and **assessment**. To achieve this, we will use `with_structured_output` method.

After the entity is detected in the users' question, we will use the entity to extract the structured information from the graph database (Neo4j database) in the form of nodes and edges. This will be done with cypher (Neo4j's native querying language) query.

For example, if we ask the question 'What are the best practices for completing the course project to ensure satisfactory contribution?'. The entities detected in the question would be 'course project', 'best practices', 'satisfactory contribution'. Using this detected entities, we will query the graph database to return the neighbourhood of these entities in the form of nodes and edges.

##### Step 1. **Detect** specified entities in the input question

In [None]:
# This class defines the output and prompt of the LLM
class Entities(BaseModel):
    """Identifying information about entities."""

    # This line structures the output of the LLM to give a List of names.
    names: List[str] = Field(
        ...,
        description="All the course deliverable, expectation, level and assessment entities "
        "appear in the text",
    )

# Each tuple represents a message with a specific role and content
# that helps define how different messages should be strucutured
# and formatted when interacting with the llm.
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are tasked with extracting specific entities from the text. Focus on course deliverable, expectation, level and assessment",
        ),
        (
            "human",
            "Use the given format to extract information from the following"
            "input: {question}",
        ),
    ]
)

# Combine the prompt template (prompt) with the language model that specifies that
# the output should be structured in a particular way, specifically to extract entitites.
entity_chain = prompt | llm.with_structured_output(Entities)

In [None]:
question_step3b2_2 = 'What are the best practices for completing the course project to ensure satisfactory contribution?'

print('Demonstration that the entity chain can now extract the entities from the users query:\n')
print(entity_chain.invoke({"question": question_step3b2_2}).names)

Demonstration that the entity chain can now extract the entities from the users query:

['course project', 'best practices', 'satisfactory contribution']


#####Step 2. Create a full text **index** on nodes in the Neo4j database

A full-text index search refers to a type of search in a database that allows you to find records based on text data contained within text fields (or properties) of the database entries. Unlike simple keyword searches that may only look for exact matches, full-text search enables more complex querying and more flexible retrieval of data, accommodating various text search requirements.

In [None]:
# Create a full text index on nodes with the __Entity__ label on the id parameter in neo4j.
# A full text index will allow efficient querying of long bodies of text
graph.query(
    "CREATE FULLTEXT INDEX entity IF NOT EXISTS FOR (e:__Entity__) ON EACH [e.id]")

[]

#####Step 3. **Retrieve** the neighborhood of relevant nodes using the detected entities, from the graph database (Neo4j database)
Now that we can detect entities in the question, let's use a full-text index to map them to the knowledge graph. First, we need to define a full-text index and a function that will generate full-text queries that allow a bit of misspelling, which we won't go into much detail here.

In [None]:
def generate_full_text_query(input: str) -> str:
    """
    Generate a full-text search query for a given input string.

    This function constructs a query string suitable for a full-text search.
    It processes the input string by splitting it into words and appending a
    similarity threshold (~2 changed characters) to each word, then combines
    them using the AND operator. Useful for mapping entities from user questions
    to database values, and allows for some misspelings.
    """
    full_text_query = ""
    words = [el for el in remove_lucene_chars(input).split() if el]
    for word in words[:-1]:
        full_text_query += f" {word}~2 AND"
    full_text_query += f" {words[-1]}~2"
    return full_text_query.strip()

In [None]:
import re

def remove_lucene_chars(input: str) -> str:
    """
    Remove special characters that are not allowed in Lucene queries.
    """
    return re.sub(r'[^a-zA-Z0-9\s]', '', input)

def generate_full_text_query(input: str) -> str:
    """
    Generate a full-text search query for a given input string.

    This function constructs a query string suitable for a full-text search.
    It processes the input string by splitting it into words and appending a
    similarity threshold (~2 changed characters) to each word, then combines
    them using the AND operator. Useful for mapping entities from user questions
    to database values, and allows for some misspellings.
    """
    full_text_query = ""
    words = [el for el in remove_lucene_chars(input).split() if el]

    if not words:
        return ""

    for word in words[:-1]:
        full_text_query += f" {word}~2 AND"
    full_text_query += f" {words[-1]}~2"

    return full_text_query.strip()

Since we have generated full text indexes and can detect entities in the users question, we can start defining the structured retriever function.

The `structured_retriever` function starts by detecting entities in the user question. Next, it iterates over the detected entities and uses a Cypher query to retrieve the neighborhood of relevant nodes.

In [None]:
def structured_retriever(question: str) -> str:
    """
    Collects the neighborhood of entities mentioned
    in the question.
    """
    result = ""
    entities = entity_chain.invoke({"question": question})

    for entity in entities.names:
        # This Neo4j Cypher query performs a full-text search on nodes that have the required label, retrieving the top two matches
        # based on the search term provided. After this, the query then looks for relationships that point to or from this entity,
        # excluding relationships of type 'MENTIONS'.
        response = graph.query(
            """
            CALL db.index.fulltext.queryNodes('entity', $query, {limit: 2})
            YIELD node, score
            WITH node
            MATCH (node)-[r]->(neighbor)
            WHERE type(r) <> 'MENTIONS'
            RETURN node.id + ' - ' + type(r) + ' -> ' + neighbor.id AS output
            UNION ALL
            MATCH (neighbor)-[r]->(node)
            WHERE type(r) <> 'MENTIONS'
            RETURN neighbor.id + ' - ' + type(r) + ' -> ' + node.id AS output
            LIMIT 50
            """,
            {"query": generate_full_text_query(entity)},
        )

        # Append results
        result += "\n".join([el['output'] for el in response]) + "\n"

    return result.strip()


In [None]:
question_step3b2_2 = 'What are the best practices for completing the course project to ensure satisfactory contribution?'
print('Example of the output of a structured retriever: \n')
print(structured_retriever(question_step3b2_2))

Example of the output of a structured retriever: 

Project Source Code - UPLOADED_TO -> Canvas
Project Source Code - ACCESSIBLE_VIA -> Public Link
Project Final Presentation - UPLOADED_TO -> Canvas
Project Final Report - UPLOADED_TO -> Canvas
Project Source Code - UPLOADED_TO -> Canvas
Project Source Code - ACCESSIBLE_VIA -> Public Link
Datasets - UPLOADED_TO -> Canvas
Datasets - ACCESSIBLE_VIA -> Public Link
Source_Code_File - CONTAINS -> Model_Performance
Source_Code_File - CONTAINS -> Markdown_Explanation
Source_Code_File - RUNNABLE_IN -> Google_Colab
Model_Performance - INCLUDES -> Training_Set
Model_Performance - INCLUDES -> Testing_Set
Model_Performance - INCLUDES -> Source_Code
Project_Report - CONTAINS -> Abstract
Project_Report - CONTAINS -> Feature_Engineering
Project_Report - CONTAINS -> Models_Performance
Contribution_Justification - FOCUS -> Datasets
Contribution_Justification - FOCUS -> Feature_Engineering
Contribution_Justification - FOCUS -> Aspects_Contribution
Contrib

###<h2>Step 3b.2.3 Combine the Unstructured and Structured (Hybrid) Retriever</h2>

<img src="https://drive.google.com/uc?id=19nii0wD4UAi5LR9QZYVPP0NlErskoPlr" width="400"></img>

Now, we'll combine the unstructured and structured retriever that has been defined above to create the final function that will be pass information to the LLM.

In [None]:
# Define a function to combine both structured and unstructred data defined above into a prompt to be fed to the LLM
def retriever(question: str):
    print(f"Search query: {question}")
    structured_data = structured_retriever(question)
    unstructured_data = [el.page_content for el in vector_index.similarity_search(question)]
    final_data = f"""Structured data:
{structured_data}
Unstructured data:
{"#Document ". join(unstructured_data)}
    """
    return final_data

In [None]:
question_step3b2_3 = 'What are the best practices for completing the course project to ensure satisfactory contribution?'
print('Example of the output of final retriever: \n')
print(retriever(question_step3b2_3))

Example of the output of final retriever: 

Search query: What are the best practices for completing the course project to ensure satisfactory contribution?




Structured data:
Project Source Code - UPLOADED_TO -> Canvas
Project Source Code - ACCESSIBLE_VIA -> Public Link
Project Final Presentation - UPLOADED_TO -> Canvas
Project Final Report - UPLOADED_TO -> Canvas
Project Source Code - UPLOADED_TO -> Canvas
Project Source Code - ACCESSIBLE_VIA -> Public Link
Datasets - UPLOADED_TO -> Canvas
Datasets - ACCESSIBLE_VIA -> Public Link
Source_Code_File - CONTAINS -> Model_Performance
Source_Code_File - CONTAINS -> Markdown_Explanation
Source_Code_File - RUNNABLE_IN -> Google_Colab
Model_Performance - INCLUDES -> Training_Set
Model_Performance - INCLUDES -> Testing_Set
Model_Performance - INCLUDES -> Source_Code
Project_Report - CONTAINS -> Abstract
Project_Report - CONTAINS -> Feature_Engineering
Project_Report - CONTAINS -> Models_Performance
Contribution_Justification - FOCUS -> Datasets
Contribution_Justification - FOCUS -> Feature_Engineering
Contribution_Justification - FOCUS -> Aspects_Contribution
Contribution_Justification - FOCUS -> Ml_

##<h2>Step 3b.3 Define the RAG Chain</h2>

We have successfully implemented the retrieval component of the RAG.
The following introduces the query rewriting section that allows conversational follow ups. Note that having conversational follow ups is not crucial for Graph RAG. However, we will add this section in for completeness.

In [None]:
# Condense a chat history and follow-up question into a standalone question
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question,
in its original language.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

# Formats chat history to incorporate into a query for the LLM
def _format_chat_history(chat_history: List[Tuple[str, str]]) -> List:
    buffer = []
    for human, ai in chat_history:
        buffer.append(HumanMessage(content=human))
        buffer.append(AIMessage(content=ai))
    return buffer

_search_query = RunnableBranch(
    # If input includes chat_history, we condense it with the follow-up question
    (
        RunnableLambda(lambda x: bool(x.get("chat_history"))).with_config(
            run_name="HasChatHistoryCheck"
        ),  # Condense follow-up question and chat into a standalone_question
        RunnablePassthrough.assign(
            chat_history=lambda x: _format_chat_history(x["chat_history"])
        )
        | CONDENSE_QUESTION_PROMPT
        | ChatOpenAI(temperature=0)
        | StrOutputParser(),
    ),
    # Else, we have no chat history, so just pass through the question
    RunnableLambda(lambda x : x["question"]),
)

Now that we are done with this we will carry out a demonstration to show how the Graph Rag benefits from structured, unstructred and hybrid retreival.

### Prompt Augumentation and LLM Response

To demonstrate how the hybrid retrevial benefits from unstructured retreival and structured retreival.
We will demonstrate the LLM's response with:
<ol>
    <li>Unstructured Retrevial</li>
    <li>Structured Retrevial</li>
    <li>Unstructured and Structured Retrevial (Hybrid Retreival)</li>
</ol>

To do this, we will compare the different responses by the LLM for each of the retrievals.

Take note that some of the responses might change with each run of the demonstration. Therefore, the description written on markdown might not align with the output given sometimes. To combat this, you can generate a new response from the llm for that demonstration.


In [None]:
demo_question = 'What are the best practices for completing the course project to ensure satisfactory contribution?'

#####<h3>Demonstration 1: Unstructured Retrevial</h3>

In [None]:
# Retrieval
def just_unstructured_retriever(question: str):
    print(f"Search query: {question}")
    unstructured_data = [el.page_content for el in vector_index.similarity_search(question)]
    final_data = f"""
Unstructured data:
{"#Document ". join(unstructured_data)}
    """
    return final_data

# Prompt Augumentation: it instructs the model to answer a question using only the context provided.
template = """Answer the question based only on the following context:
{context}
Question: {question}
Use natural language and be concise.
Answer:"""
prompt = ChatPromptTemplate.from_template(template)

# LLM Generation by running two operations in parallel: retrieve context and passthrough quesiton
unstructured_chain = (
    RunnableParallel(
        {
            "context": _search_query | just_unstructured_retriever,
            "question": RunnablePassthrough(),
        }
    )
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
print('Example of the retrieval output fed to the LLM: \n')
print(just_unstructured_retriever(demo_question))

Example of the retrieval output fed to the LLM: 

Search query: What are the best practices for completing the course project to ensure satisfactory contribution?





Unstructured data:

text: ©QIUHONG WANG 2024 3 • Use table and/or figures to report the model performance on both the training set and the testing sets (all these outcome can be reproduced by running your source code). Precise and succinct explanation should be provided if necessary.  This section is a general requirement for any machine learning related project. Let us keep it simple and straightforward with only necessary information.  Section 3 Contribution and Justification  Among the four aspects identified below or your unique aspects that are not listed here, please provide your justification within 1-3 pages for each aspect. This excludes the self-evaluation table.  • Complete the following table and assess your own contribution taken into account both the extent of efforts and the effectiveness of the outcomes.  Regarding the four aspects of the contribution, you are not required to cover all of them in your project. It should be your own decision depending on your interest, 

In [None]:
print('Unstructured retrieval model response: \n')
response = unstructured_chain.invoke({"question": demo_question})
wrapped_response = textwrap.fill(response, width=80)
print(wrapped_response)

Unstructured retrieval model response: 

Search query: What are the best practices for completing the course project to ensure satisfactory contribution?




To ensure satisfactory contribution to the course project, it is important to
use valuable and high-quality datasets, be creative in feature engineering,
design or adapt new ML methods/architecture, and provide objective
justifications for self-evaluation. Additionally, source code should be well-
organized, reproducible, and include necessary markdown explanations. Following
the project deliverable guidelines and assessment rubrics is crucial for meeting
the project requirements.


#####<h3>Demonstration 2: Structured Retrevial</h3>

In [None]:
# Retrieval
def just_structured_retriever(question: str):
    print(f"Search query: {question}")
    structured_data = structured_retriever(question)
   # unstructured_data = [el.page_content for el in vector_index.similarity_search(question)]
    final_data = f"""Structured data:
{structured_data}
    """
    return final_data

# Prompt Augumentation: it instructs the model to answer a question using only the context provided.
template = """Answer the question based only on the following context:
{context}

Question: {question}
Use natural language and be concise.
Answer:"""
prompt = ChatPromptTemplate.from_template(template)

# LLM Generation by running two operations in parallel: retrieve context and passthrough quesiton
structured_chain = (
    RunnableParallel(
        {
            "context": _search_query | just_structured_retriever,
            "question": RunnablePassthrough(),
        }
    )
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
print('Example of the retrieval output fed to the LLM: \n')
print(just_structured_retriever(demo_question))

Example of the retrieval output fed to the LLM: 

Search query: What are the best practices for completing the course project to ensure satisfactory contribution?
Structured data:
Project Source Code - UPLOADED_TO -> Canvas
Project Source Code - ACCESSIBLE_VIA -> Public Link
Project Final Presentation - UPLOADED_TO -> Canvas
Project Final Report - UPLOADED_TO -> Canvas
Project Source Code - UPLOADED_TO -> Canvas
Project Source Code - ACCESSIBLE_VIA -> Public Link
Datasets - UPLOADED_TO -> Canvas
Datasets - ACCESSIBLE_VIA -> Public Link
Source_Code_File - CONTAINS -> Model_Performance
Source_Code_File - CONTAINS -> Markdown_Explanation
Source_Code_File - RUNNABLE_IN -> Google_Colab
Model_Performance - INCLUDES -> Training_Set
Model_Performance - INCLUDES -> Testing_Set
Model_Performance - INCLUDES -> Source_Code
Project_Report - CONTAINS -> Abstract
Project_Report - CONTAINS -> Feature_Engineering
Project_Report - CONTAINS -> Models_Performance
Contribution_Justification - FOCUS -> Data

In [None]:
print('Structured retrieval model response: \n')
response = structured_chain.invoke({"question": demo_question})
wrapped_response = textwrap.fill(response, width=80)
print(wrapped_response)

Structured retrieval model response: 

Search query: What are the best practices for completing the course project to ensure satisfactory contribution?
To ensure satisfactory contribution in completing the course project, it is best
practice to upload project source code, final presentation, and final report to
Canvas, make them accessible via public links, focus on datasets and feature
engineering, utilize ML methods and bias identification, enhance insights and
creativity, explain prediction results, identify bias and gaps, impact business
decision making, utilize objective evidences for self-evaluation justification,
ensure source code quality, understand data, present with clarity, logic, and
succinctness, belong to a group, include presentation and Q&A, and avoid using
scripts.


As we can see from the llm's response, the model benefits from the structured data returned and gives out more detailed and structured elaborations. For some responses, the model managed to state that students are graded based on tutorial assignments, lab sessions and are tested on Datacamp Assignments.

However, we can also see that the response gets the course instructor wrong due to the lack of context. This is where combining the structured and unstructured retiever greatly improves the response.

####<h3>Demonstration 3: Hybrid Retrevial</h3>

Finally, we can go ahead and test our hybrid RAG implementation.
We will use **retriever** defined in Step 3b.2.3, that combines the unstructured and structured (Hybrid) retriever

In [None]:
# Retrieval
# refer to retriever defined in Step 3b.2.3


# Prompt Augumentation: it instructs the model to answer a question using only the context provided.
template = """Answer the question based only on the following context:
{context}

Question: {question}
Use natural language and be concise.
Answer:"""
prompt = ChatPromptTemplate.from_template(template)

# LLM Generation by running two operations in parallel: retrieve context and passthrough quesiton
final_chain = (
    RunnableParallel(
        {
            "context": _search_query | retriever,
            "question": RunnablePassthrough(),
        }
    )
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
print('Example of the retrieval output fed to the LLM: \n')
print(retriever(demo_question))

Example of the retrieval output fed to the LLM: 

Search query: What are the best practices for completing the course project to ensure satisfactory contribution?




Structured data:
Project Source Code - UPLOADED_TO -> Canvas
Project Source Code - ACCESSIBLE_VIA -> Public Link
Project Final Presentation - UPLOADED_TO -> Canvas
Project Final Report - UPLOADED_TO -> Canvas
Project Source Code - UPLOADED_TO -> Canvas
Project Source Code - ACCESSIBLE_VIA -> Public Link
Datasets - UPLOADED_TO -> Canvas
Datasets - ACCESSIBLE_VIA -> Public Link
Source_Code_File - CONTAINS -> Model_Performance
Source_Code_File - CONTAINS -> Markdown_Explanation
Source_Code_File - RUNNABLE_IN -> Google_Colab
Model_Performance - INCLUDES -> Training_Set
Model_Performance - INCLUDES -> Testing_Set
Model_Performance - INCLUDES -> Source_Code
Project_Report - CONTAINS -> Abstract
Project_Report - CONTAINS -> Feature_Engineering
Project_Report - CONTAINS -> Models_Performance
Contribution_Justification - FOCUS -> Datasets
Contribution_Justification - FOCUS -> Feature_Engineering
Contribution_Justification - FOCUS -> Aspects_Contribution
Contribution_Justification - FOCUS -> Ml_

In [None]:
print('Hybrid retrieval model response: \n')
response = final_chain.invoke({"question": demo_question})
wrapped_response = textwrap.fill(response, width=80)
print(wrapped_response)

Hybrid retrieval model response: 

Search query: What are the best practices for completing the course project to ensure satisfactory contribution?




To ensure satisfactory contribution for the course project, it is important to
follow these best practices: 1. Use valuable and high-quality datasets,
including integrating existing datasets or collecting new data from multiple
sources. 2. Be creative in feature engineering by generating new features based
on relevant theories or domain knowledge. 3. Design or adapt new ML
methods/architecture with a balance of resources and costs. 4. Utilize data
analytics methods and additional ML methods to clarify, distinguish, identify
bias, or evaluate ML output for effective business decision-making. 5. Provide
objective evidence to justify self-evaluation, including necessary details and
reproducible outcomes from your models.


Now that we have completed the demonstrations, we will show how to incorporate chat history for follow up questions to the LLM. We will use the response from demonstration 3 to continue the conversation.

In [None]:
question_step3b3_2 = "What did i ask in the previous question?"
previous_qn = demo_question
previous_res = wrapped_response
final_chain.invoke(
    {
        "question": question_step3b3_2,
        "chat_history": [(previous_qn, previous_res)],
    }
)

Search query: What was asked in the previous question?




'You asked about the best practices for completing the course project to ensure satisfactory contribution.'

# **Question** (4 marks)

**Part 1**. Identify a domain or scenario where the LLM is unlikely trained on the documents from this domain or scenario and at the same time the application of LLM+RAG will be valuable. (1 mark)

For example, LLM is unlikely trained on the BT4222 deliverable guideline document. This is why it was used in this example.  

Adapt this exampe source code into a document from your proposed domain and demonstrate how LLM+RAG could help you better understand specific queries related to this domain.

**Part 2**. Please list your query and the best responses from LLM (1 mark)

**Part 3**. Please report which part of the source code has been revised in order to fit for the specific domain (if applicable) (2 mark)

*Hint*:
- By different content, the entities to be detected from your query or input can be differernt.
- PyPDFLoader is weak in extracting content from table.

To be rewarded the 3 marks from Part 2 and Part 3, please maintain all output cells generated by the source code above in your submitted notebook file to provide evidence of your results.

**Your answer for part 1**:

**Your answer for part 2**:

**Your answer for part 3**: