# Real-Time GraphRAG QA

- Author: [Jongcheol Kim](https://github.com/greencode-99)
- Design: 
- Peer Review: [Heesun Moon](https://github.com/MoonHeesun), [Taylor(Jihyun Kim)](https://github.com/Taylor0819)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

This tutorial provides **GraphRAG QA** functionality that extracts knowledge from PDF documents and enables natural language queries through a **Neo4j graph database**. After users upload PDF documents, they are processed using **OpenAI's GPT models** (e.g., `gpt-4o` and `text-ada-002`) to extract entities and relationships.

The extracted information is stored in a **Neo4j graph database**. Users can then interact with the graph in real-time by asking natural language questions, which are converted into **Cypher queries** to retrieve answers from the graph.


### Features

- **Real-time GraphRAG:** Enables real-time knowledge extraction from documents and supports interactive querying.
- **Modular and Configurable:** Users can set up their own credentials for `OpenAI` and `Neo4j`.
- **Natural Language Interface:** Ask questions in plain English and get answers from the graph database.


### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Neo4j Database Connection](#neo4j-database-connection)
- [PDF Processing](#pdf-processing)
- [Graph Transformation](#graph-transformation)
- [Vector Index Creation](#vector-index-creation)
- [QA Chain Setup](#qa-chain-setup)
- [Define QA Function](#define-qa-function)
- [Usage Example](#usage-example)

### References

- [LangChain Documentation: Neo4j Integration](https://python.langchain.com/docs/integrations/retrievers/self_query/neo4j_self_query/#filter-k)
- [Neo4j Graph Labs](https://neo4j.com/labs/genai-ecosystem/langchain/)
- [LangChain Graph QA Chain](https://python.langchain.com/api_reference/community/chains/langchain_community.chains.graph_qa.base.GraphQAChain.html#graphqachain)
- [Graphy v1](https://github.com/AIAnytime/Graphy-v1)

----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain",
        "langchain_neo4j",
        "langchain_openai",
        "langchain_core",
        "langchain_text_splitters",
        "langchain_experimental",
        "pypdf",
        "json-repair",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Real-Time GraphRAG QA",
        "NEO4J_URL": "",
        "NEO4J_USERNAME": "",
        "NEO4J_PASSWORD": "",
    }
)

Environment variables have been set successfully.


In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Neo4j Database Connection

This tutorial uses Neo4j and Neo4j Sandbox for graph database construction.
 
`Neo4j Sandbox` is an online graph database that allows you to easily build a **free cloud-based Neo4j instance**, allowing you to experiment with graph databases in the cloud environment without local installation.

[Note] 
Neo4j can be set up in several different ways:

1. [`Neo4j Desktop`](https://neo4j.com/docs/operations-manual/current/installation/) :  A desktop application for local development

2. [`Neo4j Sandbox` ](https://neo4j.com/sandbox/) : A free, cloud-based platform for working with graph databases

3. [`Docker` ](https://neo4j.com/docs/operations-manual/current/docker/) : Run Neo4j in a container using the official Neo4j Docker image

### Setup Neo4j Sandbox
- Go to Neo4j Sandbox and click the "+ New Project" button. Select your desired dataset to create a database.

![select-dataset](./assets/realtime-qa-setup-01.png)

- After creation, click the toggle to see example connection code provided for different programming languages. You can easily connect using the neo4j-driver library.

![neo4j-driver](./assets/realtime-qa-setup-02.png)

- To connect the graph database with LangChain, you'll need the connection details from this section.

![connection-details](./assets/realtime-qa-setup-03.png)


The following code connects to the Neo4j database and initializes the necessary components for our application.

In [5]:
import os
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_neo4j.graphs.neo4j_graph import Neo4jGraph

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY is not set")

NEO4J_URL = os.getenv("NEO4J_URL")
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")

embeddings = OpenAIEmbeddings()
llm = ChatOpenAI(model_name="gpt-4o")


def connect_to_neo4j():
    try:
        graph = Neo4jGraph(
            url=NEO4J_URL, username=NEO4J_USERNAME, password=NEO4J_PASSWORD
        )
        print("Successfully connected to Neo4j database")
        return graph
    except Exception as e:
        print(f"Failed to connect to Neo4j: {e}")
        return None


graph = connect_to_neo4j()

Successfully connected to Neo4j database


## PDF Processing


Here's how we process PDF documents and extract text from them.

First, we use `PyPDFLoader` to load the PDF file and split it into individual pages. Then, we use `RecursiveCharacterTextSplitter` to break these pages down into manageable chunks. We set each chunk to be 200 characters long, with a 40-character overlap between chunks to maintain smooth transitions and context.

Once all the splitting work is complete, we begin our text cleanup process. For each piece of document, we remove any newline characters and include source information so we can track where the content originated. All of this cleaned-up information gets neatly organized into a list of Document objects.

This approach helps us transform complex PDF documents into a format that's much more suitable for AI processing and analysis.

In [6]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema import Document


def process_pdf(file_path):
    loader = PyPDFLoader(file_path)
    pages = loader.load_and_split()

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=40)
    docs = text_splitter.split_documents(pages)

    lc_docs = []
    for doc in docs:
        lc_docs.append(
            Document(
                page_content=doc.page_content.replace("\n", ""),
                metadata={"source": file_path},
            )
        )

    return lc_docs

## Graph Transformation

Here's how to transform extracted text into a graph structure using our transformation function.

First, we initialize the graph database by clearing any existing nodes and relationships. We then define a set of allowed nodes and their permitted relationships:
- `Allowed nodes` : ["Device", "PowerSource", "OperatingSystem", "ConnectionStatus", "Software", "Action"]
- `Permitted relationships` : ["USES_POWER", "OPERATES_ON", "HAS_STATUS", "REQUIRES", "PERFORMS"]

For demonstration purposes, these nodes and relationships were defined using the `gpt-4o` model.

We then create a graph transformer using `LLMGraphTransformer` and configure it with our defined nodes and relationships. To keep things simple, we set both node and relationship properties to false. This transformer takes our document chunks and converts them into graph documents, which are then added to our Neo4j database along with their source information.

This whole process transforms our text data into a structured graph format, making it much easier to query and analyze relationships between different entities in our documents.

In [7]:
from langchain_experimental.graph_transformers import LLMGraphTransformer


def transform_to_graph(docs, graph):
    cypher = """
    MATCH (n)
    DETACH DELETE n;
    """
    graph.query(cypher)

    allowed_nodes = [
        "Device",
        "PowerSource",
        "OperatingSystem",
        "ConnectionStatus",
        "Software",
        "Action",
    ]
    allowed_relationships = [
        "USES_POWER",
        "OPERATES_ON",
        "HAS_STATUS",
        "REQUIRES",
        "PERFORMS",
    ]

    transformer = LLMGraphTransformer(
        llm=llm,
        allowed_nodes=allowed_nodes,
        allowed_relationships=allowed_relationships,
        node_properties=False,
        relationship_properties=False,
    )

    graph_documents = transformer.convert_to_graph_documents(docs)
    graph.add_graph_documents(graph_documents, include_source=True)

    return graph

## Vector Index Creation


Here's how to create a vector index using our vector index creation function.

First, we use `Neo4jVector.from_existing_graph()` to create a vector index from our existing Neo4j graph database - this is what allows us to perform similarity searches on our graph data.

The function needs several important parameters to work properly:
- We use `OpenAI` embeddings for vector representation
- We provide database connection parameters (`url`, `username`, `password`)
- For node configuration, we specify `"Patient"` as the `node_label` for indexing
- We define `text_node_properties` to use both `"id"` and `"text"` for content
- We specify where to store vector embeddings using the `embedding_node_property` parameter

For the indexing setup, we create two types of indices:
- A vector index and a keyword index (specified by `index_name` and `keyword_index_name`)
- We set the `search_type` to `"hybrid"`, enabling both vector and keyword-based searching

This comprehensive setup creates an **efficient system** for performing **similarity searches** and **hybrid querying** in our graph database.

In [8]:
from langchain_neo4j.vectorstores.neo4j_vector import Neo4jVector


def create_vector_index():
    index = Neo4jVector.from_existing_graph(
        embedding=embeddings,
        url=NEO4J_URL,
        username=NEO4J_USERNAME,
        password=NEO4J_PASSWORD,
        database="neo4j",
        node_label="Patient",
        text_node_properties=["id", "text"],
        embedding_node_property="embedding",
        index_name="vector_index",
        keyword_index_name="entity_index",
        search_type="hybrid",
    )
    return index

## QA Chain Setup


Here's how to create a powerful question-answering chain using the `GraphCypherQAChain`.

First, we create a custom prompt template that helps generate Cypher queries. This template is quite comprehensive - it includes all the available relationship types in our database like `MENTIONS`, `PERFORMS`, `USES_POWER`, `HAS_STATUS`, `OPERATES_ON`, and `REQUIRES`. It also provides an example query structure to ensure proper formatting and includes a placeholder where we'll insert the user's question.

Once we have our template ready, we create a `GraphCypherQAChain` with several important configurations:
- We use our configured `llm` for query generation
- We connect it to our `graph` database
- We incorporate our `cypher_prompt` template
- We enable `verbose` mode for detailed logging
- We set `return_intermediate_steps` to see what's happening under the hood
- We set `allow_dangerous_requests` to true for handling complex queries
- We limit our results to the top 3 matches with `top_k`

This whole setup creates a **powerful chain** that can take **natural language questions** from users, convert them into **proper Cypher queries**, and fetch relevant answers from our graph database. It's like having a translator that converts human questions into database language and back again.


In [9]:
from langchain_neo4j.chains.graph_qa.cypher import GraphCypherQAChain
from langchain_core.prompts import PromptTemplate


def setup_qa_chain(graph):
    template = """
    Generate a Cypher query to find information about the question.
    Use only these relationships that exist in the database: MENTIONS, PERFORMS, USES_POWER, HAS_STATUS, OPERATES_ON, REQUIRES
    
    Example query structure:
    MATCH (d:Document)-[:MENTIONS]->(a:Action)
    WHERE toLower(d.text) CONTAINS 'keyword'
    RETURN d.text as answer
    
    Question: {question}
    """

    question_prompt = PromptTemplate(template=template, input_variables=["question"])

    qa = GraphCypherQAChain.from_llm(
        llm=llm,
        graph=graph,
        cypher_prompt=question_prompt,
        verbose=True,
        return_intermediate_steps=True,
        allow_dangerous_requests=True,
        top_k=3,
    )

    return qa

## Define QA Function

Here's how this function combines a **graph database query** with an **LLM fallback** to provide answers efficiently.

First, it searches the graph database using a `Cypher query`. It looks for `Document` nodes connected via `MENTIONS` relationships that contain the question's keyword in their text. This is the primary way the function tries to find answers.

If the first query doesn't return a result, it splits the question into individual words and searches using each word as a keyword. This approach helps when a single keyword doesn't match exactly but parts of the question might.

The function includes several fallback mechanisms:
- If no answer is found in the database, the question is passed to an `LLM`
- The `LLM` processes the query and generates an answer independently
- The entire process is wrapped in a `try-except` block for smooth error handling
- Users receive friendly error messages if something goes wrong

The function follows a clear decision path:
- Return database answer if found
- Use LLM's answer if database search fails
- Ask user to rephrase if both methods fail

This **hybrid approach** ensures flexibility, combining the speed of database queries with the depth of LLM-generated answers. It's perfect for handling both structured and unstructured data queries seamlessly.

In [10]:
def ask_question(qa, question):
    try:
        base_query = """
        MATCH (d:Document)-[:MENTIONS]->(a)
        WHERE toLower(d.text) CONTAINS toLower($keyword)
        RETURN DISTINCT d.text as answer
        LIMIT 1
        """

        result = graph.query(base_query, {"keyword": question.lower()})

        if not result:
            keywords = question.lower().split()
            for keyword in keywords:
                if len(keyword) > 3:
                    result = graph.query(base_query, {"keyword": keyword})
                    if result:
                        break

        if result and len(result) > 0:
            return result[0]["answer"]

        qa_result = qa.invoke({"query": question})
        if qa_result and "result" in qa_result:
            return qa_result["result"]

        return "Unable to find an answer. Please try rephrasing your question."

    except Exception as e:
        print(f"Error: {str(e)}")
        return "An error occurred while processing your question."

## Usage Example

Here's a practical example.
- I used the following document for this demonstration.
- Please download the document using the link below and save it to the `data` folder.

**Document Details**
- Product Name: `Microsoft Bluetooth Notebook Mouse 5000`
- Document Type: **User Manual**
- File Size: `1.18 MB`
- Pages: `7`

**Related Documents**
- `Laser Desktop Keyboard 6000 v3.0` Quick Start Manual
- `Bluetooth Notebook Mouse 5000` User Manual
- `Laser Mouse 6000` Specifications
- `Mouse 6000` Getting Started Manual
- `Compact Optical Mouse 500` Quick Start Manual

**Download Link**
[Microsoft Bluetooth Notebook Mouse 5000 Manual](https://www.manualslib.com/download/1876132/Microsoft-Bluetooth-Notebook-Mouse-5000.html)

In [11]:
pdf_path = "data/bluetooth_notebook_mouse_5000.pdf"

# PDF Processing
docs = process_pdf(pdf_path)

In [12]:
# Graph Transformation
# The graph creation process may take a long time depending on the size of the document.
# For a 7-page document, it takes about 2 minutes.
graph = transform_to_graph(docs, graph)

In [13]:
# Data Inspection
def inspect_neo4j_data(graph):
    # All Nodes Query
    nodes_query = """
    MATCH (n)
    RETURN DISTINCT labels(n) as labels, count(*) as count
    """
    print("=== Node Types and Count ===")
    nodes = graph.query(nodes_query)
    print(nodes)

    # All Relationships Query
    rels_query = """
    MATCH ()-[r]->()
    RETURN DISTINCT type(r) as type, count(*) as count
    """
    print("\n=== Relationship Types and Count ===")
    relationships = graph.query(rels_query)
    print(relationships)

    # Sample Graph Structure
    sample_query = """
    MATCH (n)-[r]->(m)
    RETURN n, r, m
    LIMIT 3
    """
    print("\n=== Sample Graph Structure ===")
    sample = graph.query(sample_query)
    print(sample)


print("Current State of Neo4j Database:")
inspect_neo4j_data(graph)

Current State of Neo4j Database:
=== Node Types and Count ===
[{'labels': ['Document'], 'count': 27}, {'labels': ['Device'], 'count': 19}, {'labels': ['Connectionstatus'], 'count': 10}, {'labels': ['Action'], 'count': 22}, {'labels': ['Operatingsystem'], 'count': 3}, {'labels': ['Software'], 'count': 11}, {'labels': ['Powersource'], 'count': 3}]

=== Relationship Types and Count ===
[{'type': 'MENTIONS', 'count': 102}, {'type': 'PERFORMS', 'count': 22}, {'type': 'HAS_STATUS', 'count': 12}, {'type': 'OPERATES_ON', 'count': 7}, {'type': 'REQUIRES', 'count': 15}, {'type': 'USES_POWER', 'count': 4}]

=== Sample Graph Structure ===
[{'n': {'text': 'www.microsoft.com/hardware/supportwww.microsoft.com/hardware/productguidewww.microsoft.com/hardware/downloads1  Insira duas pilhas alcalinas do tipo AAA e ligue o mouse.', 'source': 'data/bluetooth_notebook_mouse_5000.pdf', 'id': '822061c9baab0a2a3b354693240b1f00'}, 'r': ({'text': 'www.microsoft.com/hardware/supportwww.microsoft.com/hardware/prod

In [14]:
# Vector Index Creation
index = create_vector_index()

In [15]:
# QA Chain Setup
qa = setup_qa_chain(graph)

In [16]:
question = "What happens when you press and hold the connect button?"
answer = ask_question(qa, question)
print(f"\nQuestion: {question}")
print(f"Answer: {answer}")


Question: What happens when you press and hold the connect button?
Answer: control panel, and in category view, locate hardware and sound, and then select add a device.c. When the mouse is listed, select  it, and follow the instructions.


In [17]:
# Multiple Test
def test_qa():
    questions = [
        "What happens when you press and hold the connect button?",
        "What type of batteries does this mouse use?",
        "How do I connect to Windows 8?",
        "Where is the connect button located?",
    ]

    print("\nTesting multiple questions:")
    for q in questions:
        print(f"\nQ: {q}")
        print(f"A: {ask_question(qa, q)}")


# Run
test_qa()


Testing multiple questions:

Q: What happens when you press and hold the connect button?
A: control panel, and in category view, locate hardware and sound, and then select add a device.c. When the mouse is listed, select  it, and follow the instructions.

Q: What type of batteries does this mouse use?
A: type control panel, select control panel from the search results, and then select add devices and printers .WindoWs 7: On your computer, from the start menu, select

Q: How do I connect to Windows 8?
A: 2  Pour connecter la souris à votre ordinateur :a. Maintenez enfoncé le bouton « connect » jusqu’à ce que le voyant sur le dessus de la souris clignote en rouge et vert.

Q: Where is the connect button located?
A: 2  Pour connecter la souris à votre ordinateur :a. Maintenez enfoncé le bouton « connect » jusqu’à ce que le voyant sur le dessus de la souris clignote en rouge et vert.


In [18]:
# Debugging
def check_database_content():
    queries = [
        "MATCH (d:Document) WHERE toLower(d.text) CONTAINS 'connect button' RETURN d.text LIMIT 1",
        "MATCH (a:Action) WHERE toLower(a.id) CONTAINS 'connect' RETURN a.id",
        "MATCH (d:Document)-[:MENTIONS]->(a) RETURN DISTINCT labels(a) as node_types",
    ]

    print("\nDatabase Content Check:")
    for query in queries:
        result = graph.query(query)
        print(f"\nQuery: {query}")
        print(f"Result: {result}")


check_database_content()


Database Content Check:

Query: MATCH (d:Document) WHERE toLower(d.text) CONTAINS 'connect button' RETURN d.text LIMIT 1
Result: [{'d.text': '2  To connect the mouse to your computer:a. Press and hold the connect button until the light on the top of the mouse flashes red and green.b. WindoWs 8: On your computer, press the Windows key,'}]

Query: MATCH (a:Action) WHERE toLower(a.id) CONTAINS 'connect' RETURN a.id
Result: [{'a.id': 'Connecter'}, {'a.id': 'Connect Button'}, {'a.id': 'Connect_Button'}]

Query: MATCH (d:Document)-[:MENTIONS]->(a) RETURN DISTINCT labels(a) as node_types
Result: [{'node_types': ['Connectionstatus']}, {'node_types': ['Device']}, {'node_types': ['Action']}, {'node_types': ['Operatingsystem']}, {'node_types': ['Software']}, {'node_types': ['Powersource']}]
