# Real-Time GraphRAG QA

- Author: [Jongcheol Kim](https://github.com/greencode-99)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

이 튜토리얼은 PDF 문서에서 지식을 추출하고 Neo4j 그래프 데이터베이스를 통해 자연어 질의를 가능하게 하는 GraphRAG QA 기능을 제공 합니다.
사용자가 PDF 문서를 업로드한 다음, OpenAI의 GPT 모델(예: gpt-4o및 text-ada-002)을 사용하여 엔터티와 관계를 추출하기 위해 처리합니다.

추출된 정보는 Neo4j 그래프 데이터베이스에 저장됩니다. 그런 다음 사용자는 자연어 질문을 하여 그래프와 실시간으로 상호 작용할 수 있으며, 이 질문은 Cypher 쿼리로 변환되어 그래프에서 답변을 검색합니다.

특징
- 실시간 GraphRAG : 문서에서 지식을 추출하고 실시간 쿼리를 가능하게 합니다.
- 모듈식 및 구성 가능 : 사용자는 OpenAI 및 Neo4j에 대한 자체 자격 증명을 설정할 수 있습니다.
- 자연어 인터페이스 : 일반 영어로 질문하면 그래프 데이터베이스에서 답변을 받습니다.


### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Neo4j Database Connection](#neo4j-database-connection)
- [PDF Processing](#pdf-processing)
- [Graph Transformation](#graph-transformation)
- [Vector Index Creation](#vector-index-creation)
- [QA Chain Setup](#qa-chain-setup)
- [Usage Example](#usage-example)

### References

- [LangChain Documentation: Neo4j Integration](https://python.langchain.com/docs/integrations/retrievers/self_query/neo4j_self_query/#filter-k)
- [Neo4j Graph Labs](https://neo4j.com/labs/genai-ecosystem/langchain/)
- [LangChain Graph QA Chain](https://python.langchain.com/api_reference/community/chains/langchain_community.chains.graph_qa.base.GraphQAChain.html#graphqachain)

----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain",
        "langchain_neo4j",
        "langchain_openai",
        "langchain_core",
        "langchain_text_splitters",
        "langchain_experimental",
        "pypdf", 
        "json-repair"
     ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Real-Time GraphRAG QA",
        "NEO4J_URL": "",
        "NEO4J_USERNAME": "",
        "NEO4J_PASSWORD": "",
    }
)

Environment variables have been set successfully.


In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Neo4j Database Connection

Neo4j 데이터베이스에 연결하고 기본 컴포넌트들을 초기화합니다.

In [5]:
import os
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_neo4j.graphs.neo4j_graph import Neo4jGraph


# OpenAI API 키 설정
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY가 설정되지 않았습니다.")

# Neo4j 연결 정보
NEO4J_URL = os.getenv("NEO4J_URL")
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")

# LangChain 컴포넌트 초기화
embeddings = OpenAIEmbeddings()
llm = ChatOpenAI(model_name="gpt-4o")


def connect_to_neo4j():
    try:
        graph = Neo4jGraph(
            url=NEO4J_URL, username=NEO4J_USERNAME, password=NEO4J_PASSWORD
        )
        print("Neo4j 데이터베이스에 연결되었습니다.")
        return graph
    except Exception as e:
        print(f"Neo4j 연결 실패: {e}")
        return None


graph = connect_to_neo4j()

Neo4j 데이터베이스에 연결되었습니다.


## PDF Processing

PDF 문서를 처리하고 텍스트를 추출하는 함수들을 정의합니다.

In [6]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema import Document


def process_pdf(file_path):
    # PDF 로드 및 분할
    loader = PyPDFLoader(file_path)
    pages = loader.load_and_split()

    # 텍스트 분할
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=40)
    docs = text_splitter.split_documents(pages)

    # 문서 전처리
    lc_docs = []
    for doc in docs:
        lc_docs.append(
            Document(
                page_content=doc.page_content.replace("\n", ""),
                metadata={"source": file_path},
            )
        )

    return lc_docs

## Graph Transformation

추출된 텍스트를 그래프 형태로 변환하는 함수를 정의합니다.

In [7]:
from langchain_experimental.graph_transformers import LLMGraphTransformer

def transform_to_graph(docs, graph):
    # 그래프 데이터베이스 초기화
    cypher = """
    MATCH (n)
    DETACH DELETE n;
    """
    graph.query(cypher)

    # 허용된 노드와 관계 정의
    allowed_nodes = [
        "Device",
        "PowerSource",
        "OperatingSystem",
        "ConnectionStatus",
        "Software",
        "Action",
    ]
    allowed_relationships = [
        "USES_POWER",
        "OPERATES_ON",
        "HAS_STATUS",
        "REQUIRES",
        "PERFORMS",
    ]

    # 문서를 그래프로 변환
    transformer = LLMGraphTransformer(
        llm=llm,
        allowed_nodes=allowed_nodes,
        allowed_relationships=allowed_relationships,
        node_properties=False,
        relationship_properties=False,
    )

    graph_documents = transformer.convert_to_graph_documents(docs)
    graph.add_graph_documents(graph_documents, include_source=True)

    return graph

## Vector Index Creation

벡터 인덱스를 생성하는 함수를 정의합니다.

In [8]:
from langchain_neo4j.vectorstores.neo4j_vector import Neo4jVector


def create_vector_index():
    index = Neo4jVector.from_existing_graph(
        embedding=embeddings,
        url=NEO4J_URL,
        username=NEO4J_USERNAME,
        password=NEO4J_PASSWORD,
        database="neo4j",
        node_label="Patient",
        text_node_properties=["id", "text"],
        embedding_node_property="embedding",
        index_name="vector_index",
        keyword_index_name="entity_index",
        search_type="hybrid",
    )
    return index

## QA Chain Setup

질의응답 체인을 설정하는 함수를 정의합니다.

In [36]:
from langchain_neo4j.chains.graph_qa.cypher import GraphCypherQAChain
from langchain_core.prompts import PromptTemplate

def setup_qa_chain(graph):
    template = """
    Generate a Cypher query to find information about the question.
    Use only these relationships that exist in the database: MENTIONS, PERFORMS, USES_POWER, HAS_STATUS, OPERATES_ON, REQUIRES
    
    Example query structure:
    MATCH (d:Document)-[:MENTIONS]->(a:Action)
    WHERE toLower(d.text) CONTAINS 'keyword'
    RETURN d.text as answer
    
    Question: {question}
    """
    
    question_prompt = PromptTemplate(
        template=template,
        input_variables=["question"]
    )
    
    qa = GraphCypherQAChain.from_llm(
        llm=llm,
        graph=graph,
        cypher_prompt=question_prompt,
        verbose=True,
        return_intermediate_steps=True,
        allow_dangerous_requests=True,
        top_k=3  # 상위 3개의 관련 결과 반환

    )
    
    return qa

In [37]:
def ask_question(qa, question):
    try:
        # 기본 검색 쿼리
        base_query = """
        MATCH (d:Document)-[:MENTIONS]->(a)
        WHERE toLower(d.text) CONTAINS toLower($keyword)
        RETURN DISTINCT d.text as answer
        LIMIT 1
        """
        
        # 먼저 전체 구문으로 검색
        result = graph.query(base_query, {'keyword': question.lower()})
        
        # 결과가 없으면 키워드로 분리해서 검색
        if not result:
            keywords = question.lower().split()
            for keyword in keywords:
                if len(keyword) > 3:  # 짧은 단어 제외
                    result = graph.query(base_query, {'keyword': keyword})
                    if result:
                        break
        
        if result and len(result) > 0:
            return result[0]['answer']
            
        # 여전히 결과가 없으면 QA 체인 사용
        qa_result = qa.invoke({"query": question})
        if qa_result and 'result' in qa_result:
            return qa_result['result']
            
        return "Unable to find an answer. Please try rephrasing your question."
        
    except Exception as e:
        print(f"Error: {str(e)}")
        return "An error occurred while processing your question."


## Usage Example

시스템을 실제로 사용하는 예시입니다.

In [11]:
# PDF 파일 경로 설정
pdf_path = "data/bluetooth_notebook_mouse_5000.pdf"

# PDF 처리
docs = process_pdf(pdf_path)

In [20]:
# 그래프 변환
graph = transform_to_graph(docs, graph)

In [38]:
# 데이터 확인
def inspect_neo4j_data(graph):
    # 모든 노드 조회
    nodes_query = """
    MATCH (n)
    RETURN DISTINCT labels(n) as labels, count(*) as count
    """
    print("=== 노드 타입 및 개수 ===")
    nodes = graph.query(nodes_query)
    print(nodes)
    
    # 모든 관계 조회
    rels_query = """
    MATCH ()-[r]->()
    RETURN DISTINCT type(r) as type, count(*) as count
    """
    print("\n=== 관계 타입 및 개수 ===")
    relationships = graph.query(rels_query)
    print(relationships)
    
    # 전체 그래프 구조 샘플 조회
    sample_query = """
    MATCH (n)-[r]->(m)
    RETURN n, r, m
    LIMIT 3
    """
    print("\n=== 그래프 구조 샘플 ===")
    sample = graph.query(sample_query)
    print(sample)

# 실행
print("Neo4j 데이터베이스 현재 상태:")
inspect_neo4j_data(graph)

Neo4j 데이터베이스 현재 상태:
=== 노드 타입 및 개수 ===
[{'labels': ['Document'], 'count': 27}, {'labels': ['Software'], 'count': 24}, {'labels': ['Powersource'], 'count': 3}, {'labels': ['Device'], 'count': 13}, {'labels': ['Action'], 'count': 18}, {'labels': ['Operatingsystem'], 'count': 3}, {'labels': ['Connectionstatus'], 'count': 7}]

=== 관계 타입 및 개수 ===
[{'type': 'MENTIONS', 'count': 99}, {'type': 'USES_POWER', 'count': 3}, {'type': 'REQUIRES', 'count': 13}, {'type': 'PERFORMS', 'count': 18}, {'type': 'OPERATES_ON', 'count': 9}, {'type': 'HAS_STATUS', 'count': 9}]

=== 그래프 구조 샘플 ===
[{'n': {'text': 'www.microsoft.com/hardwareEnglish - EnEspañol (latinoamérica) - XXportuguês (Brasil) - X cFrançais (canada) - XdX182903901bkt.indd   2-3 5/21/2012   9:40:30 AM', 'source': 'data/bluetooth_notebook_mouse_5000.pdf', 'id': '5283bb653fe2319d093ee1b57d4d9948'}, 'r': ({'text': 'www.microsoft.com/hardwareEnglish - EnEspañol (latinoamérica) - XXportuguês (Brasil) - X cFrançais (canada) - XdX182903901bkt.indd  

In [22]:
# 벡터 인덱스 생성
index = create_vector_index()

In [23]:
# QA 체인 설정
qa = setup_qa_chain(graph)

In [39]:
# 질문하기
# question = "What power source does the mouse use?"

question = "What happens when you press and hold the connect button?"
answer = ask_question(qa, question)
print(f"\nQuestion: {question}")
print(f"Answer: {answer}")


Question: What happens when you press and hold the connect button?
Answer: control panel, and in category view, locate hardware and sound, and then select add a device.c. When the mouse is listed, select  it, and follow the instructions.


In [40]:
# 테스트
def test_qa():
    questions = [
        "What happens when you press and hold the connect button?",
        "What type of batteries does this mouse use?",
        "How do I connect to Windows 8?",
        "Where is the connect button located?"
    ]
    
    print("\nTesting multiple questions:")
    for q in questions:
        print(f"\nQ: {q}")
        print(f"A: {ask_question(qa, q)}")

# 실행
test_qa()


Testing multiple questions:

Q: What happens when you press and hold the connect button?
A: control panel, and in category view, locate hardware and sound, and then select add a device.c. When the mouse is listed, select  it, and follow the instructions.

Q: What type of batteries does this mouse use?
A: type control panel, select control panel from the search results, and then select add devices and printers .WindoWs 7: On your computer, from the start menu, select

Q: How do I connect to Windows 8?
A: 2  To connect the mouse to your computer:a. Press and hold the connect button until the light on the top of the mouse flashes red and green.b. WindoWs 8: On your computer, press the Windows key,

Q: Where is the connect button located?
A: 2  To connect the mouse to your computer:a. Press and hold the connect button until the light on the top of the mouse flashes red and green.b. WindoWs 8: On your computer, press the Windows key,


In [30]:
# 디버깅을 위한 데이터베이스 상태 확인
def check_database_content():
    queries = [
        "MATCH (d:Document) WHERE toLower(d.text) CONTAINS 'connect button' RETURN d.text LIMIT 1",
        "MATCH (a:Action) WHERE toLower(a.id) CONTAINS 'connect' RETURN a.id",
        "MATCH (d:Document)-[:MENTIONS]->(a) RETURN DISTINCT labels(a) as node_types"
    ]
    
    print("\n데이터베이스 내용 확인:")
    for query in queries:
        result = graph.query(query)
        print(f"\nQuery: {query}")
        print(f"Result: {result}")

check_database_content()


데이터베이스 내용 확인:

Query: MATCH (d:Document) WHERE toLower(d.text) CONTAINS 'connect button' RETURN d.text LIMIT 1
Result: [{'d.text': '2  To connect the mouse to your computer:a. Press and hold the connect button until the light on the top of the mouse flashes red and green.b. WindoWs 8: On your computer, press the Windows key,'}]

Query: MATCH (a:Action) WHERE toLower(a.id) CONTAINS 'connect' RETURN a.id
Result: [{'a.id': 'Connect'}]

Query: MATCH (d:Document)-[:MENTIONS]->(a) RETURN DISTINCT labels(a) as node_types
Result: [{'node_types': ['Software']}, {'node_types': ['Action']}, {'node_types': ['Powersource']}, {'node_types': ['Device']}, {'node_types': ['Operatingsystem']}, {'node_types': ['Connectionstatus']}]
