# Graph RAG Implementation with LlamaIndex

This notebook demonstrates the implementation of **Graph RAG (Retrieval-Augmented Generation)** using LlamaIndex. Graph RAG combines knowledge graphs with large language models to provide more accurate and context-aware answers by understanding entity relationships and community structures.

## Overview
1. **Data Loading**: Load news articles dataset
2. **LLM Setup**: Configure OpenRouter with GPT-4o-mini
3. **Graph Extractor**: Custom implementation to extract entities and relationships
4. **Graph Store**: Neo4j-based store with community detection
5. **Query Engine**: Custom query engine using hierarchical community summaries
6. **Testing**: Query the knowledge graph for insights

---

## üöÄ Setup Instructions

Follow these steps to set up the project environment before running the notebook.

### Prerequisites
- Python 3.11+ installed
- Docker installed and running
- Git (optional, for cloning)

---

### Step 1: Create Python Environment

**Using Conda (Recommended)**
https://www.anaconda.com/docs/getting-started/miniconda/install


```bash
# Create a new conda environment
conda create -n graph-rag-demo python=3.11 -y

# Activate the environment
conda activate graph-rag-demo
```
---

### Step 2: Install Required Packages

```bash

# install packages individually:
pip install pandas llama-index llama-index-core
pip install llama-index-llms-openrouter llama-index-embeddings-huggingface
pip install llama-index-graph-stores-neo4j neo4j
pip install networkx graspologic
pip install nest-asyncio python-dotenv
pip install sentence-transformers
pip install docling
```
---

### Step 3: Set Up Environment Variables

Create a `.env` file in the project root directory:

```bash
# Create .env file (Windows PowerShell)
New-Item -Path .env -ItemType File

# Create .env file (Linux/Mac/Git Bash)
touch .env
```

Add your OpenRouter API key to the `.env` file:
```
OPENROUTER_API_KEY=your_api_key_here
```

**To get an OpenRouter API key:**
1. Go to [https://openrouter.ai/](https://openrouter.ai/)
2. Sign up or log in
3. Navigate to API Keys section
4. Create a new API key
5. Copy and paste it into your `.env` file

---

### Step 4: Set Up Neo4j Database with Docker

**Run Neo4j with APOC plugin:**

```bash
# Windows PowerShell
docker run `
  -p 7474:7474 -p 7687:7687 `
  -v "$PWD/data:/data" -v "$PWD/plugins:/plugins" `
  --name neo4j-apoc `
  -e 'NEO4J_AUTH=neo4j/12345aA#' `
  -e NEO4J_apoc_export_file_enabled=true `
  -e NEO4J_apoc_import_file_enabled=true `
  -e NEO4J_apoc_import_file_use__neo4j__config=true `
  -e NEO4JLABS_PLUGINS='["apoc"]' `
  neo4j:latest
```

```bash
# Linux/Mac/Git Bash
docker run \
  -p 7474:7474 -p 7687:7687 \
  -v "$PWD/data:/data" -v "$PWD/plugins:/plugins" \
  --name neo4j-apoc \
  -e 'NEO4J_AUTH=neo4j/12345aA#' \
  -e NEO4J_apoc_export_file_enabled=true \
  -e NEO4J_apoc_import_file_enabled=true \
  -e NEO4J_apoc_import_file_use__neo4j__config=true \
  -e NEO4JLABS_PLUGINS='["apoc"]' \
  neo4j:latest
```

**Access Neo4j Browser:**
- Open browser: [http://localhost:7474](http://localhost:7474)
- Username: `neo4j`
- Password: `12345aA#`

---

### Step 5: Verify Setup

Run this cell to verify all imports work:

```python
import pandas as pd
from llama_index.core import Document
from llama_index.llms.openrouter import OpenRouter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore
import networkx as nx
from graspologic.partition import hierarchical_leiden
print("‚úÖ All packages imported successfully!")
```

---

### ‚ö†Ô∏è Common Issues & Solutions

**Issue: `graspologic` version error**
```bash
pip install --upgrade graspologic
```

**Issue: Neo4j connection failed**
- Ensure Docker is running
- Check if container is running: `docker ps`
- Verify ports 7474 and 7687 are not in use

**Issue: OpenRouter API key not found**
- Check `.env` file exists in project root
- Verify the key name is exactly `OPENROUTER_API_KEY`
- Restart the notebook kernel after creating `.env`

---

### üìö Resources
- [LlamaIndex Documentation](https://docs.llamaindex.ai/)
- [GraphRAG v2 Demo](https://developers.llamaindex.ai/python/examples/cookbooks/graphrag_v2/)
- [Neo4j Documentation](https://neo4j.com/docs/)
- [OpenRouter API](https://openrouter.ai/docs)

---

## 1. Data Loading and Preparation

Load news articles dataset from GitHub and convert it into LlamaIndex documents. We're using the first 50 articles to keep processing manageable.

In [17]:
# import pandas as pd
# from llama_index.core import Document

# news = pd.read_csv(
#     "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
# )[:50]

# news.head()

In [18]:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    PictureDescriptionApiOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from pydantic import AnyUrl, SecretStr
import os


def parse_document(
    input_doc_path,
    do_picture_description: bool = False,
    do_formula_enrichment: bool = False,
):
    api_key = SecretStr(os.environ["OPENROUTER_API_KEY"])
    model = "qwen/qwen-2-vl-7b-instruct"
    picture_desc_api_option = PictureDescriptionApiOptions(
        url=AnyUrl("https://openrouter.ai/api/v1/chat/completions"),
        prompt="Describe this image in sentences in a single paragraph.",
        params=dict(
            model=model,
        ),
        headers={
            "Authorization": f"Bearer {api_key.get_secret_value()}",
            "X-Title": "docling-pdf-parser",
        },
        timeout=60,
    )
    pipeline_options = PdfPipelineOptions(
        do_picture_description=do_picture_description,
        picture_description_options=picture_desc_api_option,
        enable_remote_services=True,
        generate_picture_images=True,
        do_formula_enrichment=do_formula_enrichment,
        images_scale=2,
    )

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )
    conv_res = converter.convert(source=input_doc_path)
    return conv_res

## 1.1 Document Parsing with Docling

**Docling** is an advanced document parsing library that converts PDF files into structured text while preserving:
- Layout and formatting information
- Images and figures (with optional AI-based descriptions)
- Mathematical formulas and equations
- Tables and structured data

### Key Features of parse_document Function:

- **Picture Description**: Uses OpenRouter API with vision models to generate natural language descriptions of images found in PDFs
- **Formula Enrichment**: Optionally extracts and processes mathematical formulas
- **High-Resolution Image Processing**: Scales images (2x) for better quality
- **Configurable Pipeline**: Flexible options for enabling/disabling specific features

This makes it ideal for academic papers, research documents, and technical content where preserving structure and context is crucial.

In [19]:
from pathlib import Path

file = Path("./paper/short-paper-with-formula-and-image.pdf")

source = file
result = parse_document(file, do_picture_description=True, do_formula_enrichment=False)

[32m[INFO] 2026-01-18 23:23:55,190 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2026-01-18 23:23:55,191 [RapidOCR] device_config.py:50: Using CPU device[0m
[32m[INFO] 2026-01-18 23:23:55,227 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\BUITOFU\anaconda3\envs\graphrag_clean\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.pth[0m
[32m[INFO] 2026-01-18 23:23:55,230 [RapidOCR] main.py:50: Using C:\Users\BUITOFU\anaconda3\envs\graphrag_clean\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.pth[0m
[32m[INFO] 2026-01-18 23:23:55,560 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2026-01-18 23:23:55,561 [RapidOCR] device_config.py:50: Using CPU device[0m
[32m[INFO] 2026-01-18 23:23:55,565 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\BUITOFU\anaconda3\envs\graphrag_clean\Lib\site-packages\rapidocr\models\ch_ptocr_mobile_v2.0_cls_infer.pth[0m
[32m[INFO] 2026-01-18 23:23:55,566 [RapidOCR] main

### Parse PDF Document

Load and parse a PDF file using the Docling document converter:
- **Input**: PDF file path from the `paper/` directory
- **Processing**: Extracts text, images, and structure
- **Picture Description**: Enabled to generate AI descriptions for images in the PDF
- **Formula Enrichment**: Disabled (can be enabled if needed)

The parsed result contains the document structure which will be converted to plain text for knowledge graph extraction.

### Convert to LlamaIndex Documents

Transform the pandas DataFrame into LlamaIndex `Document` objects, combining title and text for each article.

In [20]:
# documents = [
#     Document(text=f"{row['title']}: {row['text']}") for i, row in news.iterrows()
# ]

## 2. LLM Configuration

Configure the Large Language Model using OpenRouter with GPT-4o-mini. This LLM will be used for:
- Extracting entities and relationships from text
- Generating community summaries
- Answering user queries

In [21]:
from dotenv import load_dotenv
import os

load_dotenv()

from llama_index.llms.openrouter import OpenRouter
from llama_index.core.llms import ChatMessage

llm = OpenRouter(
    model="openai/gpt-4o-mini",
    api_key=os.getenv("OPENROUTER_API_KEY"),
    max_tokens=1024,
    context_window=2048,
    temperature=0.1,
    timeout=60.0,
)

from llama_index.core import Settings

Settings.llm = llm

## 3. Custom Graph Extractor Implementation

### GraphRAGExtractor Class

This custom extractor extends LlamaIndex's `TransformComponent` to extract knowledge graph triplets (entity-relation-entity) from text chunks using an LLM.

**Key Features:**
- **Async Processing**: Uses asyncio for parallel extraction across multiple nodes
- **Flexible Prompting**: Customizable extraction prompt
- **Entity Extraction**: Identifies entities with types and descriptions
- **Relationship Extraction**: Discovers relationships between entities with descriptions
- **Metadata Management**: Stores extracted graph data in node metadata

In [22]:
import asyncio
import nest_asyncio
nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
		default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
		EntityNode,
		KG_NODES_KEY,
		KG_RELATIONS_KEY,
		Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
		DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field

class GraphRAGExtractor(TransformComponent):
	"""Extract triples from a graph.

		Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

		Args:
				llm (LLM):
						The language model to use.
				extract_prompt (Union[str, PromptTemplate]):
						The prompt to use for extracting triples.
				parse_fn (callable):
						A function to parse the output of the language model.
				num_workers (int):
						The number of workers to use for parallel processing.
				max_paths_per_chunk (int):
						The maximum number of paths to extract per chunk.
	"""

	llm: LLM 
	extract_prompt: PromptTemplate 
	parse_fn: Callable 
	num_workers: int 
	max_paths_per_chunk: int 

	def __init__(
		self,
		llm: Optional[LLM] = None,
		extract_prompt: Optional[Union[str, PromptTemplate]] = None,
		parse_fn: Callable = default_parse_triplets_fn,
		max_paths_per_chunk: int = 10,
		num_workers: int = 4,
	)-> None: 
		from llama_index.core import Settings

		if isinstance(extract_prompt, str):
				extract_prompt = PromptTemplate(extract_prompt)
		
		super().__init__(
			 llm=llm or Settings.llm,
						extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
						parse_fn=parse_fn,
						num_workers=num_workers,
						max_paths_per_chunk=max_paths_per_chunk,
		)

	@classmethod
	def class_name(cls) -> str:
		return "GraphExtractor"
	
	def __call__(
			self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
	) -> List[BaseNode]:
		"""Extract triplets from nodes."""
		return asyncio.run(
						self.acall(nodes, show_progress=show_progress, **kwargs)
				)
	
	async def _aextract(
			self, node: BaseNode
	) -> BaseNode:
		"""Extract triples from a node."""
		assert hasattr(node, "text")
		
		text = node.get_content(metadata_mode="llm")
		try: 
				llm_response = await self.llm.apredict(
						self.extract_prompt,
						text = text,
						max_knowledge_triplets = self.max_paths_per_chunk,
				)

				entities, entities_relationship = self.parse_fn(llm_response)
		except ValueError:
				entities = []
				entities_relationship = []

		existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
		existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
		entity_metadata = node.metadata.copy()
		
		for entity, entity_type, description in entities:
			# normalize label
			if isinstance(entity_type, list):
				if len(entity_type) == 0:
					label = "Entity"
				else:
					label = entity_type[0]
			else:
				label = entity_type

			entity_metadata["entity_description"] = description
			entity_metadata["types"] = entity_type
			entity_node = EntityNode(
				name=str(entity),
				label=str(label),
				properties=entity_metadata
			)
			existing_nodes.append(entity_node)

		relation_metadata = node.metadata.copy()
		
		for triplet in entities_relationship: 
			subj, obj, rel, description = triplet
			relation_metadata["relationship_description"] = description
			relation = Relation(
				label = rel,
				source_id = subj,
				target_id = obj,
				properties = relation_metadata
			)
			existing_relations.append(relation)
		
		node.metadata[KG_NODES_KEY] = existing_nodes
		node.metadata[KG_RELATIONS_KEY] = existing_relations
		return node



	async def acall(
		self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
		) -> List[BaseNode]:
		jobs = []
		for node in nodes: 
				jobs.append(self._aextract(node))
		return await run_jobs(
			jobs,
			show_progress,
			desc="Extracting paths from text"
		)

## 4. Custom Graph Store with Community Detection

### GraphRAGStore Class

Extends Neo4j Property Graph Store with community detection and summarization capabilities using the Hierarchical Leiden algorithm.

**Key Features:**
- **Community Detection**: Groups related entities into communities using hierarchical Leiden clustering
- **Community Summarization**: Uses LLM to generate natural language summaries for each community
- **NetworkX Integration**: Converts graph to NetworkX format for analysis
- **Multi-community Membership**: Entities can belong to multiple communities
- **Relationship Aggregation**: Collects all relationships within each community

In [None]:
import re
import networkx as nx
from graspologic.partition import hierarchical_leiden
from collections import defaultdict

from llama_index.core.llms import ChatMessage
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore


class GraphRAGStore(Neo4jPropertyGraphStore):

    def __init__(self, *args, llm=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.llm = llm
        self.community_summary = {}
        self.entity_info = None
        self.max_cluster_size = 5
    
    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
                    "relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = self.llm.chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        self.entity_info, community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summerize_communites(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        triplets = self.get_triplets()
        for entity1, relation, entity2 in triplets:
            nx_graph.add_node(entity1.name)
            nx_graph.add_node(entity2.name)
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties.get("relationship_description", ""),
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """Collect information for each node based on their community,
        allowing entities to belong to multiple clusters."""
        entity_info = defaultdict(set)
        community_info = defaultdict(list)

        for item in clusters:
            node = item.node
            cluster_id = item.cluster

            # update entity_info
            entity_info[node].add(cluster_id)

            for neighbor in nx_graph.neighbors(node):
                edge_data = nx_graph.get_edge_data(node, neighbor)
                if edge_data:
                    detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                    community_info[cluster_id].append(detail)
        # Convert sets to lists for easier serialization if needed
        entity_info = {k: list(v) for k, v in entity_info.items()}
        return dict(entity_info), dict(community_info)

    #     community_info = {
    #     0: [
    #         "Apple -> Beats -> ACQUIRED -> Apple acquired Beats in 2014",
    #         "Apple -> iPhone -> PRODUCED -> iPhone is a smartphone",
    #         "Beats -> Headphones -> IS_A -> Beats produces headphones"
    #     ]
    # 	}
    ###sau
    # community_summary = {
    #     0: "Apple is a technology company that acquired Beats in 2014 and produces consumer electronics such as the iPhone. Beats is known for producing headphones.",
    #     1: "Google developed TensorFlow, a machine learning framework widely used for machine learning tasks."
    # }

    def _summerize_communites(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = "\n".join(details) + "."
            self.community_summary[community_id] = self.generate_community_summary(
                details_text
            )

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary

## 5. Custom Query Engine

### GraphRAGQueryEngine Class

Implements a custom query engine that leverages community summaries for answering questions.

**Query Process:**
1. **Entity Retrieval**: Find relevant entities from the query using similarity search
2. **Community Mapping**: Identify communities that contain these entities
3. **Community-based Answering**: Generate answers from each relevant community summary
4. **Answer Aggregation**: Combine individual community answers into a final coherent response

This approach provides more comprehensive answers by considering the broader context of entity communities.

In [45]:
from click import prompt
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM
from llama_index.core import PropertyGraphIndex

import re

from typer import prompt

class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    index: PropertyGraphIndex
    llm: LLM
    similarity_top_k: int = 20
    
    def custom_query(self, query_str):
        """Process all community summaries to generate answers to a specific query."""
        
        entities = self.get_entities(query_str, self.similarity_top_k)
        
        community_ids = self.retrieve_entity_communities(
			self.graph_store.entity_info, entities
		)
        community_summaries = self.graph_store.get_community_summaries()
        
        community_answers = [
			self.generate_answer_from_summary(community_summary, query_str)
			for id, community_summary in community_summaries.items()
			if id in community_ids
		]
        
        final_answer = self.aggregate_answers(community_answers)
        
        return final_answer
    
    def get_entities(self, query_str, similarity_top_k):
        nodes_retrieved = self.index.as_retriever(
			similarity_top_k = similarity_top_k
		).retrieve(query_str)
        
        entities = set ()
        
        # pattern  entity -> rel -> entity
        pattern = (
            r"^(\w+(?:\s+\w+)*)\s*->\s*([a-zA-Z\s]+?)\s*->\s*(\w+(?:\s+\w+)*)$"
        )
        
        for node in nodes_retrieved: 
            matches = re.findall(
				pattern, node.text, re.MULTILINE | re.IGNORECASE
			)
            
            for match in matches: 
                subject = match[0]
                obj = match[2]
                entities.add(subject)
                entities.add(obj)
        return list(entities)
  
    def retrieve_entity_communities(self, entity_info, entities):
        community_ids = []

        for entity in entities:
            if entity in entity_info:
                community_ids.extend(entity_info[entity])

        return list(set(community_ids))
    
    
    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = f"""
            Answer the question below as if you are responding directly to a user.

            Guidelines:
            - Do NOT mention or refer to any internal processes, summaries, or intermediate data.
            - Do NOT use phrases such as "community summary", "based on the information above",
            "the provided data", "no community", or similar meta expressions.
            - Provide a natural, confident answer.
            - Reasoning and inference are allowed, but must remain implicit.

            Question:
            {query}
            """


        messages = [
            ChatMessage(role="system", content=community_summary),
            ChatMessage(role="user", content=prompt),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response
    
    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = """
            You are responding to a user question based on user given context.

            Instructions:
            - Produce a single final answer.
            - Do NOT mention combining, aggregating, or synthesizing.
            - Do NOT refer to previous answers or internal steps.
            - The response must read as a direct standalone answer.
            """

        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response
    

## 6. Text Chunking

Split documents into smaller chunks for more granular entity extraction. Using:
- **Chunk size**: 1024 characters
- **Overlap**: 20 characters to maintain context across chunks

In [25]:
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
# pdf_document to text (test)
pdf_document = result.document.export_to_text()


documents = [Document(text=pdf_document)]


nodes = splitter.get_nodes_from_documents(documents)
nodes

len(nodes)

Parameter `strict_text` has been deprecated and will be ignored.


8

## 7. Custom Extraction Prompt

Define a detailed prompt template for knowledge graph triplet extraction.

**Prompt Instructions:**
- Identify entities with names, types, and descriptions
- Find relationships between entity pairs
- Return structured JSON output with entities and relationships
- Extract up to `max_knowledge_triplets` per chunk

This prompt ensures consistent, structured output for reliable parsing.

In [26]:
# KG_TRIPLET_EXTRACT_TMPL = """
# -Goal-
# Given a text document, identify entities and relationships to construct a knowledge graph.
# Each extraction MUST be centered around exactly ONE Paper entity, which acts as the anchor node.
# All extracted entities and relationships MUST be directly or indirectly connected to this Paper.

# Extract up to {max_knowledge_triplets} entity-relation triplets.

# You MUST strictly follow the predefined knowledge graph schema described below.

# -Predefined Knowledge Graph Schema-

# MAIN Entity Types (ONLY these types are allowed):
# - Paper
# - Section
# - Argument
# - Claim
# - Evidence
# - Concept
# - Background
# - Author

# IMPORTANT:
# - Subtypes are NOT allowed.
# - Do NOT encode any subtype information in entity_type.
# - If the text implies a subtype (e.g. table, figure, dataset), capture it ONLY in entity_description.

# Allowed Relationships (directional, fixed):

# Paper-centered relationships (Paper is the anchor):
# - Paper -> Section : has
# - Paper -> Background : has
# - Paper -> Paper : cites
# - Author -> Paper : wrote

# Structural relationships:
# - Section -> Concept : mentions
# - Section -> Argument : contains
# - Argument -> Claim : has
# - Claim -> Evidence : supported_by

# Concept relationships:
# - Concept -> Concept : related_to

# Schema Enforcement Rules:
# - Exactly ONE Paper entity MUST exist in the output.
# - Every entity MUST be connected (directly or indirectly) to the Paper entity.
# - Do NOT invent entity types outside the allowed list.
# - Do NOT invent relationships.
# - Relationship direction MUST match the schema.
# - Do NOT extract isolated entities.
# - If information does not fit the schema, do NOT extract it.

# -Steps-
# 1. Identify the single Paper entity (anchor).
# 2. Identify all other entities that relate to this Paper.
# For each entity, extract:
# - entity_name: Name of the entity, capitalized
# - entity_type: One of the allowed MAIN entity types
# - entity_description: Description of the entity's role in the document

# 3. Identify relationships.
# Extract ONLY schema-compliant relationships.
# For each relationship:
# - source_entity: name of the source entity
# - target_entity: name of the target entity
# - relation: EXACT relationship name from the allowed list
# - relationship_description: Explanation grounded in the text

# 4. Validation:
# - Ensure there is exactly ONE Paper entity.
# - Ensure the Paper acts as the anchor of the graph.
# - Ensure all Concept-to-Concept links use ONLY `related_to`.

# -Output Formatting-
# - Return valid JSON with two keys: 'entities' and 'relationships'
# - No text outside JSON
# - If nothing valid is found, return:
#   { "entities": [], "relationships": [] }

# ****IMPORTANT****
# - Return ONLY valid JSON.
# - Do NOT include explanations, comments, or markdown.
# - The response must start with '{' and end with '}'.

# -Real Data-
# ######################
# text: {text}
# ######################
# output:
# """

KG_TRIPLET_EXTRACT_TMPL = """
-Context-
The input text is a CHUNK of a larger document.
Multiple extractions may be performed on different chunks of the SAME document.

You MUST assume:
- All chunks from the same document refer to ONE and ONLY ONE Paper entity.
- The Paper entity identity MUST be logically consistent across all chunks.
- You are NOT allowed to create multiple Paper entities for the same document.

-Goal-
Given a text chunk, extract entities and relationships to construct a knowledge graph.
Each extraction MUST be centered around exactly ONE Paper entity, which acts as the anchor node.
All extracted entities and relationships MUST be directly or indirectly connected to this Paper.

Extract up to {max_knowledge_triplets} entity-relation triplets.

You MUST strictly follow the predefined knowledge graph schema described below.

-Predefined Knowledge Graph Schema-

MAIN Entity Types (ONLY these types are allowed):
- Paper
- Reference
- Section
- Argument
- Claim
- Evidence
- Concept
- Background
- Author

IMPORTANT TYPE RULES:
- Subtypes are NOT allowed.
- Do NOT encode any subtype information in entity_type.
- If the text implies a subtype (e.g. table, figure, dataset), capture it ONLY in entity_description.

Allowed Relationships (directional, fixed):

Paper-centered relationships:
- Paper -> Section : has
- Paper -> Reference : cites
- Paper -> Background : has
- Paper -> Paper : cites
- Author -> Paper : wrote

Structural relationships:
- Section -> Concept : mentions
- Section -> Argument : contains
- Argument -> Claim : has
- Claim -> Evidence : supported_by

Concept relationships:
- Concept -> Concept : related_to

CRITICAL IDENTITY & CONNECTIVITY RULES:
- Exactly ONE Paper entity MUST exist in the output.
- If the paper title is explicitly stated, use it verbatim as entity_name.
- If the title is NOT stated, use a single generic name:
  "Unknown Paper (Derived from Document Context)"
- You MUST reuse the SAME Paper entity_name consistently.
- Every extracted entity MUST be connected to the Paper entity by at least one relationship path.
- If an entity cannot be connected to the Paper, do NOT extract it.
- Do NOT create disconnected subgraphs.

-Schema Enforcement Rules-
- Do NOT invent entity types outside the allowed list.
- Do NOT invent relationships.
- Relationship direction MUST match the schema exactly.
- Do NOT extract isolated entities.

-Steps-
1. Identify the canonical Paper entity for the document.
   - Normalize the paper identity so it remains stable across chunks.
2. Identify all other entities that relate to this Paper.
For each entity, extract:
- entity_name: Name of the entity, capitalized
- entity_type: One of the allowed MAIN entity types
- entity_description: Description of the entity's role in the document

3. Identify relationships.
Extract ONLY schema-compliant relationships.
For each relationship:
- source_entity: name of the source entity
- target_entity: name of the target entity
- relation: EXACT relationship name from the allowed list
- relationship_description: Explanation grounded in the text

4. Final Validation (MANDATORY):
- There is exactly ONE Paper entity.
- All entities are connected (directly or indirectly) to the Paper.
- No disconnected clusters exist.
- All Concept-to-Concept links use ONLY `related_to`.

-An Output Example-
{
  "entities": [
    {
      "entity_name": "Attention Is All You Need",
      "entity_type": "Paper",
      "entity_description": "The main research paper discussing the Transformer architecture."
    },
    {
      "entity_name": "Introduction",
      "entity_type": "Section",
      "entity_description": "The section introducing the motivation and context of the paper."
    },
    {
      "entity_name": "Transformer Model",
      "entity_type": "Concept",
      "entity_description": "A neural network architecture based on self-attention mechanisms."
    },
    {
      "entity_name": "Vaswani et al. 2017",
      "entity_type": "Reference",
      "entity_description": "A cited work introducing the Transformer architecture."
    }
  ],
  "relationships": [
    {
      "source_entity": "Attention Is All You Need",
      "target_entity": "Introduction",
      "relation": "has",
      "relationship_description": "The paper contains an introduction section."
    },
    {
      "source_entity": "Introduction",
      "target_entity": "Transformer Model",
      "relation": "mentions",
      "relationship_description": "The introduction mentions the Transformer model."
    },
    {
      "source_entity": "Attention Is All You Need",
      "target_entity": "Vaswani et al. 2017",
      "relation": "cites",
      "relationship_description": "The paper cites the original work that proposed the Transformer model."
    }
  ]
}

-Output Formatting-
- Return valid JSON with two keys: 'entities' and 'relationships'
- No text outside JSON
- The response MUST start with '{' and end with '}'
- If nothing valid is found, return:
  { "entities": [], "relationships": [] }

-Real Data-
######################
text: {text}
######################
output:
"""


## 8. JSON Parser and Extractor Initialization

### Custom Parse Function

Parses LLM responses to extract structured entity and relationship data:
- Removes markdown formatting (```json blocks)
- Extracts JSON content using regex
- Handles parsing errors gracefully
- Logs raw responses for debugging

### Initialize GraphRAGExtractor

Configure the extractor with:
- Custom LLM
- Custom extraction prompt
- Maximum 3 triplets per chunk
- Custom parsing function

In [27]:
import json, re
from typing import List, Tuple


def parse_fn(response_str: str) -> Tuple[List, List]:
    entities, relationships = [], []

    # 1. strip markdown
    response_str = re.sub(r"```json", "", response_str, flags=re.IGNORECASE)
    response_str = re.sub(r"```", "", response_str)

    response_str = response_str.strip()
    with open("raw.txt", "w", encoding="utf-8") as f:
        f.write(response_str + "\n")

    # 2. extract JSON block (non-greedy)
    match = re.search(r"\{[\s\S]*\}", response_str)
    if not match:
        return entities, relationships

    json_str = match.group(0)

    # 3. parse
    try:
        data = json.loads(json_str)
    except json.JSONDecodeError:
        return entities, relationships

    for e in data.get("entities", []):
        raw_type = e.get("entity_type")

        # Normalize entity_type ‚Üí List[str]
        if isinstance(raw_type, str):
            # "Section:Abstract" -> ["Section", "Abstract"]
            entity_types = [t.strip() for t in raw_type.split(":") if t.strip()]
        elif isinstance(raw_type, list):
            # already a list (future-proof)
            entity_types = raw_type
        else:
            entity_types = []
        entities.append(
            (
                e.get("entity_name"),
                entity_types,
                e.get("entity_description"),
            )
        )

    for r in data.get("relationships", []):
        relationships.append(
            (
                r.get("source_entity"),
                r.get("target_entity"),
                r.get("relation"),
                r.get("relationship_description"),
            )
        )

    return entities, relationships


kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=8,
    parse_fn=parse_fn,
)

## 9. Embedding Model Setup

Initialize HuggingFace embedding model for semantic similarity search. Using `sentence-transformers/all-MiniLM-L6-v2`:
- Lightweight and fast
- Good balance between performance and speed
- Used for entity retrieval during queries

In [28]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

## 10. Graph Store Initialization

Initialize the custom `GraphRAGStore` connected to Neo4j database:
- **Database**: Neo4j running on localhost:7687
- **Features**: Community detection and summarization capabilities
- **Storage**: Persists graph data for reuse


In [29]:
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore

# Note: used to be `Neo4jPGStore`
graph_store = GraphRAGStore(
    llm=llm, username="neo4j", password="12345aA#", url="bolt://localhost:7687"
)

## 11. Build Property Graph Index

**Main Processing Step**: Construct the knowledge graph by:
1. Processing each text chunk with the graph extractor
2. Extracting entities and relationships using the LLM
3. Storing graph data in Neo4j
4. Creating embeddings for semantic search

‚ö†Ô∏è **Note**: This step may take several minutes as it processes all documents and makes multiple LLM calls.

In [30]:
from llama_index.core import PropertyGraphIndex


index = PropertyGraphIndex(
    nodes=[],
    kg_extractors=[kg_extractor],
    property_graph_store=graph_store,
    embed_model=embed_model,
    show_progress=True,
    llm=llm,
)

In [31]:
#Ingest new document everytime needed

index.build_index_from_nodes(nodes)

Extracting paths from text: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [00:23<00:00,  2.98s/it]
Generating embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  2.29it/s]
Generating embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [00:00<00:00, 10.58it/s]


IndexLPG(index_id='af9f67de-989a-48a2-8105-1002684b3911', summary=None)

---

## 12. Graph Analysis and Querying

The following cells demonstrate how to analyze the constructed knowledge graph and query it for insights.

### Inspect Graph Triplets

Examine a sample triplet from the knowledge graph to understand the extracted relationships and their properties.

In [32]:
index.property_graph_store.get_triplets()[7][1].properties

{'triplet_source_id': '1068c79e-abba-412d-9aa4-bc452ab77809',
 'relationship_description': 'Garner is referenced as the author of the work discussing the canonical signed-digit code.'}

### Build Communities

Execute community detection and generate summaries for each community. This step:
- Applies hierarchical Leiden clustering
- Groups entities into communities
- Generates natural language summaries using the LLM

In [41]:
index.property_graph_store.build_communities()

### Initialize Query Engine

Create the custom query engine with:
- Reference to the graph store
- LLM for answer generation
- Similarity search with top 10 results

In [42]:


query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store,
    llm=llm,
    index=index,
    similarity_top_k=10,
)

### Query 1: Energy Sector News

Ask the graph about energy sector news. The query engine will:
1. Find entities related to "energy sector"
2. Retrieve relevant community summaries
3. Generate a comprehensive answer

In [46]:
response = query_engine.query("What are the main authors in the paper?")
display(Markdown(f"{response.response}"))

The main authors in the paper are not mentioned in the provided information.

### Query 2: Main News Overview

Broad query to understand the main topics discussed across all documents.

In [47]:
response = query_engine.query("What are the main topics discussed in the document?")
display(Markdown(f"{response.response}"))

The document discusses several key topics, including the REITWIESNER algorithm and its modification, REITWIESNERMODIFIED, along with the GARNERREVISITED algorithm. It covers the conversion of conventional binary numbers, the canonical signed-digit code, and the performance evaluations of various algorithms such as Algorithm 5, Algorithm 9, Algorithm 10, STRING_0, and STRING_1. Additionally, it explores the relationships between these algorithms, particularly focusing on the simplification of Algorithm 5 by Algorithm 9, and the implications of the findings related to the REITWIESNERMODIFIED algorithm as presented in Algorithm 8 and the conclusions section. Section 5 is highlighted for its elaboration on Algorithm 9 and the overall analysis of algorithmic performance.

In [48]:
response_2 = query_engine.query(
    "Which themes connect the technology and geopolitics articles in the dataset"
)
display(Markdown(f"{response_2.response}"))

The provided information does not contain any details regarding technology and geopolitics articles or their themes.

### View All Community Summaries

Display all generated community summaries to understand the different topic clusters discovered in the corpus.

In [38]:
for cid, summary in index.property_graph_store.community_summary.items():
    print(f"\n===== COMMUNITY {cid} =====")
    print(summary)


===== COMMUNITY 0 =====
The relationships outlined in the knowledge graph highlight the connections between various components of a paper focused on the conversion of nonnegative integers to the canonical signed-digit representation. 

1. **A NOTE ON THE CONVERSION OF NONNEGATIVE INTEGERS TO THE CANONICAL SIGNED-DIGIT REPRESENTATION** has a **LITERATURE REVIEW** that discusses methods for canonical signed-digit conversion, indicating that this section is crucial for understanding the existing techniques in the field.

2. The **LITERATURE REVIEW** also references **REITWIESNER'S METHOD**, emphasizing its significance in the context of signed-digit conversion. This suggests that Reitwiesner's method is a notable approach within the broader discussion of conversion methods.

Overall, the summary underscores the importance of the literature review in providing a comprehensive overview of canonical signed-digit conversion methods, with a specific mention of Reitwiesner's method as a key re

### Debug: Verify Graph Structure

Diagnostic cell to verify that the knowledge graph was built correctly:
- Check total number of triplets extracted
- Inspect sample triplets
- View entity-community mappings
- Confirm community summaries exist

In [39]:
# After building index, check if triplets exist:
triplets = index.property_graph_store.get_triplets()
print(f"Number of triplets: {len(triplets)}")
if triplets:
    print(f"Sample triplet: {triplets[0]}")

# Check community info
print(f"Entity info: {index.property_graph_store.entity_info}")
print(f"Community summaries: {index.property_graph_store.community_summary}")

Number of triplets: 136
Sample triplet: [EntityNode(label='Paper', embedding=None, properties={'triplet_source_id': '2e6d7c42-1216-4bc3-80b1-ef92ac31f6e2', 'entity_description': 'The main paper discussing the signed-digit representation of non-negative integers.', 'types': ['Paper'], 'id': 'A NOTE ON THE CONVERSION OF NONNEGATIVE INTEGERS TO THE CANONICAL SIGNED-DIGIT REPRESENTATION'}, name='A NOTE ON THE CONVERSION OF NONNEGATIVE INTEGERS TO THE CANONICAL SIGNED-DIGIT REPRESENTATION'), Relation(label='has', source_id='A NOTE ON THE CONVERSION OF NONNEGATIVE INTEGERS TO THE CANONICAL SIGNED-DIGIT REPRESENTATION', target_id='LITERATURE REVIEW', properties={'triplet_source_id': '2e6d7c42-1216-4bc3-80b1-ef92ac31f6e2', 'relationship_description': 'The paper contains a section that reviews methods for canonical signed-digit conversion.'}), EntityNode(label='Section', embedding=None, properties={'triplet_source_id': '2e6d7c42-1216-4bc3-80b1-ef92ac31f6e2', 'entity_description': 'A section in 