# GraphRAG
GraphRAG is an AI-based content interpretation and search capability. Using LLMs, it parses data to create a knowledge graph and answer user questions about a user-provided private dataset.
The GraphRAG project is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using the power of LLMs.

https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

# What is GraphRAG and How Does It Work?
Unlike a baseline RAG that uses a vector database to retrieve semantically similar text, GraphRAG enhances RAG by incorporating knowledge graphs (KGs). Knowledge graphs are data structures that store and link related or unrelated data based on their relationships.

### Processes indexing and querying
![image.png](pipeline.png)

#### From Local to Global: A Graph RAG Approach to Query-Focused Summarization
https://arxiv.org/pdf/2404.16130


####  Leiden 
https://arxiv.org/pdf/1810.08473

### MILVUS VECTOR Database
Milvus is a high-performance, highly scalable vector database that runs efficiently across a wide range of environments, from a laptop to large-scale distributed systems. It is available as both open-source software and a cloud service.

https://milvus.io/

In [None]:
! pip install --upgrade pymilvus
! pip install git+https://github.com/zc277584121/graphrag.git

In [19]:
! python -m pip install future -q
! python -m pip install python-dotenv -q

In [1]:
import nest_asyncio
nest_asyncio.apply()

In [2]:
import os
import urllib.request

In [3]:
index_root = os.path.join(os.getcwd(), 'graphrag_index')
os.makedirs(os.path.join(index_root, 'input'), exist_ok=True)

In [4]:

url = "https://www.gutenberg.org/cache/epub/7785/pg7785.txt"
# file_path = os.path.join(index_root, 'input', 'davinci.txt')
# urllib.request.urlretrieve(url, file_path)
file_path = os.path.join(index_root, 'input', 'book.txt')

with open(file_path, 'r+', encoding='utf-8') as file:
    # We use the first 934 lines of the text file, because the later lines are not relevant for this example.
    # If you want to save api key cost, you can truncate the text file to a smaller size.
    lines = file.readlines()
    file.seek(0)
    #file.writelines(lines[:934])  # Decrease this number if you want to save api key cost.
    file.writelines(lines)
    file.truncate()

# Initialize the workspace

In [32]:
! python -m graphrag.index --help

usage: __main__.py [-h] [--config CONFIG] [-v] [--memprofile] [--root ROOT]
                   [--resume RESUME] [--reporter REPORTER] [--emit EMIT]
                   [--dryrun] [--nocache] [--init] [--overlay-defaults]

options:
  -h, --help           show this help message and exit
  --config CONFIG      The configuration yaml file to use when running the
                       pipeline
  -v, --verbose        Runs the pipeline with verbose logging
  --memprofile         Runs the pipeline with memory profiling
  --root ROOT          If no configuration is defined, the root directory to
                       use for input data and output data. Default value: the
                       current directory
  --resume RESUME      Resume a given data run leveraging Parquet output
                       files.
  --reporter REPORTER  The progress reporter to use. Valid values are 'rich',
                       'print', or 'none'
  --emit EMIT          The data formats to emit, comma-separate

In [5]:
! python -m graphrag.index --init --root ./graphrag_index

[2KInitializing project at .[35m/[0m[95mgraphrag_index[0m
⠋ GraphRAG Indexer 

# Running the indexing pipeline

In [6]:
! python -m graphrag.index --root ./graphrag_index

[2K🚀 [32mReading settings from graphrag_index/settings.yaml[0m
[2K⠴ GraphRAG Indexer 
[2K[1A[2K⠴ GraphRAG Indexer e.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠴ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠴ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠦ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠹ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
└── create_base_text_units
[2K[1A[2K[1A[2K

In [7]:
import os
import pandas as pd
import tiktoken
from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    # read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.input.loaders.dfs import (
    store_entity_semantic_embeddings,
)
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.question_gen.local_gen import LocalQuestionGen
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores import MilvusVectorStore

In [8]:
output_dir = os.path.join(index_root, "output")
subdirs = [os.path.join(output_dir, d) for d in os.listdir(output_dir)]
latest_subdir = max(subdirs, key=os.path.getmtime)  # Get latest output directory
INPUT_DIR = os.path.join(latest_subdir, "artifacts")

In [9]:
COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"
RELATIONSHIP_TABLE = "create_final_relationships"
COVARIATE_TABLE = "create_final_covariates"
TEXT_UNIT_TABLE = "create_final_text_units"
COMMUNITY_LEVEL = 2

In [10]:
# read nodes table to get community and degree data
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")

# ENTITIES

In [11]:
entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)
description_embedding_store = MilvusVectorStore(
    collection_name="entity_description_embeddings",
)
# description_embedding_store.connect(uri="http://localhost:19530") # For Milvus docker service
description_embedding_store.connect(uri="./milvus.db") # For Milvus Lite
entity_description_embeddings = store_entity_semantic_embeddings(
    entities=entities, vectorstore=description_embedding_store
)
print(f"Entity count: {len(entity_df)}")
entity_df.head()

Entity count: 1023


Unnamed: 0,level,title,type,description,source_id,degree,human_readable_id,id,size,graph_embedding,community,top_level_node_id,x,y
0,0,"""PROJECT GUTENBERG""","""ORGANIZATION""",Project Gutenberg is an organization founded b...,"3a765e78bbe9953c33030edec77489ab,4d72b9e64b16c...",9,0,b45241d70f0e43fca764df95b2b81f77,9,,,b45241d70f0e43fca764df95b2b81f77,0,0
1,0,"""THE GREAT GATSBY""","""EVENT""","""The Great Gatsby"" is a novel authored by F. S...","3a765e78bbe9953c33030edec77489ab,4fa7ef9cb033a...",8,1,4119fd06010c494caa07f439b333f4c5,8,,,4119fd06010c494caa07f439b333f4c5,0,0
2,0,"""F. SCOTT FITZGERALD""","""PERSON""",F. Scott Fitzgerald is the renowned author of ...,"3a765e78bbe9953c33030edec77489ab,4fa7ef9cb033a...",2,2,d3835bf3dda84ead99deadbeac5d0d7d,2,,,d3835bf3dda84ead99deadbeac5d0d7d,0,0
3,0,"""UNITED STATES""","""GEO""",The United States is highlighted as the primar...,"3a765e78bbe9953c33030edec77489ab,df3c422bf520d...",1,3,077d2820ae1845bcbb1803379a3d1eae,1,,,077d2820ae1845bcbb1803379a3d1eae,0,0
4,0,"""ALEX CABAL""","""PERSON""","""Alex Cabal is credited with producing the eBo...",3a765e78bbe9953c33030edec77489ab,1,4,3671ea0dd4e84c1a9b02c5ab2c8f4bac,1,,,3671ea0dd4e84c1a9b02c5ab2c8f4bac,0,0


# Relationships

In [12]:
relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)
print(f"Relationship count: {len(relationship_df)}")
relationship_df.head()


Relationship count: 307


Unnamed: 0,source,target,weight,description,text_unit_ids,id,human_readable_id,source_degree,target_degree,rank
0,"""PROJECT GUTENBERG""","""THE GREAT GATSBY""",1.0,"""Project Gutenberg has made 'The Great Gatsby'...",[3a765e78bbe9953c33030edec77489ab],2fb66f9a0de6406d83b61742a3b52cd6,0,9,8,17
1,"""PROJECT GUTENBERG""","""PROJECT GUTENBERG AUSTRALIA""",1.0,"""Project Gutenberg Australia operates as a par...",[3a765e78bbe9953c33030edec77489ab],b0e6cfd979ea48b997019b059999d3c2,1,9,1,10
2,"""PROJECT GUTENBERG""","""THE PROJECT GUTENBERG LITERARY ARCHIVE FOUNDA...",1.0,"""Project Gutenberg's electronic works are owne...",[df3c422bf520d888d06d2ebc04c77453],ef00ec3a324f4f5986141401002af3f6,2,9,1,10
3,"""PROJECT GUTENBERG""","""UNITED STATES""",1.0,"""Project Gutenberg's activities, particularly ...",[df3c422bf520d888d06d2ebc04c77453],a542fd7aed7341468028928937ea2983,3,9,1,10
4,"""PROJECT GUTENBERG""","""PROJECT GUTENBERG LITERARY ARCHIVE FOUNDATION""",2.0,The Project Gutenberg Literary Archive Foundat...,"[4d72b9e64b16c757a763b3bd66422a1f, 903cdd5308f...",1c5e296a5ac541c1b5cac4357537c22d,4,9,1,10


# Community Reports

In [13]:
report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)

print(f"Report records: {len(report_df)}")
report_df.head()

Report records: 37


Unnamed: 0,community,full_content,level,rank,title,rank_explanation,summary,findings,full_content_json,id
0,30,# The Complex World of Mr. Gatsby and His Ques...,2,7.5,The Complex World of Mr. Gatsby and His Quest ...,The impact severity rating is high due to the ...,This report delves into the intricate communit...,[{'explanation': 'Mr. Gatsby's claimed associa...,"{\n ""title"": ""The Complex World of Mr. Gats...",6da7e08b-92ff-444c-b65d-ef874cbbf2ac
1,31,# Chicago and Its Central Role in Character Dy...,2,7.5,Chicago and Its Central Role in Character Dyna...,The impact severity rating is relatively high ...,This report delves into the intricate relation...,[{'explanation': 'Chicago is depicted as a cen...,"{\n ""title"": ""Chicago and Its Central Role ...",870f9fd9-b14e-4a33-b2d2-e24494711438
2,32,# The Complex Web of Gatsby and His Circle\n\n...,2,7.5,The Complex Web of Gatsby and His Circle,The impact severity rating is high due to the ...,This report delves into the intricate relation...,[{'explanation': 'Jay Gatsby's life and action...,"{\n ""title"": ""The Complex Web of Gatsby and...",8753ce48-e3d3-4a33-a773-daa0f39dac51
3,33,# West Egg: A Tale of New Wealth and Social As...,2,8.5,West Egg: A Tale of New Wealth and Social Aspi...,The impact severity rating is high due to the ...,West Egg serves as a pivotal setting in the na...,[{'explanation': 'West Egg represents the emer...,"{\n ""title"": ""West Egg: A Tale of New Wealt...",a52a6c5a-9ca2-496a-8538-c37b85de2ec3
4,34,# Gatsby's Household and Associates\n\nThis re...,2,7.5,Gatsby's Household and Associates,The impact severity rating is relatively high ...,This report delves into the intricate relation...,[{'explanation': 'The Butler emerges as a cent...,"{\n ""title"": ""Gatsby's Household and Associ...",c576c7c1-22ec-4be9-b4a3-8f8bacd76543


# TEXT UNITS

In [14]:
text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)
print(f"Text unit records: {len(text_unit_df)}")
text_unit_df.head()

Text unit records: 64


Unnamed: 0,id,text,n_tokens,document_ids,entity_ids,relationship_ids
0,3a765e78bbe9953c33030edec77489ab,Project Gutenberg eBook of The Great Gatsby\n ...,1200,[a8e86e4eb56075ab3a7ef61716a89e63],"[b45241d70f0e43fca764df95b2b81f77, 4119fd06010...","[2fb66f9a0de6406d83b61742a3b52cd6, b0e6cfd979e..."
1,8185e9ebc4df48f58fc509947870d49d,"5, just a quarter of\na century after my fathe...",1200,[a8e86e4eb56075ab3a7ef61716a89e63],"[96aad7cb4b7d40e9b7e13b94a67af206, c9632a35146...","[423b72bbd56f4caa98f3328202c1c3c9, 5c7ef01f46a..."
2,2611fae4b169a75b2f59911eabd6b01c,dinner with the Tom\nBuchanans. Daisy was my ...,1200,[a8e86e4eb56075ab3a7ef61716a89e63],"[254770028d7a4fa9877da4ba0ad5ad21, e69dc259edb...","[ab3a5a6713244fd595a1ace978c3d960, 2fbd74d5ccc..."
3,229aa7c469dadb63b7280786e10d588c,"end of the divan, completely motionless, and ...",1200,[a8e86e4eb56075ab3a7ef61716a89e63],"[bc0e3f075a4c4ebbb7c7b152b65a5625, 254770028d7...","[aefde1f7617f4c0e9aed31db77f6d862, ab3a5a67132..."
4,3fafa1a846a8fa6c642a4591309a0329,"enderly, languidly, their hands set lightly on...",1200,[a8e86e4eb56075ab3a7ef61716a89e63],"[e69dc259edb944ea9ea41264b9fcfe59, d64ed762ea9...","[ec45e1c400654c4f875046926486ded7, e1f524d4b97..."


# Querying

GraphRAG has two different querying workflows tailored for different queries.

- Global Search for reasoning about holistic questions related to the whole data corpus by leveraging the community summaries.
- Local Search for reasoning about specific entities by fanning out to their neighbors and associated concepts.

# Global Search

![image.png](global.png)

- User Query and Conversation History: The system takes user query and conversation history as the initial input.
- Community Report Batches: The system uses node community reports generated by the LLM from a specified level of the community hierarchy as context data. These community reports are shuffled and divided into multiple batches (Shuffled Community Report Batch 1, Batch 2… Batch N).
- RIR (Rated Intermediate Responses): Each batch of community reports is further divided into predefined-sized text chunks. Each text chunk is used to generate an intermediate response. The response contains a list of information pieces called points. Each point has a numerical score indicating its importance. These generated intermediate responses are the Rated Intermediate Responses (Rated Intermediate Response 1, Response 2… Response N).
- Ranking and Filtering: The system ranks and filters these intermediate responses, selecting the most important points. The selected important points form the Aggregated Intermediate Responses.
- Final Response: The aggregated intermediate responses are used as context to generate the final reply.

# Local Search

![image.png](local.png)

- User Query: First, the system receives a user query, which could be a simple question or a more complex query.
- Similar Entity Search: The system identifies a set of entities from the knowledge graph that are semantically related to the user input. These entities serve as entry points into the knowledge graph. This step uses a vector database like Milvus to conduct text similarity searches.
- Entity-Text Unit Mapping: The extracted text units are mapped to the corresponding entities, removing the original text information.
- Entity-Relationship Extraction: The step extracts specific information about the entities and their corresponding relationships.
- Entity-Covariate Mapping: This step maps entities to their covariates, which may include statistical data or other relevant attributes.
- Entity-Community Report Mapping: Community reports are integrated into the search results, incorporating some global information.
- Utilization of Conversation History: If provided, the system uses conversation history to better understand the user’s intent and context.- 
- Response Generation: Finally, the system constructs and responds to the user query based on the filtered and sorted data generated in the previous steps.

# Create a local search engine

In [15]:
from dotenv import load_dotenv
load_dotenv("graphrag_index/.env")

True

In [16]:
api_key = os.environ["OPENAI_API_KEY"]  # Your OpenAI API key
llm_model = "gpt-4o"  # Or gpt-4-turbo-preview
embedding_model = "text-embedding-3-small"

In [17]:
llm = ChatOpenAI(
    api_key=api_key,
    model=llm_model,
    api_type=OpenaiApiType.OpenAI,
    max_retries=20,
)
token_encoder = tiktoken.get_encoding("cl100k_base")
text_embedder = OpenAIEmbedding(
    api_key=api_key,
    api_base=None,
    api_type=OpenaiApiType.OpenAI,
    model=embedding_model,
    deployment_name=embedding_model,
    max_retries=20,
)

In [18]:
context_builder = LocalSearchMixedContext(
    community_reports=reports,
    text_units=text_units,
    entities=entities,
    relationships=relationships,
    covariates=None, #covariates,#todo
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,  # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

In [19]:
local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,  # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids
    "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
}

In [20]:
llm_params = {
    "max_tokens": 2_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)
    "temperature": 0.0,
}

In [21]:
search_engine = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    llm_params=llm_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

In [22]:
result = await search_engine.asearch("Tell me about Great Gatsby")
print(result.response)

# Overview of "The Great Gatsby"

"The Great Gatsby" is a novel authored by F. Scott Fitzgerald, set during the Jazz Age, also known as the Roaring Twenties. This era is characterized by themes of decadence, idealism, resistance to change, excess, and the pursuit of the American Dream. Fitzgerald's work captures the essence of this vibrant period, reflecting on the complexities and nuances of society and individual aspirations [Data: Entities (1, 2)].

## Central Characters and Themes

The narrative revolves around Jay Gatsby, a wealthy and enigmatic figure known for his lavish parties at his mansion in West Egg. Born James Gatz, Gatsby reinvents himself, driven by a Platonic conception of himself and a deep ambition to fulfill a vision of vast, vulgar, and meretricious beauty. His mysterious past includes claims of being the son of wealthy parents, being educated at Oxford, and being a decorated war hero. However, he is also portrayed as a penniless young man deeply in love with Daisy

In [28]:
result = await search_engine.asearch("what is the relation of Gatsby with Daisy Buchanan")
print(result.response)

# The Relationship Between Gatsby and Daisy Buchanan

The relationship between Jay Gatsby and Daisy Buchanan is central to the narrative of "The Great Gatsby," intricately weaving themes of love, wealth, and the American Dream. Their romantic history is pivotal, with Daisy being the primary object of Gatsby's affection. Before her marriage to Tom Buchanan, Daisy had a romantic relationship with Gatsby, a connection that deeply influences the storyline [Data: Relationships (63, 162)].

## Romantic History and Emotional Connection

Gatsby's deep love for Daisy is a central theme of the narrative, fueling the central conflict. Gatsby's opulent lifestyle and the extravagant parties he throws in West Egg are primarily efforts to recapture Daisy's affection. This deep emotional bond and their shared past are at the heart of the narrative, underscoring the complexity of their relationship and its impact on the unfolding events [Data: Entities (23, 43); Relationships (63, 162)].

## Themes of 

# Question Generation
GraphRAG can also generate questions based on historical queries, which is useful for creating recommended questions in a chatbot dialogue. This method combines structured data from the knowledge graph with unstructured data from input documents to produce candidate questions related to specific entities.

In [29]:
question_generator = LocalQuestionGen(
   llm=llm,
   context_builder=context_builder,
   token_encoder=token_encoder,
   llm_params=llm_params,
   context_builder_params=local_context_params,
)

In [30]:
question_history = [
    "Tell me about Great Gatsby",
    "what is the relation of Gatsby with Daisy Buchanan",
    "Gatsby's past events",
]

In [31]:
candidate_questions = await question_generator.agenerate(
        question_history=question_history, context_data=None, question_count=5
    )
candidate_questions.response

["- What are the key events in Gatsby's past that shaped his character and ambitions?",
 "- How did Gatsby's past relationship with Daisy influence his actions and decisions?",
 "- What role did Gatsby's mysterious background play in his social interactions and reputation?",
 "- How did Gatsby's experiences during the war impact his life and relationships?",
 "- What were the significant turning points in Gatsby's life that led to his current status and lifestyle?"]

In [64]:
import shutil

shutil.rmtree(index_root)