<a href="https://www.kaggle.com/code/william2020/graphrag?scriptVersionId=186753127" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# GraphRAG

by Mircosoft

GraphRAG is a structured, hierarchical approach to Retrieval Augmented Generation (RAG), as opposed to naive semantic-search approaches using plain text snippets. The GraphRAG process involves extracting a knowledge graph out of raw text, building a community hierarchy, generating summaries for these communities, and then leveraging these structures when perform RAG-based tasks.

More information can be found here: https://microsoft.github.io/graphrag/

### Let's install GraphRAG!

In [None]:
!pip install graphrag

### You will need to create a '.env' file and put in your API key.  This can be an OPENAI API key.

In [30]:
env_file_path = '/kaggle/working/.env'

# The content to write to the .env file
env_content = 'GRAPHRAG_API_KEY="api-key"\n'

# Write the content to the .env file
with open(env_file_path, 'w') as file:
    file.write(env_content)

print(f".env file created at {env_file_path} with GRAPHRAG_API_KEY")

.env file created at /kaggle/working/.env with GRAPHRAG_API_KEY


### Let's make sure it's there before we continue

In [32]:
# Read and print the contents of the .env file
with open(env_file_path, 'r') as file:
    content = file.read()

print("Contents of .env file:")
print(content)

Contents of .env file:
GRAPHRAG_API_KEY="api-key"



### Create an 'input' directory to put our data into.  This directory is neccessary for how GraphRAG works.

In [4]:
mkdir input

### Let's download the data will be using.  For this case, we'll be using a book, specifically, one by Charles Dickens.

In [18]:
# Use curl to download the file into the created directory
!curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > /kaggle/working/input/book.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  184k  100  184k    0     0   231k      0 --:--:-- --:--:-- --:--:--  231k


### We will initialize how project with the --init method

In [11]:
!python -m graphrag.index --init --root .

2024-07-04 01:31:48.310701: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-04 01:31:48.310898: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-04 01:31:48.460754: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2KInitializing project at .
⠋ GraphRAG Indexer 

### This will generate the neccessary files. You can view these in the Output section on the right hand side.

In [19]:
import os

# List all files in the output directory
output_dir = '/kaggle/working'
files = os.listdir(output_dir)
print(files)

['prompts', '.env', 'input', '.virtual_documents', 'settings.yaml', 'output', 'cache']


## Let's have GraphRAG start indexing our data

This will take a few minutes..

In [26]:
!python -m graphrag.index --root .

2024-07-04 02:05:35.059068: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-04 02:05:35.059140: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-04 02:05:35.060901: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2K🚀 [32mReading settings from settings.yaml[0m
[2K⠧ GraphRAG Indexer 
[2K[1A[2K⠧ GraphRAG Indexer les loaded (0 filtered) [90m━━━━━━[0m [35m100%[0m [36m0:00:…[0m [33m0:00:…[0m
├── Loading Input (text) - 1 files loaded (0 filtered) [90m━━━━━━[0m [35m100%[0m [36m0:00:…[0m [33m0:00:…[0m
[2K[1A[2K[1A[2K⠧ GraphRAG Indexer 
├── Loading Inpu

### Query
At query time, these structures are used to provide materials for the LLM context window when answering a question. The primary query modes are:
Global Search for reasoning about holistic questions about the corpus by leveraging the community summaries.
Local Search for reasoning about specific entities by fanning-out to their neighbors and associated concepts.

## Now, let's query our data using the GLOBAL method:

In [28]:
!python -m graphrag.query --root . --method global "who is the author?"

2024-07-04 02:14:06.127868: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-04 02:14:06.127936: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-04 02:14:06.129553: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


INFO: Reading settings from settings.yaml
creating llm client with {'api_key': 'REDACTED,len=51', 'type': "openai_chat", 'model': 'gpt-4-turbo-preview', 'max_tokens': 4000, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': Tru

## And query using the LOCAL method:

In [29]:
!python -m graphrag.query --root . --method local "who is the author?"

2024-07-04 02:16:12.663528: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-04 02:16:12.663600: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-04 02:16:12.665423: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


INFO: Reading settings from settings.yaml
[0m[38;5;8m[[0m2024-07-04T02:17:32Z [0m[33mWARN [0m lance::dataset[0m[38;5;8m][0m No existing dataset at /kaggle/working/lancedb/description_embedding.lance, it will be created
creating llm client with {'api_key': 'REDACTED,len=51', 'type': "openai_chat", 'model': 'gpt-4-turbo-preview', 'max_tokens': 4000, 'req