# <center>Harry Potter's  GraphRAG: Enhancing Retrieval-Augmented Generation with Hebrew Knowledge Graphs</center>


![harry potter](https://curiositadalmondo.it/wp-content/uploads/2020/08/Categoria-Harry-potter.jpg)

## Overview
Microsoft GraphRAG is an advanced Retrieval-Augmented Generation (RAG) system that integrates knowledge graphs to improve the performance of large language models (LLMs). Developed by Microsoft Research, GraphRAG addresses limitations in traditional RAG approaches by using LLM-generated knowledge graphs to enhance document analysis and improve response quality.

## Motivation
Traditional RAG systems often struggle with complex queries that require synthesizing information from disparate sources. GraphRAG aims to: Connect related information across datasets. Enhance understanding of semantic concepts. Improve performance on global sensemaking tasks.

## Key Components
* Knowledge Graph Generation: Constructs graphs with entities as nodes and relationships as edges.
* Community Detection: Identifies clusters of related entities within the graph.
* Summarization: Generates summaries for each community to provide context for LLMs.
* Query Processing: Uses these summaries to enhance the LLM's ability to answer complex questions.

![graphrag](https://pbs.twimg.com/media/GNSX2jBWQAA_U8b?format=jpg&name=4096x4096)

To run this notebook we will  use both OpenAI API key and Groq API key. Create a .env file and fill in the credentials for your OpenAI and Groq. The following code loads these environment variables and sets up our AI client.

In [1]:
import os
from dotenv import load_dotenv
from openai import OpenAI
from groq import Groq
import yaml
import subprocess
import re
from IPython.display import Markdown

In [2]:
# Define the path to the .env file in the root directory
env_path = os.path.join(os.getcwd(), '..', '.env')

# Load the .env file
load_dotenv(env_path)

# Access the environment variables
GROQ_API_KEY = os.getenv('GROQ_API_KEY')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')


GraphRag has a convenient set of CLI commands we can use. We'll start by configuring the system, then run the indexing operation. Indexing with GraphRag is a much lengthier process, and one that costs significantly more, since rather than just calculating embeddings, GraphRag makes many LLM calls to analyse the text, extract entities, and construct the graph. That's a one-time expense, though.

## Stage 1 - Experiment with GraphRag Defaults with OpenAI

In [3]:
# crete a data folder inside the notebook folder
if not os.path.exists('data'): 
    os.makedirs('data')

In [4]:
# create a graphrag index folder inside the data folder
if not os.path.exists('data/graphrag'):
    !python -m graphrag.index --init --root data/graphrag

In [5]:
# edit the settings.yaml inside graphrag
with open('data/graphrag/settings.yaml', 'r') as f:
    settings_yaml = yaml.load(f, Loader=yaml.FullLoader)
settings_yaml = {}

# Encoding model
settings_yaml['encoding_model'] = 'cl100k_base'

# Skip workflows
settings_yaml['skip_workflows'] = []

# LLM Settings
settings_yaml['llm'] = {
    'api_key': OPENAI_API_KEY,  # This will be replaced with your actual key
    'type': 'openai_chat',
    'model': 'gpt-4o-mini',
    'model_supports_json': True,
    'request_timeout': 600000.0  # Timeout value
}

# Parallelization settings
settings_yaml['parallelization'] = {
    'stagger': 0.2,
    'async_mode': 'threaded'
}

# Embeddings settings
settings_yaml['embeddings'] = {
    'async_mode': 'threaded',
    'llm': {
        'api_key': OPENAI_API_KEY,  # This will be replaced with your actual key
        'type': 'openai_embedding',
        'model': 'text-embedding-3-small'
    }
}

# Chunk settings
settings_yaml['chunks'] = {
    'size': 1000,
    'overlap': 50,
    'group_by_columns': ['id']
}

# Input settings
settings_yaml['input'] = {
    'type': 'file',
    'file_type': 'text',
    'base_dir': 'input',
    'file_encoding': 'utf-8',
    'file_pattern': '.*\\.txt$'
}

# Cache settings
settings_yaml['cache'] = {
    'type': 'file',
    'base_dir': 'cache_new'  # Clear cache by using a new folder
}

# Storage settings
settings_yaml['storage'] = {
    'type': 'file',
    'base_dir': 'output/${timestamp}/artifacts'
}

# Reporting settings
settings_yaml['reporting'] = {
    'type': 'file',
    'base_dir': 'output/${timestamp}/reports'
}

# # Entity extraction settings
# settings_yaml['entity_extraction'] = {
#     'prompt': 'prompts/entity_extraction_hebrew.txt',
#     'entity_types': ['דמויות', 'חפצים קסומים', 'מקומות', 'אירועים', 'מוסדות'],
#     'max_gleanings': 2
# }



In [6]:
with open('data/graphrag/settings.yaml', 'w') as f:
    yaml.dump(settings_yaml, f)

In [7]:
# create an input folder and move files into it
if not os.path.exists('data/graphrag/input'):
    os.makedirs('data/graphrag/input')

In [8]:
# move txt files to the input folder 
!cp ../data/processed/harry_potter1.txt data/graphrag/input/harry_potter1.txt

In [9]:
# initiatae graphrag 
!python -m graphrag.index --root ./data/graphrag

[2KLogging enabled at r 
data/graphrag/output/[1;36m20240910[0m-[1;36m115956[0m/reports/indexing-engine.log
[2K⠙ GraphRAG Indexer 
[2K[1A[2K⠙ GraphRAG Indexer e.text) - 7 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
├── Loading Input (InputFileType.text) - 7 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠙ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 7 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠙ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 7 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠙ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 7 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠹ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 7 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
└

*You should get an output: 🚀 All workflows completed successfully.*

let's query to see the results

In [10]:
DEFAULT_RESPONSE_TYPE = 'Summarize and explain in 1-2 paragraphs with bullet points using at most 300 tokens'
DEFAULT_MAX_CONTEXT_TOKENS = 10000

def remove_data(text):
    return re.sub(r'\[Data:.*?\]', '', text).strip()


def ask_graph(query,method):
    env = os.environ.copy() | {
      'GRAPHRAG_GLOBAL_SEARCH_MAX_TOKENS': str(DEFAULT_MAX_CONTEXT_TOKENS),
    }
    command = [
      'python', '-m', 'graphrag.query',
      '--root', './data/graphrag',
      '--method', method,
      '--response_type', DEFAULT_RESPONSE_TYPE,
      query,
    ]
    output = subprocess.check_output(command, universal_newlines=True, env=env, stderr=subprocess.DEVNULL)
    return remove_data(output.split('Search Response: ')[0])

GrpahRag offers 2 types of search:

Global Search for reasoning about holistic questions about the corpus by leveraging the community summaries.
Local Search for reasoning about specific entities by fanning-out to their neighbors and associated concepts.
Let's check the local search:

In [11]:
local_query="מתי האגריד והאיר נפגשו לראשונה?"
local_result = ask_graph(local_query,'local')

Markdown(local_result)

INFO: Reading settings from data/graphrag/settings.yaml

INFO: Vector Store Args: {}
creating llm client with {'api_key': 'REDACTED,len=132', 'type': "openai_chat", 'model': 'gpt-4o-mini', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 600000.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
creating embedding llm client with {'api_key': 'REDACTED,len=132', 'type': "openai_embedding", 'model': 'text-embedding-3-small', 'max_tokens': 4000, 'temperature': 0, 'top_p': 1, 'n': 1, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': None, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Local Search Response:
לפי המידע הזמין, האגריד והארי פוטר נפגשו לראשונה בספר הראשון של הסדרה, "הארי פוטר ואבן החכמים". האגריד, ששימש כמדריך להארי, פגש אותו ביום הולדתו ה-11, כאשר הוא הגיע להודיע לו על כך שהוא קוסם וללוות אותו להוגוורטס. 

### נקודות עיקריות:
- **מועד הפגישה**: יום הולדתו ה-11 של הארי.
- **תפקיד האגריד**: מדריך והדמות הראשונה שמביאה את הארי לעולם הקסמים.
- **הקשר בין השניים**: האגריד מספק להארי מידע חשוב על משפחתו ועל עולמו החדש . 

אם יש לך שאלות נוספות על הדמויות או על הסדרה, אני כאן לעזור!

In [12]:
global_query="לאיזה בית שייך מאלפוי?"
global_result = ask_graph(global_query,'global')

Markdown(global_result)

INFO: Reading settings from data/graphrag/settings.yaml
creating llm client with {'api_key': 'REDACTED,len=132', 'type': "openai_chat", 'model': 'gpt-4o-mini', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 600000.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response:
מאלפוי שייך לבית סלית'רין, אחד מבתי הוגוורטס, הידוע בהעדפתו לתלמידים עם שורשים טהורים ובנטייה לרוע. בית סלית'רין מתאפיין בשאיפה לכוח, חוכמה ותחכום, ולעיתים קרובות מקושר לדמויות בעלות נטיות אפלות או קשרים עם כוחות האופל. 

### נקודות עיקריות:
- מאלפוי הוא חלק מהנרטיב המרכזי של סדרת הארי פוטר, עם קשרים לדמויות כמו לוציוס ודראקו מאלפוי .
- בית סלית'רין ידוע בהקשרו עם קסמים אפלים ובערכים כמו שאפתנות ותככנות, מה שמשפיע על יחסיו של דרקו עם דמויות אחרות, במיוחד עם הארי פוטר .
- ההיסטוריה של סלית'רין כוללת דמויות רבות עם קשרים לרוע, מה שמדגיש את המתחים והקונפליקטים הנובעים מהתנהגותם ומורשתם .

## Stage 2 - Experiment with GraphRag - Ollama for Moedeling & Nomic as Embedding Model

for this part of our journey, we will use [LLM Studio](https://lmstudio.ai/) to download, and run local LLMs. 
* Embedding Model: [nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF)
* LLM: [Ollama with Gemma2](https://ollama.com/library/gemma2:9b)

In [None]:
# uncomment if not downloaded already 
# !ollama pull gemma2

*make sure to turn on serving on LLM Studio!*

In [40]:
# delete the graphrag folder and re-starting
!rm -rf data/graphrag

In [41]:
# create a graphrag index folder inside the data folder
if not os.path.exists('data/graphrag'):
    !python -m graphrag.index --init --root data/graphrag

[2KInitializing project at data/graphrag
⠋ GraphRAG Indexer 

In [42]:
# Load the existing YAML file
with open('data/graphrag/settings.yaml', 'r') as f:
    settings_yaml = yaml.load(f, Loader=yaml.FullLoader)

# Update the settings
settings_yaml = {}

# Encoding model
settings_yaml['encoding_model'] = 'cl100k_base'

# Skip workflows
settings_yaml['skip_workflows'] = []

# LLM Settings
settings_yaml['llm'] = {
    'api_key': '${GRAPHRAG_API_KEY}',
    'api_base': "http://localhost:11434/v1",
    'type': 'openai_chat',
    'model': 'gemma2',
    'model_supports_json': True,
    'max_tokens': 5000,
    'request_timeout': 180.0
}

# Parallelization settings
settings_yaml['parallelization'] = {
    'stagger': 0.3
}

settings_yaml['async_mode'] = 'threaded'

# Embeddings settings
settings_yaml['embeddings'] = {
    'async_mode': 'threaded',
    'llm': {
        'api_key': '${GRAPHRAG_API_KEY}',
        'type': 'openai_embedding',
        'model': 'nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q5_K_M.gguf',
        'api_base': 'http://localhost:1234/v1'
    }
}

# Chunk settings
settings_yaml['chunks'] = {
    'size': 300,
    'overlap': 100,
    'group_by_columns': ['id']
}

# Input settings
settings_yaml['input'] = {
    'type': 'file',
    'file_type': 'text',
    'base_dir': 'input',
    'file_encoding': 'utf-8',
    'file_pattern': '.*\\.txt$'
}

# Cache settings
settings_yaml['cache'] = {
    'type': 'file',
    'base_dir': 'cache'
}

# Entity extraction settings
settings_yaml['entity_extraction'] = {
    'prompt': "prompts/entity_extraction.txt",
    'entity_types': ['person', 'geo', 'event'],
    'max_gleanings': 0
}

# Summarize descriptions settings
settings_yaml['summarize_descriptions'] = {
    'prompt': "prompts/summarize_descriptions.txt",
    'max_length': 500
}

# Claim extraction settings
settings_yaml['claim_extraction'] = {
    'prompt': "prompts/claim_extraction.txt",
    'description': "Any claims or facts that could be relevant to information discovery.",
    'max_gleanings': 0
}

# Community report settings
settings_yaml['community_report'] = {
    'prompt': "prompts/community_report.txt",
    'max_length': 2000,
    'max_input_length': 8000
}

# Cluster graph settings
settings_yaml['cluster_graph'] = {
    'max_cluster_size': 10
}

# Embed graph settings
settings_yaml['embed_graph'] = {
    'enabled': False
}

# Save the updated YAML settings back to the file
with open('data/graphrag/settings.yaml', 'w') as f:
    yaml.dump(settings_yaml, f)

In [43]:
with open('data/graphrag/settings.yaml', 'w') as f:
    yaml.dump(settings_yaml, f)

In [44]:
# create an input folder and move files into it
if not os.path.exists('data/graphrag/input'):
    os.makedirs('data/graphrag/input')

In [45]:
# move txt files to the input folder 
!cp ../data/processed/harry_potter1.txt data/graphrag/input/harry_potter1.txt

In [46]:
# initiatae graphrag 
!python -m graphrag.index --root ./data/graphrag

[2KLogging enabled at r 
data/graphrag/output/[1;36m20240910[0m-[1;36m154602[0m/reports/indexing-engine.log
[2K⠙ GraphRAG Indexer 
[2K[1A[2K⠙ GraphRAG Indexer e.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠙ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠙ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠙ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠹ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m


*This might take a while You should get an output: 🚀 All workflows completed successfully.*

now, let's test the graph we produced with the same questions from earlier

In [47]:
local_query="מתי האגריד והאיר נפגשו לראשונה?"
local_result = ask_graph(local_query,'local')

Markdown(local_result)

INFO: Reading settings from data/graphrag/settings.yaml

INFO: Vector Store Args: {}
creating llm client with {'api_key': 'REDACTED,len=9', 'type': "openai_chat", 'model': 'gemma2', 'max_tokens': 5000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': 'http://localhost:11434/v1', 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
creating embedding llm client with {'api_key': 'REDACTED,len=9', 'type': "openai_embedding", 'model': 'nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q5_K_M.gguf', 'max_tokens': 4000, 'temperature': 0, 'top_p': 1, 'n': 1, 'request_timeout': 180.0, 'api_base': 'http://localhost:1234/v1', 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': None, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Local Search Response:
האגריד והאיר נפגשו לראשונה בספר הראשון בסדרת הספרים "ההוביט", **"ההוביט: או מסע ת'ר".** 

הם פוגשים זה את זה כאשר האיר, שמתגורר ביער, מגלה את ההוביט גנבולט בתוך ביתו.

In [49]:
global_query="היכן התגורר הארי לפני שהתחיל ללמוד בהוגוורטס?"
global_result = ask_graph(global_query,'global')

Markdown(global_result)

INFO: Reading settings from data/graphrag/settings.yaml
creating llm client with {'api_key': 'REDACTED,len=9', 'type': "openai_chat", 'model': 'gemma2', 'max_tokens': 5000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': 'http://localhost:11434/v1', 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response:
I am sorry but I am unable to answer this question given the provided data.

In [50]:
local_query="היכן התגורר הארי לפני שהתחיל ללמוד בהוגוורטס?"
local_result = ask_graph(local_query,'local')

Markdown(local_result)

INFO: Reading settings from data/graphrag/settings.yaml

INFO: Vector Store Args: {}
creating llm client with {'api_key': 'REDACTED,len=9', 'type': "openai_chat", 'model': 'gemma2', 'max_tokens': 5000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': 'http://localhost:11434/v1', 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
creating embedding llm client with {'api_key': 'REDACTED,len=9', 'type': "openai_embedding", 'model': 'nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q5_K_M.gguf', 'max_tokens': 4000, 'temperature': 0, 'top_p': 1, 'n': 1, 'request_timeout': 180.0, 'api_base': 'http://localhost:1234/v1', 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': None, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Local Search Response:
הארי פוטר התגורר אצל משפחת דלטון, הוריו של ג'יימס ולילי פוטר. 


הם היו קרובי משפחה של הארי, אך לא ידעו שהוא היה ילד מיוחד.

## Stage 3 - Experiment with GraphRag - Ollama for Moedeling & Nomic as Embedding Model & customize the Entity Extraction
for  this final stage, we will use the same models, but to improve the process of the graph creartion - we will customize the prompt which create the entities, based on our text

In [52]:
# delete the previoius graphrag output folder and re-starting
!rm -rf data/graphrag

In [53]:
# create a graphrag index folder inside the data folder
if not os.path.exists('data/graphrag'):
    !python -m graphrag.index --init --root data/graphrag

[2KInitializing project at data/graphrag
⠋ GraphRAG Indexer 

In [54]:
# create a custom txt file in the prompts folder
# Create the directory if it doesn't exist
!mkdir -p data/graphrag/prompts

# Define the content of the file
entity_extraction_content = """
-Goal-
Given a Hebrew text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
   - entity_name: Name of the entity, capitalized
   - entity_type: One of the following types: [דמויות, חפצים קסומים, מקומות, אירועים, מוסדות]
   - entity_description: Comprehensive description of the entity's attributes and activities
   Format each entity as:
   ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
   For each pair of related entities, extract the following information:
   - source_entity: name of the source entity, as identified in step 1
   - target_entity: name of the target entity, as identified in step 1
   - relationship_description: explanation as to why you think the source entity and the target entity are related to each other
   - relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
   Format each relationship as:
   ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)

3. Return the output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.

4. When finished, output **{completion_delimiter}**.

######################
-Examples-
######################
Example 1:
Entity_types: דמויות, מוסדות
Text:
הארי פוטר לומד בבית הספר הוגוורטס לכישוף ולקוסמות, והוא מתיידד עם רון ויזלי והרמיוני גריינג'ר. סוורוס סנייפ הוא מורה שמראה סימני חשדנות כלפי הארי.
######################
Output:
("entity"{tuple_delimiter}הארי פוטר{tuple_delimiter}דמויות{tuple_delimiter}הארי פוטר הוא גיבור הסיפור, תלמיד בבית הספר הוגוורטס לכישוף ולקוסמות)
{record_delimiter}
("entity"{tuple_delimiter}הוגוורטס{tuple_delimiter}מוסדות{tuple_delimiter}הוגוורטס הוא בית ספר לקוסמים שבו לומדים הארי וחבריו)
{record_delimiter}
("entity"{tuple_delimiter}סוורוס סנייפ{tuple_delimiter}דמויות{tuple_delimiter}סוורוס סנייפ הוא מורה בהוגוורטס עם יחס חשדני כלפי הארי פוטר)
{record_delimiter}
("relationship"{tuple_delimiter}הארי פוטר{tuple_delimiter}הוגוורטס{tuple_delimiter}הארי פוטר לומד בהוגוורטס{tuple_delimiter}9)
{record_delimiter}
("relationship"{tuple_delimiter}סוורוס סנייפ{tuple_delimiter}הארי פוטר{tuple_delimiter}סנייפ מראה חשדנות כלפי הארי פוטר בבית הספר{tuple_delimiter}7)
{completion_delimiter}

######################
-Real Data-
######################
Entity_types: דמויות, חפצים קסומים, מקומות, אירועים, מוסדות
Text: {input_text}
######################
Output:
"""

# Write the content to the file
with open('data/graphrag/prompts/entity_extraction_hebrew.txt', 'w', encoding='utf-8') as f:
    f.write(entity_extraction_content)

print("File created successfully!")

File created successfully!


In [56]:
# Load the existing YAML file
with open('data/graphrag/settings.yaml', 'r') as f:
    settings_yaml = yaml.load(f, Loader=yaml.FullLoader)

# Update the settings
settings_yaml = {}

# Encoding model
settings_yaml['encoding_model'] = 'cl100k_base'

# Skip workflows
settings_yaml['skip_workflows'] = []

# LLM Settings
settings_yaml['llm'] = {
    'api_key': '${GRAPHRAG_API_KEY}',
    'api_base': "http://localhost:11434/v1",
    'type': 'openai_chat',
    'model': 'gemma2',
    'model_supports_json': True,
    'max_tokens': 5000,
    'request_timeout': 180.0
}

# Parallelization settings
settings_yaml['parallelization'] = {
    'stagger': 0.3
}

settings_yaml['async_mode'] = 'threaded'

# Embeddings settings
settings_yaml['embeddings'] = {
    'async_mode': 'threaded',
    'llm': {
        'api_key': '${GRAPHRAG_API_KEY}',
        'type': 'openai_embedding',
        'model': 'nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q5_K_M.gguf',
        'api_base': 'http://localhost:1234/v1'
    }
}

# Chunk settings
settings_yaml['chunks'] = {
    'size': 300,
    'overlap': 100,
    'group_by_columns': ['id']
}

# Input settings
settings_yaml['input'] = {
    'type': 'file',
    'file_type': 'text',
    'base_dir': 'input',
    'file_encoding': 'utf-8',
    'file_pattern': '.*\\.txt$'
}

# Cache settings
settings_yaml['cache'] = {
    'type': 'file',
    'base_dir': 'cache'
}

# Entity extraction settings
settings_yaml['entity_extraction'] = {
    'prompt': "prompts/entity_extraction_hebrew.txt",
    'entity_types': ['דמויות', 'חפצים', 'קסומים', 'מקומות', 'אירועים', 'מוסדות'],  # Added the missing comma
    'max_gleanings': 0
}

# Summarize descriptions settings
settings_yaml['summarize_descriptions'] = {
    'prompt': "prompts/summarize_descriptions.txt",
    'max_length': 500
}

# Claim extraction settings
settings_yaml['claim_extraction'] = {
    'prompt': "prompts/claim_extraction.txt",
    'description': "Any claims or facts that could be relevant to information discovery.",
    'max_gleanings': 0
}

# Community report settings
settings_yaml['community_report'] = {
    'prompt': "prompts/community_report.txt",
    'max_length': 2000,
    'max_input_length': 8000
}

# Cluster graph settings
settings_yaml['cluster_graph'] = {
    'max_cluster_size': 10
}

# Embed graph settings
settings_yaml['embed_graph'] = {
    'enabled': False
}

# Save the updated YAML settings back to the file
with open('data/graphrag/settings.yaml', 'w') as f:
    yaml.dump(settings_yaml, f)

In [57]:
with open('data/graphrag/settings.yaml', 'w') as f:
    yaml.dump(settings_yaml, f)

In [58]:
# create an input folder and move files into it
if not os.path.exists('data/graphrag/input'):
    os.makedirs('data/graphrag/input')

In [59]:
# move txt files to the input folder 
!cp ../data/processed/harry_potter1.txt data/graphrag/input/harry_potter1.txt

In [60]:
# initiatae graphrag 
!python -m graphrag.index --root ./data/graphrag

[2KLogging enabled at r 
data/graphrag/output/[1;36m20240910[0m-[1;36m165214[0m/reports/indexing-engine.log
[2K⠙ GraphRAG Indexer 
[2K[1A[2K⠙ GraphRAG Indexer e.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠙ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠙ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠙ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠹ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m


*This might take a while You should get an output: 🚀 All workflows completed successfully.*

In [62]:
local_query="מי הן הדמויות הראשיות בהארי פוטר?"
local_result = ask_graph(local_query,'local')

Markdown(local_result)

INFO: Reading settings from data/graphrag/settings.yaml

INFO: Vector Store Args: {}
creating llm client with {'api_key': 'REDACTED,len=9', 'type': "openai_chat", 'model': 'gemma2', 'max_tokens': 5000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': 'http://localhost:11434/v1', 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
creating embedding llm client with {'api_key': 'REDACTED,len=9', 'type': "openai_embedding", 'model': 'nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q5_K_M.gguf', 'max_tokens': 4000, 'temperature': 0, 'top_p': 1, 'n': 1, 'request_timeout': 180.0, 'api_base': 'http://localhost:1234/v1', 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': None, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Local Search Response:
הדמויות הראשיות בסדרת הספרים "הארי פוטר" הן:

* **הארי פוטר:** הגיבור הראשי, יתום שגילה שהוא צעיר קסם ומתקבל לבית הספר להוראות קסמים והמכשפות, הוגוורטס.
* **רונה וויזלי:** חברה טובה של הארי, ממוצא קסום, בעלת אישיות חזקה וחכמה. 
* **הרון מאל福י:** נבל הרע בסדרה, צעיר קסם שונא-מגלים ורוצה להשתלט על העולם הקסום.

**דמויות נוספות חשובות:**

* **רמפסון סניפ:** מורה לשיעורי הוגוורטס, ידוע באישיותו הקשוחה והחוקית.
* **אלברט פוטר:** אביו של הארי, קסם חזק ומוכשר שנהרג על ידי וולדמורט.
* **לילי פוטר:** אמו של הארי, קסמת חזקה וחכמה שהקריבה את עצמה כדי להגן עליו.
* **דמבלדור:** מנהל הוגוורטס, קסם חכם ומנוסה שמסייע לארי במאבק נגד וולדמורט.

## Conclusions

* It seems that switching from GPT to Gemma did not improved answers quality.
* maybe switching to different embedding models such as DistilBERT (which are more comptible to hebrew) will improve answers quality.
* overall GraphRAG is an overkiller solution to such a problem. in addition, the token consumption to create entities & communities is large and will be a problem to scale with.

## Next:
see the notebook gr