# Global Search

global search 是通过 `map-reduce` 的方法去搜索所有 生成的 `community report` ，然后从中得到答案; **换句话说，即 `community report` 就是 `context`**

<br>

>加载 `./output/{time}/create_final_community_reports.parquet` 作为上下文

## step1 import libs

In [1]:
import os

import pandas as pd
import tiktoken

from graphrag.query.indexer_adapters import read_indexer_entities, read_indexer_reports
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.structured_search.global_search.community_context import (
    GlobalCommunityContext,
)
from graphrag.query.structured_search.global_search.search import GlobalSearch

  from .autonotebook import tqdm as notebook_tqdm


## step2 LLM param setup

In [3]:
api_key = 'qwen'
llm_model = 'Qwen2-7B-Instruct'

llm = ChatOpenAI(
    api_key=api_key,
    api_base='http://0.0.0.0:8000/v1',
    model=llm_model,
    api_type=OpenaiApiType.OpenAI, 
    max_retries=20,
)

token_encoder = tiktoken.get_encoding("cl100k_base")

## step3 load community reports(context)


- `create_final_community_reports`：全局搜索的上下文数据

<br>

- `create_final_nodes` 与 `create_final_entities`：作为实体，用于计算上下文排名的权重

In [4]:
# parquet files generated from indexing pipeline
INPUT_DIR = "./output/20240807-093938/artifacts"
COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"

# community level in the Leiden community hierarchy from which we will load the community reports
# higher value means we use reports from more fine-grained communities (at the cost of higher computation cost)
COMMUNITY_LEVEL = 2

In [5]:
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")

reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)
entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)
print(f"Total report count: {len(report_df)}")
print(
    f"Report count after filtering by community level {COMMUNITY_LEVEL}: {len(reports)}"
)
report_df.head()

Total report count: 25
Report count after filtering by community level 2: 20


Unnamed: 0,community,full_content,level,rank,title,rank_explanation,summary,findings,full_content_json,id
0,23,# 国有企业数字化转型与挑战\n\n本社区围绕国有企业数字化转型展开，涉及传统行业、地区差距...,2,7.0,国有企业数字化转型与挑战,社区内的国有企业数字化转型面临多方面挑战，影响其发展速度和效果，因此影响严重性评级为中等。,本社区围绕国有企业数字化转型展开，涉及传统行业、地区差距、经济效益、贫富差距等多个方面。转型...,[{'explanation': '国有企业在数字化转型过程中面临地区差距的挑战，尽管后发地...,"{\n ""title"": ""\u56fd\u6709\u4f01\u4e1a\u657...",b8b09b9b-be5a-4cd4-ba8b-47e4ac29757a
1,24,# 国有企业与成长能力、可持续增长率\n\n本社区围绕国有企业、成长能力和可持续增长率三个关...,2,4.5,国有企业与成长能力、可持续增长率,社区内的实体关系和关联信息表明，提升成长能力和实现可持续增长对国有企业至关重要，但面临挑战，...,本社区围绕国有企业、成长能力和可持续增长率三个关键实体展开，它们在提升企业成长能力方面存在关...,[{'explanation': '国有企业在提升成长能力方面面临挑战，这可能影响其市场竞争...,"{\n ""title"": ""\u56fd\u6709\u4f01\u4e1a\u4e0...",2905edec-2995-4be9-b26f-f5f96d952080
2,10,# 国有企业数字化转型与挑战\n\n本社区围绕国有企业数字化转型展开，涉及多个实体，包括国有...,1,6.5,国有企业数字化转型与挑战,社区内的实体在数字化转型过程中面临多方面的挑战，这些挑战可能对转型的顺利进行和最终效果产生重...,本社区围绕国有企业数字化转型展开，涉及多个实体，包括国有企业、成长能力、传统行业数字化转型等...,[{'explanation': '国有企业在数字化转型过程中面临地区差距、城乡差距、贫富差...,"{\n ""title"": ""\u56fd\u6709\u4f01\u4e1a\u657...",57bb162b-cd5a-47eb-8491-acfadd4d3c60
3,11,# Digitalization and Operational Performance i...,1,6.0,Digitalization and Operational Performance in ...,"The impact severity rating is moderate, consid...",The community focuses on the impact of digital...,[{'explanation': 'Digitalization is a key focu...,"{\n ""title"": ""Digitalization and Operationa...",7c9cb5fd-9e03-4738-a5ac-8f778ac3529f
4,12,"# 资产收益率与盈利能力\n\n社区围绕着 ""资产收益率 ""和 ""盈利能力 ""两个关键实体，...",1,4.5,资产收益率与盈利能力,"社区的影响力中等，主要由 ""资产收益率 ""实体的挑战和 ""盈利能力 ""实体的关联性引起。","社区围绕着 ""资产收益率 ""和 ""盈利能力 ""两个关键实体，它们分别代表了技术公司和业务数字...","[{'explanation': ' ""资产收益率 ""实体代表了一家技术公司，它在数据的合理...","{\n ""title"": ""\u8d44\u4ea7\u6536\u76ca\u738...",15f57bf6-7e60-4556-8171-b2fafcbca1e1


## step4 Build global context 

基于 `community reports` 构建 全局上下文信息； 定义参数

In [6]:
context_builder = GlobalCommunityContext(
    community_reports=reports,
    entities=entities,  # default to None if you don't want to use community weights for ranking
    token_encoder=token_encoder,
)

In [8]:
context_builder_params = {
    "use_community_summary": False,  # False means using full community reports. True means using community short summaries.
    "shuffle_data": True,
    "include_community_rank": True,
    "min_community_rank": 0,
    "community_rank_name": "rank",
    "include_community_weight": True,
    "community_weight_name": "occurrence weight",
    "normalize_community_weight": True,
    "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
    "context_name": "Reports",
}

map_llm_params = {
    "max_tokens": 1000,
    "temperature": 0.0,
    "response_format": {"type": "json_object"},
}

reduce_llm_params = {
    "max_tokens": 2000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000-1500)
    "temperature": 0.0,
}

## step5 perform global search

In [9]:
search_engine = GlobalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    max_data_tokens=12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
    map_llm_params=map_llm_params,
    reduce_llm_params=reduce_llm_params,
    allow_general_knowledge=False,  # set this to True will add instruction to encourage the LLM to incorporate general knowledge in the response, which may increase hallucinations, but could be useful in some use cases.
    json_mode=True,  # set this to False if your LLM model does not support JSON mode.
    context_builder_params=context_builder_params,
    concurrent_coroutines=32,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

In [23]:
result = await search_engine.asearch(
    "数字化转型的建议有哪些？"
)

print(result.response)

Exception in _map_response_single_batch
Traceback (most recent call last):
  File "/root/aaa/graphrag_env/lib/python3.10/site-packages/graphrag/query/structured_search/global_search/search.py", line 182, in _map_response_single_batch
    search_response = await self.llm.agenerate(
  File "/root/aaa/graphrag_env/lib/python3.10/site-packages/graphrag/query/llm/oai/chat_openai.py", line 110, in agenerate
    async for attempt in retryer:
  File "/root/aaa/graphrag_env/lib/python3.10/site-packages/tenacity/asyncio/__init__.py", line 166, in __anext__
    do = await self.iter(retry_state=self._retry_state)
  File "/root/aaa/graphrag_env/lib/python3.10/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
    result = await action(retry_state)
  File "/root/aaa/graphrag_env/lib/python3.10/site-packages/tenacity/_utils.py", line 99, in inner
    return call(*args, **kwargs)
  File "/root/aaa/graphrag_env/lib/python3.10/site-packages/tenacity/__init__.py", line 398, in <lambda>
    se

I am sorry but I am unable to answer this question given the provided data.


In [12]:
# inspect the data used to build the context for the LLM responses
result.context_data["reports"]

Unnamed: 0,id,title,occurrence weight,content,rank
0,9,国有企业数字化转型,1.0,# 国有企业数字化转型\n\n本社区围绕中国国有企业进行数字化转型这一核心主题，涵盖了企业规...,6.5
1,16,企业数字化转型与相关实体,0.818182,# 企业数字化转型与相关实体\n\n本社区围绕企业数字化转型这一核心实体，涉及国有企业、民营...,6.5
2,21,China's Digital Transformation and Its Impact ...,0.545455,# China's Digital Transformation and Its Impac...,6.5
3,20,地方国企与数字化转型,0.545455,# 地方国企与数字化转型\n\n本社区围绕地方国企及其在数字化转型、高质量发展、数字经济等领...,6.5
4,23,国有企业数字化转型与挑战,0.454545,# 国有企业数字化转型与挑战\n\n本社区围绕国有企业数字化转型展开，涉及传统行业、地区差距...,7.0
5,14,Digital Regional Layout and Infrastructure,0.454545,# Digital Regional Layout and Infrastructure\n...,6.0
6,2,全要素生产率与数字化转型,0.363636,# 全要素生产率与数字化转型\n\n本社区围绕全要素生产率、国有企业、民营企业、地方国企等实...,6.0
7,3,Digital Infrastructure and Economic Developmen...,0.272727,# Digital Infrastructure and Economic Developm...,7.0
8,13,Industry Leaders and Digital Transformation,0.272727,# Industry Leaders and Digital Transformation\...,6.5
9,7,Central Enterprises and Their Digital Transfor...,0.272727,# Central Enterprises and Their Digital Transf...,6.5


In [13]:
# inspect number of LLM calls and tokens
print(f"LLM calls: {result.llm_calls}. LLM tokens: {result.prompt_tokens}")

LLM calls: 1. LLM tokens: 11138


In [15]:
"""
'completion_time', 'context_data', 'context_text', 'llm_calls', 'map_responses', 'prompt_tokens', 
'reduce_context_data', 'reduce_context_text', 'response'

"""
res_attr = dir(result)
print(res_attr)

['__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'completion_time', 'context_data', 'context_text', 'llm_calls', 'map_responses', 'prompt_tokens', 'reduce_context_data', 'reduce_context_text', 'response']


In [16]:
result.completion_time

125.68955135345459

In [21]:
result.context_text

['\n\nid|title|occurrence weight|content|rank\n9|国有企业数字化转型|1.0|"# 国有企业数字化转型\n\n本社区围绕中国国有企业进行数字化转型这一核心主题，涵盖了企业规模管理、绩效考核、转型程度、地区间数字鸿沟等多个方面。企业通过数字化转型提升经济效益、优化运营流程，同时面临地区间数字鸿沟、绩效考核约束等挑战。\n\n## 国有企业数字化转型的经济影响\n\n中国国有企业通过数字化转型提升经济效益，但转型程度对成长能力、资产收益率和成本利润率的提升作用不明显。这表明在数字化转型过程中，企业需要更加注重经济效益的释放，以实现可持续发展。[Data: 企业规模 (1), 经济效益 (4), 转型程度 (38), 成长能力 (42), 资产收益率 (43)]\n\n## 地区间数字鸿沟的挑战\n\n国有企业在地区间数字鸿沟的弥合方面面临挑战，数字化转型可能加剧地区差距。这要求国有企业在数字化进程中考虑地区差异，确保资源的公平分配，以促进地区间的均衡发展。[Data: 地区间数字鸿沟 (37, 39, 40, 41, +more)]\n\n## 绩效考核约束的影响\n\n国有企业在绩效考核的严格约束下，数字化转型进程受到制约。这表明在追求数字化转型的同时，需要平衡绩效考核与创新发展的关系，以促进企业的长期发展。[Data: 绩效考核 (37, 7, +more)]\n\n## 转型程度的局限性\n\n国有企业在转型程度下对成长能力、资产收益率和成本利润率的提升作用不明显。这提示企业在数字化转型时，应更加关注转型的深度和广度，以实现更显著的经济效益提升。[Data: 转型程度 (38), 成长能力 (42), 资产收益率 (43)]\n\n## 政策导向与资源保障\n\n国有企业对政策导向更敏感，具有资源保障优势。这表明政策支持和资源投入是国有企业数字化转型成功的关键因素。企业应积极与政府合作，利用政策优势，优化资源配置，推动数字化转型。[Data: 政策导向 (25), 资源保障 (26)]"|6.5\n16|企业数字化转型与相关实体|0.8181818181818182|"# 企业数字化转型与相关实体\n\n本社区围绕企业数字化转型这一核心实体，涉及国有企业、民营企业、数字鸿沟程度、全要素生产率等多个相关实体，展示了中国