# Local Search

local search 是通过 **结合** 提取的知识图谱（KG） 和 原始文档的chunk（文本块）来生成答案。

适用于 文档中提到的 **特定实体的问题**

> 加载 create_final_text_units.parquet` 和 图形数据表 作为上下文

## step1 import libs

In [3]:
import os

import pandas as pd
import tiktoken

from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.input.loaders.dfs import (
    store_entity_semantic_embeddings,
)
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.question_gen.local_gen import LocalQuestionGen
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores.lancedb import LanceDBVectorStore

  from .autonotebook import tqdm as notebook_tqdm


## step2 data process 

参数设置

In [4]:
INPUT_DIR = "./output/20240807-093938/artifacts"
LANCEDB_URI = "./../lancedb"

COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"
RELATIONSHIP_TABLE = "create_final_relationships"
COVARIATE_TABLE = "create_final_covariates"
TEXT_UNIT_TABLE = "create_final_text_units"
COMMUNITY_LEVEL = 2

读取 entity

In [5]:
# read nodes table to get community and degree data
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")

entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)

# load description embeddings to an in-memory lancedb vectorstore
# to connect to a remote db, specify url and port values.
description_embedding_store = LanceDBVectorStore(
    collection_name="entity_description_embeddings",
)
description_embedding_store.connect(db_uri=LANCEDB_URI)
entity_description_embeddings = store_entity_semantic_embeddings(
    entities=entities, vectorstore=description_embedding_store
)

print(f"Entity count: {len(entity_df)}")
entity_df.head()

Entity count: 1047


Unnamed: 0,level,title,type,description,source_id,degree,human_readable_id,id,size,graph_embedding,entity_type,community,top_level_node_id,x,y
0,0,项安波,PERSON,国务院发展研究中心企业研究所副所长、研究员,"596832d7737018724c91ff06c047794d,82d2a2a927618...",0,0,b45241d70f0e43fca764df95b2b81f77,0,,,,b45241d70f0e43fca764df95b2b81f77,0,0
1,0,杨继东,PERSON,中国人民大学学国有经济研究院副院长、经济学院教授,"596832d7737018724c91ff06c047794d,82d2a2a927618...",0,1,4119fd06010c494caa07f439b333f4c5,0,,,,4119fd06010c494caa07f439b333f4c5,0,0
2,0,高秋男,PERSON,中国人民大学经济学院博士研究生,"596832d7737018724c91ff06c047794d,82d2a2a927618...",0,2,d3835bf3dda84ead99deadbeac5d0d7d,0,,,,d3835bf3dda84ead99deadbeac5d0d7d,0,0
3,0,国有企业,ORGANIZATION,State-owned enterprises in China play a crucia...,"2b499ad361e589ce60c47bf6ca9ba9f0,2f145e08d2bfa...",46,3,077d2820ae1845bcbb1803379a3d1eae,46,,ORGANIZATION,0.0,077d2820ae1845bcbb1803379a3d1eae,0,0
4,0,企业数字化转型,EVENT,"The entity, referred to as ""\u4f01\u4e1a\u6570...","596832d7737018724c91ff06c047794d,82d2a2a927618...",13,4,3671ea0dd4e84c1a9b02c5ab2c8f4bac,13,,,4.0,3671ea0dd4e84c1a9b02c5ab2c8f4bac,0,0


读取 relationship

In [6]:
relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)

print(f"Relationship count: {len(relationship_df)}")
relationship_df.head()

Relationship count: 163


Unnamed: 0,source,target,weight,description,text_unit_ids,id,human_readable_id,source_degree,target_degree,rank
0,国有企业,企业数字化转型,2.0,"The entity ""\u56fd\u6709\u4f01\u4e1a"" is invol...","[596832d7737018724c91ff06c047794d, 82d2a2a9276...",b45ef27279c043269b23b894461d7d8c,0,46,13,59
1,国有企业,数字经济,1.0,国有企业数字化转型有助于推动数字经济的发展,[9ae220e7093685618365542418189752],10983a248cc448c59c94df4d1d0898f0,1,46,3,49
2,国有企业,地区间的数字鸿沟,11.0,国有企业在弥合地区间的数字鸿沟方面有积极作用国有企业在地区间的数字鸿沟弥合方面有积极作用,[9ae220e7093685618365542418189752],e2ec7d3cdbeb4dd086ae6eb399332363,2,46,3,49
3,国有企业,高质量发展,11.0,国有企业数字化转型可能影响和制约高质量发展国有企业数字化转型可能影响高质量发展国有企业数字化...,[9ae220e7093685618365542418189752],67f10971666240ea930f3b875aabdc1a,3,46,3,49
4,国有企业,经济效益,18.0,"The entity ""\u56fd\u6709\u4f01\u4e1a"" is deepl...","[9ae220e7093685618365542418189752, f3bfb6d6e9b...",8b95083939ad4771b57a97c2d5805f36,4,46,4,50


In [7]:
covariate_df = pd.read_parquet(f"{INPUT_DIR}/{COVARIATE_TABLE}.parquet")

claims = read_indexer_covariates(covariate_df)

print(f"Claim records: {len(claims)}")
covariates = {"claims": claims}

FileNotFoundError: [Errno 2] No such file or directory: './output/20240807-093938/artifacts/create_final_covariates.parquet'

读取 community reports

In [8]:
report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)

print(f"Report records: {len(report_df)}")
report_df.head()

Report records: 25


Unnamed: 0,community,full_content,level,rank,title,rank_explanation,summary,findings,full_content_json,id
0,23,# 国有企业数字化转型与挑战\n\n本社区围绕国有企业数字化转型展开，涉及传统行业、地区差距...,2,7.0,国有企业数字化转型与挑战,社区内的国有企业数字化转型面临多方面挑战，影响其发展速度和效果，因此影响严重性评级为中等。,本社区围绕国有企业数字化转型展开，涉及传统行业、地区差距、经济效益、贫富差距等多个方面。转型...,[{'explanation': '国有企业在数字化转型过程中面临地区差距的挑战，尽管后发地...,"{\n ""title"": ""\u56fd\u6709\u4f01\u4e1a\u657...",b8b09b9b-be5a-4cd4-ba8b-47e4ac29757a
1,24,# 国有企业与成长能力、可持续增长率\n\n本社区围绕国有企业、成长能力和可持续增长率三个关...,2,4.5,国有企业与成长能力、可持续增长率,社区内的实体关系和关联信息表明，提升成长能力和实现可持续增长对国有企业至关重要，但面临挑战，...,本社区围绕国有企业、成长能力和可持续增长率三个关键实体展开，它们在提升企业成长能力方面存在关...,[{'explanation': '国有企业在提升成长能力方面面临挑战，这可能影响其市场竞争...,"{\n ""title"": ""\u56fd\u6709\u4f01\u4e1a\u4e0...",2905edec-2995-4be9-b26f-f5f96d952080
2,10,# 国有企业数字化转型与挑战\n\n本社区围绕国有企业数字化转型展开，涉及多个实体，包括国有...,1,6.5,国有企业数字化转型与挑战,社区内的实体在数字化转型过程中面临多方面的挑战，这些挑战可能对转型的顺利进行和最终效果产生重...,本社区围绕国有企业数字化转型展开，涉及多个实体，包括国有企业、成长能力、传统行业数字化转型等...,[{'explanation': '国有企业在数字化转型过程中面临地区差距、城乡差距、贫富差...,"{\n ""title"": ""\u56fd\u6709\u4f01\u4e1a\u657...",57bb162b-cd5a-47eb-8491-acfadd4d3c60
3,11,# Digitalization and Operational Performance i...,1,6.0,Digitalization and Operational Performance in ...,"The impact severity rating is moderate, consid...",The community focuses on the impact of digital...,[{'explanation': 'Digitalization is a key focu...,"{\n ""title"": ""Digitalization and Operationa...",7c9cb5fd-9e03-4738-a5ac-8f778ac3529f
4,12,"# 资产收益率与盈利能力\n\n社区围绕着 ""资产收益率 ""和 ""盈利能力 ""两个关键实体，...",1,4.5,资产收益率与盈利能力,"社区的影响力中等，主要由 ""资产收益率 ""实体的挑战和 ""盈利能力 ""实体的关联性引起。","社区围绕着 ""资产收益率 ""和 ""盈利能力 ""两个关键实体，它们分别代表了技术公司和业务数字...","[{'explanation': ' ""资产收益率 ""实体代表了一家技术公司，它在数据的合理...","{\n ""title"": ""\u8d44\u4ea7\u6536\u76ca\u738...",15f57bf6-7e60-4556-8171-b2fafcbca1e1


读取 text units

In [9]:
text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)

print(f"Text unit records: {len(text_unit_df)}")
text_unit_df.head()

Text unit records: 12


Unnamed: 0,id,text,n_tokens,document_ids,entity_ids,relationship_ids
0,596832d7737018724c91ff06c047794d,国企数字化转型的进展、趋势与政策选择\n\n\n项安波 国务院发展研究中心企业研究所副所长、...,1200,[85aafe3f5accc4887497ecd42d75b3e1],"[b45241d70f0e43fca764df95b2b81f77, 4119fd06010...","[b45ef27279c043269b23b894461d7d8c, fe98fb197d2..."
1,9ae220e7093685618365542418189752,制企业的数字化转型战略所导致的，央企的数字化转型目标更侧重于长远发展，侧重于数字技术的研发、...,1200,[85aafe3f5accc4887497ecd42d75b3e1],"[077d2820ae1845bcbb1803379a3d1eae, c9632a35146...","[10983a248cc448c59c94df4d1d0898f0, e2ec7d3cdbe..."
2,ae48e7b126e793a9c6c82340e98af622,�。三是发挥包括大型国企在内的龙头骨干企业以及数字协同平台等公共服务平台的赋能作用，组织专项...,1200,[85aafe3f5accc4887497ecd42d75b3e1],"[077d2820ae1845bcbb1803379a3d1eae, 254770028d7...","[24652fab20d84381b112b8491de2887e, d4602d4a27b..."
3,82d2a2a927618b8802f4e5d76818ace5,使用了滞后一期的解释变量；考察转型的中期影响时，则使用了滞后四期的解释变量，以尽量避免潜在的...,1200,"[85aafe3f5accc4887497ecd42d75b3e1, 85aafe3f5ac...","[b45241d70f0e43fca764df95b2b81f77, 4119fd06010...","[b45ef27279c043269b23b894461d7d8c, a64b4b17b07..."
4,af6fd1102603efa75a7d1f32210af69b,导向更敏感、更有资源保障和数据基础，而且更多承担着新型基础设施建设、需求牵引、信任维持等任务...,1200,[85aafe3f5accc4887497ecd42d75b3e1],"[077d2820ae1845bcbb1803379a3d1eae, 1fd3fa8bb5a...","[a2b1621a3e424ae29a6a73f00edbeca3, 6f3dd1fd6d7..."


## step3 model param setup

In [11]:
api_key = 'qwen'
llm_model = 'Qwen2-7B-Instruct'
embedding_model = 'gpt-4'

llm = ChatOpenAI(
    api_key=api_key,
    api_base='http://0.0.0.0:8000/v1',
    model=llm_model,
    api_type=OpenaiApiType.OpenAI,  # OpenaiApiType.OpenAI or OpenaiApiType.AzureOpenAI
    max_retries=20,
)

token_encoder = tiktoken.get_encoding("cl100k_base")

text_embedder = OpenAIEmbedding(
    api_key=api_key,
    api_base='http://0.0.0.0:8200/v1',
    api_type=OpenaiApiType.OpenAI,
    model=embedding_model,
    deployment_name=embedding_model,
    max_retries=20,
)

## setp4 create local search context builder and engine

In [13]:
context_builder = LocalSearchMixedContext(
    community_reports=reports,
    text_units=text_units,
    entities=entities,
    relationships=relationships,
    # covariates=covariates,
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,  # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

In [14]:
# text_unit_prop: proportion of context window dedicated to related text units
# community_prop: proportion of context window dedicated to community reports.
# The remaining proportion is dedicated to entities and relationships. Sum of text_unit_prop and community_prop should be <= 1
# conversation_history_max_turns: maximum number of turns to include in the conversation history.
# conversation_history_user_turns_only: if True, only include user queries in the conversation history.
# top_k_mapped_entities: number of related entities to retrieve from the entity description embedding store.
# top_k_relationships: control the number of out-of-network relationships to pull into the context window.
# include_entity_rank: if True, include the entity rank in the entity table in the context window. Default entity rank = node degree.
# include_relationship_weight: if True, include the relationship weight in the context window.
# include_community_rank: if True, include the community rank in the context window.
# return_candidate_context: if True, return a set of dataframes containing all candidate entity/relationship/covariate records that
# could be relevant. Note that not all of these records will be included in the context window. The "in_context" column in these
# dataframes indicates whether the record is included in the context window.
# max_tokens: maximum number of tokens to use for the context window.


local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,  # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids
    "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
}

llm_params = {
    "max_tokens": 2_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)
    "temperature": 0.0,
}

In [15]:
search_engine = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    llm_params=llm_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

## step5 perform local search

In [17]:
result = await search_engine.asearch("数字化转型建议是什么？")
print(result.response)

Error embedding chunk {'OpenAIEmbedding': '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html><head>\n<meta type="copyright" content="Copyright (C) 1996-2016 The Squid Software Foundation and contributors">\n<meta http-equiv="Content-Type" CONTENT="text/html; charset=utf-8">\n<title>ERROR: The requested URL could not be retrieved</title>\n<style type="text/css"><!-- \n /*\n * Copyright (C) 1996-2016 The Squid Software Foundation and contributors\n *\n * Squid software is distributed under GPLv2+ license and includes\n * contributions from numerous individuals and organizations.\n * Please see the COPYING and CONTRIBUTORS files for details.\n */\n\n/*\n Stylesheet for Squid Error pages\n Adapted from design by Free CSS Templates\n http://www.freecsstemplates.org\n Released for free under a Creative Commons Attribution 2.5 License\n*/\n\n/* Page basics */\n* {\n\tfont-family: verdana, sans-serif;\n}\n\nhtml body {\n\tmargin: 0;\n\tpadding: 0

ZeroDivisionError: Weights sum to zero, can't be normalized

In [None]:
question = "Tell me about Dr. Jordan Hayes"
result = await search_engine.asearch(question)
print(result.response)

## step6 inspect the context data

In [None]:
result.context_data["entities"].head()

In [None]:
result.context_data["relationships"].head()

In [None]:
result.context_data["reports"].head()

In [None]:
result.context_data["sources"].head()

In [None]:
if "claims" in result.context_data:
    print(result.context_data["claims"].head())

## step7 generate the next candidate questions

In [None]:
question_generator = LocalQuestionGen(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    llm_params=llm_params,
    context_builder_params=local_context_params,
)

In [None]:
question_history = [
    "Tell me about Agent Mercer",
    "What happens in Dulce military base?",
]
candidate_questions = await question_generator.agenerate(
    question_history=question_history, context_data=None, question_count=5
)
print(candidate_questions.response)