# RAG Fusion

基于[此GitHub仓库](https://github.com/Raudaschl/rag-fusion)重新实现，所有功劳归原作者所有

> RAG-Fusion是一种搜索方法，旨在弥合传统搜索范式与人类查询多方面维度之间的差距。受检索增强生成(RAG)能力的启发，该项目更进一步，通过采用多查询生成和倒数排名融合(Reciprocal Rank Fusion)来重新排序搜索结果。

## 设置

在这个例子中，我们将使用Pinecone和一些虚构数据。要配置Pinecone，请设置以下环境变量：

- `PINECONE_API_KEY`：您的Pinecone API密钥

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

In [None]:
all_documents = {
    "doc1": "气候变化和经济影响。",
    "doc2": "气候变化导致的公共卫生问题。",
    "doc3": "气候变化：社会视角。",
    "doc4": "应对气候变化的技术解决方案。",
    "doc5": "应对气候变化所需的政策变革。",
    "doc6": "气候变化及其对生物多样性的影响。",
    "doc7": "气候变化：科学和模型。",
    "doc8": "全球变暖：气候变化的一个子集。",
    "doc9": "气候变化如何影响日常天气。",
    "doc10": "气候变化活动主义的历史。",
}

In [None]:
vectorstore = PineconeVectorStore.from_texts(
    list(all_documents.values()), OpenAIEmbeddings(), index_name="rag-fusion"
)

## 定义查询生成器

我们现在将定义一个链来进行查询生成

In [7]:
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

In [68]:
from langchain import hub

prompt = hub.pull("langchain-ai/rag-fusion-query-generation")

In [None]:
# prompt = ChatPromptTemplate.from_messages([
#     ("system", "你是一个有帮助的助手，根据单一输入查询生成多个搜索查询。"),
#     ("user", "生成与以下内容相关的多个搜索查询: {original_query}"),
#     ("user", "输出 (4个查询):")
# ])

In [5]:
generate_queries = (
    prompt | ChatOpenAI(temperature=0) | StrOutputParser() | (lambda x: x.split("\n"))
)

## 定义完整链

我们现在可以将所有内容组合在一起，定义完整的链。这个链：
    
    1. 生成一系列查询
    2. 在检索器中查找每个查询
    3. 使用倒数排名融合(reciprocal rank fusion)将所有结果合并在一起
    
    
注意，它不会执行最终的生成步骤

In [None]:
original_query = "气候变化的影响"

In [75]:
vectorstore = PineconeVectorStore.from_existing_index("rag-fusion", OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

In [None]:
from langchain.load import dumps, loads


def reciprocal_rank_fusion(results: list[list], k=60):
    fused_scores = {}
    for docs in results:
        # 假设文档按相关性排序返回
        for rank, doc in enumerate(docs):
            doc_str = dumps(doc)
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            previous_score = fused_scores[doc_str]
            fused_scores[doc_str] += 1 / (rank + k)

    reranked_results = [
        (loads(doc), score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]
    return reranked_results

In [77]:
chain = generate_queries | retriever.map() | reciprocal_rank_fusion

In [78]:
chain.invoke({"original_query": original_query})

[(Document(page_content='Climate change and economic impact.'),
  0.06558258417063283),
 (Document(page_content='Climate change: A social perspective.'),
  0.06400409626216078),
 (Document(page_content='How climate change affects daily weather.'),
  0.04787506400409626),
 (Document(page_content='Climate change and its impact on biodiversity.'),
  0.03306010928961749),
 (Document(page_content='Public health concerns due to climate change.'),
  0.016666666666666666),
 (Document(page_content='Technological solutions to climate change.'),
  0.016666666666666666),
 (Document(page_content='Policy changes needed to combat climate change.'),
  0.01639344262295082)]