# 混合搜索

混合搜索将传统的基于关键词的搜索与语义搜索相结合，以提供更准确和相关的结果。在 RAG 应用中，它通过将基于关键词的搜索与语义搜索能力集成，促进了基于用户查询的相关研究文章的发现。这种集成使应用程序能够检索同时匹配关键词和语义的文章，使其特别适用于处理涉及微妙概念、同义词和相关想法的复杂查询。

![混合搜索](images/Hybrid_Search.png)

在本笔记本中，我们将深入探讨 RAG 应用中混合搜索方法的实现细节，探索它如何利用基于关键词和语义搜索技术来提供更有效的搜索体验。

以下是步骤：
* [加载分块数据集](#loading-the-chunks-from-the-previous-steps)
* [稀疏索引](#Hybrid-Search---Sparse-Index)
* [稠密索引](#hybrid-search---dense-index)
* [合并结果](#hybrid-search---merging-results)
* [使用合并结果生成回复](#using-merged-results-to-generate-a-reply)

### 可视化改进

我们将使用 [rich 库](https://github.com/Textualize/rich) 来使输出更具可读性，并抑制警告信息。

In [None]:
from rich.console import Console
from rich_theme_manager import Theme, ThemeManager
import pathlib

theme_dir = pathlib.Path("themes")
theme_manager = ThemeManager(theme_dir=theme_dir)
dark = theme_manager.get("dark")

# Create a console with the dark theme
console = Console(theme=dark)

In [None]:
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

## 混合搜索 - 稀疏索引

我们将使用支持 BM25 的数据库来补充向量数据库的语义搜索。

In [None]:
import bm25s
from bm25s.tokenization import Tokenizer, Tokenized
import Stemmer  # optional: for stemming

### 加载之前步骤中的分块

我们将使用之前使用的 AI Arxiv 数据集中的分块。这些分块是通过语义分块切分并丰富了上下文的。

In [None]:
import json
corpus_json = json.load(open('data/corpus.json'))

### 创建稀疏索引

我们将使用基于 BM25 的内存索引。许多（向量）数据库原生支持 BM25，还有许多其他数据库支持对计算的稀疏向量进行索引和搜索。

在此示例中，我们还将定义一个词干提取器和停用词，以清理文本并更好地选择将索引到稀疏索引中的标记/术语。

分词器可以编码（将文本转换为 ID）和解码（将 ID 转换回文本）。

In [None]:
corpus_text = [doc["text"] for doc in corpus_json]

# optional: create a stemmer
english_stemmer = Stemmer.Stemmer("english")

# Initialize the Tokenizer with the stemmer
sparse_tokenizer = Tokenizer(
    stemmer=english_stemmer,
    lower=True, # lowercase the tokens
    stopwords="english",  # or pass a list of stopwords
    splitter=r"\w+",  # by default r"(?u)\b\w\w+\b", can also be a function
)

In [None]:
console.print(sparse_tokenizer.stopwords)

In [None]:
# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_sparse_tokens = (
    sparse_tokenizer
    .tokenize(
        corpus_text, 
        update_vocab=True, # update the vocab as we tokenize
        return_as="ids"
    )
)

# Create the BM25 retriever and attach your corpus_json to it
sparse_index = bm25s.BM25(corpus=corpus_json)
# Now, index the corpus_tokens (the corpus_json is not used yet)
sparse_index.index(corpus_sparse_tokens)

In [None]:
vocab_dict = sparse_tokenizer.get_vocab_dict()
console.print(f"The tokenizer vocabulary includes {len(vocab_dict)} tokens/terms")

focus_token = 'context'
focus_token_index = vocab_dict.get(focus_token)
console.print(f"The index of the {focus_token} is {focus_token_index}")

分词器可以执行编码（将文本转换为 ID）和解码（将 ID 转换回文本）。

In [None]:
console.print(sparse_tokenizer.decode([[focus_token_index]]))

### 探索稀疏索引

In [None]:
console.print(sparse_index.scores)

对于每个标记，索引包含包含它的文档（分块）列表以及该标记在该文档（分块）中的得分。

In [None]:
from rich.table import Table
from rich.style import Style

token_index = vocab_dict.get(focus_token)
console.print(f"Index of the token `{focus_token}` in the BM25 retriever: {token_index}")
score_index = sparse_index.scores.get('indptr')[token_index]
next_score_index = sparse_index.scores.get('indptr')[token_index+1]

table = Table(title=f"Document Scores for `{focus_token}`")

table.add_column("Document ID", justify="right", style="cyan", no_wrap=True)
table.add_column("Score", justify="right", style="bright_green")

max_score = max(sparse_index.scores['data'][score_index:next_score_index])
# Define styles for specific rows
highlight_style = Style(bgcolor="yellow")

for i in range(score_index, next_score_index):
    doc_id = sparse_index.scores['indices'][i]
    doc_score = sparse_index.scores['data'][i]
    if doc_score == max_score:
        table.add_row(
            str(doc_id),
            str(doc_score), style=highlight_style
        )
    else:
        table.add_row(
            str(doc_id),
            str(doc_score)
        )

console.print(table)

### 搜索稀疏索引

与在稠密索引中一样，我们需要对查询文本进行分词和编码：

In [None]:
# Query the corpus
query = "What is context size of Mixtral?"
query_tokens = (
    sparse_tokenizer
    .tokenize(
        [query], 
        update_vocab=False, 
        return_as="ids"
    )
)

console.print(query_tokens)

然后使用编码后的查询来搜索稀疏索引：

In [None]:
# Query the corpus
sparse_results, sparse_scores = sparse_index.retrieve(query_tokens, k=10)

for i in range(sparse_results.shape[1]):
    doc, score = sparse_results[0, i], sparse_scores[0, i]
    console.print(f"Rank {i+1} (score: {score:.2f}): {doc}")

## 混合搜索 - 稠密索引

对于混合搜索，我们还需要使用向量数据库的稠密索引，正如我们在之前步骤中使用的那样。

### 创建稠密索引

In [None]:
from qdrant_client import QdrantClient
from qdrant_client.http import models
from sentence_transformers import SentenceTransformer

qdrant_client = QdrantClient(
    ":memory:"
) 

# Create the embedding encoder
dense_encoder = SentenceTransformer('all-MiniLM-L6-v2') # Model to create embeddings

In [None]:
collection_name = "hybrid_search"

dense_index = qdrant_client.recreate_collection(
    collection_name=collection_name,
        vectors_config=models.VectorParams(
        size=dense_encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)
print(dense_index)

In [None]:
# vectorize!
qdrant_client.upload_points(
    collection_name=collection_name,
    points=[
        models.PointStruct(
            id=idx,
            vector=dense_encoder.encode(doc["text"]).tolist(),
            payload=doc
        ) for idx, doc in enumerate(corpus_json) # data is the variable holding all the enriched texts
    ]
)

### 搜索稠密索引

我们将首先使用稠密编码器对查询进行编码：

In [None]:
query_vector = dense_encoder.encode(query).tolist()

然后使用编码后的查询来搜索稠密索引：

In [None]:
dense_results = qdrant_client.search(
    collection_name=collection_name,
    query_vector=query_vector,
    limit=10
)

In [None]:
console.print(dense_results)

## 混合搜索 - 合并结果

有几种方法可以合并两种方法（稀疏和稠密）的结果。在本笔记本中，我们将使用简单的加权平均。

In [None]:
documents_with_scores = []
for hit in dense_results:
    doc_id = hit.payload["id"]
    doc_text = next((doc for doc in corpus_json if doc["id"] == doc_id), None)["text"]
    doc_dense_score = hit.score
    documents_with_scores.append({
        "id": doc_id,
        "text": doc_text,
        "dense_score": doc_dense_score
    })

for i, result in enumerate(sparse_results[0]):
    doc_id = result["id"]
    doc_text = next((doc for doc in corpus_json if doc["id"] == doc_id), None)["text"]
    doc_sparse_score = sparse_scores[0][i]
    for doc in documents_with_scores:
        if doc["id"] == doc_id:
            doc["sparse_score"] = doc_sparse_score
            break




In [None]:
console.print(documents_with_scores)

我们将对每个索引的得分进行归一化，然后计算一个加权得分，其中稠密索引的权重更高（0.8）。

In [None]:
import numpy as np

# Normalize the two types of scores
dense_scores = np.array([doc.get("dense_score", 0) for doc in documents_with_scores])
sparse_scores = np.array([doc.get("sparse_score", 0) for doc in documents_with_scores])

dense_scores_normalized = (dense_scores - np.min(dense_scores)) / (np.max(dense_scores) - np.min(dense_scores))
sparse_scores_normalized = (sparse_scores - np.min(sparse_scores)) / (np.max(sparse_scores) - np.min(sparse_scores))

# Calculate a weighted score with alpha of 0.2 to the sparse score
alpha = 0.2
weighted_scores = (1 - alpha) * dense_scores_normalized + alpha * sparse_scores_normalized

# Pick up the top 3 documents with the weighted score
top_docs = sorted(
    zip(
        documents_with_scores, 
        weighted_scores
    ), 
    key=lambda x: x[1], 
    reverse=True
)[:3]



In [None]:
console.print(top_docs)

## 使用合并结果生成回复

我们现在可以获取合并后的结果并调用 LLM 生成对用户查询的回复。

In [None]:
# define a variable to hold the search results for the generation model
search_results = [doc[0]['text'] for doc in top_docs]

In [None]:
from dotenv import load_dotenv

load_dotenv()

In [None]:
# Now time to connect to the large language model
from openai import OpenAI
from rich.text import Text

client = OpenAI()
completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are chatbot, an research expert. Your top priority is to help guide users to understand reserach papers."},
        {"role": "user", "content": query},
        {"role": "assistant", "content": str(search_results)}
    ]
)

response_text = Text(completion.choices[0].message.content)

In [None]:
from rich.panel import Panel

panel = Panel(response_text, title=f"Hybrid Search Reply to \"{query}\"")
console.print(panel)

保存检索到的文档，以便在下一个重新排序的笔记本中使用，该笔记本展示了更高级的混合搜索结果合并方法。

In [None]:
import json

with open('data/dense_results.json', 'w') as f:
    json.dump([dense_result.payload for dense_result in dense_results], f, default=str)

with open('data/sparse_results.json', 'w') as f:
    json.dump([sparse_result for sparse_result in sparse_results[0]], f, default=str)

