# 使用 Milvus 的 Qwen3 Embeddings 和 Rerankers 模型的 RAG 实践操作

如果您一直在关注嵌入模型领域，您可能已经注意到阿里巴巴刚刚发布了 [Qwen3 嵌入系列](https://qwenlm.github.io/blog/qwen3-embedding/)。他们发布了嵌入模型和 Rerankers 模型，各有三种大小（0.6B、4B、8B），都是建立在 Qwen3 基础模型上，专门为检索任务设计的。

我发现 Qwen3 系列有几个有趣的特点：

- **多语言嵌入**--他们声称有一个跨 100 多种语言的统一语义空间
- **指令提示**--您可以通过自定义指令来修改嵌入行为
- **可变尺寸**--通过 Matryoshka 表征学习支持不同的嵌入尺寸
- **32K 上下文长度**--可处理较长的输入序列
- **标准的双编码器/交叉编码器设置**--嵌入模型使用双编码器，Reranker 使用交叉编码器

从基准测试结果来看，Qwen3-Embedding-8B 在 MTEB 多语言排行榜上取得了 70.58 的高分，超过了 BGE、E5，甚至 Google Gemini。Qwen3-Reranker-8B 在多语言排名任务中创下了 69.02 的成绩。这不仅仅是 "在开源模型中相当不错"，而是全面超越了主流商业应用程序接口。在 RAG 检索、跨语言搜索和代码搜索系统中，尤其是在中文环境中，这些模型已经具备了生产就绪的能力。

![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXdZCKoPqf8mpxwQ_s-gGbdHYvw_HhWn6Ib62v8C_VEZF8AOSnY1yLEEv1ztkINpmwgHAVC5kZw6rWplfx5OkISf_gL4VvoqlXxSfs8s_qd8mdBuA0HBhP9kEdipXy0QVuPmEyOJRg?key=nqzZfIwgkzdlEZQ2MYSMGQ)

![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXdNppvBpn_5M9d6WDb0-pCjgTobVc9eFw_m6m6Vg73wJtB9OvcPFw5089FUui_N2-LbJVjJPe1c8_EnYY4F3Ryw0021kvmJ0jU0Q06qG2ZX2D1vywIyd5aKqO_cx-77U_spMVr8cQ?key=nqzZfIwgkzdlEZQ2MYSMGQ)

作为一个可能已经使用过常见的模型（OpenAI 的 Embeddings、BGE、E5）的人，你可能想知道这些模型是否值得你花时间去研究。剧透：值得。

## 我们正在构建什么

本教程使用 Qwen3-Embedding-0.6B 和 Qwen3-Reranker-0.6B 以及 Milvus 来构建一个完整的 RAG 系统。我们将实施一个两阶段检索管道：

1. 使用 Qwen3 Embeddings 进行密集检索，以快速选择候选对象
2. 使用 Qwen3 交叉编码器进行重排，以提高精确度
3. 使用 OpenAI 的 GPT-4 生成最终回复

最后，您将拥有一个可处理多语言查询、使用指令提示进行领域调整，并通过智能 Rerankers 实现速度与精度平衡的工作系统。

## 环境设置

让我们从依赖项开始。请注意最低版本要求，这对兼容性很重要：

```python
pip install --upgrade pymilvus openai requests tqdm sentence-transformers transformers
```

_要求 transformers>=4.51.0 和 Sentence-transformers>=2.7.0_

在本教程中，我们将使用 OpenAI 作为生成模型。设置您的 API 密钥：

```python
import os
os.environ[“OPENAI_API_KEY”] = “sk-***********”
```

## 数据准备

我们将使用 Milvus 文档作为知识库--它是技术内容的良好组合，可测试检索和生成质量。

下载并提取文档：


In [1]:
# !wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip -o ../runs/milvus_docs_2.4.x_en.zip
# !unzip -q ../runs/milvus_docs_2.4.x_en.zip -d ../runs/milvus_docs

加载标记符文件并对其进行分块。我们在此使用简单的基于标题的分块策略--对于生产系统，请考虑更复杂的分块方法：


In [2]:
import os
from glob import glob
import json

from sentence_transformers import SentenceTransformer
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
from pymilvus import MilvusClient
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# 设置环境变量解决 tokenizers 并行化警告
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
# 设置 huggingface 中国镜像
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [4]:
text_lines = []

for file_path in glob('../runs/milvus_docs/en/faq/*.md', recursive=True):
  with open(file_path, 'r') as file:
    file_text = file.read()
    text_lines.append(file_text)

print(f'Import text lines is {len(text_lines)}')

Import text lines is 4


## 模型设置

现在来初始化我们的模型。我们使用的是轻量级的 0.6B 版本，它在性能和资源需求之间取得了很好的平衡：


In [None]:
EMBED_MODEL_NAME = 'Qwen/Qwen3-Embedding-0.6B'
RERANKER_MODEL_NAME = 'Qwen/Qwen3-Reranker-0.6B'

# Load Qwen3-Embedding-0.6B model for text embeddings
embedding_model = SentenceTransformer(EMBED_MODEL_NAME)


# Load Qwen3-Reranker-0.6B model for reranking
reranker_tokenizer = AutoTokenizer.from_pretrained(RERANKER_MODEL_NAME, padding_side='left')
reranker_model = AutoModelForCausalLM.from_pretrained(RERANKER_MODEL_NAME).eval()


# Reranker configuration
token_false_id = reranker_tokenizer.convert_tokens_to_ids('no')
token_true_id = reranker_tokenizer.convert_tokens_to_ids('yes')
max_reranker_length = 8192

prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
suffix = '<|im_end|>\n<|im_start|>assistant\n\n\n\n\n'
prefix_tokens = reranker_tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = reranker_tokenizer.encode(suffix, add_special_tokens=False)

print(f'token_false_id: {token_false_id}')
print(f'token_true_id: {token_true_id}')
print(f'prefix_tokens: {prefix_tokens}')
print(f'suffix_tokens: {suffix_tokens}')

token_false_id: 2152
token_true_id: 9693
prefix_tokens: [151644, 8948, 198, 60256, 3425, 279, 11789, 20027, 279, 8502, 3118, 389, 279, 11361, 323, 279, 758, 1235, 3897, 13, 7036, 429, 279, 4226, 646, 1172, 387, 330, 9693, 1, 476, 330, 2152, 3263, 151645, 198, 151644, 872, 198]
suffix_tokens: [151645, 198, 151644, 77091, 14621]


In [6]:
def emb_text(text: str, is_query=False):
  """Generate text embeddings using Qwen3-Embedding-0.6B model

  Args:
    text: Input text to embed
    is_query: Whether this is query (True) or document (False)

  Returns:
    List of embedding values
  """

  if is_query:
    embeddings = embedding_model.encode([text], prompt_name='query')
  else:
    embeddings = embedding_model.encode([text])

  return embeddings[0].tolist()

让我们来测试嵌入功能并检查输出维度：


In [7]:
test_embedding = emb_text('这是一个测试')
embedding_dim = len(test_embedding)
print(f'嵌入维度: {embedding_dim}')
print(f'头 10 个值 : {test_embedding[:10]}')

嵌入维度: 1024
头 10 个值 : [-0.013050967827439308, -0.07333958148956299, -0.010180089622735977, -0.056118570268154144, 0.031972434371709824, 0.04878552630543709, 0.01245422288775444, 0.04370049387216568, -0.0664646252989769, 0.0325627475976944]


## Reranker 实现

Reranker 使用交叉编码器架构来评估查询-文档对。这比双编码器嵌入模型的计算成本更高，但能提供更细致的相关性评分。

下面是完整的 Rerankers 流程：


In [None]:
def format_instruction(instruction, query, doc):
  """Format instruction for reranker input"""
  if instruction is None:
    instruction = 'Given a web search query, retrieve relevant passages that answer the query'
  output = '<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}'.format(
    instruction=instruction, query=query, doc=doc
  )
  return output


def process_inputs(pairs):
  """Process inputs for reranker"""
  tokenized = reranker_tokenizer(
    pairs,
    add_special_tokens=False,
    truncation=True,
    max_length=max_reranker_length - len(prefix_tokens) - len(suffix_tokens),
  )

  # Step 2: 拼接 prefix_tokens + token + suffix_tokens
  input_ids = [prefix_tokens + ids + suffix_tokens for ids in tokenized['input_ids']]

  # Step 3: 使用 tokenizer 的 pad 功能，一次性转为 tensor
  padded = reranker_tokenizer.pad(
    {'input_ids': input_ids}, padding=True, return_tensors='pt', max_length=max_reranker_length
  )

  return padded


@torch.no_grad()
def compute_logits(inputs):
  """Compute relevance scores using reranker"""
  batch_scores = reranker_model(**inputs).logits[:, -1, :]
  true_vector = batch_scores[:, token_true_id]
  false_vector = batch_scores[:, token_false_id]
  batch_scores = torch.stack([false_vector, true_vector], dim=1)
  batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
  scores = batch_scores[:, 1].exp().tolist()
  return scores


def rerank_documents(query, documents, task_instruction=None):
  """Rerank documents based on query relevance using Qwen3-Reranker

  Args:
    query: Search query
    documents: List of documents to rerank
    task_instruction: Task instruction for reranking

  Returns:
    List of (document, score) tuples sorted by relevance score"""

  if task_instruction is None:
    task_instruction = 'Given a web search query, retrieve relevant passages that answer the query'
    # Format inputs for reranker
    pairs = [format_instruction(task_instruction, query, doc) for doc in documents]
    # Process inputs for reranker
    inputs = process_inputs(pairs)
    scores = compute_logits(inputs)

    # Combine documents with scores and sort by score (descending)
    doc_scores = list(zip(documents, scores))
    doc_scores.sort(key=lambda x: x[1], reverse=True)
    return doc_scores

## 设置 Milvus 向量数据库

现在让我们建立向量数据库。为了简单起见，我们使用 Milvus Lite，但同样的代码也适用于完整的 Milvus 部署：


In [9]:
milvus_client = MilvusClient(uri='./milvus_demo.db')

collection_name = 'my_rag_collection'

  from pkg_resources import DistributionNotFound, get_distribution


**部署选项：**

- **本地文件（如./milvus.db ）**：使用 Milvus Lite，非常适合开发
- **Docker/Kubernetes**：使用服务器 URI，如 http://localhost:19530 用于生产
- **Zilliz Cloud**：使用云端点和 API 密钥管理服务

清理任何现有的 Collections 并创建一个新的：


In [10]:
# Remove existing collection if it exists
if milvus_client.has_collection(collection_name):
  milvus_client.drop_collection(collection_name)
# Create new collection with our embedding dimensions
milvus_client.create_collection(
  collection_name=collection_name,
  dimension=embedding_dim,  # 1024 for Qwen3-Embedding-0.6B
  metric_type='IP',  # Inner product for similarity
  consistency_level='Strong',  # Ensure data consistency
)

## 将数据加载到 Milvus 中

现在让我们处理我们的文档并将其插入向量数据库：


In [11]:
data = []

for i, line in enumerate(tqdm(text_lines, desc='Creating embeddings')):
  data.append({'id': i, 'vector': emb_text(line), 'text': line})

milvus_client.insert(collection_name=collection_name, data=data)

Creating embeddings: 100%|██████████| 4/4 [00:02<00:00,  1.67it/s]


{'insert_count': 4, 'ids': [0, 1, 2, 3], 'cost': 0}

## 利用 Rerankers 技术增强 RAG

现在到了激动人心的部分--将这一切整合到一个完整的检索增强生成系统中。

### 步骤 1：查询和初始检索

让我们用一个关于 Milvus 的常见问题进行测试：


In [12]:
question = 'How is data stored in milvus?'

# Perform initial dense retrieval to get top candidates
search_res = milvus_client.search(
  collection_name=collection_name,
  data=[emb_text(question, is_query=True)],  # Use query prompt
  limit=10,  # Get top 10 candidates for reranking
  search_params={'metric_type': 'IP', 'params': {}},
  output_fields=['text'],  # Return the actual text content
)


print(f'Found {len(search_res[0])} initial candidates')

Found 4 initial candidates


In [21]:
[entry.keys() for entry in search_res[0]]

[KeysView({'id': 3, 'distance': 0.7633353471755981}),
 KeysView({'id': 0, 'distance': 0.6970409154891968}),
 KeysView({'id': 2, 'distance': 0.5539933443069458}),
 KeysView({'id': 1, 'distance': 0.5401155948638916, 'entity': {'text': '---\nid: troubleshooting.md\nsummary: Learn about common issues you may encounter with Milvus and how to overcome them.\ntitle: Troubleshooting\n---\n# Troubleshooting\nThis page lists common issues that may occur when running Milvus, as well as possible troubleshooting tips. Issues on this page fall into the following categories:\n\n- [Boot issues](#boot_issues)\n- [Runtime issues](#runtime_issues)\n- [API issues](#api_issues)\n- [etcd crash issues](#etcd_crash_issues)\n\n\n  ## Boot issues\n\n  Boot errors are usually fatal. Run the following command to view error details:\n\n  ```\n  $ docker logs <your milvus container id>\n  ```\n\n\n  ## Runtime issues\n\n  Errors that occur during runtime may cause service breakdown. To troubleshoot this issue, chec

### 步骤 2：重排序以提高精确度

提取候选文档并应用 Rerankers：


In [13]:
# Extract candidate documents
candidate_docs = [res['entity']['text'] for res in search_res[0]]

# Rerank using Qwen3-Reranker
print('Reranking documents……')
reranked_docs = rerank_documents(question, candidate_docs)


# Select top 3 after reranking
top_reranked_docs = reranked_docs[:3]
print(f'Selected top {len(top_reranked_docs)} documents after reranking')

You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Reranking documents……




Selected top 3 documents after reranking


### 步骤 3：比较结果

让我们来看看 Rerankers 如何改变结果：


In [18]:
print('Reranked results (top 3):')
print(json.dumps(top_reranked_docs, ensure_ascii=False, indent=2))

print('================================================================================')
print('Original embedding-based results (top 3):')
for entry in search_res[0][:3]:
  del entry['entity']
  print(entry)

Reranked results (top 3):
[
  [
    "---\nid: operational_faq.md\nsummary: Find answers to commonly asked questions about operations in Milvus.\ntitle: Operational FAQ\n---\n\n# Operational FAQ\n\n<!-- TOC -->\n\n\n<!-- /TOC -->\n\n#### What if I failed to pull the Milvus Docker image from Docker Hub?\n\nIf you failed to pull the Milvus Docker image from Docker Hub, try adding other registry mirrors. \n\nUsers from Mainland China can add the URL \"https://registry.docker-cn.com\" to the registry-mirrors array in **/etc.docker/daemon.json**.\n\n```\n{\n  \"registry-mirrors\": [\"https://registry.docker-cn.com\"]\n}\n```\n\n#### Is Docker the only way to install and run Milvus?\n\nDocker is an efficient way to deploy Milvus, but not the only way. You can also deploy Milvus from source code. This requires Ubuntu (18.04 or higher) or CentOS (7 or higher). See [Building Milvus from Source Code](https://github.com/milvus-io/milvus#build-milvus-from-source-code) for more information.\n\n#### 

与嵌入相似度得分相比，重排通常会显示出更高的判别得分（相关文档更接近 1.0）。


### 步骤 4：生成最终响应

现在，让我们利用检索到的上下文生成一个综合答案：

首先：将检索到的文档转换为字符串格式。
