# 微调的数据准备

## 0. 安装准备

In [None]:
% pip install -U datasets

假设我们希望对模型进行微调，以适应金融领域的任务。

我们找到了一个可能有用的开源数据集：financial-qa-10k 。

下面我们来看如何正确地为微调准备该数据集。

原始数据集的结构如下：
- 包含 5 列：`'question'`（问题）、`'answer'`（答案）、`'context'`（上下文）、`'ticker'`（股票代码）和 `'filing'`（文件来源）；
- 共计 7000 行。

注：国内通过命令下载。

```powershell
huggingface-cli download `
  --repo-type dataset `
  --resume-download `
  virattt/financial-qa-10K `
  --local-dir ./.data/financial-qa-10K `
  --local-dir-use-symlinks False
```

In [None]:
import os
from datasets import load_dataset
try:
    # 脚本环境
    current_dir = os.path.dirname(os.path.abspath(__file__))
except NameError:
    # Notebook / 交互式环境
    current_dir = os.getcwd()  # 使用当前工作目录

# 推断项目根目录
project_root = os.path.dirname(os.path.dirname((os.path.dirname(current_dir))))

# 拼接模型路径
data_path = os.path.join(project_root, ".data", "financial-qa-10K")
print("Loading data from:", data_path)
ds = load_dataset(data_path, split="train")
ds

Loading data from: e:\git\FlagEmbedding-cn


Generating train split:   0%|          | 0/7000 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer', 'context', 'ticker', 'filing'],
    num_rows: 7000
})

## 1. 微调所需的数据格式

将数据集构造成如下格式：

``` python
{"query": str, "pos": List[str], "neg":List[str], "pos_scores": List[int], "neg_scores": List[int], "prompt": str, "type": str}
```

- `query`：查询语句；
- `pos`：正例文本列表；
- `neg`：负例文本列表；
- 如果不使用知识蒸馏（knowledge distillation），这两个分数字段可以忽略；
  - `pos_scores`：与 `query` 和各项 `pos` 对应的分数列表；
  - `neg_scores`：与 `query` 和各项 `neg` 对应的分数列表；  
- `prompt`：用于该查询的提示词（prompt），它会覆盖 `query_instruction_for_retrieval`；
- `type`：用于 `bge-en-icl` 模型，可选值包括 `"normal"`、`"symmetric_class"`、`"symmetric_clustering"` 等。

如果某个查询没有负例文本，可以从整个语料库中随机采样一些作为负例。

我们选取原始数据集中的 `'question'` 和 `'context'` 两列，分别作为 `query` 和 `pos`（正例），并对列名进行重命名。同时，添加一个 `'id'` 列，以便后续评估使用。

In [8]:
ds = ds.select_columns(column_names=["question", "context"])
ds = ds.rename_column("question", "query")
ds = ds.rename_column("context", "pos")
ds = ds.add_column("id", [str(i) for i in range(len(ds))])
ds[0]

{'query': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?',
 'pos': 'Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.',
 'id': '0'}

负例在嵌入模型的训练过程中非常重要。我们的初始数据集并未包含负样本文本，因此我们直接从整个语料库中随机采样若干负例。

In [9]:
import numpy as np

np.random.seed(520)
neg_num = 10

def str_to_lst(data):
    data["pos"] = [data["pos"]]
    return data

# 采样负例文本
new_col = []
for i in range(len(ds)):
    ids = np.random.randint(0, len(ds), size=neg_num)
    while i in ids:
        ids = np.random.randint(0, len(ds), size=neg_num)
    neg = [ds[i.item()]["pos"] for i in ids]
    new_col.append(neg)
ds = ds.add_column("neg", new_col)

# 将 'pos' 的值转换为列表
ds = ds.map(str_to_lst)

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

最后，我们添加用于查询的提示（prompt），该提示将在推理时作为 `query_instruction_for_retrieval` 使用。

In [10]:
instruction = "Represent this sentence for searching relevant passages: "
ds = ds.add_column("prompt", [instruction]*len(ds))

现在数据集中的一行是：

In [11]:
ds[0]

{'query': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?',
 'pos': ['Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.'],
 'id': '0',
 'neg': ['Kroger expects that its value creation model will deliver total shareholder return within a target range of 8% to 11% over time.',
  'CSB purchased First Mortgages of $2.9 billion during 2023.',
  'See Note 13 to our Consolidated Financial Statements for information on certain legal proceedings for which there are contingencies.',
  'Diluted earnings per share were $16.69 in fiscal 2022 compared to $15.53 in fiscal 2021.',
  'In the year ended December 31, 2023, Total net sales and revenue increased primarily due to: (1) increased net wholesale volumes primarily due to increased sales of crossover vehicles and full-size pickup trucks, partially offset by decreased sales of mid-size pickup trucks; (2) favorable Pri

然后我们将数据集划分为训练集和测试集。

In [12]:
split = ds.train_test_split(test_size=0.1, shuffle=True, seed=520)
train = split["train"]
test = split["test"]

现在我们已准备好将数据保存，以便后续进行微调：

In [13]:
train.to_json("ft_data/training.json")

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

16583481

## 2. 用于评估的测试数据

最后一步是构建用于评估的测试数据集。

In [14]:
test

Dataset({
    features: ['query', 'pos', 'id', 'neg', 'prompt'],
    num_rows: 700
})

首先，选取用于查询的列：

In [15]:
queries = test.select_columns(column_names=["id", "query"])
queries = queries.rename_column("query", "text")
queries[0]

{'id': '1289',
 'text': 'How does Starbucks recognize the interest and penalties related to income tax matters on their financial statements?'}

然后选取用于语料库的列：

In [16]:
corpus = ds.select_columns(column_names=["id", "pos"])
corpus = corpus.rename_column("pos", "text")

最后，构建 qrels（查询-文档相关性标签），用于标明查询与对应语料库文档之间的相关关系。

In [17]:
qrels = test.select_columns(["id"])
qrels = qrels.rename_column("id", "qid")
qrels = qrels.add_column("docid", list(test["id"]))
qrels = qrels.add_column("relevance", [1]*len(test))
qrels[0]

Flattening the indices:   0%|          | 0/700 [00:00<?, ? examples/s]

{'qid': '1289', 'docid': '1289', 'relevance': 1}

保存训练集。

In [18]:
queries.to_json("ft_data/test_queries.jsonl")
corpus.to_json("ft_data/corpus.jsonl")
qrels.to_json("ft_data/test_qrels.jsonl")

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

30574