# Hugging Face Transformers 微调语言模型-问答任务

我们已经学会使用 Pipeline 加载支持问答任务的预训练模型，本教程代码将展示如何微调训练一个支持问答任务的模型。

**注意：微调后的模型仍然是通过提取上下文的子串来回答问题的，而不是生成新的文本。**

### 模型执行问答效果示例

![Widget inference representing the QA task](docs/images/question_answering.png)

In [16]:
import subprocess
import os

os.environ['HF_HOME'] = '/mnt/new_volume/hf' #在transformers自定义模型下载的路径方法
os.environ['HF_HUB_CACHE'] = '/mnt/new_volume/hf/hub'

result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

In [17]:
# 根据你使用的模型和GPU资源情况，调整以下关键参数
squad_v2 = False
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## 下载数据集

在本教程中，我们将使用[斯坦福问答数据集(SQuAD）](https://rajpurkar.github.io/SQuAD-explorer/)。

### SQuAD 数据集

**斯坦福问答数据集(SQuAD)** 是一个阅读理解数据集，由众包工作者在一系列维基百科文章上提出问题组成。每个问题的答案都是相应阅读段落中的文本片段或范围，或者该问题可能无法回答。

SQuAD2.0将SQuAD1.1中的10万个问题与由众包工作者对抗性地撰写的5万多个无法回答的问题相结合，使其看起来与可回答的问题类似。要在SQuAD2.0上表现良好，系统不仅必须在可能时回答问题，还必须确定段落中没有支持任何答案，并放弃回答。

In [18]:
from datasets import load_dataset

In [21]:
datasets = load_dataset("squad_v2" if squad_v2 else "squad",cache_dir="data/squad")

FileNotFoundError: Couldn't find file at https://huggingface.co/datasets/squad/resolve/main/data/squad/downloads/b8bb19735e1bb591510a01cc032f4c9f969bc0eeb081ae1b328cd306f3b24008

The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [4]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

#### 对比数据集

相比快速入门使用的 Yelp 评论数据集，我们可以看到 SQuAD 训练和测试集都新增了用于上下文、问题以及问题答案的列：

**YelpReviewFull Dataset：**

```json

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})
```

In [5]:
datasets["train"][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

#### 从上下文中组织回复内容

我们可以看到答案是通过它们在文本中的起始位置（这里是第515个字符）以及它们的完整文本表示的，这是上面提到的上下文的子字符串。

In [None]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(datasets["train"])

Unnamed: 0,id,title,context,question,answers
0,572b76f734ae481900deae32,Idealism,"Kierkegaard criticised Hegel's idealist philosophy in several of his works, particularly his claim to a comprehensive system that could explain the whole of reality. Where Hegel argues that an ultimate understanding of the logical structure of the world is an understanding of the logical structure of God's mind, Kierkegaard asserting that for God reality can be a system but it cannot be so for any human individual because both reality and humans are incomplete and all philosophical systems imply completeness. A logical system is possible but an existential system is not. ""What is rational is actual; and what is actual is rational"". Hegel's absolute idealism blurs the distinction between existence and thought: our mortal nature places limits on our understanding of reality;","Even though Kierkegaard does not believe in the possibility of an existential system of reality, what sort of system can exist?","{'text': ['logical'], 'answer_start': [517]}"
1,570629cf75f01819005e7a0e,"Atlantic_City,_New_Jersey","By the late 1960s, many of the resort's once great hotels were suffering from embarrassing vacancy rates. Most of them were either shut down, converted to cheap apartments, or converted to nursing home facilities by the end of the decade. Prior to and during the advent of legalized gaming, many of these hotels were demolished. The Breakers, the Chelsea, the Brighton, the Shelburne, the Mayflower, the Traymore, and the Marlborough-Blenheim were demolished in the 1970s and 1980s. Of the many pre-casino resorts that bordered the boardwalk, only the Claridge, the Dennis, the Ritz-Carlton, and the Haddon Hall survive to this day as parts of Bally's Atlantic City, a condo complex, and Resorts Atlantic City. The old Ambassador Hotel was purchased by Ramada in 1978 and was gutted to become the Tropicana Casino and Resort Atlantic City, only reusing the steelwork of the original building. Smaller hotels off the boardwalk, such as the Madison also survived.",Who purchased the old Ambassador Hotel in 1978?,"{'text': ['Ramada'], 'answer_start': [753]}"
2,572e9100c246551400ce434e,Endangered_Species_Act,"The Lacey Act of 1900 was the first federal law that regulated commercial animal markets. It prohibited interstate commerce of animals killed in violation of state game laws, and covered all fish and wildlife and their parts or products, as well as plants. Other legislation followed, including the Migratory Bird Conservation Act of 1929, a 1937 treaty prohibiting the hunting of right and gray whales, and the Bald Eagle Protection Act of 1940. These later laws had a low cost to society–the species were relatively rare–and little opposition was raised.",What was the first federal law that regulated wildlife commerce?,"{'text': ['Lacey Act of 1900'], 'answer_start': [4]}"
3,56f9551e9e9bad19000a0830,List_of_numbered_streets_in_Manhattan,"40°48′32″N 73°57′14″W﻿ / ﻿40.8088°N 73.9540°W﻿ / 40.8088; -73.9540 122nd Street is divided into three noncontiguous segments, E 122nd Street, W 122nd Street, and W 122nd Street Seminary Row, by Marcus Garvey Memorial Park and Morningside Park.",Which park divides 122nd Street along with Marcus Garvey Memorial Park?,"{'text': ['Morningside Park'], 'answer_start': [226]}"
4,570d4103b3d812140066d5f6,Franco-Prussian_War,"When the war had begun, European public opinion heavily favored the Germans; many Italians attempted to sign up as volunteers at the Prussian embassy in Florence and a Prussian diplomat visited Giuseppe Garibaldi in Caprera. Bismarck's demand for the return of Alsace caused a dramatic shift in that sentiment in Italy, which was best exemplified by the reaction of Garibaldi soon after the revolution in Paris, who told the Movimento of Genoa on 7 September 1870 that ""Yesterday I said to you: war to the death to Bonaparte. Today I say to you: rescue the French Republic by every means."" Garibaldi went to France and assumed command of the Army of the Vosges, with which he operated around Dijon till the end of the war.","To whom is the quote, ""Rescue the French Republic by every means"" attributed?","{'text': ['Garibaldi'], 'answer_start': [366]}"
5,5727ff4d2ca10214002d9afb,Dominican_Order,"Concerning humanity as the image of Christ, English Dominican spirituality concentrated on the moral implications of image-bearing rather than the philosophical foundations of the imago Dei. The process of Christ's life, and the process of image-bearing, amends humanity to God's image. The idea of the ""image of God"" demonstrates both the ability of man to move toward God (as partakers in Christ's redeeming sacrifice), and that, on some level, man is always an image of God. As their love and knowledge of God grows and is sanctified by faith and experience, the image of God within man becomes ever more bright and clear.",The idea of the image of God allows man to do what?,"{'text': ['move toward God'], 'answer_start': [358]}"
6,572a3b543f37b319004787f1,"New_Haven,_Connecticut","New Haven's greatest culinary claim to fame may be its pizza, which has been claimed to be among the best in the country, or even in the world. New Haven-style pizza, called ""apizza"" (pronounced ah-BEETS, [aˈpitts] in the original Italian dialect), made its debut at the iconic Frank Pepe Pizzeria Napoletana (known as Pepe's) in 1925. Apizza is baked in coal- or wood-fired brick ovens, and is notable for its thin crust. Apizza may be red (with a tomato-based sauce) or white (with a sauce of garlic and olive oil), and pies ordered ""plain"" are made without the otherwise customary mozzarella cheese (originally smoked mozzarella, known as ""scamorza"" in Italian). A white clam pie is a well-known specialty of the restaurants on Wooster Street in the Little Italy section of New Haven, including Pepe's and Sally's Apizza (which opened in 1938). Modern Apizza on State Street, which opened in 1934, is also well-known.","In general what are ""apizza"" known for?","{'text': ['its thin crust'], 'answer_start': [407]}"
7,56e7b2f200c9c71400d77512,Arena_Football_League,"After its return in 2010, the AFL had its national television deal with the NFL Network for a weekly Friday night game. All AFL games not on the NFL Network could be seen for free online, provided by Ustream.",What cable television network signed a broadcast deal with the AFL in 2010?,"{'text': ['NFL Network'], 'answer_start': [76]}"
8,570a9a214103511400d59867,Houston,"According to the 2010 Census, whites made up 51% of Houston's population; 26% of the total population were non-Hispanic whites. Blacks or African Americans made up 25% of Houston's population. American Indians made up 0.7% of the population. Asians made up 6% (1.7% Vietnamese, 1.3% Chinese, 1.3% Indian, 0.9% Pakistani, 0.4% Filipino, 0.3% Korean, 0.1% Japanese), while Pacific Islanders made up 0.1%. Individuals from some other race made up 15.2% of the city's population, of which 0.2% were non-Hispanic. Individuals from two or more races made up 3.3% of the city. At the 2000 Census, there were 1,953,631 people and the population density was 3,371.7 people per square mile (1,301.8/km²). The racial makeup of the city was 49.3% White, 25.3% African American, 5.3% Asian, 0.7% American Indian, 0.1% Pacific Islander, 16.5% from some other race, and 3.1% from two or more races. In addition, Hispanics made up 37.4% of Houston's population while non-Hispanic whites made up 30.8%, down from 62.4% in 1970.",What percentage of Houston's population is African-American?,"{'text': ['25%'], 'answer_start': [164]}"
9,56e1c9a9cd28a01900c67b88,Communications_in_Somalia,"Broadband wireless services were offered by both dial up and non-dial up ISPs in major cities, such as Mogadishu, Bosaso, Hargeisa, Galkayo and Kismayo. Pricing ranged from $150 to $300 a month for unlimited internet access, with bandwidth rates of 64 kbit/s up and down. The main patrons of these wireless services were scholastic institutions, corporations, and UN, NGO and diplomatic missions. Mogadishu had the biggest subscriber base nationwide and was also the headquarters of the largest wireless internet services, among which were Dalkom (Wanaag HK), Orbit, Unitel and Webtel.",What is another name for Dalcom?,"{'text': ['Wanaag HK'], 'answer_start': [548]}"


## 预处理数据

In [8]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

以下断言确保我们的 Tokenizers 使用的是 FastTokenizer（Rust 实现，速度和功能性上有一定优势）。

In [9]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

您可以在大模型表上查看哪种类型的模型具有可用的快速标记器，哪种类型没有。

您可以直接在两个句子上调用此标记器（一个用于答案，一个用于上下文）：

In [10]:
tokenizer("What is your name?", "My name is Sylvain.")

# transformers 库中的 tokenizer 类的具体输入参数可以根据不同的 tokenizer 类型而有所不同，
# 但一般而言，常见的 tokenizer（例如 PreTrainedTokenizerFast）通常包括以下几个主要参数：

# text（或 texts）： 待处理的文本或文本列表。这是 tokenizer 的主要输入，表示要进行编码或解码的文本。
# pair： 用于处理文本对（例如，问答任务中的问题和上下文）。如果为 True，则表示 text 参数包含一对文本。
# max_length： 指定编码后的文本的最大长度。如果文本长度超过这个值，会进行截断或其他处理。
# padding： 是否进行填充，使得所有输入文本长度相等。
# truncation： 是否进行截断，用于将文本长度限制在 max_length 之内。
# return_tensors： 指定返回的张量类型，例如 "pt" 表示返回 PyTorch 张量，"tf" 表示返回 TensorFlow 张量。

{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Tokenizer 进阶操作

在问答预处理中的一个特定问题是如何处理非常长的文档。

在其他任务中，当文档的长度超过模型最大句子长度时，我们通常会截断它们，但在这里，删除上下文的一部分可能会导致我们丢失正在寻找的答案。

为了解决这个问题，我们允许数据集中的一个（长）示例生成多个输入特征，每个特征的长度都小于模型的最大长度（或我们设置的超参数）。

In [11]:
# The maximum length of a feature (question and context)
max_length = 384 
# The authorized overlap between two part of the context when splitting it is needed.
doc_stride = 128 

#### 超出最大长度的文本数据处理

下面，我们从训练集中找出一个超过最大长度（384）的文本：

In [12]:

# enumerate 是 Python 内置函数，它用于将一个可遍历的数据对象（如列表、元组、字符串等）组合为一个索引序列，
# 同时提供索引和对应的元素值。
for i, example in enumerate(datasets["train"]):
    # 在每次循环中，代码使用tokenizer对当前样本的问题和上下文进行编码，并获取其输入标记的长度。
    # 如果长度超过384，就退出循环。
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets["train"][i]

In [13]:
len(tokenizer(example["question"], example["context"])["input_ids"])

396

In [14]:
len(tokenizer(example["question"],
              example["context"],
              max_length=max_length,
              truncation="only_second")["input_ids"])
# truncation="only_second"： 这个参数指定了截断策略。
# 在这里，"only_second" 表示只对第二个输入（即上下文）进行截断，而不对问题进行截断。
# 这在处理问答任务时是一种常见的策略，确保问题的完整性。

384

#### 关于截断的策略

- 直接截断超出部分: truncation=`only_second`
- 仅截断上下文（context），保留问题（question）：`return_overflowing_tokens=True` & 设置`stride`


In [15]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

# return_overflowing_tokens=True： 这个参数的设置表示，如果编码后的文本长度超过了指定的 max_length，
# 则返回超出部分的 token。这对于处理长文本时很有用，因为你可能需要对超出部分进行额外的处理。

# stride=doc_stride： doc_stride 表示两次处理相邻文本时的步幅。如果超出部分的长度大于 max_length，
# 那么通过增加步幅可以保证覆盖整个文本。这有助于确保覆盖全文，而不仅仅是部分。

# 结果 tokenized_example 包含了一系列的编码信息，其中可能包括超出部分的 token。
# 你可以使用这个结果来进一步处理长文本，例如切割成多个片段，以确保适应模型的输入限制。

使用此策略截断后，Tokenizer 将返回多个 `input_ids` 列表。

In [16]:
[len(x) for x in tokenized_example["input_ids"]]

# [len(x) for x in ...]： 这是一个列表推导式，用于生成一个包含每个片段标记数量的列表。

[384, 157]

解码两个输入特征，可以看到重叠的部分：

In [17]:
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

# tokenized_example["input_ids"][:2]： 这是编码后的文本的标识符（token IDs）列表的前两个元素，即前两个片段。

[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season, his fifteenth at notr

#### 使用 offsets_mapping 获取原始的 input_ids

设置 `return_offsets_mapping=True`，将使得截断分割生成的多个 input_ids 列表中的 token，通过映射保留原始文本的 input_ids。

如下所示：第一个标记（[CLS]）的起始和结束字符都是（0, 0），因为它不对应问题/答案的任何部分，然后第二个标记与问题(question)的字符0到3相同.

In [18]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])

# return_overflowing_tokens=True 和 return_offsets_mapping=True： 
# 这两个参数分别指示 tokenizer 返回超出部分的 token 和偏移映射。

# stride=doc_stride： doc_stride 表示两次处理相邻文本时的步幅。

# tokenized_example["offset_mapping"][0][:100]： 这行代码获取了第一个片段的偏移映射，并打印了其前100个元素。

# 偏移映射是一种用于将编码后的 token 映射回原始文本的机制。
# 它表示每个 token 在原始文本中的起始和结束位置。打印偏移映射的前100个元素可能是为了检查映射是否准确

[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 38), (38, 39), (40, 50), (51, 55), (56, 60), (60, 61), (0, 0), (0, 3), (4, 7), (7, 8), (8, 9), (10, 20), (21, 25), (26, 29), (30, 34), (35, 36), (36, 37), (37, 40), (41, 45), (45, 46), (47, 50), (51, 53), (54, 58), (59, 61), (62, 69), (70, 73), (74, 78), (79, 86), (87, 91), (92, 96), (96, 97), (98, 101), (102, 106), (107, 115), (116, 118), (119, 121), (122, 126), (127, 138), (138, 139), (140, 146), (147, 153), (154, 160), (161, 165), (166, 171), (172, 175), (176, 182), (183, 186), (187, 191), (192, 198), (199, 205), (206, 208), (209, 210), (211, 217), (218, 222), (223, 225), (226, 229), (230, 240), (241, 245), (246, 248), (248, 249), (250, 258), (259, 262), (263, 267), (268, 271), (272, 277), (278, 281), (282, 285), (286, 290), (291, 301), (301, 302), (303, 307), (308, 312), (313, 318), (319, 321), (322, 325), (326, 330), (330, 331), (332, 340), (341, 351), (352, 354), (355, 363), (364, 373), (374,

因此，我们可以使用这个映射来找到答案在给定特征中的起始和结束标记的位置。

我们只需区分偏移的哪些部分对应于问题，哪些部分对应于上下文。

In [19]:
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])

# first_token_id = tokenized_example["input_ids"][0][1]： 这一行获取了第一个片段中的第一个 token 的标识符（token ID）。

# offsets = tokenized_example["offset_mapping"][0][1]： 这一行获取了第一个片段中第一个 token 的偏移映射，即在原始文本中的起始和结束位置。

# tokenizer.convert_ids_to_tokens([first_token_id])[0]： 这一行使用 convert_ids_to_tokens 将标识符转换为原始文本中的 token。

# example["question"][offsets[0]:offsets[1]]： 这一行获取了原始问题文本中偏移映射对应的子字符串，即原始文本中第一个 token 的内容。

# print(...)： 这一行将两者打印出来，对比看看编码后的 token 和原始文本中的 token 是否匹配。

# 这样的操作可以帮助验证编码过程是否正确，确保模型训练时能够正确地映射回原始文本。

how How


借助`tokenized_example`的`sequence_ids`方法，我们可以方便的区分token的来源编号：

- 对于特殊标记：返回None，
- 对于正文Token：返回句子编号（从0开始编号）。

综上，现在我们可以很方便的在一个输入特征中找到答案的起始和结束 Token。

In [20]:
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

# 这行代码获取了编码后的文本中每个 token 的序列 ID，并将其打印出来。
# 这个序列 ID 的列表可以告诉你每个 token 属于哪个输入序列。在处理文本对时，常常有两个序列，例如问题和上下文。
# 打印的 sequence_ids 结果是一个包含序列 ID 的列表，可能是一个长列表，每个元素对应于编码后的文本中的一个 token。
# 这样的信息在处理多序列任务时非常有用，可以用于进一步理解和调整编码后的文本的结构。

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [21]:
# 获取样本中的答案信息，包括答案在原始文本中的起始字符位置和结束字符位置。
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# 当前span在文本中的起始标记索引。
# 使用 sequence_ids 找到当前编码后的文本的序列（例如，问题或上下文）开始的位置。
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# 当前span在文本中的结束标记索引。
# 获取编码后的文本中每个 token 在原始文本中的起始和结束字符位置的映射。
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# 检测答案是否超出span范围（如果超出范围，该特征将以CLS标记索引标记）。
# 确认答案的起始和结束位置在当前编码后的文本的范围内，
# 如果在，则将位置调整到答案的两端，并打印结果。如果不在，则打印消息表示答案不在当前特征中。
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # 将token_start_index和token_end_index移动到答案的两端。
    # 注意：如果答案是最后一个单词，我们可以移到最后一个标记之后（边界情况）。
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("答案不在此特征中。")


23 26


打印检查是否准确找到了起始位置：

In [22]:
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
print(answers["text"][0])

over 1, 600
over 1,600


#### 关于填充的策略

- 对于没有超过最大长度的文本，填充补齐长度。
- 对于需要左侧填充的模型，交换 question 和 context 顺序

In [23]:
pad_on_right = tokenizer.padding_side == "right"

### 整合以上所有预处理步骤

让我们将所有内容整合到一个函数中，并将其应用到训练集。

针对不可回答的情况（上下文过长，答案在另一个特征中），我们为开始和结束位置都设置了cls索引。

如果allow_impossible_answers标志为False，我们还可以简单地从训练集中丢弃这些示例。

In [24]:
def prepare_train_features(examples):
    # 一些问题的左侧可能有很多空白字符，这对我们没有用，而且会导致上下文的截断失败
    # （标记化的问题将占用大量空间）。因此，我们删除左侧的空白字符。
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # 使用截断和填充对我们的示例进行标记化，但保留溢出部分，使用步幅（stride）。
    # 当上下文很长时，这会导致一个示例可能提供多个特征，其中每个特征的上下文都与前一个特征的上下文有一些重叠。
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # 由于一个示例可能给我们提供多个特征（如果它具有很长的上下文），我们需要一个从特征到其对应示例的映射。这个键就提供了这个映射关系。
    # 这一行代码从tokenized_examples字典中弹出了名为"overflow_to_sample_mapping"的键，
    # 这个键提供了一个映射关系，用于将特征映射回原始示例。
    # 这很有用，因为一个示例可能由于其上下文较长而被分成多个特征。
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # 偏移映射将为我们提供从令牌到原始上下文中的字符位置的映射。这将帮助我们计算开始位置和结束位置。
    # 这一行代码从tokenized_examples字典中弹出了名为"offset_mapping"的键，
    # 这个键包含了从令牌到原始上下文中字符位置的映射。这个映射将帮助计算生成的token在原始文本中的具体位置，
    # 对于定位答案的起始和结束位置非常有用。
    offset_mapping = tokenized_examples.pop("offset_mapping")
    # 在编程中，pop 是一种常见的字典（或类似数据结构）操作，用于从字典中移除指定键对应的值，并返回该值。
    # 在这里，pop 的作用是从 tokenized_examples 字典中移除指定的键，
    # 并将对应的值返回给 sample_mapping 和 offset_mapping 变量。

    # 让我们为这些示例进行标记！
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # 我们将使用CLS令牌的索引来标记不可能的答案。
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # 获取与该示例对应的序列（以了解上下文和问题是什么）。
        sequence_ids = tokenized_examples.sequence_ids(i)

        # 一个示例可以提供多个跨度，这是包含此文本跨度的示例的索引。
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # 如果没有给出答案，则将cls_index设置为答案。
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # 答案在文本中的开始和结束字符索引。
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # 当前跨度在文本中的开始令牌索引。
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # 当前跨度在文本中的结束令牌索引。
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # 检测答案是否超出跨度（在这种情况下，该特征的标签将使用CLS索引）。
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # 否则，将token_start_index和token_end_index移到答案的两端。
                # 注意：如果答案是最后一个单词（边缘情况），我们可以在最后一个偏移之后继续。
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

#### datasets.map 的进阶使用

使用 `datasets.map` 方法将 `prepare_train_features` 应用于所有训练、验证和测试数据：

- batched: 批量处理数据。
- remove_columns: 因为预处理更改了样本的数量，所以在应用它时需要删除旧列。
- load_from_cache_file：是否使用datasets库的自动缓存

datasets 库针对大规模数据，实现了高效缓存机制，能够自动检测传递给 map 的函数是否已更改（因此需要不使用缓存数据）。如果在调用 map 时设置 `load_from_cache_file=False`，可以强制重新应用预处理。

In [26]:
tokenized_datasets = datasets.map(prepare_train_features,
                                  batched=True,
                                  remove_columns=datasets["train"].column_names)

# remove_columns=datasets["train"].column_names：
# 表示在映射完成后，移除原始数据集中的指定列。这是为了减小映射后数据集的内存占用，
# 因为经过处理的特征通常已包含在 tokenized_datasets 中，原始文本等信息已经不再需要。

## 微调模型

现在我们的数据已经准备好用于训练，我们可以下载预训练模型并进行微调。

由于我们的任务是问答，我们使用 `AutoModelForQuestionAnswering` 类。(对比 Yelp 评论打分使用的是 `AutoModelForSequenceClassification` 类）

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [27]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to

#### 训练超参数（TrainingArguments）

In [28]:
batch_size=64
model_dir = "models"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_dir}/{model_name}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

#### Data Collator（数据整理器）

数据整理器将训练数据整理为批次数据，用于模型训练时的批次处理。本教程使用默认的 `default_data_collator`。


In [29]:
from transformers import default_data_collator

data_collator = default_data_collator

### 实例化训练器（Trainer）

为了减少训练时间（需要大量算力支持），我们不在本教程的训练模型过程中计算模型评估指标。

而是训练完成后，再独立进行模型评估。

In [30]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


#### GPU 使用情况

训练数据与模型配置：

- SQUAD v1.1
- model_checkpoint = "distilbert-base-uncased"
- batch_size = 64

NVIDIA GPU 使用情况：

```shell
Every 1.0s: nvidia-smi                                                   Wed Dec 20 15:39:57 2023

Wed Dec 20 15:39:57 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   67C    P0              67W /  70W |  14617MiB / 15360MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     16384      C   /root/miniconda3/bin/python               14612MiB |
+---------------------------------------------------------------------------------------+
```

In [31]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,1.5055,1.278168
2,1.1378,1.172962
3,0.9938,1.179897


TrainOutput(global_step=4152, training_loss=1.3227212240700548, metrics={'train_runtime': 8451.7055, 'train_samples_per_second': 31.422, 'train_steps_per_second': 0.491, 'total_flos': 2.602335381127373e+16, 'train_loss': 1.3227212240700548, 'epoch': 3.0})

### 训练完成后，第一时间保存模型权重文件。

In [56]:
trained_model_path = f"{model_dir}/{model_name}-finetuned-squad-trained"

In [57]:
model_to_save = trainer.save_model(trained_model_path)

## 模型评估

**评估模型输出需要一些额外的处理：将模型的预测映射回上下文的部分。**

模型直接输出的是预测答案的`起始位置`和`结束位置`的**logits**

In [90]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

# for batch in trainer.get_eval_dataloader(): break: 
# 循环遍历模型评估数据加载器的一个批次。
# trainer.get_eval_dataloader()是一个获取评估数据的函数，而break语句用于在第一个批次后中断循环，只处理第一个批次。

# batch = {k: v.to(trainer.args.device) for k, v in batch.items()}: 
# 将批次中的所有张量（v）移动到指定设备（trainer.args.device）。
# 这是为了确保模型和数据在同一设备上，通常是GPU。

# with torch.no_grad():: 创建一个上下文管理器，其中的代码块中的运算不会被记录到梯度计算图中。
# 这通常用于评估阶段，以减少内存占用和提高速度。

# output = trainer.model(**batch): 
# 使用训练师（trainer）中的模型对处理后的批次进行推断，得到模型的输出。
# trainer.model表示训练师中的模型。**batch是Python中的解包操作，将字典中的键值对传递给函数，
# 这里是将批次中的数据传递给模型。

# output.keys(): 
# 返回模型输出的所有键（通常是模型的各个输出部分）。这是为了查看模型输出的结构。

# 总的来说，这段代码的目的是加载评估数据的第一个批次，将其移动到指定设备，然后使用模型进行推断，
# 最后查看模型输出的键。

NameError: name 'trainer' is not defined

模型的输出是一个类似字典的对象，其中包含损失（因为我们提供了标签），以及起始和结束logits。我们不需要损失来进行预测，让我们看一下logits：

In [34]:
output.start_logits.shape, output.end_logits.shape

(torch.Size([64, 384]), torch.Size([64, 384]))

In [35]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

(tensor([ 46,  57,  78,  43, 118,  15,  72,  35,  15,  34,  73,  41,  80,  91,
         156,  35,  83,  91,  80,  58,  77,  31,  42,  53,  41,  35,  42,  77,
          11,  44,  27, 133,  66,  40,  87,  44,  85,  83, 127,  26,  28,  33,
          87, 127,  95,  25,  43, 132,  42,  29,  44,  46,  24,  44,  65,  58,
          81,  14,  59,  76,  25,  36,  55,  43], device='cuda:0'),
 tensor([ 47,  58,  81,  44, 171, 110,  75,  37, 110,  36,  76,  42,  83,  94,
         158,  35,  83,  94,  83,  60,  80,  31,  43,  54,  42,  35,  43,  80,
          13,  45,  28, 133,  66,  41,  89,  45,  87,  85, 127,  27,  30,  34,
          89, 127,  97,  26,  44, 132,  43,  30,  45,  47,  25,  45,  65,  59,
          81,  14,  60,  72,  25,  36,  58,  43], device='cuda:0'))

#### 如何从模型输出的位置 logit 组合成答案

我们有每个特征和每个标记的logit。在每个特征中为每个标记预测答案最明显的方法是，将起始logits的最大索引作为起始位置，将结束logits的最大索引作为结束位置。

在许多情况下这种方式效果很好，但是如果此预测给出了不可能的结果该怎么办？比如：起始位置可能大于结束位置，或者指向问题中的文本片段而不是答案。在这种情况下，我们可能希望查看第二好的预测，看它是否给出了一个可能的答案，并选择它。

选择第二好的答案并不像选择最佳答案那么容易：
- 它是起始logits中第二佳索引与结束logits中最佳索引吗？
- 还是起始logits中最佳索引与结束logits中第二佳索引？
- 如果第二好的答案也不可能，那么对于第三好的答案，情况会更加棘手。

为了对答案进行分类，
1. 将使用通过添加起始和结束logits获得的分数
1. 设计一个名为`n_best_size`的超参数，限制不对所有可能的答案进行排序。
1. 我们将选择起始和结束logits中的最佳索引，并收集这些预测的所有答案。
1. 在检查每一个是否有效后，我们将按照其分数对它们进行排序，并保留最佳的答案。

以下是我们如何在批次中的第一个特征上执行此操作的示例：

In [36]:
n_best_size = 20

In [37]:
import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()

# 获取最佳的起始和结束位置的索引：
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()

valid_answers = []

# 遍历起始位置和结束位置的索引组合
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index:  # 需要进一步测试以检查答案是否在上下文中
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": ""  # 我们需要找到一种方法来获取与上下文中答案对应的原始子字符串
                }
            )



然后，我们可以根据它们的得分对`valid_answers`进行排序，并仅保留最佳答案。唯一剩下的问题是如何检查给定的跨度是否在上下文中（而不是问题中），以及如何获取其中的文本。为此，我们需要向我们的验证特征添加两个内容：

- 生成该特征的示例的ID（因为每个示例可以生成多个特征，如前所示）；
- 偏移映射，它将为我们提供从标记索引到上下文中字符位置的映射。

这就是为什么我们将使用以下函数稍微不同于`prepare_train_features`来重新处理验证集：

In [38]:
def prepare_validation_features(examples):
    # 一些问题的左侧有很多空白，这些空白并不有用且会导致上下文截断失败（分词后的问题会占用很多空间）。
    # 因此我们移除这些左侧空白
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # 使用截断和可能的填充对我们的示例进行分词，但使用步长保留溢出的令牌。这导致一个长上下文的示例可能产生
    # 几个特征，每个特征的上下文都会稍微与前一个特征的上下文重叠。
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # 由于一个示例在上下文很长时可能会产生几个特征，我们需要一个从特征映射到其对应示例的映射。这个键就是为了这个目的。
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # 我们保留产生这个特征的示例ID，并且会存储偏移映射。
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # 获取与该示例对应的序列（以了解哪些是上下文，哪些是问题）。
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # 一个示例可以产生几个文本段，这里是包含该文本段的示例的索引。
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # 将不属于上下文的偏移映射设置为None，以便容易确定一个令牌位置是否属于上下文。
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples


将`prepare_validation_features`应用到整个验证集：

In [39]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

Now we can grab the predictions for all features by using the `Trainer.predict` method:

In [40]:
raw_predictions = trainer.predict(validation_features)

`Trainer`会隐藏模型不使用的列（在这里是`example_id`和`offset_mapping`，我们需要它们进行后处理），所以我们需要将它们重新设置回来：

In [41]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

现在，我们可以改进之前的测试：

由于在偏移映射中，当它对应于问题的一部分时，我们将其设置为None，因此可以轻松检查答案是否完全在上下文中。我们还可以从考虑中排除非常长的答案（可以调整的超参数）。

展开说下具体实现：
- 首先从模型输出中获取起始和结束的逻辑值（logits），这些值表明答案在文本中可能开始和结束的位置。
- 然后，它使用偏移映射（offset_mapping）来找到这些逻辑值在原始文本中的具体位置。
- 接下来，代码遍历可能的开始和结束索引组合，排除那些不在上下文范围内或长度不合适的答案。
- 对于有效的答案，它计算出一个分数（基于开始和结束逻辑值的和），并将答案及其分数存储起来。
- 最后，它根据分数对答案进行排序，并返回得分最高的几个答案。

In [42]:
max_answer_length = 30

In [43]:
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]

# 第一个特征来自第一个示例。对于更一般的情况，我们需要将example_id匹配到一个示例索引
context = datasets["validation"][0]["context"]

# 收集最佳开始/结束逻辑的索引：
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # 不考虑超出范围的答案，原因是索引超出范围或对应于输入ID的部分不在上下文中。
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # 不考虑长度小于0或大于max_answer_length的答案。
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # 我们需要细化这个测试，以检查答案是否在上下文中
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers


[{'score': 16.85605, 'text': 'Denver Broncos'},
 {'score': 14.393804,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 12.881678, 'text': 'Broncos'},
 {'score': 12.01755, 'text': 'Denver'},
 {'score': 11.989048,
  'text': 'American Football Conference (AFC) champion Denver Broncos'},
 {'score': 11.743425, 'text': 'Carolina Panthers'},
 {'score': 10.916711,
  'text': 'The American Football Conference (AFC) champion Denver Broncos'},
 {'score': 10.419431,
  'text': 'Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 10.310012,
  'text': 'Denver Broncos defeated the National Football Conference'},
 {'score': 9.981752,
  'text': 'Denver Broncos defeated the National Football Conference (NFC)'},
 {'score': 9.526801,
  'text': 'American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 9.417948

打印比较模型输出和标准答案（Ground-truth）是否一致:

In [44]:
datasets["validation"][0]["answers"]

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],
 'answer_start': [177, 177, 177]}

**模型最高概率的输出与标准答案一致**

正如上面的代码所示，这在第一个特征上很容易，因为我们知道它来自第一个示例。

对于其他特征，我们需要建立一个示例与其对应特征的映射关系。

此外，由于一个示例可以生成多个特征，我们需要将由给定示例生成的所有特征中的所有答案汇集在一起，然后选择最佳答案。

下面的代码构建了一个示例索引到其对应特征索引的映射关系：

In [45]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

当`squad_v2 = True`时，有一定概率出现不可能的答案（impossible answer)。

上面的代码仅保留在上下文中的答案，我们还需要获取不可能答案的分数（其起始和结束索引对应于CLS标记的索引）。

当一个示例生成多个特征时，我们必须在所有特征中的不可能答案都预测出现不可能答案时（因为一个特征可能之所以能够预测出不可能答案，是因为答案不在它可以访问的上下文部分），这就是为什么一个示例中不可能答案的分数是该示例生成的每个特征中的不可能答案的分数的最小值。

In [46]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # 构建一个从示例到其对应特征的映射。
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # 我们需要填充的字典。
    predictions = collections.OrderedDict()

    # 日志记录。
    print(f"正在后处理 {len(examples)} 个示例的预测，这些预测分散在 {len(features)} 个特征中。")

    # 遍历所有示例！
    for example_index, example in enumerate(tqdm(examples)):
        # 这些是与当前示例关联的特征的索引。
        feature_indices = features_per_example[example_index]

        min_null_score = None # 仅在squad_v2为True时使用。
        valid_answers = []
        
        context = example["context"]
        # 遍历与当前示例关联的所有特征。
        for feature_index in feature_indices:
            # 我们获取模型对这个特征的预测。
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # 这将允许我们将logits中的某些位置映射到原始上下文中的文本跨度。
            offset_mapping = features[feature_index]["offset_mapping"]

            # 更新最小空预测。
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # 浏览所有的最佳开始和结束logits，为 `n_best_size` 个最佳选择。
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # 不考虑超出范围的答案，原因是索引超出范围或对应于输入ID的部分不在上下文中。
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # 不考虑长度小于0或大于max_answer_length的答案。
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # 在极少数情况下我们没有一个非空预测，我们创建一个假预测以避免失败。
            best_answer = {"text": "", "score": 0.0}
        
        # 选择我们的最终答案：最佳答案或空答案（仅适用于squad_v2）
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions


在原始结果上应用后处理问答结果：

In [47]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

Post-processing 10570 example predictions split into 10784 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

使用 `datasets.load_metric` 中加载 `SQuAD v2` 的评估指标

In [48]:
from datasets import load_metric

metric = load_metric("squad_v2" if squad_v2 else "squad")

  metric = load_metric("squad_v2" if squad_v2 else "squad")


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

接下来，我们可以调用上面定义的函数进行评估。

只需稍微调整一下预测和标签的格式，因为它期望的是一系列字典而不是一个大字典。

在使用`squad_v2`数据集时，我们还需要设置`no_answer_probability`参数（我们在这里将其设置为0.0，因为如果我们选择了答案，我们已经将答案设置为空）。

In [49]:
if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 74.66414380321665, 'f1': 83.41118380007063}

### Homework：加载本地保存的模型，进行评估和再训练更高的 F1 Score

### 作业2-2 加载本地保存的模型，进行评估和再训练更高的 F1 Score

In [7]:
import subprocess
import os

#在transformers自定义模型下载的路径方法
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ["HF_DATASETS_CACHE"] = "/autodl-tmp/datasets_cache/"
os.environ["HF_HOME"] = "/autodl-tmp/cache/"
os.environ["HUGGINGFACE_HUB_CACHE"] = "/autodl-tmp/hub_cache/"
os.environ["TRANSFORMERS_CACHE"] = "/autodl-tmp/transform_cache/"

result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value       

In [8]:
# 验证环境变量是否修改成功
print("http_proxy",os.environ.get("http_proxy"))
print("https_proxy",os.environ.get("https_proxy"))
print("HF_HOME",os.environ.get("HF_HOME"))
print("HF_DATASETS_CACHE",os.environ.get("HF_DATASETS_CACHE"))
print("HUGGINGFACE_HUB_CACHE",os.environ.get("HUGGINGFACE_HUB_CACHE"))
print("TRANSFORMERS_CACHE",os.environ.get("TRANSFORMERS_CACHE"))

http_proxy http://172.20.0.113:12798
https_proxy http://172.20.0.113:12798
HF_HOME /autodl-tmp/cache/
HF_DATASETS_CACHE /autodl-tmp/datasets_cache/
HUGGINGFACE_HUB_CACHE /autodl-tmp/hub_cache/
TRANSFORMERS_CACHE /autodl-tmp/transform_cache/


In [9]:
# 根据你使用的模型和GPU资源情况，调整以下关键参数
squad_v2 = False
model_checkpoint = "distilbert-base-uncased"

In [16]:
#第二次运行时不用执行
from datasets import load_dataset

datasets = load_dataset("squad_v2" if squad_v2 else "squad")

datasets.save_to_disk('../../autodl-tmp/data/squad')

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/87599 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10570 [00:00<?, ? examples/s]

In [10]:
from datasets import load_from_disk
datasets = load_from_disk('../../autodl-tmp/data/squad')

In [2]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [3]:
datasets["train"][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

In [4]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=2):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

show_random_elements(datasets["train"])

Unnamed: 0,id,title,context,question,answers
0,56d45f882ccc5a1400d830f1,Kanye_West,"Kanye West began his early production career in the mid-1990s, making beats primarily for burgeoning local artists, eventually developing a style that involved speeding up vocal samples from classic soul records. His first official production credits came at the age of nineteen when he produced eight tracks on Down to Earth, the 1996 debut album of a Chicago rapper named Grav. For a time, West acted as a ghost producer for Deric ""D-Dot"" Angelettie. Because of his association with D-Dot, West wasn't able to release a solo album, so he formed and became a member and producer of the Go-Getters, a late-1990s Chicago rap group composed of him, GLC, Timmy G, Really Doe, and Arrowstar. His group was managed by John ""Monopoly"" Johnson, Don Crowley, and Happy Lewis under the management firm Hustle Period. After attending a series of promotional photo shoots and making some radio appearances, The Go-Getters released their first and only studio album World Record Holders in 1999. The album featured other Chicago-based rappers such as Rhymefest, Mikkey Halsted, Miss Criss, and Shayla G. Meanwhile, the production was handled by West, Arrowstar, Boogz, and Brian ""All Day"" Miller.",What late 1990s Chicago rap group was Kanye West a member of?,"{'text': ['Go-Getters'], 'answer_start': [587]}"
1,570c5f39b3d812140066d19f,"John,_King_of_England","Popular representations of John first began to emerge during the Tudor period, mirroring the revisionist histories of the time. The anonymous play The Troublesome Reign of King John portrayed the king as a ""proto-Protestant martyr"", similar to that shown in John Bale's morality play Kynge Johan, in which John attempts to save England from the ""evil agents of the Roman Church"". By contrast, Shakespeare's King John, a relatively anti-Catholic play that draws on The Troublesome Reign for its source material, offers a more ""balanced, dual view of a complex monarch as both a proto-Protestant victim of Rome's machinations and as a weak, selfishly motivated ruler"". Anthony Munday's play The Downfall and The Death of Robert Earl of Huntington portrays many of John's negative traits, but adopts a positive interpretation of the king's stand against the Roman Catholic Church, in line with the contemporary views of the Tudor monarchs. By the middle of the 17th century, plays such as Robert Davenport's King John and Matilda, although based largely on the earlier Elizabethan works, were transferring the role of Protestant champion to the barons and focusing more on the tyrannical aspects of John's behaviour.","In The Troublesome Reign of King John, John portrayed the king as what?","{'text': ['proto-Protestant martyr'], 'answer_start': [207]}"


In [25]:
# 使用自定义代理，一般情况下不使用

import subprocess
import os

#在transformers自定义模型下载的路径方法
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ["HF_DATASETS_CACHE"] = "/autodl-tmp/datasets_cache/"
os.environ["HF_HOME"] = "/autodl-tmp/cache/"
os.environ["HUGGINGFACE_HUB_CACHE"] = "/autodl-tmp/hub_cache/"
os.environ["TRANSFORMERS_CACHE"] = "/autodl-tmp/transform_cache/"

result = subprocess.run('bash -c "source /etc/profile.d/clash.sh && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value    

In [27]:
#仅在第一次运行时使用

from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
config = AutoConfig.from_pretrained(model_checkpoint)

cache_directory = '../../autodl-tmp/tokenizer/distilbert-base-uncased'

# 如果不存在，则新建文件路径
if not os.path.exists(cache_directory):
    os.makedirs(cache_directory)

tokenizer.save_pretrained(cache_directory)
config.save_pretrained(cache_directory)

print(f"Max Sequence Length for distilbert-base-uncased (tokenizer): {tokenizer.model_max_length}")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Max Sequence Length for distilbert-base-uncased (tokenizer): 512


In [11]:
from transformers import AutoTokenizer

# 指定缓存路径
cache_directory = '../../autodl-tmp/tokenizer/distilbert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(cache_directory)

# 输出分词器的最大序列长度
print(f"Max Sequence Length for bert-base-cased (tokenizer): {tokenizer.model_max_length}")

# 以下断言确保我们的 Tokenizers 使用的是 FastTokenizer（Rust 实现，速度和功能性上有一定优势）。
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

Max Sequence Length for bert-base-cased (tokenizer): 512


In [12]:
# The maximum length of a feature (question and context)
max_length = 384 
# The authorized overlap between two part of the context when splitting it is needed.
doc_stride = 128 
pad_on_right = tokenizer.padding_side == "right"

In [80]:
#测试，帮助认知tokenizer后的结构

from datasets import ClassLabel, Sequence
import pandas as pd
from IPython.display import display, HTML

# 找到一个长度超过384的样本
for i, example in enumerate(datasets["train"]):
    # 在每次循环中，代码使用tokenizer对当前样本的问题和上下文进行编码，并获取其输入标记的长度。
    # 如果长度超过384，就退出循环。
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break

#经测试，i+4和i+5分别为超过长度和不超过长度的两个样本，可以对比来看最后一行overflow_to_sample_mapping的值
example = datasets["train"][i+4]
example2 = datasets["train"][i+5]

examples=datasets["train"][i+4:i+6]

print(len(tokenizer(example["question"], example["context"])["input_ids"]))
print(len(tokenizer(example2["question"], example2["context"])["input_ids"]))

tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

df = pd.DataFrame(tokenized_examples)
# display(HTML(df.to_html()))
print(df)
tokenized_examples


395
171
                            0
0                   input_ids
1              attention_mask
2              offset_mapping
3  overflow_to_sample_mapping


{'input_ids': [[101, 2040, 2001, 1996, 10289, 8214, 2273, 1005, 1055, 3455, 2873, 1999, 2297, 1029, 102, 1996, 2273, 1005, 1055, 3455, 2136, 2038, 2058, 1015, 1010, 5174, 5222, 1010, 2028, 1997, 2069, 2260, 2816, 2040, 2031, 2584, 2008, 2928, 1010, 1998, 2031, 2596, 1999, 2654, 5803, 8504, 1012, 2280, 2447, 5899, 12385, 4324, 1996, 2501, 2005, 2087, 2685, 3195, 1999, 1037, 2309, 2208, 1997, 1996, 2977, 2007, 6079, 1012, 2348, 1996, 2136, 2038, 2196, 2180, 1996, 5803, 2977, 1010, 2027, 2020, 2315, 2011, 1996, 16254, 2015, 5188, 3192, 2004, 2120, 3966, 3807, 1012, 1996, 2136, 2038, 23339, 1037, 2193, 1997, 6314, 2015, 1997, 2193, 2028, 4396, 2780, 1010, 1996, 2087, 3862, 1997, 2029, 2001, 4566, 12389, 1005, 1055, 2501, 6070, 1011, 2208, 3045, 9039, 1999, 3326, 1012, 1996, 2136, 2038, 7854, 2019, 3176, 2809, 2193, 1011, 2028, 2780, 1010, 1998, 2216, 3157, 5222, 4635, 2117, 1010, 2000, 12389, 1005, 1055, 2184, 1010, 2035, 1011, 2051, 1999, 5222, 2114, 1996, 2327, 2136, 1012, 1996, 2136, 32

In [13]:
#该函数的目的，就是找出tokenizer后，新的每个样本（可能含原样本被截断后，一分为二成了新的两个样本）对应的答案的位置
#返回的结果列字段为：input_ids、attention_mask、start_positions、end_positions

def prepare_train_features(examples):
    # 一些问题的左侧可能有很多空白字符，这对我们没有用，而且会导致上下文的截断失败
    # （标记化的问题将占用大量空间）。因此，我们删除左侧的空白字符。
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # return_overflowing_tokens=True： 这个参数的设置表示，如果编码后的文本长度超过了指定的 max_length，
    # 则返回超出部分的 token。这对于处理长文本时很有用，因为你可能需要对超出部分进行额外的处理。

    # stride=doc_stride： doc_stride 表示两次处理相邻文本时的步幅。如果超出部分的长度大于 max_length，
    # 那么通过增加步幅可以保证覆盖整个文本。这有助于确保覆盖全文，而不仅仅是部分。
    
    # 设置 return_offsets_mapping=True，
    # 将使得截断分割生成的多个 input_ids 列表中的 token，通过映射保留原始文本的 input_ids。

    # 结果 tokenized_example 包含了一系列的编码信息，其中可能包括超出部分的 token。
    # 你可以使用这个结果来进一步处理长文本，例如切割成多个片段，以确保适应模型的输入限制。
    # 使用此策略截断后，Tokenizer 将返回多个 input_ids 列表。
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    ) 

    # 由于一个示例可能给我们提供多个特征（如果它具有很长的上下文），我们需要一个从特征到其对应示例的映射。
    # 这个键就提供了这个映射关系。
    # 这一行代码从tokenized_examples字典中弹出了名为"overflow_to_sample_mapping"的键，
    # 这个键提供了一个映射关系，用于将特征映射回原始示例。
    # 这很有用，因为一个示例可能由于其上下文较长而被分成多个特征。
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # 偏移映射将为我们提供从令牌到原始上下文中的字符位置的映射。这将帮助我们计算开始位置和结束位置。
    # 这一行代码从tokenized_examples字典中弹出了名为"offset_mapping"的键，
    # 这个键包含了从令牌到原始上下文中字符位置的映射。这个映射将帮助计算生成的token在原始文本中的具体位置，
    # 对于定位答案的起始和结束位置非常有用。
    offset_mapping = tokenized_examples.pop("offset_mapping")
    # 在编程中，pop 是一种常见的字典（或类似数据结构）操作，用于从字典中移除指定键对应的值，并返回该值。
    # 在这里，pop 的作用是从 tokenized_examples 字典中移除指定的键，
    # 并将对应的值返回给 sample_mapping 和 offset_mapping 变量。

    # 让我们为这些示例进行标记！
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # 我们将使用CLS令牌的索引来标记不可能的答案。
        # 取出第i段样本
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # 获取与该示例对应的序列（以了解上下文和问题是什么）。
        # 用来了解这段样本中哪些部分是问题，哪些部分是上下文，结果是个序列
        sequence_ids = tokenized_examples.sequence_ids(i)

        # 一个示例可以提供多个跨度，这是包含此文本跨度的示例的索引。
        # 取出第i个样本对应的overflow_to_sample_mapping值，追溯它是原先数据的第几段样本
        sample_index = sample_mapping[i]
        # 对应找出答案
        answers = examples["answers"][sample_index]
        # 如果没有给出答案，则将cls_index设置为答案。
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # 答案在文本中的开始和结束字符索引。
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # 当前跨度在文本中的开始令牌索引。
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # 当前跨度在文本中的结束令牌索引。
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # 检测答案是否超出跨度（在这种情况下，该特征的标签将使用CLS索引）。
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # 否则，将token_start_index和token_end_index移到答案的两端。
                # 注意：如果答案是最后一个单词（边缘情况），我们可以在最后一个偏移之后继续。
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [14]:
tokenized_datasets = datasets.map(prepare_train_features,
                                  batched=True,
                                  remove_columns=datasets["train"].column_names)

tokenized_datasets
# remove_columns=datasets["train"].column_names：
# 表示在映射完成后，移除原始数据集中的指定列。这是为了减小映射后数据集的内存占用，
# 因为经过处理的特征通常已包含在 tokenized_datasets 中，原始文本等信息已经不再需要。

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 88524
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 10784
    })
})

In [15]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

# 先使用命令行将模型下载到本地指定文件夹
# huggingface-cli download --resume-download --local-dir-use-symlinks False distilbert-base-uncased --local-dir /root/autodl-tmp/model/distilbert-base-uncased

# 指定缓存路径
cache_directory = '../../autodl-tmp/model/distilbert-base-uncased'

# 由于我们的任务是问答，我们使用 AutoModelForQuestionAnswering 类。
# (对比 Yelp 评论打分使用的是 AutoModelForSequenceClassification 类）
model = AutoModelForQuestionAnswering.from_pretrained(cache_directory)


Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at ../../autodl-tmp/model/distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
batch_size=97 #23GB/24GB
model_dir ='../../autodl-tmp/model'
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_dir}/{model_name}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    weight_decay=0.01,
    save_total_limit=5,
)

In [17]:
from transformers import default_data_collator

data_collator = default_data_collator

In [18]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [19]:
# trainer.train()
trainer.train(resume_from_checkpoint=True)

Epoch,Training Loss,Validation Loss
7,0.6603,1.257377
8,0.6143,1.307329
9,0.5757,1.325786
10,0.5502,1.336105


TrainOutput(global_step=9130, training_loss=0.2023373147352371, metrics={'train_runtime': 1548.5595, 'train_samples_per_second': 571.654, 'train_steps_per_second': 5.896, 'total_flos': 8.674451270424576e+16, 'train_loss': 0.2023373147352371, 'epoch': 10.0})

In [20]:
trained_model_path = f"{model_dir}/{model_name}-finetuned-squad-trained"
model_to_save = trainer.save_model(trained_model_path)

In [21]:
trained_model = AutoModelForQuestionAnswering.from_pretrained(trained_model_path)

trainer = Trainer(
    trained_model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [22]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

In [23]:
#这个函数的价值就是生成要最终给evaluate函数的输入格式

def prepare_validation_features(examples):
    # 一些问题的左侧有很多空白，这些空白并不有用且会导致上下文截断失败（分词后的问题会占用很多空间）。
    # 因此我们移除这些左侧空白
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # 使用截断和可能的填充对我们的示例进行分词，但使用步长保留溢出的令牌。这导致一个长上下文的示例可能产生
    # 几个特征，每个特征的上下文都会稍微与前一个特征的上下文重叠。
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # 由于一个示例在上下文很长时可能会产生几个特征，我们需要一个从特征映射到其对应示例的映射。这个键就是为了这个目的。
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # 我们保留产生这个特征的示例ID，并且会存储偏移映射。
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # 获取与该示例对应的序列（以了解哪些是上下文，哪些是问题）。
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # 一个示例可以产生几个文本段，这里是包含该文本段的示例的索引。
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # 将不属于上下文的偏移映射设置为None，以便容易确定一个令牌位置是否属于上下文。
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples


In [24]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)
#remove_columns=datasets["validation"].column_names: 表示在映射过程中移除验证集的所有列，保留只包含模型输入特征的内容。
# 这可能是为了减少映射后数据集的大小，只保留与模型输入相关的信息。
validation_features

Dataset({
    features: ['input_ids', 'attention_mask', 'offset_mapping', 'example_id'],
    num_rows: 10784
})

In [25]:
raw_predictions = trainer.predict(validation_features)

In [26]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))
# Trainer会隐藏模型不使用的列（在这里是example_id和offset_mapping，我们需要它们进行后处理），所以我们需要将它们重新设置回来：

In [27]:
import numpy as np

n_best_size = 20
max_answer_length = 30
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]

# 第一个特征来自第一个示例。对于更一般的情况，我们需要将example_id匹配到一个示例索引
context = datasets["validation"][0]["context"]

print(context)

# 收集最佳开始/结束逻辑的索引：
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # 不考虑超出范围的答案，原因是索引超出范围或对应于输入ID的部分不在上下文中。
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # 不考虑长度小于0或大于max_answer_length的答案。
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # 我们需要细化这个测试，以检查答案是否在上下文中
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers


Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.


[{'score': 20.658417, 'text': 'Denver Broncos'},
 {'score': 18.63585,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 17.130009, 'text': 'Broncos'},
 {'score': 15.812667, 'text': 'Carolina Panthers'},
 {'score': 15.107442,
  'text': 'Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 13.223436, 'text': 'Denver'},
 {'score': 13.144026,
  'text': 'The American Football Conference (AFC) champion Denver Broncos'},
 {'score': 11.371013,
  'text': 'American Football Conference (AFC) champion Denver Broncos'},
 {'score': 11.2548275, 'text': 'Panthers'},
 {'score': 11.12146,
  'text': 'The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 9.986939,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina'},
 {'score': 9.348447,
  'text': 'American Foot

In [28]:
datasets["validation"][0]["answers"]

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],
 'answer_start': [177, 177, 177]}

In [29]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

# example_id_to_index: 创建一个字典，将示例ID映射到它们在验证集中的索引位置。

# features_per_example: 创建一个 defaultdict(list)，它会按照示例ID将特征索引分组。如果同一个示例有多个特征（由于分词和截断等操作），它们会被分配到同一个示例ID下。

# 遍历 features，对每个特征，通过 example_id_to_index 获取示例ID对应的索引，并将该特征的索引添加到相应的示例ID分组中。

# 这样，features_per_example 就成为了一个字典，其键是示例ID的索引，值是该示例ID对应的特征索引的列表。

In [30]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # 构建一个从示例到其对应特征的映射。
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # 我们需要填充的字典。
    predictions = collections.OrderedDict()

    # 日志记录。
    print(f"正在后处理 {len(examples)} 个示例的预测，这些预测分散在 {len(features)} 个特征中。")

    # 遍历所有示例！
    for example_index, example in enumerate(tqdm(examples)):
        # 这些是与当前示例关联的特征的索引。
        feature_indices = features_per_example[example_index]

        min_null_score = None # 仅在squad_v2为True时使用。
        valid_answers = []
        
        context = example["context"]
        # 遍历与当前示例关联的所有特征。
        for feature_index in feature_indices:
            # 我们获取模型对这个特征的预测。
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # 这将允许我们将logits中的某些位置映射到原始上下文中的文本跨度。
            offset_mapping = features[feature_index]["offset_mapping"]

            # 更新最小空预测。
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # 浏览所有的最佳开始和结束logits，为 `n_best_size` 个最佳选择。
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # 不考虑超出范围的答案，原因是索引超出范围或对应于输入ID的部分不在上下文中。
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # 不考虑长度小于0或大于max_answer_length的答案。
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # 在极少数情况下我们没有一个非空预测，我们创建一个假预测以避免失败。
            best_answer = {"text": "", "score": 0.0}
        
        # 选择我们的最终答案：最佳答案或空答案（仅适用于squad_v2）
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions


In [31]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

正在后处理 10570 个示例的预测，这些预测分散在 10784 个特征中。


  0%|          | 0/10570 [00:00<?, ?it/s]

In [32]:
import evaluate
from datasets import load_metric

#前置已经从GitHub：https://github.com/huggingface/evaluate/tree/main 上面下载了squad文件到本地
# 因为直接load（accuracy）的话，会联网，没翻墙读取不了

file_path = f'../../autodl-tmp/evaluate/metrics/{"squad_v2" if squad_v2 else "squad"}'

metric = evaluate.load(file_path)

print(f"读取结果为: {metric}")

读取结果为: EvaluationModule(name: "squad", module_type: "metric", features: {'predictions': {'id': Value(dtype='string', id=None), 'prediction_text': Value(dtype='string', id=None)}, 'references': {'id': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}}, usage: """
Computes SQuAD scores (F1 and EM).
Args:
    predictions: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair as given in the references (see below)
        - 'prediction_text': the text of the answer
    references: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair (see above),
        - 'answers': a Dict in the SQuAD dataset format
            {
                'text': list of possible texts for the answer, as a list of strings
                'answer_start': list of start positio

In [33]:
if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 75.30747398297068, 'f1': 84.00384093883733}

In [None]:
备注：训练了10轮，从{'exact_match': 74.66414380321665, 'f1': 83.41118380007063}，
提升至{'exact_match': 75.30747398297068, 'f1': 84.00384093883733}