# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。

## 下载数据集

In [1]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [3]:
dataset["train"][111]

{'label': 2,
 'text': "As far as Starbucks go, this is a pretty nice one.  The baristas are friendly and while I was here, a lot of regulars must have come in, because they bantered away with almost everyone.  The bathroom was clean and well maintained and the trash wasn't overflowing in the canisters around the store.  The pastries looked fresh, but I didn't partake.  The noise level was also at a nice working level - not too loud, music just barely audible.\\n\\nI do wish there was more seating.  It is nice that this location has a counter at the end of the bar for sole workers, but it doesn't replace more tables.  I'm sure this isn't as much of a problem in the summer when there's the space outside.\\n\\nThere was a treat receipt promo going on, but the barista didn't tell me about it, which I found odd.  Usually when they have promos like that going on, they ask everyone if they want their receipt to come back later in the day to claim whatever the offer is.  Today it was one of th

In [4]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [5]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [6]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,4 stars,"As soon I found out we would be staying at the Mirage, I had my heart set on hitting up the Carnegie Deli.\n\nRight after we arrived, I walked up to the to go counter and ordered the Woody Allen. This sandwhich was HUGE! After carrying it back to my room, I felt like I had just done some high intensity reps at the gym.\n\nDon't let the $20 price tag scare you. It was worth every penny.\n\nService was a little slow and rude but hey it's a New York-style deli. What else would you expect?"
1,2 star,"I don't get the positive reviews. Close to $7.00 with tax for a mediocre normal sized carne asada burrito. If the burrito was $4.50 it would be reasonable. It was Super in name only, only a Super fail. Extra star for the hot sauces that I liked."
2,2 star,"This is a two and a half hour \""tour\"" that begins at the Royal Resort on Convention Center Drive, and it covers much of the history of Las Vegas when the Mob had tremendous influence and controlled much.\n\nMuch of the tour consists of sitting on the bus in the shade of some buildings while the tour director describes mob activities and shows photos and video on the flat panel at the front of the bus. The first talk was about 15 minutes as we sat in the shade of the buildings west of the Royal Resort. We then took a circuitous route down Convention Center Drive, made a left on Paradise, went to Sahara where we made a right, turned right on Joe W. Brown and then parked in the shade of the Las Vegas Hilton near its Sports Book entrance. From there we went back down Joe W. to karen to Maryland to Sahara to Tony Roma's where Frank Rosenthal survived a car bombing attempt on October 4, 1982. See what I mean about how the \""tour\"" goes. For a local, it was awful to see the roundabout way we were going. As a tourist, I might not have noticed.\n\nThe material offered during the tour is excellent. There just isn't enough to see on the tour. The only place we got out of the bus and toured anything was the last place, the Flamingo Hotel, where we walked to Benjamin \""Bugsy\"" Siegel's Garden. Now that I know where it is I'll be back to see it in daylight.\n\nWe were fortunate to have as our narrator Dennis Griffin, the author of several books. He has a lot of knowledge.\n\nI'm glad I had a Groupon for this experience."
3,1 star,"Gosh, I don't see how this place can stay open for much longer. We should have realized that this place was going to suck since it looked empty. We gave it a shot anyway.\n\nI can say the escargot was good, as was the prime rib. But that's about it. Sure sure they bring you a big plate of shrimp and crab legs, but you can get this at other buffets. The deserts were mediocre to sad. There aren't very many choices. Sure it looks like a lot, but it's all the same thing spaced out to look like there is more variety than what's actually there.\n\nThe utensils still had food on them.YES, FOOD! When we asked the waitress to bring us more forks for the crab legs she said something like \""I've been telling him.\"" WHO is she telling!? I thought SHE was our waitress. I felt like she did not really care what we had to say considering there were 9 of us, so her tip was already included. Her english was so bad we couldn't understand her AT ALL. Being the only multilingual person at the table everyone looked at me. \nFriends: What did she say?\nMe: Huh?\nFriends: What was she saying?\nMe: How the h*ll should I know. She's speaking Vietnamese and I don't understand Vietnamese. ( non pun intended to my Vietnamese friend who was sitting @ the table with me). \n\nWhat really did it she told my friend that 2 beers were $22, and when the bill came ( seperate for alcohol drinks) it was $16. When he asked her about the price she tried to say that she told him that it was $16.22\nShe did not say that at all.\n\nAfter all these things we will NEVER be back to this place again. Advise: MAKE SURE YOUR UTENSILS DON'T HAVE FOOD STILL STUCK ON THEM ( I'm sure you do that anyway, right?)"
4,2 star,"The Pleasure Pit and Earl of Sandwich would be my favorite highlights. The hotel is connected to the Miracle Mile shops which is convenient and I like that the hotel is pretty central on the strip. I am an asthmatic and the smoke seems so much worse here than other hotels/casinos. Bring your eye drops. PH should invest in some sort of ventilation system. The decor seems outdated. The carpets in the halls need stretched and the rooms' bathrooms redone. The bathrooms are large with a tub, but who uses a tub in a hotel anyway. I would rather a double sink or vanity area to make getting ready easier. Also there is no closet only a small armoire. I thought the room was good. It was clean and the beds comfortable. It has the perfect amount of amenities since you spend almost no time in there. I wish the rooms' windows were larger and more centered on the wall so you can see the view. The pool is pretty private (on the 6th floor) and I like that they have two separate pool areas; one is 21+ and the other a family pool. The adult pool has a DJ most days of the week but it's still a far cry from the other party pools. This is more of a pool to relax. I was shocked that for such a lack luster experience you have to pay so much for beds/cabanas at the pool. They were running a spa deal for $99 if you choose at least $200 in services. With this you also get to spend the day using the other spa amenties (gym, hot tub, showers, steam room and sauna). The deal seemed good but after my facial I only stayed an hour because I ran out of things to do. What they should do is get rid of Koi (Its so bad they give out free open bar all night every night to everyone- males included) and revamp the spa area. It is outdated. Although I haven't been, I heard Aria has great spa amenities such hot stone beds, a pool, etc. and the things I mentioned above. The gym at PH costs $25 a day and I'm not sure why. It has less equipment than a Holiday Inn."
5,5 stars,"I know it may be hard to combine 5 star and southwest cuisine, but Mesa Grill has it all. Upscale decor, good staff, great ambiance, and a taste to die for (if you like gourmet flavorful southwestern foods). Had the wild mushroom appetizer - great. If you like grits the blue corn grits is to die for. We had the halibut - great; the wild and farm raised salmon - just as great; 16 spiceed chicken - great; and a fall back of steak with green and red pepper sauce - great; and the pork tenderloin - also great. The red and green southwest peppers sauces were fantastic. Las Vegas can buy anything and they spent their money wisely to bring Bobby Flay to Caesers."
6,1 star,"I have visited this casino several times through the years, most recently being on memorial day weekend--2014. It is a fun and neat little casino. The dice dealers are great. When making my reservations for memorial day weekend, I initially wanted to stay here and try it out. I discovered that they charge a resort fee of $20.00 per night. This is a small hotel casino and I don't think anyone would classify this place as a resort. They don't even have a pool. We ended up staying at another hotel/casino on Fremont Street that does not believe in this \""resort fee\"" scam. This is a shame that they are charging this ridiculous fee. This is probably why the hotel has so many empty rooms during mid-week. The casino is really cool though. If they ever stop this \""resort fee\"" nonsense, we will gladly stay here."
7,4 stars,"Saturday lunch on the Square - a very pleasant option with NO waiting, a low noise level and great burgers!\nThe menu says ALL burgers are 8 dollars. note - there are no sides included, so the burger comes looking a little bit lonely on the plate. but that is OK.\nThe Salmon burger with wasabi mayo and a jicama/carrot slaw was delish. The bun was mildly whole wheat and slightly warmed.\nThe Bloody Mary was VERY GOOD- and very pretty, with a sword of olives and pickles. I called that my salad!!\nI will go back to try more burger choices and some of the sides, like the sweet potato fries seen at a neighboring table. they looked awesome.\nService was quick and pleasant. \nThe space is modern and well- Spacious.\nThe \""game\"" was on - but the sound was OFF - Thank you so much!! I confess I am one of those people in this town who is NOT totally into the football thing."
8,3 stars,"Went with a group of co-workers for lunch. Patio was didn't have enough space for our large group as there were only a few tables open so we sat inside. The service was relatively slow but done with a smile. Our drinks took awhile to arrive and they screwed up two of our orders, one minor and one major that caused that person to get their food very late. They were apologetic but didn't offer anything in the way of discount or free dessert.\n\nSeveral of us ordered the mac and cheese burrito because it sounded amazing. Was tasty, but very filling. The chips and salsa was tasty and while we ordered just one portion it was enough for the whole table to try some. Salsa was tangy and had a little heat. Most of the group enjoyed their food although some at the other of the end of the table were not as pleased but I couldn't speak to why.\n\nOverall not a bad place to eat but not good if you are in a time crunch. Probably won't be back for awhile but I can see really wanting to have that burrito again some day."
9,5 stars,"Spent more with them over the years than I care to remember. Always good selection and service. Something for almost any budget, provided you're willing to walk next door to the lower-range Denmarket subsidiary. It's the first place I think of shopping for furniture because whenever I've tried someplace else, something has happened. \n\nGet on their mailing list for their annual (or semi-annual) warehouse sale. I've found some good stuff there with slight wear or needing just minor reconditioning."


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [8]:
show_random_elements(tokenized_datasets["train"], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,2 star,"OMG, how far this place has fallen. Last time I ate here about 5 yrs ago it was great. Now it sucks. Had to wait about 40 min to get in. I give it two stars mostly for the deserts. The flan and creme brulee were great.","[101, 152, 14666, 117, 1293, 1677, 1142, 1282, 1144, 4984, 119, 4254, 1159, 146, 8756, 1303, 1164, 126, 194, 1733, 2403, 1122, 1108, 1632, 119, 1986, 1122, 22797, 119, 6467, 1106, 3074, 1164, 1969, 11241, 1106, 1243, 1107, 119, 146, 1660, 1122, 1160, 2940, 2426, 1111, 1103, 6941, 1116, 119, 1109, 22593, 1389, 1105, 172, 16996, 1162, 9304, 8722, 1162, 1127, 1632, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [9]:
# small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
# small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [10]:
small_train_dataset = tokenized_datasets["train"]
small_eval_dataset = tokenized_datasets["test"]

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [11]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [12]:
from transformers import TrainingArguments

model_dir = "models/bert-base-cased-finetune-yelp"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=model_dir,
                                  per_device_train_batch_size=16,
                                  num_train_epochs=5,
                                  logging_steps=100)

In [13]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [14]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")


接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [15]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [16]:
import os

checkpoint_path = "models/bert-base-cased-finetune-yelp/checkpoint-63000"
if os.path.exists(checkpoint_path):
    print("检查点存在，可以继续。")
else:
    print("检查点路径错误，需要修正。")

检查点路径错误，需要修正。


In [17]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir=model_dir,
                                  evaluation_strategy="epoch", 
                                  per_device_train_batch_size=16,
                                  num_train_epochs=3,
                                  logging_steps=200)

## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [18]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## 使用 nvidia-smi 查看 GPU 使用

为了实时查看GPU使用情况，可以使用 `watch` 指令实现轮询：`watch -n 1 nvidia-smi`:

```shell
Every 1.0s: nvidia-smi                                                   Wed Dec 20 14:37:41 2023

Wed Dec 20 14:37:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   64C    P0              69W /  70W |   6665MiB / 15360MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18395      C   /root/miniconda3/bin/python                6660MiB |
+---------------------------------------------------------------------------------------+
```

In [None]:
trainer.train(resume_from_checkpoint=True)

Epoch,Training Loss,Validation Loss


In [None]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [None]:
trainer.evaluate(small_test_dataset)

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [None]:
model_dir_2 = "models/bert-base-cased-finetune-yelp-result"
trainer.save_model(model_dir_2)

In [None]:
trainer.save_state()

In [None]:
# trainer.model.save_pretrained("./")

## Homework: 使用完整的 YelpReviewFull 数据集训练，看 Acc 最高能到多少