# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。

## 下载数据集

In [1]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [3]:
dataset["train"][12]

{'label': 3,
 'text': "I drove by yesterday to get a sneak peak.  It re-opens on July 14th and I can't wait to take my kids.  The new range looks amazing.  The entire range appears to be turf, which may or many not help your game, but it looks really nice.  The tee boxes look state of the art and the club house looks like something you'll see on a newer course.  Can't wait to experience it!"}

In [4]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [5]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [6]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,2 star,"I really wanted to like this place - went here based off of the reviews it had during our trip to Vegas. However, it was very mediocre and I found myself saying what they could have done to everything to make it just a bit better. One star for using real blueberries in their pancakes, and one star for the service. Our waitress was so nice with us even when they lost our order, it didn't even seem like a big deal to us when she told us because she was so apologetic. The only good thing I did eat was the pumpkin pancakes. The whipped cream on top was actually one of the best things I ate here. \n\nTheir \""cream cheese\"" syrup was meh. They should use more cream cheese in that syrup to keep the tang of the cheese. One fork dip for the testing and that was all I needed. They should also think about using real maple syrup...... Just sayin'. The hashbrowns were so generously pulled out of the freezer, browned a bit on each side with some thick grease, and put on a plate. Real tasty and this is me being real sarcastic. Do yourself a favor and go somewhere else."
1,5 stars,"Wonderful, locally-grown, organic food. An incredible gem in the Charlotte area. Service is warm, and the food is outstanding. Love this place. Jessica and Luca (owners) are sweet, caring people. Try it. You will not be dissapointed."
2,3 stars,"I like the laid back atmosphere, playing one of my favs ben harper. Great beer selection, food was pretty good. I got the ruben I liked it wasn't out of this world good, but if your in the neighborhood worth a try."
3,4 stars,"The food at this place is really good. We stopped by here not long after it opened, so I'm sure quality control was extremely stringent. The wait staff was attentive and very friendly. The dining area is large, and the plates are all very gourmet (small plate style). We came here for the brunch. Bottomless mimosa is available for ten dollars more. Every dish here was tasty (for buffet caliber foods), and not a single one disappointed me. I still think the Wynn is my favorite, but this is a strong contender."
4,2 star,"The food was fine, but my problem was with the service. \nOur waiter was hard to catch, and I ended up having to get help from the sommelier twice because he was the only one who checked on us. \nThe sauce that came with my steak had cilantro in it, which I dislike. I asked if they could make it without the cilantro, and the waiter was pretty rude and told me that they don't make the sauce fresh, so they can't make it without any ingredients. I would think an expensive restaurant like this would make their sauces fresh and would accommodate patrons.\nI agree with everyone else-- the music was too loud. I couldn't carry on a conversation with my guests at all. \nIf you're looking for a good steakhouse, try the one at Circus Circus. If you want to spend a bunch of money on a mediocre dinner so you can brag to your friends, try Jean-Georges."
5,4 stars,"An extra star for letting allowing us to sit in close proximity to the bar with my 2.5 year old and getting the happy hour rate for apps. Score! I was at the venetian for a conference and decided to meet up with some peers for a post conference drink and pre-dinner nosh. Try the horchata margeritia it was a party in your mouth. Service was prompt,friendly and helpful. LIke many people stated the cerviche was fresh and clean."
6,1 star,"Love the frozen hot chocolates but the service was really bad for us!! We sat down and waited for ten minutes and had to flag someone down to ask to take our order. \""I'll find a server for you.\"" was his response. A family sat down right after and a server tended to them right away. We looked around and noticed that we were the minorities. Id hate to pull the race card but thats how I'm perceiving it. Disappointed is what I am."
7,4 stars,"Note: I only give out 5 stars to surprisingly EXCEPTIONAL, best-in-class establishments.\n\nThis \""man store\"" has it all, well, most of it. It has anything related to fishing & hunting. Prices are good & reasonable, but sure, there might be a deal or two out there that's cheaper via the internet. This location reminds me of a cross between an outdoor sports store, a boat dealership, the Food Network and Disneyland... except for men. \n\nTo possibly receive a 5-star rating: \n1. The tactical gear/clothing section could use a little work. I know Bass Pro Shops is more geared for outdoor sports, but since you carry a tactical gear section, it should be the best if that's what you're aiming to be. \n2. Add a fee-based indoor shooting range. It's a good way for you to bring in extra $$ while also keeping people around to buy your stuff. Plus, it makes sense since you sell firearms, ammunition, accessories and targets. \n3. Expand your horizons/knowledge of outdoor recreational activities. Include things like climbing gear (think REI). You could really do some damage if your company merged with REI or at least took some pages out of their book. \n4. Offer a loyalty-based \""annual dividend\""/discount program like the one REI has. Combined with their legendary customer service and return program, they're my #1 go-to place for outdoor equipment. You're missing out.\n5. Some of the gear you offer is for amateur cheapskates. Offer some higher-end products. Take a look at the lifejacket/PFD department, for instance. All that stuff sucks, and is only for someone who is either really cheap or doesn't care about what they're buying. On that note, the swimming goggle section sucks, too. I was there to check out some Columbia shirts, then decided to look around for other things. Since I was looking for some good swimming goggles, I stopped by that section, but quickly left when I realized that it sucks. Again, you're missing out on sales."
8,1 star,"We came here for like 5 times before everything was awesome!. BUT las night we ate there 5 ppl and we ALL feeling sick and THROWING UP all night long, we will probably go to UMC but you need to be carful from this place!!\nMaybe it's time for the health department to check this place!"
9,4 stars,"The food was absolutely wonderful. I ordered a cup of their soup of the day ($6), which happened to be a beef and barley soup. This soup totally blew me away. The broth was so herby and flavorful, I couldn't bare to let a single drop remain in the bowl when I was done. There was tons of barley, good sized chunks of beef (but still bite sized), and some other vegetables in the mix. I kept eating it slowly because I wanted to make sure I savored each bite. I cannot compliment this soup enough.\nMy main course was the salmon served with potatoes, arugula and asparagus ($32). I ordered the salmon medium well, and it came out perfect. A light crunch to the outside, and tender, juicy meat on the inside, and it was seasoned perfectly. The vegetables were delicious and cooked perfectly. They had the right balance of tenderness with crunch. I paired the meal with a glass of riesling ($11), which could have been sweeter, in my opinion (I love the really sweet wines), but it was still a nice wine.\nThe atmosphere is pleasant, especially with the decor of the panes of leaves and flowers throughout the restaurant. The hostesses were pleasant and even pulled out a chair for me when we got to the table. Our server was also nice and helpful, and didn't forget anything, although he did give us the wrong check the first time (it was the check from the table directly next to us), he amended it quickly and with a sincere apology. It's a restaurant I would come back to because the food was just amazing. I could eat that soup every day for a week and I wish every piece of salmon I ever order would taste like that salmon. It was perfection."


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map: 100%|██████████| 50000/50000 [00:19<00:00, 2510.19 examples/s]


In [8]:
show_random_elements(tokenized_datasets["train"], num_examples=3)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,4 stars,"This place was old school when it old school was in session. I can't spot any changes since it was one of the original Cork N' Cleavers - maybe new carpet, but that may be it.\n\nThe salad bar is so old school, it still has jello - not jello shots - jello. And since, I'm old school, I'm a fan of the salad bar.\n\nMonday 4 buck burgers with 2.50 salad bar is the only food you'll require for the whole day. Decent burger too.\n\nGood barkeeps - they remember your name or call you hon (old school).\n\nThe food is pretty basic, but they start with decent ingredients and prepare them well. Had the prime rib a couple of weeks ago and my view is that the only way to eat prime rib is to have it breath it's last on the way from the kitchen to the table. It was early enough in the evening that they were able to deliver a tasty hunk of meat that was as requested.\n\nIf you're into modern, contempo kinds of joints, Feeney's won't be for you - but if you're in the mood for dark, old school, slightly seedy (in a good way) - then Feeney's might just fit the bill.","[101, 1188, 1282, 1108, 1385, 1278, 1165, 1122, 1385, 1278, 1108, 1107, 4912, 119, 146, 1169, 112, 189, 3205, 1251, 2607, 1290, 1122, 1108, 1141, 1104, 1103, 1560, 8711, 151, 112, 140, 19094, 10704, 118, 2654, 1207, 10797, 117, 1133, 1115, 1336, 1129, 1122, 119, 165, 183, 165, 183, 1942, 4638, 19359, 2927, 1110, 1177, 1385, 1278, 117, 1122, 1253, 1144, 179, 13323, 118, 1136, 179, 13323, 6981, 118, 179, 13323, 119, 1262, 1290, 117, 146, 112, 182, 1385, 1278, 117, 146, 112, 182, 170, 5442, 1104, 1103, 19359, 2927, 119, 165, 183, 165, 183, 2107, 25323, 1183, 125, 171, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
1,5 stars,"This hotel resort is perfection down to the staff! We first got in and were starving! The only thing open was the Sanza bar: I ordered the nachos and my new hubby got the Cuban sandwich! Both were amazing! We took the rest of the nachos upstairs and ate them for the next day or so! \nThe next day (Tuesday) I was treated with a massage! My first one in my whole life! We are on our honeymoon and my bday is today 8/28. THE BEST EXPERIENCE! I now have serious standards. It was $100 which included double the tip for 50 mins! Well worth it!\nThe next morning we decided to get the breakfast Buffett! Worth every penny! The waiters, the omelet cook, the food, everything was FANTASTIC! \nThe pool/beach area including the ladies working the towel station exceeded my expectations!\nAdditionally, they treated us extra well with complimentary items because we were on our honeymoon and my birthday! We are still here and we don't want to leave! I could ramble more, this is my first ever review on yelp, but just had to share!\nAdvice: rent a car, go to Grand Canyon (6 hours round trip including tours), utilize staff!","[101, 1188, 3415, 8037, 1110, 17900, 1205, 1106, 1103, 2546, 106, 1284, 1148, 1400, 1107, 1105, 1127, 20285, 106, 1109, 1178, 1645, 1501, 1108, 1103, 1727, 3293, 2927, 131, 146, 2802, 1103, 9468, 8401, 1116, 1105, 1139, 1207, 10960, 2665, 1400, 1103, 9383, 14327, 106, 2695, 1127, 6929, 106, 1284, 1261, 1103, 1832, 1104, 1103, 9468, 8401, 1116, 8829, 1105, 8756, 1172, 1111, 1103, 1397, 1285, 1137, 1177, 106, 165, 183, 1942, 4638, 1397, 1285, 113, 9667, 114, 146, 1108, 5165, 1114, 170, 26088, 106, 1422, 1148, 1141, 1107, 1139, 2006, 1297, 106, 1284, 1132, 1113, 1412, 25619, 1105, 1139, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
2,4 stars,"We stopped by this past weekend to check out the Peace Love & Hoppyness party. We visited it too last year and had a great time so we were looking forward to it again. It's a pretty reasonable beer garden. For $30, you received a glass pint glass and $30 tokens for beer. This year featured 50 beers to choose from but a majority of them were IPAs or heavier. I don't mind a super heavy/dark beer once in a while but it's difficult for most people to spend an evening consuming IPAs. Another thing was that 2 of the lighter beers were out early in the night further limiting the choices. \n\nOverall it was a good time. The bands were at a reasonable volume, it was outside, eclectic group of people and a good price for a Saturday night out.","[101, 1284, 2141, 1118, 1142, 1763, 5138, 1106, 4031, 1149, 1103, 5370, 2185, 111, 12965, 5005, 1757, 1710, 119, 1284, 3891, 1122, 1315, 1314, 1214, 1105, 1125, 170, 1632, 1159, 1177, 1195, 1127, 1702, 1977, 1106, 1122, 1254, 119, 1135, 112, 188, 170, 2785, 9483, 5298, 4605, 119, 1370, 109, 1476, 117, 1128, 1460, 170, 2525, 10473, 1204, 2525, 1105, 109, 1476, 22559, 1116, 1111, 5298, 119, 1188, 1214, 2081, 1851, 23147, 1106, 4835, 1121, 1133, 170, 2656, 1104, 1172, 1127, 27925, 1116, 1137, 12163, 119, 146, 1274, 112, 189, 1713, 170, 7688, 2302, 120, 1843, 5298, 1517, 1107, 170, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [9]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [9]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [11]:
from transformers import TrainingArguments

model_dir = "models/bert-base-cased"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=f"{model_dir}/test_trainer",
                                  logging_dir=f"{model_dir}/test_trainer/runs",
                                  logging_steps=100)

In [12]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [13]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

2023-12-26 19:34:53.122204: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-26 19:34:53.122260: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-26 19:34:53.122296: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-26 19:34:53.136763: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.



接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [14]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [15]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir=f"{model_dir}/test_trainer",
                                  evaluation_strategy="epoch", 
                                  logging_dir=f"{model_dir}/test_trainer/runs",
                                  logging_steps=100)

## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [16]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## 使用 nvidia-smi 查看 GPU 使用

为了实时查看GPU使用情况，可以使用 `watch` 指令实现轮询：`watch -n 1 nvidia-smi`:

```shell
Every 1.0s: nvidia-smi                                                   Wed Dec 20 14:37:41 2023

Wed Dec 20 14:37:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   64C    P0              69W /  70W |   6665MiB / 15360MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18395      C   /root/miniconda3/bin/python                6660MiB |
+---------------------------------------------------------------------------------------+
```

In [17]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.4823,1.130525,0.511
2,1.0902,1.10326,0.533
3,0.8352,1.011419,0.573


TrainOutput(global_step=375, training_loss=1.045119140625, metrics={'train_runtime': 347.7528, 'train_samples_per_second': 8.627, 'train_steps_per_second': 1.078, 'total_flos': 789354427392000.0, 'train_loss': 1.045119140625, 'epoch': 3.0})

In [18]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [19]:
trainer.evaluate(small_test_dataset)

{'eval_loss': 1.1320194005966187,
 'eval_accuracy': 0.54,
 'eval_runtime': 2.9635,
 'eval_samples_per_second': 33.744,
 'eval_steps_per_second': 4.387,
 'epoch': 3.0}

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [20]:
trainer.save_model(f"{model_dir}/finetuned-trainer")

In [21]:
trainer.save_state()

## Homework: 使用完整的 YelpReviewFull 数据集训练，对比看 Acc 最高能到多少

In [10]:
# 限制问题，采用十分之一数据集训练
full_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(65000))
full_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(5000))

In [12]:
from transformers import TrainingArguments, Trainer

model_dir = "models/my-bert-base-cased"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为1000
training_args = TrainingArguments(output_dir=f"{model_dir}/test_trainer",
                                  logging_dir=f"{model_dir}/test_trainer/runs",
                                  logging_steps=1000)

2023-12-27 09:39:40.432127: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-27 09:39:40.432189: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-27 09:39:40.432237: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-27 09:39:40.794876: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [13]:
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=

In [14]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=full_train_dataset,
    eval_dataset=full_eval_dataset
)

Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [15]:
trainer.train()

Step,Training Loss
1000,1.1633
2000,1.0121
3000,0.9767
4000,0.9673
5000,0.9385
6000,0.9338
7000,0.9162
8000,0.9029
9000,0.7957
10000,0.7752


TrainOutput(global_step=24375, training_loss=0.7672466421274039, metrics={'train_runtime': 16719.4084, 'train_samples_per_second': 11.663, 'train_steps_per_second': 1.458, 'total_flos': 5.130803778048e+16, 'train_loss': 0.7672466421274039, 'epoch': 3.0})

In [16]:
# 模型评估
my_test_dataset = tokenized_datasets["test"].shuffle(seed=12).select(range(5000))
trainer.evaluate(my_test_dataset)

{'eval_loss': 0.9432960748672485,
 'eval_runtime': 144.7788,
 'eval_samples_per_second': 34.535,
 'eval_steps_per_second': 4.317,
 'epoch': 3.0}

In [18]:
my_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))
trainer.evaluate(my_test_dataset)

{'eval_loss': 1.1617443561553955,
 'eval_runtime': 2.8496,
 'eval_samples_per_second': 35.093,
 'eval_steps_per_second': 4.562,
 'epoch': 3.0}

In [17]:
# 保存模型和训练状态
trainer.save_model()
trainer.save_state()

In [19]:
# 加上evaluation_strategy参数最后跑一次
final_train_dataset = tokenized_datasets["train"].shuffle(seed=45).select(range(6500))
final_eval_dataset = tokenized_datasets["test"].shuffle(seed=45).select(range(500))

model_dir = "models/final-bert-base-cased"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
final_training_args = TrainingArguments(output_dir=f"{model_dir}/test_trainer",
                                  evaluation_strategy="epoch", 
                                  logging_dir=f"{model_dir}/test_trainer/runs",
                                  logging_steps=100)


In [20]:
trainer = Trainer(
    model=model,
    args=final_training_args,
    train_dataset=final_train_dataset,
    eval_dataset=final_eval_dataset
)
trainer.train()

Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss
1,0.8365,0.893085
2,0.5013,0.966964
3,0.292,1.493477


TrainOutput(global_step=2439, training_loss=0.5779791664226762, metrics={'train_runtime': 1711.1869, 'train_samples_per_second': 11.396, 'train_steps_per_second': 1.425, 'total_flos': 5130803778048000.0, 'train_loss': 0.5779791664226762, 'epoch': 3.0})

In [21]:
final_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))
trainer.evaluate(final_test_dataset)

{'eval_loss': 1.8210952281951904,
 'eval_runtime': 2.8889,
 'eval_samples_per_second': 34.616,
 'eval_steps_per_second': 4.5,
 'epoch': 3.0}