# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。

## 下载数据集

In [2]:
import subprocess
import os

# os.environ['HF_HOME'] = '/autodl-tmp/new_volume/hf' #在transformers自定义模型下载的路径方法
# os.environ['HF_HUB_CACHE'] = '/autodl-tmp/new_volume/hf/hub'

# os.environ['HF_HOME'] = '/mnt/new_volume/hf' 
# os.environ['HF_HUB_CACHE'] = '/mnt/new_volume/hf/hub'

#在transformers自定义模型下载的路径方法
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ["HF_DATASETS_CACHE"] = "/autodl-tmp/datasets_cache/"
os.environ["HF_HOME"] = "/autodl-tmp/cache/"
os.environ["HUGGINGFACE_HUB_CACHE"] = "/autodl-tmp/hub_cache/"
os.environ["TRANSFORMERS_CACHE"] = "/autodl-tmp/transform_cache/"

result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value
        

In [3]:
# 验证环境变量是否修改成功
print("http_proxy",os.environ.get("http_proxy"))
print("https_proxy",os.environ.get("https_proxy"))
print("HF_HOME",os.environ.get("HF_HOME"))
print("HF_DATASETS_CACHE",os.environ.get("HF_DATASETS_CACHE"))
print("HUGGINGFACE_HUB_CACHE",os.environ.get("HUGGINGFACE_HUB_CACHE"))
print("TRANSFORMERS_CACHE",os.environ.get("TRANSFORMERS_CACHE"))

http_proxy http://172.20.0.113:12798
https_proxy http://172.20.0.113:12798
HF_HOME /autodl-tmp/cache/
HF_DATASETS_CACHE /autodl-tmp/datasets_cache/
HUGGINGFACE_HUB_CACHE /autodl-tmp/hub_cache/
TRANSFORMERS_CACHE /autodl-tmp/transform_cache/


In [13]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full",)

In [None]:
dataset

In [6]:
dataset["train"][10]

{'label': 0,
 'text': "Owning a driving range inside the city limits is like a license to print money.  I don't think I ask much out of a driving range.  Decent mats, clean balls and accessible hours.  Hell you need even less people now with the advent of the machine that doles out the balls.  This place has none of them.  It is april and there are no grass tees yet.  BTW they opened for the season this week although it has been golfing weather for a month.  The mats look like the carpet at my 107 year old aunt Irene's house.  Worn and thread bare.  Let's talk about the hours.  This place is equipped with lights yet they only sell buckets of balls until 730.  It is still light out.  Finally lets you have the pit to hit into.  When I arrived I wasn't sure if this was a driving range or an excavation site for a mastodon or a strip mining operation.  There is no grass on the range. Just mud.  Makes it a good tool to figure out how far you actually are hitting the ball.  Oh, they are cash 

In [7]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [8]:
def show_random_elements(dataset, num_examples=10):
    #这行代码是一个断言语句，它在检查变量 num_examples 是否小于或等于数据集 dataset 的长度。如果这个条件不成立，就会触发一个错误
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
        
    picks = []
    # 在这段代码中，下划线 _ 是一个通用的占位符，表示我们在这里并不关心循环迭代的具体值。在这里，你使用了for _ in range(num_examples)，表示你打算执行num_examples次循环，但是在循环体内并不需要用到迭代变量的值。
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)   
    
    #df 就是一个包含了从 dataset 中选取的特定索引数据的 Pandas DataFrame。你可以利用 Pandas 提供的功能对这个 DataFrame 进行各种数据操作和分析。
    df = pd.DataFrame(dataset[picks]) 
    
    for column, typ in dataset.features.items():
        #通过 isinstance 函数检查当前特征的类型是否是 datasets.ClassLabel，即类别标签
        if isinstance(typ, datasets.ClassLabel):
            # 如果特征是类别标签，就使用 transform 方法将该列的索引值映射为相应的类别标签的名称。这是通过使用 lambda 函数实现的，其中 i 是索引值，typ.names[i] 给出了对应索引值的类别标签的名称。
            # tranform方法期待接收一个函数，当你使用 transform 时，你提供的函数将被应用于 DataFrame 或 Series 中的每个元素。这个函数可以是一个已有的函数，也可以是匿名函数（使用 lambda 定义），或者是用户自定义的函数。这个函数的目的是描述如何从一个元素的值转换到另一个值。
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [9]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,2 star,"Considering that Gourmet magazine said it was the best Thai restaurant in North America, AND that Frommers highly recommended it, I expected a lot more. My boyfriend and I both had the wonton soup (excellent) and the chicken with cashew nuts. Personally, I've had much better at other Thai restaurants. \n\nIt's a run-down little place, but obviously plenty of people know about it. Autographed pictures with celebrities adorn the walls. Service is okay, but nothing exceptional. It was often hard to track down our server."
1,4 stars,"I have been a bit absent from Yelp as I lost faith with some hidden reviews but I choose one of my faves in Phoenix to return with a review. True Food has added a wonderful roasted veggie salad to their menu with a horseradish dressing that is bursting with flavor! Their shrimp curry bowl is very good but lacks spice. For a beverage, you must try the Medicine Man. It's an antioxidant pomegranate/black tea based elixir that tastes like a vacation.! The only reason I don't give 5 stars is the dismissive server attitudes and sub-par service we tend to receive. Overall, I highly recommend!"
2,1 star,"Ugh... Not good.\n\nWeekday Night. Restaurant was empty. Four of us(2 adults/2 kids) and five workers(3 cooks) for dinner.\n\nMenu is way too complicated for the quality of food. chinese, thai, japanese, bbq?. Wish they stuck to one thing and make it good.\n\nAll food delivered separately. First kids meals Lo Mein w/ Teriyaki chicken($4.95). Looked like one of those weight watchers teriyaki meal. barely filled their plate with noodles. Everything on the plate for the kids looks unappetizing. Brown Noodles and brown chicken. Chicken was tough, very little teriyaki taste. One of the kids liked, the other not soo much.\n\nNext(5 minutes after kids meals came) Pad Thai\nNothing special. corn flakes on top? bland.\n\nNext (5 minutes after Pad Thai) \nShrimp LoMein - I didn't taste, but the shrimp looked like the small cheap ones that you get buy one - get one at the grocery. Small, unimpressive. Not even medium in size. Throw in at least one large one. How about one medium? Wife said, \""Nothing wrong, nothing special\""\n\nBut, overall odd the time it took to get the food out. With 3 cooks, you think they could have coordinated better. Doesn't shrimp take less time than chicken to cook. I'm thinking the kids chicken may have not been cooked to order. Reheated?\n\nSpent $31. Health Inspection Score on wall was 90.5."
3,2 star,"I don't know what others are talking about....\n\nCons:\n- Bed was hard\n- Room layout was awkward \n- Towels we're rough.... Couldn't tell which was the floor towel and which ones we're body towels\n- Extra blankets were dirty, like thank goodness I didn't have one of those CSI blue lights, kind of stains....\n\nPros:\n- Staff was friendly\n- Free breakfast\n- Walking distance to North Strip casinos."
4,4 stars,"Bachi has a breakfast restaurant?! What!! Obsessed with Bachi Burger as it is, and super disappointed by their ramen, I'm glad HLK lives up to the Bachi name. \n\nMy friends and I are fatties and had the Bella farms foie gras de canard (torchon, jelly, marcona almonds, cornichons, cranberry walnut bread) as a starter. Delicious! It has a pretty thick layer of duck fat on top as well. Yeah, as disgusting as you may feel thinking about eating straight up fat, who cares? Spread that stuff on! \n\nI had the wild mushroom hash (potato, kale, gruyere, egg, fried onions, bechamel) as well as the corned beef hash (potato, brussel sprouts, onion, egg, bechamel) . No, not all too myself, I'm not that much of a fatty. Disappointed a bit, because none of us were asked how we wanted our eggs cooked. On both hashes they came over hard, which I hate. I love runny yolks, don't take that away from me! Especially on a hash. But other than that, they were both delicious. Corned beef had good flavor. \n\nFriend got the braised short ribs loco modo ( two eggs, fried rice, onion rings, orange lentil gravy). Looked bomb. Meat was tender. Braised for thirty six hours. \n\nAlso tried the crispy sweet chili chicken and waffle (apple bacon, brussel sprouts, green onion waffle, curry butter). Chicken was very moist, and the sauce on it was delicious. Unfortunately they only put it on one of the pieces. Waffle was ok. Butter ess definitely the high point of the waffle. You get three choices of syrups (maple, coconut, blueberry). \n\nOverall, good experience. Great service. Thanks Adrianna! Definitely a come back spot."
5,5 stars,"The owner was very talkative, but I wouldn't call him rude. He was actually quite nice. I didn't get any attitude for changing my cheesesteak to no onions with provolone instead of cheese wiz. It was delicious too. I wish they had this place in Chicago.\n\nVisited: 11/19/09"
6,3 stars,"I've been dreading this review I'm about to give.. Looking back on the experience, I probably shouldn't go to Japanese restaurants to be turned on, or blown away. My climax with sushi was 3-4 years ago, now I get bored with it. I'm more of a NEW American/French food lover. \n First off, the server wasn't very informative, so I talked to another server passing by, she actually gave me some good guidance. We ordered a wide range of items, going off their recommendations. We started with sashimi yellowtail with jalape\u00f1os. Of course it was good, Yuzu makes anything rock, and fresh raw fish does it's job well, no help needed. Second was the lobster ceviche. I found it boring, uninspired, bland. Sorry. Third was the new style sashimi, salmon. This was by far the best. Slightly cooked, warm sesame, miso sauce. Delish. Main courses were the black cod miso,a restaurant favorite, but not my favorite. I just kinda ate it, without having any real feelings for it. I was just thinking ok, maybe the next plate will be better. Creamy spicy rock shrimp tempura, I had two pieces and decided it was time to order a freakin sushi roll and call it quits. I'm hungry and not so impressed. Thank god I wasn't paying. I ordered the nobu roll I believe, it had ahi tuna, yellowtail, avocado, scallion, masago. I didn't think I'd have to get a roll to be happy and satisfied, oh well..\n\n\n\n My only recommendation, the new style sashimi."
7,4 stars,best coffee around.
8,3 stars,"Too pricey! I couldn't taste ANY alcohol in my wife's birthday cake shake, it tasted good though. The burger a \""bigbun\"" wasn't so big at all! My wife got kogi beef quesadilla. She thought it was awesome! It looked very good, she also got pizza twinkies. I didn't quite get the concept. I've had waaay better gourmet burgers @ other chains. Fries weren't very good & they didn't give you very many. There's no shortage on spuds last time I checked."
9,5 stars,"Hands down, the best date place in central Phoenix. \n\nSadly, I was apparently not on a date and ended up having to pay for our bill BUT in retrospect I was still so damn impressed by this place.\n\nRumor has it that this place is owned by several male flight attendants and when they're not attending flights they are running this restaurant. Leave it to world travelers to produce a classy but quaint but sosticated but understated establishment that can truly please everyone. In true Phoenix fashion, Coronado Cafe was once a house and now stands as a Bistro Cafe. The front porch is now patio seating with cute little tables and bench style sitting all under canopy drapes and string lights. On the inside is the cutest little bar area, very small, but the older gentlemen running the place is so cute and you can tell he really know is cocktails and puts passion into crafting each one. There's also a secluded side patio that's ADORABLE.\n\nWe ate inside the dining room because the rain (rain, I know... in Phoenix?) started coming down and the wait staff was so attentive and helped us relocate. We were sat in the corner by a window that had its own little garden bed outside. Could not get over this palces cute factor. Tables were small but candle lit and had the presh checker print table clothes, real napkins and honestly one of the best waitresses I've ever had. \n\nWe order 2 Tequilatinis on Happy Hour and the artichoke dip. This artichoke dip opened the gates of heaven and brought to us the most delightful, out of the ordinary, festive taste I have ever had. It was spicy, but cheesy. It was melty but had good size chunks of artichoke. Simply, cannot go on enough about how good it was.\n\nI cannot wait to go back here and have brunch or enjoy the sunset with drinks on the patio. If you're looking for somewhere small and intimate that you'd like to take a significant other or truly impress someone you care about, pelase go here. Everything is perfect for the perfect date."


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

In [54]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# 输出分词器的最大序列长度
print(f"Max Sequence Length for bert-base-cased (tokenizer): {tokenizer.model_max_length}")


def tokenize_function(examples):
    #padding="max_length" 的意思是，对于每个样本，将其填充到所有样本中最长序列的长度。具体来说，对于较短的序列，将在其末尾填充特殊的标记（通常是 <pad> 或 0）以达到最大长度。而对于超过最大长度的序列，将被截断至最大长度。
    #这样做的目的是为了确保所有输入序列的长度一致，以便能够将它们一起批量处理，这对于在深度学习模型中进行高效的训练是很重要的。
    # max_length 通常表示模型能够处理的输入序列的最大长度。如果输入序列的长度超过这个值，就会进行截断或填充。
    return tokenizer(examples["text"], padding="max_length", truncation=True)

#map 方法用于映射函数到数据集的每个元素。在这里，它将 tokenize_function 应用于数据集中的每个样本。参数 batched=True 表示映射函数将按批次处理数据，这样可以提高处理效率。
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 在自然语言处理（NLP）中，特别是在使用预训练的Transformer模型（例如BERT、GPT等）进行文本处理时，attention_mask 是一个用于指示哪些位置需要被模型"关注"（即考虑）的二进制掩码。

# 具体来说，对于一个输入文本序列，attention_mask 是一个与输入序列等长的二进制序列。
# 在这个序列中，每个位置的值可以是0或1，其中：

# 0 表示在模型的注意力机制中该位置被掩盖（masked），即模型在处理这个位置时不会考虑它的信息。
# 1 表示在模型的注意力机制中该位置是有效的，模型会考虑这个位置的信息。
# 使用 attention_mask 的主要目的是允许模型在处理不定长文本时能够处理变长的输入序列，
# 因为Transformer模型要求输入序列的长度是固定的。通过将不需要关注的位置置为0，模型就可以正确处理变长的输入。

# 在Hugging Face Transformers库中，attention_mask 通常是作为输入参数传递给模型的。
# 例如，对于tokenizer的输出，你会得到一个字典，其中包括input_ids和attention_mask。
# 将attention_mask传递给模型，有助于模型正确处理变长的输入序列。

ProxyError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /bert-base-cased/resolve/main/tokenizer_config.json (Caused by ProxyError('Cannot connect to proxy.', TimeoutError('_ssl.c:980: The handshake operation timed out')))"), '(Request ID: 88d5509b-7481-4509-8747-e556ab331747)')

In [11]:
show_random_elements(tokenized_datasets["train"], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,5 stars,Great food!! Amazing atmosphere !!! And best of all fantastic service. Ask for Kyle he is wonderful!,"[101, 2038, 2094, 106, 106, 16035, 6814, 106, 106, 106, 1262, 1436, 1104, 1155, 14820, 1555, 119, 18149, 1111, 7156, 1119, 1110, 7310, 106, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [12]:
#.shuffle(seed=42)：对训练集进行随机打乱，使用了种子 42 以确保可复现性。打乱的目的是防止模型在训练时受到数据顺序的影响，提高模型的泛化性能。
#range(1000) 是一个 Python 内置函数，它生成一个包含从0到999（总共1000个元素）的整数序列。这个序列通常用于循环或索引的迭代。
#在这个代码中，.select(range(1000)) 的作用是从之前随机打乱的数据集中选择前1000个样本。这样做是为了创建一个规模较小的数据集，只包含1000个样本，用于更快速地进行模型训练和调试。

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [13]:
from transformers import AutoModelForSequenceClassification

#AutoModelForSequenceClassification 类：
# 用于序列分类任务，比如文本分类。
# 该类自动加载与预训练模型相对应的分类头（head），并根据任务需求进行微调。

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [17]:
from transformers import TrainingArguments

model_dir = "models/bert-base-cased"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100

# output_dir：
# 指定训练过程中输出模型和其他文件的目录。
# 在这里，模型和其他训练过程中的输出将保存在 {model_dir}/test_trainer 目录中。

# logging_dir：
# 指定 TensorBoard 日志文件的保存目录。
# 在这里，TensorBoard 日志将保存在 {model_dir}/test_trainer/runs 目录中。

# logging_steps：
# 控制多少步骤记录一次训练信息。
# 在这里，每进行100个训练步骤，就会记录一次训练信息。

training_args = TrainingArguments(output_dir=f"{model_dir}/test_trainer",
                                  logging_dir=f"{model_dir}/test_trainer/runs",
                                  logging_steps=100,
                                  save_total_limit=5)

In [18]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_le

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")


接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [17]:
def compute_metrics(eval_pred):
    # eval_pred 是一个包含两个元素的元组，第一个元素是模型的预测 logits（对各个类别的分数），第二个元素是标签（ground truth）。
    # 这一行代码通过解包将 logits 和标签分别赋值给 logits 和 labels 变量。
    logits, labels = eval_pred
    # 通过使用 np.argmax 函数，找到每个样本预测 logits 中概率最高的类别，即得到模型的预测结果。
    # axis=-1 表示在最后一个维度上执行 argmax，对于分类任务通常是类别维度。
    predictions = np.argmax(logits, axis=-1)
    # 调用外部定义的评估指标计算函数 metric.compute，将预测结果 predictions 和真实标签 labels 传递给它。
    # 返回的是一个字典，其中包含了计算得到的各个评估指标的数值。
    return metric.compute(predictions=predictions, references=labels)

# 当使用 axis=-1 时，它表示在数组的最后一个维度上进行操作。具体来说，考虑一个包含模型对每个示例的三个类别的logits的2D数组（矩阵）：
# import numpy as np

# logits = np.array([[0.8, 0.2, 0.1],
#                    [0.4, 0.6, 0.9],
#                    [0.2, 0.5, 0.7]])

# 每行对应一个示例的logits，每列对应一个类别的logits。现在，如果你想找到每个示例中具有最高logit的索引（类别），你会使用 np.argmax：
# predictions = np.argmax(logits, axis=-1)
# print(predictions)

# 输出将是一个包含每个示例中具有最高logit的索引（类别）的数组：
# [0, 2, 2]

# 让我们解释一下 axis=-1 在这个上下文中是如何工作的：

# logits数组的形状是 (3, 3)，其中第一个维度对应示例的数量，第二个维度对应类别的数量。
# axis=-1 指定了该操作（在本例中是 np.argmax）应该沿着最后一个维度进行。在2D数组中，最后一个维度是第二个维度。
# 因此，对于每一行（示例），np.argmax 沿着列（类别）应用，并选择具有最高logit的索引。
# 在上面的示例中，结果表明对于第一个示例，具有索引0的类别具有最高的logit，而对于第二个和第三个示例，具有索引2的类别具有最高的logits。这是在处理分类任务时常见的操作，用于确定每个示例在一个批次中的预测类别。

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [19]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir=f"{model_dir}/test_trainer",
                                  evaluation_strategy="epoch", 
                                  logging_dir=f"{model_dir}/test_trainer/runs",
                                  logging_steps=100)

## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [20]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## 使用 nvidia-smi 查看 GPU 使用

为了实时查看GPU使用情况，可以使用 `watch` 指令实现轮询：`watch -n 1 nvidia-smi`:

```shell
Every 1.0s: nvidia-smi                                                   Wed Dec 20 14:37:41 2023

Wed Dec 20 14:37:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   64C    P0              69W /  70W |   6665MiB / 15360MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18395      C   /root/miniconda3/bin/python                6660MiB |
+---------------------------------------------------------------------------------------+
```

In [27]:
trainer.train()
#batch_size默认为8，epcho默认为3轮，所以总步数为3000，每一批次步数为：3000/8=375
# batch_size 是深度学习中一个重要的概念，表示每一次模型更新时，输入到模型中的样本数量。
# 在训练深度学习模型时，通常会将大量的数据分成小批次进行处理，每个小批次包含固定数量的样本，这就是 batch_size。

# 这是通过命令 nvidia-smi 查看 NVIDIA GPU 状态的输出。让我解释一下这个输出：

# Driver Version 和 CUDA Version：

# Driver Version 表示安装的 NVIDIA 显卡驱动的版本。
# CUDA Version 表示安装的 CUDA（Compute Unified Device Architecture）工具包的版本，它是用于在 NVIDIA GPU 上进行并行计算的平台。
# GPU 信息：

# GPU Name：显卡的名称。
# Persistence-M：显卡的持续性模式，"On" 表示持续性模式开启。
# Bus-Id：显卡在系统总线上的唯一标识。
# Disp.A：是否正在使用显卡进行显示，"Off" 表示未在使用。
# Fan、Temp、Perf：显卡风扇状态、温度、性能状态。
# Pwr:Usage/Cap：电源使用量和电源容量。
# Memory-Usage：显存使用情况。
# GPU-Util：GPU 利用率，表示 GPU 正在被多大程度上利用。
# Compute M.：是否支持 Compute 模式。
# Processes：

# 列出正在使用 GPU 的进程的相关信息，包括 GPU ID、进程 ID（PID）、进程类型、进程名称以及 GPU Memory Usage。
# 在你的输出中，主要关注的是 GPU 的状态信息，比如显存使用情况、GPU 利用率、温度等。这些信息可以帮助你监控 GPU 的工作状态，特别是在进行深度学习任务时，可以了解模型的训练过程中 GPU 的负载情况。

# 按下 Ctrl + C 即可退出 watch 命令，回到命令行界面。

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3883,1.477182,0.547
2,0.378,1.625033,0.571
3,0.2084,1.832777,0.578


TrainOutput(global_step=375, training_loss=0.28354778798421226, metrics={'train_runtime': 112.2767, 'train_samples_per_second': 26.72, 'train_steps_per_second': 3.34, 'total_flos': 789354427392000.0, 'train_loss': 0.28354778798421226, 'epoch': 3.0})

In [22]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [23]:
trainer.evaluate(small_test_dataset)

{'eval_loss': 1.1293487548828125,
 'eval_accuracy': 0.48,
 'eval_runtime': 1.0121,
 'eval_samples_per_second': 98.801,
 'eval_steps_per_second': 12.844,
 'epoch': 3.0}

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [24]:
trainer.save_model(f"{model_dir}/finetuned-trainer")

In [25]:
trainer.save_state()

## Homework: 使用完整的 YelpReviewFull 数据集训练，对比看 Acc 最高能到多少

### 作业2-1：使用完整的 YelpReviewFull 数据集训练，对比看 Acc 最高能到多少

In [5]:
import subprocess
import os

# os.environ['HF_HOME'] = '/autodl-tmp/new_volume/hf' #在transformers自定义模型下载的路径方法
# os.environ['HF_HUB_CACHE'] = '/autodl-tmp/new_volume/hf/hub'

# os.environ['HF_HOME'] = '/mnt/new_volume/hf' 
# os.environ['HF_HUB_CACHE'] = '/mnt/new_volume/hf/hub'

#在transformers自定义模型下载的路径方法
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ["HF_DATASETS_CACHE"] = "/autodl-tmp/datasets_cache/"
os.environ["HF_HOME"] = "/autodl-tmp/cache/"
os.environ["HUGGINGFACE_HUB_CACHE"] = "/autodl-tmp/hub_cache/"
os.environ["TRANSFORMERS_CACHE"] = "/autodl-tmp/transform_cache/"

result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value        

In [6]:
# 验证环境变量是否修改成功
print("http_proxy",os.environ.get("http_proxy"))
print("https_proxy",os.environ.get("https_proxy"))
print("HF_HOME",os.environ.get("HF_HOME"))
print("HF_DATASETS_CACHE",os.environ.get("HF_DATASETS_CACHE"))
print("HUGGINGFACE_HUB_CACHE",os.environ.get("HUGGINGFACE_HUB_CACHE"))
print("TRANSFORMERS_CACHE",os.environ.get("TRANSFORMERS_CACHE"))

http_proxy http://172.20.0.113:12798
https_proxy http://172.20.0.113:12798
HF_HOME /autodl-tmp/cache/
HF_DATASETS_CACHE /autodl-tmp/datasets_cache/
HUGGINGFACE_HUB_CACHE /autodl-tmp/hub_cache/
TRANSFORMERS_CACHE /autodl-tmp/transform_cache/


In [None]:
#第二次运行时不用执行

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

In [None]:
#第二次运行时不用执行
dataset.save_to_disk('../../autodl-tmp/data/yelp_review_full')

In [1]:
from datasets import load_from_disk
dataset = load_from_disk('../../autodl-tmp/data/yelp_review_full')

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [3]:
dataset["train"][10]

{'label': 0,
 'text': "Owning a driving range inside the city limits is like a license to print money.  I don't think I ask much out of a driving range.  Decent mats, clean balls and accessible hours.  Hell you need even less people now with the advent of the machine that doles out the balls.  This place has none of them.  It is april and there are no grass tees yet.  BTW they opened for the season this week although it has been golfing weather for a month.  The mats look like the carpet at my 107 year old aunt Irene's house.  Worn and thread bare.  Let's talk about the hours.  This place is equipped with lights yet they only sell buckets of balls until 730.  It is still light out.  Finally lets you have the pit to hit into.  When I arrived I wasn't sure if this was a driving range or an excavation site for a mastodon or a strip mining operation.  There is no grass on the range. Just mud.  Makes it a good tool to figure out how far you actually are hitting the ball.  Oh, they are cash 

In [4]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    #这行代码是一个断言语句，它在检查变量 num_examples 是否小于或等于数据集 dataset 的长度。如果这个条件不成立，就会触发一个错误
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
        
    picks = []
    # 在这段代码中，下划线 _ 是一个通用的占位符，表示我们在这里并不关心循环迭代的具体值。在这里，你使用了for _ in range(num_examples)，表示你打算执行num_examples次循环，但是在循环体内并不需要用到迭代变量的值。
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)   
    
    #df 就是一个包含了从 dataset 中选取的特定索引数据的 Pandas DataFrame。你可以利用 Pandas 提供的功能对这个 DataFrame 进行各种数据操作和分析。
    df = pd.DataFrame(dataset[picks]) 
    
    for column, typ in dataset.features.items():
        #通过 isinstance 函数检查当前特征的类型是否是 datasets.ClassLabel，即类别标签
        if isinstance(typ, datasets.ClassLabel):
            # 如果特征是类别标签，就使用 transform 方法将该列的索引值映射为相应的类别标签的名称。这是通过使用 lambda 函数实现的，其中 i 是索引值，typ.names[i] 给出了对应索引值的类别标签的名称。
            # tranform方法期待接收一个函数，当你使用 transform 时，你提供的函数将被应用于 DataFrame 或 Series 中的每个元素。这个函数可以是一个已有的函数，也可以是匿名函数（使用 lambda 定义），或者是用户自定义的函数。这个函数的目的是描述如何从一个元素的值转换到另一个值。
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,4 stars,"There are few places that I return to where I consistently eat as much as I can and then walk outside and dry heave (really the other only places are all-you-can-eat sushi joints). \n\nLoLo's is good, very good, but not great. The fried chicken is good; the skin is delicious. The scrambled eggs w/ cheese and onions are amazing, but the waffles are just all right. The people I go with tell me that the grits are fantastic, however grits don't do it for me. The amount of butter, cheese, and fried goodness is probably enough to actually kill a small child, and certainly enough to send anyone and everyone into a food coma. Of course you can get smaller portions than the 'K.K.' (3 pieces of fried chicken, 2 waffles, eggs, and grits), but why?! They also serve Kool-Aid and sweet tea, but they're pretty much just liquid sugar (try half sweet tea, half regular). And the \""homemade\"" lemonade is definitely just made from those powder lemonade mixes. There's usually a line, but I've never had to wait more than 10 minutes, and the service ain't great, but that hasn't really ever bothered me (I'm usually too excited for the chicken & waffles to let something like crappy service bother me). \n\nAll-in-all, it's a great place, especially if you're able to take a siesta afterwards (i.e., close the door to your office and pass out for an hour). But I can't say it's better than Roscoe's in LA; I mean, come on, Roscoe's puts cinnamon in its waffle batter, so good ....."
1,5 stars,"I'll stay here again, hands down. \n\nI can't thank Derek and his staff enough for the excellent service we had during our three night stay. Very welcoming staff. \n\nThe shuttle picked us up at the airport and offered to take us to our off site rental car location. We were allowed to check in early. Anything we needed Derek made it happen. \n\nThe rooms were clean, comfortable perfect for us to come back and relax after the long days we had. \n\nI just wanted to say that it was a very wonderful experience to stay here. Derek made the trip less stressful making sure we got to where we needed to go. Keep up the good work, its rare to find that kind of customer service these days."
2,5 stars,"One of the Best establishments in Vegas! I had no complaints at all. Excellent service, Friendly staff. Eye Candy, Small Gaming area, purple hue lighting everywhere, gigantic chandeliers. We were on the 50th plus floor. Might i mention the elevators are super fast. Rooms are Immaculate and Hip. View from our balcony overlooked the Bellagio. Perfect setting at night during the water show. I was pretty surprised with the self parking garage too. Modern Art on walls, Green light Red light indicators for aisle slots making it easy to find a parking spot. Perfect place to bring the Ladies!!!!"
3,5 stars,I chose this place based on the great reviews on Yelp and my experience was exactly as expected. I brought my BMW 535 to them for a custom amp install and it looks and sounds great. Highly recommended.
4,3 stars,"Even if you aren't crazy about M&M's like me, this was a nice little \""free\"" attraction on the strip. Has four stories of merchandise - top floor has a race car that you can take your picture with. They have a free movie, however, it was broken while we were there so no show for us. = ( \n\nPrices were a bit steep, but in line with Vegas standards ($10 for a baby bib, $28 for shirts, etc). The red M&M was out on the sidewalk for photos. We thought it was a stuffed animal at first because it didn't move and interact with the kids. We asked and was told that it was a costumed character. Not sure if it's just hard to move around in there or they weren't really into their job. All in all, a nice 20 minute walk through attraction."
5,5 stars,Both my husband and I have had our iPhones repaired here and they have consistently been professional and quick. There was no sales pressure and the technician was friendly and responsive to any questions I had. I'd highly recommend the camelback location and we will be back (hopefully not too soon). :).
6,5 stars,"Geat customer service ! Can't beat prices ! Staff is so helpful ! Check them out juice is amazing and I vape daily . Wax, herb ect they have it all . 180 plus flavors ."
7,1 star,"This is the devils playground, dont go! \nFor real.\n\nDont serve $1.50 PBR to the kind of peps that go to Philthy Phils. I saw 2 bar fights (same people, two separte fights,) before the music even started. \nI love a good dive bar..i am no snob.\nThis is the devils playground."
8,4 stars,"Late in posting this one. Stayed at the Elara last July for five night and was very pleased. Got a killer deal on one bedroom suite ( was the same price as most regular rooms in the nicer hotels). Room was great....spectacular view, full size automated projection screen, huge jacuzzi bath tub, clean comfortable furnishings, fully equipped kitchen. Pool was immaculate with cool lounge sofas everywhere. Location is perfect as it is connected to a mall, which leads to the Planet Hollywood casino, which then leads to the strip.......the nice part of the strip I might add. The room was quiet too by the way....( of course I didn't room next to the noisy people who posted below who thought having a party in the hallway at 5am was OK because by Vegas standards that's still early ????? )"
9,1 star,"The food is absolute garbage. Period. Anyone who says otherwise knows nothing about good food. Happy hour menu is exceptionally bad. Nice view, but that's about it. Don't eat there."


In [12]:
# 使用自定义代理，一般情况下不使用

import subprocess
import os

#在transformers自定义模型下载的路径方法
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ["HF_DATASETS_CACHE"] = "/autodl-tmp/datasets_cache/"
os.environ["HF_HOME"] = "/autodl-tmp/cache/"
os.environ["HUGGINGFACE_HUB_CACHE"] = "/autodl-tmp/hub_cache/"
os.environ["TRANSFORMERS_CACHE"] = "/autodl-tmp/transform_cache/"

result = subprocess.run('bash -c "source /etc/profile.d/clash.sh && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value    

In [24]:
#仅在第一次运行时使用

from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
config = AutoConfig.from_pretrained("bert-base-cased")

cache_directory = '../../autodl-tmp/tokenizer/bert-base-cased'

# 如果不存在，则新建文件路径
if not os.path.exists(cache_directory):
    os.makedirs(cache_directory)

tokenizer.save_pretrained(cache_directory)
config.save_pretrained(cache_directory)

print(f"Max Sequence Length for bert-base-cased (tokenizer): {tokenizer.model_max_length}")

Max Sequence Length for bert-base-cased (tokenizer): 512


In [5]:
from transformers import AutoTokenizer
# from transformers import AutoConfig

# tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# 指定缓存路径
cache_directory = '../../autodl-tmp/tokenizer/bert-base-cased'

# 指定模型名称或路径
# model_name_or_path = "bert-base-cased"

# 加载模型配置
# config = AutoConfig.from_pretrained(cache_directory)

# 输出最大序列长度
# print(f"Max Sequence Length for bert-base-cased: {config.max_position_embeddings}")

tokenizer = AutoTokenizer.from_pretrained(cache_directory)

# 输出分词器的最大序列长度
print(f"Max Sequence Length for bert-base-cased (tokenizer): {tokenizer.model_max_length}")

Max Sequence Length for bert-base-cased (tokenizer): 512


In [6]:
def tokenize_function(examples):
    #padding="max_length" 的意思是，对于每个样本，将其填充到所有样本中最长序列的长度。具体来说，对于较短的序列，将在其末尾填充特殊的标记（通常是 <pad> 或 0）以达到最大长度。而对于超过最大长度的序列，将被截断至最大长度。
    #这样做的目的是为了确保所有输入序列的长度一致，以便能够将它们一起批量处理，这对于在深度学习模型中进行高效的训练是很重要的。
    # max_length 通常表示模型能够处理的输入序列的最大长度。如果输入序列的长度超过这个值，就会进行截断或填充。
    # return tokenizer(examples["text"], padding="max_length", truncation=True)
    return tokenizer(examples["text"], padding="max_length", truncation=True, return_attention_mask=True)

#map 方法用于映射函数到数据集的每个元素。在这里，它将 tokenize_function 应用于数据集中的每个样本。参数 batched=True 表示映射函数将按批次处理数据，这样可以提高处理效率。
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 在自然语言处理（NLP）中，特别是在使用预训练的Transformer模型（例如BERT、GPT等）进行文本处理时，attention_mask 是一个用于指示哪些位置需要被模型"关注"（即考虑）的二进制掩码。

# 具体来说，对于一个输入文本序列，attention_mask 是一个与输入序列等长的二进制序列。
# 在这个序列中，每个位置的值可以是0或1，其中：

# 0 表示在模型的注意力机制中该位置被掩盖（masked），即模型在处理这个位置时不会考虑它的信息。
# 1 表示在模型的注意力机制中该位置是有效的，模型会考虑这个位置的信息。
# 使用 attention_mask 的主要目的是允许模型在处理不定长文本时能够处理变长的输入序列，
# 因为Transformer模型要求输入序列的长度是固定的。通过将不需要关注的位置置为0，模型就可以正确处理变长的输入。

# 在Hugging Face Transformers库中，attention_mask 通常是作为输入参数传递给模型的。
# 例如，对于tokenizer的输出，你会得到一个字典，其中包括input_ids和attention_mask。
# 将attention_mask传递给模型，有助于模型正确处理变长的输入序列。

In [7]:
# 输出分词器的最大序列长度
print(f"Max Sequence Length for bert-base-cased: {tokenizer.model_max_length}")
show_random_elements(tokenized_datasets["train"], num_examples=2)

Max Sequence Length for bert-base-cased: 512


Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,3 stars,"Liquid was very clean and modern. Small layout. $50 dollar minimum. $15 drinks. However, the pool boys know what customer service means. On hot days, get the frozen fruits $16, its frozen goodness!","[101, 5255, 24235, 1108, 1304, 4044, 1105, 2030, 119, 6844, 9726, 119, 109, 1851, 8876, 5867, 119, 109, 1405, 8898, 119, 1438, 117, 1103, 4528, 3287, 1221, 1184, 8132, 1555, 2086, 119, 1212, 2633, 1552, 117, 1243, 1103, 7958, 11669, 109, 1479, 117, 1157, 7958, 18023, 106, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
1,3 stars,There once was a chick who was sure\nShe would dig this new club at Encore.\nThough the guest list was free\nThe dudes were FUG-ly\nAnd the best part was just the d\u00e9cor. \n\n:),"[101, 1247, 1517, 1108, 170, 22282, 1150, 1108, 1612, 165, 183, 1708, 4638, 1156, 11902, 1142, 1207, 1526, 1120, 13832, 9475, 119, 165, 183, 1942, 14640, 5084, 1103, 3648, 2190, 1108, 1714, 165, 183, 1942, 4638, 17869, 1116, 1127, 143, 2591, 2349, 118, 181, 1183, 165, 183, 1592, 3276, 1103, 1436, 1226, 1108, 1198, 1103, 173, 165, 190, 7629, 1162, 1580, 19248, 119, 165, 183, 165, 183, 131, 114, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


In [8]:
# 检查数据是否被处理过
for i,example in enumerate(tokenized_datasets["train"]):
    if i >= 10:
        break
    print(len(example['input_ids']))

512
512
512
512
512
512
512
512
512
512


In [9]:
from transformers import AutoModelForSequenceClassification

# 先使用命令行将模型下载到本地指定文件夹
# huggingface-cli download --resume-download --local-dir-use-symlinks False bert-base-cased --local-dir /root/autodl-tmp/model/bert-base-cased

#AutoModelForSequenceClassification 类：
# 用于序列分类任务，比如文本分类。
# 该类自动加载与预训练模型相对应的分类头（head），并根据任务需求进行微调。

# 指定缓存路径
cache_directory = '../../autodl-tmp/model/bert-base-cased'

model = AutoModelForSequenceClassification.from_pretrained(cache_directory, num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ../../autodl-tmp/model/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [30]:
import requests

def test_network_connection(url="http://huggingface.co"):
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        print("Network connection is successful.")
    except requests.RequestException as e:
        print(f"Network connection failed. Error: {e}")

# 测试网络连接
test_network_connection()

Network connection is successful.


In [26]:
import evaluate

#前置已经从GitHub：https://github.com/huggingface/evaluate/tree/main 上面下载了accruracy文件到本地
# 因为直接load（accuracy）的话，会联网，没翻墙读取不了

file_path = '../../autodl-tmp/evaluate/metrics/accuracy'

metric = evaluate.load(file_path)

print(f"读取结果为: {metric}")

读取结果为: EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(result

In [27]:
import numpy as np 

def compute_metrics(eval_pred):
    # eval_pred 是一个包含两个元素的元组，第一个元素是模型的预测 logits（对各个类别的分数），第二个元素是标签（ground truth）。
    # 这一行代码通过解包将 logits 和标签分别赋值给 logits 和 labels 变量。
    logits, labels = eval_pred
    # 通过使用 np.argmax 函数，找到每个样本预测 logits 中概率最高的类别，即得到模型的预测结果。
    # axis=-1 表示在最后一个维度上执行 argmax，对于分类任务通常是类别维度。
    predictions = np.argmax(logits, axis=-1)
    # 调用外部定义的评估指标计算函数 metric.compute，将预测结果 predictions 和真实标签 labels 传递给它。
    # 返回的是一个字典，其中包含了计算得到的各个评估指标的数值。
    return metric.compute(predictions=predictions, references=labels)

# 当使用 axis=-1 时，它表示在数组的最后一个维度上进行操作。具体来说，考虑一个包含模型对每个示例的三个类别的logits的2D数组（矩阵）：
# import numpy as np

# logits = np.array([[0.8, 0.2, 0.1],
#                    [0.4, 0.6, 0.9],
#                    [0.2, 0.5, 0.7]])

# 每行对应一个示例的logits，每列对应一个类别的logits。现在，如果你想找到每个示例中具有最高logit的索引（类别），你会使用 np.argmax：
# predictions = np.argmax(logits, axis=-1)
# print(predictions)

# 输出将是一个包含每个示例中具有最高logit的索引（类别）的数组：
# [0, 2, 2]

# 让我们解释一下 axis=-1 在这个上下文中是如何工作的：

# logits数组的形状是 (3, 3)，其中第一个维度对应示例的数量，第二个维度对应类别的数量。
# axis=-1 指定了该操作（在本例中是 np.argmax）应该沿着最后一个维度进行。在2D数组中，最后一个维度是第二个维度。
# 因此，对于每一行（示例），np.argmax 沿着列（类别）应用，并选择具有最高logit的索引。
# 在上面的示例中，结果表明对于第一个示例，具有索引0的类别具有最高的logit，而对于第二个和第三个示例，具有索引2的类别具有最高的logits。这是在处理分类任务时常见的操作，用于确定每个示例在一个批次中的预测类别。

In [19]:
from transformers import TrainingArguments, Trainer

model_dir = '../../autodl-tmp/model/bert-base-cased-trained' #"models/bert-base-cased-all"
batch_size=31

training_args = TrainingArguments(output_dir=f"{model_dir}/test_trainer",
                                  evaluation_strategy="epoch", 
                                  logging_dir=f"{model_dir}/test_trainer/runs",
                                  logging_steps=1000,
                                  per_device_train_batch_size=batch_size,
                                  save_total_limit=5)

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为1000

# output_dir：
# 指定训练过程中输出模型和其他文件的目录。
# 在这里，模型和其他训练过程中的输出将保存在 {model_dir}/test_trainer 目录中。

# logging_dir：
# 指定 TensorBoard 日志文件的保存目录。
# 在这里，TensorBoard 日志将保存在 {model_dir}/test_trainer/runs 目录中。

# logging_steps：
# 控制多少步骤记录一次训练信息。
# 在这里，每进行1000个训练步骤，就会记录一次训练信息。

#save_total_limit用于设置checkpoint最多保留几个文件，防止训练过程中断点拍照文件对硬盘占用过多

# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_

In [20]:
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [29]:
trainer.train(resume_from_checkpoint=True)
#batch_size默认为8，epcho默认为3轮，所以总步数为3000，每一批次步数为：3000/8=375
# batch_size 是深度学习中一个重要的概念，表示每一次模型更新时，输入到模型中的样本数量。
# 在训练深度学习模型时，通常会将大量的数据分成小批次进行处理，每个小批次包含固定数量的样本，这就是 batch_size。

# 这是通过命令 nvidia-smi 查看 NVIDIA GPU 状态的输出。让我解释一下这个输出：

# Driver Version 和 CUDA Version：

# Driver Version 表示安装的 NVIDIA 显卡驱动的版本。
# CUDA Version 表示安装的 CUDA（Compute Unified Device Architecture）工具包的版本，它是用于在 NVIDIA GPU 上进行并行计算的平台。
# GPU 信息：

# GPU Name：显卡的名称。
# Persistence-M：显卡的持续性模式，"On" 表示持续性模式开启。
# Bus-Id：显卡在系统总线上的唯一标识。
# Disp.A：是否正在使用显卡进行显示，"Off" 表示未在使用。
# Fan、Temp、Perf：显卡风扇状态、温度、性能状态。
# Pwr:Usage/Cap：电源使用量和电源容量。
# Memory-Usage：显存使用情况。
# GPU-Util：GPU 利用率，表示 GPU 正在被多大程度上利用。
# Compute M.：是否支持 Compute 模式。
# Processes：

# 列出正在使用 GPU 的进程的相关信息，包括 GPU ID、进程 ID（PID）、进程类型、进程名称以及 GPU Memory Usage。
# 在你的输出中，主要关注的是 GPU 的状态信息，比如显存使用情况、GPU 利用率、温度等。这些信息可以帮助你监控 GPU 的工作状态，特别是在进行深度学习任务时，可以了解模型的训练过程中 GPU 的负载情况。

# 按下 Ctrl + C 即可退出 watch 命令，回到命令行界面。

Epoch,Training Loss,Validation Loss,Accuracy
1,0.7332,0.697371,0.709
2,0.6679,0.672976,0.718
3,0.5507,0.72242,0.715


TrainOutput(global_step=62904, training_loss=0.5141013600391755, metrics={'train_runtime': 20072.0677, 'train_samples_per_second': 97.15, 'train_steps_per_second': 3.134, 'total_flos': 5.1361871752428134e+17, 'train_loss': 0.5141013600391755, 'epoch': 3.0})

In [30]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

trainer.evaluate(small_test_dataset)

{'eval_loss': 0.8521449565887451,
 'eval_accuracy': 0.61,
 'eval_runtime': 0.5673,
 'eval_samples_per_second': 176.283,
 'eval_steps_per_second': 22.917,
 'epoch': 3.0}

In [32]:
trainer.save_model(f"{model_dir}/finetuned-trainer")
trainer.save_state()