# Hugging Face Transformers 核心库实战

本 Notebook 包含了《Hugging Face 生态与核心库》文档中的所有示例代码，涵盖了 Pipeline、AutoClass、Datasets 以及 Trainer 的完整使用流程。

## 1. 开箱即用的 Pipeline

Pipeline 是最简单的使用方式，无需指定模型，只需指定任务类型即可自动加载默认模型。

In [11]:
from transformers import pipeline

# 情感分析（默认下载英文模型 distilbert-base-uncased-finetuned-sst-2-english）
classifier = pipeline("sentiment-analysis")
result = classifier("I love Hugging Face!")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9998641014099121}]


In [12]:
# 图像分类示例（使用 MobileViT 小模型，仅约 20MB）
# 需要安装 pillow: pip install pillow
vision_classifier = pipeline(model="apple/mobilevit-small")

# 使用在线图片 URL 进行测试
result = vision_classifier("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png")
print(result)

Device set to use cuda:0
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Exception ignored in: <function tqdm.__del__ at 0x0000016B97F37760>
Traceback (most recent call last):
  File "c:\Users\dalvqw\.conda\envs\ai\lib\site-packages\tqdm\std.py", line 1148, in __del__
    self.close()
  File "c:\Users\dalvqw\.conda\envs\ai\lib\site-packages\tqdm\notebook.py", line 279, in close
    self.disp(bar_style='danger', check_delay=False)
AttributeError: 'tqdm' object has no attribute 'disp'


[{'label': 'tabby, tabby cat', 'score': 0.29938068985939026}, {'label': 'Egyptian cat', 'score': 0.14064963161945343}, {'label': 'tiger cat', 'score': 0.13906577229499817}, {'label': 'remote control, remote', 'score': 0.01750039868056774}, {'label': 'mouse, computer mouse', 'score': 0.015360303223133087}]


## 2. AutoClass：手动加载模型

使用 `AutoTokenizer` 和 `AutoModel` 手动加载模型，并保存到本地。

In [13]:
from transformers import AutoTokenizer, AutoModel

# 自动加载：无需手动导入 BertTokenizer 或 BertModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModel.from_pretrained("bert-base-chinese")

# 保存：将模型权重与分词器配置保存到本地目录
model.save_pretrained("./my_local_bert")
tokenizer.save_pretrained("./my_local_bert")

('./my_local_bert\\tokenizer_config.json',
 './my_local_bert\\special_tokens_map.json',
 './my_local_bert\\vocab.txt',
 './my_local_bert\\added_tokens.json',
 './my_local_bert\\tokenizer.json')

## 3. 核心组件拆解：标准的推理流程

演示从 Tokenizer 到 Model 再到 Post-processing 的完整过程。

In [14]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 准备模型与分词器
checkpoint = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# 1. Tokenizer: 文本 -> Tensor
text = "Hugging Face 让 NLP 变得简单"
inputs = tokenizer(text, return_tensors="pt")
print(f"Input IDs: {inputs['input_ids']}")

# 2. Model: Tensor -> Logits
outputs = model(**inputs)
logits = outputs.logits

# 3. Post-processing: Logits -> 概率
predictions = torch.nn.functional.softmax(logits, dim=-1)
print(f"Predictions: {predictions}")

# 4. Save (可选): 如果对模型进行了微调，可以保存
# model.save_pretrained("./my_fine_tuned_bert")
# tokenizer.save_pretrained("./my_fine_tuned_bert")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Input IDs: tensor([[ 101,  100,  100, 6375,  100, 1359, 2533, 5042, 1296,  102]])
Predictions: tensor([[0.5858, 0.4142]], grad_fn=<SoftmaxBackward0>)


## 4. Datasets：数据加载

加载 `rotten_tomatoes` 数据集并进行预处理。

In [15]:
from datasets import load_dataset

# 加载烂番茄影评数据集
dataset = load_dataset("rotten_tomatoes")
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

## 5. Trainer：微调与评估

使用 `Trainer` 进行标准的微调训练流程，包括数据预处理、参数配置和评估指标定义。

In [16]:
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer

# 使用英文模型 distilbert-base-uncased
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 为了快速演示，我们只取一小部分数据
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100)) 
small_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42).select(range(100))

# 1. 准备模型
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# 2. 配置参数
training_args = TrainingArguments(
    output_dir="test_trainer",
    eval_strategy="epoch", # 每个 epoch 结束进行评估
    num_train_epochs=1,
)

# 3. 实例化 Trainer 并启动训练
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
)
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,0.690155


TrainOutput(global_step=13, training_loss=0.6814216467050406, metrics={'train_runtime': 5.6133, 'train_samples_per_second': 17.815, 'train_steps_per_second': 2.316, 'total_flos': 13246739865600.0, 'train_loss': 0.6814216467050406, 'epoch': 1.0})

In [17]:
import numpy as np
import evaluate

# 加载评估指标
metric = evaluate.load("accuracy")

# 定义计算函数：将 Logits 转换为 Predictions 并计算 Accuracy
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# 重新实例化 Trainer，注入 compute_metrics
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics, # 关键步骤
)

# 训练并评估
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.682575,0.54


TrainOutput(global_step=13, training_loss=0.6109083249018743, metrics={'train_runtime': 4.2211, 'train_samples_per_second': 23.69, 'train_steps_per_second': 3.08, 'total_flos': 13246739865600.0, 'train_loss': 0.6109083249018743, 'epoch': 1.0})