## 文本分类

### 使用预训练模型进行迁移学习

通过高质量预训练模型与PaddleHub Fine-tune API，只需要少量代码即可实现自然语言处理和计算机视觉场景的深度学习模型

### 选择并加载预训练模型

使用ERNIE Tiny模型来演示如何利用PaddleHub实现finetune。ERNIE Tiny主要通过模型结构压缩和模型蒸馏的方法，将 ERNIE 2.0 Base 模型进行压缩。相较于 ERNIE 2.0，ERNIE Tiny模型能带来4.3倍的预测提速，具有更高的工业落地能力。

In [2]:
# 加载预训练模型
# Params:
# - name         模型名称
# - version      模型版本
# - task         fine-tune任务，此处seq-cls表示文本分类任务
# - num_classes  表示当前文本分类任务的类别数，根据具体使用的数据集确定，默认为2

import paddlehub as hub

model = hub.Module(name='ernie_tiny', task='seq-cls', num_classes=2)

[32m[2024-02-23 12:53:51,588] [    INFO][0m - Already cached C:\Users\Neo\.paddlenlp\models\ernie-tiny\model_state.pdparams[0m
[32m[2024-02-23 12:53:51,589] [    INFO][0m - Loading weights file model_state.pdparams from cache at C:\Users\Neo\.paddlenlp\models\ernie-tiny\model_state.pdparams[0m
[32m[2024-02-23 12:53:51,932] [    INFO][0m - Loaded weights file from disk, setting weights to model.[0m
[32m[2024-02-23 12:54:08,007] [    INFO][0m - All model checkpoint weights were used when initializing ErnieForSequenceClassification.
[0m
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.[0m


### 准备数据集并读取数据

此处使用PaddleHub内置的情感分析数据集ChnSentiCorp

In [3]:
# 自动从网络下载数据集并解压到用户目录下$HUB_HOME/.paddlehub/dataset目录
train_dataset = hub.datasets.ChnSentiCorp(tokenizer=model.get_tokenizer(), max_seq_len=128, mode='train')
dev_dataset = hub.datasets.ChnSentiCorp(tokenizer=model.get_tokenizer(), max_seq_len=128, mode='dev')

[32m[2024-02-23 12:54:23,704] [    INFO][0m - Already cached C:\Users\Neo\.paddlenlp\models\ernie-tiny\vocab.txt[0m
[32m[2024-02-23 12:54:23,707] [    INFO][0m - Already cached C:\Users\Neo\.paddlenlp\models\ernie-tiny\spm_cased_simp_sampled.model[0m
[32m[2024-02-23 12:54:23,709] [    INFO][0m - Already cached C:\Users\Neo\.paddlenlp\models\ernie-tiny\dict.wordseg.pickle[0m
[32m[2024-02-23 12:54:28,415] [    INFO][0m - tokenizer config file saved in C:\Users\Neo\.paddlenlp\models\ernie-tiny\tokenizer_config.json[0m
[32m[2024-02-23 12:54:28,419] [    INFO][0m - Special tokens file saved in C:\Users\Neo\.paddlenlp\models\ernie-tiny\special_tokens_map.json[0m
[32m[2024-02-23 12:54:38,461] [    INFO][0m - Already cached C:\Users\Neo\.paddlenlp\models\ernie-tiny\vocab.txt[0m
[32m[2024-02-23 12:54:38,463] [    INFO][0m - Already cached C:\Users\Neo\.paddlenlp\models\ernie-tiny\spm_cased_simp_sampled.model[0m
[32m[2024-02-23 12:54:38,466] [    INFO][0m - Already cached 

### 选择优化策略和运行配置

In [4]:
import paddle

"""优化器

优化器：SGD Adam Adamax 
全局学习率：learning_rate 默认为1e-3
待优化模型参数：parameters
"""

optimizer = paddle.optimizer.Adam(learning_rate=5e-5, parameters=model.parameters())

In [None]:
"""运行配置

Trainer主要控制Fine-tune的训练，包含以下可控制的参数:
- model: 被优化模型
- optimizer: 优化器选择
- use_gpu: 是否使用gpu
- use_vdl: 是否使用vdl可视化训练过程
- checkpoint_dir: 保存模型参数的地址
- compare_metrics: 保存最优模型的衡量指标
"""

trainer = hub.Trainer(model, optimizer, checkpoint_dir='output/ernie_text_cls')
trainer.train(train_dataset, epochs=1, batch_size=64, eval_dataset=dev_dataset, save_interval=1)

[36m[2024-02-23 14:11:19,287] [   TRAIN][0m - Epoch=1/1, Step=10/150 loss=0.1617 acc=0.9422 lr=0.000050 step/sec=0.02 | ETA 02:19:48[0m
[36m[2024-02-23 14:21:13,709] [   TRAIN][0m - Epoch=1/1, Step=20/150 loss=0.1411 acc=0.9469 lr=0.000050 step/sec=0.02 | ETA 02:24:12[0m
[36m[2024-02-23 14:31:21,063] [   TRAIN][0m - Epoch=1/1, Step=30/150 loss=0.1463 acc=0.9516 lr=0.000050 step/sec=0.02 | ETA 02:26:44[0m
[36m[2024-02-23 14:41:12,881] [   TRAIN][0m - Epoch=1/1, Step=40/150 loss=0.1438 acc=0.9484 lr=0.000050 step/sec=0.02 | ETA 02:27:03[0m
[36m[2024-02-23 14:50:40,843] [   TRAIN][0m - Epoch=1/1, Step=50/150 loss=0.1227 acc=0.9609 lr=0.000050 step/sec=0.02 | ETA 02:26:02[0m
[36m[2024-02-23 15:00:25,958] [   TRAIN][0m - Epoch=1/1, Step=60/150 loss=0.1208 acc=0.9500 lr=0.000050 step/sec=0.02 | ETA 02:26:04[0m
[36m[2024-02-23 15:10:21,408] [   TRAIN][0m - Epoch=1/1, Step=70/150 loss=0.0963 acc=0.9750 lr=0.000050 step/sec=0.02 | ETA 02:26:28[0m
[36m[2024-02-23 15:20:47,8

### 模型预测

当完成Fine-tune后，Fine-tune过程在验证集上表现最优的模型会被保存在`${CHECKPOINT_DIR}/best_model`目录下，其中`${CHECKPOINT_DIR}`目录为Fine-tune时所选择的保存checkpoint的目录。

In [None]:
import paddlehub as hub

# 待预测数据
data = [
    [''],
    [''],
    ['']
]

# 预测标签
label_map = {0: 'negative', 1: 'positive'}

# 加载预训练好的模型
model = hub.Module(
    name='ernie_tiny',
    # version='2.0.1',
    task='seq-cls',
    load_checkpoint='./output/ernie_text_cls/best_model/model.pdparams',
    label_map=label_map
)

# 模型预测
res = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False)

# 打印结果
for idx, text in enumerate(data):
    print('Text: {} \t Label: {}'.format(text[0], res[idx]))