## 文本分类

- 文本分类是指使用计算机将文本数据进行自动化归类的任务，是自然语言处理（NLP）中的一项重要任务。
- ERNIE是一个预训练模型，使用三种级别的Knowledge Masking帮助模型学习语言知识，在多项任务上超越了BERT。在模型结构方面，它采用了Transformer的Encoder部分作为模型主干进行训练。
    - Seq2Seq模型: sequence to sequence模型是一类End-to-End的算法框架，也就是从序列到序列的转换模型框架，应用在机器翻译，自动应答等场景。
    - Attention机制（注意力）：注意力机制可以利用人类的认知机制直观解释。例如，我们的视觉系统倾向于关注图像中辅助判断的部分信息，并忽略掉不相关的信息。同样，在自然语言处理的问题中，输入的某些部分可能会比其他部分对决策更有帮助。
    - Transfomer模型：很多NLP的语义学习问题涉及到大量的训练数据，而RNN类的模型内部存在计算依赖，无法高效的并行化训练。使用Self-attenion的方法，将RNN变成每个输入与其他输入部分计算匹配度来决定注意力权重的方式，使得模型引入了Attention机制的同时也具备了并行化计算的能力。以这种Self-attention结构为核心，设计Encoder-Decoder的结构形成Transformer模型。

## 使用开发套件ERNIE实现对新闻标题的分类

### 数据集介绍

THUCNews是根据新浪新闻RSS订阅频道2005~2011年间的历史数据筛选过滤生成，均为UTF-8纯文本格式。在原始新浪新闻分类体系的基础上，重新整合划分出14个候选分类类别：财经、彩票、房产、股票、家居、教育、科技、社会、时尚、时政、体育、星座、游戏、娱乐。

本案例使用的数据集是从THUCNews新闻数据中根据新闻类别按照一定的比例提取了新闻标题，其中训练集数据约27.1w，测试集约6.7w条，另有一份记录标签的词表`lable_dict.txt`。

In [3]:
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.io import Dataset

import paddlenlp
from paddlenlp.datasets import MapDataset
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.transformers import LinearDecayWithWarmup

import numpy as np
from functools import partial

INFO 2023-01-11 17:54:31,038 utils.py:148] Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.


In [4]:
"""构建数据集类"""

class NewsDataset(Dataset):
    def __init__(self, data_path, label_path):
        # 加载标签词典
        self.label2id = self._load_label_dict(label_path)
        
        self.label_list = list(self.label2id.keys())
        
        # 加载数据集
        self.data = self._load_data(data_path)
        
    def _load_data(self, data_path):
        dataset = []
        
        with open(data_path, 'r', encoding='utf-8') as f:
            for line in f.readlines():
                label, text = line.strip().split('\t', maxsplit=1)
                example = {'text': text, 'label': self.label2id[label]}
                dataset.append(example)
                
        return dataset
                
    def _load_label_dict(self, label_path):
        with open(label_path, 'r', encoding='utf-8') as f:
            lines = [line.strip().split() for line in f.readlines()]
            lines = [(line[0], int(line[1])) for line in lines]
            label_dict = dict(lines)
            
        return label_dict 
    
    def __getitem__(self, idx):
        return self.data[idx]
    
    def __len__(self):
        return len(self.data)

In [5]:
data_path = '../datasets/THUCNews/train.txt'
label_path = '../datasets/THUCNews/label_dict.txt'

news_dataset = NewsDataset(data_path, label_path)

print('训练集长度：', news_dataset.__len__())
print('训练集格式：', news_dataset.__getitem__(0))

print('标签字典：', news_dataset.label2id)
print('标签列表：', news_dataset.label_list)

训练集长度： 271167
训练集格式： {'text': '爱情测试，你的爱情年老时是啥样', 'label': 0}
标签字典： {'星座': 0, '科技': 1, '房产': 2, '股票': 3, '彩票': 4, '时尚': 5, '教育': 6, '体育': 7, '娱乐': 8, '家居': 9, '时政': 10, '社会': 11, '财经': 12, '游戏': 13}
标签列表： ['星座', '科技', '房产', '股票', '彩票', '时尚', '教育', '体育', '娱乐', '家居', '时政', '社会', '财经', '游戏']


In [6]:
"""数据格式转换

转换成ERNIE模型需要的语料数据输入格式，输入包括token ids, segment ids(token type ids)和position ids(模型内部生成)。
转换数据格式的时候，可以调用PaddleNLP封装好的tokenizer
"""

def convert_example(example, tokenizer, max_seq_length=128, is_test=False):
    encoded_inputs = tokenizer(text=example['text'], max_seq_length=max_seq_length)
    input_ids = encoded_inputs['input_ids']
    token_type_ids = encoded_inputs['token_type_ids']
    
    if not is_test:
        label = np.array([example['label']], dtype='int64')
        return input_ids, token_type_ids, label
    else:
        return input_ids, token_type_ids

In [7]:
"""构建数据加载器

构建DataLoader，将数据组装成规整的mini-batch形式，以便传入模型进行处理，处理流程如下：
1. 首先用数据格式转换处理数据为期望的格式，然后构建DataLoader
2. 在DataLoader生成mini-batch数据的过程中，通过使用batchify_fn函数进行统一文本序列长度，
处理label等操作，以保证返回的数据适合输入到模型中。
"""

def create_dataloader(dataset, mode='train', batch_size=1, batchify_fn=None, trans_fn=None):
    # 使用该函数将数据处理成模型输入需要的格式
    if trans_fn:
        dataset = dataset.map(trans_fn)
        
    shuffle = True if mode == 'train' else False
    
    if mode == 'train':
        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
    else:
        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
        
    return paddle.io.DataLoader(
        dataset=dataset,
        batch_sampler=batch_sampler,
        collate_fn=batchify_fn,
        return_list=True
    )

In [8]:
# 使用的预训练模型
MODEL_NAME = 'ernie-1.0'
# 在转换数据格式的时候，调用paddleNLP封装好的tokenizer
tokenizer = paddlenlp.transformers.ErnieTokenizer.from_pretrained(MODEL_NAME)

[32m[2023-01-11 17:54:49,812] [    INFO][0m - Already cached /Users/neowong/.paddlenlp/models/ernie-1.0/vocab.txt[0m
[32m[2023-01-11 17:54:49,831] [    INFO][0m - tokenizer config file saved in /Users/neowong/.paddlenlp/models/ernie-1.0/tokenizer_config.json[0m
[32m[2023-01-11 17:54:49,832] [    INFO][0m - Special tokens file saved in /Users/neowong/.paddlenlp/models/ernie-1.0/special_tokens_map.json[0m


In [9]:
# 规整化mini-batch数据（长短不一）为模型期望的样式
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),       # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack(dtype='int64')  # label
): [data for data in fn(samples)]

In [12]:
from paddlenlp.data import Stack, Tuple, Pad

# 将多个样本打包成一个批次
a = [1, 2, 3, 4]
b = [3, 4, 5, 6]
c = [5, 6, 7, 8]
result = Stack()([a, b, c])
print('Stacked Data: \n', result, '\n')

# 补齐样本长度
a = [1, 2, 3, 4]
b = [5, 6, 7]
c = [8, 9]
result = Pad(pad_val=0)([a, b, c])
print('Padded Data: \n', result, '\n')

# 特征集合与标签集合的映射
data = [
    [[1, 2, 3, 4], [1]],
    [[5, 6, 7], [0]],
    [[8, 9], [1]]
]
batchify_fn = Tuple(Pad(pad_val=0), Stack())
ids, labels = batchify_fn(data)
print('Ids: \n', ids, '\n')
print('Labels: \n', labels, '\n')

Stacked Data: 
 [[1 2 3 4]
 [3 4 5 6]
 [5 6 7 8]] 

Padded Data: 
 [[1 2 3 4]
 [5 6 7 0]
 [8 9 0 0]] 

Ids: 
 [[1 2 3 4]
 [5 6 7 0]
 [8 9 0 0]] 

Labels: 
 [[1]
 [0]
 [1]] 



In [10]:
"""模型构建

使用PaddleNLP直接加载预训练好的ERNIE模型
然后定义用于文本分类的线性层，同时设置使用Dropout的网络优化策略
"""

class ErnieForSequenceClassification(paddle.nn.Layer):
    def __init__(self, MODEL_NAME, num_class=14, dropout=None):
        super(ErnieForSequenceClassification, self).__init__()
        
        # 通过指定名称加载对应的预训练模型
        self.ernie = paddlenlp.transformers.ErnieModel.from_pretrained(MODEL_NAME)
        self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config['hidden_dropout_prob'])
        self.classifier = nn.Linear(self.ernie.config['hidden_size'], num_class)
        
    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
        _, pooled_output = self.ernie(
            input_ids,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            attention_mask=attention_mask
        )
        
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

In [11]:
"""训练配置"""

# 超参设置
n_epochs = 1
batch_size = 128
max_seq_length = 128
n_classes = 14
dropout_rate = None

learning_rate = 5e-5
warmup_proportion = 0.1
weight_decay = 0.01

# 简化版的ERNIE，拥有更快的训练和推理速度
MODEL_NAME = 'ernie-tiny'

# 加载数据集，构建DataLoader
train_set = NewsDataset('../datasets/THUCNews/train.txt', '../datasets/THUCNews/label_dict.txt')
test_set = NewsDataset('../datasets/THUCNews/test.txt', '../datasets/THUCNews/label_dict.txt')

label2id = train_set.label2id
train_set = MapDataset(train_set)
test_set = MapDataset(test_set)

trans_fn = partial(convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length)
train_data_loader = create_dataloader(train_set, mode='train', batch_size=batch_size, 
                                      batchify_fn=batchify_fn, trans_fn=trans_fn)
test_data_loader = create_dataloader(test_set, mode='test', batch_size=batch_size,
                                     batchify_fn=batchify_fn, trans_fn=trans_fn)

# 检测是否可以使用GPU进行训练
use_gpu = True if paddle.get_device().startswith('gpu') else False
if use_gpu:
    paddle.set_device('gpu:0')
    
# 加载模型
model = ErnieForSequenceClassification(MODEL_NAME, num_class=n_classes, dropout=dropout_rate)

# 设置优化器
num_training_steps = len(train_data_loader) * n_epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ])

[32m[2023-01-11 17:55:04,675] [    INFO][0m - Already cached /Users/neowong/.paddlenlp/models/ernie-tiny/ernie_tiny.pdparams[0m


In [23]:
"""模型训练与评估

在训练过程中，模型会根据Loss不断反向调整模型参数
当完成训练后，可以根据模型的评估指标选出训练成功的模型，用于推理
"""

# 定义评估指标
metric = paddle.metric.Accuracy()

def evaluate(model, metric, data_loader):
    model.eval()
    
    # 每次使用测试集进行评估时，先重置之前的metric的累积数据
    metric.reset()
    losses = []
    
    for batch in data_loader:
        # 获取数据
        input_ids, segment_ids, labels = batch
        
        # 执行前向计算
        logits = model(input_ids, segment_ids)
        
        # 计算损失
        loss = F.cross_entropy(input=logits, label=labels)
        loss = paddle.mean(loss)
        losses.append(loss.numpy())
        
        # 统计准确率指标
        correct = metric.compute(logits, labels)
        metric.update(correct)
        accu = metric.accumulate()
    
    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
    metric.reset()
    
    
def train(model):
    global_step=0
    for epoch in range(1, n_epochs+1):
        model.train()
        for step, batch in enumerate(train_data_loader, start=1):
            # 获取数据
            input_ids, segment_ids, labels = batch
            # 模型前向计算
            logits = model(input_ids, segment_ids)
            loss = F.cross_entropy(input=logits, label=labels)
            loss = paddle.mean(loss)

            # 统计指标
            probs = F.softmax(logits, axis=1)
            correct = metric.compute(probs, labels)
            metric.update(correct)
            acc = metric.accumulate()
            
            # 打印中间训练结果
            global_step += 1
            if global_step % 10 == 0 :
                print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % 
                      (global_step, epoch, step, loss, acc))
            
            # 参数更新
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.clear_grad()
        
        # 模型评估
        evaluate(model, metric, test_data_loader)
        
        
train(model)

global step 10, epoch: 1, batch: 10, loss: 2.98997, acc: 0.05859
global step 20, epoch: 1, batch: 20, loss: 2.87861, acc: 0.06094
global step 30, epoch: 1, batch: 30, loss: 2.62139, acc: 0.06979
global step 40, epoch: 1, batch: 40, loss: 2.55458, acc: 0.08398
global step 50, epoch: 1, batch: 50, loss: 2.52601, acc: 0.09031
global step 60, epoch: 1, batch: 60, loss: 2.40739, acc: 0.10117
global step 70, epoch: 1, batch: 70, loss: 2.38055, acc: 0.11161
global step 80, epoch: 1, batch: 80, loss: 2.50773, acc: 0.12295
global step 90, epoch: 1, batch: 90, loss: 2.19281, acc: 0.13984
global step 100, epoch: 1, batch: 100, loss: 2.00311, acc: 0.15906
global step 110, epoch: 1, batch: 110, loss: 2.08255, acc: 0.17564
global step 120, epoch: 1, batch: 120, loss: 1.67801, acc: 0.19792
global step 130, epoch: 1, batch: 130, loss: 1.60479, acc: 0.21689
global step 140, epoch: 1, batch: 140, loss: 1.63914, acc: 0.23588
global step 150, epoch: 1, batch: 150, loss: 1.51181, acc: 0.25542
global step 1

global step 1230, epoch: 1, batch: 1230, loss: 0.53150, acc: 0.71936
global step 1240, epoch: 1, batch: 1240, loss: 0.63918, acc: 0.72029
global step 1250, epoch: 1, batch: 1250, loss: 0.46274, acc: 0.72126
global step 1260, epoch: 1, batch: 1260, loss: 0.54385, acc: 0.72217
global step 1270, epoch: 1, batch: 1270, loss: 0.50451, acc: 0.72320
global step 1280, epoch: 1, batch: 1280, loss: 0.52351, acc: 0.72422
global step 1290, epoch: 1, batch: 1290, loss: 0.40506, acc: 0.72523
global step 1300, epoch: 1, batch: 1300, loss: 0.40808, acc: 0.72604
global step 1310, epoch: 1, batch: 1310, loss: 0.67256, acc: 0.72694
global step 1320, epoch: 1, batch: 1320, loss: 0.58643, acc: 0.72777
global step 1330, epoch: 1, batch: 1330, loss: 0.60177, acc: 0.72874
global step 1340, epoch: 1, batch: 1340, loss: 0.57780, acc: 0.72964
global step 1350, epoch: 1, batch: 1350, loss: 0.62892, acc: 0.73046
global step 1360, epoch: 1, batch: 1360, loss: 0.51418, acc: 0.73134
global step 1370, epoch: 1, batch:

In [12]:
"""模型保存

在模型训练完成后，需要将模型和优化器参数保存到磁盘，用于模型推理或继续训练。
另外，可以将tokenizer保存下来以备后用。
"""

model_name = "ernie_for_sequence_classification"

paddle.save(model.state_dict(), "{}.pdparams".format(model_name))
paddle.save(optimizer.state_dict(), "{}.optparams".format(model_name))
tokenizer.save_pretrained('./tokenizer')

[32m[2023-01-11 17:55:18,607] [    INFO][0m - tokenizer config file saved in ./tokenizer/tokenizer_config.json[0m
[32m[2023-01-11 17:55:18,608] [    INFO][0m - Special tokens file saved in ./tokenizer/special_tokens_map.json[0m


('./tokenizer/tokenizer_config.json',
 './tokenizer/special_tokens_map.json',
 './tokenizer/added_tokens.json')

In [19]:
"""模型预测

在使用模型预测之前，依然要将原始文本数据做Token转化和批次样本切分的操作
然后使用训练好的模型计算新闻在各个类别上的概率分布，选择概率最大的那个ID，转换成类别标签输出
"""

def predict(data, id2label, batch_size=1):
    examples = []
    # 数据处理
    for text in data:
        input_ids, segment_ids = convert_example(
            text,
            tokenizer,
            max_seq_length=128,
            is_test=True)
        examples.append((input_ids, segment_ids))

    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input id
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment id
    ): fn(samples)

    # 将数据按照batch_size进行切分
    batches = []
    one_batch = []
    for example in examples:
        one_batch.append(example)
        if len(one_batch) == batch_size:
            batches.append(one_batch)
            one_batch = []
    if one_batch:
        batches.append(one_batch)

    # 使用模型预测数据，并返回结果
    results = []
    model.eval()
    for batch in batches:
        input_ids, segment_ids = batchify_fn(batch)
        input_ids = paddle.to_tensor(input_ids)
        segment_ids = paddle.to_tensor(segment_ids)
        logits = model(input_ids, segment_ids)
        probs = F.softmax(logits, axis=1)
        idx = paddle.argmax(probs, axis=1).numpy()
        idx = idx.tolist()
        labels = [id2label[i] for i in idx]
        results.extend(labels)
    return results

data = [{"text":"重磅数据公布！2022年存款增超26万亿，超额储蓄待释放！"}]

id2label = dict([(items[1], items[0]) for items in label2id.items()])
results = predict(data, id2label)
print(results)

['财经']
