<a href="https://colab.research.google.com/github/Rosefinch-Midsummer/Awesome-Colab/blob/master/NLP/fastNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.](https://github.com/fastnlp/fastNLP)

[Docs](https://fastnlp.readthedocs.io/zh/latest/index.html)

fastNLP是一款轻量级的自然语言处理（NLP）工具包，目标是快速实现NLP任务以及构建复杂模型。

fastNLP具有如下的特性：

- 统一的Tabular式数据容器，简化数据预处理过程；
- 内置多种数据集的Loader和Pipe，省去预处理代码;
- 各种方便的NLP工具，例如Embedding加载（包括ELMo和BERT）、中间数据cache等;
- 部分数据集与预训练模型的自动下载；
- 提供多种神经网络组件以及复现模型（涵盖中文分词、命名实体识别、句法分析、文本分类、文本匹配、指代消解、摘要等任务）;
- Trainer提供多种内置Callback函数，方便实验记录、异常捕获等。

In [2]:
!pip install fastNLP
!python -m spacy download en

Collecting fastNLP
[?25l  Downloading https://files.pythonhosted.org/packages/de/97/56d84b45c6f416943ba21d9516d98649328ae1afb82ede55bdbe53ba60cb/FastNLP-0.5.0-py3-none-any.whl (270kB)
[K     |█▏                              | 10kB 21.1MB/s eta 0:00:01[K     |██▍                             | 20kB 24.7MB/s eta 0:00:01[K     |███▋                            | 30kB 29.5MB/s eta 0:00:01[K     |████▉                           | 40kB 34.0MB/s eta 0:00:01[K     |██████                          | 51kB 37.8MB/s eta 0:00:01[K     |███████▎                        | 61kB 40.7MB/s eta 0:00:01[K     |████████▌                       | 71kB 42.6MB/s eta 0:00:01[K     |█████████▊                      | 81kB 43.5MB/s eta 0:00:01[K     |███████████                     | 92kB 45.0MB/s eta 0:00:01[K     |████████████▏                   | 102kB 46.5MB/s eta 0:00:01[K     |█████████████▍                  | 112kB 46.5MB/s eta 0:00:01[K     |██████████████▌                 | 122kB 46.5

# 文本分类(Text classification)

文本分类任务是将一句话或一段话划分到某个具体的类别。比如垃圾邮件识别，文本情绪分类等。

步骤

一共有以下的几个步骤:

1.读取数据

2.预处理数据

3.选择预训练词向量

4.创建模型

5.训练模型

(1) 读取数据

fastNLP提供多种数据的自动下载与自动加载功能，对于这里我们要用到的数据，我们可以用 Loader 自动下载并加载该数据。 更多有关Loader的使用可以参考 loader

In [3]:
from fastNLP.io import ChnSentiCorpLoader

loader = ChnSentiCorpLoader()        # 初始化一个中文情感分类的loader
data_dir = loader.download()         # 这一行代码将自动下载数据到默认的缓存地址, 并将该地址返回
data_bundle = loader.load(data_dir)  # 这一行代码将从{data_dir}处读取数据至DataBundle

print(data_bundle)

print(data_bundle.get_dataset('train')[:2])  # 查看Train集前两个sample

  1%|          | 16.4k/1.76M [00:00<00:17, 99.4kB/s]

http://212.129.155.247/dataset/chn_senti_corp.zip not found in cache, downloading to /tmp/tmpf51hdew7


100%|██████████| 1.76M/1.76M [00:01<00:00, 1.51MB/s]


Finish download from http://212.129.155.247/dataset/chn_senti_corp.zip
Copy file to /root/.fastNLP/dataset/chn_senti_corp
In total 3 datasets:
	dev has 1200 instances.
	train has 9600 instances.
	test has 1200 instances.

+-------------------------------------------+--------+
| raw_chars                                 | target |
+-------------------------------------------+--------+
| 选择珠江花园的原因就是方便，有电动扶梯... | 1      |
| 15.4寸笔记本的键盘确实爽，基本跟台式机... | 1      |
+-------------------------------------------+--------+


(2) 预处理数据

在NLP任务中，预处理一般包括:

将一整句话切分成汉字或者词;

将文本转换为index

fastNLP中也提供了多种数据集的处理类，这里我们直接使用fastNLP的ChnSentiCorpPipe。更多关于Pipe的说明可以参考 pipe

In [4]:
from fastNLP.io import ChnSentiCorpPipe

pipe = ChnSentiCorpPipe()
data_bundle = pipe.process(data_bundle)  # 所有的Pipe都实现了process()方法，且输入输出都为DataBundle类型

print(data_bundle)  # 打印data_bundle，查看其变化

print(data_bundle.get_dataset('train')[:2])

In total 3 datasets:
	dev has 1200 instances.
	train has 9600 instances.
	test has 1200 instances.
In total 2 vocabs:
	chars has 4409 entries.
	target has 2 entries.

+-------------------------+--------+------------------------+---------+
| raw_chars               | target | chars                  | seq_len |
+-------------------------+--------+------------------------+---------+
| 选择珠江花园的原因就... | 0      | [338, 464, 1400, 78... | 106     |
| 15.4寸笔记本的键盘确... | 0      | [50, 133, 20, 135, ... | 56      |
+-------------------------+--------+------------------------+---------+


In [5]:
#新增了一列为数字列表的chars，以及变为数字的target列。可以看出这两列的名称和刚好与data_bundle中两个Vocabulary的名称是一致的，我们可以打印一下Vocabulary看一下里面的内容。
char_vocab = data_bundle.get_vocab('chars')
print(char_vocab)

Vocabulary(['选', '择', '珠', '江', '花']...)


In [6]:
#Vocabulary是一个记录着词语与index之间映射关系的类，比如
index = char_vocab.to_index('选')
print("'选'的index是{}".format(index))  # 这个值与上面打印出来的第一个instance的chars的第一个index是一致的
print("index:{}对应的汉字是{}".format(index, char_vocab.to_word(index)))

'选'的index是338
index:338对应的汉字是选


(3) 选择预训练词向量

由于Word2vec, Glove, Elmo, Bert等预训练模型可以增强模型的性能，所以在训练具体任务前，选择合适的预训练词向量非常重要。 在fastNLP中我们提供了多种Embedding使得加载这些预训练模型的过程变得更加便捷。 这里我们先给出一个使用word2vec的中文汉字预训练的示例，之后再给出一个使用Bert的文本分类。 这里使用的预训练词向量为'cn-fastnlp-100d'，fastNLP将自动下载该embedding至本地缓存， fastNLP支持使用名字指定的Embedding以及相关说明可以参见 fastNLP.embeddings

In [7]:
from fastNLP.embeddings import StaticEmbedding

word2vec_embed = StaticEmbedding(char_vocab, model_dir_or_name='cn-char-fastnlp-100d')

  0%|          | 0.00/3.70M [00:00<?, ?B/s]

http://212.129.155.247/embedding/cn_char_fastnlp_100d.zip not found in cache, downloading to /tmp/tmp_abvygvx


100%|██████████| 3.70M/3.70M [00:00<00:00, 6.07MB/s]


Finish download from http://212.129.155.247/embedding/cn_char_fastnlp_100d.zip
Copy file to /root/.fastNLP/embedding/cn_char_fastnlp_100d
Found 4321 out of 4409 words in the pre-training embedding.


(4) 创建模型

In [0]:
from torch import nn
from fastNLP.modules import LSTM
import torch

# 定义模型
class BiLSTMMaxPoolCls(nn.Module):
    def __init__(self, embed, num_classes, hidden_size=400, num_layers=1, dropout=0.3):
        super().__init__()
        self.embed = embed

        self.lstm = LSTM(self.embed.embedding_dim, hidden_size=hidden_size//2, num_layers=num_layers,
                         batch_first=True, bidirectional=True)
        self.dropout_layer = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, chars, seq_len):  # 这里的名称必须和DataSet中相应的field对应，比如之前我们DataSet中有chars，这里就必须为chars
        # chars:[batch_size, max_len]
        # seq_len: [batch_size, ]
        chars = self.embed(chars)
        outputs, _ = self.lstm(chars, seq_len)
        outputs = self.dropout_layer(outputs)
        outputs, _ = torch.max(outputs, dim=1)
        outputs = self.fc(outputs)

        return {'pred':outputs}  # [batch_size,], 返回值必须是dict类型，且预测值的key建议设为pred

# 初始化模型
model = BiLSTMMaxPoolCls(word2vec_embed, len(data_bundle.get_vocab('target')))

(5) 训练模型

fastNLP提供了Trainer对象来组织训练过程，包括完成loss计算(所以在初始化Trainer的时候需要指定loss类型)，梯度更新(所以在初始化Trainer的时候需要提供优化器optimizer)以及在验证集上的性能验证(所以在初始化时需要提供一个Metric)

In [9]:
from fastNLP import Trainer
from fastNLP import CrossEntropyLoss
from torch.optim import Adam
from fastNLP import AccuracyMetric

loss = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=0.001)
metric = AccuracyMetric()
device = 0 if torch.cuda.is_available() else 'cpu'  # 如果有gpu的话在gpu上运行，训练速度会更快

trainer = Trainer(train_data=data_bundle.get_dataset('train'), model=model, loss=loss,
                  optimizer=optimizer, batch_size=32, dev_data=data_bundle.get_dataset('dev'),
                  metrics=metric, device=device)
trainer.train()  # 开始训练，训练完成之后默认会加载在dev上表现最好的模型

# 在测试集上测试一下模型的性能
from fastNLP import Tester
print("Performance on test is:")
tester = Tester(data=data_bundle.get_dataset('test'), model=model, metrics=metric, batch_size=64, device=device)
tester.test()

input fields after batch(if batch size is 2):
	target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
	chars: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 106]) 
	seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
target fields after batch(if batch size is 2):
	target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 

training epochs started 2020-01-03-04-34-52


HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=3000), HTML(value='')), layout=Layout(display…

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.62 seconds!
Evaluation on dev at Epoch 1/10. Step:300/3000: 
AccuracyMetric: acc=0.825



HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.64 seconds!
Evaluation on dev at Epoch 2/10. Step:600/3000: 
AccuracyMetric: acc=0.863333



HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.61 seconds!
Evaluation on dev at Epoch 3/10. Step:900/3000: 
AccuracyMetric: acc=0.889167



HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.62 seconds!
Evaluation on dev at Epoch 4/10. Step:1200/3000: 
AccuracyMetric: acc=0.895



HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.62 seconds!
Evaluation on dev at Epoch 5/10. Step:1500/3000: 
AccuracyMetric: acc=0.890833



HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.62 seconds!
Evaluation on dev at Epoch 6/10. Step:1800/3000: 
AccuracyMetric: acc=0.890833



HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.62 seconds!
Evaluation on dev at Epoch 7/10. Step:2100/3000: 
AccuracyMetric: acc=0.915



HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.62 seconds!
Evaluation on dev at Epoch 8/10. Step:2400/3000: 
AccuracyMetric: acc=0.905



HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.64 seconds!
Evaluation on dev at Epoch 9/10. Step:2700/3000: 
AccuracyMetric: acc=0.910833



HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=38), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.61 seconds!
Evaluation on dev at Epoch 10/10. Step:3000/3000: 
AccuracyMetric: acc=0.903333


In Epoch:7/Step:2100, got best dev performance:
AccuracyMetric: acc=0.915
Reloaded the best model.
Performance on test is:


HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=19), HTML(value='')), layout=Layout(display='…

Evaluate data in 0.54 seconds!
[tester] 
AccuracyMetric: acc=0.8875


{'AccuracyMetric': {'acc': 0.8875}}

使用Bert进行文本分类

In [10]:
# 只需要切换一下Embedding即可
from fastNLP.embeddings import BertEmbedding

# 这里为了演示一下效果，所以默认Bert不更新权重
bert_embed = BertEmbedding(char_vocab, model_dir_or_name='cn', auto_truncate=True, requires_grad=False)
model = BiLSTMMaxPoolCls(bert_embed, len(data_bundle.get_vocab('target')), )


import torch
from fastNLP import Trainer
from fastNLP import CrossEntropyLoss
from torch.optim import Adam
from fastNLP import AccuracyMetric

loss = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=2e-5)
metric = AccuracyMetric()
device = 0 if torch.cuda.is_available() else 'cpu'  # 如果有gpu的话在gpu上运行，训练速度会更快

trainer = Trainer(train_data=data_bundle.get_dataset('train'), model=model, loss=loss,
                  optimizer=optimizer, batch_size=16, dev_data=data_bundle.get_dataset('test'),
                  metrics=metric, device=device, n_epochs=3)
trainer.train()  # 开始训练，训练完成之后默认会加载在dev上表现最好的模型

# 在测试集上测试一下模型的性能
from fastNLP import Tester
print("Performance on test is:")
tester = Tester(data=data_bundle.get_dataset('test'), model=model, metrics=metric, batch_size=64, device=device)
tester.test()

  0%|          | 0.00/412M [00:00<?, ?B/s]

http://212.129.155.247/embedding/bert-chinese-wwm.zip not found in cache, downloading to /tmp/tmpv1ujjwc6


100%|██████████| 412M/412M [00:16<00:00, 25.2MB/s]


Finish download from http://212.129.155.247/embedding/bert-chinese-wwm.zip
Copy file to /root/.fastNLP/embedding/bert-chinese-wwm
loading vocabulary file /root/.fastNLP/embedding/bert-chinese-wwm/vocab.txt
Load pre-trained BERT parameters from file /root/.fastNLP/embedding/bert-chinese-wwm/chinese_wwm_pytorch.bin.
Start to generate word pieces for word.
Found(Or segment into word pieces) 4286 words out of 4409.
input fields after batch(if batch size is 2):
	target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
	chars: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 106]) 
	seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
target fields after batch(if batch size is 2):
	target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 

training epochs started 2020-01-03-04-38-33


HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=1800), HTML(value='')), layout=Layout(display…

HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=75), HTML(value='')), layout=Layout(display='…

Evaluate data in 32.21 seconds!
Evaluation on dev at Epoch 1/3. Step:600/1800: 
AccuracyMetric: acc=0.878333



HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=75), HTML(value='')), layout=Layout(display='…

Evaluate data in 32.2 seconds!
Evaluation on dev at Epoch 2/3. Step:1200/1800: 
AccuracyMetric: acc=0.898333



HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=75), HTML(value='')), layout=Layout(display='…

Evaluate data in 32.26 seconds!
Evaluation on dev at Epoch 3/3. Step:1800/1800: 
AccuracyMetric: acc=0.913333


In Epoch:3/Step:1800, got best dev performance:
AccuracyMetric: acc=0.913333
Reloaded the best model.
Performance on test is:


HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=19), HTML(value='')), layout=Layout(display='…

Evaluate data in 49.11 seconds!
[tester] 
AccuracyMetric: acc=0.913333


{'AccuracyMetric': {'acc': 0.913333}}