In [2]:
%matplotlib inline


Text Classification with TorchText
==================================

   这个例子主要示范了如何使用torchtext的文本分类的数据集，包含：


    
       - AG_NEWS,
       - SogouNews,
       - DBpedia,
       - YelpReviewPolarity,
       - YelpReviewFull,
       - YahooAnswers,
       - AmazonReviewPolarity,
       - AmazonReviewFull  
本例使用了AG_News数据集，训练了一个监督学习的算法做分类。

Load data with ngrams
---------------------

词袋特征被用来捕捉关于词序的一些局部特征。实际应用中，bi-gram或tri-gram作为词袋模型
的一种能比单个单词提供更多的信息，比如：



       "load data with ngrams"
       Bi-grams results: "load data", "data with", "with ngrams"
       Tri-grams results: "load data with", "data with ngrams"

``TextClassification``数据集提供了ngrams的方法。如果将ngrams设置为2，那么数据集里
的文本就会变成一列单个词加上bi-grams字符。




In [1]:
import torch
import torchtext
from torchtext.datasets import text_classification
NGRAMS = 2
import os
if not os.path.isdir('./.data'):
	os.mkdir('./.data')
#设置ngram
train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)
BATCH_SIZE = 16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

.data\ag_news_csv.tar.gz: 11.8MB [00:01, 10.9MB/s]
120000lines [00:07, 15845.83lines/s]
120000lines [00:13, 9218.16lines/s]
7600lines [00:00, 9527.08lines/s]


In [21]:
print(train_dataset[0])

(2, tensor([    572,     564,       2,    2326,   49106,     150,      88,       3,
           1143,      14,      32,      15,      32,      16,  443749,       4,
            572,     499,      17,      10,  741769,       7,  468770,       4,
             52,    7019,    1050,     442,       2,   14341,     673,  141447,
         326092,   55044,    7887,     411,    9870,  628642,      43,      44,
            144,     145,  299709,  443750,   51274,     703,   14312,      23,
        1111134,  741770,  411508,  468771,    3779,   86384,  135944,  371666,
           4052]))


Define the model
----------------

这个模型由[EmbeddingBag](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html#torch.nn.EmbeddingBag)层和全连接层组成。Embeddingbag计算了每个bag词嵌入的均值，放入nn.EmbeddingBag的句子可以是不同长度的，不需要padding,因为文本的长度储存在offsets里面了。

除此之外，由于nn.EmbeddingBag计算了词袋的平均值，它可以提高处理一个张量序列的性能和内存效率。  



  
  
  

![](./model.png)


In [2]:
import torch.nn as nn
import torch.nn.functional as F
class TextSentiment(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        #EmbeddingBag先建立一个词嵌入的表，提取出bag里面每个词的词向量（这个的一个句子是一个bag，对应一个标签）
        #然后将句子里面所有词向量的词嵌入相加，得到一个词嵌入向量，大小为embed_dim
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)#（batch_size, embed_dim）
        return self.fc(embedded)

Initiate an instance
--------------------

AG_NEWS数据集有四个标签:

    1 : World
    2 : Sports
    3 : Business
    4 : Sci/Tec
vocab size 等于词汇量的大小（包括单个单词和ngrams），number of classes 是标签的数量4。




In [3]:
VOCAB_SIZE = len(train_dataset.get_vocab())
EMBED_DIM = 32
NUN_CLASS = len(train_dataset.get_labels())
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)

Functions used to generate batch
--------------------------------





由于text有不同的长度，使用generate_batch()来产生data batch和offsets。

这个函数传递给torch.utils.data.Dataloader里面的collate_fn。给collate_fn的输入是大小为batch_size的张量，通过collate_fn将他们打包成mini-batch。注意要确保collate_fn声明为top level的函数，使得他在每一步都是可调用的。

文本在原本的数据中被打包在列表中，并且concatenated成一个张量作为nn.EmbeddingBag的输入。offsets是一个定界张量，用来表示文本中每一个单独序列的开始引索。Label是用来保存每条文本标签的张量。




In [4]:
def generate_batch(batch):
    '''
    输入batch：每条输入的第一维是label，第二维是text
    如dataset里面第一条输入：[ [2, torch.tensor([572,564,2])] ]
    
    输出：text:(tensor([572, 564, 2]),将所有batch里面文本的index拼接成一个tensor
        offsets:tensor([0]),只有一条文本offsets就是0，第二条文本的offsets就是第一条的长度
        label:tensor([2])
    '''
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    offsets = [0] + [len(entry) for entry in text]
    # torch.Tensor.cumsum returns the cumulative sum
    # of elements in the dimension dim.
    # torch.Tensor([1.0, 2.0, 3.0]).cumsum(dim=0)

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)
    return text, offsets, label

Define functions to train the model and evaluate results.
---------------------------------------------------------





torch.utils.data.DataLoader能够并行加载数据。



In [5]:
from torch.utils.data import DataLoader

def train_func(sub_train_):

    # Train the model
    train_loss = 0
    train_acc = 0
    data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True,
                      collate_fn=generate_batch)
    for i, (text, offsets, cls) in enumerate(data):
        optimizer.zero_grad()
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        output = model(text, offsets)
        loss = criterion(output, cls)
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
        train_acc += (output.argmax(1) == cls).sum().item()

    # Adjust the learning rate
    scheduler.step()

    return train_loss / len(sub_train_), train_acc / len(sub_train_)

def test(data_):
    loss = 0
    acc = 0
    data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch)
    for text, offsets, cls in data:
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        with torch.no_grad():
            output = model(text, offsets)
            loss = criterion(output, cls)
            loss += loss.item()
            acc += (output.argmax(1) == cls).sum().item()

    return loss / len(data_), acc / len(data_)

Split the dataset and run the model
-----------------------------------

由于原始的数据集没有验证集，所以使用[torch.utils.data.dataset.random_split](https://pytorch.org/docs/stable/data.html?highlight=random_split#torch.utils.data.random_split)将训练集分为训练/验证比例0.95/0.05。

[nn.CrossEntropyLoss](https://pytorch.org/docs/stable/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss)结合了nn.LogSoftmax()和nn.NLLLoss(),多被用来做多分类问题。

使用[SGD](https://pytorch.org/docs/stable/_modules/torch/optim/sgd.html)作为优化器，初始的学习率为4.0

[StepLR](https://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html#StepLR)在每个epochs后面调整学习率





In [6]:
import time
from torch.utils.data.dataset import random_split
N_EPOCHS = 5
min_valid_loss = float('inf')

criterion = torch.nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_len = int(len(train_dataset) * 0.95)
sub_train_, sub_valid_ = \
    random_split(train_dataset, [train_len, len(train_dataset) - train_len])

for epoch in range(N_EPOCHS):

    start_time = time.time()
    train_loss, train_acc = train_func(sub_train_)
    valid_loss, valid_acc = test(sub_valid_)

    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60

    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

Epoch: 1  | time in 0 minutes, 9 seconds
	Loss: 0.0263(train)	|	Acc: 84.6%(train)
	Loss: 0.0001(valid)	|	Acc: 89.5%(valid)
Epoch: 2  | time in 0 minutes, 7 seconds
	Loss: 0.0117(train)	|	Acc: 93.8%(train)
	Loss: 0.0001(valid)	|	Acc: 90.4%(valid)
Epoch: 3  | time in 0 minutes, 7 seconds
	Loss: 0.0068(train)	|	Acc: 96.4%(train)
	Loss: 0.0002(valid)	|	Acc: 90.9%(valid)
Epoch: 4  | time in 0 minutes, 7 seconds
	Loss: 0.0037(train)	|	Acc: 98.2%(train)
	Loss: 0.0002(valid)	|	Acc: 90.8%(valid)
Epoch: 5  | time in 0 minutes, 7 seconds
	Loss: 0.0022(train)	|	Acc: 99.0%(train)
	Loss: 0.0002(valid)	|	Acc: 90.5%(valid)


Evaluate the model with test dataset
------------------------------------




In [7]:
print('Checking the results of test dataset...')
test_loss, test_acc = test(test_dataset)
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset...
	Loss: 0.0003(test)	|	Acc: 89.2%(test)


Test on a random news
---------------------
`here <https://pytorch.org/text/datasets.html?highlight=ag_news#torchtext.datasets.AG_NEWS>`




In [33]:
import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
        #将文本转化为token
        #其中vocab为训练集用到的vocab
        #如果文本为：‘i like my mom’
        #token：i:381,  like:432, my:1807, mom:17721, i like:189786, like my:394101, my mom:970653
        #最终转化为：[381,432,1807, 17721,189786,394101,970653]
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)])
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."
#str_ = 'i like my mom'
#print(len(str_))
vocab = train_dataset.get_vocab()
model = model.to("cpu")
#print(vocab['like my'])
print("This is a %s news" %ag_news_label[predict(ex_text_str, model, vocab, 2)])

This is a Sports news


You can find the code examples displayed in this note
`here <https://github.com/pytorch/text/tree/master/examples/text_classification>`__.


