<a href="https://colab.research.google.com/github/zhangguanheng66/tutorials/blob/sentiment_analysis/Torchtext_with_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%shell

rm -r /usr/local/lib/python3.6/dist-packages/torch*
#pip uninstall torch torchtext
#pip install --pre torch torchvision torchtext -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
pip install --pre torch torchvision torchtext -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html


Looking in links: https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
Collecting torch
[?25l  Downloading https://download.pytorch.org/whl/nightly/cu101/torch-1.8.0.dev20201008%2Bcu101-cp36-cp36m-linux_x86_64.whl (737.4MB)
[K     |████████████████████████████████| 737.4MB 21kB/s 
[?25hCollecting torchvision
[?25l  Downloading https://download.pytorch.org/whl/nightly/cu101/torchvision-0.8.0.dev20201008%2Bcu101-cp36-cp36m-linux_x86_64.whl (24.9MB)
[K     |████████████████████████████████| 24.9MB 100kB/s 
[?25hCollecting torchtext
[?25l  Downloading https://download.pytorch.org/whl/nightly/torchtext-0.8.0.dev20201008-cp36-cp36m-linux_x86_64.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 44.8MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB



In [3]:
import torch
print(torch.__version__)
print(torch.__file__)
import torchtext
print(torchtext.__version__)
print(torchtext.__file__)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

1.8.0.dev20201008+cu101
/usr/local/lib/python3.6/dist-packages/torch/__init__.py
0.8.0.dev20201008
/usr/local/lib/python3.6/dist-packages/torchtext/__init__.py
cuda


# Prototype pipeline with the new torchtext library

In this tutorial, we will show how to use the new torchtext library to build the dataset for the text classification analysis. In the nightly release of torchtext libraries, we provide a few prototype building blocks for data processing. With the new torchtext library, you will have the flexibility to

*   Access to the raw data as an iterator
*   Build data processing pipeline to convert the raw text strings into `torch.Tensor` that can be used to train the model
*   Shuffle and iterate the data with `torch.utils.data.DataLoader`


## Step 1: Access to the raw dataset iterators
----------------------------

For some advanced users, they prefer to work on the raw data strings with their custom data process pipeline. The new torchtext library provides a few raw dataset iterators, which yield the raw text strings. For example, the AG_NEWS dataset iterators yield the raw data as a tuple of label and text.

In [4]:
from torchtext.experimental.datasets.raw import AG_NEWS
train_iter, = AG_NEWS(data_select=('train'))

train.csv: 29.5MB [00:00, 70.5MB/s]
test.csv: 1.86MB [00:00, 24.8MB/s]                  




```
next(iter(train_iter))
>>> (3, "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - 
Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green 
again.")

next(iter(train_iter))
>>> (3, 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private 
investment firm Carlyle Group,\\which has a reputation for making well-timed 
and occasionally\\controversial plays in the defense industry, has quietly 
placed\\its bets on another part of the market.')

next(iter(train_iter))
>>> (3, "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring 
crude prices plus worries\\about the economy and the outlook for earnings are 
expected to\\hang over the stock market next week during the depth of 
the\\summer doldrums.")
```



## Step 2: Prepare data processing pipelines
----------------------------
We have revisited the very basic components of the torchtext library, including vocab, word vectors, tokenizer backed by regular expression, and sentencepiece. Those are the basic data processing building blocks for raw text string.

### 2.1 Tokenizer-vocabulary data processing pipeline

Here is an example for typical NLP data processing with tokenizer and vocabulary.

The first step is to build a vocabulary with the raw training dataset. We provide a function `build_vocab_from_iterator` to build the vocabulary from a text iterator.

In [None]:
from torchtext.experimental.vocab import build_vocab_from_iterator
from torchtext.experimental.transforms import basic_english_normalize
tokenizer = basic_english_normalize()
train_iter, = AG_NEWS(data_select=('train'))
vocab = build_vocab_from_iterator(iter(tokenizer(line) for label, line in train_iter))

The vocabulary block converts a list of tokens into integers.
```
vocab(['here', 'is', 'an', 'example'])
>>> [475, 21, 30, 5286]
```

Prepare data pipeline with the tokenizer and vocabulary. The pipelines will be used for the raw data strings from the dataset iterators.

In [6]:
def generate_text_pipeline(tokenizer, vocab):
  def _forward(text):
    return vocab(tokenizer(text))
  return _forward
text_pipeline = generate_text_pipeline(basic_english_normalize(), vocab)
#label_pipeline = lambda x: 1 if x == 'pos' else 0
label_pipeline = lambda x: int(x)

The text piple converts a text string into a list of integers based on the lookup defined in the vocab. The label pipeline converts the label into integers. For example,

```
text_pipeline('here is the an example')
>>> [475, 21, 2, 30, 5286]
label_pipeline('10')
>>> 10
```

### 2.2 SentencePiece data processing pipeline

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. For sentencepiece transforms in torchtext, both subword units (e.g., byte-pair-encoding (BPE) ) and unigram language model are supported. We provide a few pretrained SentencePiece models and they are accessable from `PRETRAINED_SP_MODEL`. Here is an example to apply SentencePiece transform to build the dataset.

In [8]:
from torchtext.experimental.transforms import (
    PRETRAINED_SP_MODEL,
    sentencepiece_processor,
    load_sp_model,
)
from torchtext.utils import download_from_url
spm_filepath = download_from_url(PRETRAINED_SP_MODEL['text_unigram_25000'])
spm_transform = sentencepiece_processor(spm_filepath)
sp_model = load_sp_model(spm_filepath)

text_unigram_25000.model: 100%|██████████| 678k/678k [00:00<00:00, 1.66MB/s]


The sentecepiece processor converts a text string into a list of integers. You can use the `decode` method to convert a list of integers back to the original string.

```
spm_transform('here is the an example')
>>> [130, 46, 9, 76, 1798]
spm_transform.decode([6468, 17151, 4024, 8246, 16887, 87, 23985, 12, 581, 15120])
>>> 'torchtext sentencepiece processor can encode and decode'
```



### 2.3 (Optional for tutorial) Tokenizer + Vocab + Embedding data processing pipeline

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. FastText and GloVe are well established baseline word vectors in the NLP community. In the new torchtext library, a Vector object supports the mapping between tokens and their corresponding vector representation (i.e. word embeddings).

In [None]:
from torchtext.experimental.vectors import FastText
def generate_vector_pipeline(tokenizer, vector):
  def _forward(text):
    return vector(tokenizer(text))
  return _forward
word_vector_pipeline = generate_vector_pipeline(basic_english_normalize(), FastText())

The word_vector_pipeline tokenizes a text string and converts the tokenzers into a vector, according to the pretrained word vector.

```
word_vector_pipeline('here is the an example')
>>> tensor([[-0.1564,  0.0486,  0.1724,  ...,  0.4588, -0.0021,  0.3085],
            [ 0.0359,  0.1452,  0.1193,  ..., -0.0016,  0.1708, -0.0355],
            [-0.0653, -0.0930, -0.0176,  ...,  0.1664, -0.1308,  0.0354],
            [-0.0671,  0.0014, -0.1857,  ...,  0.1050, -0.2144,  0.0944],
            [ 0.0144,  0.1337, -0.1489,  ..., -0.0202,  0.0657, -0.0029]])
```

## Step 3: Generate data batch and iterator¶

The PyTorch data loading utility is the `torch.utils.data.DataLoader` class. It works with a map-style dataset that implements the `getitem()` and `len()` protocols, and represents a map from indices/keys to data samples. It also works with an iterable datasets with the shuffle argumnet of `False`. Before sending to the model, `collate_fn` function works on a batch of samples generated from DataLoader and we can add the data processing pipelines in Step 2 to the `collate_fn` function.

In [9]:
from torch.utils.data import DataLoader
# [TODO] integrate with torchtext.experimental.transforms.PadTransform
# Need to land https://github.com/pytorch/text/pull/952
from torch.nn.utils.rnn import pad_sequence

cls_id = sp_model.PieceToId('<cls>')
pad_id = sp_model.PieceToId('<pad>')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list = [], []
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        text_list.append(torch.tensor([cls_id] + spm_transform(_text)))
    text_list = pad_sequence(text_list, batch_first=True, padding_value=float(pad_id))
    label_list = torch.tensor(label_list)
    return label_list.to(device), text_list.transpose(0, 1).contiguous().to(device)

train_iter, = AG_NEWS(data_select=('train'))
dataloader = DataLoader(list(train_iter), batch_size=8, shuffle=True, collate_fn=collate_batch)

## Step 4: Model for text classification task
---

We use a transformer model here for the text classification analysis. The model is composed of an embedding layer plus a positional encoding layer. Following those two, we have the transformer model and a linear layer is attached to the end for the classification purpose.

In [10]:
from torch import nn
import math
NUM_CLASSES = 5
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)


class TextClassificationModel(nn.Module):
    """Contain a transformer encoder."""

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super(TextClassificationModel, self).__init__()
        self.embed_layer = nn.Embedding(ntoken, ninp)
        self.pos_encoder = PositionalEncoding(ninp, dropout)
        encoder_layers = nn.TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, nlayers)
        self.activation = nn.Tanh()
        self.projection = nn.Linear(ninp, NUM_CLASSES)

    def forward(self, src_seq):
        output = self.embed_layer(src_seq)
        output = self.pos_encoder(output)
        output = self.transformer_encoder(output)
        output = self.activation(output[0])
        return self.projection(output)

We build a model with the following parameters


*   the embedding dimension - 64
*   the number of heads in the transformer model - 8
*   the hidden dimension in the transformer model - 128
*   the number of layers in the transformer model - 1
*   the dropout value - 0.2



In [11]:
vocab_size = sp_model.GetPieceSize()
emsize, nhead, nhid, nlayers, dropout = 64, 8, 128, 1, 0.2
model = TextClassificationModel(vocab_size, emsize, nhead, nhid, nlayers, dropout).to(device)


## Step 5: Train and test the model
---

Then, we train and test the transformer model with the text classification datasets

In [12]:
import time
import math

def fine_tune(model, dataloader, optimizer, criterion, batch_size, device, SEQENCE_LENGTH):
    model.train()
    total_loss, total_acc, total_count = 0, 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text) in enumerate(dataloader):
        optimizer.zero_grad()
        # print(seq_input.size(), tok_type.size())
        if text.size(0) > SEQENCE_LENGTH:
            text = text[:SEQENCE_LENGTH]
        predited_label = model(text)
        loss = criterion(predited_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_loss += loss.item()
        total_acc += (predited_label.argmax(1) == label).sum().item()
        #print(predited_label.argmax(1), label)
        total_count += text.size(1)
        if idx % log_interval == 0 and idx > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:05.5f} | '
                  'ms/batch {:5.2f} | loss {:5.2f} '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              scheduler.get_last_lr()[0],
                                              elapsed * 1000 / log_interval,
                                              cur_loss, total_acc/total_count))
            total_loss, total_acc, total_count = 0, 0, 0
            start_time = time.time()

def evaluate(model, dataloader, optimizer,
             criterion, batch_size, device, SEQENCE_LENGTH):
    model.eval()
    total_loss, total_acc, total_count = 0, 0, 0
    ans_pred_tokens_samples = []

    with torch.no_grad():
        for idx, (label, text) in enumerate(dataloader):
            if text.size(0) > SEQENCE_LENGTH:
              text = text[:SEQENCE_LENGTH]
            predited_label = model(text)
            loss = criterion(predited_label, label)
            total_loss += loss.item()
            total_acc += (predited_label.argmax(1) == label).sum().item()
            total_count += text.size(1)
    return total_loss / len(dataloader), total_acc/total_count

Here are a few hyperparameters used in the pipeline


*   The number of epoches - 10
*   The initial learning rate - 5.0
*   The batch size - 64
*   The maximum sequence length - 768



In [13]:
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training
SEQENCE_LENGTH = 768 # the maximum sequence length
  
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter, test_iter = AG_NEWS()
train_dataloader = DataLoader(list(train_iter), batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(list(test_iter), batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    fine_tune(model, train_dataloader, optimizer,
              criterion, BATCH_SIZE, device, SEQENCE_LENGTH)
    loss_val, accu_val = evaluate(model, test_dataloader, optimizer,
                                  criterion, BATCH_SIZE, device, SEQENCE_LENGTH)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid loss {:5.2f} | valid accuracy {:8.3f} '.format(epoch,
                                         (time.time() - epoch_start_time),
                                         loss_val, accu_val))
    print('-' * 89)

| epoch   1 |   500/ 1875 batches | lr 5.00000 | ms/batch 34.56 | loss  1.34 | accuracy    0.360
| epoch   1 |  1000/ 1875 batches | lr 5.00000 | ms/batch 34.37 | loss  1.08 | accuracy    0.541
| epoch   1 |  1500/ 1875 batches | lr 5.00000 | ms/batch 34.65 | loss  0.98 | accuracy    0.589
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 66.76s | valid loss  0.75 | valid accuracy    0.707 
-----------------------------------------------------------------------------------------
| epoch   2 |   500/ 1875 batches | lr 5.00000 | ms/batch 34.60 | loss  0.89 | accuracy    0.639
| epoch   2 |  1000/ 1875 batches | lr 5.00000 | ms/batch 34.22 | loss  0.84 | accuracy    0.665
| epoch   2 |  1500/ 1875 batches | lr 5.00000 | ms/batch 34.77 | loss  0.80 | accuracy    0.680
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 66.30s | valid loss  0.61 | valid accurac