<a href="https://colab.research.google.com/github/zhangguanheng66/tutorials/blob/sentiment_analysis/Torchtext_with_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%shell

rm -r /usr/local/lib/python3.6/dist-packages/torch*
pip install numpy
pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html

In [None]:
import torch
import torchtext
print(torchtext.__version__)

## Sentiment Analysis with Torchtext

This tutorial is to show how to build the dataset to conduct sentiment analysis with torchtext library. The builiding blocks in torchtext library give the flexibility to build a custom data processing pipeline.



## Step 1: Prepare datasets
----------------------------
We have revisited the very basic components of the torchtext library, including vocab, word vectors, tokenizer backed by regular expression, and sentencepiece. Those are the basic data processing building blocks for raw text string.

### Tokenizer-vocabulary data processing pipeline

Here is an example for typical NLP data processing with tokenizer and vocabulary. We have a vocabulary saved in a text file. It's avaiable for downloading.

In [None]:
%%shell
rm bert_vocab.txt
wget https://pytorch.s3.amazonaws.com/models/text/torchtext_bert_example/bert_vocab.txt 

Prepare data pipeline for the dataset

In [None]:
from torchtext.experimental.transforms import (
    basic_english_normalize,
    TextSequentialTransforms,
)
from torchtext.experimental.vocab import vocab_from_file
with open('bert_vocab.txt', 'r') as f:
  vocab = vocab_from_file(f)
text_pipeline = TextSequentialTransforms(basic_english_normalize(), vocab)
label_pipeline = lambda x: 1 if x == 'pos' else 0

### (Optional for tutorial) Word-vector embedding data processing pipeline

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. FastText and GloVe are well established baseline word vectors in the NLP community. In the new torchtext library, a Vector object supports the mapping between tokens and their corresponding vector representation (i.e. word embeddings).

In [None]:
from torchtext.experimental.transforms import (
    basic_english_normalize,
    TextSequentialTransforms,
)
from torchtext.experimental.vectors import FastText
vector = FastText()
word_vector_pipeline = TextSequentialTransforms(basic_english_normalize(), vector)

### (Optional for tutorial) SentencePiece data processing pipeline

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. For sentencepiece transforms in torchtext, both subword units (e.g., byte-pair-encoding (BPE) ) and unigram language model are supported. Here is an example to apply SentencePiece transform to build the dataset.

In [None]:
from torchtext.experimental.transforms import (
    PRETRAINED_SP_MODEL,
    sentencepiece_processor,
)
from torchtext.utils import download_from_url
spm_filepath = download_from_url(PRETRAINED_SP_MODEL['text_unigram_25000'])
spm_transform = sentencepiece_processor(spm_filepath)

### Construct the dataset with raw text iterator and transforms

The raw text datasets iterators are avaialble in the `torchtext.experimental.datasets.raw` folder. The datasets in `torchtext.experimental.datasets.raw` return iterators which yield the raw data. In this way, users can define the custom data processing pipelines and work on the raw data.

In [None]:
from torchtext.experimental.datasets.raw import IMDB
train_iter, test_iter = IMDB()

Materialize the raw IMDB data iterators. Pass the data and data processing pipelines (a.k.a. transforms) to the IMDB dataset abstraction. IMDBDataset is an abstraction that applies the user-defined transform pipelines to the raw question-answer data.

In [None]:
class IMDBDataset(torch.utils.data.Dataset):
    """Defines an abstract datasets.
    """

    def __init__(self, data, transforms):
        """Initiate text-classification dataset.
        """

        super(IMDBDataset, self).__init__()
        self.data = data
        self.transforms = transforms  # (label_transforms, text_transforms)

    def __getitem__(self, i):
        label = self.data[i][0]
        txt = self.data[i][1]
        return (self.transforms[0](label), self.transforms[1](txt))

    def __len__(self):
        return len(self.data)

train_data = IMDBDataset(list(train_iter), (label_pipeline, spm_transform))
test_data = IMDBDataset(list(test_iter), (label_pipeline, spm_transform))



### [REMOVE LATER] JIT support for the data processing pipeline

The new building blocks in torchtext library is compatible with `torch.jit.script`. TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency. The data processing pipelines above can be converted and run on the JIT mode without Python dependency

In [None]:
text_pipeline = text_pipeline.to_ivalue()
jit_text_pipeline = torch.jit.script(text_pipeline)

## Step 2: Data IteratorÂ¶

The PyTorch data loading utility is the `torch.utils.data.DataLoader` class. It works with a map-style dataset that implements the `getitem()` and `len()` protocols, and represents a map from indices/keys to data samples. Before sending to the model, `collate_fn` function works on a batch of samples generated from DataLoader.

In [None]:
from torch.utils.data import DataLoader

# [TODO] integrate with torchtext.experimental.transforms.PadTransform
# Need to land https://github.com/pytorch/text/pull/952
from torch.nn.utils.rnn import pad_sequence

cls_id = spm_transform.sp_model.PieceToId('<cls>')
pad_id = spm_transform.sp_model.PieceToId('<pad>')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list = [], []
    for (_label, _text) in batch:
        label_list.append(_label)
        text_list.append(torch.tensor(cls_id + _text))
    text_list = pad_sequence(text_list, batch_first=True, padding_value=float(pad_id[0]))
    label_list = torch.tensor(label_list)
    return label_list.to(device), text_list.transpose(0, 1).contiguous().to(device)

dataloader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=collate_batch)

## Step 3: Model for Sentiment Analysis Task
---

In [None]:
from torch import nn
class SentimentAnalysisModel(nn.Module):
    """Contain a transformer encoder."""

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super(SentimentAnalysisModel, self).__init__()
        self.embed_layer = nn.Embedding(ntoken, ninp)
        encoder_layers = nn.TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, nlayers)
        self.activation = nn.Tanh()
        self.projection = nn.Linear(ninp, 2)

    def forward(self, src_seq):
        output = self.embed_layer(src_seq)
        output = self.transformer_encoder(output)
        output = self.activation(output[0])
        return self.projection(output)

In [None]:
vocab_size = spm_transform.sp_model.GetPieceSize()
emsize, nhead, nhid, nlayers, dropout = 64, 8, 128, 1, 0.2
model = SentimentAnalysisModel(vocab_size, emsize, nhead, nhid, nlayers, dropout).to(device)


## Step 4: Train and test the Model
---

Then, we train and test the transformer model with the sentiment analysis based on the dataset

In [None]:
import time
import math

def fine_tune(model, dataloader, optimizer, criterion, batch_size, device, SEQENCE_LENGTH):
    model.train()
    total_loss = 0.
    log_interval = 200
    start_time = time.time()

    for idx, (label, text) in enumerate(dataloader):
        optimizer.zero_grad()
        # print(seq_input.size(), tok_type.size())
        if text.size(0) > SEQENCE_LENGTH:
            text = text[:SEQENCE_LENGTH]
        predited_label = model(text)
        loss = criterion(predited_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_loss += loss.item()
        if idx % log_interval == 0 and idx > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:05.5f} | '
                  'ms/batch {:5.2f} | '
                  'loss {:5.2f} | ppl {:8.2f}'.format(epoch, idx, len(dataloader),
                                                      scheduler.get_last_lr()[0],
                                                      elapsed * 1000 / log_interval,
                                                      cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

def evaluate(model, dataloader, optimizer, criterion, batch_size, device, SEQENCE_LENGTH):
    model.eval()
    total_loss = 0.
    ans_pred_tokens_samples = []

    with torch.no_grad():
        for idx, (label, text) in enumerate(dataloader):
            if text.size(0) > SEQENCE_LENGTH:
              text = text[:SEQENCE_LENGTH]
            predited_label = model(text)
            loss = criterion(predited_label, label)
            total_loss += loss.item()
    return total_loss / len(dataloader)

Here are a few hyperparameters used in the pipeline

In [None]:
# Hyperparameters
EPOCHS = 3 # epoch
LR = 0.5  # learning rate
BATCH_SIZE = 16 # batch size for training
SEQENCE_LENGTH = 128 # the maximum sequence length
  
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
eval_loss = None
train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    fine_tune(model, train_dataloader, optimizer, criterion, BATCH_SIZE, device, SEQENCE_LENGTH)
    _loss = evaluate(model, test_dataloader, optimizer, criterion, BATCH_SIZE, device, SEQENCE_LENGTH)
    if eval_loss is not None and _loss > eval_loss:
      scheduler.step()
    else:
       eval_loss = _loss
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid loss {:5.2f} | '.format(epoch, (time.time() - epoch_start_time),
                                         _loss))
    print('-' * 89)