<a href="https://colab.research.google.com/github/zhangguanheng66/tutorials/blob/sentiment_analysis/Torchtext_with_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%shell

#rm -r /usr/local/lib/python3.6/dist-packages/torch*
#pip install numpy
pip uninstall torch torchtext
pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html

In [None]:
import torch
print(torch.__version__)
import torchtext
print(torchtext.__version__)

# Sentiment Analysis with Torchtext

[TODO] Add more details about the task and describe the work here, show how to prepare data.
This tutorial is to show how to build the dataset to conduct sentiment analysis with torchtext library. The builiding blocks in torchtext library give the flexibility to build a custom data processing pipeline.



## Step 1: Access to the raw dataset iterators
----------------------------

The torchtext library provides a few raw dataset iterators, which yield the raw text strings and labels. For example, the IMDB dataset iterators yield the raw data as a tuple of label and text.

In [None]:
from torchtext.experimental.datasets.raw import IMDB
train_iter, = IMDB(data_select=('train'))



```
next(iter(train_iter))
>>> ('neg',
 'I rented I AM CURIOUS-YELLOW from my video store because of all the 
controversy that surrounded it when it was first released in 1967. I also heard 
that at first it was seized by U.S. customs if it ever tried to enter this 
country, therefore being a fan of films considered "controversial" I really had 
to see this for myself.<br /><br />The plot is centered around a young Swedish 
drama student named Lena who wants to learn everything she can about life. In 
particular she wants to focus her attentions to making some sort of documentary 
on what the average Swede thought about certain political issues such as the 
Vietnam War and race issues in the United States. In between asking politicians 
and ordinary denizens of Stockholm about their opinions on politics, she has 
sex with her drama teacher, classmates, and married men.<br /><br />What kills 
me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered 
pornographic. Really, the sex and nudity scenes are few and far between, even 
then it\'s not shot like some cheaply made porno. While my countrymen mind find 
it shocking, in reality sex and nudity are a major staple in Swedish cinema. 
Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex 
scenes in his films.<br /><br />I do commend the filmmakers for the fact that 
any sex shown in the film is shown for artistic purposes rather than just to 
shock people and make money to be shown in pornographic theaters in America. I 
AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and 
potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t 
have much of a plot.')

next(iter(train_iter))
('neg',
 '"I Am Curious: Yellow" is a risible and pretentious steaming pile. It 
doesn\'t matter what one\'s political views are because this film can hardly be 
taken seriously on any level. As for the claim that frontal male nudity is an 
automatic NC-17, that isn\'t true. I\'ve seen R-rated films with male nudity. 
Granted, they only offer some fleeting views, but where are the R-rated films 
with gaping vulvas and flapping labia? Nowhere, because they don\'t exist. The 
same goes for those crappy cable shows: schlongs swinging in the breeze but not 
a clitoris in sight. And those pretentious indie movies like The Brown Bunny, 
in which we\'re treated to the site of Vincent Gallo\'s throbbing johnson, but 
not a trace of pink visible on Chloe Sevigny. Before crying (or implying) 
"double-standard" in matters of nudity, the mentally obtuse should take into 
account one unavoidably obvious anatomical difference between men and women: 
there are no genitals on display when actresses appears nude, and the same 
cannot be said for a man. In fact, you generally won\'t see female genitals in 
an American film in anything short of porn or explicit erotica. This alleged 
double-standard is less a double standard than an admittedly depressing ability 
to come to terms culturally with the insides of women\'s bodies.')
```



## Step 2: Prepare data
----------------------------
We have revisited the very basic components of the torchtext library, including vocab, word vectors, tokenizer backed by regular expression, and sentencepiece. Those are the basic data processing building blocks for raw text string.

### Tokenizer-vocabulary data processing pipeline

Here is an example for typical NLP data processing with tokenizer and vocabulary.

Build vocab with the raw training dataset

In [None]:
from torchtext.experimental.vocab import build_vocab_from_iterator
from torchtext.experimental.transforms import basic_english_normalize
tokenizer = basic_english_normalize()
train_iter, = IMDB(data_select=('train'))
vocab = build_vocab_from_iterator(iter(tokenizer(line) for label, line in train_iter))

```
vocab(['here', 'is', 'an', 'example'])
>>> [131, 9, 40, 464]
```

Prepare data pipeline for the dataset

In [None]:
def generate_text_pipeline(tokenizer, vocab):
  def _forward(text):
    return vocab(tokenizer(text))
  return _forward
text_pipeline = generate_text_pipeline(basic_english_normalize(), vocab)
label_pipeline = lambda x: 1 if x == 'pos' else 0

The text piple converts a text string into a list of integers based on the lookup defined in the vocab. The label pipeline converts the label into integers. For example,

```
text_pipeline('here is the an example')
>>> [131, 9, 1, 40, 464]
label_pipeline('pos')
>>> 1
```

### (Optional for tutorial) Tokenizer + Vocab + Embedding data processing pipeline

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. FastText and GloVe are well established baseline word vectors in the NLP community. In the new torchtext library, a Vector object supports the mapping between tokens and their corresponding vector representation (i.e. word embeddings).

In [None]:
from torchtext.experimental.vectors import FastText
def generate_vector_pipeline(tokenizer, vector):
  def _forward(text):
    return vector(tokenizer(text))
  return _forward
word_vector_pipeline = generate_vector_pipeline(basic_english_normalize(), FastText())

The word_vector_pipeline tokenizes a text string and converts the tokenzers into a vector, according to the pretrained word vector.

```
word_vector_pipeline('here is the an example')
>>> tensor([[-0.1564,  0.0486,  0.1724,  ...,  0.4588, -0.0021,  0.3085],
            [ 0.0359,  0.1452,  0.1193,  ..., -0.0016,  0.1708, -0.0355],
            [-0.0653, -0.0930, -0.0176,  ...,  0.1664, -0.1308,  0.0354],
            [-0.0671,  0.0014, -0.1857,  ...,  0.1050, -0.2144,  0.0944],
            [ 0.0144,  0.1337, -0.1489,  ..., -0.0202,  0.0657, -0.0029]])
```

### SentencePiece data processing pipeline

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. For sentencepiece transforms in torchtext, both subword units (e.g., byte-pair-encoding (BPE) ) and unigram language model are supported. Here is an example to apply SentencePiece transform to build the dataset.

In [None]:
from torchtext.experimental.transforms import (
    PRETRAINED_SP_MODEL,
    sentencepiece_processor,
    load_sp_model,
)
from torchtext.utils import download_from_url
spm_filepath = download_from_url(PRETRAINED_SP_MODEL['text_unigram_25000'])
spm_transform = sentencepiece_processor(spm_filepath)
sp_model = load_sp_model(spm_filepath)

## Step 3: Data Iterator¶

The PyTorch data loading utility is the `torch.utils.data.DataLoader` class. It works with a map-style dataset that implements the `getitem()` and `len()` protocols, and represents a map from indices/keys to data samples. Before sending to the model, `collate_fn` function works on a batch of samples generated from DataLoader.

In [None]:
from torch.utils.data import DataLoader

# [TODO] integrate with torchtext.experimental.transforms.PadTransform
# Need to land https://github.com/pytorch/text/pull/952
from torch.nn.utils.rnn import pad_sequence

cls_id = sp_model.PieceToId('<cls>')
pad_id = sp_model.PieceToId('<pad>')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list = [], []
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        text_list.append(torch.tensor([cls_id] + spm_transform(_text)))
    text_list = pad_sequence(text_list, batch_first=True, padding_value=float(pad_id))
    label_list = torch.tensor(label_list)
    return label_list.to(device), text_list.transpose(0, 1).contiguous().to(device)

train_iter, = IMDB(data_select=('train'))
dataloader = DataLoader(list(train_iter), batch_size=8, shuffle=True, collate_fn=collate_batch)

## Step 4: Model for Sentiment Analysis Task
---

In [None]:
from torch import nn
import math
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

class SentimentAnalysisModel(nn.Module):
    """Contain a transformer encoder."""

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super(SentimentAnalysisModel, self).__init__()
        self.embed_layer = nn.Embedding(ntoken, ninp)
        self.pos_encoder = PositionalEncoding(ninp, dropout)
        encoder_layers = nn.TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, nlayers)
        self.activation = nn.Tanh()
        self.projection = nn.Linear(ninp, 2)

    def forward(self, src_seq):
        output = self.embed_layer(src_seq)
        output = self.pos_encoder(output)
        output = self.transformer_encoder(output)
        output = self.activation(output[0])
        return self.projection(output)

In [None]:
vocab_size = sp_model.GetPieceSize()
emsize, nhead, nhid, nlayers, dropout = 64, 8, 128, 1, 0.2
model = SentimentAnalysisModel(vocab_size, emsize, nhead, nhid, nlayers, dropout).to(device)


## Step 5: Train and test the Model
---

Then, we train and test the transformer model with the sentiment analysis based on the dataset

In [None]:
import time
import math

def fine_tune(model, dataloader, optimizer, criterion, batch_size, device, SEQENCE_LENGTH):
    model.train()
    total_loss, total_acc, total_count = 0, 0, 0
    log_interval = 50
    start_time = time.time()

    for idx, (label, text) in enumerate(dataloader):
        optimizer.zero_grad()
        # print(seq_input.size(), tok_type.size())
        if text.size(0) > SEQENCE_LENGTH:
            text = text[:SEQENCE_LENGTH]
        predited_label = model(text)
        loss = criterion(predited_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_loss += loss.item()
        total_acc += (predited_label.argmax(1) == label).sum().item()
        total_count += text.size(1)
        if idx % log_interval == 0 and idx > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:05.5f} | '
                  'ms/batch {:5.2f} | loss {:5.2f} '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              scheduler.get_last_lr()[0],
                                              elapsed * 1000 / log_interval,
                                              cur_loss, total_acc/total_count))
            total_loss, total_acc, total_count = 0, 0, 0
            start_time = time.time()

def evaluate(model, dataloader, optimizer,
             criterion, batch_size, device, SEQENCE_LENGTH):
    model.eval()
    total_loss, total_acc, total_count = 0, 0, 0
    ans_pred_tokens_samples = []

    with torch.no_grad():
        for idx, (label, text) in enumerate(dataloader):
            if text.size(0) > SEQENCE_LENGTH:
              text = text[:SEQENCE_LENGTH]
            predited_label = model(text)
            loss = criterion(predited_label, label)
            total_loss += loss.item()
            total_acc += (predited_label.argmax(1) == label).sum().item()
            total_count += text.size(1)
    return total_loss / len(dataloader), total_acc/total_count

Here are a few hyperparameters used in the pipeline

In [None]:
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training
SEQENCE_LENGTH = 1025 # the maximum sequence length
  
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
eval_loss = None
train_iter, test_iter = IMDB()
train_dataloader = DataLoader(list(train_iter), batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(list(test_iter), batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    fine_tune(model, train_dataloader, optimizer,
              criterion, BATCH_SIZE, device, SEQENCE_LENGTH)
    loss_val, accu_val = evaluate(model, test_dataloader, optimizer,
                                  criterion, BATCH_SIZE, device, SEQENCE_LENGTH)
    if eval_loss is not None and loss_val > eval_loss:
      scheduler.step()
    else:
       eval_loss = loss_val
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid loss {:5.2f} | valid accuracy {:8.3f} '.format(epoch,
                                         (time.time() - epoch_start_time),
                                         loss_val, accu_val))
    print('-' * 89)