<a href="https://colab.research.google.com/github/vamoscy/DataScienceCourse/blob/master/lab6_transfer_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 6: Transfer Learning in NLP

In [0]:
__author__ = "Alex Wang"
__version__ = "DSGA 1012, NYU, Spring 2019 term"

The field of natural language understanding (NLU) has seen a surge of progress in just the past year by transferring high-quality language models to various NLU tasks. In this notebook we'll familiarize ourselves with a popular pretrained language model, BERT, and its usage.

Logistical notes:
- make a copy of this notebook
- make sure you're running a Python **3** kernel
- make sure you adjust the runtime to have a GPU backend (Runtime -> Change runtime type -> GPU)

## BERT

BERT [(Devlin et al., 2018)](https://arxiv.org/abs/1810.04805) is a pretrained language model that has been productively applied for transfer learning for a variety of NLP tasks including:
- all the tasks in [GLUE](https://gluebenchmark.com): acceptability judgments, sentiment classification, natural language inference, paraphrase detection etc.
- [reading comprehension](https://rajpurkar.github.io/SQuAD-explorer/) 
- [question answering](https://ai.google.com/research/NaturalQuestions/leaderboard)
- [commonsense reasoning](https://leaderboard.allenai.org/swag/submissions/public)
- [constituency parsing (Kitaev and Klein, 2018)](https://arxiv.org/abs/1812.11760)

Why is BERT so effective? What are the ingredients?
- big data, big models, big compute
- _masked_ language modeling
- next sentence prediction

A factor in BERT's popularity has been the release of its [code](https://github.com/google-research/bert) and pretrained models (and for us, a re-implementation in [PyTorch](https://github.com/huggingface/pytorch-pretrained-BERT)). BERT's performance and accessibility mean that it is becoming the _de facto_ baseline approach to NLU tasks. We'll play around with the code relase today.

In [2]:
!pip install pytorch_pretrained_bert

Collecting pytorch_pretrained_bert
[?25l  Downloading https://files.pythonhosted.org/packages/5d/3c/d5fa084dd3a82ffc645aba78c417e6072ff48552e3301b1fa3bd711e03d4/pytorch_pretrained_bert-0.6.1-py3-none-any.whl (114kB)
[K    100% |████████████████████████████████| 122kB 4.7MB/s 
Installing collected packages: pytorch-pretrained-bert
Successfully installed pytorch-pretrained-bert-0.6.1


In [0]:
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
version = "bert-base-cased"

In [0]:
tokenizer = BertTokenizer.from_pretrained(version)
model = BertForMaskedLM.from_pretrained(version)
model.eval()
mask_tok = "[MASK]"

### Important Implementation Details

Take note of the following implementation details while using BERT:
- tokenization: BERT uses wordpiece tokenization, which is a subword tokenization scheme.
- special tokens: Instead of begining/end of sentence tokens (e.g. SOS/EOS), BERT uses [CLS] and [SEP], which respectively stand for classification and separation. To perform the masked language modeling, they also reserve a [MASK] token. It is crucial that you exactly match these tokens.
- modeling tasks: To apply BERT to any task, the inputs are prepended with the [CLS] token. The output of the Transformer corresponding to this [CLS] token is taken as the representation of the _entire_ input. To apply BERT to sentence pair tasks, the authors concatenate the two sentence inputs with a [SEP] token (and only one [CLS] token).
- segment IDs: To indicate where the two sentences begin and end, the authors also input a _segment ID_ (either 0 or 1), that is embedded and used to compute representations. The segment ID is also used in the single-sentence case.
- many models: the `pytorch-pretrained-bert` library contains many versions of BERT depending on the task you want to do. Make sure that you are using the architecture appropriate for your task (e.g. `BertForMultipleChoice` if you have a multiple choice reading comprehension task) and the correct pretrained version of BERT (base vs large, cased vs uncased, multilingual vs monolingual).

Below is an example of using BERT for masked language model, aka _cloze_.

In [0]:
text = "[CLS] The best part of today is lab . [SEP]"
tokenized_text = text.split() #tokenizer.tokenize(text)
mask_idx = 7
tokenized_text[mask_idx] = mask_tok
print(tokenized_text)

['[CLS]', 'The', 'best', 'part', 'of', 'today', 'is', '[MASK]', '.', '[SEP]']


In [0]:
tok_idxs = tokenizer.convert_tokens_to_ids(tokenized_text)
seg_idxs = [0] * len(tok_idxs)
tok_tensor = torch.tensor([tok_idxs])
seg_tensor = torch.tensor([seg_idxs])
print(tok_tensor)
print(seg_tensor)

tensor([[ 101, 1109, 1436, 1226, 1104, 2052, 1110,  103,  119,  102]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


In [0]:
preds = model(tok_tensor, seg_tensor)
pred_idx = torch.argmax(preds[0, mask_idx]).item()
pred_tok = tokenizer.convert_ids_to_tokens([pred_idx])[0]
print("Predicted \"%s\"" % pred_tok)

__Exercise__: Write code to process a sentence pair example and use BERT to evaluate a masked location in that pair.

### Fine-Tuning BERT

Masked language modeling is fun, but we care a lot about other NLU tasks. Let's try to _fine-tune_ BERT for SST.
 Download [this SST data](https://gluebenchmark.com/tasks), and upload it if you're on Colab (next cell).

In [0]:
from google.colab import files
uploaded = files.upload()

In [0]:
data_dir = "."
max_seq_len = 128
batch_size = 32
n_epochs = 3
learning_rate = 1e-5
validate_every = 100
print_every = 50

At the bottom of this notebook are some functions to help with data loading and processing. Go to the bottom and run those. 

We'll use the `BertForSequenceClassification` class and the BERT-specific optimizer, `BertAdam`. For other tasks, you'll need different tasks specific architectures, such as `BertForQuestionAnswering`.

In [0]:
from pytorch_pretrained_bert import BertTokenizer, BertForSequenceClassification, BertAdam

# Prepare model
tokenizer = BertTokenizer.from_pretrained(version)
model = BertForSequenceClassification.from_pretrained(version, num_labels = n_labels)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Prepare optimizer
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]
optimizer = BertAdam(optimizer_grouped_parameters, lr=learning_rate, warmup=0.1, t_total=n_train_steps)

**Warmup**: write a function `evaluate` that takes in a model and a (classification) dataset and evaluates the model on that dataset, returning the accuracy.

In [0]:
def accuracy(out, labels):
    outputs = np.argmax(out, axis=1)
    return np.sum(outputs == labels)

def evaluate(model, dataloader, batch_size=32):
    model.eval()
    eval_loss, eval_accuracy = 0., 0.
    nb_eval_steps, nb_eval_examples = 0, 0

    for input_ids, input_mask, segment_ids, label_ids in eval_dataloader:
        input_ids = input_ids.to(device)
        input_mask = input_mask.to(device)
        segment_ids = segment_ids.to(device)
        label_ids = label_ids.to(device)

        with torch.no_grad():
            tmp_eval_loss = model(input_ids, segment_ids, input_mask, label_ids)
            logits = model(input_ids, segment_ids, input_mask)

        logits = logits.detach().cpu().numpy()
        label_ids = label_ids.to('cpu').numpy()
        tmp_eval_accuracy = accuracy(logits, label_ids)

        eval_loss += tmp_eval_loss.mean().item()
        eval_accuracy += tmp_eval_accuracy

        nb_eval_examples += input_ids.size(0)
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_examples
    result = {'eval_loss': eval_loss,
              'eval_accuracy': eval_accuracy}
    return result

**Exercise**: Write the fine-tuning (training) procedure for a fixed number of epochs.

Important lines:
- `batch = tuple(t.to(device) for t in batch)`
- `input_ids, input_mask, segment_ids, label_ids = batch`
- `loss = model(input_ids, segment_ids, input_mask, label_ids)`

### RUN THIS CODE BEFORE EXECUTING STUFF ###

In [0]:
import os
import csv
import sys
import numpy as np

class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.
        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label


class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id
        
class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    @classmethod
    def _read_tsv(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with open(input_file, "r") as f:
            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
            lines = []
            for line in reader:
                if sys.version_info[0] == 2:
                    line = list(unicode(cell, 'utf-8') for cell in line)
                lines.append(line)
            return lines
        
class Sst2Processor(DataProcessor):
    """Processor for the SST-2 data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
            text_a = line[0]
            label = line[1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

In [0]:
def convert_examples_to_features(examples, label_list, max_seq_length, tokenizer):
    """Loads a data file into a list of `InputBatch`s."""

    label_map = {label : i for i, label in enumerate(label_list)}

    features = []
    for (ex_index, example) in enumerate(examples):
        tokens_a = tokenizer.tokenize(example.text_a)

        tokens_b = None
        if example.text_b:
            tokens_b = tokenizer.tokenize(example.text_b)
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > max_seq_length - 2:
                tokens_a = tokens_a[:(max_seq_length - 2)]

        # The convention in BERT is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids: 0   0   0   0  0     0 0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambigiously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
        segment_ids = [0] * len(tokens)

        if tokens_b:
            tokens += tokens_b + ["[SEP]"]
            segment_ids += [1] * (len(tokens_b) + 1)

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding = [0] * (max_seq_length - len(input_ids))
        input_ids += padding
        input_mask += padding
        segment_ids += padding

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length

        label_id = label_map[example.label]
        if ex_index < 5:
            print("*** Example ***")
            print("guid: %s" % (example.guid))
            print("tokens: %s" % " ".join(
                    [str(x) for x in tokens]))
            print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            print("input_mask: %s" % " ".join([str(x) for x in input_mask]))
            print(
                    "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
            print("label: %s (id = %d)" % (example.label, label_id))

        features.append(
                InputFeatures(input_ids=input_ids,
                              input_mask=input_mask,
                              segment_ids=segment_ids,
                              label_id=label_id))
    return features

In [0]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset


processor = Sst2Processor()
n_labels = 2
label_list = processor.get_labels()

train_examples = processor.get_train_examples(data_dir)
train_features = convert_examples_to_features(train_examples, label_list, max_seq_len, tokenizer)

all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

eval_examples = processor.get_dev_examples(data_dir)
eval_features = convert_examples_to_features(eval_examples, label_list, max_seq_len, tokenizer)
all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=batch_size)


n_train_steps = int(len(train_examples) / batch_size) * n_epochs

*** Example ***
guid: train-1
tokens: [CLS] hide new secret ##ions from the parental units [SEP]
input_ids: 101 4750 1207 3318 5266 1121 1103 22467 2338 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
label: 0 (id = 0)
*** Example ***
guid: train-2
tokens: [CLS] contains

## Other Pretrained Models

There are a variety of other pretrained models that have been productively used for transfer learning.

### GPT and GPT2

OpenAI released a high-quality unidirectional Transformer language model known as GPT (which actually stands for "Generative PreTraining") that beat out ELMo (see below), likely due to increased model size, improved model architecture, and contiguous sentence data. Soon after, they released a larger version, GPT2, that is even bigger and trained on even more data. This model is so scary, the authors did not release the full pretrained weights for fear of the horrors they might unleash into the world. You may have heard about it in the news. GPT and GPT2 are available as part of pytorch-pretrained-bert (Big & Extending Repository of pretrained Transformers).

### ELMo

ELMo [(Peters et al., 2018)](https://allennlp.org/elmo) was the first of the pretrained language models that was accompanied by an easy-to-use code release. The model consists of a pair of unidirectional two-layer language models whose representations are concatenated to form each token's representations. When applying ELMo to a downstream task, the model is typically _frozen_ except for a set of scalar layer mixing weights. ELMo is available as part of AllenNLP.

### CoVe

Some other work explores pretraining models on tasks other than language modeling. CoVe [(McCann et al., 2018)](https://github.com/salesforce/cove) is pretrained on English-German translation and came out slightly ahead of ELMo. There is public code for CoVe, but it is not as easy to use as ELMo and BERT.

In closing, the general hierarchy of models is CoVe < ELMo < GPT < BERT < in terms of effectiveness and ease of use, though with how easy it is to use these models, it's not a bad idea to explore using multiple of them.

## BONUS ?!