# Pretrained Transformers

Created by [Gerard I. Gállego](https://www.linkedin.com/in/gerard-gallego/) for the [Postgraduate Course in Artificial Intelligence with Deep Learning](https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgrau-artificial-intelligence-deep-learning/) ([UPC School](https://www.talent.upc.edu/ing/), 2021).

The Transformer was firstly designed for Machine Translation, but nowadays Transformer-based architectures are used in many different fields. Beyond the original encoder-decoder architecture, we can classify the new architectures in Encoder or Decoder-based.

In this lab, we will see how to use pretrained Transformers from [🤗 Hugging Face](https://huggingface.co/), for tasks like Sentiment Analysis and Text Generation.

Note: If it's your first time working with the Transformer, we recommend you to do [this notebook](https://colab.research.google.com/github/telecombcn-dl/labs-all/blob/main/labs/transformer/lab_transformer1_todo.ipynb) first.

In [None]:
pip install -qq transformers datasets

In [None]:
import logging
import matplotlib.pylab as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader

from datasets import load_dataset
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

from sklearn.decomposition import PCA


class DisableLogger():
    def __enter__(self):
       logging.disable(logging.CRITICAL)
    def __exit__(self, exit_type, exit_value, exit_traceback):
       logging.disable(logging.NOTSET)


## Encoder-based models (BERT-ish)

This type of models, inspired by [BERT](https://arxiv.org/abs/1810.04805), consist of a Transformer encoder pre-trained with a self-supervision strategy (check [this](https://jalammar.github.io/illustrated-bert#bert-from-decoders-to-encoders) for more information).

Once pre-trained, the model returns sentence embeddings and contextualized word representations, that can be used for many downstream tasks.

In this section we will use [DistilBERT](https://medium.com/huggingface/distilbert-8cf3380435b5)* to perform Sentiment classification on IMDB movie reviews.


\* Much less parameters than BERT with a similar performance.


In [None]:
distilbert_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
distilbert = AutoModel.from_pretrained("distilbert-base-uncased")

def get_sents_representations(sents):
    encoded_input = distilbert_tokenizer(sents, return_tensors='pt', padding=True, truncation=True)

    distilbert_output = distilbert(**encoded_input)[0]
    sentence_repr = distilbert_output[:, 0]

    return distilbert_output, sentence_repr

In [None]:
#@title  { run: "auto", vertical-output: true }

#@markdown Show 5 DistilBERT sentence representations to 2-D
sentence_1 = "Hello, my name is Joe" #@param {type:"string"}
sentence_2 = "Hi, I'm Joey" #@param {type:"string"}
sentence_3 = "Goodbye, see you at 5pm" #@param {type:"string"}
sentence_4 = "Bye, see you later" #@param {type:"string"}
sentence_5 = "Attention is All You Need" #@param {type:"string"}


sentences = [sentence_1, sentence_2, sentence_3, sentence_4, sentence_5]

distilbert_output, sentence_repr = get_sents_representations(sentences)

print(f"DistilBERT output: {distilbert_output.shape}")
print(f"Sentence representations: {sentence_repr.shape}")
print("\n")

pca = PCA(n_components=2)
sentence_repr_2d = pca.fit_transform(sentence_repr.detach().numpy())

fig, ax = plt.subplots()
plt.scatter(sentence_repr_2d[:,0], sentence_repr_2d[:,1])
plt.title("Sentence representations (PCA projection)")
plt.xlim(sentence_repr_2d[:,0].min() - 1, sentence_repr_2d[:,0].max() + 4)
plt.ylim(sentence_repr_2d[:,1].min() - 1, sentence_repr_2d[:,1].max() + 1)

for x, y, s in zip(sentence_repr_2d[:,0], sentence_repr_2d[:,1], sentences):
    plt.text(x+0.15, y+0.15, s)

plt.show()

We've seen that the model extract meaningful sentence representations. Now, we'll build a Sentiment classifier (positive/negative movie reviews) by appending a simple MLP to DistilBERT.

In [None]:
# Subsampling the dataset to make the exercise faster.
imdb_train = load_dataset("imdb", split='train[:10%]+train[-10%:]')
imdb_test = load_dataset("imdb", split='train[:5%]+train[-5%:]')

def collate_imdb(batch):
    input_enc = distilbert_tokenizer(
        [s['text'] for s in batch],
        truncation=True, padding=True,
        max_length=512, return_tensors='pt'
    )
    label = torch.Tensor([s['label'] for s in batch]).unsqueeze(-1)
    return input_enc, label

In [None]:
class DistilBERTSentenceClassifier(nn.Module):
    def __init__(self):
        super(DistilBERTSentenceClassifier, self).__init__()
        self.distilbert = AutoModel.from_pretrained("distilbert-base-uncased")
        self.classifier = nn.Sequential(
            nn.Linear(768, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        with torch.no_grad():
            distilbert_output = self.distilbert(**x)[0]  # (B x T x 768)
            sentence_repr = distilbert_output[:, 0]      # (B x 768)

        out = self.classifier(sentence_repr)
        return out

In [None]:
lr = 1e-3
batch_size = 64
log_interval = 10
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

imdb_loader_train = DataLoader(
    imdb_train,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=collate_imdb
)

model = DistilBERTSentenceClassifier()
model.to(device)
model.train()

optimizer = optim.Adam(model.classifier.parameters(), lr=lr)
criterion = F.binary_cross_entropy

print("Training model...")

loss_avg = 0
for i, (net_input, target) in enumerate(imdb_loader_train):
    net_input, target = net_input.to(device), target.to(device)

    optimizer.zero_grad()
    output = model(net_input)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

    loss_avg += loss.item()
    if (i+1) % log_interval == 0:
        loss_avg /= log_interval
        print(f"{i+1}/{len(imdb_loader_train)}\tLoss: {loss_avg}")

In [None]:
batch_size_test = 128
log_interval_test = 5

imdb_loader_test = DataLoader(
    imdb_test,
    batch_size=batch_size_test,
    shuffle=True,
    collate_fn=collate_imdb
)

model.eval()

print("\nTesting model...")

n_correct = 0
n_total = 0
for i, (net_input, target) in enumerate(imdb_loader_test):
    net_input = net_input.to(device)

    output = model(net_input).detach().cpu()

    n_correct += torch.eq(output.round(), target).sum().item()
    n_total += output.numel()

    if (i+1) % log_interval_test == 0:
        print(f"{i+1}/{len(imdb_loader_test)}")

print(f"Test Accuracy: {round(100 * n_correct / n_total, 2)}%")

As you can see, we get a remarkable result by just training a small classifier with 2 layers (~200k params) on top of DistilBERT.

## Decoder-based models (GPT-ish)

On the other side, using the Transformer Decoder autorregressively led to the appearance of powerful synthetic text generation models. OpenAI has proposed more and more powerful models throughout the last years ([GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf), [GPT-3](https://arxiv.org/abs/2005.14165)).

These models consist of a Tranformer Decoder, but without the encoder-decoder attention blocks. Check [this](https://jalammar.github.io/illustrated-gpt2/) for more information.

In this section we will generate synthetic text by using pretrained models. We will also try some models which have been fine-tuned for some specific domains.


In [None]:
def download_hf_model(name):
    model = AutoModelForCausalLM.from_pretrained(name)
    tokenizer = AutoTokenizer.from_pretrained(name)
    model.to('cuda')
    return model, tokenizer


def generate_text(prompt, model, tokenizer, max_length=100, greedy=False):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
    with DisableLogger():
        gen_tokens = model.generate(input_ids, do_sample=(not greedy), temperature=0.9, max_length=max_length)
    print(tokenizer.batch_decode(gen_tokens.cpu())[0])

In [None]:
#@title  { run: "auto", vertical-output: true }
#@markdown Download a pretrained text generation model:
model_name = "gpt2" #@param ["distilgpt2", "gpt2"]
model, tokenizer = download_hf_model(model_name)


In [None]:
generate_text("The researcher presented an astonishing deep learning model. ", model, tokenizer)

Normally, language models are trained with as much (and diverse) data as possible. Under some scenarios, it might be interesting to fine-tune those general text generators for more specific domains. Now we will test some of this type of models:

- [ktrapeznikov/gpt2-medium-topic-small-set](https://huggingface.co/ktrapeznikov/gpt2-medium-topic-news-v2): Generates fake news. Needs to start with the following format `topic {topic} source {source} title {title} body`

- [lvwerra/gpt2-imdb-pos](https://huggingface.co/lvwerra/gpt2-imdb-pos): A small GPT2 language model fine-tuned to produce positive movie reviews based the IMDB dataset.

- [mrm8488/gpt2-imdb-neg](https://huggingface.co/mrm8488/gpt2-imdb-neg): A small GPT2 language model fine-tuned to produce negative movie reviews based the IMDB dataset.

In [None]:
#@title  { run: "auto", vertical-output: true }
#@title 
#@markdown Download a pretrained text generation model:
model_name = "ktrapeznikov/gpt2-medium-topic-small-set" #@param ["ktrapeznikov/gpt2-medium-topic-small-set", "lvwerra/gpt2-imdb-pos", "mrm8488/gpt2-imdb-neg"]
model, tokenizer = download_hf_model(model_name)


In [None]:
#@title  { run: "auto", vertical-output: true }
prompt = "topic politics source washington post title Donald Trump gives his fortune to the poor body" #@param ["topic politics source washington post title Donald Trump gives his fortune to the poor body", "The movie"] {allow-input: true}
max_length = 500 #@param {type:"number"}

generate_text(prompt, model, tokenizer, max_length)

## References

The code is partially inspired by:
- https://huggingface.co/transformers/custom_datasets.html#sequence-classification-with-imdb-reviews