In [7]:
%run supportvectors-common.ipynb



<center><img src="https://d4x5p7s4.rocketcdn.me/wp-content/uploads/2016/03/logo-poster-smaller.png"/> </center>
<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



## <img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="60px">  Pretrained Transformers

In this lab, we will delve into the pretrained transformers. 

In particular, we will continue our running example of sentiment analysis classification problem. We will see how to use a huggingface transformer

### Transformers

A pre-trained transformer, such as `bert` in the hugging-face repository exists as a head-less body: in other words, they produce a latent state embeddings. The weights of the transformer has been trained one some general task, such as MLM (masked language model): mask a few tokens in a sentence, and train the transformer to reconstruct the missing tokens as accurately as possible, from a vast corpus of documents. These weights form the **checkpoint** of the transformer model. Therefore, when we download and load a transformer from the hugging-face repository, we are actually loading the model with these particular trained checkpoint weights.

Such a transformer body can be used for a variety of tasks by using these latent state embeddings as input to a head meant to perform a particular task.

Let us say that we would like to create a text classifier for sentiment analysis -- i.e. classify each piece of text into the sentiment it represents. Then the pipeline we can build would look like this:

<img src = "classifier-pipeline.png"/>



First, we will need to tokenize the input text using a tokenizer. We covered this in a previous lab. The output of the tokenizer would be the input tokens to the transformer. The transformer will emit a hidden state embedding for each text. These embeddings become the input to the classifier network, which we can train using the gradient descent and backpropagation of gradients.

Therefore, we typically would make the classifier itself a differentiable function, made of a few layers of feed forward network. Or in its most simple form, it could be single softmax layer.

For the classification purposes, it is common to take only the transformer generated embedding of the `[CLS]` token as the input to the classifier, if one is using the `bert` transformer.

<hr />
<img src="nlp-with-transformers-book.png"  width="150" style="padding:20px;" align="left"> <b>Note</b>: Some of the code snippets below is inspired by, or directly taken from the book <a href="https://www.amazon.com/dp/1098136799"> Natural Language Processing with Transformers, Revised Edition. </a>


In [22]:
from transformers import AutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

`Distilbert` is a much smaller model, that approximates the accuracy the much larger `BERT` model. Since it is much easier to play with for inferences, especially when working on laptops without gpu/tpu acceleration support, we will use it for this lab.

**Note**: Considering that many of us do not have powerful laptops with tensor accelerators, we have not moved the model to the "cuda" device. However, if you do have that available, strongly consider using that with the syntax:

```
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device
```

Let us explore what the output hidden state embeddings of the `distilbert` transformer looks like; first we will take a text and tokenize it. Then we will input it to the transformer, and look at the output.

In [23]:

from transformers import AutoTokenizer

checkpoint = "nlptown/bert-base-multilingual-uncased-sentiment" 
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
text = 'A thing of beauty is a joy for ever'
inputs = tokenizer(text, return_tensors='pt')
print(inputs)
print(f"The shape of the input is {inputs['input_ids'].size()}")

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

{'input_ids': tensor([[  101,   143, 21973, 10108, 25209, 10127,   143, 27318, 10139, 15765,
           102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
The shape of the input is torch.Size([1, 11])


In [24]:
tokenizer.model_input_names

['input_ids', 'token_type_ids', 'attention_mask']

In what follows, for simplicity, we will skip the step of loading it into the gpu; however, augment the code to run it on the gpu. Hint: use `.to(device)` on the values, before passing it to the model.

In [35]:
import torch
with torch.no_grad():
    outputs = model(**inputs)
    print(outputs)
    
logits = outputs.logits
logits

SequenceClassifierOutput(loss=None, logits=tensor([[-2.2633, -2.5239, -0.9028,  1.4483,  3.4063]]), hidden_states=None, attentions=None)


tensor([[-2.2633, -2.5239, -0.9028,  1.4483,  3.4063]])

In [31]:
logits.size()

torch.Size([1, 5])

We see that the embedding for each token is a 768-dimensional vector, and there are 11 tokens for the 1 input sentence.

What does the hidden state of the `[CLS]` look like? Recall that it is the first token, and we will use it's latent state to feed into the classifier. We expect a 768-dimensional vector. Let's see:

In [32]:
from torch.nn.functional import softmax
predictions = softmax(logits, dim=-1)
predictions

tensor([[0.0030, 0.0023, 0.0116, 0.1216, 0.8615]])

### Load the `emotion` dataset

I the previous lab, we explored the `emotion` dataset. Let's load it again:

In [50]:
from datasets import load_dataset
emotions = load_dataset('emotion')
emotions

No config specified, defaulting to: emotion/split
Found cached dataset emotion (/home/asif/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

Let us now define a function to act on a mini-batch  (commonly, called a batch) of inputs.

In [72]:
def tokenize(batch):
    tokens = tokenizer(batch['text'], padding=True, truncation =True)
    return tokens

def extract_hidden_state(batch):
    
    inputs = {k:v for k,v in batch.items() if k in tokenizer.model_input_names}
    
    # forward-pass the batch through the transformer,
    # to get the hidden state of each of the tokens
    # disable gradient computations, as we are in inference mode
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
        # return only the hidden-state of the [CLS] token
        last_hidden_state = last_hidden_state[:, 0].numpy()
        
        return {'hidden_state': last_hidden_state}

Let us try these functions:

In [73]:
tokens = tokenize(emotions['train'][:2])
tokens

{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

#### Tokenize the `emotions` training dataset

With `tokenize()` function, we can now tokenize the entire dataset:

In [74]:
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)
print(emotions_encoded['train'].column_names)

Loading cached processed dataset at /home/asif/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-c1a5eb64959c25f7.arrow
Loading cached processed dataset at /home/asif/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-5f12000962fab643.arrow
Loading cached processed dataset at /home/asif/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-27dd56d250cb57d8.arrow


['text', 'label', 'input_ids', 'attention_mask']


Our model expects inputs as `pytorch` tensors. We can ensure that with:

In [75]:
emotions_encoded.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
emotions_encoded

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
})

#### Forward pass through the transformer

Let us now pass the entire dataset through the transformer.

In [76]:
emotions_latent = emotions_encoded.map(to_latent_vector, batched=True)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

AttributeError: 'list' object has no attribute 'size'