In [1]:
%run supportvectors-common.ipynb



<center><img src="https://d4x5p7s4.rocketcdn.me/wp-content/uploads/2016/03/logo-poster-smaller.png"/> </center>
<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



## <img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="60px">  Pretrained Transformers

In this lab, we will delve into the pretrained transformers. 

In particular, we will continue our running example of sentiment analysis classification problem. We will see how to use a huggingface transformer

### Transformers

A pre-trained transformer, such as `bert` in the hugging-face repository exists as a head-less body: in other words, they produce a latent state embeddings. The weights of the transformer has been trained one some general task, such as MLM (masked language model): mask a few tokens in a sentence, and train the transformer to reconstruct the missing tokens as accurately as possible, from a vast corpus of documents. These weights form the **checkpoint** of the transformer model. Therefore, when we download and load a transformer from the hugging-face repository, we are actually loading the model with these particular trained checkpoint weights.

Such a transformer body can be used for a variety of tasks by using these latent state embeddings as input to a head meant to perform a particular task.

Let us say that we would like to create a text classifier for sentiment analysis -- i.e. classify each piece of text into the sentiment it represents. Then the pipeline we can build would look like this:

<img src = "classifier-pipeline.png"/>



First, we will need to tokenize the input text using a tokenizer. We covered this in a previous lab. The output of the tokenizer would be the input tokens to the transformer. The transformer will emit a hidden state embedding for each text. These embeddings become the input to the classifier network, which we can train using the gradient descent and backpropagation of gradients.

Therefore, we typically would make the classifier itself a differentiable function, made of a few layers of feed forward network. Or in its most simple form, it could be single softmax layer.

For the classification purposes, it is common to take only the transformer generated embedding of the `[CLS]` token as the input to the classifier, if one is using the `bert` transformer.

<hr />
<img src="nlp-with-transformers-book.png"  width="150" style="padding:20px;" align="left"> <b>Note</b>: Some of the code snippets below is inspired by, or directly taken from the book <a href="https://www.amazon.com/dp/1098136799"> Natural Language Processing with Transformers, Revised Edition. </a>


### Picking a sentiment analysis library

Let us pick the pretrained model for sentiment analysis. There are many available on the huggingface repository, but we will pick one that gives a familiar rating on the Likert scale (from 1 to 5). It seems to have a reasonably good accuracy, reported as:

<img src="images/sentiment-analysis-metrics.png" width=400/>

For a given input sequence, it produces the logit-vector with 5 elements, each element corresponding to a rating value from 1 to 5.

Let us use a softmax classifier at the end to transform the logits to probabilities. 


In [2]:
from transformers import AutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model

Downloading (…)lve/main/config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

<!--

`Distilbert` is a much smaller model, that approximates the accuracy the much larger `BERT` model. Since it is much easier to play with for inferences, especially when working on laptops without gpu/tpu acceleration support, we will use it for this lab.

**Note**: Considering that many of us do not have powerful laptops with tensor accelerators, we have not moved the model to the "cuda" device. However, if you do have that available, strongly consider using that with the syntax:

```
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device
```
-->

Let us explore what the output hidden state embeddings of this transformer looks like; first we will take a text and tokenize it. Then we will input it to the transformer, and look at the output.

In [3]:

from transformers import AutoTokenizer

checkpoint = "nlptown/bert-base-multilingual-uncased-sentiment" 
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
text = 'A thing of beauty is a joy for ever'
inputs = tokenizer(text, return_tensors='pt')
print(inputs)
print(f"The shape of the input is {inputs['input_ids'].size()}")

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

{'input_ids': tensor([[  101,   143, 21973, 10108, 25209, 10127,   143, 27318, 10139, 15765,
           102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
The shape of the input is torch.Size([1, 11])


In [4]:
tokenizer.model_input_names

['input_ids', 'token_type_ids', 'attention_mask']

In what follows, for simplicity, we will skip the step of loading it into the gpu; however, augment the code to run it on the gpu. Hint: use `.to(device)` on the values, before passing it to the model.

In [5]:
import torch
with torch.no_grad():
    outputs = model(**inputs)
    print(outputs)
    
logits = outputs.logits
print (f'The logits are: {logits[0].numpy()}')

SequenceClassifierOutput(loss=None, logits=tensor([[-2.2633, -2.5239, -0.9028,  1.4483,  3.4063]]), hidden_states=None, attentions=None)
The logits are: [-2.2632933 -2.5239275 -0.9028079  1.4482905  3.406254 ]


In [6]:
from torch.nn.functional import softmax
predictions = softmax(logits, dim=-1)
predictions

tensor([[0.0030, 0.0023, 0.0116, 0.1216, 0.8615]])

Let us display it in a more intuitive manner.

In [7]:
ratings = [1,2,3,4,5]
for rating, probability in zip(ratings, predictions[0]):
    print (f'Rating: {rating}  Probability: {probability.numpy():.2f}')

Rating: 1  Probability: 0.00
Rating: 2  Probability: 0.00
Rating: 3  Probability: 0.01
Rating: 4  Probability: 0.12
Rating: 5  Probability: 0.86


In other words, if the text `A thing of beauty is a joy forever` were a comment on a product, the sentiment it expresses is very positive.


# Passing a batch of input

Needless to say, we could have passed a small batch of texts as inputs for sentiment analysis. The process remains the same, with only minor and expected changes. Let us see that example below.

In [8]:
texts = ['A thing of beauty is a joy forever', 
         'I would really not recommend this horrible shoe; it hurts on long walks!',
         'It is quite satisfactory, but not exceptionally good'
        ]
inputs = tokenizer(texts, return_tensors='pt', padding=True)
import torch
with torch.no_grad():
    outputs = model(**inputs)
    print(outputs)
    
logits = outputs.logits
predictions = softmax(logits, dim=-1)
predictions

SequenceClassifierOutput(loss=None, logits=tensor([[-2.1803, -2.4672, -0.8363,  1.3856,  3.2928],
        [ 4.5518,  2.7818, -0.1818, -3.0668, -3.2394],
        [-2.7100, -0.2763,  2.2295,  2.0550, -1.1312]]), hidden_states=None, attentions=None)


tensor([[3.5822e-03, 2.6888e-03, 1.3736e-02, 1.2672e-01, 8.5327e-01],
        [8.4743e-01, 1.4435e-01, 7.4532e-03, 4.1628e-04, 3.5030e-04],
        [3.6457e-03, 4.1567e-02, 5.0934e-01, 4.2776e-01, 1.7680e-02]])

Let us display it an a more readable, user-friendly manner.

In [9]:
for text, prediction in zip (texts, predictions):
    print (f'{text:<100} {prediction}')

A thing of beauty is a joy forever                                                                   tensor([0.0036, 0.0027, 0.0137, 0.1267, 0.8533])
I would really not recommend this horrible shoe; it hurts on long walks!                             tensor([8.4743e-01, 1.4435e-01, 7.4532e-03, 4.1628e-04, 3.5030e-04])
It is quite satisfactory, but not exceptionally good                                                 tensor([0.0036, 0.0416, 0.5093, 0.4278, 0.0177])


In [10]:
results = [(1+ np.argmax(prediction.numpy()), np.max(prediction.numpy())) for prediction in predictions]

ratings = []
probs = []

for rating, prob in results:
    ratings.append('*'*rating)
    probs.append(prob)

data = pd.DataFrame(data={'Text': texts, 'Rating': ratings, 'Probability': probs})

data


Unnamed: 0,Text,Rating,Probability
0,A thing of beauty is a joy forever,*****,0.853274
1,I would really not recommend this horrible sho...,*,0.84743
2,"It is quite satisfactory, but not exceptionall...",***,0.509343


## What did we learn?

In this lab, we learned to build a complete text classifier for sentiment analysis. To do this, we concatenated the following pieces into a hand-constructed pipeline:

* a pretrained tokenizer
* a pretrained transformer
* finally, a softmax classifier

This gives us an insight into what the `pipeline()` from the previous labs were doing. Let us revisit the simpler way of directly using the `pipeline()`.

In [11]:
from transformers import pipeline
checkpoint = 'nlptown/bert-base-multilingual-uncased-sentiment'
sentiment_classifier = pipeline (task='sentiment-analysis', 
                                 tokenizer=checkpoint, # not necessary to mention, since it can be inferred
                                 model=checkpoint)
sentiment_classifier(texts)

[{'label': '5 stars', 'score': 0.8532741069793701},
 {'label': '1 star', 'score': 0.8474298119544983},
 {'label': '3 stars', 'score': 0.5093432068824768}]