# Transformers <a name="top"></a>

https://huggingface.co/course/chapter0?fw=pt

**Table of contents**
- [Pipeline](#pipeline)
    - [Sentiment Analysis](#sentiment-analysis)
    - [Zero shot classification](#zero-shot-classification)
    - [Text generation](#text-generation)
    - [Mask filling](#mask-filling)
    - [Named Entity Recognition](#Named-entity-recognition)
    - [Question Answering](#question-answering)
    - [Summarisation](#summarisation)
    - [Translation](#translation)
- [Components of `pipeline`](#components)
    - [Tokenizer](#tokenizer)
        - [Encoding](#encoding)
        - [Decoding](#decoding)
        - [Padding](#padding)
        - [Attention mask](#attention-mask)
        - [Truncation](#truncation)
        - [Token type id](#token-type-id)
    - [Model](#model)
    - [Postprocessing the output](#postprocessing)
- [Finetuning](#finetuning)
    - [MRPC dataset](#mrpc-dataset)

## Pipeline

A very convenient high level function to get nlp tasks done quickly.

In [1]:
from transformers import pipeline

The first parameter of `pipeline` is `task`, we can input any common task e.g. `sentiment-analysis`, `text-generation` and get a function that performs that nlp task.

### Sentiment Analysis <a name="sentiment-analysis"></a>

[Back to top](#top)

In [2]:
classifier = pipeline("sentiment-analysis")
classifier([
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
])

[{'label': 'POSITIVE', 'score': 0.9598047733306885},
 {'label': 'NEGATIVE', 'score': 0.9994558095932007}]

### Zero shot classification <a name="zero-shot-classification"></a>

[Back to top](#top)

In [3]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445960879325867, 0.11197630316019058, 0.04342757537961006]}

### Text generation <a name="text-generation"></a>

[Back to top](#top)

In [4]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to apply your data analysis skills to understanding, analyzing and applying the RNNs. We will examine various techniques which determine the optimal value of a RNN. Our target audience for the course is those individuals'}]

We can specify the model and additional parameters

In [5]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use a computer program to quickly copy and paste files stored between the systems. This is an overview of'},
 {'generated_text': 'In this course, we will teach you how to navigate the virtual world while simultaneously using the latest-generation virtualization software. If you‽ve'}]

### Mask filling <a name="mask-filling"></a>

[Back to top](#top)

In [6]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

[{'sequence': 'This course will teach you all about mathematical models.',
  'score': 0.19619831442832947,
  'token': 30412,
  'token_str': ' mathematical'},
 {'sequence': 'This course will teach you all about computational models.',
  'score': 0.040527213364839554,
  'token': 38163,
  'token_str': ' computational'}]

### Named Entity Recognition <a name="named-entity-recognition"></a>

[Back to top](#top)

In [7]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

[{'entity_group': 'PER',
  'score': 0.9981693774461746,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019991238912,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932105541229248,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

### Question Answering <a name="question-answering"></a>

[Back to top](#top)

In [8]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn"
)

{'score': 0.6949764490127563, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

### Summarisation <a name="summarisation"></a>

[Back to top](#top)

In [9]:
summarizer = pipeline("summarization")
summarizer("""
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
""")

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

### Translation <a name="translation"></a>

Models can be selected from https://huggingface.co/models

[Back to top](#top)

In [10]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

[{'translation_text': 'This course is produced by Hugging Face.'}]

## Components of `pipeline` <a name="components"></a>

<img src="transformers-pipeline.png" alt="components that form the pipeline" width="80%"/>

Under the hood of a pipeline, there is a `tokenizer`, a `model` and some post-processing steps.

First, text is passed to the tokenizer, then either `pytorch` or `tensorflow` tensors will be outputted.\
The tensors will be passed to the model and the model returns logits.\
Post-processing steps need to be done on the logits, e.g. passing them into a softmax layer to output variables that satisfies the NLP task.

[Back to top](#top)

### Tokenizer <a name="tokenizer"></a>

A `tokenizer` takes in a text as input and does the following:

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model

To understand the different ways of tokenization, e.g. tokenizing by words, by characters, by symbols, check out: https://huggingface.co/course/chapter2/4?fw=pt

[Back to top](#top)

In [11]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [12]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [13]:
type(inputs['input_ids'][0])

torch.Tensor

The tokenizer can be saved by using the `pre-trained` method.

In [14]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

**Under the hood of the tokenizer**

The BERT tokenizer is a **subword** tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That’s the case here with `transformer`, which is split into two tokens: `transform` and `##er`.

In [35]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']


#### Encoding <a name="encoding"></a>

The (subword) tokens are then turned into integers using the vocabulary.

[Back to top](#top)

In [41]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 11303, 1200, 2443, 1110, 3014]


The tokens are stored as keys in the vocabulary dictionary and the corresponding index are values of the dictionary.

In [42]:
[tokenizer.vocab[t] for t in tokens]

[7993, 170, 11303, 1200, 2443, 1110, 3014]

#### Decoding <a name="decoding"></a>

**Decoding** is going the other way around: from vocabulary indices, we want to get a string.

This can be done with the `decode` method as follows:

Note that the `decode` method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence.

[Back to top](#top)

In [31]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


### Padding <a name="padding"></a>

[Back to top](#top)

### Attention mask <a name="attention-mask"></a>

[Back to top](#top)

### Truncation <a name="truncation"></a>

[Back to top](#top)

### `token_type_id` <a name="token-type-id"></a>

[Back to top](#top)

### Model <a name="model"></a>

Note that the checkpoint used should be the same as the one used by the tokenizer.

[Back to top](#top)

In [15]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


Models for a specific task can be used, e.g. `AutoModelForSequenceClassification` instead of `AutoModel`

In [17]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [18]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

outputs = model(**inputs)

The inputs were 2 tokenized sentences and passed through an `AutoModelForSequenceClassification`, the logits are size `(2,2)` because the classification task classifies text into `2` classes and since there are 2 sentences, the size is `(2,2)`, much smaller than that of the `AutoModel`.

In [19]:
print(outputs.logits.shape)

torch.Size([2, 2])


### Postprocessing the output <a name="postprocessing"></a>

The logits don't really make sense, they don't sum to 1.

[Back to top](#top)

In [20]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)


The logits should be passed to a softmax function.

In [21]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)


Now the predictions make more sense, `0.040195` for label `0` and `0.95980` label `1` for the first sentence.\
But we still don't know what the labels mean, to get the labels corresponding to each position, we can inspect the `id2label` attribute of the model config.

In [22]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

The `AutoModel` class and all of its relatives are actually simple **wrappers** over the wide variety of models available in the library. It’s a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.

However, if the type of model is known, the class that defines its architecture can be used directly.\
For example, for BERT:

In [23]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

In [24]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.6.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Doing the above creates a model from the default configuration and initializes it with random values. However for the above model to be usable, it needs to be trained, which is compute-resources intensive.

Loading a Transformer model that is already trained is simple — we can do this using the `from_pretrained` method:

In [25]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Saving a model**

A model can be saved by using the `save_pretrained` method. 2 files will be saved `'config.json'`, `'pytorch_model.bin'`.

- `'config.json'`: contains information about the hyperparameters of the model, attributes necessary to build the model architecture and some metadata, e.g. the version of `Transformers` the model was saved in.
- `'pytorch_model.bin'`: Model's weights

In [26]:
model.save_pretrained("./directory_on_my_computer")

In [27]:
import os
os.listdir("./directory_on_my_computer/")

['tokenizer_config.json',
 'special_tokens_map.json',
 'config.json',
 'tokenizer.json',
 'vocab.txt',
 'pytorch_model.bin']

## Finetuning <a name="finetuning"></a>

[Back to top](#top)

In [46]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
batch

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2023,  2607,  2003,  6429,   999,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

In [47]:
# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

In [48]:
batch

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2023,  2607,  2003,  6429,   999,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'labels': tensor([1, 1])}

### MRPC Dataset

The **MRPC (Microsoft Research Paraphrase Corpus)**  dataset, introduced in a paper by William B. Dolan and Chris Brockett consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing). It’s a small dataset, so it’s easy to experiment with training on it.

The hub doesn't just contains model, it also contains datasets such as the MRPC dataset.

[Back to top](#top)

In [52]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Reusing dataset glue (/Users/Tay/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


DatasetDict({
    train: Dataset({
        features: ['idx', 'label', 'sentence1', 'sentence2'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['idx', 'label', 'sentence1', 'sentence2'],
        num_rows: 408
    })
    test: Dataset({
        features: ['idx', 'label', 'sentence1', 'sentence2'],
        num_rows: 1725
    })
})

The above code returns a `DatasetDict` object which contains the **training** set, the **validation** set, and the test set.

Each of those contains several columns (`sentence1`, `sentence2`, `label`, and `idx`) and a variable number of rows, which are the number of elements in each set. They can be accessed like a dictionary

In [65]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

The `'label': 1` entry does not say what does `1` refer to.\
To know which integer corresponds to which label, we can inspect the `features` of our `raw_train_dataset`.

In [66]:
raw_train_dataset.features

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}

Behind the scenes, `label` is of type `ClassLabel`, and the mapping of integers to label name is stored in the names folder. `0` corresponds to `not_equivalent`, and `1` corresponds to `equivalent`.

To preprocess the dataset, we need to convert the text to numbers the model can make sense of.\
We can tokenize them all at once:

In [75]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

In [77]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

In [79]:
tokenized_dataset.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [80]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [83]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Loading cached processed dataset at /Users/Tay/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-45c1e2c429d5d127.arrow
Loading cached processed dataset at /Users/Tay/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-2dc02d9648235839.arrow
Loading cached processed dataset at /Users/Tay/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-bddcfdfa2d4a83e8.arrow


DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 408
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 1725
    })
})

In [84]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)