# Transformer Pipeline

This section talks about how pipeline class in transformer module works. A general workflow looks like - a pipeline object is initiated from transformer library, to which raw text is fed directly as input and labels are given as output.

Pipeline takes care of multiple things internally,
1. Tokenization
2. Feeding to model and generating logits
3. Post processing logits and giving output as labels

Before we go through further text in this notebook, it is important to understand difference between **AutoModelForSequenceClassification, AutoModel, and BertModel**.

| Method | What does it do? | Use case |
| --- | --- | --- |
| **AutoModel** | Loads base model like BERT, **without and task head** | when we want to extract embeddings |
| **BertModel** | Same as AutoModel, but specifc to BERT. This again, does not have any task specific head. | Useful when we want to work directly with BERT’s embeddings or use it as the base for other models |
| **AutoModelForSequenceClassification** | Loads a base model with a classification head on top. It automatically adds a classification head on top of the pre-trained model. | Directly use or fine-tune the model for classification tasks. |

Other AutoModelFor... are,

["AutoModelForSequenceClassification", "AutoModelForTokenClassification","AutoModelForQuestionAnswering", "AutoModelForMaskedLM", "AutoModelForCausalLM", "AutoModelForSeq2SeqLM", "AutoModelForMultipleChoice", "AutoModelForNextSentencePrediction", "AutoModelForImageClassification", "AutoModelForVision2Seq", "AutoModelForSpeechSeq2Seq", "AutoModelForAudioClassification", "AutoModelForCTC", "AutoModelForImageToText", "AutoModelForZeroShotObjectDetection", "AutoModelForDepthEstimation", "AutoModelForDocumentQuestionAnswering"]


In [1]:
# checking if gpu device is available
import torch
print(torch.cuda.is_available(), torch.cuda.get_device_name(0))

True Tesla P100-PCIE-16GB


## Import Libraries

In [2]:
from transformers import pipeline, AutoTokenizer, AutoModel

2025-06-06 10:26:32.866434: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749205592.914727     150 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749205592.928396     150 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Initiating classifier (using Huggingface's pipeline)

- initiating instance of class **pipeline** and calling "sentiment-analysis" pre-trained model.
- When we call pipeline, There are following things running in the background,
  -  Tokenizer --> it takes in the raw input (text) and tokenizes it and convert into IDs.
  -  Model --> The IDs fed to model gives logits.
  -  Post processing --> These logits are then processed to give predictions.

In [3]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


## Prediction

In [4]:
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

> The output tells that first sentence's sentiment is positive with 0.959 probability while second sentence's sentiment is negative with .999 probability.

# Transformer AutoModel and AutoTokenizer

## Defining checkpoint

A checkpoint in transformers is a saved state of the model during or after training. It includes the model weights, allowing to resume training or use the model later without retraining. 
> Checkpoints are useful for fault recovery, model evaluation, and deployment.

In [5]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

## Getting tokenizer and pre-trained model (from a checkpoint)

In [6]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint)

## Tokenizing raw inputs

This is unlike "pipeline". Here, tokenization is done manually using the AutoTokenizer called on a checkpoint, and feeding raw input to this.

Output is a dictionary with following keys,
1. input_ids
2. attention_mask

In [7]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, 
                   return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [8]:
inputs.keys()

dict_keys(['input_ids', 'attention_mask'])

## Feeding tokenized inputs to the model

model(**inputs) unpacks the inputs dictionary to have something like this,
> **model(input_ids=...  , attention_mask=...)**

In [9]:
outputs = model(**inputs)
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],
         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],
         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],
         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

The vector output by the Transformer module is usually large. It generally has three dimensions:
* Batch size: The number of sequences processed at a time (2 in above example).
* Sequence length: The length of the numerical representation of the sequence (16 in above example).
* Hidden size: The vector dimension of each model input.


In [10]:
### Let's see shape of last hidden state
outputs.last_hidden_state.shape

torch.Size([2, 16, 768])

> The **model heads** take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers.

There are many different architectures available in Transformers library (with attached heads), with each one designed around tackling a specific task. We will explore the sequence classification architecture. Let's see.

## Using architecture with model heads

Some key points to consider,
1. We are using the same checkpoint that was defined above.
2. We are feeding the tokenized inputs to the model (created from tokenzier from same checkpoint).

In [11]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [12]:
outputs.logits, outputs.logits.shape

(tensor([[-1.5607,  1.6123],
         [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>),
 torch.Size([2, 2]))

> the model head takes as input the high-dimensional vectors, and outputs vectors containing two values, one for each label.

> Since we have two sentences and two labels, the result we get from our model is of shape 2 x 2.

## Post processing the output

* The output are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. 
* To be converted to probabilities, they need to go through a SoftMax layer.

In [13]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4419e-04]], grad_fn=<SoftmaxBackward0>)


In [14]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

This means,
1. 1st sentence is positive.
2. 2nd sentence is negative.

Till now, we have seen following usage of the **transformer** library,

<img src = transformer_lib.jpg width="700">

# Loading Pre-trained BERT

## Initializing BERT with random parameters

In [15]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()  # The configuration contains many attributes 
                       # that are used to build the model.

# Building the model from the config 

model = BertModel(config) # this initializes the model randomly [not pre-trained]

In [16]:
config

BertConfig {
  "_attn_implementation_autoset": true,
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.51.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

## Instead, Loading pre-trained BERT ???

In [17]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

In [18]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

## Saving a model

In [19]:
model.save_pretrained("sample_model")

## Inference ?

In [20]:
import torch

sequences = ["Hello!", "Cool.", "Nice!"]
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

model_inputs = torch.tensor(encoded_sequences)

In [21]:
model_inputs

tensor([[ 101, 7592,  999,  102],
        [ 101, 4658, 1012,  102],
        [ 101, 3835,  999,  102]])

In [22]:
output = model(model_inputs)

In [23]:
output.last_hidden_state.shape  # [batch size, features, hidden states]

torch.Size([3, 4, 768])

# Handling multiple sequences

In [24]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Defining a checkpoint and loading a tokenzier and model from that checkpoint
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Defining an input sequence
sequence = "I've been waiting for a HuggingFace course my whole life."

# creating tokens from the tokenizer, converting tokens to ids and then tensor
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor([ids])

In [25]:
input_ids.shape

torch.Size([1, 14])

So, the tokenizer has divided the input sentence to 14 token IDs.

In [26]:
output = model(input_ids)
print("Logits:", output.logits)

Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


## How about multiple inputs now ?

In [27]:
# we are going to use the pad token id from the tokenizer object. 
# let's see what it is
tokenizer.pad_token_id

0

In [28]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]] # these are IDs from tokenizer for sentence 1
sequence2_ids = [[200, 200]]      # these are IDs from tokenzier for sentence 2
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward0>)


> Interesting to see that results are not matching for sequence 2.

## Bring in the attention masks to have the same results ...

In [29]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), 
                attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


# Fine-tuning a pretrained model (General Flow)

In [30]:
# Import needed libraries
import torch
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Defining a checkpoint and then initializing a tokenizer and model object
# using that checkpoint
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Define raw text as input
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Important keyword argument for tokenizer object,
* **padding=True**
  * Pads all sequences in the batch to the length of the longest one. Ensures uniform input size (required for batching).
* **truncation=True**
  * Truncates sequences longer than the model’s max length (e.g., 512 for BERT). Prevents overflow and out-of-memory errors.

In [31]:
# creating batch from the raw input text and tokenizer object
batch = tokenizer(sequences, 
                  padding=True, 
                  truncation=True, 
                  return_tensors="pt")

batch

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2023,  2607,  2003,  6429,   999,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

In [32]:
batch.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [33]:
# Since, we are creating training example, we will also need labels.
# for this case, lets give a label of 1 to both the input training example.
batch["labels"] = torch.tensor([1, 1])

In [34]:
# create a optimizer to update model parameters
optimizer = AdamW(model.parameters(), 
                  lr= 5e-2)

# Runs the model on the input batch, and computes the loss.
# batch is expected to be a dictionary with keys like,
# input_ids, attention_mask, and possibly labels.
loss = model(**batch).loss

# Computes the gradients of the loss with respect to the model parameters
loss.backward()

# Applies the calculated gradients to update the model weights
optimizer.step()

# Fine-tuning a pretrained model (using Trainer Class)

Let's use the MRPC (Microsoft Research Paraphrase Corpus) dataset, introduced in a paper by William B. Dolan and Chris Brockett. The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not

### Importing dataset

In [35]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [36]:
raw_datasets['train'][0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [37]:
raw_datasets['train'].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

### Tokenizing the Inputs

In [38]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [39]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], 
                     example["sentence2"], 
                     truncation=True
                    )

> **batched=True**

> Instead of applying the function to one example at a time, it passes a batch (a list of examples) to the function.

In [40]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [41]:
tokenized_datasets['train']

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})

In [42]:
raw_datasets['train']

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})

### Dynamic Padding

**IMPORTANT**
* Dynamic padding is a strategy where sequences in a batch are padded only up to the length of the longest sequence in that batch, instead of a fixed maximum length.
* With dynamic padding, different batches can—and usually do—have different sequence lengths, because each batch is padded only up to its longest sequence.
  * Models expect input tensors to be uniform in size within a batch.
  * But between batches, lengths can vary. 

> pad all the examples to the length of the longest element when we batch elements together

> The function that is responsible for putting together samples inside a batch is called a **collate function**.


**DataCollatorWithPadding** - It takes a tokenizer when you instantiate it,
  * to know which padding token to use
  * and whether the model expects padding to be on the left or on the right of the inputs

In [43]:
from transformers import DataCollatorWithPadding

# DataCollatorWithPadding - Automatically pads each batch 
#                           dynamically to the length of the longest 
#                           sequence in that batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [44]:
# Let's see how it is working - FIRST REMOVE UNWANTED COLUMNS FROM THE DATASET
samples = {k: v for k, v in tokenized_datasets["train"][:8].items() 
           if k not in ["idx", "sentence1", "sentence2"]}
samples.keys()

dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])

In [45]:
# let's check lengths of tokens in each training example provided
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

In [46]:
# here's how data collator will work
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

### Fine-tuning with trainer API

* Transformers provides a Trainer class to help fine-tune any of the pretrained models.
* The first step before defining **Trainer** is to **define a TrainingArguments class** that will contain all the hyperparameters the Trainer will use for training and evaluation.
  * The <u>only argument to be provided is a directory where the trained model will be saved</u>, as well as the checkpoints along the way.
  * Rest can be left as defaults, which will work pretty well for a basic fine-tuning.

In [47]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer",
                                 report_to="none")

In [48]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, 
                                                           num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [49]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer
)

In [50]:
# Now, fine tuning
trainer.train()

Step,Training Loss
500,0.4922
1000,0.2374


TrainOutput(global_step=1377, training_loss=0.2942901561462801, metrics={'train_runtime': 130.751, 'train_samples_per_second': 84.16, 'train_steps_per_second': 10.531, 'total_flos': 405114969714960.0, 'train_loss': 0.2942901561462801, 'epoch': 3.0})

### Evaluation

In [51]:
!pip install evaluate



In [52]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


In [53]:
predictions.predictions[:2]

array([[-3.162256 ,  4.407572 ],
       [ 2.5285764, -4.0733857]], dtype=float32)

In [54]:
import numpy as np
import evaluate

preds = np.argmax(predictions.predictions, axis=-1) # this line tells which index
                                                    # has max value in the array
print("preds: {}".format(preds))


metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)
print("metrics: {}".format(metric))

preds: [1 0 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
 0 1 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1
 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 1 0 1 1 1
 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1
 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 1
 1 1 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0
 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
 0 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1
 1]
metrics: EvaluationModule(name: "glue", module_type: "metric", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}

In [55]:
import numpy as np
import evaluate

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### Re-defining the trainer

In [56]:
training_args = TrainingArguments("test-trainer", 
                                  eval_strategy="epoch",
                                  fp16=True,
                                  no_cuda=False,
                                  num_train_epochs=3,
                                  report_to="none"  # no logging to wandb
                                 )

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, 
                                                           num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [57]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.417061,0.828431,0.883721
2,0.552400,0.428832,0.845588,0.890052
3,0.332400,0.679138,0.848039,0.895623


TrainOutput(global_step=1377, training_loss=0.3718539413197281, metrics={'train_runtime': 132.6482, 'train_samples_per_second': 82.956, 'train_steps_per_second': 10.381, 'total_flos': 405114969714960.0, 'train_loss': 0.3718539413197281, 'epoch': 3.0})

# Fine-tuning a pretrained model (without Trainer Class)

In [58]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

# get dataset
raw_datasets = load_dataset("glue", "mrpc")

# define a checkpoint from which model and tokenizer will be loaded
checkpoint = "bert-base-uncased"

# get the tokenizer from the checkpoint
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
    return tokenizer(example["sentence1"], 
                     example["sentence2"], 
                     truncation=True)

# batched tokenization of the raw dataset
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# data collator for preparing batches to feed to model
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

## Prepare for training

We need to apply a bit of postprocessing to tokenized_datasets, to take care of some things that the **Trainer** automatically did.
* Remove the columns corresponding to values the model does not expect (like the sentence1 and sentence2 columns).
* Rename the column label to labels.
* Set the format of the datasets so they return PyTorch tensors instead of lists.

In [59]:
# Things done by trainer
# 1. remove unwanted columns
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", 
                                                        "sentence2", 
                                                        "idx"])
# 2. ensure predicted column name is "labels"
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

# 3. setting dataset format to torch
tokenized_datasets.set_format("torch")

tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

## Define DataLoader

In [60]:
from torch.utils.data import DataLoader

# Things done by trainer
# 4. DataLoader - groups your dataset into mini-batches efficiently
#               - Shuffle the data
#               - collate / pad examples
train_dataloader = DataLoader(
    tokenized_datasets["train"], 
    shuffle=True, 
    batch_size=8, 
    collate_fn=data_collator
)

eval_dataloader = DataLoader(
    tokenized_datasets["validation"], 
    batch_size=8, 
    collate_fn=data_collator
)

In [61]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 81]),
 'token_type_ids': torch.Size([8, 81]),
 'attention_mask': torch.Size([8, 81])}

In [62]:
batch_count=0
for batch in train_dataloader:
    batch_count+=1

print(batch_count)

459


## Feeding to Model

In [63]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, 
                                                           num_labels=2)
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor(0.7701, grad_fn=<NllLossBackward0>) torch.Size([4, 2])


## Define Optimizer

In [64]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

## Learning rate scheduler

In [65]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377


## The Training Loop

In [66]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

In [67]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()    # puts the model in training mode --> enables batchnorm, dropout
for epoch in range(num_epochs):  # iterate through all epochs
    for batch in train_dataloader:  # iterate through mini-batches of data
        batch = {k: v.to(device) for k, v in batch.items()} # move batch to device
        outputs = model(**batch)  # FORWARD PASS
        loss = outputs.loss       # compute loss
        loss.backward()           # BACKWARD PASS

        optimizer.step()          # update model weights
        lr_scheduler.step()       # adjusts the learning rate
        optimizer.zero_grad()     # resets gradient before next batch
        progress_bar.update(1)    # advances progress bar by 1 after each batch

  0%|          | 0/1377 [00:00<?, ?it/s]

## Evaluation

In [68]:
import evaluate

metric = evaluate.load("glue", "mrpc")   # loading metrics

model.eval()    # setting the model in evaluation mode now...
for batch in eval_dataloader:  # iterating through batches
    batch = {k: v.to(device) for k, v in batch.items()}   # sending data to device
    with torch.no_grad():    # this ensures that no grads are calculated
        outputs = model(**batch)

    logits = outputs.logits    # get logits
    predictions = torch.argmax(logits, dim=-1)   # get predictions
    
    # feeds this batch's preds and true labels to metric tracker
    # this will accumulate across batches
    metric.add_batch(predictions=predictions, references=batch["labels"]) 

# After all batches are processed, this computes the final evaluation results.
metric.compute()

{'accuracy': 0.8725490196078431, 'f1': 0.9103448275862069}

## Supercharge training loop with "Accelerate"

In [69]:
# from accelerate import Accelerator     # addition
# from torch.optim import AdamW
# from transformers import AutoModelForSequenceClassification, get_scheduler

# accelerator = Accelerator()            # addition

# model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
# optimizer = AdamW(model.parameters(), lr=3e-5)

# # addition
# train_dl, eval_dl, model, optimizer = accelerator.prepare(
#     train_dataloader, eval_dataloader, model, optimizer
# )

# num_epochs = 3
# num_training_steps = num_epochs * len(train_dl)
# lr_scheduler = get_scheduler(
#     "linear",
#     optimizer=optimizer,
#     num_warmup_steps=0,
#     num_training_steps=num_training_steps,
# )

# progress_bar = tqdm(range(num_training_steps))

# model.train()
# for epoch in range(num_epochs):
#     for batch in train_dl:
#         outputs = model(**batch)
#         loss = outputs.loss
#         accelerator.backward(loss)     # addition

#         optimizer.step()
#         lr_scheduler.step()
#         optimizer.zero_grad()
#         progress_bar.update(1)