In [1]:
!pip install transformers
! pip install datasets
!pip install accelerate


Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.17.1 dill-0.3.8 multiprocess-0.70.16
Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Installing collected pa

**DataCollatorWithPadding:**

This is a class provided by the Hugging Face Transformers library. It is used for collating and padding input data (usually tokenized sequences) during language model training.

**Purpose:**

The resulting `data_collator` instance will be used during training to prepare batches of data. It ensures that input sequences within a batch are padded to the same length (using padding tokens) for efficient processing by the model.


In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

# Prepare for Training

Before actually writing our training loop, we will need to define a few objects. The first ones are the dataloaders we will use to iterate over batches. But before we can define those dataloaders, we need to apply a bit of postprocessing to our `tokenized_datasets`, to take care of some things that the Trainer did for us automatically. Specifically, we need to:

1. Remove the columns corresponding to values the model does not expect (like the `sentence1` and `sentence2` columns).
2. Rename the column `label` to `labels` (because the model expects the argument to be named `labels`).
3. Set the format of the datasets so they return PyTorch tensors instead of lists.

Our `tokenized_datasets` has one method for each of those steps:


In [3]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

**DataLoader Class:**

The DataLoader class is part of the PyTorch library and is used for creating data loaders. It efficiently loads and batches data during training or evaluation of machine learning models.

**Purpose and Usage:**

The primary purpose of DataLoader is to create an iterable over a dataset. It provides an efficient way to load data in batches, shuffle the data, and apply transformations.

**Parameters:**

The DataLoader class takes several important parameters:
- `dataset`: The dataset object (usually an instance of a custom dataset class).
- `batch_size`: The number of samples in each batch.
- `shuffle`: Determines whether to shuffle the data before creating batches.
- `collate_fn`: An optional function that collates individual samples into batches.

**Benefits of Using DataLoader:**

- **Efficient loading:** DataLoader loads data in parallel using multiple worker processes.
- **Batching:** It automatically creates batches of data.
- **Shuffling:** If `shuffle=True`, DataLoader shuffles the data before creating batches.
- **Custom transformations:** You can apply custom transformations (e.g., normalization) using `collate_fn`.


In [4]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

To quickly check there is no mistake in the data processing, we can inspect a batch like this:



In [5]:
for batch in train_dataloader:
    break
#print(batch)
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 75]),
 'token_type_ids': torch.Size([8, 75]),
 'attention_mask': torch.Size([8, 75])}

#Instantiating the model


In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To make sure that everything will go smoothly during training, we pass our batch to this model:



In [7]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.7130, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


**Initializing the Optimizer:**

The next line `optimizer = AdamW(model.parameters(), lr=5e-5)` initializes an instance of the AdamW optimizer. Here’s what each part does:

- `model.parameters()`: This provides the parameters (weights and biases) of a neural network model (which should be defined elsewhere in the code).
- `lr=5e-5`: This sets the learning rate for the optimizer to 5e-5 (which is equivalent to 0.00005).


The resulting optimizer instance will be used during training to update the model’s parameters (weights and biases) based on gradients computed during backpropagation. The learning rate determines how large the steps are during optimization. Smaller learning rates lead to slower convergence but more stable training.




In [8]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)



**Importing the Necessary Function:**

The line `from transformers import get_scheduler` imports a function called `get_scheduler` from the transformers library. This function is used to create a learning rate scheduler for training neural network models.

**Setting Up Variables:**

- `num_epochs = 3`: This variable represents the total number of training epochs. An epoch is a complete pass through the entire training dataset.
- `num_training_steps = num_epochs * len(train_dataloader)`: Here, we calculate the total number of training steps based on the number of epochs and the length of the training data loader (`train_dataloader`). Each training step corresponds to one batch of data processed during training.

**Creating the Learning Rate Scheduler:**

`lr_scheduler = get_scheduler(...)`: This line initializes a learning rate scheduler using the `get_scheduler` function. The function takes several arguments:

- `"linear"`: The type of scheduler. In this case, it’s a linear scheduler.
- `optimizer`: The optimizer used for training (e.g., Adam, SGD, etc.). You should have already defined an optimizer (not shown in the provided snippet).
- `num_warmup_steps=0`: The number of warm-up steps. Warm-up steps gradually increase the learning rate from zero to its initial value. Setting it to zero means no warm-up.
- `num_training_steps=num_training_steps`: The total number of training steps (calculated earlier based on epochs and data loader length).

**Linear Learning Rate Schedule:**

The "linear" scheduler decreases the learning rate linearly from its initial value to zero over the course of training. It’s a simple and commonly used schedule. During the warm-up phase (if specified), the learning rate gradually increases from zero to its initial value. After the warm-up, the learning rate decreases linearly as training progresses.


In [9]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377


#Training Loop
One last thing: we will want to use the GPU if we have access to one (on a CPU, training might take several hours instead of a couple of minutes). To do this, we define a device we will put our model and our batches on:

In [10]:
from accelerate import Accelerator
accelerator = Accelerator()
train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( train_dataloader, eval_dataloader, model, optimizer)

In [11]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

#The evaluation loop
As we did earlier, we will use a metric provided by the 🤗 Evaluate library. We’ve already seen the metric.compute() method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method add_batch(). Once we have accumulated all the batches, we can get the final result with metric.compute(). Here’s how to implement all of this in an evaluation loop:

In [13]:
!pip install evaluate
import torch
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()



{'accuracy': 0.8480392156862745, 'f1': 0.8934707903780068}