<a href="https://colab.research.google.com/github/ucheokechukwu/courses/blob/main/HuggingFace_NLP_Course/3_Finetuning_a_pretrained_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

https://huggingface.co/learn/nlp-course/chapter3/1?fw=pt

In [1]:
!pip install -q transformers[sentencepiece]
!pip install -q datasets

# Processing the data

In [2]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Loading a dataset from the hub

- tokenizers are backed by Rust and allow parallel processing
- map method
- datasets use Apache Arrow which minimizes disk sace

    * `remove_columns()`
    * `rename_columns()`
    * `select(range())` - generate a sample
    * `with_format('pt')`

In [3]:
from datasets import load_dataset

raw_datasets = load_dataset('glue', 'mrpc')
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [4]:
raw_train_dataset = raw_datasets['train']
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [5]:
type(raw_train_dataset.features['label'])

datasets.features.features.ClassLabel

## preprocessing a dataset

* `token_type_ids` distinguishes sentence 1 from sentence 2

In [6]:
inputs = tokenizer(
    "This is the first sentence.",
    "This is the second one."
)
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [7]:
# so this will work...

from transformers import AutoTokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer(
    raw_datasets['train']['sentence1'],
    raw_datasets['train']['sentence2'],
    padding=True,
    truncation=True
).keys()


dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

**Note** but notice that it returns a dictionary... with values that are a list of lists... This needs a lot of RAM. We would prefer it to store it in Apache Arrow files. So we have to keep this in a `Dataset` using the `Dataset.map() `method

In [8]:
def tokenizer_function(example):
    return tokenizer(
        example['sentence1'],
        example['sentence2'],
        truncation=True
    )

tokenized_datasets = raw_datasets.map(tokenizer_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

## dynamic padding - the collate function

the function that puts together samples inside a batch is the *collate function*.

its an argument that is passed when building a `DataLoader`.

There is a function called DataCollatorWithPadding which takes a tokenize and does everything needed for dynamic padding.

In [9]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# testing out this toy
samples = tokenized_datasets['train'][:8]

# remove the string features which are not needed
samples = {k: v for k, v in samples.items() if k not in ['idx', 'sentence1', 'sentence2']}

In [10]:
# note how the lengths have differnet sizes
[len(x) for x in samples['input_ids']]

[50, 59, 47, 67, 59, 50, 62, 32]

In [11]:
batch = data_collator(samples)
[len(x) for x in batch['input_ids']]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[67, 67, 67, 67, 67, 67, 67, 67]

In [12]:
batch.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

# Fine-Tuning a model with the Trainer API

The Trainer class helps with fine-tuning pre-trained models.

In [13]:
# %pip install transformers[torch] -q
# %pip install datasets -q
# %pip install accelerate>=0.20.1 -q
# %pip install evaluate -q

In [14]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

In [15]:
raw_datasets = load_dataset('glue', 'mrpc')
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(
        example['sentence1'],
        example['sentence2'],
        truncation=True
    )

tokenized_dataset = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


Map:   0%|          | 0/408 [00:00<?, ? examples/s]

## Training

1. step one: define `TrainingArguments` class that contains all the hyperparameters for the Trainer.
Its only argument is the dictionary to save the trained model.

2. step two: define the model

3.  define the `Trainer` by passing all the objects - model, trainingarguments, training and validation datasets, data_collator and tokenizer.

In [16]:
# 1. step 1


from transformers import TrainingArguments

try:
    training_args = TrainingArguments('test-trainer')
except:
    !pip install accelerate -U -q
    training_args = TrainingArguments('test-trainer')

# 2. define the model
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# 3. define the Trainer

from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['validation'],
    data_collator = data_collator,
    tokenizer = tokenizer
)

# fine tune the model
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.5891
1000,0.4449


TrainOutput(global_step=1377, training_loss=0.46770091115692514, metrics={'train_runtime': 183.055, 'train_samples_per_second': 60.113, 'train_steps_per_second': 7.522, 'total_flos': 405324636337200.0, 'train_loss': 0.46770091115692514, 'epoch': 3.0})

## Evaluation
Can include evaluation during trianing by using the `compute_metrics()` function and adding an `eval_strategy`

After training, we can run `trainer.predict()` on the evaluation dataset.

The output of the `predict()` method is another named tuple with three fields: predictions, label_ids, and metrics.

* prediction - the logits for each element of the dataset,
* labels - the test labels,
* metrics - the loss and nay metrics in the compute_metrics() function

In [17]:
predictions = trainer.predict(tokenized_dataset['validation'])
predictions.predictions.shape, predictions.label_ids.shape

((408, 2), (408,))

In [20]:
# to get the predictions as 0, 1 i.e. in the same form as the labels...
import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)

# now run the evaluation
try:
    import evaluate
except:
    !pip install evaluate -q
    import evaluate
metric = evaluate.load('glue', 'mrpc')
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.821078431372549, 'f1': 0.8768971332209106}

In [21]:
def compute_metrics(eval_preds):
    metric = evaluate.load('glue', 'mrpc')
    logits, labels = eval_preds
    try:
        predictions = logits.argmax(-1)
    except:
        predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


In [22]:
# define a new trainer class
training_args = TrainingArguments('test-trainer',
                                  evaluation_strategy='epoch')
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['validation'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.470278,0.779412,0.830827
2,0.551100,0.661137,0.745098,0.842424
3,0.395800,0.584061,0.848039,0.891608


TrainOutput(global_step=1377, training_loss=0.42026507032030114, metrics={'train_runtime': 192.4259, 'train_samples_per_second': 57.186, 'train_steps_per_second': 7.156, 'total_flos': 405540469624800.0, 'train_loss': 0.42026507032030114, 'epoch': 3.0})

## Try It Out: Preprocessing and Fine Tuning the GLUE SST-2 dataset

https://huggingface.co/datasets/gimmaru/glue-sst2

### Preprocessing
- load the dataset
- create the tokenizer function
- map the dataset to create the tokenized-dataset
- add the collate function for dynamic padding


In [24]:
glue_dataset = load_dataset('gimmaru/glue-sst2', 'sst2')
glue_dataset

Downloading readme:   0%|          | 0.00/491 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/72.7k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating validation split: 0 examples [00:00, ? examples/s]

DatasetDict({
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
})

In [25]:
glue_dataset

DatasetDict({
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
})

In [26]:
glue_dataset['validation'][0]

{'sentence': 'it gets onto the screen just about as much of the novella as one could reasonably expect , and is engrossing and moving in its own right . ',
 'label': 1,
 'idx': 726}

In [27]:
# create the tokenizer function
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


In [28]:
sample = glue_dataset['validation'][169]
tokenizer(sample['sentence'])

{'input_ids': [101, 2079, 2025, 2156, 2023, 2143, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [29]:
def glue_tokenizer_function(example):

    return tokenizer(example['sentence'],
                     truncation=True)

tokenized_glue_dataset = glue_dataset.map(glue_tokenizer_function,
                                          batched=True)

# mapping the collator

tokenized_glue_dataset

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

DatasetDict({
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 872
    })
})

### Fine tuning with the Trainer class

- get the model
- get the collate function
- get the TrainingArguments - directory and eval as epoch or something
- get the compute_metrics
- set up the Trainer
- run it

In [30]:
# model
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# collate function
from transformers import DataCollatorWithPadding
collate_function = DataCollatorWithPadding(tokenizer=tokenizer)

# training arguments
from transformers import TrainingArguments
training_args = TrainingArguments('glue_sentiment',
                                  evaluation_strategy='epoch'
                                  )

# compute metrics
import evaluate





def compute_metrics(eval_params):
    metric = evaluate.load('glue', 'sst2')
    logits, labels = eval_params
    predictions = logits.argmax(axis=-1)
    return metric.compute(predictions=predictions, references=labels)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [31]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_glue_dataset["validation"].select(range(600)),
    eval_dataset = tokenized_glue_dataset["validation"].select(range(600,872)),
    data_collator = collate_function,
    tokenizer=tokenizer,
    compute_metrics = compute_metrics
)
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.386634,0.852941
2,No log,0.456324,0.882353
3,No log,0.493048,0.897059


TrainOutput(global_step=225, training_loss=0.271107415093316, metrics={'train_runtime': 25.0161, 'train_samples_per_second': 71.954, 'train_steps_per_second': 8.994, 'total_flos': 38352547428960.0, 'train_loss': 0.271107415093316, 'epoch': 3.0})

In [32]:
tokenized_glue_dataset

DatasetDict({
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 872
    })
})

In [33]:
tokenized_glue_dataset['validation'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [34]:
tokenized_dataset['train'].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [35]:
predictions = trainer.predict(tokenized_glue_dataset["validation"].select(range(600,872)))

# to get the predictions as 0, 1 i.e. in the same form as the labels...
import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)

# now run the evaluation
import evaluate
metric = evaluate.load('glue', 'mrpc')
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8970588235294118, 'f1': 0.8955223880597015}

## A Full Training with Pytorch Loop

In [36]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset('glue', 'mrpc')
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example['sentence1'],
                     example['sentence2'],
                     truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

Before writing training loops, we need to create dataloaders to iterate over batches.

To define dataloaders, we need to do postprocessing to the tokenized_datasets. The Trainer did that automatically.

In [37]:
# remove columns that the models does not expect
tokenized_datasets = tokenized_datasets.remove_columns(['sentence1',
                                                        'sentence2',
                                                        'idx'])
# rename label to labels which matches the model's expectations
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')
# set the format so that pytorch tensors are returned not lists
tokenized_datasets.set_format('torch')
tokenized_datasets['train'].column_names


['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [39]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(tokenized_datasets['train'],
                              shuffle=True,
                              batch_size=8,
                              collate_fn=data_collator)
eval_dataloader = DataLoader(tokenized_datasets['validation'],
                             batch_size=8,
                             collate_fn=data_collator)

In [43]:
for batch in train_dataloader:
    print(type(batch))
    break
{k:v.shape for k,v in batch.items()}

<class 'transformers.tokenization_utils_base.BatchEncoding'>


{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 64]),
 'token_type_ids': torch.Size([8, 64]),
 'attention_mask': torch.Size([8, 64])}

In [44]:
batch['labels']

tensor([0, 1, 0, 1, 1, 1, 0, 1])

In [48]:
# instantiate the model
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [51]:
# checking by passing the batch into the model
outputs = model(**batch)
outputs, outputs.loss, outputs.logits.shape

(SequenceClassifierOutput(loss=tensor(0.8186, grad_fn=<NllLossBackward0>), logits=tensor([[ 0.2388, -0.3589],
         [ 0.2740, -0.3262],
         [ 0.2707, -0.3447],
         [ 0.2821, -0.3386],
         [ 0.2738, -0.3172],
         [ 0.2964, -0.3637],
         [ 0.2513, -0.3429],
         [ 0.2849, -0.3217]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None),
 tensor(0.8186, grad_fn=<NllLossBackward0>),
 torch.Size([8, 2]))

In [57]:
# get the optimizer and leraning rate scheduler
from transformers import AdamW
from torch.optim import AdamW #
optimizer = AdamW(model.parameters(), lr=5e-5)

In [59]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    'linear',
    optimizer=optimizer,
    num_warmup_steps = 0,
    num_training_steps=num_training_steps,
)
num_training_steps

1377

In [63]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [64]:
device

'cuda'

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

* notice that we don't define the loss function for HuggingFace models. It looks like it's already part of the model configuration.

In [65]:
from tqdm.auto import tqdm
p_bar = tqdm(range(num_training_steps))
model.train() # put model in training mode
for epoch in range(num_epochs):
    for batch in train_dataloader:
        # put tensors on the right device
        batch = {k:v.to(device) for k,v in batch.items()}
        # calculate the outputs
        outputs = model(**batch)
        # calculate the loss
        loss = outputs.loss
        # back propagate the loss
        loss.backward()
        # optimizer step
        optimizer.step()
        # learning rate cheduler
        lr_scheduler.step()
        # optimizer zero gradient
        optimizer.zero_grad()

        p_bar.update(1)



  0%|          | 0/1377 [00:00<?, ?it/s]

### Evaluation

In [68]:
import evaluate
metric = evaluate.load('glue', 'mrpc')
model.eval() # put in tevaluation mode

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    # accumulate the batches and compute will give us the average over batches
    metric.add_batch(predictions=predictions, references= batch['labels'])

metric.compute()

{'accuracy': 0.8602941176470589, 'f1': 0.9028960817717206}

## Accelerate training loop with Accelerate

By importing and instantiating an `Accelerator()` object, you can make your model more adaptable to different tools, while still retaining control over the training loop.

In [70]:
try:
    from accelerate import Accelerator
except:
    !pip install -q accelerate
    from accelerate import Accelerator

In [76]:
accelerate = Accelerator()
model_acc, optimizer_acc, train_dataloader_acc = accelerate.prepare(model, optimizer, train_dataloader)

In [77]:
type(optimizer_acc), type (optimizer), type(model_acc), type (model), type(train_dataloader_acc), type (train_dataloader)

(accelerate.optimizer.AcceleratedOptimizer,
 torch.optim.adamw.AdamW,
 transformers.models.bert.modeling_bert.BertForSequenceClassification,
 transformers.models.bert.modeling_bert.BertForSequenceClassification,
 accelerate.data_loader.DataLoaderShard,
 torch.utils.data.dataloader.DataLoader)

### Two methods of implementing Accelerate

1. from the Terminal with `accelerate config` then `accelerate train.py`

2. from notebook with `accelerate.notebook_launcher(training_function)``

In [90]:
%%writefile train.py
from accelerate import Accelerator
accelerator = Accelerator()
from transformers import AutoModelForSequenceClassification, AdamW, get_scheduler
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer)
device = 'cuda' if torch.cuda.is_available() else 'cpu'

num_epochs=3
num_training_steps = len(train_dataloader) * num_epochs
lr_scheduler = get_scheduler(
    'linear',
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

pbar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        pbar.update(1)



Overwriting train.py


In [82]:
!accelerate config

----------------------------------------------------------------------------------------------------In which compute environment are you running?
Please input a choice index (starting from 0), and press enter
 ➔  [32mThis machine[0m
    AWS (Amazon SageMaker)
[2A[?25l0
[32mThis machine[0m
----------------------------------------------------------------------------------------------------Which type of machine are you using?
Please input a choice index (starting from 0), and press enter
 ➔  [32mNo distributed training[0m
    multi-CPU
    multi-XPU
    multi-GPU
    multi-NPU
    TPU
[6A[?25l3
[32mmulti-GPU[0m
[?25hHow many different machines will you use (use more than 1 for multi-node training)? [1]: 


In [83]:
# !accelerate launch train.py

Traceback (most recent call last):
  File "/content/train.py", line 4, in <module>
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
NameError: name 'checkpoint' is not defined. Did you mean: 'breakpoint'?
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 994, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 636, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train.py']' returned non-zero exit status 1.


In [91]:
def training_function():
    from accelerate import Accelerator
    accelerator = Accelerator()
    from transformers import AutoModelForSequenceClassification, AdamW, get_scheduler
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    optimizer = AdamW(model.parameters(), lr=3e-5)
    train_dl, eval_dl, model, optimizer = accelerator.prepare(
        train_dataloader, eval_dataloader, model, optimizer)
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    num_epochs=3
    num_training_steps = len(train_dataloader) * num_epochs
    lr_scheduler = get_scheduler(
        'linear',
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps
    )

    pbar = tqdm(range(num_training_steps))
    model.train()
    for epoch in range(num_epochs):
        for batch in train_dl:
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            pbar.update(1)
    return model



In [94]:
from accelerate import notebook_launcher

model = notebook_launcher(training_function)

Launching training on one GPU.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]