# CUDA out of memory?
No worries. I'll show you some tricks so that you cudn't run out of memory.

Let's first connect to the runtime with T4 GPU. Turn off the high RAM option. After that, we install and import the packages.

In [None]:
!pip install datasets

import torch
from transformers import BertTokenizer, BertModel, BertForSequenceClassification, AdamW, BertConfig
from datasets import load_dataset
import pdb
import time
import gc
from tqdm import tqdm



# Set device
Let's make sure you CUDA use GPU.

In [None]:
# Check if a GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


# Load model
Time to BERT onto the scene. A large one, of course.

In [None]:
model_name = "bert-large-uncased"
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3).to(device)
tokenizer = BertTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



# Data Preprocessing
Wanna know how people are feeling? We are using a sentiment analysis dataset for our demo: SST-2.

Click [here](https://nlp.stanford.edu/sentiment/) to learn more about SST-2:

In [None]:
dataset = load_dataset("glue", "sst2")
train_dataset = dataset["train"]

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Your model can't understand the input unless you tokenize it. Here's how you do it.

In [None]:
def preprocess(batch):
    return tokenizer(batch["sentence"], padding=True, truncation=True, max_length=128)
train_dataset = train_dataset.map(preprocess, batched=True, batch_size=len(train_dataset))

# transform data to pytorch tensors
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])



Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

# Start training
I know you are excited about training the model, but first, you gotta activate the train model and initialize the optimizer.

In [None]:
model.train()
optimizer = AdamW(model.parameters(), lr=2e-5)




Let's set ourselves up for failure. Use a dataloader with a batch size of 128.

In [None]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)

I bet you can't even finish 10 batches with that big of a batch size.

In [None]:
for batch_id, batch in enumerate(train_loader):
  # stop before the 11th batch
  if batch_id >= 10:
      break
  print(f"Training batch {batch_id} now")

  # Move inputs and labels to the GPU
  inputs = batch["input_ids"].to(device)
  attention_mask = batch["attention_mask"].to(device)
  labels = batch["label"].to(device)

  # Forward pass
  optimizer.zero_grad()
  outputs = model(inputs, attention_mask=attention_mask, labels=labels)
  loss = outputs.loss
  loss.backward()  # Backward pass
  optimizer.step()

Training batch 0 now
Training batch 1 now


OutOfMemoryError: CUDA out of memory. Tried to allocate 34.00 MiB. GPU 

## Reduce the batch size
What did I just say? You couldn't even finish the second batch. Let's wipe the slate clean before we do anything else. Disconnect and delete the runtime, and reconnect to the runtime with T4 GPU. Turn off the high RAM option. Then, you need to rerun the set-up code below.

In [None]:
# Install datasets
!pip install datasets

# Import packages
import torch
from transformers import BertTokenizer, BertModel, BertForSequenceClassification, AdamW, BertConfig
from datasets import load_dataset
import pdb
import time
import gc

# Check if a GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Reinitialize the model
model_name = "bert-large-uncased"
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3).to(device)
tokenizer = BertTokenizer.from_pretrained(model_name)

# reload the data
dataset = load_dataset("glue", "sst2")
train_dataset = dataset["train"]

# Tokenize the data
def preprocess(batch):
    return tokenizer(batch["sentence"], padding=True, truncation=True, max_length=128)
train_dataset = train_dataset.map(preprocess, batched=True, batch_size=len(train_dataset))

# transform data to pytorch tensors
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

Using device: cuda


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This time, we use a smaller batch size of 16 instead.

In [None]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)

Let's scale down the learning rate by the scaling factor of the batch size. In this example, the factor is 128/16=8.

Note this is not the guaranteed optimal learning rate, guaranteed. It's just a good place to start. You still need to tune the learning rate with the new batch size.

In [None]:
model.train()
optimizer = AdamW(model.parameters(), lr=2.5e-6)

Now you should be able to train 10 batches easily.

In [None]:
# Initialize the progress bar
progress_bar = tqdm(train_loader)

# initialize running loss
running_loss = 0.0

print("Start training")

for batch_id, batch in enumerate(progress_bar):
  # Move inputs and labels to the GPU
  inputs = batch["input_ids"].to(device)
  attention_mask = batch["attention_mask"].to(device)
  labels = batch["label"].to(device)

  # Forward pass
  optimizer.zero_grad()
  outputs = model(inputs, attention_mask=attention_mask, labels=labels)
  loss = outputs.loss
  loss.backward()  # Backward pass
  optimizer.step()

  # Accumulate loss for tracking
  running_loss += loss.item()

  # Update the tqdm progress bar with the current loss
  progress_bar.set_postfix({'loss': running_loss / (batch_id + 1)})

  0%|          | 0/4210 [00:00<?, ?it/s]

Start training


  1%|          | 35/4210 [00:22<45:07,  1.54it/s, loss=0.903]


KeyboardInterrupt: 

### Use gradient acucmulation
I've shown you that big batches don't fit in the memory, but small batches can be problematic too, because they can lead to more noise.

Actually, if you accumulate the gradient for 8 steps until you update the model with a mini-batch size of 16, you have an effective batch size of 128, just like before.

In [None]:
# Set gradient accumulation steps
accumulation_steps = 8  # For example, accumulate over 4 mini-batches

Now we can retrain the model with a large effective batch size of 128, and still keep it within the available memory.

In [None]:
# Training loop
for batch_id, batch in enumerate(train_loader):
    # Stop at the 11th batch
    if batch_id >= 10:
        break
    print(f"Training batch {batch_id} now")

    # Move inputs and labels to the GPU
    inputs = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    labels = batch["label"].to(device)

    # Forward pass
    outputs = model(inputs, attention_mask=attention_mask, labels=labels)
    loss = outputs.loss

    # Normalize loss for gradient accumulation
    loss = loss / accumulation_steps

    # Backward pass
    loss.backward()

    # Step the optimizer every `accumulation_steps`
    if (batch_id + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()  # Clear gradients

Training batch 0 now
Training batch 1 now
Training batch 2 now
Training batch 3 now
Training batch 4 now
Training batch 5 now
Training batch 6 now
Training batch 7 now
Training batch 8 now
Training batch 9 now


## Mixed precision training
Saving memory is like saving money. All you need to do is to knock off a bunch of 0s. In this case, Pytorch automatically does it for you. Just import the tools.

In [None]:
# Import from torch.amp if you are using torch > 2.0
from torch.amp import autocast, GradScaler

# In previous versions of Pytorch, you import from torch.cuda.amp
# from torch.cuda.amp import autocast, GradScaler

Wipe the slate clean before you start over, as always.

In [None]:
# delete the variables that take up the most memory
del optimizer
del train_loader
del batch
del inputs
del attention_mask
del labels
del outputs
del loss

# Clear cache before measuring
gc.collect()
torch.cuda.empty_cache()


Now, reinitialize the dataloader with a batch size of 128 and reset the learning rate in the optimizer.

In [None]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
optimizer = AdamW(model.parameters(), lr=2e-5)

Initialize the gradient scaler. Do NOT skip this because you need to scale up the gradient to prevent underflow in FP16.

In [None]:
scaler = GradScaler()

Now Pytorch will throw that FP16 in the mix while you are training the model.

In [None]:
for batch_id, batch in enumerate(train_loader):
    # Stop at the 11th batch
    if batch_id >= 10:
        break
    print(f"Training batch {batch_id} now")

    # Move inputs and labels to the GPU
    inputs = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    labels = batch["label"].to(device)

    # Forward pass with autocast for mixed precision
    optimizer.zero_grad()
    with autocast(device_type='cuda', dtype=torch.float16):  # Specify device type and dtype
        outputs = model(inputs, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

    # Backward pass with scaled loss
    scaler.scale(loss).backward()

    # Update model parameters with scaled gradients
    scaler.step(optimizer)

    # Update the scaler for the next iteration
    scaler.update()

Training batch 0 now
Training batch 1 now
Training batch 2 now
Training batch 3 now
Training batch 4 now
Training batch 5 now
Training batch 6 now
Training batch 7 now
Training batch 8 now
Training batch 9 now


## Gradient checkpointing
I know there's gradient checkpointing in Huggingface, but if you want to customize the trade off between memory and speed, I still recommend using Pytorch. It allows you to checkpoint any particular layer you want.

We define a new BERT model to checkpoint the self-attention and intermediate layers of BERT, which are the most memory consuming layers according to my experience.

In [None]:
from torch.utils.checkpoint import checkpoint

class CheckpointedIntermediateBertForSequenceClassification(BertForSequenceClassification):
    def __init__(self, config):
        super().__init__(config)

    def forward(self, input_ids=None, attention_mask=None, labels=None):
        # add breakpoint for the debugger
        # pdb.set_trace()

        # Embedding layers (unchanged)
        embedding_output = self.bert.embeddings(input_ids=input_ids)

        # Manually handle the transformer layers (BertEncoder)
        hidden_states = embedding_output
        for layer in self.bert.encoder.layer:
            # add breakpoint for the debugger
            pdb.set_trace()
            # use checkpoint for self-attention
            # attention_output = checkpoint(
            #     layer.attention,  # Checkpoint only the attention layer
            #     hidden_states,
            #     attention_mask
            # )

            # use regular self-attention without gradient checkpointing
            attention_output = layer.attention(hidden_states, attention_mask)

            # Check if attention_output is a tuple and extract the hidden states if it is
            if isinstance(attention_output, tuple):
                attention_output = attention_output[0]  # Assuming the first element is the hidden state tensor

            hidden_states = layer.attention.output(attention_output, hidden_states)

            # Apply gradient checkpointing to the intermediate (feed-forward) layer
            # intermediate_output = checkpoint(
            #     layer.intermediate,  # Checkpoint only the intermediate layer
            #     hidden_states
            # )

            # Use intermediate layer without gradient checkpointing
            intermediate_output = layer.intermediate(hidden_states)

            # Process the output layer with the intermediate output
            hidden_states = layer.output(intermediate_output, hidden_states)

        # Pooling dropout, and classification
        pooled_output = self.bert.pooler(hidden_states)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        # Calculate loss if labels are provided
        loss = None
        if labels is not None:
            loss_fct = torch.nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        return loss, logits

We can initialize the new model with gradient checkpointing.

In [None]:
# delete the variables that take up the most memory
# del model
# gc.collect()
# torch.cuda.empty_cache()
# del batch
# gc.collect()
# torch.cuda.empty_cache()
# del inputs
# gc.collect()
# torch.cuda.empty_cache()
# del attention_mask
# gc.collect()
# torch.cuda.empty_cache()
# del labels
# gc.collect()
# torch.cuda.empty_cache()
# del outputs
# gc.collect()
# torch.cuda.empty_cache()
# del loss
# gc.collect()
# torch.cuda.empty_cache()

# Clear cache before measuring
gc.collect()
torch.cuda.empty_cache()

model = CheckpointedIntermediateBertForSequenceClassification.from_pretrained(model_name, num_labels=3).to(device)

Some weights of CheckpointedIntermediateBertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Run the training loop with gradient checkpointing.

In [None]:
# Training loop
model.train()  # Make sure the model is in training mode
optimizer = AdamW(model.parameters(), lr=2e-5)
for batch_id, batch in enumerate(train_loader):
    # Stop at the 11th batch
    if batch_id >= 10:
        break
    print(f"Training batch {batch_id} now")

    # Move inputs and labels to the GPU
    inputs = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device).float().unsqueeze(1).unsqueeze(2)
    labels = batch["label"].to(device)  # Ensure this key matches your dataset

    # Forward pass
    optimizer.zero_grad()  # Clear the gradients
    outputs = model(inputs, attention_mask=attention_mask, labels=labels)


    loss = outputs[0]
    # Backward pass
    loss.backward()  # Compute gradients

    # Update model parameters
    optimizer.step()

Training batch 0 now
> [0;32m<ipython-input-23-adbc0c2e6dbb>[0m(27)[0;36mforward[0;34m()[0m
[0;32m     25 [0;31m[0;34m[0m[0m
[0m[0;32m     26 [0;31m            [0;31m# use regular self-attention without gradient checkpointing[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 27 [0;31m            [0mattention_output[0m [0;34m=[0m [0mlayer[0m[0;34m.[0m[0mattention[0m[0;34m([0m[0mhidden_states[0m[0;34m,[0m [0mattention_mask[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     28 [0;31m[0;34m[0m[0m
[0m[0;32m     29 [0;31m            [0;31m# Check if attention_output is a tuple and extract the hidden states if it is[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> n
> [0;32m<ipython-input-23-adbc0c2e6dbb>[0m(30)[0;36mforward[0;34m()[0m
[0;32m     28 [0;31m[0;34m[0m[0m
[0m[0;32m     29 [0;31m            [0;31m# Check if attention_output is a tuple and extract the hidden states if it is[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 30 [0;31m    

OutOfMemoryError: CUDA out of memory. Tried to allocate 34.00 MiB. GPU 