# Training a GPT2-Sequence-Classification-Model Using LAAM

## Install Required Libraries


In [1]:
!pip install datasets # unified interface for accessing and working with various datasets (by hugging face)
!pip install -U accelerate # library to optimize and accelerate numerical computations
!pip install -U transformers # library by hugging face that gives easy access to pre-trained models, tokenizers, and tools for fine-tuning models

import utils

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.23.0-py3-none-any

## Loading and Processing the Dataset

We load the dataset from Hugging Face. Each sample consists of one strings feature that stores the title as well as the (start of the) article-text. The label is the category that the article belongs to (world, sports, business, sci/tech). [Link](https://huggingface.co/datasets/ag_news/viewer/default/train) to explore the structure of the data.

In [2]:
from datasets import load_dataset

dataset = load_dataset('ag_news')

print(dataset)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


Reduce the size of the dataset (to reduce training times) whilst ensuring that the original structure and distribution of the data is kept.

In [3]:
from datasets import DatasetDict

dataset_train_1percent = utils.take_a_percentage_of_data(dataset['train'], percentage=0.01)
dataset_test_1percent = utils.take_a_percentage_of_data(dataset['test'], percentage=0.01)

dataset_1percent = DatasetDict({
    'train': dataset_train_1percent,
    'test': dataset_test_1percent
}) # combine the shortened datasets back into the old structure.

print(dataset_1percent)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 76
    })
})


## Tokenizing the dataset

Tokenize the dataset in the exact same way as the GPT-2 model.

In [4]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token # padding tokens added to sequences will be represented by an end-of-sequence token
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length")

tokenized_dataset = dataset_1percent.map(tokenize_function, batched=True) # performed in batches to increase performance

print(tokenized_dataset) # tokenization adds two features: 'input_ids' (the tokenized representation of 'text') as well as 'attention_mask', which ensures that the model does not attend to padding tokens added during tokenization



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

Map:   0%|          | 0/76 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 76
    })
})


## Loading the Pre-trained GPT2-Model

Load the pre-trained GPT2-Model for sequence classification.

In [5]:
from transformers import GPT2ForSequenceClassification

gpt2_model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=4) # our gpt2-model should distinguish between 4 labels, adds a final fully connected layers with 4 output neurons.

print(gpt2_model)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=4, bias=False)
)


## Implementing the Laplacian Adaptive Attention Block

The code below was extracted from the paper and then adapted to use a Laplacian probability distribution rather than a Gaussian probability distribution.

All the lines that had to be changed are marked with a comment `# ADAPTED`. The changes were made according to the formulas that were specified in our proposal.

In [6]:
import torch
import torch.nn as nn

class LaplacianAdaptiveAttention(nn.Module):
    def __init__(self, norm_axis, num_heads, num_laplacians, padding_value, mean_offset_init=0, eps=1e-8):
        super().__init__()
        if not isinstance(norm_axis, int):
            raise ValueError("norm_axis must be an integer.")
        if num_heads <= 0 or not isinstance(num_heads, int):
            raise ValueError("num_heads must be a positive integer.")
        if num_laplacians <= 0 or not isinstance(num_laplacians, int):
            raise ValueError("num_laplacians must be a positive integer.")

        self.norm_axis = norm_axis
        self.eps = eps
        self.num_heads = num_heads
        self.padding_value = padding_value
        self.num_laplacians = num_laplacians

        self.mean_offsets = nn.Parameter(torch.zeros(num_laplacians, dtype=torch.float))
        self.c = nn.Parameter(torch.randn(num_laplacians, dtype=torch.float))

    def forward(self, x, return_attention_details=False):
        if x.dim() < 2:
            raise ValueError(f"Input tensor must have at least 2 dimensions, got {x.dim()}.")
        if self.norm_axis >= x.dim() or self.norm_axis < -x.dim():
            raise ValueError(f"norm_axis {self.norm_axis} is out of bounds for input tensor with {x.dim()} dimensions.")

        mask = x != self.padding_value if self.padding_value is not None else None
        x_masked = torch.where(mask, x, torch.zeros_like(x)) if mask is not None else x
        median = x_masked.median(dim=self.norm_axis, keepdim=True)[0] # ADAPTED
        b = torch.abs(x_masked - median).mean(dim=self.norm_axis, keepdim=True) + self.eps # ADAPTED

        mixture = 1
        for i in range(self.num_laplacians):
            adjusted_median = median + self.mean_offsets[i] # ADAPTED
            y_norm = (x - adjusted_median) / torch.sqrt(b) # ADAPTED
            laplacian = torch.exp(-(torch.abs(y_norm) / (self.c[i] ** 2))) / torch.sqrt(2 * torch.pi * (self.c[i] ** 2)) # ADAPTED - equation (9), but second division term cannot be found in the paper - must be a scaling term. Kept the way it is.
            mixture *= laplacian

        mixture /= mixture.sum(dim=self.norm_axis, keepdim=True).clamp(min=self.eps)

        if return_attention_details:
            return torch.where(mask, x * mixture, x) if mask is not None else x * mixture, mixture.detach()
        else:
            return torch.where(mask, x * mixture, x) if mask is not None else x * mixture


class MultiHeadLaplacianAdaptiveAttention(nn.Module):
    def __init__(self, norm_axis, num_heads, num_laplacians, padding_value=None, eps=1e-8):
        super().__init__()
        self.norm_axis = norm_axis
        self.num_heads = num_heads
        self.attention_heads = nn.ModuleList([
            LaplacianAdaptiveAttention(norm_axis, num_heads, num_laplacians, padding_value, eps)
            for _ in range(num_heads)
        ])

    def forward(self, x, return_attention_details=False):
        chunk_size = x.shape[self.norm_axis] // self.num_heads
        if chunk_size == 0:
            raise ValueError(f"Input tensor size along norm_axis ({self.norm_axis}) must be larger than the number of heads ({self.num_heads}).")

        outputs, attention_details_ = [], []
        for i in range(self.num_heads):
            start_index = i * chunk_size
            end_index = start_index + chunk_size if i < self.num_heads - 1 else x.shape[self.norm_axis]
            chunk = x.narrow(self.norm_axis, start_index, end_index - start_index)
            if return_attention_details:
                out, mixture = self.attention_heads[i](chunk, return_attention_details=True)
                outputs.append(out)
                attention_details_.append(mixture)
            else:
                outputs.append(self.attention_heads[i](chunk))

        if return_attention_details:
            return torch.cat(outputs, dim=self.norm_axis), torch.cat(attention_details_, dim=self.norm_axis)
        else:
            return torch.cat(outputs, dim=self.norm_axis)

**Approach**: Create a wrapper class of the `MultiHeadLaplacianAdaptiveAttention`-layer that is then inserted into the GPT2-architecture layer by layer. The wrapper ensures that MultiHeadLaplacianAdapativeAttention is compatible with the GPT2-architecture.

The `MultiHeadLaplacianAdaptiveAttention`-layer is initiated with the chosen hyperparameters `num_heads = 4` and `num_laplacians = 5`.

In [12]:
import importlib
import torch
import copy

laplacian_model = copy.deepcopy(gpt2_model)

class MultiHeadLaplacianAdaptiveAttentionWrapper(torch.nn.Module):
    def __init__(self, config, num_heads=4, num_laplacians=5, norm_axis=1):
        super().__init__()
        self.attention = MultiHeadLaplacianAdaptiveAttention(
            norm_axis=norm_axis,
            num_heads=num_heads,
            num_laplacians=num_laplacians,
            padding_value=config.eos_token_id,
            eps=config.layer_norm_epsilon
        )

    def forward(self, hidden_states, **kwargs):
        # Pass arguments using **kwargs to the underlying attention mechanism
        attention_output = self.attention(hidden_states)
        return (hidden_states,) + tuple(attention_output)  # Ensure the return value is a tuple


# Replace the attention mechanism in each transformer block
for block in laplacian_model.transformer.h: # accessing each transformer blocks within the GPT-2 model
    block.attn = MultiHeadLaplacianAdaptiveAttentionWrapper(config=laplacian_model.config, num_heads=4, num_laplacians=5, norm_axis=1) # and replacing the attention module with the Gaussian attention block.


In [13]:
print(laplacian_model)
print(gpt2_model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): MultiHeadLaplacianAdaptiveAttentionWrapper(
          (attention): MultiHeadLaplacianAdaptiveAttention(
            (attention_heads): ModuleList(
              (0-3): 4 x LaplacianAdaptiveAttention()
            )
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=4, bias=False)
)
GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    

## Check: Do both models run?

1. The Normal GPT2-Model
2. The GPT2-Model with LAAM

In [14]:
input_ids = torch.randint(0, gpt2_model.config.vocab_size, (1, 512))
labels = torch.tensor([1]).unsqueeze(0)

outputs = gpt2_model(input_ids=input_ids, labels=labels)
loss, logits = outputs['loss'], outputs['logits']
print(f"GPT2 with regular attention mechanism: Loss = {round(loss.item(), 4)}, logits = {logits.detach()}")

outputs = laplacian_model(input_ids=input_ids, labels=labels)
loss, logits = outputs['loss'], outputs['logits']
print(f"GPT2 with Laplacian attention mechanism: Loss = {round(loss.item(), 4)}, logits = {logits.detach()}")


GPT2 with regular attention mechanism: Loss = 0.008, logits = tensor([[-2.5557e+00,  6.0127e+00, -2.5160e-03,  7.9631e-01]])
GPT2 with Laplacian attention mechanism: Loss = 1.0243, logits = tensor([[-0.5086,  0.5875, -0.0469,  0.5048]])


## Training

Training the GPT2-Model with Laplacian attention for 10 epochs on 1 percent of the training data.

In [15]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=0.00002,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=10,
    use_cpu = False,
    no_cuda = False,
    save_strategy= 'epoch',
    logging_strategy = 'epoch',
    evaluation_strategy='epoch',
    load_best_model_at_end = True
)

trainer = Trainer(
    model=laplacian_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=utils.compute_accuracy
)

In [16]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.4071,1.388321,0.25
2,1.3969,1.408977,0.25
3,1.3996,1.393989,0.25
4,1.3955,1.398186,0.25
5,1.3942,1.388539,0.25
6,1.3941,1.390117,0.25
7,1.3918,1.388874,0.25
8,1.3895,1.387034,0.25
9,1.3889,1.386437,0.25
10,1.3881,1.38635,0.25


TrainOutput(global_step=12000, training_loss=1.3945801086425782, metrics={'train_runtime': 2830.0486, 'train_samples_per_second': 4.24, 'train_steps_per_second': 4.24, 'total_flos': 4181198635008000.0, 'train_loss': 1.3945801086425782, 'epoch': 10.0})

### Interpretation of the Results

Changing the type of distribution to a Laplacian distribution rather than a Gaussian distribution leads to very similar results.

However, because the results remain very bad, which suggests that there is a larger issue with the model architecture, the Gaussian/Laplacian attention mechanism or the training process, these results cannot really be used to conclude whether Laplacian attention mechanisms are less, equally, or more powerful than Gaussian attention mechanisms.