# Training a GPT2-Sequence-Classification-Model Using Regular Attention and GAAM

## Install Required Libraries


In [1]:
!pip install datasets # unified interface for accessing and working with various datasets (by hugging face)
!pip install -U accelerate # library to optimize and accelerate numerical computations
!pip install -U transformers # library by hugging face that gives easy access to pre-trained models, tokenizers, and tools for fine-tuning models

import utils

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.23.0-py3-none-any.

## Loading and Processing the Dataset

We load the dataset from Hugging Face. Each sample consists of one strings feature that stores the title as well as the (start of the) article-text. The label is the category that the article belongs to (world, sports, business, sci/tech). [Link](https://huggingface.co/datasets/ag_news/viewer/default/train) to explore the structure of the data.

In [2]:
from datasets import load_dataset

dataset = load_dataset('ag_news')

print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


Reduce the size of the dataset (to reduce training times) whilst ensuring that the original structure and distribution of the data is kept.

In [3]:
from datasets import DatasetDict

dataset_train_1percent = utils.take_a_percentage_of_data(dataset['train'], percentage=0.01)
dataset_test_1percent = utils.take_a_percentage_of_data(dataset['test'], percentage=0.01)

dataset_1percent = DatasetDict({
    'train': dataset_train_1percent,
    'test': dataset_test_1percent
}) # combine the shortened datasets back into the old structure.

print(dataset_1percent)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 76
    })
})


## Tokenizing the dataset

Tokenize the dataset in the exact same way as the GPT-2 model.

In [4]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token # padding tokens added to sequences will be represented by an end-of-sequence token
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length")

tokenized_dataset = dataset_1percent.map(tokenize_function, batched=True) # performed in batches to increase performance

print(tokenized_dataset) # tokenization adds two features: 'input_ids' (the tokenized representation of 'text') as well as 'attention_mask', which ensures that the model does not attend to padding tokens added during tokenization



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

Map:   0%|          | 0/76 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 76
    })
})


## Loading the Pre-trained GPT2-Model

Load the pre-trained GPT2-Model for sequence classification.

In [5]:
from transformers import GPT2ForSequenceClassification

gpt2_model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=4) # our gpt2-model should distinguish between 4 labels, adds a final fully connected layers with 4 output neurons.

print(gpt2_model)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=4, bias=False)
)


## Implementing the Gaussian Adaptive Attention Block

Download the package provided by the paper.

In [6]:
!pip3 install gaussian-adaptive-attention

Collecting gaussian-adaptive-attention
  Downloading gaussian_adaptive_attention-0.1.5-py3-none-any.whl (8.7 kB)
Installing collected packages: gaussian-adaptive-attention
Successfully installed gaussian-adaptive-attention-0.1.5


**Approach**: Create a new module that uses both the regular GPT2 attention as well as GAAM. This approach is based on a suggestion by one of the authors of the paper, Georgios Ioannides. With this setup, the model will first apply Gaussian attention and then in a second step apply GPT2 attention on the outputs of the Gaussian attention.

The `MultiHeadGaussianAdaptiveAttention`-layer is initiated with the chosen hyperparameters `num_heads = 4` and `num_gaussians = 5`.

In [7]:
import torch

class MultiHeadCombinedAttention(torch.nn.Module):
    def __init__(self, config, gaussian_attention, original_attention):
        super().__init__()
        self.gaussian_attention = gaussian_attention
        self.original_attention = original_attention

    def forward(self, hidden_states, **kwargs):
        # Pass hidden_states through Gaussian attention first
        gaussian_output = self.gaussian_attention(hidden_states)
        # Pass the output of Gaussian attention through the original attention
        combined_output = self.original_attention(gaussian_output)
        return combined_output

In [8]:
import importlib
gaussian_adaptive_attention = importlib.import_module("gaussian_adaptive_attention")
MultiHeadGaussianAdaptiveAttention = getattr(gaussian_adaptive_attention, "MultiHeadGaussianAdaptiveAttention")
import copy

# create a copy of the gpt2-model
gaussian_model = copy.deepcopy(gpt2_model)

# instantiate GAAM with the chosen hyperparameters num_heads = 4, num_gaussians = 5.
multi_head_gaussian_adaptive_attention = MultiHeadGaussianAdaptiveAttention(norm_axis=1, num_heads=4, num_gaussians=5, padding_value=gaussian_model.config.eos_token_id, eps=gaussian_model.config.layer_norm_epsilon)

# Replace the attention mechanism in each transformer block
for block in gaussian_model.transformer.h:
    # Save the original attention mechanism
    original_attention = block.attn
    # Replace it with the combined attention mechanism
    block.attn = MultiHeadCombinedAttention(gaussian_model.config, multi_head_gaussian_adaptive_attention, original_attention)

In [9]:
print(gaussian_model)
print(gpt2_model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): MultiHeadCombinedAttention(
          (gaussian_attention): MultiHeadGaussianAdaptiveAttention(
            (attention_heads): ModuleList(
              (0-3): 4 x GaussianAdaptiveAttention()
            )
          )
          (original_attention): GPT2Attention(
            (c_attn): Conv1D()
            (c_proj): Conv1D()
            (attn_dropout): Dropout(p=0.1, inplace=False)
            (resid_dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=

## Check: Do both models run?

1. The Normal GPT2-Model
2. The GPT2-Model with Normal Attention and GAAM

In [10]:
input_ids = torch.randint(0, gpt2_model.config.vocab_size, (1, 512))
labels = torch.tensor([1]).unsqueeze(0)

outputs = gpt2_model(input_ids=input_ids, labels=labels)
loss, logits = outputs['loss'], outputs['logits']
print(f"GPT2 with regular attention mechanism: Loss = {round(loss.item(), 4)}, logits = {logits.detach()}")

outputs = gaussian_model(input_ids=input_ids, labels=labels)
loss, logits = outputs['loss'], outputs['logits']
print(f"GPT2 with Gaussian attention mechanism: Loss = {round(loss.item(), 4)}, logits = {logits.detach()}")


GPT2 with regular attention mechanism: Loss = 6.9644, logits = tensor([[ 1.0402,  1.5031,  8.4659, -0.5331]])
GPT2 with Gaussian attention mechanism: Loss = 8.8657, logits = tensor([[ 3.8026, -0.6036,  8.2501,  0.1685]])


## Training

Training the GPT2-Model with Gaussian attention for 10 epochs on 1% of the data.

In [11]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=0.00002,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=10,
    use_cpu = False,
    no_cuda = False,
    save_strategy= 'no',
    logging_strategy = 'epoch',
    evaluation_strategy='epoch'
)

trainer = Trainer(
    model=gaussian_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=utils.compute_accuracy
)

In [12]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.401,1.391345,0.25
2,1.3995,1.388064,0.25
3,1.3959,1.394787,0.25
4,1.3948,1.386702,0.25
5,1.3916,1.386755,0.25
6,1.3923,1.387551,0.25
7,1.3911,1.386982,0.25
8,1.3912,1.387499,0.25
9,1.3902,1.386482,0.25
10,1.3881,1.386441,0.25


TrainOutput(global_step=12000, training_loss=1.3935629475911457, metrics={'train_runtime': 4947.5258, 'train_samples_per_second': 2.425, 'train_steps_per_second': 2.425, 'total_flos': 6271238209536000.0, 'train_loss': 1.3935629475911457, 'epoch': 10.0})

### Interpretation of the Results

The results indicate that the using both the regular and the Gaussian attention on top of each other does not yield improved results. Instead, the model is equally bad compared to when just GAAM is used. Therefore, we can conclude that using a combination of regular and Gaussian attention is not desirable in this specific architecture.