# Training a GPT2-Sequence-Classification-Model Using GAAM

## Install Required Libraries


In [2]:
!pip install datasets # unified interface for accessing and working with various datasets (by hugging face)
!pip install -U accelerate # library to optimize and accelerate numerical computations
!pip install -U transformers # library by hugging face that gives easy access to pre-trained models, tokenizers, and tools for fine-tuning models

import utils # some utility functions we wrote that are used across the different notebooks



## Loading and Processing the Dataset

We load the dataset from Hugging Face. Each sample consists of one strings feature that stores the title as well as the (start of the) article-text. The label is the category that the article belongs to (world, sports, business, sci/tech). [Link](https://huggingface.co/datasets/ag_news/viewer/default/train) to explore the structure of the data.

In [3]:
from datasets import load_dataset

dataset = load_dataset('ag_news')

print(dataset)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


Reduce the size of the dataset (to reduce training times) whilst ensuring that the original structure and distribution of the data is kept.

In [4]:
from datasets import DatasetDict

dataset_train_1percent = utils.take_a_percentage_of_data(dataset['train'], percentage=0.01)
dataset_test_1percent = utils.take_a_percentage_of_data(dataset['test'], percentage=0.01)

dataset_1percent = DatasetDict({
    'train': dataset_train_1percent,
    'test': dataset_test_1percent
}) # combine the shortened datasets back into the old structure.

print(dataset_1percent)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 76
    })
})


## Tokenizing the dataset

Tokenize the dataset in the exact same way as the GPT-2 model.

In [5]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token # padding tokens added to sequences will be represented by an end-of-sequence token
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length")

tokenized_dataset = dataset_1percent.map(tokenize_function, batched=True) # performed in batches to increase performance

print(tokenized_dataset) # tokenization adds two features: 'input_ids' (the tokenized representation of 'text') as well as 'attention_mask', which ensures that the model does not attend to padding tokens added during tokenization



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

Map:   0%|          | 0/76 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 76
    })
})


## Loading the Pre-trained GPT2-Model

Load the pre-trained GPT2-Model for sequence classification.

In [6]:
from transformers import GPT2ForSequenceClassification

gpt2_model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=4) # our gpt2-model should distinguish between 4 labels, adds a final fully connected layers with 4 output neurons.

print(gpt2_model)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=4, bias=False)
)


## Implementing the Gaussian Adaptive Attention Block

Download the package provided by the paper.

In [7]:
!pip3 install gaussian-adaptive-attention

Collecting gaussian-adaptive-attention
  Downloading gaussian_adaptive_attention-0.1.5-py3-none-any.whl (8.7 kB)
Installing collected packages: gaussian-adaptive-attention
Successfully installed gaussian-adaptive-attention-0.1.5


**Approach**: Create a wrapper class of the `MultiHeadGaussianAdaptiveAttention`-module that is then inserted into the GPT2-architecture layer by layer. The wrapper ensures that `MultiHeadGaussianAdapativeAttention` is compatible with the GPT2-architecture.

The `MultiHeadGaussianAdaptiveAttention`-layer is initiated with the chosen hyperparameters `num_heads = 4` and `num_gaussians = 5`.

In [8]:
import importlib
gaussian_adaptive_attention = importlib.import_module("gaussian_adaptive_attention")
MultiHeadGaussianAdaptiveAttention = getattr(gaussian_adaptive_attention, "MultiHeadGaussianAdaptiveAttention")
import torch
import copy

gaussian_model = copy.deepcopy(gpt2_model)

class MultiHeadGaussianAdaptiveAttentionWrapper(torch.nn.Module):
    def __init__(self, config, num_heads=4, num_gaussians=5, norm_axis=1):
        super().__init__()
        self.attention = MultiHeadGaussianAdaptiveAttention(
            norm_axis=norm_axis,
            num_heads=num_heads,
            num_gaussians=num_gaussians,
            padding_value=config.eos_token_id,
            eps=config.layer_norm_epsilon
        )

    def forward(self, hidden_states, **kwargs):
        # Pass arguments using **kwargs to the underlying attention mechanism
        attention_output = self.attention(hidden_states)
        return (hidden_states,) + tuple(attention_output)  # Ensure the return value is a tuple

# Replace the attention mechanism in each transformer block
for block in gaussian_model.transformer.h: # accessing each transformer blocks within the GPT-2 model
    block.attn = MultiHeadGaussianAdaptiveAttentionWrapper(config=gaussian_model.config, num_heads=4, num_gaussians=5, norm_axis=1) # and replacing the attention module with the Gaussian attention block.

In [9]:
print(gaussian_model)
print(gpt2_model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): MultiHeadGaussianAdaptiveAttentionWrapper(
          (attention): MultiHeadGaussianAdaptiveAttention(
            (attention_heads): ModuleList(
              (0-3): 4 x GaussianAdaptiveAttention()
            )
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=4, bias=False)
)
GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wt

## Check: Do both models run?

1. The Normal GPT2-Model
2. The GPT2-Model with GAAM

In [10]:
input_ids = torch.randint(0, gpt2_model.config.vocab_size, (1, 512))
labels = torch.tensor([1]).unsqueeze(0)

outputs = gpt2_model(input_ids=input_ids, labels=labels)
loss, logits = outputs['loss'], outputs['logits']
print(f"GPT2 with regular attention mechanism: Loss = {round(loss.item(), 4)}, logits = {logits.detach()}")

outputs = gaussian_model(input_ids=input_ids, labels=labels)
loss, logits = outputs['loss'], outputs['logits']
print(f"GPT2 with Gaussian attention mechanism: Loss = {round(loss.item(), 4)}, logits = {logits.detach()}")


GPT2 with regular attention mechanism: Loss = 0.0552, logits = tensor([[-4.2974,  2.0370, -0.8821, -4.8758]])
GPT2 with Gaussian attention mechanism: Loss = 0.4546, logits = tensor([[-0.9896,  0.9155, -0.5822, -0.6785]])


## Training

Training the GPT2-Model with Gaussian attention for 10 epochs on 1 percent of the training data.

In [13]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=0.00002,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=10,
    use_cpu = False,
    no_cuda = False,
    save_strategy= 'epoch',
    logging_strategy = 'epoch',
    evaluation_strategy='epoch',
    load_best_model_at_end = True
)

trainer = Trainer(
    model=gaussian_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=utils.compute_accuracy
)

In [14]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.4046,1.390067,0.25
2,1.4011,1.389341,0.25
3,1.3948,1.392255,0.25
4,1.3982,1.388889,0.25
5,1.3958,1.387185,0.25
6,1.3919,1.387794,0.25
7,1.3924,1.386747,0.25
8,1.3893,1.386823,0.25
9,1.3913,1.386543,0.25
10,1.3867,1.38642,0.25


TrainOutput(global_step=12000, training_loss=1.394611806233724, metrics={'train_runtime': 2772.1361, 'train_samples_per_second': 4.329, 'train_steps_per_second': 4.329, 'total_flos': 4181198635008000.0, 'train_loss': 1.394611806233724, 'epoch': 10.0})

### Interpretation of the Results

Unfortunately, the same problematic behavior as in the notebook `03_gpt2_with_GAAM_hyp_tuning` can be observed: The training and validation loss do not reduce much beyond 1.4, and the accuracy remains stubbornly stable at 25%, meaning that the model continues to consistently output `2` as the predicted label for any sample.

This training cycle shows that adding (a bit) more data and increasing the number of epochs do not lead to improved results with regards to the accuracy of the model. As a result, we conclude that there must be a more profound deficiency in the model that limits its performance. Hence, we reached out to the authors of the paper to ask them for their feedback on this model, particularly if they could find an implementational mistake we had made. They did not mention any implementational mistakes, but they suggested we try out two new approaches. These are documented in the notebooks `06_gpt2_with_normal_attention_and_GAAM.ipynb` and `07_two_model_architecture.ipynb`.

