# Hyperparameter Tuning for GPT2-Sequence-Classification-Model Using GAAM

## Install Required Libraries

In [1]:
!pip install datasets # unified interface for accessing and working with various datasets (by hugging face)
!pip install -U accelerate # library to optimize and accelerate numerical computations
!pip install -U transformers # library by hugging face that gives easy access to pre-trained models, tokenizers, and tools for fine-tuning models

import utils # some utility functions we wrote that are used across the different notebooks

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.23.0-py3-none-any

## Loading and Processing the Dataset

We load the dataset from Hugging Face. Each sample consists of one strings feature that stores the title as well as the (start of the) article-text. The label is the category that the article belongs to (world, sports, business, sci/tech). [Link](https://huggingface.co/datasets/ag_news/viewer/default/train) to explore the structure of the data.

In [2]:
from datasets import load_dataset

dataset = load_dataset('ag_news')

print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


Reduce the size of the dataset (to reduce training times) whilst ensuring that the original structure and distribution of the data is kept.

In [3]:
from datasets import DatasetDict

dataset_train_reduced = utils.take_a_percentage_of_data(dataset['train'], percentage=0.002)
dataset_test_reduced = utils.take_a_percentage_of_data(dataset['test'], percentage=0.002)

dataset_reduced = DatasetDict({
    'train': dataset_train_reduced,
    'test': dataset_test_reduced
}) # combine the shortened datasets back into the old structure.

print(dataset_reduced)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 240
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12
    })
})


## Tokenizing the dataset

Tokenize the dataset in the exact same way as the GPT-2 model.

In [4]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token # padding tokens added to sequences will be represented by an end-of-sequence token
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length")

tokenized_dataset = dataset_reduced.map(tokenize_function, batched=True) # performed in batches to increase performance

print(tokenized_dataset) # tokenization adds two features: 'input_ids' (the tokenized representation of 'text') as well as 'attention_mask', which ensures that the model does not attend to padding tokens added during tokenization



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/240 [00:00<?, ? examples/s]

Map:   0%|          | 0/12 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 240
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 12
    })
})


## Loading the Pre-trained GPT2-Model

Load the pre-trained GPT2-Model for sequence classification.

In [6]:
from transformers import GPT2ForSequenceClassification

gpt2_model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=4) # our gpt2-model should distinguish between 4 labels, adds a final fully connected layers with 4 output neurons.

print(gpt2_model)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=4, bias=False)
)


## Implementing the Gaussian Adaptive Attention Block

Download the package provided by the paper.

In [7]:
!pip3 install gaussian-adaptive-attention

Collecting gaussian-adaptive-attention
  Downloading gaussian_adaptive_attention-0.1.5-py3-none-any.whl (8.7 kB)
Installing collected packages: gaussian-adaptive-attention
Successfully installed gaussian-adaptive-attention-0.1.5


**Approach**: Create a wrapper class of the `MultiHeadGaussianAdaptiveAttention`-module that is then inserted into the GPT2-architecture layer by layer. The wrapper ensures that `MultiHeadGaussianAdapativeAttention` is compatible with the GPT2-architecture.

In [8]:
import importlib
gaussian_adaptive_attention = importlib.import_module("gaussian_adaptive_attention")
MultiHeadGaussianAdaptiveAttention = getattr(gaussian_adaptive_attention, "MultiHeadGaussianAdaptiveAttention")
import torch
import copy

class MultiHeadGaussianAdaptiveAttentionWrapper(torch.nn.Module):
    def __init__(self, config, num_heads=4, num_gaussians=5, norm_axis=1):
        super().__init__()
        self.attention = MultiHeadGaussianAdaptiveAttention(
            norm_axis=norm_axis,
            num_heads=num_heads,
            num_gaussians=num_gaussians,
            padding_value=config.eos_token_id,
            eps=config.layer_norm_epsilon
        )

    def forward(self, hidden_states, **kwargs):
        # Pass arguments using **kwargs to the underlying attention mechanism
        attention_output = self.attention(hidden_states)
        return (hidden_states,) + tuple(attention_output)  # Ensure the return value is a tuple


## Hyperparameter Tuning

The hyperparameters of the multi-head Gaussian adapative attention block are the number of gaussians and the number of heads.

In the paper, it was suggested to use a setup with num_gaussians = 5 and num_heads = 4. Because this setup relies on a fairly small number of parameters, we decided to investigate how increasing `num_gaussians` and `num_heads` would affect the performance of the model. While doing so, we decided to keep the suggested ratio between `num_gaussians` and `num_heads` constant.

In [12]:
possible_num_gaussians = [5, 10, 20]
possible_num_heads = [4, 8, 16]

models_with_different_parameters = {}

# Create a model for each of the different sets of hyperparameters
for comb in range(len(possible_num_gaussians)):

  # creating a new instance of the model with an appropriate name
  var_name = f"gaussian_gpt2_model_gaussians_{possible_num_gaussians[comb]}_heads_{possible_num_heads[comb]}"
  gaussian_model = copy.deepcopy(gpt2_model)

  # replacing the attention blocks with the GAAM blocks
  for block in gaussian_model.transformer.h:
    block.attn = MultiHeadGaussianAdaptiveAttentionWrapper(config=gaussian_model.config, num_heads=possible_num_heads[comb], num_gaussians=possible_num_gaussians[comb], norm_axis=1) # and replacing the attention module with the Gaussian attention block.

  # freeze the pretrained layers and compute number of trainable parameters
  gaussian_model = utils.freeze_pretrained_layers(gaussian_model)
  trainable_params = sum(p.numel() for p in gaussian_model.parameters() if p.requires_grad)
  print(f"Total number trainable parameters for model {var_name}: {trainable_params}")

  # freeze the pretrained layers
  models_with_different_parameters[var_name] = gaussian_model


Total number trainable parameters for model gaussian_gpt2_model_gaussians_5_heads_4: 3552
Total number trainable parameters for model gaussian_gpt2_model_gaussians_10_heads_8: 4992
Total number trainable parameters for model gaussian_gpt2_model_gaussians_20_heads_16: 10752


### Training

Because our computational resources were limited, we were forced to freeze the pretrained layers and use a very small training loop to test the models with the different hyperparameter setups. We trained all the models for 5 epochs on 0.2% of the data.

In [10]:
import pandas as pd
from transformers import Trainer, TrainingArguments

results = pd.DataFrame(columns=['Validation Loss', 'Accuracy'])

num_epochs = 5
idx = 0

for model_name, network in models_with_different_parameters.items():

    training_args = TrainingArguments(
        output_dir="./results",
        learning_rate=0.00002,
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        num_train_epochs=num_epochs,
        use_cpu = False,
        no_cuda = False,
        save_strategy= 'epoch',
        logging_strategy = 'epoch',
        evaluation_strategy='epoch',
        load_best_model_at_end = True
    )

    trainer = Trainer(
        model=network,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["test"],
        compute_metrics=utils.compute_accuracy
    )

    print(f"Model name: {model_name}")

    trainer.train()

    results_of_model = {"Validation Loss": trainer.state.best_metric, "Accuracy": trainer.state.log_history[1 + 2 * (num_epochs - 1)]["eval_accuracy"]}
    results.loc[len(results.index)] = [trainer.state.best_metric, trainer.state.log_history[1 + 2 * (num_epochs - 1)]["eval_accuracy"]]
    results = results.rename(index={len(results) - 1: model_name})

    idx += 1

    print(f"{(idx / 3)*100}% done!")

print(results)

Model name: gaussian_gpt2_model_gaussians_5_heads_4


Epoch,Training Loss,Validation Loss,Accuracy
1,1.4441,1.41145,0.25
2,1.4122,1.402184,0.25
3,1.4059,1.396868,0.25
4,1.4058,1.394114,0.25
5,1.3919,1.393971,0.25


33.33333333333333% done!
Model name: gaussian_gpt2_model_gaussians_10_heads_8


Epoch,Training Loss,Validation Loss,Accuracy
1,1.4441,1.41145,0.25
2,1.4122,1.402184,0.25
3,1.4059,1.396868,0.25
4,1.4058,1.394114,0.25
5,1.3919,1.393971,0.25


66.66666666666666% done!
Model name: gaussian_gpt2_model_gaussians_20_heads_16


Epoch,Training Loss,Validation Loss,Accuracy
1,1.4441,1.41145,0.25
2,1.4122,1.402184,0.25
3,1.4059,1.396868,0.25
4,1.4058,1.394114,0.25
5,1.3919,1.393971,0.25


100.0% done!
                                           Validation Loss  Accuracy
gaussian_gpt2_model_gaussians_5_heads_4           1.393971      0.25
gaussian_gpt2_model_gaussians_10_heads_8          1.393971      0.25
gaussian_gpt2_model_gaussians_20_heads_16         1.393971      0.25


### Interpretation of the Results

Astonishingly, we repeatedly found that differences in the hyperparameters of GAAM do not affect the training process - all models achieve the exact same loss values and accuracy in each iteration (training took substantially longer for models with more gaussians and more heads, though).

Furthermore, it is visible that the model never surpasses an accuracy of 25%. We found that this is the case because the model always outputs 2 as a label, before and after training. We tried a variety of things to "get the model to not predict 2" - e.g. trying to overfit it on a batch (doing 50 epochs of training on only 60 data samples) or adjusting the training hyperparameters (alternating between small and large learning rates). Unfortunately, though, the results were consistently the same: The model achieves a loss of approximately 1.4 and an accuracy of 25%.

To conclude, we decided to stick with the hyperparameters of the smallest model (num_gaussians = 5, num_heads = 4) in later models, because it matches the suggestion in the code of the paper and because it will reduce training times for later models.