# 1️⃣ Training an Adapter for a Transformer model

In this notebook, we train an adapter for a **RoBERTa** ([Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf)) model for sequence classification on a **sentiment analysis** task using [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers), the _AdapterHub_ adaptation of HuggingFace's _transformers_ library.

If you're unfamiliar with the theoretical parts of adapters or the AdapterHub framework, check out our [introductory blog post](https://adapterhub.ml/blog/2020/11/adapting-transformers-with-adapterhub/) first.

We train a **Task Adapter** for a pre-trained model here. Most of the code is identical to a full finetuning setup using HuggingFace's transformers. For comparison, have a look at the [same guide using full finetuning](https://colab.research.google.com/drive/1brXJg5Mokm8h3shxqPRnoIsRwHQoncus?usp=sharing).

For training, we use the [movie review dataset by Pang and Lee (2005)](http://www.cs.cornell.edu/people/pabo/movie-review-data/). It contains movie reviews  from Rotten Tomatoes which are either classified as positive or negative. We download the dataset via HuggingFace's [datasets](https://github.com/huggingface/datasets) library.

## Installation

First, let's install the required libraries:

In [1]:
!pip install -U adapter-transformers
!pip install datasets



## Dataset Preprocessing

Before we start to train our adapter, we first prepare the training data. Our training dataset can be loaded via HuggingFace `datasets` using one line of code:

In [2]:
from datasets import load_dataset, DatasetDict

dataset = load_dataset("yxchar/sciie-tlm")
dataset.num_rows

Using custom data configuration yxchar___sciie-tlm-a32f1f2c4e9b5c0d
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/yxchar___sciie-tlm-a32f1f2c4e9b5c0d/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)


  0%|          | 0/3 [00:00<?, ?it/s]

{'test': 974, 'train': 3219, 'validation': 455}

Every dataset sample has an input text and a binary label:

In [19]:
max([len(x['text']) for x in dataset['train']])

577

In [20]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'text', 'id', 'label'],
        num_rows: 3219
    })
    test: Dataset({
        features: ['Unnamed: 0', 'text', 'id', 'label'],
        num_rows: 974
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'text', 'id', 'label'],
        num_rows: 455
    })
})

Now, we need to encode all dataset samples to valid inputs for our Transformer model. Since we want to train on `roberta-base`, we load the corresponding `RobertaTokenizer`. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [3]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  return tokenizer(batch["text"], max_length=512, truncation=True, padding="max_length")

# Encode the input data
dataset = dataset.map(encode_batch, batched=True)
# The transformers model expects the target class column to be named "labels"
dataset.rename_column_("label", "labels")
# Transform to pytorch tensors and only output the required columns
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/yxchar___sciie-tlm-a32f1f2c4e9b5c0d/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-1170b94b10923d12.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  if sys.path[0] == '':


Now we're ready to train our model...

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'attention_mask', 'id', 'input_ids', 'labels', 'text'],
        num_rows: 3219
    })
    test: Dataset({
        features: ['Unnamed: 0', 'attention_mask', 'id', 'input_ids', 'labels', 'text'],
        num_rows: 974
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'attention_mask', 'id', 'input_ids', 'labels', 'text'],
        num_rows: 455
    })
})

## Training

We use a pre-trained RoBERTa model from HuggingFace. We use `RobertaModelWithHeads`, a class unique to `adapter-transformers`, which allows us to add and configure prediction heads in a flexibler way.

In [None]:
import numpy as np

In [None]:
np.unique(dataset['train']['labels'])

array([0, 1, 2, 3, 4, 5, 6])

In [4]:
from transformers import RobertaConfig, RobertaModelWithHeads

config = RobertaConfig.from_pretrained(
    "roberta-base",
    num_labels=7,
)
model = RobertaModelWithHeads.from_pretrained(
    "roberta-base",
    config=config,
)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModelWithHeads: ['lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModelWithHeads were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infere

**Here comes the important part!**

We add a new adapter to our model by calling `add_adapter()`. We pass a name (`"rotten_tomatoes"`) and [the type of adapter](https://docs.adapterhub.ml/adapters.html#adapter-types) (task adapter). Next, we add a binary classification head. It's convenient to give the prediction head the same name as the adapter. This allows us to activate both together in the next step. The `train_adapter()` method does two things:

1. It freezes all weights of the pre-trained model so only the adapter weights are updated during training.
2. It activates the adapter and the prediction head such that both are used in every forward pass.

In [5]:
model.load_adapter("/content/final_adapter")

'sciie'

In [7]:
# # Add a new adapter
# model.add_adapter("sciie")
# # Add a matching classification head
# model.add_classification_head(
#     "sciie",
#     num_labels=7,
#     # id2label={ 0: "👎", 1: "👍"}
#   )

# # Activate the adapter
# model.train_adapter("sciie")

In [6]:
adapter_name = model.load_adapter("/content/final_adapter")

model.set_active_adapters(adapter_name)

Overwriting existing adapter 'sciie'.
Overwriting existing head 'sciie'


In [None]:
import torch
device='cuda'

In [7]:
test_dataset = dataset['test']
bsz = 16
i = 0
batches = []
while i<test_dataset.num_rows:
  batches.append(test_dataset[i:i+bsz])
  i+=bsz

  

In [33]:
# Put model in evaluation mode
model.to(device)
model.eval()


# Tracking variables for storing ground truth and predictions 
predictions , true_labels = [], []

# Prediction Loop
for batch in batches:

 
 
  # Unpack the inputs from our dataloader and move to GPU/accelerator 
 
  input_ids = batch['input_ids'].to(device)
  attention_mask = batch['attention_mask'].to(device)
  labels = batch['labels'].to(device)

  
  # Telling the model not to compute or store gradients, saving memory and 
  # speeding up prediction
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(input_ids, attention_mask=attention_mask, 
                         labels=labels)

  logits = outputs[1]

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = labels.to('cpu').numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

In [34]:
predictions

[array([[-1.8884428 ,  0.1050658 , -0.7264077 ,  3.2858791 ,  3.100538  ,
         -4.3508964 , -0.60108906],
        [-1.2388635 ,  2.5230796 , -2.715296  ,  3.9984171 ,  4.7768316 ,
         -4.4296246 , -3.981271  ],
        [-2.971092  ,  1.7569726 , -3.2996414 ,  5.138061  ,  2.7759657 ,
         -4.9823833 ,  0.5523543 ],
        [-2.643563  , -0.43135262, -1.0583457 ,  4.458822  ,  1.0931156 ,
         -3.330486  ,  0.392027  ],
        [-2.351655  , -2.3277557 , -3.5022995 ,  7.9692497 , -0.76124746,
          0.73324275, -0.26425788],
        [-2.75016   ,  1.2060477 , -4.0436044 ,  7.8976936 , -0.4114193 ,
         -3.7191281 ,  0.657977  ],
        [-5.0352163 ,  2.5033891 , -1.242385  ,  4.7942786 , -0.57123315,
         -4.5497103 ,  2.7803087 ],
        [-5.148657  ,  3.4410336 , -2.0782764 ,  4.8292174 ,  0.13492694,
         -4.656694  ,  2.3420086 ],
        [-1.7422447 , -2.3740318 , -4.0334535 ,  8.089609  , -1.2591072 ,
          0.0239407 ,  0.5522504 ],
        [ 

For training, we make use of the `Trainer` class built-in into `transformers`. We configure the training process using a `TrainingArguments` object and define a method that will calculate the evaluation accuracy in the end. We pass both, together with the training and validation split of our dataset, to the trainer instance.

**Note the differences in hyperparameters compared to full finetuning.** Adapter training usually required a few more training epochs than full finetuning.

In [None]:
import torch, gc

gc.collect()
torch.cuda.empty_cache()

In [None]:
import numpy as np
from transformers import TrainingArguments, AdapterTrainer, EvalPrediction

training_args = TrainingArguments(
    learning_rate=2*1e-5,
    num_train_epochs=30,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
)

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_accuracy,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Start the training 🚀

In [None]:
trainer.train()

***** Running training *****
  Num examples = 3219
  Num Epochs = 30
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 6060


Step,Training Loss
200,0.492
400,0.4311
600,0.4038
800,0.4006
1000,0.3773
1200,0.3962
1400,0.404
1600,0.3944
1800,0.3687
2000,0.3689


Saving model checkpoint to ./training_output/checkpoint-500
Configuration saved in ./training_output/checkpoint-500/sciie/adapter_config.json
Module weights saved in ./training_output/checkpoint-500/sciie/pytorch_adapter.bin
Configuration saved in ./training_output/checkpoint-500/sciie/head_config.json
Module weights saved in ./training_output/checkpoint-500/sciie/pytorch_model_head.bin
Configuration saved in ./training_output/checkpoint-500/sciie/head_config.json
Module weights saved in ./training_output/checkpoint-500/sciie/pytorch_model_head.bin
Configuration saved in ./training_output/checkpoint-500/sciie/head_config.json
Module weights saved in ./training_output/checkpoint-500/sciie/pytorch_model_head.bin
Saving model checkpoint to ./training_output/checkpoint-1000
Configuration saved in ./training_output/checkpoint-1000/sciie/adapter_config.json
Module weights saved in ./training_output/checkpoint-1000/sciie/pytorch_adapter.bin
Configuration saved in ./training_output/checkpoint-

TrainOutput(global_step=6060, training_loss=0.32730629200195716, metrics={'train_runtime': 3898.0, 'train_samples_per_second': 24.774, 'train_steps_per_second': 1.555, 'total_flos': 2.58503554994688e+16, 'train_loss': 0.32730629200195716, 'epoch': 30.0})

Looks good! Let's evaluate our adapter on the validation split of the dataset to see how well it learned:

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 455
  Batch size = 16


{'epoch': 30.0,
 'eval_acc': 0.8725274725274725,
 'eval_loss': 0.40560296177864075,
 'eval_runtime': 9.0297,
 'eval_samples_per_second': 50.389,
 'eval_steps_per_second': 3.212}

We can put our trained model into a `transformers` pipeline to be able to make new predictions conveniently:

In [None]:
from transformers import TextClassificationPipeline

classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=training_args.device.index)

classifier("This is awesome!")

[{'label': 'LABEL_3', 'score': 0.7224643230438232}]

At last, we can also extract the adapter from our model and separately save it for later reuse. Note the size difference compared to a full model!

In [None]:
model.save_adapter("./final_adapter", "sciie")

!ls -lh final_adapter

Configuration saved in ./final_adapter/adapter_config.json
Module weights saved in ./final_adapter/pytorch_adapter.bin
Configuration saved in ./final_adapter/head_config.json
Module weights saved in ./final_adapter/pytorch_model_head.bin


total 5.8M
-rw-r--r-- 1 root root  571 Dec 14 01:09 adapter_config.json
-rw-r--r-- 1 root root  477 Dec 14 01:09 head_config.json
-rw-r--r-- 1 root root 3.5M Dec 14 01:09 pytorch_adapter.bin
-rw-r--r-- 1 root root 2.3M Dec 14 01:09 pytorch_model_head.bin


**Share your work!**

The next step after training is to share our adapter with the world via _AdapterHub_. [Read our guide](https://docs.adapterhub.ml/contributing.html) on how to prepare the adapter module we just saved and contribute it to the Hub!

➡️ Also continue with [the next Colab notebook](https://colab.research.google.com/github/Adapter-Hub/adapter-transformers/blob/master/notebooks/02_Adapter_Inference.ipynb) to learn how to use adapters from the Hub.

In [None]:
from google.colab import files
!zip -r /content/file.zip /content
files.download("/content/file.zip")

updating: content/ (stored 0%)
updating: content/.config/ (stored 0%)
updating: content/.config/.last_opt_in_prompt.yaml (stored 0%)
updating: content/.config/logs/ (stored 0%)
updating: content/.config/logs/2021.12.03/ (stored 0%)
updating: content/.config/logs/2021.12.03/14.32.50.522723.log (deflated 53%)
updating: content/.config/logs/2021.12.03/14.33.16.964195.log (deflated 54%)
updating: content/.config/logs/2021.12.03/14.33.37.701606.log (deflated 53%)
updating: content/.config/logs/2021.12.03/14.33.36.903459.log (deflated 54%)
updating: content/.config/logs/2021.12.03/14.33.09.955489.log (deflated 86%)
updating: content/.config/logs/2021.12.03/14.32.30.027140.log (deflated 91%)
updating: content/.config/.last_update_check.json (deflated 24%)
updating: content/.config/gce (stored 0%)
updating: content/.config/configurations/ (stored 0%)
updating: content/.config/configurations/config_default (deflated 15%)
updating: content/.config/active_config (stored 0%)
updating: content/.con

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>