# 1️⃣ Training an Adapter for a Transformer model

In this notebook, we train an adapter for a **RoBERTa** ([Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf)) model for sequence classification on a **sentiment analysis** task using [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers), the _AdapterHub_ adaptation of HuggingFace's _transformers_ library.

If you're unfamiliar with the theoretical parts of adapters or the AdapterHub framework, check out our [introductory blog post](https://adapterhub.ml/blog/2020/11/adapting-transformers-with-adapterhub/) first.

We train a **Task Adapter** for a pre-trained model here. Most of the code is identical to a full finetuning setup using HuggingFace's transformers. For comparison, have a look at the [same guide using full finetuning](https://colab.research.google.com/drive/1brXJg5Mokm8h3shxqPRnoIsRwHQoncus?usp=sharing).

For training, we use the [movie review dataset by Pang and Lee (2005)](http://www.cs.cornell.edu/people/pabo/movie-review-data/). It contains movie reviews  from Rotten Tomatoes which are either classified as positive or negative. We download the dataset via HuggingFace's [datasets](https://github.com/huggingface/datasets) library.

## Installation

First, let's install the required libraries:

In [1]:
!pip install -U adapter-transformers
!pip install datasets

Collecting adapter-transformers
  Downloading adapter_transformers-2.2.0-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 7.2 MB/s 
[?25hCollecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 398 kB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 41.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 52.6 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 45.8 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, ada

## Dataset Preprocessing

Before we start to train our adapter, we first prepare the training data. Our training dataset can be loaded via HuggingFace `datasets` using one line of code:

In [146]:
from datasets import load_dataset, DatasetDict

dataset = load_dataset("yxchar/citation_intent-tlm")
dataset.num_rows

Using custom data configuration yxchar___citation_intent-tlm-1b778e179b05f85c


Downloading and preparing dataset csv/yxchar___citation_intent-tlm to /root/.cache/huggingface/datasets/csv/yxchar___citation_intent-tlm-1b778e179b05f85c/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/400k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/32.9k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.4k [00:00<?, ?B/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/yxchar___citation_intent-tlm-1b778e179b05f85c/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

{'test': 139, 'train': 1688, 'validation': 114}

Every dataset sample has an input text and a binary label:

In [147]:
dataset['train']['label'][0]

0

In [148]:
# from datasets import ClassLabel, Value
# new_features = dataset['train'].features.copy()
# new_features["label"] = ClassLabel(names=["0","1","2","3","4","5"])
# dataset['train'] = dataset['train'].cast(new_features)
# dataset['train'].features
# dataset.set_format("torch")

In [149]:
# train_dataset, valid_dataset= dataset["train"].train_test_split(test_size=0.2).values()
# dataset = DatasetDict({"train":train_dataset,"test":dataset["test"], "validation":valid_dataset})
# dataset.num_rows

Now, we need to encode all dataset samples to valid inputs for our Transformer model. Since we want to train on `roberta-base`, we load the corresponding `RobertaTokenizer`. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [150]:
import numpy as np
nLabels = len(np.unique(dataset['train']['label']))
nLabels

6

In [151]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  return tokenizer(batch["text"], max_length=512, truncation=True, padding="max_length")

# Encode the input data
dataset = dataset.map(encode_batch, batched=True)
# The transformers model expects the target class column to be named "labels"
dataset.rename_column_("label", "labels")
# Transform to pytorch tensors and only output the required columns
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

loading file https://huggingface.co/roberta-base/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
loading file https://huggingface.co/roberta-base/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/roberta-base/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/roberta-base/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/roberta-base/resolve/main/tokenizer_config.json from cache at None
loading file https://huggingface.co/roberta-base/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/d53fc0fa09b8342651efd4073d75e19617b3e51287c2a535becda5

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Now we're ready to train our model...

## Training

We use a pre-trained RoBERTa model from HuggingFace. We use `RobertaModelWithHeads`, a class unique to `adapter-transformers`, which allows us to add and configure prediction heads in a flexibler way.

In [152]:
dataset['train']['labels']

tensor([0, 1, 0,  ..., 0, 2, 1])

In [153]:
from transformers import RobertaConfig, RobertaModelWithHeads

config = RobertaConfig.from_pretrained(
    "roberta-base",
    num_labels=6,
)
model = RobertaModelWithHeads.from_pretrained(
    "roberta-base",
    config=config,
)

loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "adapters": {
    "adapters": {},
    "config_map": {},
    "fusion_config_map": {},
    "fusions": {}
  },
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5
  },
  "layer_norm_eps": 1e-05,


**Here comes the important part!**

We add a new adapter to our model by calling `add_adapter()`. We pass a name (`"rotten_tomatoes"`) and [the type of adapter](https://docs.adapterhub.ml/adapters.html#adapter-types) (task adapter). Next, we add a binary classification head. It's convenient to give the prediction head the same name as the adapter. This allows us to activate both together in the next step. The `train_adapter()` method does two things:

1. It freezes all weights of the pre-trained model so only the adapter weights are updated during training.
2. It activates the adapter and the prediction head such that both are used in every forward pass.

In [154]:
adaptername = 'CitationIntent'

# Add a new adapter
model.add_adapter(adaptername)
# Add a matching classification head
model.add_classification_head(
    adaptername,
    num_labels=6
    # id2label={ 0: "👎", 1: "👍"}
  )
# Activate the adapter
model.train_adapter(adaptername)

Adding adapter 'CitationIntent'.
Adding head 'CitationIntent' with config {'head_type': 'classification', 'num_labels': 6, 'layers': 2, 'activation_function': 'tanh', 'label2id': {'LABEL_0': 0, 'LABEL_1': 1, 'LABEL_2': 2, 'LABEL_3': 3, 'LABEL_4': 4, 'LABEL_5': 5}, 'use_pooler': False, 'bias': True}.


For training, we make use of the `Trainer` class built-in into `transformers`. We configure the training process using a `TrainingArguments` object and define a method that will calculate the evaluation accuracy in the end. We pass both, together with the training and validation split of our dataset, to the trainer instance.

**Note the differences in hyperparameters compared to full finetuning.** Adapter training usually required a few more training epochs than full finetuning.

In [155]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'attention_mask', 'id', 'input_ids', 'labels', 'text'],
        num_rows: 1688
    })
    test: Dataset({
        features: ['Unnamed: 0', 'attention_mask', 'id', 'input_ids', 'labels', 'text'],
        num_rows: 139
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'attention_mask', 'id', 'input_ids', 'labels', 'text'],
        num_rows: 114
    })
})

In [156]:
import torch, gc

gc.collect()
torch.cuda.empty_cache()

In [157]:
import numpy as np
from transformers import TrainingArguments, AdapterTrainer, EvalPrediction

training_args = TrainingArguments(
    learning_rate=1e-4,
    num_train_epochs=6,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
)

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_accuracy,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Start the training 🚀

In [158]:
# gpu_info = !nvidia-smi
# gpu_info = '\n'.join(gpu_info)
# if gpu_info.find('failed') >= 0:
#   print('Not connected to a GPU')
# else:
#   print(gpu_info)

In [159]:
trainer.train()

***** Running training *****
  Num examples = 1688
  Num Epochs = 6
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 636


Step,Training Loss
200,1.3224
400,1.0644
600,0.9363


Saving model checkpoint to ./training_output/checkpoint-500
Configuration saved in ./training_output/checkpoint-500/CitationIntent/adapter_config.json
Module weights saved in ./training_output/checkpoint-500/CitationIntent/pytorch_adapter.bin
Configuration saved in ./training_output/checkpoint-500/CitationIntent/head_config.json
Module weights saved in ./training_output/checkpoint-500/CitationIntent/pytorch_model_head.bin
Configuration saved in ./training_output/checkpoint-500/CitationIntent/head_config.json
Module weights saved in ./training_output/checkpoint-500/CitationIntent/pytorch_model_head.bin
Configuration saved in ./training_output/checkpoint-500/CitationIntent/head_config.json
Module weights saved in ./training_output/checkpoint-500/CitationIntent/pytorch_model_head.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=636, training_loss=1.0972046762142542, metrics={'train_runtime': 404.9238, 'train_samples_per_second': 25.012, 'train_steps_per_second': 1.571, 'total_flos': 2711091332284416.0, 'train_loss': 1.0972046762142542, 'epoch': 6.0})

Looks good! Let's evaluate our adapter on the validation split of the dataset to see how well it learned:

In [160]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 114
  Batch size = 16


{'epoch': 6.0,
 'eval_acc': 0.6929824561403509,
 'eval_loss': 0.9062106013298035,
 'eval_runtime': 2.2739,
 'eval_samples_per_second': 50.135,
 'eval_steps_per_second': 3.518}

We can put our trained model into a `transformers` pipeline to be able to make new predictions conveniently:

In [None]:
# from transformers import TextClassificationPipeline

# classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=training_args.device.index)

# classifier("This is awesome!")

[{'label': '👍', 'score': 0.9894458055496216}]

At last, we can also extract the adapter from our model and separately save it for later reuse. Note the size difference compared to a full model!

In [167]:
adaptername = 'CitationIntent'

In [168]:
model.save_adapter("./final_adapter", adaptername)
!ls -lh final_adapter

Configuration saved in ./final_adapter/adapter_config.json
Module weights saved in ./final_adapter/pytorch_adapter.bin
Configuration saved in ./final_adapter/head_config.json
Module weights saved in ./final_adapter/pytorch_model_head.bin


total 5.8M
-rw-r--r-- 1 root root  580 Dec 14 00:09 adapter_config.json
-rw-r--r-- 1 root root  466 Dec 14 00:09 head_config.json
-rw-r--r-- 1 root root 3.5M Dec 14 00:09 pytorch_adapter.bin
-rw-r--r-- 1 root root 2.3M Dec 14 00:09 pytorch_model_head.bin


**Share your work!**

The next step after training is to share our adapter with the world via _AdapterHub_. [Read our guide](https://docs.adapterhub.ml/contributing.html) on how to prepare the adapter module we just saved and contribute it to the Hub!

➡️ Also continue with [the next Colab notebook](https://colab.research.google.com/github/Adapter-Hub/adapter-transformers/blob/master/notebooks/02_Adapter_Inference.ipynb) to learn how to use adapters from the Hub.