# 4️⃣ Zero-Shot Cross-Lingual Transfer using Adapters

Beyond AdapterFusion, which we trained in [the previous notebook](https://github.com/Adapter-Hub/adapter-transformers/blob/master/notebooks/04_Cross_Lingual_Transfer.ipynb), we can compose adapters for zero-shot cross-lingual transfer between tasks. We will use the stacked adapter setup presented in **MAD-X** ([Pfeiffer et al., 2020](https://arxiv.org/pdf/2005.00052.pdf)) for this purpose.

In this example, the base model is a pre-trained multilingual **XLM-R** (`xlm-roberta-base`) ([Conneau et al., 2019](https://arxiv.org/pdf/1911.02116.pdf)) model. Additionally, two types of adapters, language adapters and task adapters, are used. Here's how the MAD-X process works in detail:

1. Train language adapters (`AdapterType.text_lang`) for the source and target language on a language modeling task. In this notebook, we won't train them ourselves but use [pre-trained language adapters from the Hub](https://adapterhub.ml/explore/text_lang/).
2. Train a task adapter (`AdapterType.text_task`) on the target task dataset. This task adapter is **stacked** upon the previously trained language adapter. During this step, only the weights of the task adapter are updated.
3. Perform zero-shot cross-lingual transfer. In this last step, we simply replace the source language adapter with the target language adapter while keeping the stacked task adapter.

Now to our concrete example: we select **XCOPA** ([Ponti et al., 2020](https://ducdauge.github.io/files/xcopa.pdf)), a multilingual extension of the **COPA** commonsence reasoning dataset ([Roemmele et al., 2011](https://people.ict.usc.edu/~gordon/publications/AAAI-SPRING11A.PDF)) as our target task. The setup is trained on the original **English** dataset and then transferred to **Chinese**.

## Installation

Besides `adapter-transformers`, we use HuggingFace's `datasets` library for loading the data. So let's install both first:

In [11]:
!pip install -U git+https://github.com/Adapter-Hub/adapter-transformers.git
!pip install -U datasets

## Dataset Preprocessing

We need the English COPA dataset for training our task adapter. It is part of the SuperGLUE benchmark and can be loaded via `datasets` using one line of code:

In [12]:
from datasets import load_dataset

dataset_en = load_dataset("super_glue", "copa")
dataset_en.num_rows

Reusing dataset super_glue (/root/.cache/huggingface/datasets/super_glue/copa/1.0.2/41d9edb3935257e1da4b7ce54cd90df0e8bb255a15e46cfe5cbc7e1c04f177de)


{'test': 500, 'train': 400, 'validation': 100}

Every dataset sample has a premise, a question and two possible answer choices:

In [13]:
dataset_en['train'].features

{'choice1': Value(dtype='string', id=None),
 'choice2': Value(dtype='string', id=None),
 'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['choice1', 'choice2'], names_file=None, id=None),
 'premise': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None)}

In this example, we model COPA as a multiple-choice task with two choices. Thus, we encode the premise and question as well as both choices as one input to our `xlm-roberta-base` model. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [14]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

def encode_batch(examples):
  """Encodes a batch of input data using the model tokenizer."""
  all_encoded = {"input_ids": [], "attention_mask": []}
  # Iterate through all examples in this batch
  for premise, question, choice1, choice2 in zip(examples["premise"], examples["question"], examples["choice1"], examples["choice2"]):
    sentences_a = [premise + " " + question for _ in range(2)]
    # Both answer choices are passed in an array according to the format needed for the multiple-choice prediction head
    sentences_b = [choice1, choice2]
    encoded = tokenizer(
        sentences_a,
        sentences_b,
        max_length=60,
        truncation=True,
        padding="max_length",
    )
    all_encoded["input_ids"].append(encoded["input_ids"])
    all_encoded["attention_mask"].append(encoded["attention_mask"])
  return all_encoded

def preprocess_dataset(dataset):
  # Encode the input data
  dataset = dataset.map(encode_batch, batched=True)
  # The transformers model expects the target class column to be named "labels"
  dataset.rename_column_("label", "labels")
  # Transform to pytorch tensors and only output the required columns
  dataset.set_format(columns=["input_ids", "attention_mask", "labels"])
  return dataset

dataset_en = preprocess_dataset(dataset_en)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




## Task Adapter Training

In this section, we will train the task adapter on the English COPA dataset. We use a pre-trained XLM-R model from HuggingFace and instantiate our model using `AutoModelWithHeads`.

In [15]:
from transformers import AutoConfig, AutoModelWithHeads

config = AutoConfig.from_pretrained(
    "xlm-roberta-base",
)
model = AutoModelWithHeads.from_pretrained(
    "xlm-roberta-base",
    config=config,
)

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModelWithHeads: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing XLMRobertaModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaModelWithHeads were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['roberta.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for pr

Now we only need to set up the adapters. As described, we need two language adapters (which are assumed to be pre-trained in this example) and a task adapter (which will be trained in a few moments).

First, we load both the language adapters for our source language English (`"en"`) and our target language Chinese (`"zh"`) from the Hub. Then we add a new task adapter (`"copa"`) for our target task. _Note the different values for AdapterType given as the second parameter!_

Finally, we add a multiple-choice head with the same name as our task adapter on top.

In [None]:
from transformers import AdapterType, AdapterConfig

# Load the language adapters
lang_adapter_config = AdapterConfig.load("pfeiffer", reduction_factor=2)
model.load_adapter("en/wiki@ukp", AdapterType.text_lang, config=lang_adapter_config)
model.load_adapter("zh/wiki@ukp", AdapterType.text_lang, config=lang_adapter_config)

# Add a new task adapter
model.add_adapter("copa", AdapterType.text_task)

# Add a classification head for our target task
model.add_multiple_choice_head("copa", num_choices=2)

We want the task adapter to be stacked on top of the language adapter, so we have to tell our model to use this setup by calling `set_active_adapters()`.

The syntax for the adapter setup works as follows:

- a single string is interpreted as a single adapter
- a list of strings is interpreted as a __stack__ of adapters
- a _nested_ list of strings is interpreted as a __fusion__ of adapters


In [17]:
# Unfreeze and activate fusion setup
adapter_setup = [
  ["en"],
  ["copa"]
]
model.set_active_adapters(adapter_setup)

Great! Now, the input will be passed through the English language adapter first and the COPA task adapter second in every forward pass.

Just one final step to make: Using `train_adapter()`, we tell our model to only train the task adapter in the following. This call will freeze the weights of the pre-trained model and the weights of the language adapters to prevent them from further finetuning.

In [18]:
model.train_adapter(["copa"])

For training, we make use of the `Trainer` class built-in into `transformers`. We configure the training process using a `TrainingArguments` object.

As the dataset splits of English COPA in the SuperGLUE are slightly different, we train on both the train and validation split of the dataset. Later, we will evaluate on the test split of XCOPA.

In [19]:
from transformers import TrainingArguments, Trainer
from datasets import concatenate_datasets

training_args = TrainingArguments(
    learning_rate=1e-4,
    num_train_epochs=8,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=100,
    output_dir="./training_output",
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
)

train_dataset = concatenate_datasets([dataset_en["train"], dataset_en["validation"]])

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

Start the training 🚀 (this will take a while)

In [20]:
trainer.train()

Step,Training Loss
100,0.696961


TrainOutput(global_step=128, training_loss=0.69627445936203)

## Cross-lingual transfer

With the model and all adapters trained and ready, we can come to the cross-lingual transfer step here. We will evaluate our setup on the Chinese split of the XCOPA dataset.
Therefore, we'll first download the data and preprocess it using the same method as the English dataset:

In [28]:
dataset_zh = load_dataset("xcopa", "zh", ignore_verifications=True)
dataset_zh = preprocess_dataset(dataset_zh)

Downloading and preparing dataset xcopa/zh (download: 362.10 KiB, generated: 65.80 KiB, post-processed: Unknown size, total: 427.89 KiB) to /root/.cache/huggingface/datasets/xcopa/zh/1.0.0/5cdb49cff11d193f096083f60fa2d0be592f7ef005be61181203c1052e694a54...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset xcopa downloaded and prepared to /root/.cache/huggingface/datasets/xcopa/zh/1.0.0/5cdb49cff11d193f096083f60fa2d0be592f7ef005be61181203c1052e694a54. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Next, let's adapt our setup to the new language. We simply replace the English language adapter with the Chinese language adapter we already loaded previously. The task adapter we just trained is kept. Again, we set this architecture using `set_active_adapters()`:

In [29]:
adapter_setup = [
  ["zh"],
  ["copa"]
]
model.set_active_adapters(adapter_setup)

Finally, let's see how well our adapter setup performs on the new language. We measure the zero-shot accuracy on the test split of the target language dataset. Evaluation is also performed using the built-in `Trainer` class.

In [30]:
import numpy as np
from transformers import EvalPrediction

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}

eval_trainer = Trainer(
    model=model,
    args=TrainingArguments(output_dir="./eval_output"),
    eval_dataset=dataset_zh["test"],
    compute_metrics=compute_accuracy,
    adapter_names=adapter_setup,
)

eval_trainer.evaluate()

{'eval_acc': 0.558, 'eval_loss': 0.6927390098571777}

You should get an overall accuracy of about 56 which is on-par with full finetuning on COPA only but below the state-of-the-art which is sequentially finetuned on an additional dataset before finetuning on COPA.

For results on different languages and a sequential finetuning setup which yields better results, make sure to check out [the MAD-X paper](https://arxiv.org/pdf/2005.00052.pdf).