<a href="https://colab.research.google.com/github/virginiakm1988/Easy-Adapter/blob/main/Parameter_efficient_fine_tuning_in_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parameter-efficient Fine-tuning in NLP

This code demonstrates how to fine-tune a BERT model based on the Hugging Face Transformers library using adapters. Adapters are a parameter-efficient way to fine-tune a pre-trained language model for a specific NLP task.

The code was written by [Zih-Ching Chen](https://github.com/virginiakm1988).

If you use this code in your research, please cite the following papers:

```
@article{fu2022adapterbias,
  title={AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks},
  author={Fu, Chin-Lun and Chen, Zih-Ching and Lee, Yun-Ru and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2205.00305},
  year={2022}
}

@inproceedings{chen2023exploring,
  title={Exploring efficient-tuning methods in self-supervised speech models},
  author={Chen, Zih-Ching and Fu, Chin-Lun and Liu, Chih-Ying and Li, Shang-Wen Daniel and Lee, Hung-yi},
  booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},
  pages={1120--1127},
  year={2023},
  organization={IEEE}
}
```

This code demonstrates a practical example of using adapters in fine-tuning a BERT model. The code can be adapted to other pre-trained models and NLP tasks.

## Setup Instructions

Before running the code, please follow these setup instructions:

1. Install the necessary packages by running the following command: 

   ```
   ! pip install transformers datasets
   ! pip install loralib
   ```

2. Check that your system has a compatible GPU installed by running the following command in your terminal:

   ```
   nvidia-smi
   ```


Once you have completed these setup instructions, you are ready to run the code.

In [None]:
! pip install transformers datasets
! pip install loralib
! nvidia-smi

## Define custom adapter modules
Here we implemented `Houlsby`, `ConvAdapters`, `AdapterBias`, and `LoRA`.
1. Houlsby Adapter ([Parameter-Efficient Transfer Learning for NLP](https://http://proceedings.mlr.press/v97/houlsby19a.html))
2. ConvAdapter ([CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models](https://arxiv.org/abs/2212.01282))
3. AdapterBias ([AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks](https://arxiv.org/abs/2205.00305))

4. LoRA ([LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685))

5. BitFit ([BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models](https://arxiv.org/abs/2106.10199)
BitFit can be implemented through the following settings:
```
mark_only_adapter_as_trainable(model_bert,bias="all")
```



In [5]:
from torch import nn
  
## Houlsby adapter
class HoulsbyAdapter(nn.Module):
    def __init__(
            self,
            input_size,
            bottleneck = 32
        ):
        super().__init__()

        self.houlsby_adapter = nn.Sequential(
          nn.Linear(input_size, bottleneck),
          nn.GELU(),
          nn.Linear(bottleneck, input_size),
      )
    def forward(self, x):
        return self.houlsby_adapter(x)

## conv adapter
class ConvAdapter(nn.Module):
    def __init__(
            self,
            input_size,
            compress_rate = 8,
            k = 1, 
            stride = 1,
            dropout = 0.8
        ):
        super().__init__()
        def depthwise_conv(n_in, n_out, compress_rate, k, stride):
          conv = nn.Conv1d(n_in, n_out//compress_rate, k, stride = stride)
          nn.init.kaiming_normal_(conv.weight)
          return conv
        def pointwise_conv(n_in, n_out, compress_rate, k, stride):
          conv = nn.Conv1d(n_out//compress_rate,n_out, 1)
          nn.init.kaiming_normal_(conv.weight)
          return conv
        self.conv_adapter = nn.Sequential(
        depthwise_conv(input_size, input_size, compress_rate,k ,stride),
        pointwise_conv(input_size, input_size, compress_rate,k ,stride),
        nn.Dropout(p=dropout),
        nn.GELU()
      )
    def forward(self, x):
        return self.conv_adapter(x)

## adapterBias
class AdapterBias(nn.Module):
    def __init__(
            self,
            input_size,
            dropout = 0.8
        ):
        super().__init__()
        self.adapter_vector = nn.Parameter(torch.ones((input_size), requires_grad=True))
        self.adapter_alpha = nn.Linear(input_size, 1)
        
    def forward(self, x):
        return self.adapter_vector  * self.adapter_alpha(x)
##lora
class LoRA(nn.Module):
    def __init__(
            self,
            input_size,
            dropout = 0.8,
            r = 16
        ):
        super().__init__()
        self.lora_adapter = lora.Linear(input_size, input_size, r)
        
    def forward(self, x):
        return self.lora_adapter(x)

### Utils
source code: loralib

In [6]:
import torch
from typing import Dict

def mark_only_adapter_as_trainable(model: nn.Module, bias: str = 'none') -> None:
    for n, p in model.named_parameters():
        if 'adapter' not in n:
          p.requires_grad = False
        else:
          p.requires_grad = True
    if bias == "none":
      return
    elif bias == 'all':
        for n, p in model.named_parameters():
            if 'bias' in n:
                p.requires_grad = True
    else:
        raise NotImplementedError


def adapter_state_dict(model: nn.Module, bias: str = 'none') -> Dict[str, torch.Tensor]:
    my_state_dict = model.state_dict()
    if bias == 'none':
        return {k: my_state_dict[k] for k in my_state_dict if 'adapter' in k}
    elif bias == 'all':
        return {k: my_state_dict[k] for k in my_state_dict if 'adapter_' in k or 'bias' in k}
    else:
        raise NotImplementedError

## Adding adapter to a hugginface model

In [7]:
from transformers import AutoModelForSequenceClassification

from torch import nn
import torch

BertLayerNorm = torch.nn.LayerNorm

##vanilla houlsby residual adapter, custom layers
class adapted_bert_output(nn.Module):
  def __init__(self, BertOutput, config):
    super().__init__()
    self.config = config
    self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
    self.dropout = nn.Dropout(config.hidden_dropout_prob)

    if config.adapter == "houlsby":
      self.adapter = HoulsbyAdapter(config.hidden_size)
    elif config.adapter == "conv_adapter":
      self.adapter = ConvAdapter(config.max_position_embeddings)
    elif config.adapter == "AdapterBias":
      self.adapter = AdapterBias(config.hidden_size)
    elif config.adapter == "lora":
      self.adapter = LoRA(config.hidden_size)
    else:
      raise NotImplementedError

  def forward(self,  hidden_states, input_tensor):

    hidden_states = self.dense(hidden_states)
    if self.config.adapter != None:
      adapter_output = self.adapter(hidden_states)
      hidden_states = self.dropout(hidden_states) + adapter_output
    else:
      hidden_states = self.dropout(hidden_states)
    hidden_states = self.LayerNorm(hidden_states + input_tensor)
  
    return hidden_states


model_bert = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
model_bert.config.adapter = "houlsby"

#add adapter module in a bert model
for idx, layer in enumerate(model_bert.bert.encoder.layer):
  model_bert.bert.encoder.layer[idx].output = adapted_bert_output(model_bert.bert.encoder.layer[idx].output, model_bert.config)

#freeze parameters
mark_only_adapter_as_trainable(model_bert)


Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Loading datasets

In [8]:
from datasets import load_dataset
raw_datasets = load_dataset("imdb")
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]


Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

## Start training

In [None]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir = "test-trainer",per_device_train_batch_size = 4)
trainer = Trainer(
    model=model_bert, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)
trainer.train() 

Step,Training Loss


## Saving adapter checkpoints


In [None]:
checkpoint_path = "./result"
torch.save(adapter_state_dict(model_bert), checkpoint_path)