# PEFT with DNA Language Models

This notebook demonstrates how to utilize parameter-efficient fine-tuning techniques (PEFT) from the PEFT library to fine-tune a DNA Language Model (DNA-LM). The fine-tuned DNA-LM will be applied to solve a task from the nucleotide benchmark dataset. Parameter-efficient fine-tuning (PEFT) techniques are crucial for adapting large pre-trained models to specific tasks with limited computational resources.

In [1]:
! pip install -U accelerate
! pip install -U transformers
! pip install peft
! pip install datasets

Collecting accelerate
  Downloading accelerate-0.32.1-py3-none-any.whl (314 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.w

### 1. Import relevant libraries

We'll start by importing the required libraries, including the PEFT library and other dependencies.

In [2]:
import torch
import transformers
import peft
import tqdm
import numpy as np

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


### 2. Load models


We'll load a pre-trained DNA Language Model, "SpeciesLM", that serves as the base for fine-tuning. This is done using the transformers library from HuggingFace.

The tokenizer and the model comes from the paper, "Species-aware DNA language models capture regulatory elements and their evolution". [Paper Link](https://www.biorxiv.org/content/10.1101/2023.01.26.525670v2), [Code Link](https://github.com/gagneurlab/SpeciesLM). They introduce a species-aware DNA language model, which is trained on more than 800 species spanning over 500 million years of evolution.

In [3]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

In [4]:
tokenizer = AutoTokenizer.from_pretrained("gagneurlab/SpeciesLM", revision = "downstream_species_lm")
lm = AutoModelForMaskedLM.from_pretrained("gagneurlab/SpeciesLM", revision = "downstream_species_lm")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/379k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/68.0k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/62.7k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/795 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/361M [00:00<?, ?B/s]

In [None]:
lm.eval()
lm.to("cuda");

### 2. Prepare datasets

We'll load the `nucleotide_transformer_downstream_tasks` dataset, which contains 18 downstream tasks from the Nucleotide Transformer paper. This dataset provides a consistent genomics benchmark with binary classification tasks.

In [6]:
from datasets import load_dataset

raw_data = load_dataset("InstaDeepAI/nucleotide_transformer_downstream_tasks", "H3")

Downloading data:   0%|          | 0.00/3.50M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/391k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/13468 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1497 [00:00<?, ? examples/s]

We'll use the "H3" subset of this dataset, which contains a total of 13,468 rows in the training data, and 1497 rows in the test data.

In [7]:
raw_data

DatasetDict({
    train: Dataset({
        features: ['sequence', 'name', 'label'],
        num_rows: 13468
    })
    test: Dataset({
        features: ['sequence', 'name', 'label'],
        num_rows: 1497
    })
})

The dataset consists of three columns, ```sequence```, ```name``` and ```label```. An row in this dataset looks like:

In [8]:
raw_data['train'][0]

{'sequence': 'TCACTTCGATTATTGAGGCAGTCTTCATTAAAGTTTATTACAATGGATATGGTATCACCAGTCTTGAACCTACAATCATCTATTTTAGGTGAGCTCGTAGGCATTATTGGAAAAGTGTTCTTTCTCTTAATAGAAGAGATTAAATACCCGATAATCACACCCAAAATTATTGTGGATGCCCAGATATCTTCTTGGTCATTGTTTTTTTTCGCTTCAATCTGTAATCTCTCTGCAAAATTTCGGGAGCCAATAGTGACAACATCGTCAATAATAAGTTTGATGGAATCGGAAAAAGATCTTAAAAATGTAAATGAGTATTTCCAAATAATGGCCAAAATGCTCTTTATATTGGAAAATAAAATAGTTGTTTCGCTCTTCGTAGTATTTAACATTTCCGTTCTTATCATTGTAAAGTCTGAGCCATATTCATATGGAAAAGTGCTTTTTAAACCTAGTTCCTCCATATTTTAGTTTTTTATCGATATTGGAAAAAAAAGAGC',
 'name': 'YBR063C_YBR063C_367930|0',
 'label': 0}

We split out dataset into training, test, and validation sets.

In [9]:
from datasets import Dataset, DatasetDict

train_valid_split = raw_data['train'].train_test_split(test_size=0.15, seed=42)

train_valid_split = DatasetDict({
    'train': train_valid_split['train'],
    'validation': train_valid_split['test']
})

ds = DatasetDict({
    'train': train_valid_split['train'],
    'validation': train_valid_split['validation'],
    'test': raw_data['test']
})

Then, we use the tokenizer and a utility function we created, ```get_kmers``` to generate the final data and labels. The ```get_kmers``` function is essential for generating overlapping 6-mers needed by the language model (LM). By using k=6 and stride=1, we ensure that the model receives continuous and overlapping subsequences, capturing the local context within the biological sequence for more effective analysis and prediction.



In [10]:
def get_kmers(seq, k=6, stride=1):
    return [seq[i:i + k] for i in range(0, len(seq), stride) if i + k <= len(seq)]

In [11]:
test_sequences = []
train_sequences = []
val_sequences = []

# dataset_limit = 200 # NOTE: This dataset limit is set to 200, so that the training runs faster. It can be set to None to use the
#                     # entire dataset

for i in range(0, len(ds['train'])):

    # if dataset_limit and i == dataset_limit:
    #     break

    sequence = ds['train'][i]['sequence']
    sequence = "candida_glabrata " + " ".join(get_kmers(sequence))
    sequence = tokenizer(sequence)["input_ids"]
    train_sequences.append(sequence)


for i in range(0, len(ds['validation'])):
    # if dataset_limit and i == dataset_limit:
    #     break
    sequence = ds['validation'][i]['sequence']
    sequence = "candida_glabrata " + " ".join(get_kmers(sequence))
    sequence = tokenizer(sequence)["input_ids"]
    val_sequences.append(sequence)


for i in range(0, len(ds['test'])):
    # if dataset_limit and i == dataset_limit:
    #     break
    sequence = ds['test'][i]['sequence']
    sequence = "candida_glabrata " + " ".join(get_kmers(sequence))
    sequence = tokenizer(sequence)["input_ids"]
    test_sequences.append(sequence)


train_labels = ds['train']['label']
test_labels = ds['test']['label']
val_labels = ds['validation']['label']

# if dataset_limit:
#     train_labels = train_labels[0:dataset_limit]
#     test_labels = test_labels[0:dataset_limit]
#     val_labels = val_labels[0:dataset_limit]

In [12]:
print(len(train_sequences))
print(len(test_sequences))
print(len(val_sequences))

11447
1497
2021


Finally, we create a Dataset object for each our sets.

In [None]:
from datasets import Dataset

train_dataset = Dataset.from_dict({"input_ids": train_sequences, "labels": train_labels})
val_dataset = Dataset.from_dict({"input_ids": val_sequences, "labels": val_labels})
test_dataset = Dataset.from_dict({"input_ids": test_sequences, "labels": test_labels})

### 4. Train model

Now, we'll train our DNA Language Model with the training dataset. We'll add a linear layer in the final layer of our language model, and then, train all the parameteres of our model with the training dataset.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
# free model weights
for param in lm.parameters():
  param.requires_grad = False

In [None]:
import torch
from torch import nn

class DNA_LM(nn.Module):
    def __init__(self, model, num_labels):
        super(DNA_LM, self).__init__()
        self.model = model.bert
        self.in_features = model.config.hidden_size
        self.out_features = num_labels
        self.classifier = nn.Linear(self.in_features, self.out_features)

    def forward(self, input_ids, attention_mask=None, labels=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True)
        sequence_output = outputs.hidden_states[-1]
        # Use the [CLS] token for classification
        cls_output = sequence_output[:, 0, :]
        logits = self.classifier(cls_output)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.out_features), labels.view(-1))

        return (loss, logits) if loss is not None else logits

# Number of classes for your classification task
num_labels = 2
classification_model = DNA_LM(lm, num_labels)
classification_model.to('cuda');

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
from transformers import Trainer, TrainingArguments


# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",
    learning_rate=0.001,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.01,
    eval_steps=1,
    logging_steps=1,
)

# Initialize Trainer
trainer = Trainer(
    model=classification_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.4391,0.490557
2,0.4467,0.468887
3,0.4677,0.455574
4,0.534,0.451796
5,0.4478,0.451544


TrainOutput(global_step=895, training_loss=0.47818122949014163, metrics={'train_runtime': 399.5628, 'train_samples_per_second': 143.244, 'train_steps_per_second': 2.24, 'total_flos': 0.0, 'train_loss': 0.47818122949014163, 'epoch': 5.0})

### 5. Evaluation

In [None]:
import sklearn
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, roc_auc_score, roc_curve, auc


def sklearn_eval(test_labels, predicted_labels):
    accuracy = accuracy_score(test_labels, predicted_labels)
    print(f"Accuracy: {accuracy:.4f}")

    precision, recall, f1, _ = precision_recall_fscore_support(test_labels, predicted_labels, average='weighted')
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")

    roc_score = roc_auc_score(test_labels, predicted_labels)
    print(f"AUC ROC score: {roc_score:.4f}")

    target_names = ['Class 0', 'Class 1']
    print(classification_report(test_labels, predicted_labels, target_names=target_names))

In [None]:
# Generate predictions

predictions = trainer.predict(test_dataset)
logits = predictions.predictions
predicted_labels = logits.argmax(axis=-1)
print(predicted_labels)

[1 1 1 ... 1 1 1]


Then, we create a function to calculate the accuracy from the test and predicted labels.

In [None]:
def calculate_accuracy(true_labels, predicted_labels):

    assert len(true_labels) == len(predicted_labels), "Arrays must have the same length"
    correct_predictions = np.sum(true_labels == predicted_labels)
    accuracy = correct_predictions / len(true_labels)

    return accuracy

accuracy = calculate_accuracy(test_labels, predicted_labels)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.82


In [None]:
sklearn_eval(test_labels, predicted_labels)

Accuracy: 0.8216
Precision: 0.8219
Recall: 0.8216
F1-Score: 0.8215
AUC ROC score: 0.8211
              precision    recall  f1-score   support

     Class 0       0.83      0.80      0.81       730
     Class 1       0.82      0.84      0.83       767

    accuracy                           0.82      1497
   macro avg       0.82      0.82      0.82      1497
weighted avg       0.82      0.82      0.82      1497



The results aren't that good, which we can attribute to the small dataset size.

### 7. Parameter Efficient Fine-Tuning Techniques

In this section, we demonstrate how to employ parameter-efficient fine-tuning (PEFT) techniques to adapt a pre-trained model for specific genomics tasks using the PEFT library.

The LoraConfig object is instantiated to configure the PEFT parameters:

- task_type: Specifies the type of task, in this case, sequence classification (SEQ_CLS).
- r: The rank of the LoRA matrices.
- lora_alpha: Scaling factor for adaptive re-parameterization.
- target_modules: Modules within the model to apply PEFT re-parameterization (query, key, value in this example).
- lora_dropout: Dropout rate used during PEFT fine-tuning.

In [None]:
# Number of classes for your classification task
num_labels = 2
classification_model = DNA_LM(lm, num_labels)
classification_model.to('cuda');

In [None]:
from peft import LoraConfig, TaskType

peft_config = LoraConfig(
    use_dora=True,
    r=32,
    lora_alpha=32,
    target_modules=["query", "key", "value"],
    lora_dropout=0.01,
)

In [None]:
from peft import get_peft_model

peft_model = get_peft_model(classification_model, peft_config)
peft_model.print_trainable_parameters()

trainable params: 1,797,120 || all params: 91,476,482 || trainable%: 1.9646


In [None]:
peft_model.to('cuda')
peft_model

PeftModel(
  (base_model): LoraModel(
    (model): DNA_LM(
      (model): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(5504, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0-11): 12 x BertLayer(
              (attention): BertAttention(
                (self): BertSdpaSelfAttention(
                  (query): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.01, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=32, bia

In [None]:
from transformers import Trainer, TrainingArguments


# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",
    learning_rate=0.001,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.01,
    eval_steps=1,
    logging_steps=1,
)

# Initialize Trainer
trainer = Trainer(
    model=peft_model.model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.3475,0.311613
2,0.1896,0.294376
3,0.246,0.285091
4,0.3309,0.278897
5,0.2777,0.290264


TrainOutput(global_step=895, training_loss=0.26418064777328315, metrics={'train_runtime': 1016.5685, 'train_samples_per_second': 56.302, 'train_steps_per_second': 0.88, 'total_flos': 0.0, 'train_loss': 0.26418064777328315, 'epoch': 5.0})



```
# This is formatted as code
```

### 8. Evaluate PEFT Model

In [None]:
# Generate predictions

predictions = trainer.predict(test_dataset)
logits = predictions.predictions
predicted_labels = logits.argmax(axis=-1)
print(predicted_labels)

[1 1 1 ... 1 1 1]


In [None]:
def calculate_accuracy(true_labels, predicted_labels):

    assert len(true_labels) == len(predicted_labels), "Arrays must have the same length"
    correct_predictions = np.sum(true_labels == predicted_labels)
    accuracy = correct_predictions / len(true_labels)

    return accuracy

accuracy = calculate_accuracy(test_labels, predicted_labels)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.89


In [None]:
import sklearn
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, roc_auc_score, roc_curve, auc


def sklearn_eval(test_labels, predicted_labels):
    accuracy = accuracy_score(test_labels, predicted_labels)
    print(f"Accuracy: {accuracy:.4f}")

    precision, recall, f1, _ = precision_recall_fscore_support(test_labels, predicted_labels, average='weighted')
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")

    roc_score = roc_auc_score(test_labels, predicted_labels)
    print(f"AUC ROC score: {roc_score:.4f}")

    target_names = ['Class 0', 'Class 1']
    print(classification_report(test_labels, predicted_labels, target_names=target_names))

In [None]:
sklearn_eval(test_labels, predicted_labels)

Accuracy: 0.8911
Precision: 0.8912
Recall: 0.8911
F1-Score: 0.8911
AUC ROC score: 0.8911
              precision    recall  f1-score   support

     Class 0       0.89      0.89      0.89       730
     Class 1       0.90      0.89      0.89       767

    accuracy                           0.89      1497
   macro avg       0.89      0.89      0.89      1497
weighted avg       0.89      0.89      0.89      1497



As we can see, the PEFT model achieves similar performance to the baseline model, demonstrating the effectiveness of PEFT in adapting pre-trained models to specific tasks with limited computational resources.

With PEFT, we only train 442,368 parameters, which is 0.49% of the total parameters in the model. This is a significant reduction in computational resources compared to training the entire model from scratch.

We can improve the results by using a larger dataset, fine-tuning the model for more epochs or changing the hyperparameters (rank, learning rate, etc.).


# IA3

In [None]:
from peft import LoraConfig, TaskType, get_peft_model, IA3Config

config = IA3Config(
    # peft_type="IA3",
    task_type=TaskType.FEATURE_EXTRACTION,
    # target_modules=["query", "key", "value"],
    # feedforward_modules=["w0"],
)

ia3_model = get_peft_model(lm, config)
ia3_model.to('cuda')
ia3_model = DNA_LM(ia3_model, num_labels)
ia3_model.to('cuda')

DNA_LM(
  (model): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(5504, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(
                (base_layer): Linear(in_features=768, out_features=768, bias=True)
                (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.FloatTensor of size 768x1 (cuda:0)])
              )
              (value): Linear(
                (base_layer): Linear(in_features=768, out_features=768, bias=True)
                (ia3_l): ParameterDict(  (

In [None]:
from transformers import Trainer, TrainingArguments


# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",
    learning_rate=0.001,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.01,
    eval_steps=1,
    logging_steps=1,
)

# Initialize Trainer
trainer = Trainer(
    model=ia3_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.3088,0.343893
2,0.2546,0.31708
3,0.3085,0.31182
4,0.4203,0.305007
5,0.2489,0.303707


TrainOutput(global_step=895, training_loss=0.3415597979582888, metrics={'train_runtime': 805.3498, 'train_samples_per_second': 71.068, 'train_steps_per_second': 1.111, 'total_flos': 0.0, 'train_loss': 0.3415597979582888, 'epoch': 5.0})

In [None]:
# Generate predictions

predictions = trainer.predict(test_dataset)
logits = predictions.predictions
predicted_labels = logits.argmax(axis=-1)
print(predicted_labels)

[1 1 1 ... 1 1 1]


In [None]:
import sklearn
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, roc_auc_score, roc_curve, auc


def sklearn_eval(test_labels, predicted_labels):
    accuracy = accuracy_score(test_labels, predicted_labels)
    print(f"Accuracy: {accuracy:.4f}")

    precision, recall, f1, _ = precision_recall_fscore_support(test_labels, predicted_labels, average='weighted')
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")

    roc_score = roc_auc_score(test_labels, predicted_labels)
    print(f"AUC ROC score: {roc_score:.4f}")

    target_names = ['Class 0', 'Class 1']
    print(classification_report(test_labels, predicted_labels, target_names=target_names))

In [None]:
sklearn_eval(test_labels, predicted_labels)

Accuracy: 0.8684
Precision: 0.8684
Recall: 0.8684
F1-Score: 0.8684
AUC ROC score: 0.8682
              precision    recall  f1-score   support

     Class 0       0.87      0.86      0.86       730
     Class 1       0.87      0.87      0.87       767

    accuracy                           0.87      1497
   macro avg       0.87      0.87      0.87      1497
weighted avg       0.87      0.87      0.87      1497

