Assignment 08 (due to Nov 25, 8:30 AM):

Try to train the best possible model for the genomic benchmarks "human_enhancers_cohn" dataset. You can use the basic CNN model from the notebook example, but it might be also nice to try some transformers, especially ones trained for DNA. Use hyperparameter optimization to find the best combination of parameters for your model. Do the final evaluation on the 'test' split of the dataset.

In [11]:
import torch
from transformers import AutoTokenizer, AutoModel,  BertConfig, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'



In [11]:
#!pip install einops
#!pip install accelerate

In [10]:
# run this in the terminal

#!git clone https://github.com/openai/triton.git;
#!cd triton/python;
#!pip install cmake; # build-time dependency
#!pip install -e .
#!pip uninstall triton

## Load and preprocess the dataset

In [12]:
from datasets import load_dataset

def preprocess_function(examples):
    # Tokenize the sequences
    tokenized = tokenizer(
        examples["seq"],
        truncation=True 
    )
    return tokenized

dataset = load_dataset("katarinagresova/Genomic_Benchmarks_human_enhancers_cohn")
dataset = dataset.map(preprocess_function, batched=True)

In [13]:
from sklearn.model_selection import train_test_split
train_val = dataset['train'].train_test_split(test_size=0.2)

In [14]:
train = train_val['train']
val = train_val['test']
test = dataset['test']

In [15]:
print(len(train))
print(len(val))
print(len(test))

16674
4169
6948


## hyperparameter optimization

In [16]:
import numpy as np
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

from evaluate import load
accuracy = load("accuracy")

def compute_metrics(eval_pred):
    outputs, labels = eval_pred
    logits, embeddings = outputs 
    # logits.shape = (batch_size, number_of_categories), embeddings.shape = (batch_size, padded_sequence_len, hidden_dim_size)
    # e.g. logits.shape = (64, 2)
    # e.g. embeddings.shape = (64, 110, 768)
    predictions = np.argmax(logits, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [19]:
import optuna


def objective(trial):

    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5)
    dropout_prob = trial.suggest_float("dropout_prob", 0.05, 0.2)

    config = BertConfig.from_pretrained("zhihan1996/DNABERT-2-117M",
        hidden_dropout_prob=dropout_prob,
        attention_probs_dropout_prob=dropout_prob,
        classifier_dropout=dropout_prob
    )
    
    model = AutoModelForSequenceClassification.from_pretrained(
        "zhihan1996/DNABERT-2-117M", 
        config=config
    ).to(DEVICE)

    training_args = TrainingArguments(
        output_dir=f"./optuna_trial_{trial.number}",
        learning_rate=learning_rate,
        num_train_epochs=2, 
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_steps=100,
        metric_for_best_model="eval_loss",
        load_best_model_at_end=True
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train,
        eval_dataset=val,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )
    
    trainer.train()
    eval_results = trainer.evaluate()
    
    return eval_results["eval_accuracy"]


study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=3)

[I 2024-11-18 14:34:00,255] A new study created in memory with name: no-name-2d657ab3-502a-4e83-9a74-d767a9d143bd
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5103,0.493278,0.759655
2,0.4394,0.48013,0.770928


[I 2024-11-18 14:37:10,738] Trial 0 finished with value: 0.7709282801631087 and parameters: {'learning_rate': 1.8621576331491283e-05, 'dropout_prob': 0.08705202524443822}. Best is trial 0 with value: 0.7709282801631087.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5143,0.498161,0.756536
2,0.4385,0.48857,0.759894


[I 2024-11-18 14:40:20,932] Trial 1 finished with value: 0.7598944591029023 and parameters: {'learning_rate': 2.8132911935268816e-05, 'dropout_prob': 0.09836394395546255}. Best is trial 0 with value: 0.7709282801631087.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5273,0.496751,0.762533
2,0.4646,0.495299,0.761574


[I 2024-11-18 14:43:32,401] Trial 2 finished with value: 0.7615735188294555 and parameters: {'learning_rate': 3.91635115763713e-05, 'dropout_prob': 0.1589567967599395}. Best is trial 0 with value: 0.7709282801631087.


In [20]:
print("Best trial:")
print("  Value: ", study.best_trial.value)
print("  Params: ")
for key, value in study.best_trial.params.items():
    print(f"    {key}: {value}")

Best trial:
  Value:  0.7709282801631087
  Params: 
    learning_rate: 1.8621576331491283e-05
    dropout_prob: 0.08705202524443822


In [21]:
best_lr = study.best_trial.params["learning_rate"]
best_dropout = study.best_trial.params["dropout_prob"]

## Training the best classifier

In [21]:
config = BertConfig.from_pretrained("zhihan1996/DNABERT-2-117M", 
    hidden_dropout_prob=best_dropout,    
    attention_probs_dropout_prob=best_dropout,
    classifier_dropout=best_dropout
)
model = AutoModelForSequenceClassification.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True, config=config)
model = model.to(DEVICE)

training_args = TrainingArguments(
    output_dir="dnabert_genomic",
    learning_rate=best_lr,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    metric_for_best_model="eval_loss",
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=val,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.497413,0.756296
2,0.515600,0.495065,0.751499
3,0.515600,0.489234,0.760134


TrainOutput(global_step=783, training_loss=0.49611403658929, metrics={'train_runtime': 231.5069, 'train_samples_per_second': 216.071, 'train_steps_per_second': 3.382, 'total_flos': 3759760585608600.0, 'train_loss': 0.49611403658929, 'epoch': 3.0})

## Evaluation on the test set

In [22]:
predictions = trainer.predict(test)


test_preds = np.argmax(predictions.predictions[0], axis=-1)

In [23]:
from sklearn.metrics import classification_report, confusion_matrix

test_labels = test['label']
print("\nClassification Report:")
print(classification_report(test_labels, test_preds))

print("\nConfusion Matrix:")
print(confusion_matrix(test_labels, test_preds))


Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.73      0.75      3474
           1       0.75      0.79      0.77      3474

    accuracy                           0.76      6948
   macro avg       0.76      0.76      0.76      6948
weighted avg       0.76      0.76      0.76      6948


Confusion Matrix:
[[2545  929]
 [ 734 2740]]
