Assignment 08 (due to Nov 25, 8:30 AM):

Try to train the best possible model for the genomic benchmarks "human_enhancers_cohn" dataset. You can use the basic CNN model from the notebook example, but it might be also nice to try some transformers, especially ones trained for DNA. Use hyperparameter optimization to find the best combination of parameters for your model. Do the final evaluation on the 'test' split of the dataset.

In [2]:
import torch
from transformers import AutoTokenizer, AutoModel,  BertConfig, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'



In [3]:
#!pip install einops
#!pip install accelerate



In [5]:
# run this in the terminal

#!git clone https://github.com/openai/triton.git;
#!cd triton/python;
#!pip install cmake; # build-time dependency
#!pip install -e .
#!pip uninstall triton

## Load and preprocess the dataset

In [6]:
from datasets import load_dataset

def preprocess_function(examples):
    # Tokenize the sequences
    tokenized = tokenizer(
        examples["seq"],
        truncation=True 
    )
    return tokenized

dataset = load_dataset("katarinagresova/Genomic_Benchmarks_human_enhancers_cohn")
dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/20843 [00:00<?, ? examples/s]

In [7]:
from sklearn.model_selection import train_test_split
train_val = dataset['train'].train_test_split(test_size=0.2)

In [8]:
train = train_val['train']
val = train_val['test']
test = dataset['test']

In [9]:
print(len(train))
print(len(val))
print(len(test))

16674
4169
6948


## hyperparameter optimization

In [13]:
import numpy as np
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

from evaluate import load
accuracy = load("accuracy")

def compute_metrics(eval_pred):
    outputs, labels = eval_pred
    logits, embeddings = outputs 
    # logits.shape = (batch_size, number_of_categories), embeddings.shape = (batch_size, padded_sequence_len, hidden_dim_size)
    # e.g. logits.shape = (64, 2)
    # e.g. embeddings.shape = (64, 110, 768)
    predictions = np.argmax(logits, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)


In [14]:
import optuna


def objective(trial):

    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5)
    dropout_prob = trial.suggest_float("dropout_prob", 0.05, 0.2)
    num_train_epochs = trial.suggest_int("num_train_epochs", 2, 10)
    weight_decay = trial.suggest_float("weight_decay", 0.0, 0.1)

    config = BertConfig.from_pretrained("zhihan1996/DNABERT-2-117M",
        hidden_dropout_prob=dropout_prob,
        attention_probs_dropout_prob=dropout_prob,
        classifier_dropout=dropout_prob
    )
    
    model = AutoModelForSequenceClassification.from_pretrained(
        "zhihan1996/DNABERT-2-117M", 
        config=config
    ).to(DEVICE)

    training_args = TrainingArguments(
        output_dir=f"./optuna_trial_{trial.number}",
        learning_rate=learning_rate,
        num_train_epochs=num_train_epochs, 
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        weight_decay=weight_decay,
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_steps=100,
        metric_for_best_model="eval_loss",
        load_best_model_at_end=True
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train,
        eval_dataset=val,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )
    
    trainer.train()
    eval_results = trainer.evaluate()
    
    return eval_results["eval_accuracy"]


study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)

[I 2024-11-18 16:26:31,331] A new study created in memory with name: no-name-2103e03c-f1b3-404c-9276-6bb8225d1558
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.embeddings.position_embeddings.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.dense.bias', 'bert.encoder.layer.0.output.dense.weight', 'bert.encoder.layer.1.attention.self.key.bias', 'bert.encoder.layer.1.attention.self.key.weight', 'bert.encode

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5788,0.586601,0.691293
2,0.5604,0.604268,0.69729
3,0.5463,0.620781,0.699448
4,0.5331,0.599972,0.702567
5,0.52,0.569435,0.704965
6,0.5414,0.603092,0.698729
7,0.5116,0.596222,0.701127


[I 2024-11-18 16:33:39,021] Trial 0 finished with value: 0.7049652194770928 and parameters: {'learning_rate': 1.528157137723273e-05, 'dropout_prob': 0.14482173274527047, 'num_train_epochs': 7, 'weight_decay': 0.055719781590230215}. Best is trial 0 with value: 0.7049652194770928.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.embeddings.position_embeddings.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.d

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5954,0.61583,0.69657
2,0.5471,0.670861,0.693931
3,0.5431,0.619254,0.69585
4,0.5211,0.613609,0.694171
5,0.4933,0.614043,0.694891
6,0.5131,0.608095,0.692492
7,0.4704,0.630222,0.692732


[I 2024-11-18 16:40:43,450] Trial 1 finished with value: 0.6924922043655553 and parameters: {'learning_rate': 3.3966302950416475e-05, 'dropout_prob': 0.1077764136289187, 'num_train_epochs': 7, 'weight_decay': 0.03299477790216977}. Best is trial 0 with value: 0.7049652194770928.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.embeddings.position_embeddings.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.de

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6553,0.620681,0.654833
2,0.5964,0.616502,0.680499
3,0.6126,0.627734,0.675462
4,0.6064,0.622188,0.675222
5,0.5811,0.605729,0.686735
6,0.592,0.616835,0.692492
7,0.5961,0.616527,0.69681


[I 2024-11-18 16:47:47,998] Trial 2 finished with value: 0.6867354281602303 and parameters: {'learning_rate': 3.8685170763667604e-05, 'dropout_prob': 0.14470867712333146, 'num_train_epochs': 7, 'weight_decay': 0.06460329194569946}. Best is trial 0 with value: 0.7049652194770928.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.embeddings.position_embeddings.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.d

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5793,0.5837,0.694651
2,0.552,0.607213,0.681219
3,0.5534,0.606174,0.704485
4,0.5358,0.621148,0.699208
5,0.5231,0.56995,0.702806
6,0.5372,0.613498,0.698489
7,0.5088,0.608615,0.69705
8,0.4723,0.620647,0.699688
9,0.4813,0.607734,0.701367


[I 2024-11-18 16:57:00,358] Trial 3 finished with value: 0.702806428400096 and parameters: {'learning_rate': 1.4904160876479474e-05, 'dropout_prob': 0.12025914229025711, 'num_train_epochs': 9, 'weight_decay': 0.058552511588541495}. Best is trial 0 with value: 0.7049652194770928.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.embeddings.position_embeddings.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.d

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5916,0.595545,0.698009
2,0.5561,0.637119,0.702567
3,0.54,0.637899,0.703286
4,0.5184,0.631687,0.692732
5,0.4983,0.594302,0.69729
6,0.5216,0.613038,0.693692


[I 2024-11-18 17:03:11,969] Trial 4 finished with value: 0.6972895178699928 and parameters: {'learning_rate': 4.402047304103297e-05, 'dropout_prob': 0.18346542959073153, 'num_train_epochs': 6, 'weight_decay': 0.018528559135711175}. Best is trial 0 with value: 0.7049652194770928.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.embeddings.position_embeddings.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.d

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5904,0.587196,0.694411
2,0.5541,0.62668,0.698729
3,0.5446,0.596496,0.702327
4,0.5262,0.619828,0.688175
5,0.5041,0.586126,0.699448
6,0.5127,0.615253,0.698249
7,0.4689,0.628427,0.69657


[I 2024-11-18 17:10:16,206] Trial 5 finished with value: 0.6994483089469897 and parameters: {'learning_rate': 3.6839563791524795e-05, 'dropout_prob': 0.1114332819768351, 'num_train_epochs': 7, 'weight_decay': 0.09902570174095965}. Best is trial 0 with value: 0.7049652194770928.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.embeddings.position_embeddings.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.de

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5774,0.62006,0.69561
2,0.5394,0.630487,0.699448
3,0.5421,0.627612,0.700168
4,0.526,0.619895,0.680259
5,0.4972,0.600298,0.695371
6,0.5061,0.607348,0.691773
7,0.4745,0.643738,0.685776


[I 2024-11-18 17:17:21,432] Trial 6 finished with value: 0.6953705924682178 and parameters: {'learning_rate': 3.518698563893951e-05, 'dropout_prob': 0.1059573832463852, 'num_train_epochs': 7, 'weight_decay': 0.033733861527725666}. Best is trial 0 with value: 0.7049652194770928.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.embeddings.position_embeddings.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.de

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5823,0.585254,0.694891
2,0.5485,0.601822,0.684577
3,0.5544,0.617373,0.703526
4,0.536,0.620312,0.702087
5,0.5235,0.57008,0.702327
6,0.5367,0.601584,0.700408
7,0.5103,0.600727,0.700888
8,0.4758,0.588832,0.698729


[I 2024-11-18 17:25:31,008] Trial 7 finished with value: 0.7023266970496522 and parameters: {'learning_rate': 1.4061721549757406e-05, 'dropout_prob': 0.12195379946133938, 'num_train_epochs': 8, 'weight_decay': 0.08024994807598362}. Best is trial 0 with value: 0.7049652194770928.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.embeddings.position_embeddings.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.d

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6029,0.586263,0.692012
2,0.5554,0.650858,0.694651
3,0.5595,0.611695,0.701367
4,0.5429,0.582079,0.699448
5,0.512,0.586733,0.693692
6,0.5358,0.658069,0.691293
7,0.5039,0.624673,0.69585
8,0.4618,0.644974,0.694411
9,0.4682,0.659259,0.692012
10,0.4279,0.663158,0.690094


[I 2024-11-18 17:35:40,177] Trial 8 finished with value: 0.6994483089469897 and parameters: {'learning_rate': 4.660246770283248e-05, 'dropout_prob': 0.1408463212992932, 'num_train_epochs': 10, 'weight_decay': 0.07960106070147711}. Best is trial 0 with value: 0.7049652194770928.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.embeddings.position_embeddings.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.de

Epoch,Training Loss,Validation Loss,Accuracy
1,0.576,0.572113,0.703046
2,0.5485,0.625365,0.697529
3,0.5481,0.612728,0.692972
4,0.5303,0.6013,0.697529
5,0.5202,0.579528,0.69585
6,0.5236,0.616166,0.69585
7,0.4882,0.610872,0.687215
8,0.4543,0.629101,0.689614
9,0.4663,0.629985,0.689374


[I 2024-11-18 17:44:47,214] Trial 9 finished with value: 0.7030462940753178 and parameters: {'learning_rate': 2.277301313901963e-05, 'dropout_prob': 0.09834230772795935, 'num_train_epochs': 9, 'weight_decay': 0.009933333410548806}. Best is trial 0 with value: 0.7049652194770928.


In [15]:
print("Best trial:")
print("  Value: ", study.best_trial.value)
print("  Params: ")
for key, value in study.best_trial.params.items():
    print(f"    {key}: {value}")

Best trial:
  Value:  0.7049652194770928
  Params: 
    learning_rate: 1.528157137723273e-05
    dropout_prob: 0.14482173274527047
    num_train_epochs: 7
    weight_decay: 0.055719781590230215


In [16]:
best_lr = study.best_trial.params["learning_rate"]
best_dropout = study.best_trial.params["dropout_prob"]
best_num_train_epochs = study.best_trial.params["num_train_epochs"]
best_weight_decay = study.best_trial.params["weight_decay"]

## Training the best classifier

In [19]:
config = BertConfig.from_pretrained("zhihan1996/DNABERT-2-117M", 
    hidden_dropout_prob=best_dropout,    
    attention_probs_dropout_prob=best_dropout,
    classifier_dropout=best_dropout
)
model = AutoModelForSequenceClassification.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True, config=config)
model = model.to(DEVICE)

training_args = TrainingArguments(
    output_dir="dnabert_genomic",
    learning_rate=best_lr,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=best_num_train_epochs,
    weight_decay=best_weight_decay,
    eval_strategy="epoch",
    save_strategy="epoch",
    metric_for_best_model="eval_loss",
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=val,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.538723,0.729192
2,0.551200,0.503595,0.748621
3,0.551200,0.495844,0.752459
4,0.495300,0.500101,0.751019
5,0.495300,0.489234,0.763492
6,0.472200,0.49299,0.766131
7,0.472200,0.490946,0.765411


TrainOutput(global_step=1827, training_loss=0.498861520617923, metrics={'train_runtime': 521.3957, 'train_samples_per_second': 223.857, 'train_steps_per_second': 3.504, 'total_flos': 8775818261759088.0, 'train_loss': 0.498861520617923, 'epoch': 7.0})

## Evaluation on the test set

In [20]:
predictions = trainer.predict(test)
test_preds = np.argmax(predictions.predictions[0], axis=-1)

In [22]:
from sklearn.metrics import classification_report, confusion_matrix

test_labels = test['label']
print("\nClassification Report:")
print(classification_report(test_labels, test_preds))

print("\nConfusion Matrix:")
print(confusion_matrix(test_labels, test_preds))


Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.72      0.75      3474
           1       0.74      0.80      0.77      3474

    accuracy                           0.76      6948
   macro avg       0.76      0.76      0.76      6948
weighted avg       0.76      0.76      0.76      6948


Confusion Matrix:
[[2495  979]
 [ 686 2788]]
