# Task 1: Fine-tune Chemical Language Model

The goal is to fine-tune a pre-trained chemical language model on a regression task using the Lipophilicity dataset. The task involves predicting the lipophilicity value for a given molecule representation (SMILES string). You will learn how to load and tokenize a dataset from HuggingFace, how to load a pre-trained language model, and finally, how to run a model in inference mode.

Your task is to complete the missing code blocks below.

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading

In [3]:
!pip install --upgrade --force-reinstall transformers

Collecting transformers
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting filelock (from transformers)
  Downloading filelock-3.17.0-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.26.0 (from transformers)
  Downloading huggingface_hub-0.29.1-py3-none-any.whl.metadata (13 kB)
Collecting numpy>=1.17 (from transformers)
  Downloading numpy-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting packaging>=20.0 (from transformers)
  Downloading packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Collecting pyyaml>=5.1 (from transformers)
  Downl

In [1]:
import transformers
print(transformers.__version__)

4.49.0


In [2]:
# import dependencies
import torch
from datasets import load_dataset
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling
from torch.utils.data import DataLoader, Dataset, Subset, random_split
from sklearn.model_selection import train_test_split
import pandas as pd
from tqdm.notebook import tqdm
import random

# 1.Fine-tune a Chemical Language Model on Lipophilicity


## --- Step 1: Load Dataset ---

The dataset we are going to use is the [Lipophilicity](https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Lipophilicity) dataset, part of [MoleculeNet](https://pubs.rsc.org/en/content/articlelanding/2018/sc/c7sc02664a) benchmark.

Lipophilicity, also known as hydrophobicity, is a measure of how readily a substance dissolves in nonpolar solvents (such as oil) compared to polar solvents (such as water).

In [3]:
# specify dataset name and model name
DATASET_PATH = "scikit-fingerprints/MoleculeNet_Lipophilicity"
MODEL_NAME = "ibm/MoLFormer-XL-both-10pct"  #MoLFormer model

In [4]:
# load the dataset from HuggingFace
dataset = load_dataset(DATASET_PATH)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

lipophilicity.csv:   0%|          | 0.00/223k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4200 [00:00<?, ? examples/s]

In [5]:
# Explore the dataset
# For example, print the column names and display a few sample rows
# TODO: your code goes here
print(dataset.column_names)
print(len(dataset['train']['SMILES']))
print(len(dataset['train']['label']))
print(dataset['train'][0])

{'train': ['SMILES', 'label']}
4200
4200
{'SMILES': 'Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14', 'label': 3.54}


In [6]:
# define a PyTorch Dataset class for handling SMILES strings and targets

# TODO: your code goes here
class SMILESDataset(Dataset):
    def __init__(self, dataset, tokenizer):
        self.dataset = dataset
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        smiles = self.dataset[idx]['SMILES']

        encoded = self.tokenizer(smiles, return_tensors='pt', padding='max_length', max_length=512, truncation=True)
        encoded = {k: v.squeeze() for k, v in encoded.items()}

        return {**encoded, 'labels': torch.tensor(self.dataset[idx]['label'], dtype=torch.float32)}

## --- Step 2: Split Dataset ---

As there is only one split (train split) in the original dataset, we need to split the data into training and testing sets by ourselves.

In [7]:
# tokenize the data
# load a pre-trained tokenizer from HuggingFace
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, deterministic_eval=True, trust_remote_code=True)


tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenization_molformer_fast.py:   0%|          | 0.00/6.50k [00:00<?, ?B/s]

tokenization_molformer.py:   0%|          | 0.00/9.48k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ibm/MoLFormer-XL-both-10pct:
- tokenization_molformer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/ibm/MoLFormer-XL-both-10pct:
- tokenization_molformer_fast.py
- tokenization_molformer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


vocab.json:   0%|          | 0.00/41.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/54.0k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [9]:
encode = tokenizer(dataset['train'][0]['SMILES'], padding='max_length', truncation=True, max_length=512, return_tensors="pt")
encode['attention_mask'].shape

len(dataset['train'])

4200

In [10]:
# split the data into training and test datasets
# TODO: your code goes here
train_indices, test_indices = train_test_split(range(len(dataset['train'])), test_size=0.2, random_state=42)
train_set = Subset(dataset['train'], train_indices)
test_set = Subset(dataset['train'], test_indices)

train_dataset = SMILESDataset(train_set, tokenizer)
test_dataset = SMILESDataset(test_set, tokenizer)

print(f"Train DataLoader with {len(train_dataset)} data points created.")
print(f"Test DataLoader with {len(test_dataset)} data points created.")

Train DataLoader with 3360 data points created.
Test DataLoader with 840 data points created.


In [11]:
# construct Pytorch data loaders for both train and test datasets
BATCH_SIZE = 16 # adjust based on memory constraints

# TODO: your code goes here
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

## --- Step 3: Load Model ---

In [12]:
# load pre-trained model from HuggingFace
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

configuration_molformer.py:   0%|          | 0.00/7.60k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ibm/MoLFormer-XL-both-10pct:
- configuration_molformer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_molformer.py:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ibm/MoLFormer-XL-both-10pct:
- modeling_molformer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/187M [00:00<?, ?B/s]

In [13]:
# We need to add a regression head on the language model as we are doing a regression task.

# specify model with a regression head

class MoLFormerWithRegressionHead(nn.Module):
    # TODO: your code goes here
    def __init__(self, model):
        super(MoLFormerWithRegressionHead, self).__init__()
        self.model = model
        self.hidden_size = model.config.hidden_size
        self.regression_head = nn.Linear(self.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids, attention_mask)
        # cls_token = outputs.last_hidden_state[:, 0, :]
        sequence_output = outputs[0]
        cls_token = sequence_output[:, 0, :]
        outputs_head = self.regression_head(cls_token)
        return outputs_head
# instantiate the model
# model = MoLFormerWithRegressionHead(model)

In [14]:
# initialize the regression model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
regression_model = MoLFormerWithRegressionHead(model).to(device)

## --- Step 4: Training ---

In [None]:
# TODO: your code goes here
num_epochs = 20
optimizer = torch.optim.AdamW(regression_model.parameters(), lr=5e-5)
criterion = nn.MSELoss()
regression_model.train()

for epoch in range(num_epochs):
    total_loss = 0

    for i, data in enumerate(tqdm(train_loader)):
        input_ids = data['input_ids'].to(device)
        attention_mask = data['attention_mask'].to(device)
        label = data['labels'].to(device)
        optimizer.zero_grad()
        outputs = regression_model(input_ids, attention_mask)
        loss = criterion(outputs.squeeze(), label)

        loss.backward()
        optimizer.step()
        total_loss += loss.item() * label.shape[0]

    print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_dataset)}")

  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 1, Loss: 0.9631072766724087


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 2, Loss: 0.4960657344687553


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 3, Loss: 0.35897222384810445


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 4, Loss: 0.2856334506755783


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 5, Loss: 0.23387702948280742


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 6, Loss: 0.19939777336659886


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 7, Loss: 0.17243759804183528


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 8, Loss: 0.1549551884688082


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 9, Loss: 0.1388117222204095


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 10, Loss: 0.1549153261951038


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 11, Loss: 0.1953611200586671


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 12, Loss: 0.1239517913510402


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 13, Loss: 0.3417385856487921


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 14, Loss: 0.16450906481061664


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 15, Loss: 0.10776137181868156


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 16, Loss: 0.09705615684035279


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 17, Loss: 0.08375553910043977


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 18, Loss: 0.08823751479919467


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 19, Loss: 0.08117713609799034


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 20, Loss: 0.07791138789838269


## --- Step 5: Evaluation ---

In [None]:
# TODO: your code goes here
regression_model.eval()
total_loss = 0

for i, data in enumerate(tqdm(test_loader)):
    input_ids = data['input_ids'].to(device)
    attention_mask = data['attention_mask'].to(device)
    label = data['labels'].to(device)
    outputs = regression_model(input_ids, attention_mask)
    loss = criterion(outputs.squeeze(), label)
    total_loss += loss.item() * label.shape[0]

print(f"Test Loss: {total_loss / len(test_dataset)}")

  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.47730408452806017


# 2.Add Unsupervised Finetuning
In this step, you will perform unsupervised fine-tuning on the training dataset. This means the model will leverage only the SMILES strings without any corresponding labels to adapt its understanding of the data distribution. By familiarizing the model with the patterns and structure of the SMILES strings, you can potentially enhance its performance on downstream supervised tasks.

For this fine-tuning, you will use the Masked Language Modeling (MLM) objective, where the model learns to predict randomly masked tokens within the input sequence. Remember to save the fine-tuned model for later use.


In [None]:
# TODO: your code goes here
from transformers import get_scheduler

# unlabel_dataset = SMILESDataset(train_set, tokenizer, False)
# print(f"Train DataLoader with {len(unlabel_dataset)} unsupervised data points created.")

unsup_model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME,
                                                   deterministic_eval=True,
                                                   trust_remote_code=True).to(device)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm=True,
                                                mlm_probability=0.15)

train_dataloader = DataLoader(train_dataset,
                              batch_size=16,
                              shuffle=True,
                              collate_fn=data_collator)

optimizer = torch.optim.AdamW(unsup_model.parameters(), lr=5e-5)
num_epochs = 100
num_training_steps = num_epochs * len(train_dataloader)
scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=int(0.1 * num_training_steps),
    num_training_steps=num_training_steps
)

unsup_model.train()
best_loss = float('inf')
count = 0

for epoch in range(num_epochs):
    total_loss = 0

    for data in tqdm(train_dataloader):
        data = {k: v.to(device) for k, v in data.items()}
        outputs = unsup_model(**data)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()
        total_loss += loss.item() * data['input_ids'].shape[0]

    epoch_loss = total_loss / len(train_dataset)

    if epoch_loss < best_loss:
        best_loss = epoch_loss
        count = 0
        #save model
        unsup_model.save_pretrained("./finetuned-mlm-model")
        tokenizer.save_pretrained("./finetuned-mlm-token")
        print("Model saved ...")
    else:
        count += 1

    print(f"Epoch {epoch+1}, Loss: {epoch_loss}, Count: {count}")

    if count == 10: # early stop
        print("Early stop !")
        break


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 1, Loss: 0.7052506881100791, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 2, Loss: 0.3895466801666078, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 3, Loss: 0.2824850717825549, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 4, Loss: 0.2389592829204741, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 5, Loss: 0.21021530865913346, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 6, Loss: 0.19202484279161408, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 7, Loss: 0.1775719823227042, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 8, Loss: 0.17052741629027185, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 9, Loss: 0.1530819710521471, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 10, Loss: 0.15430248388577075, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 11, Loss: 0.14853285425120877, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 12, Loss: 0.1468137736742695, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 13, Loss: 0.1371008756880959, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 14, Loss: 0.1311052461997384, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 15, Loss: 0.12645029063735688, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 16, Loss: 0.12161384838234102, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 17, Loss: 0.12034004210893597, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 18, Loss: 0.1220074801217942, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 19, Loss: 0.11344718916696452, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 20, Loss: 0.10991184885303179, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 21, Loss: 0.10961458982811087, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 22, Loss: 0.10595469804746764, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 23, Loss: 0.10471689387535056, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 24, Loss: 0.10636556789811169, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 25, Loss: 0.09883436762860844, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 26, Loss: 0.09855499747652738, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 27, Loss: 0.09538311260708031, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 28, Loss: 0.08958107069400804, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 29, Loss: 0.08702450280210801, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 30, Loss: 0.0836933463213167, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 31, Loss: 0.09038075544827041, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 32, Loss: 0.08723788057631325, Count: 2


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 33, Loss: 0.08809950229312692, Count: 3


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 34, Loss: 0.08621945657013427, Count: 4


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 35, Loss: 0.08408765439387589, Count: 5


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 36, Loss: 0.08619017520963791, Count: 6


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 37, Loss: 0.07752411443784478, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 38, Loss: 0.07785674076128218, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 39, Loss: 0.07966758006119302, Count: 2


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 40, Loss: 0.0768554359430536, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 41, Loss: 0.0747488898313826, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 42, Loss: 0.07298139743728652, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 43, Loss: 0.07214372371589499, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 44, Loss: 0.06995169852771574, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 45, Loss: 0.07084932450781621, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 46, Loss: 0.07308326455808821, Count: 2


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 47, Loss: 0.07232095693637218, Count: 3


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 48, Loss: 0.06765720159802142, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 49, Loss: 0.07025269856232973, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 50, Loss: 0.06475259467870706, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 51, Loss: 0.06256468294277077, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 52, Loss: 0.0665476971571999, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 53, Loss: 0.06276472141367516, Count: 2


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 54, Loss: 0.06400525086958493, Count: 3


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 55, Loss: 0.06322967327987065, Count: 4


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 56, Loss: 0.05891522031853951, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 57, Loss: 0.059349902830130995, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 58, Loss: 0.057029503012087104, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 59, Loss: 0.05809982054024225, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 60, Loss: 0.056880719141502466, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 61, Loss: 0.05766084914849628, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 62, Loss: 0.051230279701052324, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 63, Loss: 0.0542396835439528, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 64, Loss: 0.05151454241658073, Count: 2


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 65, Loss: 0.05159369318335805, Count: 3


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 66, Loss: 0.04952441805064501, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 67, Loss: 0.04890002251846627, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 68, Loss: 0.05261529708015067, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 69, Loss: 0.05091634585987777, Count: 2


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 70, Loss: 0.04721695862799173, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 71, Loss: 0.04855382706515402, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 72, Loss: 0.04765385170162301, Count: 2


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 73, Loss: 0.046850172534496304, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 74, Loss: 0.05323560819628515, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 75, Loss: 0.04674636209189582, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 76, Loss: 0.04873262411316059, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 77, Loss: 0.05027761689132257, Count: 2


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 78, Loss: 0.04938577862256872, Count: 3


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 79, Loss: 0.04641051503380628, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 80, Loss: 0.045210999403414984, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 81, Loss: 0.04887860911743094, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 82, Loss: 0.04493153191482027, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 83, Loss: 0.041368320436837774, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 84, Loss: 0.039888802374203114, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 85, Loss: 0.04996131226597797, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 86, Loss: 0.03992099472255047, Count: 2


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 87, Loss: 0.041008573190663894, Count: 3


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 88, Loss: 0.04438628633894647, Count: 4


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 89, Loss: 0.04308016759764758, Count: 5


  0%|          | 0/210 [00:00<?, ?it/s]

Model saved ...
Epoch 90, Loss: 0.03739472858530159, Count: 0


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 91, Loss: 0.043064478392313636, Count: 1


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 92, Loss: 0.04243584534574655, Count: 2


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 93, Loss: 0.039767636968532485, Count: 3


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 94, Loss: 0.03950590095122433, Count: 4


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 95, Loss: 0.041974646359726434, Count: 5


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 96, Loss: 0.04038917282817974, Count: 6


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 97, Loss: 0.03859211647040432, Count: 7


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 98, Loss: 0.03826844466744833, Count: 8


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 99, Loss: 0.037880165968090296, Count: 9


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 100, Loss: 0.040032584148658706, Count: 10
Early stop !


# 3.Fine-Tune for Comparison
After performing unsupervised fine-tuning on the training data, we now fine-tune the model on the regression task with the regression head. By comparing the performance of the model before and after unsupervised fine-tuning, you can evaluate how the unsupervised fine-tuning impacts the model's performance on our target task.


In [None]:
# TODO: your code goes here
from transformers import AutoConfig

# Clear the cache
# config = AutoConfig.from_pretrained("/content/drive/MyDrive/NNTI_project/notebooks/finetuned-mlm-model", trust_remote_code=True)

finetuned_mlm_model = AutoModel.from_pretrained(
    "./finetuned-mlm-model",
    deterministic_eval=True,
    trust_remote_code=True,
)

finetune_model = MoLFormerWithRegressionHead(finetuned_mlm_model).to(device)

num_epochs = 20
optimizer = torch.optim.AdamW(finetune_model.parameters(), lr=5e-5)
criterion = nn.MSELoss()
best_loss = 0.0
count = 0

for epoch in range(num_epochs):
    finetune_model.train()
    train_loss = 0

    for i, data in enumerate(tqdm(train_loader)):
        input_ids = data['input_ids'].to(device)
        attention_mask = data['attention_mask'].to(device)
        label = data['labels'].to(device)
        outputs = finetune_model(input_ids, attention_mask)
        loss = criterion(outputs.squeeze(), label)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * label.shape[0]

    epoch_loss = train_loss / len(train_dataset)
    print(f"Epoch {epoch+1}, Loss: {train_loss / len(train_dataset)}")

    if epoch_loss < best_loss:
        best_loss = epoch_loss
        count = 0
    else:
        count += 1

    finetune_model.eval()
    test_loss = 0
    with torch.no_grad():
        for i, data in enumerate(tqdm(test_loader)):
            input_ids = data['input_ids'].to(device)
            attention_mask = data['attention_mask'].to(device)
            label = data['labels'].to(device)
            outputs = finetune_model(input_ids, attention_mask)
            loss = criterion(outputs.squeeze(), label)
            test_loss += loss.item() * label.shape[0]

        print(f"Test Loss: {test_loss / len(test_dataset)}")

    if count == 5: # early stop
        print("Early stop !")
        break

  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 1, Loss: 0.9284282751736187


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.8023252027375357


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 2, Loss: 0.47159650985683715


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.548570317029953


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 3, Loss: 0.36186805108473413


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.539662682442438


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 4, Loss: 0.2749469946892489


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.4758058863026755


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 5, Loss: 0.2328652380123025


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.46420217894372484


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 6, Loss: 0.203725615888834


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.40931275401796613


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 7, Loss: 0.17374120860227515


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.465050542922247


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 8, Loss: 0.15552911581028075


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.4948249598344167


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 9, Loss: 0.14456481931819803


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.3870586579754239


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 10, Loss: 0.14186382691065472


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.4474286192939395


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 11, Loss: 0.12563518473789806


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.3786383765084403


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 12, Loss: 0.11194112558982201


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.4256276411669595


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 13, Loss: 0.11426490863696451


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.39915103429839727


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 14, Loss: 0.10264719612009468


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.41682599641027906


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 15, Loss: 0.0993486667229306


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.4177468915780385


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 16, Loss: 0.09290751726144836


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.3783803567999885


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 17, Loss: 0.0901261595476951


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.4384279129051027


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 18, Loss: 0.08879849863726469


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.37449254386481784


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 19, Loss: 0.08125942081567787


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.3843427618344625


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 20, Loss: 0.08563885562831447


  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.3763835790611449
