# Task 1: Fine-tune Chemical Language Model

The goal is to fine-tune a pre-trained chemical language model on a regression task using the Lipophilicity dataset. The task involves predicting the lipophilicity value for a given molecule representation (SMILES string). You will learn how to load and tokenize a dataset from HuggingFace, how to load a pre-trained language model, and finally, how to run a model in inference mode.

Your task is to complete the missing code blocks below.

In [1]:
!pip install datasets



In [2]:
# import dependencies
import torch
from datasets import load_dataset
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling
from torch.utils.data import DataLoader, Dataset, Subset
from sklearn.model_selection import train_test_split
import pandas as pd
from tqdm.notebook import tqdm
import random

# 1.Fine-tune a Chemical Language Model on Lipophilicity


## --- Step 1: Load Dataset ---

The dataset we are going to use is the [Lipophilicity](https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Lipophilicity) dataset, part of [MoleculeNet](https://pubs.rsc.org/en/content/articlelanding/2018/sc/c7sc02664a) benchmark.

Lipophilicity, also known as hydrophobicity, is a measure of how readily a substance dissolves in nonpolar solvents (such as oil) compared to polar solvents (such as water).

In [3]:
# specify dataset name and model name
DATASET_PATH = "scikit-fingerprints/MoleculeNet_Lipophilicity"
MODEL_NAME = "ibm/MoLFormer-XL-both-10pct"  #MoLFormer model

In [4]:
# load the dataset from HuggingFace
dataset = load_dataset(DATASET_PATH)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


In [5]:
# Explore the dataset
print(dataset['train'].column_names)
print(dataset['train'][0])

['SMILES', 'label']
{'SMILES': 'Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14', 'label': 3.54}


In [6]:
# define a PyTorch Dataset class for handling SMILES strings and targets
class SMILESDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        smiles = item['SMILES']
        target = item['label']
        encoding = self.tokenizer(smiles, return_tensors='pt', padding='max_length', truncation=True, max_length=512)
        encoding = {k: v.squeeze() for k, v in encoding.items()}
        return {**encoding, 'labels': torch.tensor(target, dtype=torch.float)}

## --- Step 2: Split Dataset ---

As there is only one split (train split) in the original dataset, we need to split the data into training and testing sets by ourselves.

In [7]:
# tokenize the data
# load a pre-trained tokenizer from HuggingFace
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

In [8]:
# split the data into training and test datasets
train_data, test_data = train_test_split(range(len(dataset['train'])), test_size=0.2, random_state=42)
train_set = Subset(dataset['train'], train_data)
test_set = Subset(dataset['train'], test_data)
train_dataset = SMILESDataset(train_set, tokenizer)
test_dataset = SMILESDataset(test_set, tokenizer)
print(f"Number of training examples: {len(train_dataset)}")
print(f"Number of testing examples: {len(test_dataset)}")

Number of training examples: 3360
Number of testing examples: 840


In [9]:
# construct Pytorch data loaders for both train and test datasets
BATCH_SIZE = 16 # adjust based on memory constraints
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

## --- Step 3: Load Model ---

In [10]:
# load pre-trained model from HuggingFace
model = AutoModel.from_pretrained(MODEL_NAME, deterministic_eval=True, trust_remote_code=True)

In [11]:
# We need to add a regression head on the language model as we are doing a regression task.

# specify model with a regression head
class MoLFormerWithRegressionHead(nn.Module):
    def __init__(self, base_model):
        super(MoLFormerWithRegressionHead, self).__init__()
        self.base_model = base_model
        self.regression_head = nn.Linear(base_model.config.hidden_size, 1)
    def forward(self, input_ids, attention_mask=None):
        outputs = self.base_model(input_ids, attention_mask=attention_mask)
        # pooled_output = torch.mean(outputs.last_hidden_state, dim=1)
        sequence_output = outputs[0]
        pooled_output = sequence_output[:, 0, :]  # take <s> token (equiv. to [CLS])
        return self.regression_head(pooled_output)

In [12]:
# initialize the regression model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
regression_model = MoLFormerWithRegressionHead(model).to(device)

## --- Step 4: Training ---

In [13]:
# Training loop
num_epochs = 5
optimizer = torch.optim.AdamW(regression_model.parameters(), lr=5e-5)
criterion = nn.MSELoss()
regression_model.train()

for epoch in range(num_epochs):
    train_loss = 0.0

    for batch in tqdm(train_loader):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = regression_model(input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.squeeze(), labels)

        loss.backward()
        optimizer.step()
        train_loss += loss.item() * labels.shape[0]

    print(f'Epoch {epoch + 1}, Loss: {train_loss / len(train_dataset)}')

  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 1, Loss: 0.9877884942860831


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 2, Loss: 0.5129356546770959


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 3, Loss: 0.350424035461176


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 4, Loss: 0.27441548107280617


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 5, Loss: 0.2287725250990618


## --- Step 5: Evaluation ---

In [14]:
# Evaluation
regression_model.eval()
total_loss = 0
with torch.no_grad():
    for batch in tqdm(test_loader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = regression_model(input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.squeeze(), labels)
        total_loss += loss.item() * labels.shape[0]

print(f'Test Loss: {total_loss / len(test_dataset)}')

  0%|          | 0/53 [00:00<?, ?it/s]

Test Loss: 0.4812660422527565


# 2.Add Unsupervised Finetuning
In this step, you will perform unsupervised fine-tuning on the training dataset. This means the model will leverage only the SMILES strings without any corresponding labels to adapt its understanding of the data distribution. By familiarizing the model with the patterns and structure of the SMILES strings, you can potentially enhance its performance on downstream supervised tasks.

For this fine-tuning, you will use the Masked Language Modeling (MLM) objective, where the model learns to predict randomly masked tokens within the input sequence. Remember to save the fine-tuned model for later use.


In [20]:
# Unsupervised fine-tuning with MLM
from transformers import get_scheduler

mlm_model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME,
                                                 deterministic_eval=True,
                                                 trust_remote_code=True).to(device)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
train_mlm_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=data_collator)

mlm_model.train()
optimizer = torch.optim.AdamW(mlm_model.parameters(), lr=1e-4)
num_epochs = 10
num_training_steps = num_epochs * len(train_mlm_loader)
scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

for epoch in range(num_epochs):
    total_loss = 0.0

    for batch in tqdm(train_mlm_loader):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = mlm_model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        loss.backward()
        optimizer.step()
        scheduler.step()
        total_loss += loss.item() * input_ids.shape[0]

    print(f'Epoch {epoch + 1}, Loss: {total_loss / len(train_dataset)}')

# Save the fine-tuned model
# mlm_model.save_pretrained('fine_tuned_model')

  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 1, Loss: 0.2730441145244099


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 2, Loss: 0.20019716270977542


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 3, Loss: 0.17514654167351268


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 4, Loss: 0.16167072728276252


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 5, Loss: 0.14682077392935752


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 6, Loss: 0.13777706223052172


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 7, Loss: 0.1358767557445736


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 8, Loss: 0.13323864059611445


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 9, Loss: 0.11855115123270522


  0%|          | 0/210 [00:00<?, ?it/s]

Epoch 10, Loss: 0.11542394618015914


In [27]:
rm -rf ./fine_tuned_model

In [28]:
mlm_model.save_pretrained('fine_tuned_model')

# 3.Fine-Tune for Comparison
After performing unsupervised fine-tuning on the training data, we now fine-tune the model on the regression task with the regression head. By comparing the performance of the model before and after unsupervised fine-tuning, you can evaluate how the unsupervised fine-tuning impacts the model's performance on our target task.


In [1]:
# Fine-tune the regression model after unsupervised fine-tuning
loaded_model = AutoModel.from_pretrained('./fine_tuned_model', trust_remote_code=True)
sup_model = MoLFormerWithRegressionHead(loaded_model).to(device)
optimizer = torch.optim.AdamW(sup_model.parameters(), lr=5e-5)
criterion = nn.MSELoss()
num_epochs = 5
sup_model.train()

for epoch in range(num_epochs):
    train_loss = 0.0

    for batch in tqdm(train_loader):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = sup_model(input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.squeeze(), labels)

        loss.backward()
        optimizer.step()
        train_loss += loss.item() * labels.shape[0]

    print(f'Epoch {epoch + 1}, Loss: {train_loss/len(train_dataset)}')


sup_model.eval()
total_loss = 0
with torch.no_grad():
    for batch in tqdm(test_loader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = sup_model(input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.squeeze(), labels)
        total_loss += loss.item() * labels.shape[0]

print(f'Test Loss: {total_loss / len(test_dataset)}')

NameError: name 'AutoModel' is not defined