# Ligand ADMET and Potency (Property Prediction)

The [ADMET](https://polarishub.io/competitions/asap-discovery/antiviral-admet-2025) and [Potency](https://polarishub.io/competitions/asap-discovery/antiviral-potency-2025) Challenge of the [ASAP Discovery competition](https://polarishub.io/blog/antiviral-competition) take the shape of a property prediction task. Given the SMILES (or, to be more precise, the CXSMILES) of a molecule, you are asked to predict the numerical properties of said molecule. This is a relatively straight-forward application of ML and this notebook will quickly get you up and running!

To begin with, choose one of the two challenges! The code will look the same for both. 

In [1]:
CHALLENGE = "antiviral-potency-2025"

## Load the competition

Let's first load the competition from Polaris.

Make sure you are logged in! If not, simply run `polaris login` and follow the instructions. 

In [2]:
import polaris as po

competition = po.load_competition(f"asap-discovery/{CHALLENGE}")

  from .autonotebook import tqdm as notebook_tqdm


As suggested in the logs, we'll cache the dataset. Note that this is not strictly necessary, but it does speed up later steps.

In [3]:
competition.cache()

'/home/stas/.cache/polaris/datasets/aa42414a-4768-4974-bfe4-2bdb9388c0de'

Let's get the train and test set and take a look at the data structure.

In [4]:
train, test = competition.get_train_test_split()

In [5]:
train[0]

('COC[C@]1(C)C(=O)N(C2=CN=CC3=CC=CC=C23)C(=O)N1C |&1:3|',
 {'pIC50 (MERS-CoV Mpro)': 4.19, 'pIC50 (SARS-CoV-2 Mpro)': nan})

In [6]:
test[0]

'C=CC(=O)NC1=CC=CC(N(CC2=CC=CC(Cl)=C2)C(=O)CC2=CN=CC3=CC=CC=C23)=C1'

In [7]:
train.targets

{'pIC50 (MERS-CoV Mpro)': array([4.19, 4.92, 4.73, ..., 4.22, 4.4 , 4.22]),
 'pIC50 (SARS-CoV-2 Mpro)': array([ nan, 5.29,  nan, ...,  nan, 5.06,  nan])}

### Raw data dump
We've decided to sacrifice the completeness of the scientific data to improve its ease of use. For those that are interested, you can also access the raw data dump that this dataset has been created from.

In [None]:
import fsspec
import zipfile
import io

# Read the entire file into memory
with fsspec.open("https://fs.polarishub.io/2025-01-asap-discovery/raw_data_package.zip", block_size=0) as fd:
    file_data = fd.read()  # Read the entire file into memory

# Use BytesIO to make it seekable
with zipfile.ZipFile(io.BytesIO(file_data), "r") as zip_ref:
    zip_ref.extractall("./raw_data_package/")

In [16]:
import pandas as pd
from pathlib import Path

subdir = "admet" if CHALLENGE == "antiviral-admet-2025" else "potency"

path = Path("./raw_data_package")
path = path / subdir

csv_files = list(path.glob("*.csv"))
pd.read_csv(csv_files[0]).head(3)

Unnamed: 0,SARS-CoV-2-MPro_fluorescence-dose-response_weizmann: IC50 (µM),SARS-CoV-2-MPro_fluorescence-dose-response_weizmann: IC50 CI (Lower) (µM),SARS-CoV-2-MPro_fluorescence-dose-response_weizmann: IC50 CI (Upper) (µM),SARS-CoV-2-MPro_fluorescence-dose-response_weizmann: Hill slope,SARS-CoV-2-MPro_fluorescence-dose-response_weizmann: pIC50 (log10M),Molecule Name,CXSMILES (CDD Compatible),Batch Created Date
0,13.44,12.303,14.682,1.045,4.87,ASAP-0029418,O=C(CC1=CN=CC2=CC=CC=C12)N1CCC(C2=CC=NO2)CC1,2024-07-08
1,7.993,7.024,9.096,1.03,5.1,ASAP-0029417,O=C(CC1=CN=CC2=CC=CC=C12)N1CCC[C@H](C2=CC=CC(F...,2024-07-08
2,48.046,43.21,53.424,1.114,4.32,ASAP-0029414,O=C(CC1=CN=CC2=CC=CC=C12)N1CCCC[C@H]1CC(F)(F)F...,2024-07-08


## Build a model
Next, we'll train a simple baseline model using scikit-learn. 

You'll notice that the challenge has multiple targets.

In [8]:
train.target_cols

['pIC50 (MERS-CoV Mpro)', 'pIC50 (SARS-CoV-2 Mpro)']

An interesting idea would be to build a multi-task model to leverage shared information across tasks.

For the sake of simplicity, however, we'll simply build a model per target here. 

In [9]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("feature-extraction", model="seyonec/ChemBERTa-zinc-base-v1")

Device set to use cpu


In [None]:
feats = []
for num, (x, _) in enumerate(train):
    try:
        feat = pipe(x)
        feats.append(feat[0][0])
    except Exception as e:
        print(f"Error processing input {x} at index {num}: {e}")

In [None]:
import numpy as np

from sklearn.ensemble import GradientBoostingRegressor

# Use ChemBERTa to featurize the SMILES strings
X_train_dl = np.array([pipe(x)[0][0] for x in train.X])
X_test_dl = np.array([pipe(x)[0][0] for x in test.X])

y_pred_dl_test = {}
y_pred_dl_train = {}

# For each of the targets...
for tgt in competition.target_cols:

    # We get the training targets
    # Note that we need to mask out NaNs since the multi-task matrix is sparse.
    y_true = train.y[tgt]
    mask = ~np.isnan(y_true)

    # We'll train a simple baseline model
    model_dl = GradientBoostingRegressor()
    model_dl.fit(X_train_dl[mask], y_true[mask])

    # And then use that to predict the targets for both train and test sets
    y_pred_dl_train[tgt] = model_dl.predict(X_train_dl)
    y_pred_dl_test[tgt] = model_dl.predict(X_test_dl)

In [1]:
import datamol as dm
import numpy as np

from sklearn.ensemble import GradientBoostingRegressor

# Prepare the input data. We'll use Datamol to compute the ECFP fingerprints for both the train and test columns.
X_train = np.array([dm.to_fp(dm.to_mol(smi)) for smi in train.X])
X_test = np.array([dm.to_fp(dm.to_mol(smi)) for smi in test.X])

y_pred_test = {}
y_pred_train = {}

# For each of the targets...
for tgt in competition.target_cols:

    # We get the training targets
    # Note that we need to mask out NaNs since the multi-task matrix is sparse.
    y_true = train.y[tgt]
    mask = ~np.isnan(y_true)

    # We'll train a simple baseline model
    model = GradientBoostingRegressor()
    model.fit(X_train[mask], y_true[mask])

    # And then use that to predict the targets for both train and test set
    y_pred_train[tgt] = model.predict(X_train)
    y_pred_test[tgt] = model.predict(X_test)

NameError: name 'train' is not defined

In [61]:
from evaluation import eval_potency

eval_dl = eval_potency(y_pred_dl_train, train.y)
eval_base = eval_potency(y_pred_train, train.y)

In [64]:
from pprint import pprint

print("ChemBERTa-based-features model:")
pprint(dict(eval_dl))

print("\nBaseline model:")
pprint(dict(eval_base))

ChemBERTa-based-features model:
{'aggregated': {'macro_mean_absolute_error': 0.3064593839816264,
                'macro_r2': 0.8245691040584464},
 'pIC50 (MERS-CoV Mpro)': {'kendall_tau': 0.6342071144863535,
                           'mean_absolute_error': 0.3068769883569856,
                           'r2': 0.7940740078713294},
 'pIC50 (SARS-CoV-2 Mpro)': {'kendall_tau': 0.7499696539838778,
                             'mean_absolute_error': 0.30604177960626716,
                             'r2': 0.8550642002455634}}

Baseline model:
{'aggregated': {'macro_mean_absolute_error': 0.3679454856023733,
                'macro_r2': 0.7456944613851981},
 'pIC50 (MERS-CoV Mpro)': {'kendall_tau': 0.5366452225196361,
                           'mean_absolute_error': 0.3743752757756639,
                           'r2': 0.6910242335961272},
 'pIC50 (SARS-CoV-2 Mpro)': {'kendall_tau': 0.6894310979197656,
                             'mean_absolute_error': 0.36151569542908263,
                     

## Submit your predictions
Submitting your predictions to the competition is simple.

In [None]:
competition.submit_predictions(
    predictions=y_pred_dl_test,
    prediction_name="ChemBERTa-based-features",
    prediction_owner="stanislav-chekmenev",
    report_url="https://www.example.com",
    # The below metadata is optional, but recommended.
    github_url="https://github.com/polaris-hub/polaris",
    description="Adding ChemBERTa-based features to a simple Gradient Boosting model",
    tags=["tutorial", "Potency"],
    user_attributes={
        "Framework": "Scikit-learn",
        "Method": "Gradient Boosting",
        "Experiment": "ChemBERTa-based features",
    },
)