# Ligand ADMET and Potency (Property Prediction)

The [ADMET](https://polarishub.io/competitions/asap-discovery/antiviral-admet-2025) and [Potency](https://polarishub.io/competitions/asap-discovery/antiviral-potency-2025) Challenge of the [ASAP Discovery competition](https://polarishub.io/blog/antiviral-competition) take the shape of a property prediction task. Given the SMILES (or, to be more precise, the CXSMILES) of a molecule, you are asked to predict the numerical properties of said molecule. This is a relatively straight-forward application of ML and this notebook will quickly get you up and running!

To begin with, choose one of the two challenges! The code will look the same for both. 

In [1]:
import polaris as po
from polaris.hub.client import PolarisHubClient
import csv

In [2]:
client = PolarisHubClient()
client.login()

In [3]:
CHALLENGE = "antiviral-potency-2025"  # or: "antiviral-potency-2025"

## Load the competition

Let's first load the competition from Polaris.

Make sure you are logged in! If not, simply run `polaris login` and follow the instructions. 

In [4]:
import polaris as po

competition = po.load_competition(f"asap-discovery/{CHALLENGE}")

As suggested in the logs, we'll cache the dataset. Note that this is not strictly necessary, but it does speed up later steps.

In [5]:
competition.cache()

Output()

'/home/valerij/.cache/polaris/datasets/3395bbb1-34d7-42ac-b7bd-a608427a2891'

Let's get the train and test set and take a look at the data structure.

In [6]:
train, test = competition.get_train_test_split()

In [7]:
train[10]

('CN1C(=O)N(C2=CN=CC3=CC=CC=C23)C(=O)C12CN(S(C)(=O)=O)C2',
 {'pIC50 (SARS-CoV-2 Mpro)': nan, 'pIC50 (MERS-CoV Mpro)': 4.69})

In [8]:
print(type(train))

<class 'polaris.dataset._subset.Subset'>


In [9]:
train[:1]

(array(['COC[C@]1(C)C(=O)N(C2=CN=CC3=CC=CC=C23)C(=O)N1C |&1:3|'],
       dtype=object),
 {'pIC50 (SARS-CoV-2 Mpro)': array([nan]),
  'pIC50 (MERS-CoV Mpro)': array([4.19])})

In [10]:
import pandas as pd

smiles_list_test = []

# Iterate through train and populate the lists
for t in test:
    smiles_list_test.append(t)

# Create the DataFrame
potency_df_test = pd.DataFrame({
    'SMILES': smiles_list_test,
})

In [11]:
potency_df_test

Unnamed: 0,SMILES
0,C=CC(=O)NC1=CC=CC(N(CC2=CC=CC(Cl)=C2)C(=O)CC2=...
1,CNC(=O)CN1C[C@@]2(C(=O)N(C3=CN=CC4=CC=CC=C34)C...
2,CNC(=O)CN1C[C@]2(CCN(C3=CN=CC4=CC=C(OC[C@H](O)...
3,CNC(=O)CN1C[C@@]2(C(=O)N(C3=CN=CC4=CC=CC=C34)C...
4,CNC(=O)CN1C[C@@]2(C(=O)N(C3=CN=CC4=CC=CC=C34)C...
...,...
292,O=C(CC1=CN=CC2=CC=CC=C12)N1CCCC[C@H]1[C@H]1CCC...
293,O=C(CC1=CN=CC2=CC=CC=C12)N1CCCC[C@H]1[C@H]1CCC...
294,O=C(CC1=CN=CC2=CC=CC=C12)N1CCC[C@H]2CCCC[C@@H]...
295,COC1=CC=CC=C1[C@H]1C[C@H](C)CCN1C(=O)CC1=CN=CC...


In [12]:
import pandas as pd
# Initialize lists to hold data for DataFrame
smiles_list = []
sars_list = []
mers_list = []

# Iterate through train and populate the lists
for t in train:
    smiles_list.append(t[0])
    sars_list.append(t[1].get('pIC50 (SARS-CoV-2 Mpro)', None))
    mers_list.append(t[1].get('pIC50 (MERS-CoV Mpro)', None))

# Create the DataFrame
potency_df_sars = pd.DataFrame({
    'SMILES': smiles_list,
    'pIC50 (SARS-CoV-2 Mpro)': sars_list,
})

# Create the DataFrame
potency_df_mers = pd.DataFrame({
    'SMILES': smiles_list,
    'pIC50 (MERS-CoV Mpro)': mers_list,
})

# Display the DataFrame
print(potency_df_sars)
print(potency_df_mers)

                                                 SMILES  \
0     COC[C@]1(C)C(=O)N(C2=CN=CC3=CC=CC=C23)C(=O)N1C...   
1     C=C(CN1CCC2=C(C=C(Cl)C=C2)C1C(=O)NC1=CN=CC2=CC...   
2     CNC(=O)CN1C[C@]2(C[C@H](C)N(C3=CN=CC=C3C3CC3)C...   
3     C=C(CN1CCC2=C(C=C(Cl)C=C2)C1C(=O)NC1=CN=CC2=CC...   
4     C=C(CN1CCC2=C(C=C(Cl)C=C2)C1C(=O)NC1=CN=CC2=CC...   
...                                                 ...   
1026  CNS(=O)(=O)OCC(=O)N1CCN(CC2=CC=CC(Cl)=C2)[C@H]...   
1027  O=C(CC1=CN=CC2=CC=CC=C12)N1CC[C@@H]2CCCC[C@H]2...   
1028  CNC(=O)[C@H]1CCCN(C(=O)CC2=CN=CC3=CC=CC=C23)C1...   
1029  C[C@H]1CCCN(C(=O)CC2=CN=CC3=CC=CC=C23)[C@H]1C ...   
1030  O=C(O)C[C@H]1CCCN(C(=O)CC2=CN=CC3=CC=CC=C23)C1...   

      pIC50 (SARS-CoV-2 Mpro)  
0                         NaN  
1                        5.29  
2                         NaN  
3                        6.11  
4                        5.62  
...                       ...  
1026                     6.38  
1027                     6.09  
102

In [13]:
print(len(train))

1031


In [14]:
print(len(test))

297


### Raw data dump
We've decided to sacrifice the completeness of the scientific data to improve its ease of use. For those that are interested, you can also access the raw data dump that this dataset has been created from.

## Build a model
Next, we'll train a simple baseline model using scikit-learn. 

You'll notice that the challenge has multiple targets.

In [15]:
from qsprpred.data import QSPRDataset
from sklearn.impute import SimpleImputer

In [16]:
mers_props = [
    {"name": "pIC50 (MERS-CoV Mpro)", "task": "REGRESSION"},]

In [17]:
sars_props = [
    {"name": "pIC50 (SARS-CoV-2 Mpro)", "task": "REGRESSION"},]

## SARS

In [18]:
potency_df_sars

Unnamed: 0,SMILES,pIC50 (SARS-CoV-2 Mpro)
0,COC[C@]1(C)C(=O)N(C2=CN=CC3=CC=CC=C23)C(=O)N1C...,
1,C=C(CN1CCC2=C(C=C(Cl)C=C2)C1C(=O)NC1=CN=CC2=CC...,5.29
2,CNC(=O)CN1C[C@]2(C[C@H](C)N(C3=CN=CC=C3C3CC3)C...,
3,C=C(CN1CCC2=C(C=C(Cl)C=C2)C1C(=O)NC1=CN=CC2=CC...,6.11
4,C=C(CN1CCC2=C(C=C(Cl)C=C2)C1C(=O)NC1=CN=CC2=CC...,5.62
...,...,...
1026,CNS(=O)(=O)OCC(=O)N1CCN(CC2=CC=CC(Cl)=C2)[C@H]...,6.38
1027,O=C(CC1=CN=CC2=CC=CC=C12)N1CC[C@@H]2CCCC[C@H]2...,6.09
1028,CNC(=O)[C@H]1CCCN(C(=O)CC2=CN=CC3=CC=CC=C23)C1...,
1029,C[C@H]1CCCN(C(=O)CC2=CN=CC3=CC=CC=C23)[C@H]1C ...,5.06


In [19]:
dataset_sars = QSPRDataset(
    name="MultiTaskTutorialDataset",
    df=potency_df_sars,
    target_props=sars_props,
    store_dir="sars_output_adjusted/data",
    random_state=42,
)

dataset_sars.getDF()



Unnamed: 0_level_0,SMILES,pIC50 (SARS-CoV-2 Mpro),QSPRID,pIC50 (SARS-CoV-2 Mpro)_original,Split_IsTrain
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MultiTaskTutorialDataset_000,C=C(CN1CCc2ccc(Cl)cc2C1C(=O)Nc1cncc2ccccc12)C(...,5.29,MultiTaskTutorialDataset_000,5.29,True
MultiTaskTutorialDataset_001,C=C(CN1CCc2ccc(Cl)cc2C1C(=O)Nc1cncc2ccccc12)C(...,6.11,MultiTaskTutorialDataset_001,6.11,True
MultiTaskTutorialDataset_002,C=C(CN1CCc2ccc(Cl)cc2C1C(=O)Nc1cncc2ccccc12)C(...,5.62,MultiTaskTutorialDataset_002,5.62,True
MultiTaskTutorialDataset_003,C=C(CN(C(=O)C1CCOc2ccc(Cl)cc21)c1cncc2ccccc12)...,6.45,MultiTaskTutorialDataset_003,6.45,True
MultiTaskTutorialDataset_004,C=C(CN(C(=O)C1CCOc2ccc(Cl)cc21)c1cncc2ccccc12)...,5.56,MultiTaskTutorialDataset_004,5.56,True
...,...,...,...,...,...
MultiTaskTutorialDataset_837,O=C(Cc1cncc2ccccc12)N1CCC([C@H]2CCOC2)CC1,4.68,MultiTaskTutorialDataset_837,4.68,True
MultiTaskTutorialDataset_838,O=C(Cc1cncc2ccccc12)N1CCCC2(CC2)C1,4.41,MultiTaskTutorialDataset_838,4.41,True
MultiTaskTutorialDataset_839,CNS(=O)(=O)OCC(=O)N1CCN(Cc2cccc(Cl)c2)[C@H]2CS...,6.38,MultiTaskTutorialDataset_839,6.38,True
MultiTaskTutorialDataset_840,O=C(Cc1cncc2ccccc12)N1CC[C@@H]2CCCC[C@H]2C1,6.09,MultiTaskTutorialDataset_840,6.09,True


In [38]:
from qsprpred.data import RandomSplit, BootstrapSplit, ScaffoldSplit, ClusterSplit
from qsprpred.data.descriptors.fingerprints import MorganFP, RDKitFP
from qsprpred.data.processing.feature_filters import LowVarianceFilter
from sklearn.preprocessing import StandardScaler

# Specifiy random split for creating the train (80%) and test set (20%)
rand_split = ClusterSplit(test_fraction=0.2, dataset=dataset_sars)
# rand_split = RandomSplit(test_fraction=0.15, dataset=dataset_mers)

# calculate compound features and split dataset into train and test
dataset_sars.prepareDataset(
    split=rand_split,
    feature_calculators=[MorganFP(radius=3, nBits=1024), RDKitFP(maxPath=8, nBits=256)],
    recalculate_features=True,
    feature_filters=[LowVarianceFilter(0.001)],
    feature_standardizer=StandardScaler()
)

print(f"Number of samples train set: {len(dataset_sars.y)}")
print(f"Number of samples test set: {len(dataset_sars.y_ind)}")

dataset_sars.save()

Number of samples train set: 673
Number of samples test set: 169


In [39]:
from qsprpred.models import SklearnModel
import os
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBClassifier, XGBRegressor
from qsprpred.models import CrossValAssessor, TestSetAssessor

os.makedirs("class_models/XG_models_clust", exist_ok=True)

# This is an SKlearn model, so we will initialize it with the SklearnModel class
model_sars = SklearnModel(
    base_dir="class_models/XG_models_clust_sars_adjusted",
    alg=XGBRegressor,
    name="XGMorgan_clustsplit"
)

import numpy as np
from qsprpred.models import GridSearchOptimization, CrossValAssessor

In [40]:
from qsprpred.models import OptunaOptimization, TestSetAssessor
from sklearn.model_selection import KFold

# Note the specification of the hyperparameter types as first item in the list
search_space = {"max_depth": ["int", 1, 20],
                "gamma": ["int", 0, 20],
                "max_delta_step": ["int", 0, 20],
                "min_child_weight": ["int", 1, 20],
                "learning_rate": ["float", 0.001, 1],
                "subsample": ["float", 0.001, 1],
                "n_estimators": ["int", 10, 250],
               }

# Optuna gridsearcher with the TestSetAssessor
gridsearcher = OptunaOptimization(
    n_trials=500,
    param_grid=search_space,
    model_assessor=TestSetAssessor(scoring='r2'),
)
gridsearcher.optimize(model_sars, dataset_sars)

[I 2025-02-26 01:33:03,638] A new study created in memory with name: no-name-1689f3fc-918c-4591-b9f3-0f660d1b69ab
[I 2025-02-26 01:33:08,742] Trial 0 finished with value: 0.1426323856748366 and parameters: {'max_depth': 8, 'gamma': 19, 'max_delta_step': 15, 'min_child_weight': 12, 'learning_rate': 0.1568626218019941, 'subsample': 0.15683852581586644, 'n_estimators': 23}. Best is trial 0 with value: 0.1426323856748366.
[I 2025-02-26 01:33:09,661] Trial 1 finished with value: 0.12314740850092776 and parameters: {'max_depth': 18, 'gamma': 12, 'max_delta_step': 14, 'min_child_weight': 1, 'learning_rate': 0.9699399423098324, 'subsample': 0.8326101981596213, 'n_estimators': 61}. Best is trial 0 with value: 0.1426323856748366.
[I 2025-02-26 01:33:11,242] Trial 2 finished with value: 0.260143188084316 and parameters: {'max_depth': 4, 'gamma': 3, 'max_delta_step': 6, 'min_child_weight': 11, 'learning_rate': 0.43251307362347363, 'subsample': 0.2919379110578439, 'n_estimators': 157}. Best is tria

{'max_depth': 19,
 'gamma': 5,
 'max_delta_step': 1,
 'min_child_weight': 7,
 'learning_rate': 0.5951214841506621,
 'subsample': 0.2834638087754667,
 'n_estimators': 184}

{'max_depth': 16,
 'gamma': 2,
 'max_delta_step': 4,
 'min_child_weight': 20,
 'learning_rate': 0.1673266351389645,
 'subsample': 0.17794938759715506,
 'n_estimators': 74}

In [42]:
# We can now assess the model performance on the training set using cross validation
from sklearn.metrics import mean_squared_error
import sklearn
from sklearn import metrics

print(sklearn.metrics.get_scorer_names())

CrossValAssessor(
    scoring="neg_mean_absolute_error",
    split=BootstrapSplit(split=ClusterSplit(dataset_sars), n_bootstraps=100)
)(model_sars, dataset_sars)

# and on the test set
TestSetAssessor("neg_mean_absolute_error")(model_sars, dataset_sars)

# Finally, we need to fit the model on the complete dataset if we want to use it further
model_sars.fitDataset(dataset_sars)

# and save the model
_ = model_sars.save()

['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'completeness_score', 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted', 'matthews_corrcoef', 'max_error', 'mutual_info_score', 'neg_brier_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_gamma_deviance', 'neg_mean_poisson_deviance', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'neg_negative_likelihood_ratio', 'neg_root_mean_squared_error', 'neg_root_mean_squared_log_error', 'normalized_mutual_info_score', 'positive_likelihood_ratio', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'rand_score', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc',

In [43]:
from rdkit import Chem
from rdkit.Chem import MolStandardize

def read_smi_file(file_path):
    smiles_list = []
    try:
        with open(file_path, 'r') as file:
            for line in file:
                # Skip comment lines and empty lines
                if line.strip() and not line.startswith('#'):
                    # SMILES strings are typically separated by whitespace or tab
                    smiles = line.split()[0]
                    smiles_list.append(smiles)
    except FileNotFoundError:
        print("File not found.")
    return smiles_list

# Example usage to read the standardized SMILES file and create a list
file_path = 'test_potency.smi'
smiles_list = read_smi_file(file_path)
print("SMILES in the file:")
print(len(smiles_list))

SMILES in the file:
297


In [44]:
well = model_sars.predictMols(smiles_list_test)

In [45]:
print(well)

[[5.2606316]
 [6.042621 ]
 [5.5060997]
 [5.7655044]
 [6.3580527]
 [6.635112 ]
 [5.8555408]
 [6.781377 ]
 [6.5181847]
 [6.5181847]
 [6.125652 ]
 [7.2176757]
 [6.455889 ]
 [7.156018 ]
 [5.78455  ]
 [6.902244 ]
 [6.653148 ]
 [6.085179 ]
 [4.744858 ]
 [6.31968  ]
 [7.1286716]
 [7.4047213]
 [6.65365  ]
 [8.3577795]
 [5.84428  ]
 [6.5421844]
 [6.1597114]
 [7.3859096]
 [5.9935613]
 [4.979324 ]
 [6.635112 ]
 [6.978636 ]
 [6.6643333]
 [6.332336 ]
 [6.65365  ]
 [6.332336 ]
 [6.615716 ]
 [6.313798 ]
 [6.635112 ]
 [6.514045 ]
 [6.332336 ]
 [6.332336 ]
 [7.0407805]
 [7.0407805]
 [7.5038323]
 [7.5038323]
 [5.9526405]
 [5.9526405]
 [6.927808 ]
 [6.927808 ]
 [6.6376767]
 [6.635112 ]
 [6.8372865]
 [6.65365  ]
 [5.7378697]
 [6.5267005]
 [6.332336 ]
 [6.313798 ]
 [6.8294764]
 [6.509946 ]
 [6.4625707]
 [6.550733 ]
 [6.550733 ]
 [6.65365  ]
 [7.6429057]
 [6.176478 ]
 [7.365069 ]
 [7.685716 ]
 [6.635112 ]
 [6.987118 ]
 [6.332336 ]
 [7.427582 ]
 [7.427582 ]
 [6.332336 ]
 [6.65365  ]
 [6.2001977]
 [6.2001977]

In [46]:
min_length = min(len(smiles_list_test), len(well))
values = [val[0] for val in well[:min_length]] 

In [47]:
print(values)

[5.2606316, 6.042621, 5.5060997, 5.7655044, 6.3580527, 6.635112, 5.8555408, 6.781377, 6.5181847, 6.5181847, 6.125652, 7.2176757, 6.455889, 7.156018, 5.78455, 6.902244, 6.653148, 6.085179, 4.744858, 6.31968, 7.1286716, 7.4047213, 6.65365, 8.3577795, 5.84428, 6.5421844, 6.1597114, 7.3859096, 5.9935613, 4.979324, 6.635112, 6.978636, 6.6643333, 6.332336, 6.65365, 6.332336, 6.615716, 6.313798, 6.635112, 6.514045, 6.332336, 6.332336, 7.0407805, 7.0407805, 7.5038323, 7.5038323, 5.9526405, 5.9526405, 6.927808, 6.927808, 6.6376767, 6.635112, 6.8372865, 6.65365, 5.7378697, 6.5267005, 6.332336, 6.313798, 6.8294764, 6.509946, 6.4625707, 6.550733, 6.550733, 6.65365, 7.6429057, 6.176478, 7.365069, 7.685716, 6.635112, 6.987118, 6.332336, 7.427582, 7.427582, 6.332336, 6.65365, 6.2001977, 6.2001977, 7.697893, 6.65365, 7.1267576, 6.663706, 6.663706, 7.1467543, 6.332336, 6.8254404, 6.8254404, 6.635112, 7.704641, 6.532812, 6.96858, 7.427582, 8.393517, 7.427582, 7.726003, 6.998111, 7.427582, 7.427582, 7.83

In [48]:
# Write to CSV file
with open("sars_prediction_preliminary.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["SMILES", "SARS"])
    for smiles, value in zip(smiles_list_test, values):
        writer.writerow([smiles, value])

print("CSV file created successfully!")

CSV file created successfully!


The End.