# Data splitting in model validation

Model validation involves assessing the quality of the predictions on unseen data, that is, data not used during training.

Classically in statistical learning, this "prediction set", also called "validation set" or "test set" according to context, is sampled **randomly** from all data available. This is called a random data split.

See, for instance, the many options available in [scikit learn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection), which can sample data randomly or take into account stratification of target values, as well as allow the use of more than one data split (cross-validation).

The metrics obtained from such random splits might be predictive of model performance for tasks such as digit recognition, for which not much data drift is expected, that is, future data is likely to be similar to data available during training. 

This, however, is not usually the case for cheminformatics datasets, especially when, e.g., we are trying to identify novel hits for a target of interest. Random test sets tend to be "easier" to predict than actual future data, which means that any quality metrics calculated will not reflect true model performance.  Alternative approaches are needed to account for a possible change in the chemical space of new compounds.

Several alternative splitting methods have been developed. [Deepchem](https://github.com/deepchem/deepchem/tree/master), a Python package with molecular machine learning tools, implements some of the most used ones.

To use DeepChem locally on a Jupyter Notebook, follow the instructions on: https://deepchem.readthedocs.io/en/latest/get_started/installation.html#jupyter-notebook. Then, install Jupyter using:

`pip install notebook`

In [1]:
import deepchem as dc

dc.__version__

Skipped loading some Pytorch utilities, missing a dependency. No module named 'torch'


This module requires PyTorch to be installed.


No normalization for AvgIpc. Feature removed!
Skipped loading some PyTorch models, missing a dependency. No module named 'torch'
No module named 'torch'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch'
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'torch'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


'2.7.2.dev'

In [2]:
# import datasets
import pandas as pd 

# SMILES and target
url = 'https://raw.githubusercontent.com/rflameiro/Python_e_Quiminformatica/main/datasets/BBBP_curated.csv'
df_smi = pd.read_csv(url, sep=";", index_col=False)
# Fingerprints and target
url = 'https://raw.githubusercontent.com/rflameiro/Python_e_Quiminformatica/main/datasets/BBBP_morganFP_1024_radius3.csv'
df_fp = pd.read_csv(url, sep=";", index_col=False)

In [3]:
df_smi.columns

Index(['std_smiles', 'p_np'], dtype='object')

In [4]:
df_fp["ids"] = df_smi["std_smiles"]

In [5]:
df_fp.head()

Unnamed: 0,morgan_bit_0,morgan_bit_1,morgan_bit_2,morgan_bit_3,morgan_bit_4,morgan_bit_5,morgan_bit_6,morgan_bit_7,morgan_bit_8,morgan_bit_9,...,morgan_bit_1016,morgan_bit_1017,morgan_bit_1018,morgan_bit_1019,morgan_bit_1020,morgan_bit_1021,morgan_bit_1022,morgan_bit_1023,target,ids
0,0,1,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,Cc1onc(-c2ccccc2Cl)c1C(=O)NC1C(=O)N2C1SC(C)(C)...
1,0,1,1,0,0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,1,CCN1CCN(C(=O)NC(C(=O)NC2C(=O)N3C(C(=O)O)=C(CSc...
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,CN(C)C1C(=O)C(C(=O)NCN2CCCC2)C(=O)[C@@]2(O)C(=...
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,1,0,0,1,Cc1nccn1CC1CCc2c(C1=O)c1ccccc1n2C
4,0,0,0,0,0,1,0,1,0,0,...,0,1,0,1,0,0,0,0,1,COc1ccc([C@@H]2Sc3ccccc3N(CCN(C)C)C(=O)C2OC(C)...


In [6]:
print(df_smi.shape)
print(df_fp.shape)

(1934, 2)
(1934, 1026)


In [7]:
# Create Deepchem dataset object with SMILES as the ids field
X_cols = df_fp.columns.to_list()[:-2]
dataset = dc.data.NumpyDataset.from_dataframe(df_fp, 
                                              X=X_cols, 
                                              y="target", 
                                              ids="ids")

# Random split

Deepchem has the option to create a random split, which we will use to compare with other splitting methods:

In [8]:
import deepchem as dc

splitter = dc.splits.RandomSplitter()
train_random, test_random = splitter.train_test_split(dataset)

In [9]:
train_random.get_shape()

((1547, 1024), (1547, 1), (1547, 0), (1547,))

In [10]:
test_random.get_shape()

((387, 1024), (387, 1), (387, 0), (387,))

In [11]:
# You can convert them back to pandas DataFrames
# Note that the labels of the X columns are replaced by X1, X2...
pandas_random = train_random.to_dataframe()
pandas_random.shape

(1547, 1026)

In [12]:
pandas_random.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X1017,X1018,X1019,X1020,X1021,X1022,X1023,X1024,y,ids
0,0,0,0,0,0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,0,CC(=O)OCC1=C(C(=O)O)N2C(=O)C(NC(=O)CC#N)C2SC1
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,O=C(Nc1ccccc1)OCC1(COC(=O)Nc2ccccc2)CCCC1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,1,CN1CCC(=C2c3c(cccc3)Sc3c2cccc3)CC1
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,1,O=[N+]([O-])C1=CC=NC1NCCSCc1ncccc1Br
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,CN(C)c1nc(O)c(-c2ccccc2)o1


In [13]:
# Alternativaly, export .csv
# train_random.to_csv('out.csv')

# Cluster split

Two versions of this split exist:

The first requires using a clustering algorithms that takes as input a predefined number of final clusters. If you wanted, for instance, to create a 80:20 split, you could use the K-means algorithm with the number of clusters = 5. One example from the literature is [this paper](https://pubs.rsc.org/en/content/articlelanding/2019/sc/c8sc04175j), that uses:
> *K-means clustering with K = 5 on MACCS fingerprints*

The second version uses a clustering algorithm with a predefined threshold value. In this case, we can't know beforehand the final number of clusters, only control whether there will be more or less clusters by setting the cutoff value. For more details, read [this talktorial, that uses Butina clustering](https://projects.volkamerlab.org/teachopencadd/talktorials/T005_compound_clustering.html).

In Deepchem the cluster split available is of the second type, and it also employs the Butina clustering algorithm from RDKit, an algorithm optimized for clustering molecular finerprints. The method requires RDKit to be installed and takes SMILES as input.

In [14]:
import deepchem as dc

butinasplitter = dc.splits.ButinaSplitter()
train_butina, test_butina = butinasplitter.train_test_split(dataset)

In [15]:
train_butina.get_shape()

((1547, 1024), (1547, 1), (1547, 0), (1547,))

In [16]:
test_butina.get_shape()

((387, 1024), (387, 1), (387, 0), (387,))

In [17]:
train_butina.to_dataframe().head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X1017,X1018,X1019,X1020,X1021,X1022,X1023,X1024,y,ids
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,1,CC(=O)OCC(=O)[C@@]1(O)CC[C@H]2[C@@H]3CC=C4CC(=...
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,C[C@H]1C[C@H]2[C@@H]3CC=C4CC(=O)C=C[C@]4(C)[C@...
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,C[C@]12C[C@H](O)[C@H]3[C@@H](CC=C4CC(=O)C=C[C@...
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,C[C@]12CC(=O)C3[C@@H](CC=C4CC(=O)C=C[C@@]43C)[...
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,C[C@]12C[C@H](O)[C@@]3(F)[C@@H](CC=C4CC(=O)C=C...


# Scaffold split

Consists in splitting the data according to Bemis-Murcko scaffolds in a way that molecules with the same scaffold are either in the training set or in the test set, but not both. Usually, compounds belonging to the rarest scaffolds are put in the test set. This method might not be able to create perfect data splits because it depends on the particular scaffold distribution of a dataset.

For more information, see: 

https://www.blopig.com/blog/2021/06/out-of-distribution-generalisation-and-scaffold-splitting-in-molecular-property-prediction/

https://practicalcheminformatics.blogspot.com/2023/06/getting-real-with-molecular-property.html

Deepchem's implementation requires RDKit to be installed and takes SMILES as input.

In [18]:
import deepchem as dc

scaffoldsplitter = dc.splits.ScaffoldSplitter()
train_scaffold, test_scaffold, = scaffoldsplitter.train_test_split(dataset)

In [19]:
train_scaffold.get_shape()

((1547, 1024), (1547, 1), (1547, 0), (1547,))

In [20]:
test_scaffold.get_shape()

((387, 1024), (387, 1), (387, 0), (387,))

In [21]:
train_scaffold.to_dataframe().head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X1017,X1018,X1019,X1020,X1021,X1022,X1023,X1024,y,ids
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,C[C@H](N)Cc1ccccc1
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,CS(=O)(=O)c1ccc([C@@H](O)[C@@H](CO)NC(=O)C(Cl)...
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,Cc1c(OCC(C)N)c(C)ccc1
3,0,1,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,CCCC(=O)Nc1cc(C(C)=O)c(OCC(O)CNC(C)C)cc1
4,1,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,CCN(CC)C(=O)Nc1ccc(OCC(O)CNC(C)(C)C)c(C(C)=O)c1


# Fingerprint split

Splits are based on the Tanimoto similarity between ECFP4 fingerprints. Tries to split the data such that the molecules in each dataset are as different as possible from the ones in the other datasets.

This looks similar to the "Neighbor splits" approach, in which the number of neighbors for each compound is computed, then, compounds with fewer neighbors are placed in the test set. A neighbor can be defined as a compound with similarity greater than a predefined threshold (0.5-0.7) according to some similarity metric (Tanimoto/cosine/Dice) on molecular fingerprints. 

Be aware that these methods are likely to create test sets that are too hard and, consequetly, the calculated metrics might underestimate true model performance.

In [22]:
import deepchem as dc

fingerprintsplitter = dc.splits.FingerprintSplitter()
train_fp, test_fp = fingerprintsplitter.train_test_split(dataset)

In [23]:
train_fp.get_shape()

((1547, 1024), (1547, 1), (1547, 0), (1547,))

In [24]:
test_fp.get_shape()

((387, 1024), (387, 1), (387, 0), (387,))

In [25]:
train_fp.to_dataframe().head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X1017,X1018,X1019,X1020,X1021,X1022,X1023,X1024,y,ids
0,0,1,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,Cc1onc(-c2ccccc2Cl)c1C(=O)NC1C(=O)N2C1SC(C)(C)...
1,0,0,0,0,0,1,0,1,0,0,...,0,1,0,1,0,0,0,0,1,COc1ccc([C@@H]2Sc3ccccc3N(CCN(C)C)C(=O)C2OC(C)...
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,N=C(N)NC(=O)c1nc(Cl)c(N)nc1N
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,Nc1nnc(-c2cccc(Cl)c2Cl)c(N)n1
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,CCCC(C)C


# Temporal split

Despite also being known as time-split cross-validation, this method consists in using a single train/test split, with newer instances being alocated to the test set. Contrast this to scikit-learn's [Time Series Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), that can generate K-splits for time series data.

In [Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction (2013)](https://pubs.acs.org/doi/10.1021/ci400084k), Sheridan shows that using a single split to allocate either 10, 25 or 50% of the "newest" compounds in the test set can match well the R² for future data prediction, whereas a random split tends to be too optimistic about model performance.

This implementation from Deepchem assumes that your dataset is properly ordered, with newer compounds at the bottom of the dataset. Therefore, it is just a simple index split.

In [26]:
import deepchem as dc

indexsplitter = dc.splits.IndexSplitter()
train_dataset, test_dataset = indexsplitter.train_test_split(dataset, frac_train=0.8)

In [27]:
# This should be similar to scikit-learn's train_test_split
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, shuffle=False)

# More options

In addition to `train_test_split`, some of the other methods available in DeepChem include `train_valid_test_split`, which creates an additional validation set, and `k_fold_split`, which creates several splits for use in cross-validation (note that this might not always make sense, such as for time-split data).

# SIMPD - Simulated Time Split

The temporal split has been shown as a good predictor of future model performance. However, date information is rarely available, especially on public assays. In [SIMPD: an Algorithm for Generating Simulated Time Splits for Validating Machine Learning Approaches (2023)](https://chemrxiv.org/engage/chemrxiv/article-details/6406049e6642bf8c8f10e189), Landrum et al. describe the SIMPD algorithm, which uses a genetic algorithm approach to create an approximate temporal split for any dataset. 

The authors identify that these simulated temporal splits reflect the properties expected for true temporal splits: descriptors that tend to increase over the course of a project, such as the “synthetic accessibility” score (compounds get more complex), heavy atom count and topological polar surface area (TPSA) also increase in the selected test sets relative to the training sets.

The algorithm has been open-sourced on [GitHub](github.com/rinikerlab/molecular_time_series), but I didn't find it too easy to implement. It seems that SIMPD might be added to [RDKit - Prefer](https://github.com/rdkit/PREFER) soon, so I'll' keep it for a future Notebook.

# MUV

In [Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity Data (2009)](https://pubs.acs.org/doi/10.1021/ci8002649), Rohren and Baumann present the MUV workflow to create unbiased datasets for virtual screening.

The workflow consists in:
- Removing compounds with a potential for unspecific bioactivity. For this, the authors implement filters for assay artifacts, frequent hitters,  autofluorescence, and luciferase inhibition. 
- Removing active compounds devoid of decoys (chemical space embedding filter)
- Use of spatial statistics to adjust the spread of actives ($G$) and of actives and decoys ($F$), enforcing spatial randomness. Large values of $G$ indicate a high level of self-similarity among the actives, whereas small values of $F$ indicate a high degree of separation from the decoys. Therefore, the quantity $S = G - F$ can be seen an estimate of "data clumping", and the unbiasing approach can be summarized as "adjusting the dataset to bring the value of $S$ close to zero".

I could not find the code for the MUV workflow, so let's check out the next approach, which is based on it.

# AVE

In [Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization (2018)](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00403), Wallach and Heifets investigate the phenomenon of "undetected overfitting", which is a consequence of having redundant information in both training an test sets, that is, active-active and inactive-inactive similarities in both sets. Inspired by the MUV approach, the AVE redundancy measure, or **AVE bias**, was proposed to measure validation set bias. The authors state that:

> the amount of AVE bias strongly correlates with the performance of ligand-based predictive methods irrespective of the predicted property, chemical fingerprint, similarity measure, or previously applied unbiasing techniques

In essence, the AVE bias:

> describes the ability of 1-NN to solve a validation set by memorizing the training data... tests with high AVE bias are easy to solve

It is shown that even time-split datasets can show significant bias, mostly because there is an intrinsic bias in compounds that are selected for synthesis and testing. Therefore, a splitting method is proposed to minimize redundancy bias. The authors, however, alert for its limitations:

> For cases where the available data are scarce, we proposed an algorithm that minimizes the AVE bias in test cases... uses a genetic algorithm to partition the available data into training and validation subsets with reduced bias. The algorithm is a heuristic search

> we claim neither that the algorithm is optimal nor that it is guaranteed to succeed. There is a risk that new biases are introduced into the test, as with clustering and MUV unbiasing. In many cases, however, test cases can be successfully unbiased and used in evaluations

And make some final suggestions:

> we suggest that cheminformaticians measure and report biases in their tests, and we provide our code to compute AVE biases. Tests with high bias can be excluded from consideration or unbiased with the provided code. Baseline predictive performance, such as from 1-NN, should also be included in the results.

Two Python scripts are available from the paper's Supporting Information: 
`analyze_AVE_bias.py` measures the bias of a dataset that was already split into training and test (.smi file, tab- or space-separated). `remove_AVE_bias.py` takes an entire dataset and creates an optimal split using to their algorithm. 

I have updated the scripts to work with Python 3 and fixed some indentation mistakes. The updated files can be found [in this folder](https://github.com/rflameiro/Python_e_Quiminformatica/tree/main/modules). I'll be using the default options on this example, but take a look at the script for the available options.

## analyze_AVE_bias

Let's compare the random and scaffold splits. The AVE bias is the final value output by the script: (AA-AI)+(II-IA)

In [28]:
# Convert to pandas DataFrame
train_random_df = train_random.to_dataframe()
test_random_df = test_random.to_dataframe()

train_scaffold_df = train_scaffold.to_dataframe()
test_scaffold_df = test_scaffold.to_dataframe()

In [31]:
def dataframes_to_smi(df_train, df_test, label=""):
    """
    Function to write .smi inputs to analyze_AVE_bias.py
    df_train, df_test: Pandas DataFrame with a SMILES column labeled as "ids", and the binary target column "y"
    label: will be used to name the generated .smi files. Use it to compare splits, 
        e.g., "random", "scaffold", "temporal"

    I modified the function to output the results inside the notebook, using print().
    In case you prefer to create a .txt with the results, you can modify the script by (un)commenting where indicated
    """
    active_training = df_train[df_train["y"] == 1]["ids"].to_list()
    inactive_training = df_train[df_train["y"] == 0]["ids"].to_list()
    active_test = df_test[df_test["y"] == 1]["ids"].to_list()
    inactive_test = df_test[df_test["y"] == 0]["ids"].to_list()

    active_training_path = "active_training_" + label + ".smi"   
    with open(active_training_path,'w+') as file:
        file.write(' \n'.join(active_training))
    print(active_training_path)

    inactive_training_path = "inactive_training_" + label + ".smi"   
    with open(inactive_training_path,'w+') as file:
        file.write(' \n'.join(inactive_training))
    print(inactive_training_path)
    
    active_test_path = "active_test_" + label + ".smi"   
    with open(active_test_path,'w+') as file:
        file.write(' \n'.join(active_test))
    print(active_test_path)

    inactive_test_path = "inactive_test_" + label + ".smi"   
    with open(inactive_test_path,'w+') as file:
        file.write(' \n'.join(inactive_test))
    print(inactive_test_path)


In [32]:
# write .smi files - random
dataframes_to_smi(train_random_df, test_random_df, label="random")

active_training_random.smi
inactive_training_random.smi
active_test_random.smi
inactive_test_random.smi


Note: I modified the script to print the results on the Notebook, so the `-outFile` argument will not be used for anything, but it is still required to run the script.

In [58]:
# run script
! python analyze_AVE_bias.py -activeMolsTraining active_training_random.smi -inactiveMolsTraining inactive_training_random.smi -activeMolsTesting active_test_random.smi -inactiveMolsTesting inactive_test_random.smi -outFile ave_results_random.txt

#ActTrain = 1177 
#InactTrain = 370 
#ActTest = 299 
#InactTest = 88 
knn1 = 0.834 
lr = 0.909 
rf = 0.942 
svm = 0.921 
AA-AI = 0.221 
II-IA = 0.169 
(AA-AI)+(II-IA) = 0.390




In [34]:
# write .smi files - scaffold
dataframes_to_smi(train_scaffold_df, test_scaffold_df, label="scaffold")

active_training_scaffold.smi
inactive_training_scaffold.smi
active_test_scaffold.smi
inactive_test_scaffold.smi


In [59]:
# run script
! python analyze_AVE_bias.py -activeMolsTraining active_training_scaffold.smi -inactiveMolsTraining inactive_training_scaffold.smi -activeMolsTesting active_test_scaffold.smi -inactiveMolsTesting inactive_test_scaffold.smi -outFile ave_results_scaffold.txt

#ActTrain = 1274 
#InactTrain = 273 
#ActTest = 202 
#InactTest = 185 
knn1 = 0.712 
lr = 0.818 
rf = 0.839 
svm = 0.837 
AA-AI = 0.158 
II-IA = 0.048 
(AA-AI)+(II-IA) = 0.205




As we can see, the bias of the scaffold split (0.205) is smaller than that of the random split (0.390). Notice how the scaffold split decreases both the values of AA-AI and II-IA. 

AA-AI is a measure of how clumped the test actives are among the training actives, and II-IA, similarly, measures the clumping among the inactives. This means that while scaffold splitting is able to make test actives less similar to training actives, its main strength in this example was to decrease significantly the inactive-inactive similarities among the training-test sets.

## remove_AVE_bias

Let's see if we can get an improvement over the scaffold split.

In [36]:
def dataframe_to_smi(df, label=""):
    """
    Function to write .smi inputs to remove_AVE_bias.py
    df: Pandas DataFrame with a SMILES column labeled as "ids", and the binary target column "y"
    label: will be used to name the generated .smi files.
    """
    active_mols = df[df["y"] == 1]["ids"].to_list()
    inactive_mols = df[df["y"] == 0]["ids"].to_list()

    active_mols_path = "active_mols_" + label + ".smi"   
    with open(active_mols_path,'w+') as file:
        file.write(' \n'.join(active_mols))
    print(active_mols_path)

    inactive_mols_path = "inactive_mols_" + label + ".smi"   
    with open(inactive_mols_path,'w+') as file:
        file.write(' \n'.join(inactive_mols))
    print(inactive_mols_path)

In [38]:
df = df_smi.copy()
df.columns = ["ids", "y"]
df.head()

Unnamed: 0,ids,y
0,Cc1onc(-c2ccccc2Cl)c1C(=O)NC1C(=O)N2C1SC(C)(C)...,1
1,CCN1CCN(C(=O)NC(C(=O)NC2C(=O)N3C(C(=O)O)=C(CSc...,1
2,CN(C)C1C(=O)C(C(=O)NCN2CCCC2)C(=O)[C@@]2(O)C(=...,1
3,Cc1nccn1CC1CCc2c(C1=O)c1ccccc1n2C,1
4,COc1ccc([C@@H]2Sc3ccccc3N(CCN(C)C)C(=O)C2OC(C)...,1


In [39]:
dataframe_to_smi(df, label="remove_bias")

active_mols_remove_bias.smi
inactive_mols_remove_bias.smi


In [40]:
# remove_AVE_bias.py
! python remove_AVE_bias.py -activeMols active_mols_remove_bias.smi -inactiveMols inactive_mols_remove_bias.smi

read 1476 actives and 458 inactives
calc aa_D_ref
calc ii_D_ref
calc ai_D_ref
done
calculate objectives for the population
remove similar sets
removing 0 similar sets
population size after similarity filter:  100
select the next generation
iter= 1 fullPopObj= 0.367 topPopObj= 0.337 finalPopObj= 0.311 minObj= 99999
breed
calculate objectives for the population
remove similar sets
removing 0 similar sets
population size after similarity filter:  100
select the next generation
iter= 2 fullPopObj= 0.341 topPopObj= 0.315 finalPopObj= 0.297 minObj= 0.311
breed
calculate objectives for the population
remove similar sets
removing 0 similar sets
population size after similarity filter:  100
select the next generation
iter= 3 fullPopObj= 0.318 topPopObj= 0.291 finalPopObj= 0.262 minObj= 0.297
breed
calculate objectives for the population
remove similar sets
removing 0 similar sets
population size after similarity filter:  100
select the next generation
iter= 4 fullPopObj= 0.298 topPopObj= 0.271 



Now let's see how the bias of this AVE split compares to that of the other splits:

In [60]:
! python analyze_AVE_bias.py -activeMolsTraining actives.T.smi -inactiveMolsTraining inactives.T.smi -activeMolsTesting actives.V.smi -inactiveMolsTesting inactives.V.smi -outFile ave_results_AVE_split.txt

#ActTrain = 1174 
#InactTrain = 360 
#ActTest = 290 
#InactTest = 83 
knn1 = 0.523 
lr = 0.747 
rf = 0.738 
svm = 0.758 
AA-AI = 0.160 
II-IA = -0.158 
(AA-AI)+(II-IA) = 0.002




The overall bias is much smaller than that of the scaffold split. However, notice how the AA-AI factor is approximately the same, while II-IA has become negative. This means that test inactives are generally more similar to training actives than to training inactives, and that classification of inactives will be more challenging. This might be a problem or not, depending on your use case.

## Results

| Splitting method | Bias | # training actives | # training inactives | # test actives | # test inactives |
|---|---|---|---|---|---|
| Random split | 0.390 | 1177 | 370 | 299  | 88 |
| Scaffold split | 0.205 | 1274 | 273 | 202 | 185 |
| AVE split | 0.002 | 1174 | 360 | 290 | 83 |

In summary, the scaffold split is less biased than the random split, but it still has a significant bias compared to the AVE split. From the paper, bias values in the range 0.1-0.2 already indicate datasets that can be easily "solved" by all the algorithms tested (LR, SVM, RF, 1-NN), that is, achieving an AUC > 0.9. 

In our example, the bias comes mostly from the similarities among training/test actives, and this bias was not removed by the AVE split. The similary among inactives, on the other hand, was significantly altered by the split, which went as far as making it harder to differentiate test inactives from training actives than to training inactives. 

Since not many compounds were lost in the process, it might be a good idea to use the AVE split in this case. Expect, however, a drop in performance, since some test inactives are likely to be misclassified as actives. You can try different molecular repreesentations to overcome this.