# Examples of Additional Use Cases in `pydebiaseddta`

This notebook examines the use of various additional experimental settings pertaining to guides, predictors, or the debiased training process. For a practical yet comprehensive introduction to `pydebiaseddta` please see the notebook `quickstart.ipynb`. This notebook complements `quickstart.ipynb` by providing short examples of other experimental scenarios that the user might be interested in.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from pydebiaseddta.guides import BoWDTA, IDDTA, RFDTA, OutDTA
from pydebiaseddta.debiasing import DebiasedDTA
from pydebiaseddta.predictors import DeepDTA, BPEDTA, LMDTA
from pydebiaseddta.utils import load_sample_dta_data, load_sample_prot_sim_matrix
from pydebiaseddta.evaluation import evaluate_predictions

train_ligands, train_proteins, train_labels = load_sample_dta_data(mini=True, split="train")
val_ligands, val_proteins, val_labels = load_sample_dta_data(mini=True, split="val")
test_ligands, test_proteins, test_labels = load_sample_dta_data(mini=True, split="test")

Training with various guides and predictors.

In [2]:
for guide in [BoWDTA, IDDTA]:
    for predictor in [DeepDTA, BPEDTA]:
        print(guide.__name__, predictor.__name__)
        debiaseddta = DebiasedDTA(guide, predictor, predictor_params={"n_epochs": 2})
        train_hist = debiaseddta.train(train_ligands,
                                       train_proteins,
                                       train_labels,
                                       val_splits = {"cold_both": [test_ligands, test_proteins, test_labels]},
                                       metrics_tracked=["mae", "mse", "r2"])
        print(train_hist)
        preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
        scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"])   

BoWDTA DeepDTA
Guide training started.
Guide training completed in 00:00:13.
Predictor training started.


100%|██████████| 2/2 [00:02<00:00,  1.47s/it]


Predictor training completed in 00:00:03.
{'train': {'mae': [11.408992, 10.997201], 'mse': [130.574375, 121.345531], 'r2': [-317.706598, -295.180791]}, 'val_splits': {'cold_both': {'mae': [11.618738, 11.209242], 'mse': [135.346634, 126.000854], 'r2': [-386.14832, -359.415457]}}}
BoWDTA BPEDTA
Guide training started.
Guide training completed in 00:00:14.
Predictor training started.


100%|██████████| 2/2 [00:01<00:00,  1.50it/s]


Predictor training completed in 00:00:01.
{'train': {'mae': [11.427759, 11.050064], 'mse': [131.003118, 122.509018], 'r2': [-318.753078, -298.020635]}, 'val_splits': {'cold_both': {'mae': [11.64573, 11.306248], 'mse': [135.972875, 128.180762], 'r2': [-387.93963, -365.650908]}}}
IDDTA DeepDTA
Guide training started.
Guide training completed in 00:00:00.
Predictor training started.


100%|██████████| 2/2 [00:01<00:00,  1.73it/s]


Predictor training completed in 00:00:01.
{'train': {'mae': [11.408992, 10.997054], 'mse': [130.574375, 121.342341], 'r2': [-317.706598, -295.173004]}, 'val_splits': {'cold_both': {'mae': [11.618738, 11.209118], 'mse': [135.346634, 125.998048], 'r2': [-386.14832, -359.40743]}}}
IDDTA BPEDTA
Guide training started.
Guide training completed in 00:00:00.
Predictor training started.


100%|██████████| 2/2 [00:01<00:00,  1.40it/s]

Predictor training completed in 00:00:02.
{'train': {'mae': [11.427759, 11.0496], 'mse': [131.003118, 122.498378], 'r2': [-318.753078, -297.994663]}, 'val_splits': {'cold_both': {'mae': [11.64573, 11.305969], 'mse': [135.972875, 128.174522], 'r2': [-387.93963, -365.633062]}}}





## Training using various non-default predictor hyperparameters

Early stopping based on validation overfitting, also providing a minimum number of epochs before which early stopping cannot commence:

In [5]:
debiaseddta = DebiasedDTA(BoWDTA, DeepDTA, predictor_params={
    "n_epochs": 100,
    "early_stopping_metric": "mae",
    "early_stopping_num_epochs": 3,
    "early_stopping_split": "val_set",
    "min_epochs": 15,
    "optimizer": "adam"})
train_hist = debiaseddta.train(train_ligands,
                                train_proteins,
                                train_labels,
                                val_splits = {"val_set": [test_ligands, test_proteins, test_labels]},
                                metrics_tracked=["mae", "mse", "r2"])
preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"])   
print("MAE in val_set for last 5 epochs:", train_hist["val_splits"]["val_set"]["mae"][-5:])
print("MAE in val_set in the final model:", scores["mae"])

Guide training started.
Guide training completed in 00:00:13.
Predictor training started.


 18%|█▊        | 18/100 [00:05<00:25,  3.19it/s]

Early stopping due to no increase to mae in val_set split for 3 epochs.
No save folder provided, using the final model.
Predictor training completed in 00:00:06.
MAE in val_set for last 5 epochs: [1.726252, 0.547368, 2.167525, 3.041101, 2.48041]
MAE in val_set in the final model: 2.48041038985376





Early stopping based on training convergence (based on predefined error levels). Also using vanilla SGD instead of Adam as optimizer:

In [12]:
debiaseddta = DebiasedDTA(IDDTA, BPEDTA, predictor_params={
    "n_epochs": 100,
    "model_folder": "./temp/",
    "early_stopping_metric": "mse",
    "early_stopping_metric_threshold": 1.6,
    "early_stopping_split": "train",
    "optimizer": "sgd",
    "learning_rate": 0.01})
train_hist = debiaseddta.train(train_ligands,
                                train_proteins,
                                train_labels,
                                val_splits = {"val_set": [test_ligands, test_proteins, test_labels]},
                                metrics_tracked=["mae", "mse", "r2"])
preds = debiaseddta.predictor_instance.predict(train_ligands, train_proteins)
scores = evaluate_predictions(train_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"])    
print("MSE in train split for last 5 epochs:", train_hist["train"]["mse"][-5:])
print("MSE in train split in the final model:", scores["mse"])

Guide training started.
Guide training completed in 00:00:00.
Predictor training started.


  4%|▍         | 4/100 [00:02<00:55,  1.74it/s]

Early stopping training due to convergence on the train split.
Predictor training completed in 00:00:02.
MSE in train split for last 5 epochs: [125.573093, 113.278748, 95.360722, 59.67571, 1.216261]
MSE in train split in the final model: 1.2162609036893124





## Training using various non-default predictor hyperparameters

`max_depth`, `min_samples_split`, and `min_samples_leaf` allows limiting the complexity of the decision tree based guide models (`BoWDTA`, `IDDTA`, `RFDTA`). The parameters `ligand_vector_mode` and `prot_vector_mode` determine how bag-of-words representation is converted to an embedding vector. `vocab_size` is another hyperparameter that allows the complexity of the representations that these guides can use as input. `criterion` is the loss function the regressor will use, and `input_rank` allows further simplification of guide's input by replacing the matrix representation of input with a low-rank appraximation.

In [13]:
debiaseddta = DebiasedDTA(BoWDTA, BPEDTA, predictor_params={"n_epochs": 10}, guide_params={
    "max_depth": 4,
    "min_samples_split": 5,
    "min_samples_leaf": 3,
    "ligand_vector_mode": "freq",
    "prot_vector_mode": "binary",
    "vocab_size": "low",
    "criterion": "poisson",
    "input_rank": 10,
})
train_hist = debiaseddta.train(train_ligands, train_proteins, train_labels, metrics_tracked=["mae", "mse", "r2"])
preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"])   

Guide training started.
Guide training completed in 00:00:08.
Predictor training started.


100%|██████████| 10/10 [00:03<00:00,  3.28it/s]

Predictor training completed in 00:00:03.





## Training using various non-default hyperparameters for overall training

Here `guide_error_exponent` allows changing the exponent for computing the errors incurred by guide's predictions. `weight_temperature` is a temperature parameter that determines how far from uniform the computed importance weights will be. `weight_tempering_exponent` determines how quick the ``tempering'' process will be, that is, the lower this value is, the faster the weights will get closer to their final computed versions. The parameter `weight_tempering_num_epochs` controls the total number of epochs in which to transition to computed importance weights will be made. This is especially relevant when early stopping is desired. `weight_prior` adds the given ratio of the maximum importance weight to all importance weights to prevent extreme sparsity. Lastly, `weight_rank_based` sets the importance weights to the percentile ranks of the errors of training inputs. This allows the errors arising from observation noise to have an uncalled for effect on importance weights.

In [14]:
debiaseddta = DebiasedDTA(IDDTA,
                          BPEDTA,
                          predictor_params={"n_epochs": 4},
                          guide_error_exponent=1,
                          weight_tempering_exponent=0.5,
                          weight_tempering_num_epochs=5,
                          weight_temperature=2,
                          weight_prior=0.01,
                          weight_rank_based=True
                          )
train_hist = debiaseddta.train(train_ligands,
                               train_proteins,
                               train_labels,
                               metrics_tracked=["mae", "mse", "r2"],
                               weights_save_path="./temp/additional_experiments/importance_weights.coef",
                               predictor_save_folder="./temp/additional_experiments")
preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"])   

Guide training started.
Saved importance weights to ./temp/additional_experiments/importance_weights.coef.
Guide training completed in 00:00:00.
Predictor training started.


100%|██████████| 4/4 [00:01<00:00,  2.37it/s]


Saved predictor to the folder ./temp/additional_experiments.
Predictor training completed in 00:00:03.


## Other scenarios

We can train using pre-computed and saved importance weights, in which case importance weight-related hyperparameters are ignored, except tempering related ones.

In [15]:
debiaseddta = DebiasedDTA(BoWDTA,
                          BPEDTA,
                          predictor_params={"n_epochs": 4},
                          weight_tempering_exponent=0.5,
                          weight_tempering_num_epochs=5,
                          )
train_hist = debiaseddta.train(train_ligands,
                               train_proteins,
                               train_labels,
                               metrics_tracked=["mae", "mse", "r2"],
                               weights_load_path="./temp/additional_experiments/importance_weights.coef")
preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"])   

Loading existing importance weights from ./temp/additional_experiments/importance_weights.coef.
Predictor training started.


100%|██████████| 4/4 [00:01<00:00,  2.45it/s]

Predictor training completed in 00:00:02.





We can also load a pre-trained predictor to obtain DTA predictions. 

In [16]:
predictor = BPEDTA.from_file("./temp/additional_experiments")
preds = predictor.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"]) 
print(scores)

{'ci': 0.5454545454545454, 'mse': 83.31641735482603, 'r2': -237.32000781048507, 'mae': 9.108696908997254, 'rmse': 9.127782718427627}


We can also feed this to a DebiasedDTA object to conduct additional training of the said model.

In [17]:
debiaseddta = DebiasedDTA(None,
                          BPEDTA,
                          predictor_params={"n_epochs": 4},
                          weight_tempering_exponent=0.5,
                          weight_tempering_num_epochs=5,
                          )
train_hist = debiaseddta.train(train_ligands,
                               train_proteins,
                               train_labels,
                               metrics_tracked=["mae", "mse", "r2"],
                               predictor_load_folder="./temp/additional_experiments")
preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"])   

No guide model specified, proceeding with uniform weights.
Loading a pretrained predictor.
New hyperparameters used for the pretrained predictor, any saved hyperparamaters are ignored.
Predictor training started.


100%|██████████| 4/4 [00:01<00:00,  2.41it/s]

Predictor training completed in 00:00:02.





Using `RFDTA` for training.

In [18]:
debiaseddta = DebiasedDTA(RFDTA,
                          BPEDTA,
                          predictor_params={"n_epochs": 6},
                          guide_params={"max_depth": 3, "num_trees": 100})
train_hist = debiaseddta.train(train_ligands, train_proteins, train_labels, metrics_tracked=["mae", "mse", "r2"])
preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"])   

Guide training started.
Guide training completed in 00:00:00.
Predictor training started.


100%|██████████| 6/6 [00:02<00:00,  2.71it/s]

Predictor training completed in 00:00:02.





Using `OutDTA` for training with inverse frequency.

In [2]:
debiaseddta = DebiasedDTA(OutDTA,
                          DeepDTA,
                          predictor_params={"n_epochs": 6},
                          guide_params={"df": load_sample_dta_data(mini=True)["train"], "rarity_indicator": "inv_frequency"})
train_hist = debiaseddta.train(train_ligands, train_proteins, train_labels, metrics_tracked=["mae", "mse", "r2"])
preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"])   

Guide training started.
Guide training completed in 00:00:00.
Predictor training started.


100%|██████████| 6/6 [00:03<00:00,  1.61it/s]

Predictor training completed in 00:00:04.





Using `OutDTA` for training with average distance.

In [3]:
debiaseddta = DebiasedDTA(OutDTA,
                          DeepDTA,
                          predictor_params={"n_epochs": 6},
                          guide_params={
                              "df": load_sample_dta_data(mini=True)["train"],
                              "rarity_indicator": "avg_distance",
                              "prot_sim_matrix": load_sample_prot_sim_matrix(),
                              })
train_hist = debiaseddta.train(train_ligands, train_proteins, train_labels, metrics_tracked=["mae", "mse", "r2"])
preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"])   

Guide training started.
Guide training completed in 00:00:00.
Predictor training started.


100%|██████████| 6/6 [00:01<00:00,  3.04it/s]

Predictor training completed in 00:00:02.





Using an early-stopped `DeepDTA` as a guide.

In [5]:
debiaseddta = DebiasedDTA(DeepDTA,
                          DeepDTA,
                          guide_params={
                              "n_epochs": 3,
                              },
                          predictor_params={"n_epochs": 6},)
train_hist = debiaseddta.train(train_ligands, train_proteins, train_labels, metrics_tracked=["mae", "mse", "r2"])
preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"])   

Guide training started.
Using predictor DeepDTA as guide.


100%|██████████| 3/3 [00:01<00:00,  2.25it/s]


Guide training completed in 00:00:02.
Predictor training started.


100%|██████████| 6/6 [00:02<00:00,  2.89it/s]

Predictor training completed in 00:00:02.





Using `BoWDTA` as predictor with no guides.

In [6]:
debiaseddta = DebiasedDTA(None,
                          BoWDTA,
                          predictor_params={"max_depth": 3},
                          )
train_hist = debiaseddta.train(train_ligands, train_proteins, train_labels, metrics_tracked=["mae", "mse", "r2"])
preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"]) 
print(scores)

No guide model specified, proceeding with uniform weights.
Predictor training started.
Predictor training completed in 00:00:00.
{'ci': 0.29545454545454547, 'mse': 0.5963057059340608, 'r2': -0.7056852059591168, 'mae': 0.6645587399470578, 'rmse': 0.7722083306556987}


Creating and using a custom predictor model by extending the `Predictor` class:

In [25]:
import numpy as np
from pydebiaseddta.predictors import Predictor

class AveragePredictor(Predictor):
    def __init__(self, n_epochs, **kwargs):
        self.n_epochs = n_epochs

    def train(self, train_ligands, train_proteins, train_labels, sample_weights_by_epoch, **kwargs):
        self.prediction = np.array(train_labels).mean()
    
    def predict(self, ligands, proteins, **kwargs):
        return np.ones(len(ligands)) * self.prediction

train_ligands, train_proteins, train_labels = load_sample_dta_data(mini=True, split="train")
test_ligands, test_proteins, test_labels = load_sample_dta_data(mini=True, split="test")
debiaseddta = DebiasedDTA(IDDTA, AveragePredictor, predictor_params={'n_epochs': 1})
debiaseddta.train(train_ligands, train_proteins, train_labels)
preds = debiaseddta.predictor_instance.predict(test_ligands, test_proteins)
scores = evaluate_predictions(test_labels, preds, metrics=["ci", "mse", "r2", "mae", "rmse"]) 
print(scores)

Guide training started.
Guide training completed in 00:00:00.
Predictor training started.
Predictor training completed in 00:00:00.
{'ci': 0.5, 'mse': 0.3936561283845659, 'r2': -0.12602215229338376, 'mae': 0.5579790632750005, 'rmse': 0.6274202167483655}


Please see the documentation of the specific modules for further details on functionalities provided by `pydebiaseddta`, and see the associated paper for in-depth discussions regarding the implications of the choices examined here.