# Tox 24 challenge - see how deepFPlearn performs

We are doing the following steps
- load challenge data: training and test datasets
- remove duplicated SMILES with different target values
- scale the target value to the range [0, 1]
- use the whole set of SMILES (test and train substances), generate 2048 bit binary molecular fingerprints, train a specific autoencoder for compressing 2048 bit binary molecular fingerprints into 256 bit vectors with less zeros
- use the trained specific autoencoder to encode the 2048 bit fingerprints of the training substances
- train a regression model with this data 
- use the trained autoencoder to encode the test substances, use the regression model to predict the scaled target values
- reverse the scaling of the target values
- submit the predictions

## Train the autoencoder

For this load train and test datasets first to get the full set of molecular structures. Store all structures again in a .csv file.

In [1]:
import pandas as pd

pd.concat([pd.read_csv('data/tox24_challenge_train.csv'),
           pd.read_csv('data/tox24_challenge_test.csv')],
          ignore_index=True).to_csv('data/tox24_challenge_smiles_all.csv', index=False)

Adjust all options for training the autoencoder

In [1]:
from dfpl import options

opts = options.Options(
    inputFile='data/tox24_challenge_smiles_all.csv',
    outputDir='data/output/',
    ecModelDir='data/output/AE_encoder/',
    ecWeightsFile='',
    type='smiles',
    fpType='topological',
    fpSize=2048,
    encFPSize=256,
    verbose=2,
    trainAC=True,
    aeActivationFunction='tanh',
    aeEpochs=3000,
    aeBatchSize=52,
    aeLearningRate=0.004123771070856377,
    aeLearningRateDecay=0.05465859583974732,
    trainFNN=False,
    wabTracking=True,
)


Allow tracking the training in Weights & Biases.

This requires a Weights & Biases account and at least the free plan. Feel free to comment this code cell.

In [2]:
import wandb

if opts.wabTracking:
    wandb.init(project=f"tox_24",
               entity="dfpl_regression",
               config=vars(opts))

[34m[1mwandb[0m: Currently logged in as: [33mmai00fti[0m ([33mdfpl_regression[0m). Use [1m`wandb login --relogin`[0m to force relogin


Load the training data and generate fingerprints.

In [3]:
from dfpl import fingerprint as fp

df = fp.importDataFile(opts.inputFile, import_function=fp.importCSV, fp_size=opts.fpSize)

Train the autoencoder

In [7]:
from dfpl import utils

utils.createDirectory(opts.outputDir)

from dfpl import autoencoder as ac
# opts.trainAC=False
if opts.trainAC:
    # train an autoencoder on the full feature matrix
    encoder = ac.train_full_ac(df, opts)

Update the options for training the regression model with compressed features.

In [18]:
opts = options.Options(
    inputFile='data/tox24_challenge_train.csv',
    outputDir='data/output/',
    ecModelDir='data/output/AE_encoder/',
    ecWeightsFile='',
    type='smiles',
    fpType='topological',
    fpSize=2048,
    encFPSize=256,
    verbose=2,
    trainFNN=True,
    compressFeatures=True,
    kFolds=5,
    testSize=0.2,
    optimizer="SGD",
    lossFunction="mse",
    epochs=5000,
    batchSize=56,
    activationFunction="tanh",
    dropout=0.15657883016344468,
    learningRate=0.017935022040821466,
    l2reg=0.009308121424156192,
    fnnType="REG",
    enableMultiLabel=False,
    wabTarget="activity",
)


In [19]:
df = fp.importDataFile(opts.inputFile, import_function=fp.importCSV, fp_size=opts.fpSize)

In [20]:
from tensorflow import keras

if opts.compressFeatures:
    # load trained model for autoencoder
    encoder = keras.models.load_model(opts.ecModelDir)

    # compress the fingerprints using the autoencoder
    df = ac.compress_fingerprints(df, encoder)







Scale the target values to [0,1]

In [21]:
df.columns

Index(['SMILES', 'activity', 'fp', 'fpcompressed'], dtype='object')

In [22]:
unscaled_target = df['activity'].to_numpy().reshape(-1,1)

from sklearn.preprocessing import MinMaxScaler
import pandas as pd

scaler = MinMaxScaler()
scaler.fit(unscaled_target)
scaled_target = scaler.transform(unscaled_target)
df = df.drop('activity', axis=1)
df = pd.concat([df, pd.DataFrame(scaled_target, columns=['activity'])], axis=1)

Now train the regression model

In [24]:
opts.inputFile

'data/tox24_challenge_train.csv'

In [25]:
from dfpl import single_label_model as sl

if opts.trainFNN:
    sl.train_single_label_models(df=df, opts=opts)

  super(SGD, self).__init__(name, **kwargs)






Epoch 1/5000
15/15 - 0s - loss: 4.7379 - rmse: 0.8089 - mse: 0.6544 - mae: 0.6171 - val_loss: 4.1314 - val_rmse: 0.4196 - val_mse: 0.1761 - val_mae: 0.3554 - 489ms/epoch - 33ms/step
Epoch 2/5000
15/15 - 0s - loss: 3.9217 - rmse: 0.3363 - mse: 0.1131 - mae: 0.2753 - val_loss: 3.6763 - val_rmse: 0.2385 - val_mse: 0.0569 - val_mae: 0.1968 - 39ms/epoch - 3ms/step
Epoch 3/5000
15/15 - 0s - loss: 3.5263 - rmse: 0.2512 - mse: 0.0631 - mae: 0.2038 - val_loss: 3.3366 - val_rmse: 0.2491 - val_mse: 0.0621 - val_mae: 0.1901 - 41ms/epoch - 3ms/step
Epoch 4/5000
15/15 - 0s - loss: 3.1784 - rmse: 0.2236 - mse: 0.0500 - mae: 0.1846 - val_loss: 3.0031 - val_rmse: 0.2203 - val_mse: 0.0485 - val_mae: 0.1848 - 52ms/epoch - 3ms/step
Epoch 5/5000
15/15 - 0s - loss: 2.8651 - rmse: 0.2080 - mse: 0.0433 - mae: 0.1751 - val_loss: 2.7125 - val_rmse: 0.2195 - val_mse: 0.0482 - val_mae: 0.1834 - 38ms/epoch - 3ms/step
Epoch 6/5000
15/15 - 0s - loss: 2.5859 - rmse: 0.2038 - mse: 0.0415 - mae: 0.1710 - val_loss: 2.45

  super(SGD, self).__init__(name, **kwargs)


Epoch 1/5000
15/15 - 0s - loss: 4.4430 - rmse: 0.5841 - mse: 0.3412 - mae: 0.4423 - val_loss: 4.0425 - val_rmse: 0.2901 - val_mse: 0.0842 - val_mae: 0.2371 - 377ms/epoch - 25ms/step
Epoch 2/5000
15/15 - 0s - loss: 3.8774 - rmse: 0.2685 - mse: 0.0721 - mae: 0.2169 - val_loss: 3.6760 - val_rmse: 0.2545 - val_mse: 0.0648 - val_mae: 0.2128 - 32ms/epoch - 2ms/step
Epoch 3/5000
15/15 - 0s - loss: 3.5052 - rmse: 0.2268 - mse: 0.0514 - mae: 0.1864 - val_loss: 3.3199 - val_rmse: 0.2356 - val_mse: 0.0555 - val_mae: 0.1905 - 35ms/epoch - 2ms/step
Epoch 4/5000
15/15 - 0s - loss: 3.1607 - rmse: 0.2057 - mse: 0.0423 - mae: 0.1690 - val_loss: 2.9894 - val_rmse: 0.2113 - val_mse: 0.0446 - val_mae: 0.1783 - 52ms/epoch - 3ms/step
Epoch 5/5000
15/15 - 0s - loss: 2.8503 - rmse: 0.1945 - mse: 0.0378 - mae: 0.1578 - val_loss: 2.6997 - val_rmse: 0.2103 - val_mse: 0.0442 - val_mae: 0.1758 - 37ms/epoch - 2ms/step
Epoch 6/5000
15/15 - 0s - loss: 2.5744 - rmse: 0.1960 - mse: 0.0384 - mae: 0.1603 - val_loss: 2.43

  np.mean(abs_error / np.array(y_test), axis=0),
  super(SGD, self).__init__(name, **kwargs)


Epoch 1/5000
15/15 - 0s - loss: 4.4929 - rmse: 0.6208 - mse: 0.3854 - mae: 0.4704 - val_loss: 4.0764 - val_rmse: 0.3304 - val_mse: 0.1092 - val_mae: 0.2638 - 236ms/epoch - 16ms/step
Epoch 2/5000
15/15 - 0s - loss: 3.8986 - rmse: 0.2884 - mse: 0.0832 - mae: 0.2341 - val_loss: 3.6703 - val_rmse: 0.2190 - val_mse: 0.0480 - val_mae: 0.1723 - 47ms/epoch - 3ms/step
Epoch 3/5000
15/15 - 0s - loss: 3.5149 - rmse: 0.2246 - mse: 0.0505 - mae: 0.1868 - val_loss: 3.3169 - val_rmse: 0.2045 - val_mse: 0.0418 - val_mae: 0.1576 - 31ms/epoch - 2ms/step
Epoch 4/5000
15/15 - 0s - loss: 3.1731 - rmse: 0.2111 - mse: 0.0446 - mae: 0.1726 - val_loss: 2.9918 - val_rmse: 0.1924 - val_mse: 0.0370 - val_mae: 0.1500 - 42ms/epoch - 3ms/step
Epoch 5/5000
15/15 - 0s - loss: 2.8647 - rmse: 0.2071 - mse: 0.0429 - mae: 0.1720 - val_loss: 2.7013 - val_rmse: 0.1921 - val_mse: 0.0369 - val_mae: 0.1496 - 37ms/epoch - 2ms/step
Epoch 6/5000
15/15 - 0s - loss: 2.5859 - rmse: 0.2038 - mse: 0.0415 - mae: 0.1673 - val_loss: 2.43

  super(SGD, self).__init__(name, **kwargs)


Epoch 1/5000
15/15 - 0s - loss: 4.5906 - rmse: 0.6662 - mse: 0.4439 - mae: 0.5110 - val_loss: 4.1124 - val_rmse: 0.3209 - val_mse: 0.1030 - val_mae: 0.2536 - 237ms/epoch - 16ms/step
Epoch 2/5000
15/15 - 0s - loss: 3.9514 - rmse: 0.3066 - mse: 0.0940 - mae: 0.2476 - val_loss: 3.7078 - val_rmse: 0.2109 - val_mse: 0.0445 - val_mae: 0.1829 - 32ms/epoch - 2ms/step
Epoch 3/5000
15/15 - 0s - loss: 3.5585 - rmse: 0.2335 - mse: 0.0545 - mae: 0.1901 - val_loss: 3.3560 - val_rmse: 0.2084 - val_mse: 0.0434 - val_mae: 0.1726 - 32ms/epoch - 2ms/step
Epoch 4/5000
15/15 - 0s - loss: 3.2104 - rmse: 0.2145 - mse: 0.0460 - mae: 0.1766 - val_loss: 3.0235 - val_rmse: 0.1870 - val_mse: 0.0350 - val_mae: 0.1530 - 38ms/epoch - 3ms/step
Epoch 5/5000
15/15 - 0s - loss: 2.8958 - rmse: 0.2042 - mse: 0.0417 - mae: 0.1695 - val_loss: 2.7269 - val_rmse: 0.1787 - val_mse: 0.0319 - val_mae: 0.1443 - 36ms/epoch - 2ms/step
Epoch 6/5000
15/15 - 0s - loss: 2.6146 - rmse: 0.2028 - mse: 0.0411 - mae: 0.1682 - val_loss: 2.46

  super(SGD, self).__init__(name, **kwargs)


Epoch 1/5000
15/15 - 0s - loss: 4.4218 - rmse: 0.5928 - mse: 0.3515 - mae: 0.4411 - val_loss: 3.9983 - val_rmse: 0.2575 - val_mse: 0.0663 - val_mae: 0.2015 - 245ms/epoch - 16ms/step
Epoch 2/5000
15/15 - 0s - loss: 3.8553 - rmse: 0.2713 - mse: 0.0736 - mae: 0.2195 - val_loss: 3.6396 - val_rmse: 0.2217 - val_mse: 0.0491 - val_mae: 0.1833 - 26ms/epoch - 2ms/step
Epoch 3/5000
15/15 - 0s - loss: 3.4819 - rmse: 0.2185 - mse: 0.0477 - mae: 0.1771 - val_loss: 3.2948 - val_rmse: 0.2197 - val_mse: 0.0483 - val_mae: 0.1907 - 40ms/epoch - 3ms/step
Epoch 4/5000
15/15 - 0s - loss: 3.1410 - rmse: 0.1991 - mse: 0.0397 - mae: 0.1622 - val_loss: 2.9729 - val_rmse: 0.2095 - val_mse: 0.0439 - val_mae: 0.1763 - 43ms/epoch - 3ms/step
Epoch 5/5000
15/15 - 0s - loss: 2.8358 - rmse: 0.1964 - mse: 0.0386 - mae: 0.1587 - val_loss: 2.6838 - val_rmse: 0.2063 - val_mse: 0.0425 - val_mae: 0.1748 - 37ms/epoch - 2ms/step
Epoch 6/5000
15/15 - 0s - loss: 2.5599 - rmse: 0.1939 - mse: 0.0376 - mae: 0.1580 - val_loss: 2.42

INFO:tensorflow:Assets written to: data/output/activity_saved_model/assets
