# Training of a Small Machine Learning (ML) Model 

In this part, we use machine learning (ML) models, particularly random forests (RF), to predict the cytotoxicity of our previously processed compounds. 

First, we need to import the necessary libraries.

In [2]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.ensemble import RandomForestRegressor
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error
import copy
import pandas as pd
from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize
import numpy as np

## Add Cytotoxicity Information from ToxCast to the Morphological Fingerprints from Broad Institute

* read in the file from the Part2-Notebook, containing the compounds with their respective SMILE strings and the morphological and molecular fingerprints
* Moreover, we read in the cytotoxicitiy data obatined from the ToxCast database
* To be able to join the ToxCast Compound IDs to our compounds from the broad institute, we additionally, need the compound information of the ToxCast compounds, which includes the SMILE representation of the compounds

In [3]:
dataset = pd.read_pickle('fingerprint_ds.pkl')
cpd_summary = pd.read_excel('../../data/INVITRODBv3_20181017.xls')
labels = pd.read_excel('../../data/cytotox_invitrodb_v4_1_SEPT2023.xlsx')

Now, the SMILE strings from the ToxCast compounds need to be standardized. To this end, we use almost the same function as in the Part2 notebook. However, we needed to add steps, where NaN values were dropped since the SMILES were not available for all compounds.

In [4]:
def clean_std_smiles(dataset):
    
    '''
    Clean, uncharge parents' smiles, get Inchi to standardize smiles
        Parameters : 
            dataset (data frame): data frame having a column named 'SMILES' with smiles
        Returns : 
            dataset (data frame): same data frame with one more column named 'STD_smile'
    '''
    
    dataset['Molecule'] = dataset.apply(lambda x: Chem.MolFromSmiles(x['SMILES']), axis = 1)
    dataset.dropna(subset=['Molecule'], inplace=True)
    clean_mol = [rdMolStandardize.Cleanup(mol) for mol in dataset['Molecule'].values ]
    parent_clean_mol = [rdMolStandardize.FragmentParent(mol) for mol in clean_mol]
    uncharger = rdMolStandardize.Uncharger() 
    uncharged_parent_clean_mol = [uncharger.uncharge(mol)for mol in parent_clean_mol]
    

    try : inchi = [Chem.MolToInchi(mol) for mol in uncharged_parent_clean_mol]
    except : print(f'cannot convert this smile into Inchi' )
    
    std_smile = list(map(Chem.MolFromInchi,inchi))
    dataset['STD_SMILE_MOLECULE'] = std_smile
    dataset.dropna(subset=['STD_SMILE_MOLECULE'], inplace=True)
    dataset['STD_SMILES'] = dataset.apply(lambda x: Chem.MolToSmiles(x['STD_SMILE_MOLECULE']), axis=1)
    # drop unnecessary columns
    dataset.drop(['STD_SMILE_MOLECULE', 'Molecule'], axis=1, inplace=True)
    return dataset

In [5]:
cpd_summary = clean_std_smiles(cpd_summary)

[12:47:39] SMILES Parse Error: syntax error while parsing: -
[12:47:39] SMILES Parse Error: Failed parsing SMILES '-' for input: '-'
[12:47:39] SMILES Parse Error: syntax error while parsing: -
[12:47:39] SMILES Parse Error: Failed parsing SMILES '-' for input: '-'
[12:47:39] SMILES Parse Error: syntax error while parsing: -
[12:47:39] SMILES Parse Error: Failed parsing SMILES '-' for input: '-'
[12:47:39] SMILES Parse Error: syntax error while parsing: -
[12:47:39] SMILES Parse Error: Failed parsing SMILES '-' for input: '-'
[12:47:39] SMILES Parse Error: syntax error while parsing: -
[12:47:39] SMILES Parse Error: Failed parsing SMILES '-' for input: '-'
[12:47:39] SMILES Parse Error: syntax error while parsing: -
[12:47:39] SMILES Parse Error: Failed parsing SMILES '-' for input: '-'
[12:47:39] SMILES Parse Error: syntax error while parsing: -
[12:47:39] SMILES Parse Error: Failed parsing SMILES '-' for input: '-'
[12:47:39] SMILES Parse Error: syntax error while parsing: -
[12:47:3

In [6]:
cpd_summary = cpd_summary.loc[:, ['DTXSID', 'SMILES', 'STD_SMILES']]

Now, we join the cytotoxicity information with the compound SMILES.

In [7]:
label_with_smiles = cpd_summary.join(labels.set_index('dsstox_substance_id'), on='DTXSID')
label_with_smiles.dropna(subset=['STD_SMILES', 'cytotox_median_log'])
label_with_smiles = label_with_smiles.loc[:, ['STD_SMILES', 'cytotox_median_log']]

We, drop the unnecessary columns from part 2, i.e., the euclidean distances per fingerprint.

In [8]:
dataset.drop(['euclidean_distance_morganB_fps', 'euclidean_distance_morphological_fingerprint'], axis = 1, inplace=True)

And join the cytotoxicity data with our morphological compound fingerprints on the standardized SMILES.

In [11]:
full_dataset = dataset.join(label_with_smiles.set_index('STD_SMILES'), on='STD_smile')

We drop all columns without response.

In [12]:
full_dataset.dropna(subset=['cytotox_median_log'], inplace=True)
full_dataset.reset_index(inplace=True)

In [13]:
print(f'Now, we have left {len(full_dataset)} compounds with associated cytotoxicity value.')

Now, we have left 932 compounds with associated cytotoxicity value.


## Split Data in Training and Test Set

* For ML, we always need a training and a test set that are pairwise disjoint. To this end, we split the data in two parts, where the training set will contain 80% of the samples and the test set the remaining 20%.
* Moreover, we need a response vector y, i.e., the column with the cytotoxicity information and a feature matrix X, i.e., a matrix with the morphological or molecular fingerprints or a combination thereof, for each compound.

We start with X conatining only the morphological fingerprints.

In [14]:
y = copy.deepcopy(full_dataset['cytotox_median_log'].values)
X = full_dataset.copy(deep=True)
X.drop(['cytotox_median_log', 'CPD_NAME', 'morphological_fingerprint', 'morganB_fps', 'STD_smile', 'CPD_SMILES'], axis = 1, inplace = True)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

## Cross Validation and Hyperparameter Tuning

In ML, we want to minimize a loss function, e.g., the mean-squared error (MSE). In particular, we want to fit the training data in order to minimize the loss on the training samples, while still having a small error on the unseen test samples. Moreover, there are hyperparameters defining the hypothesis space of the model, which can be tuned in order to minimize the estimated test error. The test error can be estimated using a so-called k-fold cross validation (CV), which is performed in the following.

For an RF, there are three hyperparameters that we want to tune in order to optimize the estimated test error: 
* depth of the trees in the forest: 10, 20, or 30
* number of trees in the forest: 100, 300, or 500
* minimum number of samples per leaf: 10, 15, or 20

In [16]:
def rf_cross_validation(X_train, y_train, max_depth_range = [10,20,30], num_tree_range = [100,300,500], min_samples_leaf_range = [10,15,20]):
    '''
        performs a 5-fold CV for a Random Forest for given X_train and y_train
        
        @param X_train: the training matrix
        @param y_train: the associated response vector
        @param max_depth_range: list containing the values that should be tested for max depth, default [10,20,30]
        @param num_tree_range: list containing the values that should be tested for the number of trees, default [100,300,500]
        @param min_samples_leaf_range: list containing the values that should be tested for the minimum number of samples per leaf, default [10,15,20]
        
        @return: a forest with the best hyperparameter according to the estimated test MSE and trained on the whole training set
    '''
    best_score = -float('inf')
    for depth in max_depth_range:
        cv_results = cross_validate(RandomForestRegressor(random_state=42, max_depth=depth), X = X_train, y=y_train, scoring='neg_mean_squared_error', verbose=4, cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_depth = depth
            
    best_score = -float('inf')
    for n_tree in num_tree_range:
        cv_results = cross_validate(RandomForestRegressor(random_state=42, n_estimators=n_tree), X = X_train, y=y_train, scoring='neg_mean_squared_error', verbose=4, cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_n_tree = n_tree
    
    best_score = -float('inf')
    for num_samples in min_samples_leaf_range:
        cv_results = cross_validate(RandomForestRegressor(random_state=42, min_samples_leaf=num_samples), X = X_train, y=y_train, scoring='neg_mean_squared_error', verbose=4, cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_min_samples = num_samples
    
    rf = RandomForestRegressor(random_state=42, n_estimators=best_n_tree, max_depth=best_depth, min_samples_leaf=best_min_samples)
    rf.fit(X_train, y_train)
    return rf

Will need, approximately 90 minutes as the training process per split is quite time consuming.

In [17]:
best_estimator = rf_cross_validation(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ......................................, score=-0.542 total time=  51.9s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   51.9s remaining:    0.0s


[CV] END ......................................, score=-0.570 total time=  52.6s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.7min remaining:    0.0s


[CV] END ......................................, score=-0.415 total time=  52.3s


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.6min remaining:    0.0s


[CV] END ......................................, score=-0.464 total time=  51.8s
[CV] END ......................................, score=-0.580 total time=  52.0s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  4.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ......................................, score=-0.540 total time= 1.2min


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.2min remaining:    0.0s


[CV] END ......................................, score=-0.584 total time= 1.3min


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.5min remaining:    0.0s


[CV] END ......................................, score=-0.432 total time= 1.2min


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  3.7min remaining:    0.0s


[CV] END ......................................, score=-0.468 total time= 1.2min
[CV] END ......................................, score=-0.583 total time= 1.1min


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  6.0min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ......................................, score=-0.539 total time= 1.3min


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.3min remaining:    0.0s


[CV] END ......................................, score=-0.584 total time= 1.4min


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.7min remaining:    0.0s


[CV] END ......................................, score=-0.435 total time= 1.3min


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  4.0min remaining:    0.0s


[CV] END ......................................, score=-0.470 total time= 1.2min
[CV] END ......................................, score=-0.584 total time= 1.2min


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  6.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ......................................, score=-0.539 total time= 1.3min


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.3min remaining:    0.0s


[CV] END ......................................, score=-0.585 total time= 1.4min


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.7min remaining:    0.0s


[CV] END ......................................, score=-0.435 total time= 1.3min


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  3.9min remaining:    0.0s


[CV] END ......................................, score=-0.470 total time= 1.2min
[CV] END ......................................, score=-0.583 total time= 1.2min


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  6.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ......................................, score=-0.543 total time= 3.8min


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  3.8min remaining:    0.0s


[CV] END ......................................, score=-0.575 total time= 3.9min


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  7.8min remaining:    0.0s


[CV] END ......................................, score=-0.430 total time= 3.7min


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 11.5min remaining:    0.0s


[CV] END ......................................, score=-0.471 total time= 3.6min
[CV] END ......................................, score=-0.586 total time= 3.6min


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 18.7min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ......................................, score=-0.545 total time= 6.4min


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  6.4min remaining:    0.0s


[CV] END ......................................, score=-0.586 total time= 6.7min


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 13.0min remaining:    0.0s


[CV] END ......................................, score=-0.430 total time= 6.2min


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 19.2min remaining:    0.0s


[CV] END ......................................, score=-0.472 total time= 6.1min
[CV] END ......................................, score=-0.582 total time= 6.0min


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 31.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ......................................, score=-0.549 total time=  38.6s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   38.6s remaining:    0.0s


[CV] END ......................................, score=-0.556 total time=  40.1s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.3min remaining:    0.0s


[CV] END ......................................, score=-0.418 total time=  37.0s


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.9min remaining:    0.0s


[CV] END ......................................, score=-0.467 total time=  37.5s
[CV] END ......................................, score=-0.578 total time=  36.6s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  3.2min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ......................................, score=-0.545 total time=  32.1s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   32.1s remaining:    0.0s


[CV] END ......................................, score=-0.553 total time=  33.2s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.1min remaining:    0.0s


[CV] END ......................................, score=-0.419 total time=  30.6s


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.6min remaining:    0.0s


[CV] END ......................................, score=-0.475 total time=  30.7s
[CV] END ......................................, score=-0.592 total time=  29.9s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  2.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ......................................, score=-0.540 total time=  27.5s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   27.5s remaining:    0.0s


[CV] END ......................................, score=-0.555 total time=  28.9s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   56.5s remaining:    0.0s


[CV] END ......................................, score=-0.423 total time=  26.5s


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.4min remaining:    0.0s


[CV] END ......................................, score=-0.468 total time=  26.7s
[CV] END ......................................, score=-0.593 total time=  26.2s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  2.3min finished


Now, we print the hyperparameters of the forest, to see what performed best in the CV.

In [20]:
param_dict = best_estimator.get_params()
print(f'We use a max depth of {param_dict["max_depth"]}, {param_dict["n_estimators"]} trees, and at least {param_dict["min_samples_leaf"]} samples per leaf.')

We use a max depth of 10, 300 trees, and at least 10 samples per leaf.


## Prediction on Test Set 

Now, we predict for the test data, i.e., so far unseen samples to assess the real test error.

In [21]:
y_pred  = best_estimator.predict(X_test)

And assess the MSE on our test set prediciton.

In [22]:
print(f'The RF trained on morphological fingerprints reached an MSE of {mean_squared_error(y_pred=y_pred, y_true=y_test)}')

The RF trained on morphological fingerprints reached an MSE of 0.4719982082182938


## Using Molecular (Morgan) Fingerprints

In [None]:
y = copy.deepcopy(full_dataset['cytotox_median_log'].values)
X = full_dataset.copy(deep=True)
X = X.loc[:, ['morganB_fps']].values

In [None]:
new_X = []
for i, l in enumerate(X):
    new_X.append(np.array(l[0]))
new_X = np.array(new_X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(new_X, y, test_size=0.2, random_state=42, shuffle=True)

In [None]:
rf_morgan = RandomForestRegressor(random_state=42)

In [None]:
rf_morgan.fit(X_train, y_train)

In [None]:
y_pred_rf_morgan = rf_morgan.predict(X_test)

In [None]:
mean_squared_error(y_true=y_test, y_pred=y_pred_rf_morgan)

In [None]:
X = full_dataset.copy(deep=True)
X = X.loc[:, ['morphological_fingerprint', 'morganB_fps']]
X['appended_profile'] = X.apply(lambda x: np.concatenate([x['morphological_fingerprint'], x['morganB_fps']]), axis=1)
X.drop(['morphological_fingerprint', 'morganB_fps'], axis=1, inplace=True)

In [None]:
X = X.values

In [None]:
new_X = []
for i, l in enumerate(X):
    new_X.append(np.array(l[0]))
new_X = np.array(new_X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(new_X, y, test_size=0.2, random_state=42, shuffle=True)

In [None]:
rf_combined_morgan_morph  = RandomForestRegressor(random_state=42)

In [None]:
rf_combined_morgan_morph.fit(X_train, y_train)

In [None]:
y_pred_combi = rf_combined_morgan_morph.predict(X_test)

In [None]:
mean_squared_error(y_true=y_test, y_pred=y_pred_combi)