# KnowTox_manuscript_SI
This notebook is part of the supporting information to <br />

<b> KnowTox: Pipeline and Case Study for Confident Prediction of Potential Toxic Effects of Compounds in Early Phases of Development </b><br />
A. Morger<sup>1</sup>, M. Mathea<sup>2</sup>, J. H. Achenbach<sup>2</sup>, A. Wolf<sup>2</sup>, R. Buesen<sup>2</sup>, K-J. Schleifer<sup>2</sup>, R. Landsiedel<sup>2</sup>, A. Volkamer<sup>1</sup><br />
<sup>1</sup>: <i>In Silico</i> Toxicology and Structural Bioinformatics, Charité Universitätsmedizin, Berlin, Germany, [volkamerlab.org](https://physiologie-ccm.charite.de/en/research_at_the_institute/volkamer_lab/) <br />
<sup>2</sup>: BASF SE, Ludwigshafen, Germany

## Content

In this notebook, we demonstrate how a conformal predictor is built, how the model is applied to make predictions for external data, and how to evaluate the internal (crossvalidation) and external predictions.

The notebook consists of three main parts. 

1. Preparation: 
    - Used Python libraries are loaded
    - Paths and parameters are defined 
2. Helper functions for 
    - Loading and formatting data
    - Conformal predictor (CP) training on ToxCast data
    - Conformal prediction on external data
    - Evaluation of the predictions (internal crossvalidation and external data). 
3. Main script: 
    - Datasets are loaded
    - Three CP models are built on ToxCast (original, normalised, normalised+balanced)
    - Models are evaludated (internal and external)

### Table of contents
1. [Preparation](#preparation) <br>
    1.1. [Import libraries and modules](#import-libraries-and-modules)<br>
    1.2. [Define data paths](#define-data-paths)<br>
    1.3. [Define parameters](#define-parameters)<br>
2. [Define helper functions](#define-helper-functions)<br>
    2.1. [Load and format data](#load-and-format)<br>
    2.2. [Train conformal predictor on ToxCast dataset](#train-conformal-predictor)<br>
    2.3. [Make conformal prediction for external dataset](#make-conformal-prediction)<br>
    2.4. [Evaluate conformal predictors and predictions](#evaluate-conformal-predictors)<br>
3. [Main script: apply helper functions to different model set-ups](#main-script)<br>
    3.1. [Load ToxCast data to train conformal predictor](#load-toxcast-data)<br>
    3.2. [Load external data to make conformal prediction](#load-external-data)<br>
    3.3. [Train and make predictions with original, normalised and normalised+balanced model and evaluate](#train-and-make)<br>

## 1. Preparation <a name="preparation"></a>
### 1.1. Import libraries and modules <a name="import-libraries-and-modules"></a>

In [1]:
# Import stdlib and 3rd party packages
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsRegressor
from nonconformist.icp import IcpClassifier
from nonconformist.nc import InverseProbabilityErrFunc, NcFactory
from nonconformist.acp import AggregatedCp, RandomSubSampler

# Import own library
from MyEqualSizeSampler import EqualSizeSampler

### 1.2. Define data paths<a name="define-data-paths"></a>
The input files are provided as part of this Github repository. They contain preprocessed data as described in the manuscript (and briefly in the `README`). Original data were downloaded from [[1]](https://figshare.com/articles/ToxCast_and_Tox21_Data_Spreadsheet/6062503) (ToxCast dataset) and [[2]](https://www.tandfonline.com/doi/full/10.1080/1062936X.2016.1172665) (external data).

In [2]:
# Input files
train_file = '../data/toxcast_CP_endpoints_MorganMACCS_mmpcReduced.csv'
predict_file = '../data/external_AA_endpoint_MorganMACCS_mmpcReduced.csv'

# Output files
output_directory = '../data/output/'
crossvalidation_output_file = f'{output_directory}crossvalidation'
model_pickle_output_file = f'{output_directory}model'
prediction_output_file = f'{output_directory}prediction'

### 1.3. Define parameters <a name="define-parameters"></a>
The parameters are set as described in the Data and Methods section of the manuscript. 
- To ensure reproducibility, do not change the parameters. 
- However, they can be changed if you would like to adapt the notebook for your own purposes.

In [3]:
# Input related
endpoint_toxcast = '762' # Androgen receptor antagonism endpoint from ToxCast
endpoint_external = 'AA' # Androgen receptor antagonism endpoint in external dataset

fp_MorganMACCS = 'morgan_maccs' # Used to train original model
fp_mmpcReduced = 'mm_pc_reduced' # Used to train normalised and normalised+balanced model

# CP training related
test_portion = 0.2 # Ratio to divide data into test and training set
cal_portion = 0.3 # Ratio to divide training set into calibration and proper training set
num_models = 5 # Number of ICPs trained per ACP # Change to 25 to follow manuscript methods. Calculation takes long.
num_trees = 500 # Number of trees built in random forest

icp_classifier_condition = (lambda instance: instance[1]) # Mondrian condition in conformal prediction
aggregation_function = (lambda x: np.median(x,axis=2)) # Function to aggregate p-values in ACP
error_function = InverseProbabilityErrFunc() # Nonconformity measure

# Evaluation related
cv = 5 # Number of folds in crossvalidation

significance_level = 0.2 # Significance level used to evaluate conformal predictor/prediction

## 2. Define helper functions <a name="define-helper-functions"></a>
### 2.1. Functions to load and format data <a name="load-and-format"></a>

In [4]:
# Helper function to load and format molecules and descriptors from file
def load_descriptor_dataframe(dataframe_file, cols, fingerprints):
    """
    Read data from csv file 
    """
    
    dataframe = pd.read_csv(dataframe_file, usecols=cols)
    dataframe.dropna(inplace=True)
    
    # Reformat fingerprint information
    for fp in fingerprints:
        dataframe[fp] = dataframe[fp].apply(
            lambda f: np.asarray([float(i) for i in f.replace('[', '').replace(']', '').split(',')]))
    
    # Reformat casn entry
    if 'casn' in dataframe.columns:
        dataframe['casn'].replace(',','_').tolist()
    
    print(dataframe.shape)
    display(dataframe.head())
    
    return dataframe

In [5]:
# Helper function to generate numpy arrays from dataframe column
def format_to_numpy_array_for_ML(dataframe, column_name):
    """
    Format data in column to numpy array for machine learning 
    """
    
    column_list = dataframe[column_name].tolist()
    np_array = np.asarray(column_list)
    
    return np_array

### 2.2. Function to train (and validate) conformal predictor on ToxCast dataset <a name="train-conformal-predictor"></a>

#### Define function to train conformal predictors on ToxCast dataset
Within a (5-fold) crossvalidation, (five) ACPs are trained and saved as `acps_cv`. 
The predictions of the crossvalidation are written to `crossvalidation_output_file`.

In this function, the model parameters are defined according to the insisted model setup:
* Original model: No normaliser model, no equal size sampling
* Normalised model: KNN Regressor normaliser model, no equal size sampling
* Normalised+balanced model: KNN Regressor normaliser model, equal size sampling

In [6]:
def train_acp_cv(knowtox_model, data_X, class_y, casn, smiles, cv_folds, endpoint, fingerprint):
    """
    Train ACP within a crossvalidation and save internal predictions
    """
    
    # Split data for crossvalidation
    kf = StratifiedKFold(n_splits=cv_folds, random_state=42, shuffle=True,    )
    acps_cv = [] # Save all ACPs to use for further prediction
    
    # Prepare output file
    cv_outfile = open(f'{crossvalidation_output_file}_{endpoint}_{fingerprint}_{knowtox_model}.csv', 'w')
    header = f'name,{endpoint},p0,p1,prediction,casn,smiles\n'
    cv_outfile.write(header)
    
    # Define parameters for choosen CP model set-up
    if knowtox_model == 'original':
        normaliser_model = None
        acp_option = RandomSubSampler()
    elif knowtox_model == 'normalised':
        normaliser_model = KNeighborsRegressor()
        acp_option = RandomSubSampler()
    elif knowtox_model == 'normalised_balanced':
        normaliser_model = KNeighborsRegressor()
        acp_option = EqualSizeSampler()

    print(knowtox_model, ' model:')
    
    # Fit model within crossvalidation and make prediction for respective test set
    for train_index, test_index in kf.split(data_X, class_y):
        
        # Prepare the data splits
        X_train, X_test = data_X[train_index], data_X[test_index]
        y_train, y_test = class_y[train_index], class_y[test_index]
        casn_train, casn_test = casn[train_index], casn[test_index]
        smiles_train, smiles_test = smiles[train_index], smiles[test_index]
        
        # Create and train model on training set
        forest = RandomForestClassifier(n_estimators=num_trees)
        nc = NcFactory.create_nc(forest, err_func=error_function, normalizer_model=normaliser_model)
        icp = IcpClassifier(nc, condition=icp_classifier_condition)
        acp = AggregatedCp(n_models=num_models, predictor=icp, sampler=acp_option,
                           aggregation_func=aggregation_function)
        acp.fit(X_train, y_train)
        
        # Make predictions for test set
        predictions = acp.predict(X_test)
        
        # Write p-values and further compound information for each split into output file
        long_line = ''
        for n, i in enumerate(X_test):
            newline = f'{n}, {y_test[n]}, {predictions[n][0]}, {predictions[n][1]}, {predictions[n]},'\
                        f'{casn_test[n].replace(",",".")}, {smiles_test[n]}\n'
            long_line += newline
        cv_outfile.write(long_line)
        
        # Collect ACPs for later predictions of external test data
        acps_cv.append(acp)
    
    cv_outfile.close()
    
    # Return trained ACP models
    return acps_cv

### 2.3. Function to make conformal prediction for external dataset <a name="make-conformal-prediction"></a>
#### Define function to apply the model to a data set and to calculate p-values
- First, pretrained ACP models (`acps`) are used to make predictions for an external dataset (`predict_df`). 
- Second, the predictions are saved in form of p-values for the active and inactive class, separately. 

In [7]:
def predict_acp_get_pvalues(acps, predict_df, chosen_fp, endpoint):
    """
    Make conformal predictions for external test data and save p-values to dataframe
    """
    
    smiles_dict = {}
    fingerprint_dict = {}
    prediction_dict = {}
    
    # Store necessary data from data set in dictionaries
    for idx, row in predict_df.iterrows():
        smiles_dict[row['Name']] = row['smiles']
        fingerprint_dict[row['Name']] = row[chosen_fp]
    
    # Make prediction for each entry (class-wise)
    for tmp_entry, tmp_fp in fingerprint_dict.items():
        predictions_p0 = np.array([])
        predictions_p1 = np.array([])
        
        # With each of the ACPs trained in CV
        for a in acps:
            p = a.predict(np.array(tmp_fp, ndmin=2))
            predictions_p0 = np.append(predictions_p0, p[0][0])
            predictions_p1 = np.append(predictions_p1, p[0][1])
        prediction_dict[(tmp_entry + '_p0')] = np.mean(predictions_p0)
        prediction_dict[(tmp_entry + '_p1')] = np.mean(predictions_p1)
    
    # Calculate p-values per class
    pvalues_class0 = []
    pvalues_class1 = []

    for i, r in predict_df.iterrows():
        pvalues_class0.append(prediction_dict[r['Name'] + '_p0'])
        pvalues_class1.append(prediction_dict[r['Name'] + '_p1'])  
    
    predict_df['p0'] = pvalues_class0
    predict_df['p1'] = pvalues_class1
    predict_df[chosen_fp] = predict_df[chosen_fp].apply(lambda fp: [i for i in fp])
       
    predict_df.to_csv(f'{prediction_output_file}_{endpoint}_{chosen_fp}_{knowtox_model}.csv')
    
    return predict_df, fingerprint_dict, prediction_dict

### 2.4. Evaluate conformal predictors and predictions <a name="evaluate-conformal-predictors"></a>
#### Define evaluation functions to calculate validity, efficiency and accuracy
Three functions are defined to evaluate the conformal predictions with respect to validity, efficiency and accuracy. Each function returns a dictionary containing three entries, _i.e._ the values for all compounds, the active compounds and the inactive compounds.
##### 2.4.1 Validity

In [8]:
def calculate_set_sizes(df, ep):
    """
    Calculate total number of compounds, class-wise and all compounds in data set
    """
    
    nof_neg = float(sum(df[ep].values == 0.0))
    nof_pos = float(sum(df[ep].values == 1.0))
    nof_all = float(nof_neg + nof_pos)
    
    return (nof_all, nof_neg, nof_pos)

In [9]:
def calculate_validity(dataframe, endpoint, significance):
    """
    Calculate ratio of valid predictions, i.e. prediction sets containing the correct label
    """
    
    # Calculate total number of compounds, class-wise and all compounds
    total, total_0, total_1 = calculate_set_sizes(dataframe, endpoint)
      
    # Calculate number of wrongly predicted compounds 
    # (correct label not in prediction set at given significance level)
    # class-wise
    error_0 = sum((dataframe[endpoint].values == 0.0) & (dataframe.p0.values < significance))
    error_1 = sum((dataframe[endpoint].values == 1.0) & (dataframe.p1.values < significance))

    # Calculate error rate, class-wise and for all compounds
    error_rate_0 = np.round(error_0 / total_0, 3)
    error_rate_1 = np.round(error_1 / total_1, 3)
    error_rate = np.round(((error_0 + error_1) / total), 3)
    
    return {'validity': (1 - error_rate), 'validity_1': (1 - error_rate_1), 'validity_0': (1 - error_rate_0)}

##### 2.4.2 Efficiency

In [10]:
def calculate_nof_one_class_predictions(df, ep, label, significance):
    """
    Calculate number of one class predictions for a specific class at a given significance level
    """
    
    # Get number of comopounds that have respective label 
    # and only one of the p-values fullfills significance level
    nof = sum((df[ep].values == label) 
               & (((df.p0.values < significance) & (df.p1.values >= significance)) 
                  | ((df.p0.values >= significance) & (df.p1.values < significance))))
    
    return nof

In [11]:
def calculate_efficiency(dataframe, endpoint, significance):
    """
    Calculate ratio of efficient predictions, i.e. prediction sets containig one single label
    """
    
    # Calculate total number of compounds, class-wise and all compounds
    total, total_0, total_1 = calculate_set_sizes(dataframe, endpoint)
    
    # Calculate number of efficiently predicted compounds 
    # (only one label not in prediction set at given significance level)
    # class-wise
    efficiency_0 = calculate_nof_one_class_predictions(dataframe, endpoint, 0.0, significance)
    efficiency_1 = calculate_nof_one_class_predictions(dataframe, endpoint, 1.0, significance)
    
    # Calculate efficiency rate, class-wise and for all compounds
    efficiency_rate_0 = np.round(efficiency_0 / total_0, 3)
    efficiency_rate_1 = np.round(efficiency_1 / total_1, 3)
    efficiency_rate = np.round(((efficiency_0 + efficiency_1) / total), 3)
    
    return {'efficiency': efficiency_rate, 'efficiency_1': efficiency_rate_1, 
            'efficiency_0': efficiency_rate_0}

##### 2.4.3 Accuracy

In [12]:
def calculate_accuracy(dataframe, endpoint, significance):
    """
    Calculate ratio of accurate predictions, i.e. efficient prediction sets containing the one correct label
    """
    # Calculate number of efficiently predicted compounds 
    # (only one label not in prediction set at given significance level)
    # class-wise
    efficiency_0 = calculate_nof_one_class_predictions(dataframe, endpoint, 0.0, significance)
    efficiency_1 = calculate_nof_one_class_predictions(dataframe, endpoint, 1.0, significance)
    efficiency = efficiency_0 + efficiency_1
    
    # Calculate number of correctly and efficiently predicted compounds 
    # (only one correct label in prediction set at given significance level)
    # class-wise
    accuracy_0 = sum((dataframe[endpoint].values == 0.0) & (dataframe.p0.values >= significance) &
                     (dataframe.p1.values < significance))
    accuracy_1 = sum((dataframe[endpoint].values == 1.0) & (dataframe.p0.values < significance) & 
                     (dataframe.p1.values >= significance))
    
    # Calculate accuracy rate, class-wise and for all compounds
    accuracy_rate_0 = np.round(accuracy_0 / efficiency_0, 3)
    accuracy_rate_1 = np.round(accuracy_1 / efficiency_1, 3)
    accuracy_rate = np.round(((accuracy_0 + accuracy_1) / efficiency), 3)
    
    return {'accuracy': accuracy_rate, 'accuracy_1': accuracy_rate_1, 'accuracy_0': accuracy_rate_0}

## 3. Main script: apply helper functions to different model set-ups <a name="main-script"></a>

The functions defined above are applied to the three model set-ups
* Original model
* Normalised model
* Normalised+balanced model

### 3.1. Load ToxCast data to train a conformal predictor <a name="load-toxcast-data"></a>
The ToxCast is loaded and formatted for machine learning.

In [13]:
# Define input columns and fingerprints
cols = [endpoint_toxcast, 'smiles', fp_MorganMACCS, fp_mmpcReduced, 'casn']
fingerprints = [fp_MorganMACCS, fp_mmpcReduced]

# Load dataframe
train_dataframe = load_descriptor_dataframe(train_file, cols, fingerprints)

# Generate numpy arrays as input for machine learning/conformal prediction
X_MorganMACCS = format_to_numpy_array_for_ML(train_dataframe, fp_MorganMACCS)
X_mmpcReduced = format_to_numpy_array_for_ML(train_dataframe, fp_mmpcReduced)
toxcast_y = format_to_numpy_array_for_ML(train_dataframe, endpoint_toxcast)
toxcast_casn = format_to_numpy_array_for_ML(train_dataframe, 'casn')
toxcast_smiles = format_to_numpy_array_for_ML(train_dataframe, 'smiles')

(6713, 5)


Unnamed: 0,casn,762,smiles,morgan_maccs,mm_pc_reduced
0,143-50-0,1.0,O=C1C2(Cl)C3(Cl)C4(Cl)C(Cl)(Cl)C5(Cl)C3(Cl)C1(...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,2385-85-5,0.0,ClC1(Cl)C2(Cl)C3(Cl)C4(Cl)C(Cl)(Cl)C5(Cl)C3(Cl...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,306-94-5,0.0,FC1(F)C(F)(F)C(F)(F)C2(F)C(F)(F)C(F)(F)C(F)(F)...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,86508-42-1,0.0,FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,105-06-6,0.0,C=Cc1ccc(C=C)cc1,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ..."


### 3.2. Load external data to make conformal prediction <a name="load-external-data"></a>
The external data is loaded and formatted for machine learning.

In [14]:
# Define input columns and fingerprints
cols=['Name', endpoint_external, 'smiles', fp_MorganMACCS, fp_mmpcReduced]
fingerprints = [fp_MorganMACCS, fp_mmpcReduced]

# Load dataframe
external_df = load_descriptor_dataframe(predict_file, cols, fingerprints)

(361, 5)


Unnamed: 0,Name,smiles,AA,morgan_maccs,mm_pc_reduced
0,M1013056,CC(O)C1(O)CCC2C3CCC4=CC(=O)CCC4(C)C3CCC21C,1.0,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ..."
1,M1014521,CC(=O)c1ccc2ccc3cccc4ccc1c2c34,1.0,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,M10645,Oc1c(Cl)c(Cl)c(O)c(Cl)c1Cl,1.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,M10749,Clc1ccc(-c2cc(Cl)cc(Cl)c2)cc1,1.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,M11028,Clc1ccc(-c2ccccc2Cl)cc1,1.0,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


### 3.3. Train and make predictions with original, normalised and normalised+balanced model and evaluate <a name="train-and-make"></a>
- Conformal predictors are fitted and evaluated on the internal test set
- Predictions are made for the external data
- The predictors/predictions are evaluated
Three tables with the evaluation of the three models are returned. 

In [16]:
os.makedirs(output_directory, exist_ok=True)

# Data overview
# ToxCast
total, neg, pos = calculate_set_sizes(train_dataframe, endpoint_toxcast)
print(f'ToxCast data: {total} total, {neg} inactives, {pos} actives\n')
# External
total, neg, pos = calculate_set_sizes(external_df, endpoint_external)
print(f'External data: {total} total, {neg} inactives, {pos} actives\n')

# For three model set-ups do
for knowtox_model, toxcast_X, fingerprint in zip(['original', 'normalised', 'normalised_balanced'], 
                                         [X_MorganMACCS, X_mmpcReduced, X_mmpcReduced], 
                                         [fp_MorganMACCS, fp_mmpcReduced, fp_mmpcReduced]):
    # ToxCast
    # Train model on ToxCast data and predict p-values on training data (cross-validation)
    acps_cv_model = train_acp_cv(knowtox_model, toxcast_X, toxcast_y, toxcast_casn, toxcast_smiles,
                                 cv, endpoint_toxcast, fingerprint)
    # Load output dataframe with p-values
    toxcast_filename = f'{crossvalidation_output_file}_{endpoint_toxcast}_{fingerprint}_{knowtox_model}.csv'
    results_df_toxcast_cv = pd.read_csv(toxcast_filename)
    
    # External
    # Predict and calculate p-values for external data set
    model_predict_dataframe, model_fingerprint_dict, model_prediction_dict = \
        predict_acp_get_pvalues(acps_cv_model, external_df, fingerprint, endpoint_external)
   
    # Evaluate model internal (CV) and external data
    evaluation_dict = {'validity': [], 'validity_1': [], 'validity_0': [], 'efficiency': [],
                       'efficiency_1': [], 'efficiency_0': [], 'accuracy': [],
                       'accuracy_1': [], 'accuracy_0': []}
    
    for dataframe, endpoint in zip([results_df_toxcast_cv, model_predict_dataframe],
                                   [endpoint_toxcast, endpoint_external]):
        validity_dict = calculate_validity(dataframe, endpoint, significance_level)
        efficiency_dict = calculate_efficiency(dataframe, endpoint, significance_level)
        accuracy_dict = calculate_accuracy(dataframe, endpoint, significance_level)
        
        for d in [validity_dict, efficiency_dict, accuracy_dict]:
            for k, v in d.items():
                evaluation_dict[k].append(v)
         
    evaluation_dataframe = pd.DataFrame(data=evaluation_dict, index = ['toxcast CV', 'external set'])
    display(evaluation_dataframe)

ToxCast data: 6713.0 total, 5845.0 inactives, 868.0 actives

External data: 361.0 total, 201.0 inactives, 160.0 actives

original  model:


Unnamed: 0,validity,validity_1,validity_0,efficiency,efficiency_1,efficiency_0,accuracy,accuracy_1,accuracy_0
toxcast CV,0.808,0.808,0.809,0.85,0.866,0.848,0.775,0.778,0.774
external set,0.781,0.812,0.756,0.731,0.712,0.746,0.701,0.737,0.673


normalised  model:


Unnamed: 0,validity,validity_1,validity_0,efficiency,efficiency_1,efficiency_0,accuracy,accuracy_1,accuracy_0
toxcast CV,0.844,0.835,0.845,0.362,0.149,0.393,0.942,0.535,0.964
external set,0.801,0.756,0.836,0.305,0.169,0.413,0.782,0.556,0.855


normalised_balanced  model:


Unnamed: 0,validity,validity_1,validity_0,efficiency,efficiency_1,efficiency_0,accuracy,accuracy_1,accuracy_0
toxcast CV,0.833,0.838,0.833,0.539,0.403,0.559,0.867,0.76,0.878
external set,0.778,0.812,0.751,0.501,0.45,0.542,0.746,0.847,0.679


### Results
- From the original to the normalised model, the validity could be increased while efficiency dropped. 
- From the normalised to the normalised+balanced model, efficiency could be increased again. 