# HERGAI: An Artificial Intelligence Tool for Structure-Based Prediction of hERG Inhibitors

You are using a Jupyter notebook containing the Python code, named **HERGAI**, to predict inhibitors of the human Ether-Ã -go-go-Related Gene (hERG) potassium channel. This code is introduced in our article:

**Tran-Nguyen, V.K., Randriharimanamizara, U.F., Taboureau, O. HERGAI: An Artificial Intelligence Tool for Structure-Based Prediction of hERG Inhibitors. (2025)**

You can use this code to reproduce our results: all necessary input files are provided in our **HERGAI** GitHub repository (https://github.com/vktrannguyen/HERGAI). You can also make predictions on your own data, even on a large scale. Please read our article (cited above) for more information.

To use this Jupyter notebook, you need to set up a proper environment. A suggested solution is to use the *protocol-env.yml* file provided in our **MLSF-protocol** GitHub repository (https://github.com/vktrannguyen/MLSF-protocol) and install a few additional packages (e.g., imblearn). Please refer to the **import** section at the beginning of each code block to ensure that all required Python dependencies are installed beforehand.

## Part 1: Training and applying base classification models (RF_BC, XGB_BC, DNN_BC)

To train and apply the following base classification models, we use, as features, the **PLEC fingerprints extracted from docking poses selected by ClassyPose**.

For more information on ClassyPose, please refer to this article: https://advanced.onlinelibrary.wiley.com/doi/full/10.1002/aisy.202400238 and our **Classy_Pose** GitHub repository (https://github.com/vktrannguyen/Classy_Pose).

### RF_BC 

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import RandomOverSampler

# Provide the paths to the csv training and test data files:
train_data = pd.read_csv("Provide the file path to training_data.csv")
test_data = pd.read_csv("Provide the file path to test_data.csv")
Train_Class = train_data['activity'].map({'Active': 1, 'Inactive': 0})  
Test_Class = test_data['activity'].map({'Active': 1, 'Inactive': 0}) 

# Provide the paths to the csv training and test features: 
d_train_csv = pd.read_csv("Provide the file path to training_PLEC_ClassyPose.csv", header=None)
d_test_csv = pd.read_csv("Provide the file path to test_PLEC_ClassyPose.csv", header=None)
train_features = np.array(d_train_csv)
test_features = np.array(d_test_csv)

# Define a function to perform controlled oversampling of actives:
def controlled_oversampling(X, y, ratio=100):
    # Count the number of actives and inactives:
    n_actives = np.sum(y == 1)
    n_inactives = np.sum(y == 0)

    # Calculate the number of actives to oversample:
    target_n_actives = min(n_actives * ratio, n_inactives)

    # Create a RandomOverSampler instance:
    ros = RandomOverSampler(sampling_strategy={1: target_n_actives}, random_state=42)

    # Resample the dataset:
    X_resampled, y_resampled = ros.fit_resample(X, y)

    return X_resampled, y_resampled

# Apply controlled oversampling to the training data:
train_features_resampled, Train_Class_resampled = controlled_oversampling(train_features, Train_Class)

# Train RF_BC on the resampled training set using optimal hyperparameters:
rf_plec = RandomForestClassifier(
    n_estimators=2600, 
    max_depth=6, 
    criterion='gini',
    max_features='sqrt', 
    n_jobs=20,
    random_state=42
)
rf_plec.fit(train_features_resampled, Train_Class_resampled)

# Test RF_BC on the test molecules:
threshold = 0.32799228309084005
prediction_test_rf_plec_prob = rf_plec.predict_proba(test_features)
prediction_test_rf_plec_class = (prediction_test_rf_plec_prob[:, 1] > threshold).astype(int)

# Get classification results on the test molecules and export to csv:
plec_result_rf = pd.DataFrame({
    "Active_Prob": prediction_test_rf_plec_prob[:, 1], 
    "Inactive_Prob": prediction_test_rf_plec_prob[:, 0], 
    "Predicted_Class": ["Active" if pred == 1 else "Inactive" for pred in prediction_test_rf_plec_class],
    "Real_Class": Test_Class.map({1: 'Active', 0: 'Inactive'})  
})
plec_result_rf.to_csv("Provide the file path to your csv output file", index=False)
# The output file RF_BC.csv is provided in our GitHub repository, in Data/Test_set/Results/Our_AI_classifiers.

### XGB_BC 

In [None]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from imblearn.over_sampling import RandomOverSampler

# Provide the paths to the csv training and test data files:
train_data = pd.read_csv("Provide the file path to training_data.csv")
test_data = pd.read_csv("Provide the file path to test_data.csv")
train_data['activity'] = train_data['activity'].map({'Active': 1, 'Inactive': 0})
test_data['activity'] = test_data['activity'].map({'Active': 1, 'Inactive': 0})
Train_Class = train_data['activity'].astype('int32')
Test_Class = test_data['activity'].astype('int32')

# Provide the paths to the csv training and test features: 
train_features = pd.read_csv("Provide the file path to training_PLEC_ClassyPose.csv", header=None).astype('float32')
test_features = pd.read_csv("Provide the file path to test_PLEC_ClassyPose.csv", header=None).astype('float32')

# Define a function to perform controlled oversampling of actives:
def controlled_oversampling(X, y, ratio=100):
    # Count the number of actives and inactives:
    n_actives = np.sum(y == 1)
    n_inactives = np.sum(y == 0)

    # Calculate the number of actives to oversample:
    target_n_actives = min(n_actives * ratio, n_inactives)

    # Create a RandomOverSampler instance:
    ros = RandomOverSampler(sampling_strategy={1: target_n_actives}, random_state=42)

    # Resample the dataset:
    X_resampled, y_resampled = ros.fit_resample(X, y)

    return X_resampled, y_resampled

# Apply controlled oversampling to the training data:
train_features_resampled, Train_Class_resampled = controlled_oversampling(train_features, Train_Class)

# Train XGB_BC on the resampled training set using optimal hyperparameters:
xgb_plec = XGBClassifier(
    objective="binary:logistic",
    max_depth=4,
    reg_alpha=0.5,
    reg_lambda=1,
    n_estimators=86,
    n_jobs=20,  
    tree_method='hist',  
    grow_policy='depthwise',  
    random_state=42
)
xgb_plec.fit(train_features_resampled, Train_Class_resampled)

# Test XGB_BC on the test molecules:
threshold = 0.06986083857271219
prediction_test_xgb_plec_prob = xgb_plec.predict_proba(test_features)
prediction_test_xgb_plec_class = (prediction_test_xgb_plec_prob[:, 1] > threshold).astype(int)

# Get classification results on the test molecules and export to csv:
plec_result_xgb = pd.DataFrame({
    "Active_Prob": prediction_test_xgb_plec_prob[:, 1],
    "Inactive_Prob": prediction_test_xgb_plec_prob[:, 0],
    "Predicted_Class": prediction_test_xgb_plec_class,
    "Real_Class": Test_Class
})
plec_result_xgb.to_csv("Provide the file path to your csv output file", index=False)
# The output file XGB_BC.csv is provided in our GitHub repository, in Data/Test_set/Results/Our_AI_classifiers.

### DNN_BC 

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from imblearn.over_sampling import RandomOverSampler
import os

np.random.seed(42)
tf.random.set_seed(42)

# Set up multiple CPUs:
num_cpus = 25
os.environ["TF_NUM_INTRAOP_THREADS"] = str(num_cpus)
os.environ["TF_NUM_INTEROP_THREADS"] = str(num_cpus)

# Provide the paths to the csv training and test data files:
train_data = pd.read_csv("Provide the file path to training_data.csv")
test_data = pd.read_csv("Provide the file path to test_data.csv")
Train_Class = train_data['activity'].map({'Active': 1, 'Inactive': 0})  
Test_Class = test_data['activity'].map({'Active': 1, 'Inactive': 0})  

# Provide the paths to the csv training and test features:
d_train_csv = pd.read_csv("Provide the file path to training_PLEC_ClassyPose.csv", header=None)
d_test_csv = pd.read_csv("Provide the file path to test_PLEC_ClassyPose.csv", header=None)
train_features = d_train_csv.values
test_features = d_test_csv.values

# Define a function to perform controlled oversampling of actives:
def controlled_oversampling(X, y, ratio=100):
    # Count the number of actives and inactives
    n_actives = np.sum(y == 1)
    n_inactives = np.sum(y == 0)

    # Calculate the number of actives to oversample
    target_n_actives = min(n_actives * ratio, n_inactives)

    # Create a RandomOverSampler instance
    ros = RandomOverSampler(sampling_strategy={1: target_n_actives}, random_state=42)

    # Resample the dataset
    X_resampled, y_resampled = ros.fit_resample(X, y)

    return X_resampled, y_resampled

# Apply controlled oversampling to the training data:
train_features_resampled, Train_Class_resampled = controlled_oversampling(train_features, Train_Class)

# Train DNN_BC on the resampled training set using optimal hyperparameters:
dnn_plec = keras.Sequential()
dnn_plec.add(layers.Dense(abs(int(352.8790957431371)), activation='relu'))
dnn_plec.add(layers.BatchNormalization())
dnn_plec.add(layers.Dropout(0.3912375678578729))
dnn_plec.add(layers.Dense(abs(int(507.20805366280905)), activation='relu'))
dnn_plec.add(layers.BatchNormalization())
dnn_plec.add(layers.Dropout(0.4672821567914654))  
dnn_plec.add(layers.Dense(abs(int(242.0267251614644)), activation='relu'))  
dnn_plec.add(layers.BatchNormalization())
dnn_plec.add(layers.Dropout(0.382443636406835))  
dnn_plec.add(layers.Dense(1, activation='sigmoid')) 
dnn_plec.compile(optimizer='Adadelta', loss="binary_crossentropy", metrics=['accuracy'])
dnn_plec.fit(np.array(train_features_resampled), Train_Class_resampled, 
             epochs=10, batch_size=68, verbose=1)

# Test DNN_BC on the test molecules:
threshold = 0.14415252008378787
test_predictions = dnn_plec.predict(np.array(test_features))
prediction_test_dnn_plec_class = ["Active" if num > threshold else "Inactive" for num in test_predictions]  

# Get classification results on the test molecules and export to csv:
plec_result_dnn = pd.DataFrame({"Active_Prob": test_predictions[:, 0],
                                "Predicted_Class": prediction_test_dnn_plec_class,
                                "Real_Class": Test_Class})
plec_result_dnn.to_csv("Provide the file path to your csv output file", index=False)
# The output file DNN_BC.csv is provided in our GitHub repository, in Data/Test_set/Results/Our_AI_classifiers.

## Part 2: Training and applying our best stacking classification model (DNN_SC) from the three base models previously trained

To train and apply our best stacking classification model (DNN_SC), we use, as features, the **adjusted scores from the three base classifiers (RF_BC, XGB_BC, DNN_BC)** for all training and test ligands.

For this, the following procedure is followed:

- The active probabilities of the ligands in each of the five validation folds (partitioned from the training set) are predicted by each base model, trained on the corresponding training fold, using its respective optimal hyperparameters. 

- To account for the optimal decision threshold determined for each base model (which changes from one model to another, and impacts their classification results), we adjust the active probabilities predicted by a base model, dividing them by the corresponding decision threshold. 

This practice gives the adjusted scores that we use as features for DNN_SC.

### Part 2.1: Using each base model to issue an active probability and an adjusted score for each ligand in the training set 

#### Using RF_BC 

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import RandomOverSampler

# Provide the path to the csv training data file:
train_data = pd.read_csv("Provide the file path to training_data.csv")
train_data['activity'] = train_data['activity'].map({'Active': 1, 'Inactive': 0})
Train_Class = train_data['activity'].astype('int32')

# Provide the paths to the csv training features:
train_features = pd.read_csv("Provide the file path to training_PLEC_ClassyPose.csv", header=None).astype('float32')

# Define a function to perform controlled oversampling of actives:
def controlled_oversampling(X, y, ratio=100):
    n_actives = np.sum(y == 1)
    n_inactives = np.sum(y == 0)
    target_n_actives = min(n_actives * ratio, n_inactives)
    ros = RandomOverSampler(sampling_strategy={1: target_n_actives}, random_state=42)
    X_resampled, y_resampled = ros.fit_resample(X, y)
    return X_resampled, y_resampled

# Define RF_BC with its optimal hyperparameters:
model = RandomForestClassifier(
    n_estimators=2600, 
    max_depth=6, 
    criterion='gini',
    max_features='sqrt', 
    n_jobs=20,
    random_state=42
)

# Perform 5-fold CV:
skf = StratifiedKFold(n_splits=5)
predictions = []

for fold, (train_index, val_index) in enumerate(skf.split(train_features, Train_Class), start=1):
    x_train_fold, x_val_fold = train_features.iloc[train_index], train_features.iloc[val_index]
    y_train_fold, y_val_fold = Train_Class.iloc[train_index], Train_Class.iloc[val_index]
    val_ids = train_data['SID'].iloc[val_index]  

    # Apply controlled oversampling within each fold:
    x_train_fold_resampled, y_train_fold_resampled = controlled_oversampling(x_train_fold, y_train_fold)

    # Train the model on the resampled data:
    model.fit(x_train_fold_resampled, y_train_fold_resampled)

    # Get predicted probabilities for the validation fold:
    y_scores = model.predict_proba(x_val_fold)[:, 1]  

    # Collect the results for each molecule in the validation fold:
    fold_results = pd.DataFrame({
        'SID': val_ids.values,            
        'Actual_Class': y_val_fold.values, 
        'Predicted_Probability': y_scores,  
        'Adjusted_Score': y_scores/0.32799228309084005
    })
    predictions.append(fold_results)

# Concatenate all fold results into a single DataFrame:
all_predictions = pd.concat(predictions, ignore_index=True)

# Save the predictions to a csv file:
all_predictions.to_csv("Provide the file path to your csv output file", index=False)
# The output file rf_cv_predictions.csv is provided in our GitHub repository, in Data/Training_set/Stacking_ensemble_ML.

#### Using XGB_BC 

In [None]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import RandomOverSampler

# Provide the path to the csv training data file:
train_data = pd.read_csv("Provide the file path to training_data.csv")
train_data['activity'] = train_data['activity'].map({'Active': 1, 'Inactive': 0})
Train_Class = train_data['activity'].astype('int32')

# Provide the paths to the csv training features:
train_features = pd.read_csv("Provide the file path to training_PLEC_ClassyPose.csv", header=None).astype('float32')

# Define a function to perform controlled oversampling of actives:
def controlled_oversampling(X, y, ratio=100):
    n_actives = np.sum(y == 1)
    n_inactives = np.sum(y == 0)
    target_n_actives = min(n_actives * ratio, n_inactives)
    ros = RandomOverSampler(sampling_strategy={1: target_n_actives}, random_state=42)
    X_resampled, y_resampled = ros.fit_resample(X, y)
    return X_resampled, y_resampled

# Define XGB_BC with its optimal hyperparameters:
model = XGBClassifier(
    objective="binary:logistic",
    max_depth=4,
    reg_alpha=0.5,
    reg_lambda=1,
    n_estimators=86,
    n_jobs=20,  
    tree_method='hist',
    grow_policy='depthwise',
    random_state=42
)

# Perform 5-fold CV:
skf = StratifiedKFold(n_splits=5)
predictions = []  

for fold, (train_index, val_index) in enumerate(skf.split(train_features, Train_Class), start=1):
    x_train_fold, x_val_fold = train_features.iloc[train_index], train_features.iloc[val_index]
    y_train_fold, y_val_fold = Train_Class.iloc[train_index], Train_Class.iloc[val_index]
    val_ids = train_data['SID'].iloc[val_index]  

    # Apply controlled oversampling within each fold:
    x_train_fold_resampled, y_train_fold_resampled = controlled_oversampling(x_train_fold, y_train_fold)

    # Train the model on the resampled data:
    model.fit(x_train_fold_resampled, y_train_fold_resampled)

    # Get predicted probabilities for the validation fold:
    y_scores = model.predict_proba(x_val_fold)[:, 1]  

    # Collect the results for each molecule in the validation fold:
    fold_results = pd.DataFrame({
        'SID': val_ids.values,            
        'Actual_Class': y_val_fold.values, 
        'Predicted_Probability': y_scores,  
        'Adjusted_Score': y_scores/0.06986083857271219  
    })
    predictions.append(fold_results)

# Concatenate all fold results into a single DataFrame:
all_predictions = pd.concat(predictions, ignore_index=True)

# Save the predictions to a csv file:
all_predictions.to_csv("Provide the file path to your csv output file", index=False)
# The output file xgb_cv_predictions.csv is provided in our GitHub repository, in Data/Training_set/Stacking_ensemble_ML.

#### Using DNN_BC 

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, backend as K
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import RandomOverSampler
import os
import gc

np.random.seed(42)
tf.random.set_seed(42)

# Set up multiple CPUs:
num_cpus = 25
os.environ["TF_NUM_INTRAOP_THREADS"] = str(num_cpus)
os.environ["TF_NUM_INTEROP_THREADS"] = str(num_cpus)

# Provide the path to the csv training data file:
train_data = pd.read_csv("Provide the file path to training_data.csv")
Train_Class = train_data['activity'].map({'Active': 1, 'Inactive': 0})

# Provide the paths to the csv training features:
d_train_csv = pd.read_csv("Provide the file path to training_PLEC_ClassyPose.csv", header=None)
train_features = d_train_csv.values

# Define a function to perform controlled oversampling of actives:
def controlled_oversampling(X, y, ratio=100):
    n_actives = np.sum(y == 1)
    n_inactives = np.sum(y == 0)
    target_n_actives = min(n_actives * ratio, n_inactives)
    ros = RandomOverSampler(sampling_strategy={1: target_n_actives}, random_state=42)
    X_resampled, y_resampled = ros.fit_resample(X, y)
    return X_resampled, y_resampled

# Define DNN_BC with its optimal hyperparameters:
def build_model():
    model = keras.Sequential()
    model.add(layers.Dense(abs(int(352.8790957431371)), activation="relu"))
    model.add(layers.BatchNormalization())
    model.add(layers.Dropout(0.3912375678578729))
    model.add(layers.Dense(abs(int(507.20805366280905)), activation="relu"))
    model.add(layers.BatchNormalization())
    model.add(layers.Dropout(0.4672821567914654))  
    model.add(layers.Dense(abs(int(242.0267251614644)), activation="relu"))  
    model.add(layers.BatchNormalization())
    model.add(layers.Dropout(0.382443636406835))  
    model.add(layers.Dense(1, activation="sigmoid"))
    model.compile(optimizer='Adadelta', loss="binary_crossentropy", metrics=['accuracy'])
    return model

# Perform 5-fold CV:
skf = StratifiedKFold(n_splits=5)
predictions = [] 

for fold, (train_index, val_index) in enumerate(skf.split(train_features, Train_Class), start=1):
    print(f"Processing fold {fold}...")
    
    # Reset model to avoid carrying overweights:
    model = build_model()
    
    x_train_fold, x_val_fold = train_features[train_index], train_features[val_index]
    y_train_fold, y_val_fold = Train_Class.values[train_index], Train_Class.values[val_index]
    val_ids = train_data['SID'].iloc[val_index] 

    # Apply controlled oversampling within each fold:
    x_train_fold_resampled, y_train_fold_resampled = controlled_oversampling(x_train_fold, y_train_fold)

    # Train the model on the resampled data:
    model.fit(x_train_fold_resampled, y_train_fold_resampled, epochs=10, batch_size=68, verbose=0)

    # Get predicted probabilities for the validation fold:
    y_scores = model.predict(x_val_fold).flatten()  

    # Collect the results for each molecule in the validation fold
    fold_results = pd.DataFrame({
        'SID': val_ids.values,            
        'Actual_Class': y_val_fold, 
        'Predicted_Probability': y_scores,  
        'Adjusted_Score': y_scores/0.14415252008378787  
    })
    predictions.append(fold_results)
    
    # Clear session and collect garbage to free memory between folds:
    K.clear_session()
    gc.collect()

# Concatenate all fold results into a single DataFrame:
all_predictions = pd.concat(predictions, ignore_index=True)

# Save the predictions to a csv file:
all_predictions.to_csv("Provide the file path to your csv output file", index=False)
# The output file dnn_cv_predictions.csv is provided in our GitHub repository, in Data/Training_set/Stacking_ensemble_ML.

### Part 2.2: Using the adjusted scores from the three base classifiers to train and apply DNN_SC

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from imblearn.over_sampling import RandomOverSampler
import os

np.random.seed(42)
tf.random.set_seed(42)

# Set up multiple CPUs:
num_cpus = 25
os.environ["TF_NUM_INTRAOP_THREADS"] = str(num_cpus)
os.environ["TF_NUM_INTEROP_THREADS"] = str(num_cpus)

# Provide the path to the csv training data file:
train_data = pd.read_csv("Provide the file path to training_data_stackedML.csv")
# To ensure the order of all training molecules, use the training_data_stackedML.csv file.
# This file can be downloaded from our GitHub repository, in Data/Training_set/Stacking_ensemble_ML.

# Provide the path to the csv test data file:
test_data = pd.read_csv("Provide the file path to test_3_base_models.csv")
# The test_3_base_models.csv file contains the output of three base models for each test molecule.
# The output includes the active probabilities and the adjusted scores.
# You can prepare this file yourself from the three files RF_BC.csv, XGB_BC.csv, DNN_BC.csv obtained above.
# You can also download this file from our GitHub repository, in Data/Test_set.

# Call the "activity" labels of all training and test molecules:
Train_Class = train_data['Actual_Class']
Test_Class = test_data['Real_Class'].map({'Active': 1, 'Inactive': 0})

# Provide the paths to the base model prediction files on the training set:
base_model_1 = pd.read_csv("Provide the file path to rf_cv_predictions.csv")
base_model_2 = pd.read_csv("Provide the file path to xgb_cv_predictions.csv")
base_model_3 = pd.read_csv("Provide the file path to dnn_cv_predictions.csv")

# Merge predictions into a single DataFrame
ensemble_data = pd.DataFrame({
    'SID': base_model_1['SID'],
    'Actual_Class': base_model_1['Actual_Class'],
    'Base1_AdjScore': base_model_1['Adjusted_Score'],
    'Base2_AdjScore': base_model_2['Adjusted_Score'],
    'Base3_AdjScore': base_model_3['Adjusted_Score']
})

# Define training and test features for training and applying DNN_SC:
d_train_csv = ensemble_data[['Base1_AdjScore', 'Base2_AdjScore', 'Base3_AdjScore']].values
train_features = np.array(d_train_csv)
d_test_csv = test_data[['rf_Adjusted_Score', 'xgb_Adjusted_Score', 'dnn_Adjusted_Score']].values
test_features = np.array(d_test_csv)

# Define a function to perform controlled oversampling of actives:
def controlled_oversampling(X, y, ratio=100):
    n_actives = np.sum(y == 1)
    n_inactives = np.sum(y == 0)
    target_n_actives = min(n_actives * ratio, n_inactives)
    ros = RandomOverSampler(sampling_strategy={1: target_n_actives}, random_state=42)
    X_resampled, y_resampled = ros.fit_resample(X, y)
    return X_resampled, y_resampled

# Apply controlled oversampling to the training data:
train_features_resampled, Train_Class_resampled = controlled_oversampling(train_features, Train_Class)

# Train DNN_SC on the resampled training set using optimal hyperparameters:
dnn_plec = keras.Sequential()
dnn_plec.add(layers.Dense(abs(int(131.561382068786)), activation='relu'))
dnn_plec.add(layers.BatchNormalization())
dnn_plec.add(layers.Dropout(0.3294908082433152))
dnn_plec.add(layers.Dense(abs(int(227.11758746306032)), activation='relu'))
dnn_plec.add(layers.BatchNormalization())
dnn_plec.add(layers.Dropout(0.4453136608614301))  
dnn_plec.add(layers.Dense(abs(int(81.34015005515602)), activation='relu'))  
dnn_plec.add(layers.BatchNormalization())
dnn_plec.add(layers.Dropout(0.5668191107970695))  
dnn_plec.add(layers.Dense(1, activation='sigmoid')) 
dnn_plec.compile(optimizer='rmsprop', loss="binary_crossentropy", metrics=['accuracy'])
dnn_plec.fit(np.array(train_features_resampled), Train_Class_resampled, 
             epochs=10, batch_size=46, verbose=1)

# Test DNN_SC on the test molecules:
threshold = 0.20844117646625016
test_predictions = dnn_plec.predict(np.array(test_features))
prediction_test_dnn_plec_class = ["Active" if num > threshold else "Inactive" for num in test_predictions]

# Get classification results on the test molecules and export to csv:
plec_result_dnn = pd.DataFrame({"SID": test_data['SID'],
                                "Active_Prob": test_predictions[:, 0],
                                "Predicted_Class": prediction_test_dnn_plec_class,
                                "Real_Class": Test_Class})
plec_result_dnn.to_csv("Provide the file path to your csv output file", index=False)
# The output file DNN_SC.csv is provided in our GitHub repository, in Data/Test_set/Results/Our_AI_classifiers.