### Multilayer perceptron code

* We begin with importing the necessary training data(metabolic fluxes & biomass).
* Import the training data by typing the designated file name.

| **Carbon** |    **acetate**   |  **adenosine**  | **D-alanine** |  **fructose**  |   **fucose**  |  **fumarate** | **galactose** | **galacturonate** | **gluconate** |      **glucosamine**     |
|:----------:|:----------------:|:---------------:|:-------------:|:--------------:|:-------------:|:-------------:|:-------------:|:-----------------:|:-------------:|:------------------------:|
|  file name |        ac        |       adn       |      alaD     |       fru      |      fuc      |      fum      |      gal      |       galur       |      glcn     |            gam           |
| **Carbon** |    **glucose**   | **glucuronate** |  **glycerol** |   **lactate**  | **L-alanine** |   **malate**  |  **maltose**  |    **mannitol**   |  **mannose**  | **N-acetyl glucosamine** |
|  file name |        glc       |      glcur      |      glyc     |       lac      |      alaL     |      mal      |      malt     |        mnl        |      man      |           acgam          |
| **Carbon** | **oxaloacetate** |   **pyruvate**  |  **ribose**  | **saccharate** |  **sorbitol** | **succinate** | **thymidine** |   **trehalose**   |   **xylose**  |    **a-ketoglutarate**   |
|  file name |        oaa       |       pyr       |      rib      |      sacc      |      sbt      |      succ     |     thymd     |        tre        |      xyl      |            akg           |



In [None]:
carbon_source = "glc" # glucose condition
output_name = "glc"

**Import package**

For reproducibility, python & python packages' versions must be fixed as below.
* python $\;\;\;\;\;\;$ :  v. 3.7
* tensorflow $\;$ :  v. 2.7.0
* SHAP    $\;\;\;\;\;\;\;$    :  v. 0.41.0
* numpy  $\;\;\;\;$     : v. 1.21.6
* pyarrow $\;\;\;$ : v.  10.0.1
* scikit-learn $\;$ : v.  1.0.2

In [None]:
# python v. 3.7.0
import shap # v. 0.41.0
import sklearn # v. 1.0.2
import pandas as pd # v. 1.1.5
import numpy as np # v. 1.21.6
import tensorflow as tf # v. 2.7.0
import keras_tuner as kt # v. 1.1.3
from sklearn import preprocessing
import warnings
import random
warnings.filterwarnings(action='ignore')

**Import data**

* The simulated flux data is imported and preprocessed for training data (X_train).
* We absolutized each flux values and filtered out those that had constant value across all deletion mutants.
* The final 24 OD data from Tong(2020) and Baba(2006) were used as target data.

In [None]:
#Extracting metabolic flux data
X_data_raw  = pd.read_feather("input_data/simulated_fluxes("+carbon_source+").feather").set_index("index")

#Remove any unnecessary columns(reactions)
X_data = pd.DataFrame(index=X_data_raw.index)
for index_col in X_data_raw.columns:
    each_column = X_data_raw.loc[:, index_col]
    not_constant = 0
    if_forst = 0
    for f_value in each_column:

        if if_forst == 0:
            default_value = f_value
            if_forst = 1
        elif f_value != default_value:
            not_constant = 1

    if not_constant == 0 and f_value ==0:
        continue

    else:
        X_data[each_column.name] = abs(each_column)


#Extracting growth data for target data
growth_data = pd.read_feather("input_data/biomass_data.feather").set_index("index")
y_data_raw =  growth_data[carbon_source]
y_data = y_data_raw[y_data_raw.index.isin(X_data.index)]

### **Deep learning with MLP**
**Hyperparameter setting**
* The optimal hyperparameters were determined with RandomSearch function in Keras-tuner.
* Below are the lists of candidate hyperparameters.

In [None]:
random_seed       = 0 # fix the seed for reproducability 
hp_dir            = "hp_folder" #hyperparameter tuning directory
neurons           = [5,10, 25, 50, 100, 200, 1000,2000] # number of perceptrons for each layers 
optimizer_param   = ['adam', 'rmsprop', 'sgd'] # backpropagation optimizers 
learning_rate     = [0.1,0.05,0.01,0.005,0.001,0.0001] 
kernel_constraint = [-1,2,3,4] # layer weight constraints, -1 : no constraint
dropout           = [0.3,0.4, 0.5, 0.6] # Dropout layer rate
max_trials        = 10000

Run the code below for hyperparameter tuning.

In [None]:
#Standardize data
X_train_scaled = sklearn.preprocessing.StandardScaler().fit_transform(X_data)
y_train = y_data

#Shuffle the data
X_train_scaled, y_train = sklearn.utils.shuffle(X_train_scaled, y_train, random_state=random_seed)

#Layer weight regularizers
def kernel_constraint_func(int):
    if int ==-1:
        return None
    elif int ==2:
        return tf.keras.constraints.max_norm(2)
    elif int ==3:
        return tf.keras.constraints.max_norm(3)
    elif int ==4:
        return tf.keras.constraints.max_norm(4)

def build_model(hp):
    
    #Model construction
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Input(shape=(len(X_data.columns),)))
    for i in range(hp.Int('layers', 0,4)):
        model.add(tf.keras.layers.Dense(units=hp.Choice('units', neurons), activation='relu', kernel_constraint=kernel_constraint_func(hp.Choice("kernel",kernel_constraint))))
        model.add(tf.keras.layers.Dropout( hp.Choice('d_units',dropout)))
    model.add(tf.keras.layers.Dense(1, activation='linear'))

    #Optimizer 
    optimizer = hp.Choice('optimizer', values=optimizer_param)

    if optimizer =="adam":
        final_optimizer  = tf.optimizers.Adam(hp.Choice('learning_rate', values=learning_rate))
    elif optimizer == "sgd":
        final_optimizer = tf.optimizers.SGD(hp.Choice('learning_rate', values=learning_rate))
    elif optimizer =="rmsprop":
        final_optimizer = tf.optimizers.RMSprop(hp.Choice('learning_rate', values=learning_rate))


    # Compile model
    model.compile(
        optimizer= final_optimizer,
        loss='mse',
        metrics=['mse']
        )
    return model

#Tuning
tuner = kt.RandomSearch(build_model, objective = 'val_mse',
                        overwrite=True,
                        max_trials=max_trials,
                        executions_per_trial=3,
                        directory=hp_dir,
                        seed=random_seed)

tuner.search(X_train_scaled, y_train, epochs = 40,validation_split =0.1, verbose=0)
#tuner.search_space_summary()


#Get the optimal hyperparameters
best_hp=tuner.get_best_hyperparameters()
print("Selected hp:", best_hp[0].values)

**Model training & SHAP value calculation**

* The model architecture was constructed based on the selected parameters from hp tuning.

In [None]:
#Selected MLP model
model = tf.keras.models.Sequential([
            tf.keras.layers.Input(shape=(len(X_data.columns),)),
            tf.keras.layers.Dense(units=1000, activation="relu", kernel_constraint=tf.keras.constraints.max_norm(3)),
            tf.keras.layers.Dropout(rate=0.6),
            tf.keras.layers.Dense(units=1000, activation="relu", kernel_constraint=tf.keras.constraints.max_norm(3)),
            tf.keras.layers.Dropout(rate=0.6),
            tf.keras.layers.Dense(units=1000, activation="relu", kernel_constraint=tf.keras.constraints.max_norm(3)),
            tf.keras.layers.Dropout(rate=0.6),
            tf.keras.layers.Dense(units=1000, activation="relu", kernel_constraint=tf.keras.constraints.max_norm(3)),
            tf.keras.layers.Dropout(rate=0.6),
            tf.keras.layers.Dense(1, activation="linear")
        ])
model.compile(optimizer=tf.optimizers.RMSprop(lr=0.005), loss="mse", metrics=["mse"])

 * Due to the stochastic nature of deep learning MLP model, we used different random seeds for each model training.
 * For each model training, the 10% validation data was retained for monitering the model performance.
 * Each trained MLP model(with different random seeds) was consequently computed for SHAP values. 
 * The average SHAP values across all models were assigned as the representative SHAP values.

In [None]:
#Set the list of random seeds for MLP training & SHAP values
seed_num_list =[0,1,2,3,4,5,6,7,8,9]
total_shap_df = pd.DataFrame(index=X_data.columns)

for seed_num in seed_num_list:
    tf.random.set_seed(seed_num)
    random.seed(seed_num)

    
    #Standardize data
    X_train_scaled = sklearn.preprocessing.StandardScaler().fit_transform(X_data)
    y_train = y_data
    #Shuffle data
    X_train_scaled, y_train = sklearn.utils.shuffle(X_train_scaled, y_train, random_state=seed_num)
    
    #Artificial Neural Network build
    with tf.device("cpu:0"):

        model = tf.keras.models.Sequential([
            tf.keras.layers.Input(shape=(len(X_data.columns),)),
            tf.keras.layers.Dense(units=1000, activation="relu", kernel_constraint=tf.keras.constraints.max_norm(3)),
            tf.keras.layers.Dropout(rate=0.6),
            tf.keras.layers.Dense(units=1000, activation="relu", kernel_constraint=tf.keras.constraints.max_norm(3)),
            tf.keras.layers.Dropout(rate=0.6),
            tf.keras.layers.Dense(units=1000, activation="relu", kernel_constraint=tf.keras.constraints.max_norm(3)),
            tf.keras.layers.Dropout(rate=0.6),
            tf.keras.layers.Dense(units=1000, activation="relu", kernel_constraint=tf.keras.constraints.max_norm(3)),
            tf.keras.layers.Dropout(rate=0.6),
            tf.keras.layers.Dense(1, activation="linear")
        ])
        
        #Compile model
        model.compile(optimizer=tf.optimizers.RMSprop(lr=0.005), loss="mse", metrics=["mse"])
        
        #Train model
        model.fit(x=X_train_scaled, y=y_train, epochs=40, validation_split=0.1)


    #SHAP computation
    background = X_train_scaled
    explainer = shap.DeepExplainer(model, background) # create the background set
    shap_values = explainer.shap_values(X_train_scaled) # train the explainer 
    shap_df = pd.DataFrame(shap_values[0], columns=X_data.columns)
    median_shap = pd.DataFrame(shap_df.median())
    #median_shap = median_shap.sort_values(ascending=False)
    total_shap_df = pd.merge(total_shap_df, median_shap, left_index=True,right_index=True)
    
# The average SHAP values will be the representative for each features
total_shap_df_mean = total_shap_df.mean(axis=1) 

total_shap_df_mean = total_shap_df_mean.sort_values(ascending=False)

Extract & filter the SHAP values of the trained model.

In [None]:
#Extract each reaction's SHAP value
raw_SHAP_values = total_shap_df_mean.to_frame()

#Filter out transport and external reactions
memote_pure_rxn = open("util/memote_pure_rxns.txt", 'r').read().strip('"').split('","')

#Separate beneficial(+) and detrimental(-) reactions based on SHAP value
SHAP_pos = raw_SHAP_values[raw_SHAP_values.iloc[:, 0] > 0]
SHAP_neg = raw_SHAP_values[raw_SHAP_values.iloc[:, 0] < 0]

#Filter out reactions with negligible SHAP value
avg_coefs_pos = SHAP_pos.iloc[:, 0].mean()
avg_coefs_neg = SHAP_neg.iloc[:, 0].mean()

final_pos_SHAPs = SHAP_pos[SHAP_pos.iloc[:,0] >=  0.1*avg_coefs_pos]
final_pos_SHAPs = final_pos_SHAPs[final_pos_SHAPs.index.isin(memote_pure_rxn) == True]
final_neg_SHAPs = SHAP_neg[abs(SHAP_neg.iloc[:,0]) >= abs(0.1*avg_coefs_neg)]
final_neg_SHAPs = final_neg_SHAPs[final_neg_SHAPs.index.isin(memote_pure_rxn) == True]

#Sort and extract to csv
filtered_SHAPs = final_pos_SHAPs.append(final_neg_SHAPs)
filtered_SHAPs = filtered_SHAPs.sort_values(ascending=False, by=0)
filtered_SHAPs.columns = ["SHAP value"] 
filtered_SHAPs.to_csv("MLP_output_data/"+output_name+"_mlp_shaps.csv")