### Elastic net regression code

* We begin with importing the necessary training data(metabolic fluxes & biomass).
* Import the training data by typing the designated file name.

| **Carbon** |    **acetate**   |  **adenosine**  | **D-alanine** |  **fructose**  |   **fucose**  |  **fumarate** | **galactose** | **galacturonate** | **gluconate** |      **glucosamine**     |
|:----------:|:----------------:|:---------------:|:-------------:|:--------------:|:-------------:|:-------------:|:-------------:|:-----------------:|:-------------:|:------------------------:|
|  file name |        ac        |       adn       |      alaD     |       fru      |      fuc      |      fum      |      gal      |       galur       |      glcn     |            gam           |
| **Carbon** |    **glucose**   | **glucuronate** |  **glycerol** |   **lactate**  | **L-alanine** |   **malate**  |  **maltose**  |    **mannitol**   |  **mannose**  | **N-acetyl glucosamine** |
|  file name |        glc       |      glcur      |      glyc     |       lac      |      alaL     |      mal      |      malt     |        mnl        |      man      |           acgam          |
| **Carbon** | **oxaloacetate** |   **pyruvate**  |  **ribose**  | **saccharate** |  **sorbitol** | **succinate** | **thymidine** |   **trehalose**   |   **xylose**  |    **a-ketoglutarate**   |
|  file name |        oaa       |       pyr       |      rib      |      sacc      |      sbt      |      succ     |     thymd     |        tre        |      xyl      |            akg           |



In [None]:
carbon_source = "glc" # glucose condition
output_name = "glc"

**Import package**

For reproducibility, python & python packages' versions must be fixed as below.
* python $\;\;\;\;$ : v. 3.6.5
* H2O4GPU $\;$ : v.  0.2.0
* scikit-learn $\;$ : v.  0.19.1
* numpy  $\;\;\;\;$     : v. 1.19.5
* pyarrow $\;\;\;$ : v.  6.0.1

In [None]:
# python v. 3.6.5
import sklearn # v. 0.19.1
import pandas as pd # v. 1.1.5
import numpy as np # v. 1.19.5
from h2o4gpu.solvers.elastic_net import ElasticNet # v. 0.20.0
from sklearn import preprocessing
import h2o4gpu.util.import_data as io
import h2o4gpu.util.metrics as metrics
import warnings
import random
warnings.filterwarnings(action='ignore')

**Import data**

* The simulated flux data is imported and preprocessed for training data (X_train).
* We absolutized each flux values and filtered out those that had constant value across all deletion mutants.
* The final 24 OD data from Tong(2020) and Baba(2006) were used as target data.

In [None]:
#Extracting metabolic flux data
X_data_raw  = pd.read_feather("input_data/simulated_fluxes("+carbon_source+").feather").set_index("index")

#Remove any unnecessary columns(reactions)
X_data = pd.DataFrame(index=X_data_raw.index)
for index_col in X_data_raw.columns:
    each_column = X_data_raw.loc[:, index_col]
    not_constant = 0
    if_forst = 0
    for f_value in each_column:

        if if_forst == 0:
            default_value = f_value
            if_forst = 1
        elif f_value != default_value:
            not_constant = 1

    if not_constant == 0 and f_value ==0:
        continue

    else:
        X_data[each_column.name] = abs(each_column)


#Extracting growth data for target data
growth_data = pd.read_feather("input_data/biomass_data.feather").set_index("index")
y_data_raw =  growth_data[carbon_source]
y_data = y_data_raw[y_data_raw.index.isin(X_data.index)]

### **Machine learning with ElasticNet regression**
Hyperparameter setting.

In [None]:
random_seed = 0 # random seed
n_alphas    = 100 # number of alphas along the regularization path
max_iter    = 1e4 # maximum number of iterations
tol         = 1e-6 # tolerance for the optimization
cv_folds    = 300 # number of cross validation folds
l1_ratio    = 1e-2 # scaling between l1 and l2 penalties

Run the machine learning code.

In [None]:
#Standardize data
X_train_scaled = sklearn.preprocessing.StandardScaler().fit_transform(X_data)

#Shuffle the data
y_train = y_data

X_train_scaled, y_train = sklearn.utils.shuffle(X_train_scaled, y_train, random_state=random_seed)

#Train the data
enlr = ElasticNet(max_iter=max_iter,
                  n_alphas=n_alphas,
                  tol = tol,
                  n_folds = cv_folds,
                  l1_ratio = l1_ratio,
                  random_state = random_seed
                  )
                  
enlr.fit(X_train_scaled,y_train)

Extract & filter the coefficients of the trained model.

In [None]:
#Extract each reaction's coefficient
raw_coefs_data = pd.Series(enlr.coef_, index=X_data.columns , name=  "Coefficient").to_frame()

#Filter out transport and external reactions
memote_pure_rxn = open("util/memote_pure_rxns.txt", 'r').read().strip('"').split('","')

#Separate beneficial(+) and detrimental(-) reactions based on coefficient value
coefs_pos = raw_coefs_data[raw_coefs_data.iloc[:, 0] > 0]
coefs_neg = raw_coefs_data[raw_coefs_data.iloc[:, 0] < 0]

#Filter out reactions with negligible coefficient value
avg_coefs_pos = coefs_pos.iloc[:, 0].mean()
avg_coefs_neg = coefs_neg.iloc[:, 0].mean()

final_pos_coefs = coefs_pos[coefs_pos.iloc[:,0] >=  0.1*avg_coefs_pos]
final_pos_coefs = final_pos_coefs[final_pos_coefs.index.isin(memote_pure_rxn) == True]
final_neg_coefs = coefs_neg[abs(coefs_neg.iloc[:,0]) >= abs(0.1*avg_coefs_neg)]
final_neg_coefs = final_neg_coefs[final_neg_coefs.index.isin(memote_pure_rxn) == True]

#Sort and extract to csv
filtered_coefs = final_pos_coefs.append(final_neg_coefs)
filtered_coefs  = filtered_coefs.sort_values(ascending=False, by="Coefficient") 
filtered_coefs.to_csv("EN_output_data/"+output_name+"_en_coefs.csv")