# Model Intergation Workflow Demo

This notebook provides a **demonstration and case study of the model integration workflow**, showing how to connect **external models**—such as pilot-scale experimental data or rigorous process simulators—into WaterTAP.  

The workflow illustrates how surrogate modeling can bridge detailed models with **system-scale analysis**, enabling:  
- Technology performance screening  
- Process integration and plant-wide optimization  
- Techno-economic and market participation assessments,  
  which are **enabled capabilities within the WaterTAP and IDAES ecosystem**.  

**Example Case:**  
In this demonstration, we integrate the **Modular Encapsulated Two-Stage Anaerobic Biological (METAB) system**. The workflow was implemented to connect:  
- **Benchmark Simulation Model in WaterTAP**  
- **Rigorous METAB process model** developed in the study by:  

> Zhang, X., Arnold, W. A., Wright, N., Novak, P. J., & Guest, J. S. (2024). *Prioritization of early-stage research and development of a hydrogel-encapsulated anaerobic technology for distributed treatment of high strength organic wastewater.* Environmental Science & Technology, 58(44), 19651–19665.  

The model was implemented on the external modeling platform **QSD-SAN**.  

This example illustrates how external models can be systematically integrated into WaterTAP to enable **system-scale evaluation, optimization, and techno-economic analysis** of novel wastewater treatment technologies.

![Easy Five-Step Workflow: Variables --> Input Space --> Model Runs --> Surrogate --> Selection](workflow_steps.png)

## Step 1: Define Input Variables

The first step in the workflow is to **select the key operating and design variables** that will define the input space for surrogate modeling.  

**Example Case: METAB System**  
For the METAB system, the input variables of interest include:  
- **Influent flow rate (loading rate)** – the rate at which wastewater enters the system  
- **Operational temperature** – affects microbial activity and treatment efficiency  
- **Hydraulic Retention Time (HRT)** – the time wastewater remains in the digester, influencing conversion and biogas production  

**Step Tasks:**  
- Select **key operating/design variables** relevant to system performance  
- Establish **bounds for the operational range** based on pilot data or rigorous model knowledge  

These defined variables and bounds form the foundation for **input space generation** in the next step.

In [1]:
input_var_info = {
        "inf_fr": (5000,10000),  # Influent flow rate (loading rate), lower bounds, upper bounds
        "temp": (22, 35),  #Operational temperature
        "hrt": (1, 12),   #Hydraulic Retention Time (HRT)
    }

## Step 2: Input Space Generation

Once the key input variables and their bounds are defined, the next step is to **systematically generate the input space**. This ensures that the surrogate model captures the full variability of the system.  

**Example Case: METAB System**  
For METAB, the input variables include influent flow rate, operational temperature, and HRT. The goal is to explore combinations of these variables across their feasible ranges.  

**Step Tasks:**  
- **Systematically sample operating ranges** to cover variability in the input space  
- **Methods:**  
  - Random sampling within bounds  
  - **Latin Hypercube Sampling (LHS)** – a space-filling statistical technique that divides each variable’s range into intervals and selects samples to evenly cover the multi-dimensional space  
  - Other space-filling designs as needed  

**Goal:**  
- Generate a **broad and representative dataset** for surrogate model training  
- Ensure that all relevant operating conditions are considered for accurate system-level predictions  

This input dataset will be used in **Step 3: Model Evaluations** to generate outputs for surrogate training.


In [12]:
from input_space_generation import create_samples
import os

path = "./results/"
input_data_csv_file =  path + "input_data.csv"

if not os.path.exists(input_data_csv_file):
    create_samples(method="LHS",                    # Latin Hypercube Sampling (LHS)
                   input_var_info=input_var_info,   # defined in step 1
                   sample_numbers=20,               # can be changed to any integer
                   csv_file =input_data_csv_file )
    
print(f"Input data space file has been generated.")

Input data space file has been generated.


## Step 3: External Model Evaluations

In this step, we **run the rigorous external models** to generate outputs for the input conditions sampled in Step 2. These outputs will form the training dataset for surrogate models.  

**Example Case: METAB System**  
For METAB, the external model could be the rigorous process simulation implemented in QSD-SAN. For each combination of influent flow rate, operational temperature, and HRT, the model generates outputs such as:  
- **Biogas production**  
- **Effluent flow rate and quality**  


**Step Tasks:**  
- Run the rigorous model across all sampled input sets  
- Collect model outputs to form a **comprehensive dataset**  
- Ensure that the dataset **captures nonlinear system behavior** across the operational range  

This dataset will then be used in **Step 4: Surrogate Model Training**, bridging the detailed external model with the WaterTAP flowsheet for system-scale analysis.


In [2]:
from model_evaluation import get_input_data, run_model, export_output_data 
import os  
path = "./results/" 
input_data_csv_file =  path + "input_data.csv"
output_data_csv_file = path + "output_data.csv"
if not os.path.exists(output_data_csv_file):
    input_data = get_input_data(filename=input_data_csv_file)
    output_data = run_model(input_data)
    export_output_data(output_data,
                       filename=output_data_csv_file)
print(f"Model evaluation output file has been generated.")

Model evaluation output file has been generated.


## Step 4: Surrogate Model Training

Once the training dataset from Step 3 is available, we fit **surrogate models** to approximate the behavior of the rigorous external model. Surrogates allow fast evaluations while capturing nonlinear relationships between inputs and outputs.

**Example Case: METAB System**  
For METAB, the surrogates predict outputs such as biogas production, effluent quality, and energy consumption based on influent flow rate, operational temperature, and HRT.

---

### Surrogate Model Types

**1. Polynomial Regression**  
Fits a polynomial equation to relate inputs $x_1, x_2, ..., x_n$ to output $y$:  

$$
y = \beta_0 + \sum_{i=1}^{n} \beta_i x_i + \sum_{i=1}^{n}\sum_{j=i}^{n} \beta_{ij} x_i x_j + \epsilon
$$

- $\beta$ are coefficients, $\epsilon$ is error  
- Can include higher-order terms for nonlinearity  

**2. Kriging (Gaussian Process Regression)**  
Models the output as a Gaussian process:  

$$
y(x) = \mu + Z(x)
$$  

Covariance:  

$$
\text{Cov}[Z(x_i), Z(x_j)] = \sigma^2 R(x_i, x_j), \quad
R(x_i, x_j) = \exp\left(-\sum_{k=1}^n \theta_k |x_{ik} - x_{jk}|^2\right)
$$  

- $\mu$ is the mean, $Z(x)$ is a correlated Gaussian process  
- Captures nonlinearities and provides uncertainty estimates  

**3. Radial Basis Function (RBF)**  
Approximates outputs as weighted sums of radial functions:  

$$
y(x) = \sum_{i=1}^{N} w_i \phi(\|x - x_i\|)
$$  

- $x_i$ are training points, $w_i$ are weights  
- $\phi(r)$ is a radial function, e.g., Gaussian: $\phi(r) = e^{-(\epsilon r)^2}$  

**4. ALAMO (Automated Learning of Algebraic Models for Optimization)**  
Constructs simple algebraic models:  

$$
y = \sum_{i} c_i f_i(x)
$$  

- $f_i(x)$ are linear, polynomial, exponential, or rational functions  
- ALAMO selects relevant terms to balance accuracy and simplicity  
- Produces **optimization-ready models**  

---

These surrogate models are **computationally efficient** and ready for validation and selection in Step 5, enabling integration into WaterTAP for system-scale analysis.


In [1]:
import os
import surrogate_model_generator as trainer
feed_data, input_data, output_data = trainer.get_data()
output_data = trainer.outputs_selections(output_data)
all_surrogte_method = ['poly','kri','rbf','alamo']
path = "./results/" 
for method in all_surrogte_method:
    model_file =  path + method + "_surrogate.json"
    if not os.path.exists(model_file):
        trainer.gen_surrogate_model(
                method= method,
                feed_data=None,
                input_data=input_data,
                output_data=output_data,
            )
    print("Surrogate models using " + method + " method has been generated")
print("All models are trained and ready for selection")

Surrogate models using poly method has been generated
Surrogate models using kri method has been generated
Surrogate models using rbf method has been generated
Surrogate models using alamo method has been generated
All models are trained and ready for selection


# Step 5: Performance Estimation & Model Selection

After training surrogate models in Step 4, the next step is to **validate their accuracy** and select the best-performing model for system-scale integration.

**Example Case: METAB System**  
For METAB, surrogate models predict outputs like biogas production, effluent quality, and energy consumption. These predictions are compared against a **withheld test dataset** (data not used in training) to evaluate model performance.

**Step Tasks:**  
- **Validate surrogate accuracy** using metrics such as:  
  - **R² (Coefficient of Determination)** – indicates how well the model explains variance in the data  
  - **RMSE (Root Mean Squared Error)** – measures the average prediction error  
  - **Parity plots** – visualize predicted vs. actual values for each output  

- **Select the best-performing surrogate model** based on these metrics  
- The selected surrogate is then **ready to integrate** into WaterTAP for system-level analysis, optimization, and techno-economic evaluation

**Outcome:**  
A computationally efficient surrogate model that faithfully represents the external rigorous model and can be used for **plant-scale simulations and decision-making**.


In [None]:
from performance_estimation import display_performace
display_performace( method="poly",  
                        path="./results/",)

Unnamed: 0,Predicted Variables,R^2,Adjusted R^2,MAE,MSE
1,S_su,0.982842,0.953427,0.251895,0.1451095
2,S_aa,0.990575,0.974418,0.088737,0.01775902
3,S_fa,0.982444,0.952348,0.988848,2.254959
4,S_va,0.992831,0.980541,0.154351,0.05098712
5,S_bu,0.993203,0.981551,0.21017,0.09303517
6,S_pro,0.987178,0.965198,0.324885,0.2298348
7,S_ac,0.989913,0.972622,0.619289,0.878602
8,S_h2,0.98669,0.963873,2e-06,6.255281e-12
9,S_ch4,0.995764,0.988503,0.182005,0.06960779
10,S_IC,0.987474,0.9762,1.304313,3.302947


In [4]:
#TODO:from performance_estimation import display_plot
#display_plot()

