
# Pump Performance Surrogate with PySMO Surrogate Object

In this demo, we will build a surrogate model of a reverse osmosis (RO) multispeed pump performance curve using the PySMO surrogate fitting tools. The surrogate will approximate the pump's power consumption as functions of flow rate and head, enabling rapid evaluations in system simulations and seamless integration into optimization problems.

$$P= f(Q, H)$$

We will then generate the multispeed pump power curve surrogate at different operating speeds and evaluate its accuracy against the original pump performance data.

$$P = f(Q, H, N)$$

In this demo you will learn:
- How to load pump performance data and prepare it for surrogate fitting
- How to build and evaluate surrogate models using PySMO
- How to visualize surrogate performance against original data
- How to save and load surrogate models for future use
<!-- - How to adaptively refine surrogate models based on error metrics
- How to compare different surrogate fitting strategies
- How to interpret surrogate error metrics and improve model accuracy
- How to integrate surrogate models into system simulations
- Best practices for surrogate modeling in water treatment applications -->

# Surrogate Modeling Workflow
The general workflow for building a surrogate model in WaterTAP involves the following steps:
1. Load and preprocess the dataset for surrogate training and validation
   1. Load data from CSV or other formats
   2. Sample input space or perform a training/validation split via PySMO sampling tools.
2. Train the surrogate
   1. Create a trainer object with specified input and output labels, training data, and surrogate type.
   2. Configure trainer options as needed
   3. Train the surrogate model
3. Build Surrogate Object
   1. Pass surrogate expressions, labels, and bounds to the chosen surrogate (ALAMO, PySMO, Keras).
4. Save the surrogate model for future use
5. Validate the surrogate
   1. Evaluate surrogate performance using error metrics
   2. Visualize surrogate predictions against original data
6. Load saved surrogate models as needed to your flowsheet
7.  Optionally, adaptively refine the surrogate by adding samples based on error metrics

Don't fret! The following demo notebook will walk you through each of these steps in detail.
We get you covered with code snippets and explanations along the way.

<div style="background-color: #f9f9f9; border-left: 6px solid #4c78af; padding: 10px; margin: 10px 0;">
IDAES contains several surrogate modeling tools, including the IDAES Surrogates API which enables integrating ALAMO, PySMO or Keras surrogate models into IDAES flowsheets.

</div>

# Python-based Surrogate Modelling Objects (PySMO) Introduction

<p align="center">
<img src="https://idaes-pse.readthedocs.io/en/stable/_images/pysmo-logo.png" width="300">
</p>

**Python-based Surrogate Modelling Objects** ([PySMO](https://idaes-pse.readthedocs.io/en/1.5.1/surrogate/pysmo/index.html)) provides tools for generating different types of reduced order models.

## Sampling with PySMO

The PySMO package offers five sampling methods:

* Latin Hypercube Sampling (LHS)
* Full-Factorial Sampling
* Halton Sampling
* Hammersley Sampling
* Centroidal voronoi tessellation (CVT) sampling

and two modes for data sampling:
- In `creation` mode, PySMO creates a specified number of sample points from the bounds provided by the user.
- In `selection` mode, PySMO selects a specified number of data points from a user-supplied dataset or file.

## Generating surrogates with PySMO

PySMO provides tools for generating three types of surrogates:

- Polynomial surrogates
- Radial basis function (RBF) surrogates, and
- Kriging surrogates

Details about the methods may be found in the [documentation](https://idaes-pse.readthedocs.io/en/stable/explanations/modeling_extensions/surrogate/api/pysmo/).

In today's demo, we will focus on building polynomial surrogates using PySMO.

# Import Libraries and Dataset

In [None]:
# Import necessary libraries
# Import statements
import os
import numpy as np
import pandas as pd

# Import Pyomo libraries
from pyomo.environ import (
    ConcreteModel,
    SolverFactory,
    value,
    Var,
    Constraint,
    Set,
    Objective,
    maximize,
)
from pyomo.common.timing import TicTocTimer

# Import IDAES libraries
# Specific imports for surrogate modeling
from idaes.core.surrogate.sampling.data_utils import split_training_validation
# Import PySMO surrogate modeling tools
from idaes.core.surrogate.pysmo_surrogate import PysmoPolyTrainer, PysmoSurrogate
# Import plotting functions for surrogate evaluation
from idaes.core.surrogate.plotting.sm_plotter import (
    surrogate_scatter2D,
    surrogate_parity,
    surrogate_residual,
)
# Import SurrogateBlock for creating surrogate models
from idaes.core.surrogate.surrogate_block import SurrogateBlock
# Import FlowsheetBlock for creating flowsheets
from idaes.core import FlowsheetBlock


## Importing Training and Validation Datasets

In this section, we read the dataset from the CSV file and prepare it for surrogate training and validation. For simplicity and to reduce training runtime, this example randomly selects <span style="background-color: #555; color: white;">100 data points</span> to use **for training/validation**. The data is separated using an <span style="background-color: #555; color: white;">80/20 split into training and validation data</span> using the IDAES `split_training_validation()` method.

### Multispeed RO Pump Curve Data

The dataset includes pump head (ft) and power (kW) at various flow rates (GPM) for multiple pump speeds (RPM). We will focus on building surrogates for power consumption as a function of flow rate and head at a selected speed.

<p align="center">
<img src="pump_datasheet/multispeed-pump-curves.svg" width="700">
</p>

The pump performance data used in this demo is sourced from the following datasheet:

 [`week7/pump_datasheet/H10B26 DOCUMENTATION_related_to_RO_pump_multispeed_curves.pdf`](pump_datasheet/multispeed-ro-pump/H10B26%20DOCUMENTATION_related_to_RO_pump_multispeed_curves.pdf)

  *Operational note:* Flowrates below Minimum Continuous Stable Flow (MCSF) are excluded due to unstable pump operation / potential vibration or cavitation risk; analyses and model inputs should use â‰¥ MCSF datasets.

In [None]:
# Import Pump dataset

csv_path = os.path.join("pump_datasheet", "multispeed-ro-pump_head_power_aligned_above_mcsf.csv")

csv_data = pd.read_csv(csv_path)  # load dataset, 408 data points

# Let's take a look at the first few rows of the dataset
data = csv_data
data.head()

<!-- ## Multispeed Pump Surrogate Input and Output Variables

We are going to use the normalized values of the input and output variables for surrogate training and validation.

$$P = f(Q, H, N)$$

The input variables for the surrogate model in the dataset are:

<div align="center">

|Input Variable|Symbol|Dataset Column|
|--------------|------|--------------|
|Flow rate|$Q$|`Flowrate_norm_global`|
|Head|$H$|`Head_norm_global`|
|Speed|$N$|`RPM_norm`|

</div>

The output variable is:
- Power, $P\rightarrow$ `Power_norm_global` -->

In [None]:
# Filter relevant columns and order columns such as inputs followed by outputs
input_vars = ["Flowrate_norm_global", "Head_norm_global", "RPM_norm"]
output_vars = ["Power_norm_global"]
data = data[input_vars + output_vars]

data.head()

## Single Speed Surrogate Training and Validation Data Preparation


For single-speed pump surrogate modeling, we will fix the speed variable, reducing the model to:

$$P = f(Q, H)$$

The input variables for the surrogate model in the dataset are:

<div align="center">

|Input Variable|Symbol|Dataset Column|
|--------------|------|--------------|
|Flow rate|$Q$|`Flowrate_norm_global`|
|Head|$H$|`Head_norm_global`|

</div>

The output variable is:
- Power, $P\rightarrow$ `Power_norm_global`

We are going to filter the dataset for a single speed (e.g., `RPM_norm` = 1.0 which corresponds to 3450 RPM) before proceeding with surrogate training.

In [None]:
# Filter dataset for single-speed pump (e.g., RPM_norm = 1.0)
filtered_df = data[data['RPM_norm'] == 1.0]
data_single = filtered_df.reset_index(drop=True)
data_single.head()
print(f"Single-speed pump dataset size: {data_single.shape}")

In [None]:
# data_single = data_single.sample(n=100)  # randomly sample points for training/validation

# Separate input and output data
input_data = data_single.iloc[:, :2] # first two columns are inputs
output_data = data_single.iloc[:, -1] # last column is output

# Define labels, and split training and validation data
# note that PySMO requires that labels are passed as string lists
input_labels = list(input_data.columns)
output_labels = [output_data.name]

n_data = data_single[input_labels[0]].size
# 80/20 train/validation split using IDAES utility function
data_single_training, data_single_validation = split_training_validation(
    data_single, 0.8, seed=n_data
)  # seed=n_data for reproducibility
data_single_training.head()
print(f"Training data size: {data_single_training.shape}")
print(f"Validation data size: {data_single_validation.shape}")

# Train Surrogate with PySMO

In this section, we will build and train a polynomial surrogate model.
We will call and build the Polynomial Regression class `PysmoPolyTrainer`.

`PysmoPolyTrainer` considers three types of basis functions:

- univariate polynomials,
- second-degree bivariate polynomials, and
- user-specified basis functions.

Then, for a problem with $m$ sample points and $n$ input variables, the resulting polynomial is:

\begin{equation*}
y_{k}={\displaystyle \sum_{i=1}^{n}\beta_{i}x_{ik}^{\alpha}}+\sum_{i,j>i}^{n}\beta_{ij}x_{ik}x_{jk}+\beta_{\Phi}\Phi\left(x_{ik}\right)\qquad i,j=1,\ldots,n;i\neq j;k=1,\ldots,m;\alpha \leq 10\qquad\quad
\end{equation*}

where:
- $y_{k}$ is the output variable,
- $x_{ik}$ is the $i^{th}$ input variable,
- $\beta_{i}$, $\beta_{ij}$, and $\beta_{\Phi}$ are the coefficients of the univariate, bivariate, and user-specified basis functions, respectively,
- $\alpha$ is the polynomial order of the univariate basis functions (up to a maximum of 10),
- $\Phi\left(x_{ik}\right)$ is the user-specified basis function.

**PySMO Configuration Options:**

In this example, allowed basis terms span a 6<sup>th</sup> order polynomial as well as a variable product, and data is internally cross-validated using 10 iterations of 80/20 splits to ensure a robust surrogate fit. Note that PySMO uses cross-validation of training data to adjust model coefficients and ensure a more accurate fit, while we separate the validation dataset pre-training in order to visualize the surrogate fits.

|Parameter|Description|Value|
|---------|-----------|-----------|
|`maximum_polynomial_order`|Sets the maximum polynomial order for the surrogate model.|`6`|
|`multinomialS`|Boolean option which determines whether bivariate terms are considered in polynomial generation.|`True`|
|`training_split`|Fraction of training data to be used for regression training.|`0.8`|
|`number_of_crossvalidations`|Number of cross-validation iterations to perform during training.|`10`|

Finally, after training the model we save the results and model expressions to a folder which contains a serialized JSON file. Serializing the model in this fashion enables importing a previously trained set of surrogate models into external flowsheets. This feature will be used later.

In [None]:
# capture long output (not required to use surrogate API)
from io import StringIO
import sys

stream = StringIO()
oldstdout = sys.stdout
sys.stdout = stream

# Create PySMO trainer object
trainer = PysmoPolyTrainer(
    input_labels=input_labels,
    output_labels=output_labels,
    training_dataframe=data_single_training,
)

# Set PySMO options
trainer.config.maximum_polynomial_order = 6
trainer.config.multinomials = True
trainer.config.training_split = 0.8
trainer.config.number_of_crossvalidations = 10

# Train surrogate (calls PySMO through IDAES Python wrapper)
poly_train = trainer.train_surrogate() # Surrogate expression object

# Create callable surrogate object

# Set input bounds for surrogate
xmin = data_single_training[input_labels].min()
xmax = data_single_training[input_labels].max()

input_bounds = {input_labels[i]: (xmin[i], xmax[i]) for i in range(len(input_labels))}

# Call PysmoSurrogate to create surrogate object
poly_surr = PysmoSurrogate(poly_train, input_labels, output_labels, input_bounds)

# save model to JSON
model = poly_surr.save_to_file("pysmo_poly_surrogate.json", overwrite=True)

# revert back to normal output capture
sys.stdout = oldstdout

# display first 50 lines and last 50 lines of output
celloutput = stream.getvalue().split("\n")
for line in celloutput[:50]:
    print(line)
print(".")
print(".")
print(".")
for line in celloutput[-50:]:
    print(line)

## Visualize Surrogate Model

Once trained, surrogate models are visualized via scatter, parity, and residual plots to confirm their domain validity. Training data is plotted first to ensure the surrogates fit it correctly.

|Plot Type|Description|IDAES Function|
|---------|-----------|-----------|
|Scatter Plot|Visualizes surrogate predictions against actual data points in a 2D scatter format.|`surrogate_scatter2D`|
|Parity Plot|Compares surrogate predictions to actual data, ideally aligning along the y=x line.|`surrogate_parity`|
|Residual Plot|Shows the differences between surrogate predictions and actual data to identify patterns or biases.|`surrogate_residual`|

In [None]:
# visualize with IDAES surrogate plotting tools for training data
surrogate_scatter2D(poly_surr, data_single_training, filename="pysmo_poly_train_scatter2D.pdf")
surrogate_parity(poly_surr, data_single_training, filename="pysmo_poly_train_parity.pdf")
surrogate_residual(poly_surr, data_single_training, filename="pysmo_poly_train_residual.pdf")

## Validate Surrogate Model

Now that we have trained the surrogate model, we can evaluate its performance on the validation dataset. This involves predicting the output values using the surrogate model and comparing them to the actual values in the validation set. We will compute error metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to quantify the accuracy of the surrogate model. Additionally, we will visualize the surrogate predictions against the original data to assess how well the surrogate captures the underlying relationships.

In [None]:
# visualize with IDAES surrogate plotting tools for validation data
surrogate_scatter2D(poly_surr, data_single_validation, filename="pysmo_poly_val_scatter2D.pdf")
surrogate_parity(poly_surr, data_single_validation, filename="pysmo_poly_val_parity.pdf")
surrogate_residual(poly_surr, data_single_validation, filename="pysmo_poly_val_residual.pdf")

# Import Saved Surrogate Model into Flowsheet

# **Your Turn!**
It is your turn now! Try building surrogates for other pump speeds using the provided dataset and evaluate their performance. Experiment with different polynomial orders and basis function configurations to see how they affect the surrogate accuracy.

## **Try it yourself #1:** Build a surrogate model for all the speeds.

Build a surrogate model for all the speeds. Follow the same steps as in the demo to train and validate the surrogate model.

**Multispeed Pump Surrogate Input and Output Variables**

We are going to use the normalized values of the input and output variables for surrogate training and validation.

$$P = f(Q, H, N)$$

The input variables for the surrogate model in the dataset are:

<div align="center">

|Input Variable|Symbol|Dataset Column|
|--------------|------|--------------|
|Flow rate|$Q$|`Flowrate_norm_global`|
|Head|$H$|`Head_norm_global`|
|Speed|$N$|`RPM_norm`|

</div>

The output variable is:
- Power, $P\rightarrow$ `Power_norm_global`

In [None]:
# For the multi-speed pump dataset
# Set numpy print options for better readability
np.set_printoptions(precision=6, suppress=True)

# Separate input and output data
input_data = data.iloc[:, :3] # first three columns are inputs
output_data = data.iloc[:, -1] # last column is output

# Define labels, and split training and validation data
# note that PySMO requires that labels are passed as string lists
input_labels = list(input_data.columns)
output_labels = [output_data.name]

n_data = data[input_labels[0]].size
# 80/20 train/validation split using IDAES utility function
data_training, data_validation = split_training_validation(
    data, 0.8, seed=n_data
)  # seed=n_data for reproducibility
data_training.head()
print(f"Training data size: {data_training.shape}")
print(f"Validation data size: {data_validation.shape}")

In [None]:
# capture long output (not required to use surrogate API)
from io import StringIO
import sys

stream = StringIO()
oldstdout = sys.stdout
sys.stdout = stream

# Create PySMO trainer object
trainer = PysmoPolyTrainer(
    input_labels=input_labels,
    output_labels=output_labels,
    training_dataframe=data_training,
)

# Set PySMO options
trainer.config.maximum_polynomial_order = 6
trainer.config.multinomials = True
trainer.config.training_split = 0.8
trainer.config.number_of_crossvalidations = 10

# Train surrogate (calls PySMO through IDAES Python wrapper)
poly_train = trainer.train_surrogate() # Surrogate expression object

# Create callable surrogate object

# Set input bounds for surrogate
xmin = data_training[input_labels].min()
xmax = data_training[input_labels].max()

input_bounds = {input_labels[i]: (xmin[i], xmax[i]) for i in range(len(input_labels))}

# Call PysmoSurrogate to create surrogate object
poly_surr = PysmoSurrogate(poly_train, input_labels, output_labels, input_bounds)

# save model to JSON
model = poly_surr.save_to_file("pysmo_poly_surrogate.json", overwrite=True)

# revert back to normal output capture
sys.stdout = oldstdout

# display first 50 lines and last 50 lines of output
celloutput = stream.getvalue().split("\n")
for line in celloutput[:50]:
    print(line)
print(".")
print(".")
print(".")
for line in celloutput[-50:]:
    print(line)