# Setting up and Facilitating SAMOO Runs
### Walkthrough of SAMOO Utility

Recall that Surrogate-Assisted Multi-Objective Optimisation (SAMOO) follows this general algorithm:
1. LHS sampling of initial training dataset and starting population
2. Complex model evaluation (hereinafter referred to as Outer Iteration)
3. Pareto dominance evaluation
     - if front converged: exit; else: continue to step 4
4. (Re)training the GPR
5. MOO with GPR replacing the complex model (hereinafter referred to as Inner Iterations)
6. Resampling for new training points (also called infills)

This entire process requires two models to be set up: a process-based model (PBM), which is a complex, expensive-to-evaluate model such as a groundwater numerical model, and a surrogate model, which is a fast-to-evaluate model, such as a GPR, that emulates the response of the PBM at some specified input parameters (i.e., decision vaiable values). PBM is run in Step 2 of the algorithm while the surrogate model is run in Step 5.

``pestpp-mou`` can be used to run both the outer and inner iterations of the SAMOO algorithm, where in the outer iteration it simply runs the PBM and records the results, and in the inner iteration it runs the MOO with the GPR surrogate model. To do this, we need to set up pestpp-mou control files for the outer and inner iterations. We have shown in the previous tutorial a glimpse of how this can be facilitated using the `samoo utility`. Here, we will show a more detailed walkthrough on how to set up the SAMOO runs using the utility.

`samoo utility` basically implements the general workflow of SAMOO. Your optimisation problem may have special considerations that need to be reflected in the control files. You may need to modify specific parts of the utility's codes for your particular needs or incorporate those special considerations externally. Nevertheless, this tutorial walks you through the necessary functions in the utility that are needed for your SAMOO runs.

We need three sets of control files, along with input and template files required by pestpp-mou. For better housekeeping, we will place each set of files in separate template directories:
1. template_outer -- for the outer iteration
2. template_inner -- for the inner iteration
3. template_repo_update -- for updating the outer iteration repository

These will be generated by the utility with the help of ``pyemu``. Let us first load the files we need.



In [1]:
import os
import sys
import shutil
import pandas as pd
import numpy as np

base_d = "./demo_files/kursawe2d"
temp_d = "samoo_util_demo"
assert os.path.exists(base_d)


At this point, we only need the over-all template directory containing all the standard files needed by pestpp-mou. These files have been prepared for you in the `samoo_util_demo/template` directory (run the cell below). We will still use the 2D Kursawe problem.

In [2]:
if os.path.exists(temp_d):
    shutil.rmtree(temp_d)
shutil.copytree(os.path.join(base_d, 'kursawe2d_samoo_template'), os.path.join(temp_d, 'template'))

print("Files in template directory:")
for f in os.listdir(os.path.join(temp_d, 'template')):
    print(f"  {f}")

Files in template directory:
  dv_tpl.dat
  forward_gprun.py
  forward_pbrun.py
  kur.pst
  output.ins


Feel free to explore the contents of these files.
* `dv_tpl.dat` - this is the template file for the decision variables
* `output.ins` - this contains the instructions on how to read the output from the model. Notice that included here are the outputs related to standard deviation of the surrogate model predictions. Although this is irrelevant for outer iterations, we still keep it here in the general template so that it can be passed to the inner iteration when we generate the relevant files later. Observation names used for objectives and constraints should be the same for inner and outer iterations (doing this also helps better populate the outputs from inner and outer iterations when plotting the results)
* `kur.pst` - this is the pest control file. You can generate this by yourself using your own scripts/workflows. Notice that also listed in the observation data section are outputs relevant to the surrogate model consistent with those in `output.ins`. 

The following are codes to run your models. These scripts implement pyworker functionality of pyemu.
* `forward_pbrun.py` - this is the script that runs the PBM. Set the standard deviation outputs to zero -- make sure that this is printed in the output file even though it's not relevant.
* `forward_gprun.py` - this is the script that runs the GPR. 

Notice that these codes both have `ppw_worker` function. This is necessary to take advantage of pyworker in pyemu and efficiently facilitate parallel runs, even with your normal laptop. `ppw_worker` calls the function that executes your model. Refer to [pyemu documentation](https://pyemu.readthedocs.io/en/latest/autoapi/pyemu/index.html) for more information about the pyworker functionality.

Let us now load the utility.

In [3]:
sys.path.insert(0,temp_d)
from bin.samoo_utils import SAMOO
samoo = SAMOO()

There's one more things we need to prepare -- the initial training dataset. It is recommended to use Latin hypercube sampling (LHS) in sampling the decision space. There are many tool available for free, but for our convenience, an LHS sampler is also provided. 

For this tutorial, let's say we train the GPR with an initial training dataset size of 50. Let's load the LHS sampler and generate 50 sets of decision variable values. We also need to define the bounds of the decision variables and sample only within these bounds.

In [4]:
training_size = 50
bounds = np.array([[-5, 5] for i in range(2)])
from bin.LHS_sampler import generate_lhstrainingset
generate_lhstrainingset(os.path.join(temp_d, 'template'), seed=8, n_samples=training_size, n_dimensions=2, bounds=bounds)

Unnamed: 0_level_0,x1,x2
real_name,Unnamed: 1_level_1,Unnamed: 2_level_1
gen=0_training=0,0.99695,-2.214924
gen=0_training=1,-1.333245,-4.294463
gen=0_training=2,2.063938,4.629858
gen=0_training=3,-4.153454,4.492248
gen=0_training=4,4.516793,3.299044
gen=0_training=5,2.760211,-4.475046
gen=0_training=6,0.278727,3.670483
gen=0_training=7,-1.405229,0.632555
gen=0_training=8,-2.888929,2.824867
gen=0_training=9,4.158035,1.961699


Now, let's specify some parameters for the run and pass these values to the utility.


In [5]:
samoo.nmax_inner = 20 #the number of inner iterations
samoo.nmax_outer = 1 #the number of outer iterations; we set this to 1 for now to demonstrate what happens in a single outer iteration
samoo.ppd_beta = 0.6 #discussed in Part 2
samoo.num_workers = 8 #the number of cores to utilise for parallel runs
samoo.max_infill = 50 #the target number of infill points
samoo.pop_size = 50 #the population size for pso in inner iterations
samoo.repo_size = 50 #the size of the repository for storing (probabilistically) non-dominated objective positions in inner and outer iterations
samoo.save_inner_every = 1 #save inner results every n iterations
samoo.pbmodel_command = 'python forward_pbrun.py' #the command to run the PBM
samoo.gpmodel_command = 'python forward_gprun.py' #the command to run the GPR
samoo.exe_file = '../../bin/pestpp-mou' #the path to the pestpp-mou executable relative to the location pst files in the template directory

We are now ready to generate the three sets of template files.

In [6]:
os.chdir('samoo_util_demo')
samoo.prep_templates()


2025-05-24 07:04:12.831215: prepping templates 


2025-05-24 07:04:12.831215: prepping outer template 

noptmax:-1, npar_adj:2, nnz_obs:4

2025-05-24 07:04:12.895863: outer template prepped 


2025-05-24 07:04:12.895863: prepping repo update template 

noptmax:-1, npar_adj:2, nnz_obs:4

2025-05-24 07:04:12.962874: outer repo update template prepped 


2025-05-24 07:04:12.962874: prepping inner template 

noptmax:20, npar_adj:2, nnz_obs:4

2025-05-24 07:04:13.012523: inner template prepped 



Feel free to inspect the contents of the generated template directories. Now that we have generated the template directories, let's review the SAMOO algorithm steps and see how they map to specific samoo utility functions:

1. LHS sampling of initial training dataset and starting population `LHS sampler` (not in samoo utility)
2. Complex model evaluation (hereinafter referred to as Outer Iteration) `samoo.outer_sweep(oitidx)` (oitidx - outer iter index)
3. Pareto dominance evaluation `samoo.update_outer_repo(outer_dirs)` 
     - if front converged: exit; else: continue to step 4
4. (Re)training the GPR `samoo.inner_prep(inner_dirs, outer_dirs)`
5. MOO with GPR replacing the complex model (hereinafter referred to as Inner Iterations) `samoo.inner_opt(iitidx)` (iitidx - inner iter index)
6. Resampling for new training points (also called infills)  `samoo.resample(inner_dirs, outer_dirs)`

The results of every inner and outer iterations of samoo will be saved in separate directories. This is to keep everything tidy as we will be dealing with different sets of outputs that can be confusing and overhwhelming as we move along. Thus, notice that a list of inner and outer directories is passed to each function to keep track of iteration counts.


Let's start with the first outer iteration.

In [7]:
samoo.outer_sweep(0)


2025-05-24 07:04:24.615249: outer 0 done. Output saved to outer_0 



['outer_0']

This should finish quickly as we are evaluating a very simple function. This will take time for real models.

The "0" input passed to the function above is just the index for this iteration, e.g., this is 'outer iteration 0'. Observe that a new directory has been created called `outer_0/` where you will find the outputs of a pestpp-mou run to evaluate the decision variables generated earlier by LHS. Also note that pestpp-mou in this run has also performed non-dominance evaluation of the population, we can skip Step 2 for now and proceed with Step 3.

In [8]:
inner_dirs, outer_dirs = samoo.get_dirlist()
samoo.inner_prep(inner_dirs, outer_dirs)


2025-05-24 07:04:24.667728: restart population for inner iteration saved to template_inner 


2025-05-24 07:04:24.667728: surrogate model training data updated in template_inner 



Inspect the contents of `samoo_util_demo/template_inner/`. We have now generated the GPR training files which consists of the decision variable values that was just evaluated in outer_0 and the correspoding outputs, which are the objective values.

Let's proceed with the first inner iteration.


In [9]:
next_inner_index = 1 if len(inner_dirs) == 0 else int(inner_dirs[-1].split("_")[1]) + 1
samoo.inner_opt(next_inner_index) #this executes pestpp-mou that calls on the GPR in the inner iterations


2025-05-24 07:05:01.416467: inner 1 done. Output saved to inner_1 



['inner_1']

Again, inspect the contents of `samoo_util_demo/template_inner/`. You should find a new directory called `inner_1/` where you can find the outputs for all 20 inner iterations. For larger runs, this could be memory expensive. You can choose to save the results of some inner iterations by changing the value of samoo.save_inner_every

Proceeding with Step 6, let's perform resampling to choose the infill points to be evaluated in the next outer iteration.



In [10]:
inner_dirs, outer_dirs = samoo.get_dirlist()
samoo.resample(inner_dirs, outer_dirs)


2025-05-24 07:05:01.464397: resampling 50 infill points 


2025-05-24 07:05:01.872052: infill ensemble saved to template_outer 





Now, inspect `samoo_util_demo/template_outer/` where you will find a new file called `infill.dv_pop.csv` which contains the infill decision variable values.

We repeat the process again from Step 2. We're not skipping samoo.update_outer_repo now since we need to perform non-dominance evaluation for the outputs of outer_0 and outer_1

In [11]:
# Step 2
outer_dirs = samoo.outer_sweep(1)
# Step 3
samoo.update_outer_repo(outer_dirs) 


2025-05-24 07:05:13.131881: outer 1 done. Output saved to outer_1 


2025-05-24 07:05:14.312013: outer repo update done 



Now, we have `outer_1` and inside it is another directory (`outer_repo`) containing the repository of Pareto optimal positions across all outer iterations.

Let's continue and let's try doing one more outer iteration in one go. 

In [12]:
# Step 4
inner_dirs, outer_dirs = samoo.get_dirlist()
samoo.inner_prep(inner_dirs, outer_dirs)
# Step 5
next_inner_index = 1 if len(inner_dirs) == 0 else int(inner_dirs[-1].split("_")[1]) + 1
samoo.inner_opt(next_inner_index)
# Step 6
inner_dirs, outer_dirs = samoo.get_dirlist()
samoo.resample(inner_dirs, outer_dirs)
# Step 2
outer_dirs = samoo.outer_sweep(1)
# Step 3
samoo.update_outer_repo(outer_dirs) 


2025-05-24 07:05:14.377203: restart population for inner iteration saved to template_inner 


2025-05-24 07:05:14.412553: surrogate model training data updated in template_inner 


2025-05-24 07:05:52.400299: inner 2 done. Output saved to inner_2 


2025-05-24 07:05:52.402307: resampling 50 infill points 






2025-05-24 07:05:52.760596: infill ensemble saved to template_outer 


2025-05-24 07:06:04.315929: outer 1 done. Output saved to outer_1 


2025-05-24 07:06:05.476940: outer repo update done 



Keep observing how directories and files are changing in `samoo_util_demo/`. The steps above can be cumbersome, but don't fret because we can run all these in one command. Also, we can pick up from the last outer iteration, which is outer_1, by informing samoo that this is a restart run, then we can run the whole Step 4-6 and 2-3 using `run()` function in the utility. Take note that when restarting, the whole cycle from Steps 4-6 to Step 2-3 must have been completed previously. Let's perform 2 more cycles of these steps. 

In [13]:
samoo.restart = True
samoo.nmax_outer = 2
samoo.run()

Saving outputs to: .

2025-05-24 07:06:05.506210: starting SAMOO run 


2025-05-24 07:06:05.533760: restart population for inner iteration saved to template_inner 


2025-05-24 07:06:05.555676: surrogate model training data updated in template_inner 


2025-05-24 07:06:43.445614: inner 3 done. Output saved to inner_3 


2025-05-24 07:06:43.458017: resampling 50 infill points 






2025-05-24 07:06:43.778289: infill ensemble saved to template_outer 


2025-05-24 07:06:55.062535: outer 2 done. Output saved to outer_2 


2025-05-24 07:06:56.211664: outer repo update done 


2025-05-24 07:06:56.241564: restart population for inner iteration saved to template_inner 


2025-05-24 07:06:56.257294: surrogate model training data updated in template_inner 


2025-05-24 07:07:33.480326: inner 4 done. Output saved to inner_4 


2025-05-24 07:07:33.480326: resampling 50 infill points 






2025-05-24 07:07:33.826420: infill ensemble saved to template_outer 


2025-05-24 07:07:44.962320: outer 3 done. Output saved to outer_3 


2025-05-24 07:07:46.140410: outer repo update done 


2025-05-24 07:07:46.140410: outer 2 done 

total run time: 0:01:40.634200 



Congratulations, you just completed 4 outer iterations (including outer_0)! Remember that you can find the current Pareto optimal solution in `outer_repo` directory inside the directory of the last outer iteration.

Let's try to run the whole process in one go. Let's do 5 outer iterations with 20 inner iterations each and a population size of 100.

In [14]:
samoo.nmax_inner = 20 #the number of inner iterations
samoo.nmax_outer = 5 #the number of outer iterations; we set this to 1 for now to demonstrate what happens in a single outer iteration
samoo.ppd_beta = 0.6 #discussed in Part 2
samoo.num_workers = 8 #the number of cores to utilise for parallel runs
samoo.max_infill = 100 #the target number of infill points
samoo.pop_size = 100 #the population size for pso in inner iterations
samoo.repo_size = 100 #the size of the repository for storing (probabilistically) non-dominated objective positions in inner and outer iterations
samoo.save_inner_every = 5 #save inner results every n iterations
samoo.pbmodel_command = 'python forward_pbrun.py' #the command to run the PBM
samoo.gpmodel_command = 'python forward_gprun.py' #the command to run the GPR
samoo.exe_file = '../../bin/pestpp-mou' #the path to the pestpp-mou executable relative to the location pst files in the template directory
samoo.restart = False
training_size = 100

bounds = np.array([[-5, 5] for i in range(2)])
generate_lhstrainingset('template', seed=8, n_samples=training_size, n_dimensions=2, bounds=bounds)

samoo.run()

Saving outputs to: .

2025-05-24 07:07:46.198133: starting SAMOO run 


2025-05-24 07:07:46.198133: prepping templates 


2025-05-24 07:07:46.198133: prepping outer template 

noptmax:-1, npar_adj:2, nnz_obs:4

2025-05-24 07:07:46.315523: outer template prepped 


2025-05-24 07:07:46.315523: prepping repo update template 

noptmax:-1, npar_adj:2, nnz_obs:4

2025-05-24 07:07:46.389413: outer repo update template prepped 


2025-05-24 07:07:46.389413: prepping inner template 

noptmax:20, npar_adj:2, nnz_obs:4

2025-05-24 07:07:46.458908: inner template prepped 


2025-05-24 07:07:57.909289: outer 0 done. Output saved to outer_0 


2025-05-24 07:07:57.921417: restart population for inner iteration saved to template_inner 


2025-05-24 07:07:57.921417: surrogate model training data updated in template_inner 


2025-05-24 07:09:02.914887: inner 1 done. Output saved to inner_1 


2025-05-24 07:09:02.918425: resampling 100 infill points 






2025-05-24 07:09:03.273146: infill ensemble saved to template_outer 


2025-05-24 07:09:14.742711: outer 1 done. Output saved to outer_1 


2025-05-24 07:09:15.958335: outer repo update done 


2025-05-24 07:09:15.988203: restart population for inner iteration saved to template_inner 


2025-05-24 07:09:16.004754: surrogate model training data updated in template_inner 


2025-05-24 07:10:18.498314: inner 2 done. Output saved to inner_2 


2025-05-24 07:10:18.502611: resampling 100 infill points 






2025-05-24 07:10:18.886158: infill ensemble saved to template_outer 


2025-05-24 07:10:30.339175: outer 2 done. Output saved to outer_2 


2025-05-24 07:10:31.492718: outer repo update done 


2025-05-24 07:10:31.510298: restart population for inner iteration saved to template_inner 


2025-05-24 07:10:31.534341: surrogate model training data updated in template_inner 


2025-05-24 07:11:34.028394: inner 3 done. Output saved to inner_3 


2025-05-24 07:11:34.036880: resampling 100 infill points 






2025-05-24 07:11:34.400499: infill ensemble saved to template_outer 


2025-05-24 07:11:45.971792: outer 3 done. Output saved to outer_3 


2025-05-24 07:11:47.133384: outer repo update done 


2025-05-24 07:11:47.154919: restart population for inner iteration saved to template_inner 


2025-05-24 07:11:47.184910: surrogate model training data updated in template_inner 


2025-05-24 07:12:52.046919: inner 4 done. Output saved to inner_4 


2025-05-24 07:12:52.052740: resampling 100 infill points 






2025-05-24 07:12:52.417913: infill ensemble saved to template_outer 


2025-05-24 07:13:03.601511: outer 4 done. Output saved to outer_4 


2025-05-24 07:13:04.749687: outer repo update done 


2025-05-24 07:13:04.780716: restart population for inner iteration saved to template_inner 


2025-05-24 07:13:04.801518: surrogate model training data updated in template_inner 


2025-05-24 07:14:10.490112: inner 5 done. Output saved to inner_5 


2025-05-24 07:14:10.498045: resampling 100 infill points 






2025-05-24 07:14:10.849213: infill ensemble saved to template_outer 


2025-05-24 07:14:22.390067: outer 5 done. Output saved to outer_5 


2025-05-24 07:14:23.571545: outer repo update done 


2025-05-24 07:14:23.571545: outer 5 done 

total run time: 0:06:37.373412 

