# Setting up a PEST interface from MODFLOW6 using the `PstFrom` class with `PyPestUtils` for advanced pilot point parameterization

In [None]:
import os
import shutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pyemu
import flopy

In [None]:
import sys
sys.path.append(os.path.join("..","..","pypestutils"))

In [None]:
import pypestutils as ppu

An existing MODFLOW6 model is in the directory `freyberg_mf6`.  Lets check it out:

In [None]:
org_model_ws = os.path.join('freyberg_mf6')
os.listdir(org_model_ws)

You can see that all the input array and list data for this model have been written "externally" - this is key to using the `PstFrom` class. 

Let's quickly viz the model top just to remind us of what we are dealing with:

In [None]:
id_arr = np.loadtxt(os.path.join(org_model_ws,"freyberg6.dis_idomain_layer3.txt"))
top_arr = np.loadtxt(os.path.join(org_model_ws,"freyberg6.dis_top.txt"))
top_arr[id_arr==0] = np.nan
plt.imshow(top_arr)

Now let's copy those files to a temporary location just to make sure we don't goof up those original files:

In [None]:
tmp_model_ws = "temp_pst_from_ppu"
if os.path.exists(tmp_model_ws):
    shutil.rmtree(tmp_model_ws)
shutil.copytree(org_model_ws,tmp_model_ws)
os.listdir(tmp_model_ws)

Now we need just a tiny bit of info about the spatial discretization of the model - this is needed to work out separation distances between parameters for build a geostatistical prior covariance matrix later.

Here we will load the flopy sim and model instance just to help us define some quantities later - flopy is not required to use the `PstFrom` class.

In [None]:
sim = flopy.mf6.MFSimulation.load(sim_ws=tmp_model_ws)
m = sim.get_model("freyberg6")


Here we use the simple `SpatialReference` pyemu implements to help us spatially locate parameters

In [None]:
sr = pyemu.helpers.SpatialReference.from_namfile(
        os.path.join(tmp_model_ws, "freyberg6.nam"),
        delr=m.dis.delr.array, delc=m.dis.delc.array)
sr

Now we can instantiate a `PstFrom` class instance

In [None]:
template_ws = "freyberg6_template"
pf = pyemu.utils.PstFrom(original_d=tmp_model_ws, new_d=template_ws,
                 remove_existing=True,
                 longnames=True, spatial_reference=sr,
                 zero_based=False,start_datetime="1-1-2018")


## Observations

So now that we have a `PstFrom` instance, but its just an empty container at this point, so we need to add some PEST interface "observations" and "parameters".  Let's start with observations using MODFLOW6 head.  These are stored in `heads.csv`:

In [None]:
df = pd.read_csv(os.path.join(tmp_model_ws,"heads.csv"),index_col=0)
df

The main entry point for adding observations is (surprise) `PstFrom.add_observations()`.  This method works on the list-type observation output file.  We need to tell it what column is the index column (can be string if there is a header or int if no header) and then what columns contain quantities we want to monitor (e.g. "observe") in the control file - in this case we want to monitor all columns except the index column:

In [None]:
hds_df = pf.add_observations("heads.csv",insfile="heads.csv.ins",index_cols="time",
                    use_cols=list(df.columns.values),prefix="hds",)
hds_df

We can see that it returned a dataframe with lots of useful info: the observation names that were formed (`obsnme`), the values that were read from `heads.csv` (`obsval`) and also some generic weights and group names.  At this point, no control file has been created, we have simply prepared to add this observations to the control file later.  

In [None]:
[f for f in os.listdir(template_ws) if f.endswith(".ins")]

Nice!  We also have a PEST-style instruction file for those obs.

Now lets do the same for SFR observations:

In [None]:
df = pd.read_csv(os.path.join(tmp_model_ws, "sfr.csv"), index_col=0)
sfr_df = pf.add_observations("sfr.csv", insfile="sfr.csv.ins", index_cols="time", use_cols=list(df.columns.values))
sfr_df

Sweet as!  Now that we have some observations, let's add parameters!

## Pilot points and `PyPestUtils`

This notebook is mostly meant to demonstrate some advanced pilot point parameterization that is possible with `PyPestUtils`, so we will only focus on HK and VK pilot point parameters.  This is just to keep the example short.  In practice, please please please parameterize boundary conditions too!

In [None]:
v = pyemu.geostats.ExpVario(contribution=1.0,a=5000,bearing=0,anisotropy=5)
pp_gs = pyemu.geostats.GeoStruct(variograms=v, transform='log')

In [None]:
pp_gs.plot()
print("spatial variogram")

Now let's get the idomain array to use as a zone array - this keeps us from setting up parameters in inactive model cells:

In [None]:
ib = m.dis.idomain[0].array

Find HK files for the upper and lower model layers (assuming model layer 2 is a semi-confining unit)

In [None]:
hk_arr_files = [f for f in os.listdir(tmp_model_ws) if "npf_k_" in f and f.endswith(".txt") and "layer2" not in f]
hk_arr_files

In [None]:
arr_file = "freyberg6.npf_k_layer1.txt"
tag = arr_file.split('.')[1].replace("_","-")
pf.add_parameters(filenames=arr_file,par_type="pilotpoints",
                   par_name_base=tag,pargp=tag,zone_array=ib,
                   upper_bound=10.,lower_bound=0.1,ult_ubound=100,ult_lbound=0.01,
                   pp_options={"pp_space":3},geostruct=pp_gs)
#let's also add the resulting hk array that modflow sees as observations
# so we can make easy plots later...
pf.add_observations(arr_file,prefix=tag,
                    obsgp=tag,zone_array=ib)

If you are familiar with how `PstFrom` has worked historically, we handed off the process to solve for the factor file (which requires solving the kriging equations for each active node) to a pure python (well, with pandas and numpy).  This was ok for toy models, but hella slow for big ugly models.  If you look at the log entries above, you should see that the instead, `PstFrom` successfully handed off the solve to `PyPestUtils`, which is exponentially faster for big models.  sweet ez! 

In [None]:
tpl_files = [f for f in os.listdir(template_ws) if f.endswith(".tpl")]
tpl_files

In [None]:
with open(os.path.join(template_ws,tpl_files[0]),'r') as f:
    for _ in range(2):
        print(f.readline().strip())
        


So those might look like pretty redic parameter names, but they contain heaps of metadata to help you post process things later...

So those are you standard pilot points for HK in layer 1 - same as it ever was...

## Geostatistical hyper-parameters

For the HK layer 1 pilot points, we used a standard geostatistical structure - the ever popular exponential variogram.  But what if the properties that define that variogram were themselves uncertain?  Like what is the anisotropy ellipse varied in space across the model domain?  What does this imply?  Well, technically speaking, those variogram properties can be conceptualized as "hyper parameters" in that they influence the underlying parameters (in this case, the pilot points) in hierarchical sense.  That is, the bearing of the anisotropy of the variogram changes, then the resulting interpolation from the pilot points to grid changes.  But where it gets really deep is that we need to define correlation structures for these spatially varying hyper pars, so they themselves have plausible spatial patterns...Seen that movie inception?!

In `PyPestUtils`, we can supply the pilot-point-to-grid interpolation process with arrays of hyper-parameter values, one array for each variogram property.  The result of this hyper parameter mess is referred to as a non-stationary spatial parameterization.  buckle up...

First let's define some additional geostatistical structures:

In [None]:
value_v = pyemu.geostats.ExpVario(contribution=1, a=5000, anisotropy=5, bearing=0.0)
value_gs = pyemu.geostats.GeoStruct(variograms=value_v)
bearing_v = pyemu.geostats.ExpVario(contribution=1,a=10000,anisotropy=5,bearing=0.0)
bearing_gs = pyemu.geostats.GeoStruct(variograms=bearing_v)

In [None]:
arr_file = "freyberg6.npf_k_layer3.txt"
tag = arr_file.split('.')[1].replace("_","-")
pf.add_parameters(filenames=arr_file,par_type="pilotpoints",
                   par_name_base=tag,pargp=tag,zone_array=ib,
                   upper_bound=10.,lower_bound=0.1,ult_ubound=100,ult_lbound=0.01,
                 pp_options={"pp_space":3,"prep_hyperpars":True},geostruct=value_gs,
                 apply_order=2)
pf.add_observations(arr_file,prefix=tag,
                    obsgp=tag,zone_array=ib)

In [None]:
hyperpar_files = [f for f in os.listdir(pf.new_d) if tag in f]
hyperpar_files

when we supplied the "prep_hyperpars" as `True` above, that triggered `PstFrom` to do something different.  Instead of solving for the pilot point kriging factors as before, now, we have array-based files for the geostatistical hyper parameters, as well as some additional quantities we need to "apply" these hyper parameter at runtime.  This is a key difference:  When the pilot point variogram is changing for each model run, we need to re-solve for the kriging factors for each model run...

We snuck in something else too - see that `apply_order` argument?  That is how we can control what order of files being processed by the run-time multiplier parameter function.  Since we are going to parameterize the hyper parameters and there is an implicit order between these hyper parameters and the underlying pilot points, we need to make sure the hyper parameters are processed first.  

Lets setup some hyper parameters for estimation.  We will use a constant for the anisotropy ratio, but use pilot points for the bearing:

In [None]:
afile = 'npf-k-layer3.aniso.dat'
tag = afile.split('.')[0].replace("_","-")+"-aniso"
pf.add_parameters(afile,par_type="constant",par_name_base=tag,
                  pargp=tag,lower_bound=-1.0,upper_bound=1.0,
                  apply_order=1,
                  par_style="a",transform="none",initial_value=0.0)
pf.add_observations(afile, prefix=tag, obsgp=tag)
bfile = 'npf-k-layer3.bearing.dat'
tag = bfile.split('.')[0].replace("_","-")+"-bearing"
pf.add_parameters(bfile, par_type="pilotpoints", par_name_base=tag,
                  pargp=tag, pp_space=6,lower_bound=-45,upper_bound=45,
                  par_style="a",transform="none",
                  pp_options={"try_use_ppu":True},
                  apply_order=1,geostruct=bearing_gs)
pf.add_observations(bfile, prefix=tag, obsgp=tag)                

Notice that the `apply_order` for these hyper pars is 1 so that any processing for these quantities happens before the actual underlying pilot points are processed

## "These go to 11" - amp'ing things up with categorization

Sometimes, the world we want to simulate might be better represented as categorical instead continuous.  That is, rather than smoothly varying property fields, we want fields that are either a high value or a low value (please dont ask for more than 2 categories!).  In this case, depending on how you plan to assimilate data (that is, what inversion algorithm you are planning to you), we can accommodate this preference for categorical fields.  

This is pretty advanced and also dense.  There is another example notebook the describes the categorization process in detail.  Here we will just blast thru it....

lets setup non-stationary categorical parameterization for the VK of layer 2 (the semi confining unit).  We can conceptualize this as a semi-confining unit that has "windows" in it that connects the two aquifers.  Where there is not a window, the Vk will be very low, where there is a window, the VK will be much higher. Let's also assume the windows in the confining unit where created when a stream eroded thru it, so the shape of these windows will be higher-order (not derived from a standard geostatistical 2-point process), but rather from connected features.

In what follows, we setup this complex parameterization.  We also add lots of aux observations to lets plot and viz the steps in this parameterization process.

In [None]:
arr_file = "freyberg6.npf_k33_layer2.txt"
print(arr_file)
k = int(arr_file.split(".")[1][-1]) - 1
pth_arr_file = os.path.join(pf.new_d,arr_file)
arr = np.loadtxt(pth_arr_file)
cat_dict = {1:[0.4,arr.mean()],2:[0.6,arr.mean()]}

#this is where we initialize the categorization process - it will operate on the 
# layer 2 VK array just before MODFLOW runs
thresharr,threshcsv = pyemu.helpers.setup_threshold_pars(pth_arr_file,cat_dict=cat_dict,
                                                         testing_workspace=pf.new_d,inact_arr=ib)

# the corresponding apply function
pf.pre_py_cmds.append("pyemu.helpers.apply_threshold_pars('{0}')".format(os.path.split(threshcsv)[1]))
prefix = arr_file.split('.')[1].replace("_","-")

pth_arr_file = os.path.join(pf.new_d,arr_file)
arr = np.loadtxt(pth_arr_file)

tag = arr_file.split('.')[1].replace("_","-") + "_pp"
prefix = arr_file.split('.')[1].replace("_","-")
#setup pilot points with hyper pars for the thresholding array (the array that will drive the 
# categorization process).  Notice the apply_order arg being used 
pf.add_parameters(filenames=os.path.split(thresharr)[1],par_type="pilotpoints",transform="none",
                  par_name_base=tag+"-threshpp_k:{0}".format(k),
                  pargp=tag + "-threshpp_k:{0}".format(k),
                  lower_bound=0.0,upper_bound=2.0,par_style="m",
                  pp_options={"try_use_ppu":False,"prep_hyperpars":True,"pp_space":5},
                  apply_order=2,geostruct=value_gs
                  )

tag = arr_file.split('.')[1].replace("_","-")
# a constant parameter for the anisotropy of the thresholding array
# Notice the apply_order arg being used
tfiles = [f for f in os.listdir(pf.new_d) if tag in f]
afile = [f for f in tfiles if "aniso" in f][0]
pf.add_parameters(afile,par_type="constant",par_name_base=tag+"-aniso",
                  pargp=tag+"-aniso",lower_bound=-3.0,upper_bound=3.0,
                  apply_order=1,
                  par_style="a",transform="none",initial_value=0.0)
# obs for the anisotropy field
pf.add_observations(afile, prefix=tag+"-aniso", obsgp=tag+"-aniso")

# pilot points for the bearing array of the geostructure of the thresholding array
# Notice the apply_order arg being used
bfile = [f for f in tfiles if "bearing" in f][0]
pf.add_parameters(bfile, par_type="pilotpoints", par_name_base=tag + "-bearing",
                  pargp=tag + "-bearing", pp_space=6,lower_bound=-45,upper_bound=45,
                  par_style="a",transform="none",
                  pp_options={"try_use_ppu":True},
                  apply_order=1,geostruct=bearing_gs)
# obs for the bearing array
pf.add_observations(bfile, prefix=tag + "-bearing", obsgp=tag + "-bearing")                

# list style parameters for the quantities used in the categorization process
# We will manipulate these initial values and bounds later
pf.add_parameters(filenames=os.path.split(threshcsv)[1], par_type="grid",index_cols=["threshcat"],
                  use_cols=["threshproportion","threshfill"],
                  par_name_base=[prefix+"threshproportion_k:{0}".format(k),prefix+"threshfill_k:{0}".format(k)],
                  pargp=[prefix+"threshproportion_k:{0}".format(k),prefix+"threshfill_k:{0}".format(k)],
                  lower_bound=[0.1,0.1],upper_bound=[10.0,10.0],transform="none",par_style='d')

# obs of the resulting Vk array that MODFLOW uses
pf.add_observations(arr_file,prefix=tag,
                    obsgp=tag,zone_array=ib)

# observations of the categorized array
pf.add_observations(arr_file+".threshcat.dat", prefix="tcatarr-" + prefix+"_k:{0}".format(k),
                    obsgp="tcatarr-" + prefix+"_k:{0}".format(k),zone_array=ib)

# observations of the thresholding array
pf.add_observations(arr_file + ".thresharr.dat",
                    prefix=tag+'-thresharr',
                    obsgp=tag+'-thresharr', zone_array=ib)

# observations of the results of the thresholding process
df = pd.read_csv(threshcsv.replace(".csv","_results.csv"),index_col=0)
pf.add_observations(os.path.split(threshcsv)[1].replace(".csv","_results.csv"),index_cols="threshcat",use_cols=df.columns.tolist(),prefix=prefix+"-results_k:{0}".format(k),
                    obsgp=prefix+"-results_k:{0}".format(k),ofile_sep=",")


### build the control file, pest interface files, and forward run script
At this point, we have some parameters and some observations, so we can create a control file:

In [None]:
pf.mod_sys_cmds.append("mf6")
pf.pre_py_cmds.insert(0,"import sys")
pf.pre_py_cmds.insert(1,"sys.path.append(os.path.join('..','..','..','pypestutils'))")
pst = pf.build_pst()

In [None]:
_ = [print(line.rstrip()) for line in open(os.path.join(template_ws,"forward_run.py"))]

## Setting initial parameter bounds and values

Here is some gory detail regarding defining the hyper parameters for both layer 3 HK and layer 2 VK...

In [None]:
#set the initial and bounds for the fill values
par = pst.parameter_data

apar = par.loc[par.pname.str.contains("aniso"),:]
bpar = par.loc[par.pname.str.contains("bearing"), :]
par.loc[apar.parnme.str.contains("layer3").index,"parval1"] = 3
par.loc[apar.parnme.str.contains("layer3").index,"parlbnd"] = 1
par.loc[apar.parnme.str.contains("layer3").index,"parubnd"] = 5

par.loc[apar.parnme.str.contains("layer2").index,"parval1"] = 2
par.loc[apar.parnme.str.contains("layer2").index,"parlbnd"] = 0
par.loc[apar.parnme.str.contains("layer2").index,"parubnd"] = 4

par.loc[bpar.parnme.str.contains("layer3").index,"parval1"] = 0
par.loc[bpar.parnme.str.contains("layer3").index,"parlbnd"] = -90
par.loc[bpar.parnme.str.contains("layer3").index,"parubnd"] = 90

par.loc[bpar.parnme.str.contains("layer2").index,"parval1"] = 0
par.loc[bpar.parnme.str.contains("layer2").index,"parlbnd"] = -90
par.loc[bpar.parnme.str.contains("layer2").index,"parubnd"] = 90

cat1par = par.loc[par.apply(lambda x: x.threshcat=="0" and x.usecol=="threshfill",axis=1),"parnme"]
cat2par = par.loc[par.apply(lambda x: x.threshcat == "1" and x.usecol == "threshfill", axis=1), "parnme"]
assert cat1par.shape[0] == 1
assert cat2par.shape[0] == 1

cat1parvk = [p for p in cat1par if "k:1" in p]
cat2parvk = [p for p in cat2par if "k:1" in p]
for lst in [cat2parvk,cat1parvk]:
    assert len(lst) > 0

#these are the values that will fill the two categories of VK - 
# one is low (clay) and one is high (sand - the windows)
par.loc[cat1parvk, "parval1"] = 0.01
par.loc[cat1parvk, "parubnd"] = 0.1
par.loc[cat1parvk, "parlbnd"] = 0.001
par.loc[cat1parvk, "partrans"] = "log"
par.loc[cat2parvk, "parval1"] = 0.1
par.loc[cat2parvk, "parubnd"] = 1
par.loc[cat2parvk, "parlbnd"] = 0.01
par.loc[cat2parvk, "partrans"] = "log"


cat1par = par.loc[par.apply(lambda x: x.threshcat == "0" and x.usecol == "threshproportion", axis=1), "parnme"]
cat2par = par.loc[par.apply(lambda x: x.threshcat == "1" and x.usecol == "threshproportion", axis=1), "parnme"]

assert cat1par.shape[0] == 1
assert cat2par.shape[0] == 1

#these are the proportions of clay and sand in the resulting categorical array
#really under the hood, only the first one is used, so we can fix the other.
par.loc[cat1par, "parval1"] = 0.75
par.loc[cat1par, "parubnd"] = 1.0
par.loc[cat1par, "parlbnd"] = 0.5
par.loc[cat1par,"partrans"] = "none"

# since the apply method only looks that first proportion, we can just fix this one
par.loc[cat2par, "parval1"] = 1
par.loc[cat2par, "parubnd"] = 1
par.loc[cat2par, "parlbnd"] = 1
par.loc[cat2par,"partrans"] = "fixed"


# Generating a prior parameter ensemble, then run and viz a real

In [None]:
np.random.seed(122341)
pe = pf.draw(num_reals=100)

In [None]:
pe.to_csv(os.path.join(template_ws,"prior.csv"))

In [None]:
real = 0
pst_name = "real_{0}.pst".format(real)
pst.parameter_data.loc[pst.adj_par_names,"parval1"] = pe.loc[real,pst.adj_par_names].values

In [None]:
pst.control_data.noptmax = 0
pst.write(os.path.join(pf.new_d,pst_name))

In [None]:
pyemu.os_utils.run("pestpp-ies {0}".format(pst_name),cwd=pf.new_d)

In [None]:
pst.set_res(os.path.join(pf.new_d,pst_name.replace(".pst",".base.rei")))
res = pst.res
obs = pst.observation_data
grps = [o for o in obs.obgnme.unique() if o.startswith("npf") and "result" not in o and "aniso" not in o]
grps

In [None]:
gobs = obs.loc[obs.obgnme.isin(grps),:].copy()
gobs["i"] = gobs.i.astype(int)
gobs["j"] = gobs.j.astype(int)
gobs["k"] = gobs.obgnme.apply(lambda x: int(x.split('-')[2].replace("layer","")) - 1)

In [None]:
uk = gobs.k.unique()
uk.sort()

In [None]:
for k in uk:
    kobs = gobs.loc[gobs.k==k,:]
    ug = kobs.obgnme.unique()
    ug.sort()
    fig,axes = plt.subplots(1,4,figsize=(20,6))
    axes = np.atleast_1d(axes)
    for ax in axes:
        ax.set_frame_on(False)
        ax.set_yticks([])
        ax.set_xticks([])
    for g,ax in zip(ug,axes):
        gkobs = kobs.loc[kobs.obgnme==g,:]
        
        arr = np.zeros_like(top_arr)
        arr[gkobs.i,gkobs.j] = res.loc[gkobs.obsnme,"modelled"].values
        ax.set_aspect("equal")
        label = ""
        if "bearing" not in g and "aniso" not in g:
            arr = np.log10(arr)
            label = "$log_{10}$"
        cb = ax.imshow(arr)
        plt.colorbar(cb,ax=ax,label=label)
        ax.set_title("layer: {0} group: {1}".format(k+1,g),loc="left",fontsize=15)
        
    plt.tight_layout()
    plt.show()
    plt.close(fig)

Stunning isn't it?!  There is clearly a lot subjectivity in the form of defining the prior for the hyper parameters required to use these non-stationary geostats, but they do afford more opportunities to express (stochastic) expert knowledge.  To be honest, there was a lot of experimenting with this notebook to get these figures to look this way - playing with variograms and parameter initial values and bounds a lot.  You encouraged to do the same!  scroll back up, change things, and "restart kernel and run all" - this will help build some better intution, promise....