## Generate batch .csv files
This code generates the .csv files that can be used for batch runs of the lake PSM. 

Each row in these .csv files represents some distinct experimental setup. The user can generate these setups programmatically in four ways: 

1. **Latin hypercube sampling of parameter ranges**
2. **All combinations of given parameter values**
3. **Prescribed individual cases**
4. **One-at-a-time parameter sensitivity tests**

To generate these outputs, the user must define a dictionary where each key is the name of a parameter (must be consistent with those listed in `defaults/default_dicts.py`!) 

Each dictionary value depends on the function used to generate the .csv file. The latin hypercube approach requires each parameter value to be a list of min / max values (in that order) (e.g., `[0,5]`). The all combinations approach takes a list of values (e.g., `[0,1,2,3,4,5]`). And the prescribed cases approach also takes a list, but each list must be the same length for all parameters. 

In [1]:
import os
import itertools
import numpy as np
import pandas as pd
from scipy.stats import qmc
import warnings

## Parameter dictionary
Should be min / max values for latin hypercube sampling, or a list of values for all combinations sampling. 

Must be numeric parameters only.

In [5]:
# --- EXAMPLE ranges for latin hypercube
# each parameter has a min, max (in that order!)
parameter_ranges = {
    "b_area": [21000, 50000000],
    "basedep": [100, 3e3],
    "f": [0.01, 0.99],
    "iceflag": [".false.", ".true."],
}

# --- EXAMPLE lists for all combinations
# each parameter has a list of values (can be any length, any order)
parameter_values = {
    "b_area": [21000, 210000, 2100000, 21000000],
    "basedep": [100, 500, 1.5e3, 3e3],
    "f": [0.01, 0.25, 0.5, 0.75, 0.99],
    "iceflag": [".false.", ".true."],
}

# --- EXAMPLE lists for individual cases
# each parameter has a list of values of the same length. 
parameter_cases = {
    "b_area": [21000, 210000, 2100000, 21000000],
    "basedep": [100, 500, 1.5e3, 3e3],
    "f": [0.01, 0.25, 0.75, 0.99],
    "iceflag": [".false.", ".true.", ".false.", ".true."],
}

# --- EXAMPLE lists for individual parameter sensitivity tests
parameter_one_at_a_time = {
    "b_area": [21000, 210000, 2100000, 21000000],
    "basedep": [100, 500, 1.5e3, 3e3],
    "f": [0.01, 0.25, 0.75, 0.99],
    "iceflag": [".false."],
}

## Sampling functions

-`latin_hypercube_sampler`: Sample N times from a min / max range of each parameter. Must also decide how to handle the non-numeric parameters. Use `nonnum_repeat_type="prescribed_cases"` to repeat the lhs once for each index in the non-numeric params. Use `nonnum_repeat_type="all_combinations"` to repeat the lhs once for every possible combination of the non-numeric params. To learn more about latin hypercube sampling, [see here](https://en.wikipedia.org/wiki/Latin_hypercube_sampling).

-`all_combinations_sampler`: Generate all combinations of parameter values.

-`prescribed_cases`: Generate experimental setups where the first parameter value is the first setup, the second is the second setup, and so on. 

-`one_at_a_time_sensitivity`: Test the sensitivity of one parameter at a time --- the first runs will go through the unique values of parameter1, setting all else to default, then the unique values of parameter2, setting all else to default, and so on.

In [None]:
# --- function to sample parameter space with latin hypercube
def latin_hypercube_sampler(
        parameter_ranges: dict,
        n_samples: int,
        nonnum_repeat_type: str,
        round_to: int=2,
)-> pd.DataFrame:
    '''
    Read in a dictionary of parameter ranges where each key is the parameter
    name (as listed in defaults/default_dicts.py) and each value is a 
    list (e.g., `[0,1]`). Given a number of samples, return parameter values 
    for each sample. 

    Parameters
    ----------
    parameter_ranges : dict
        dictionary with keys equal to default_dicts.py parameters 
        (must be numeric!) and values indicating a min / max range.
    n_samples : int
        number of samples to generate (each sample amounts to a given
        simulation of the PSM)
    nonnum_repeat_type : str
        ["prescribed_cases" | "all_combinations"] how to repeat the numeric 
        latin hypercube samples across the non-numeric parameters. "prescribed_cases"
        means we repeat the lhs once for each index in the non-numeric params. 
        "all_combinations" means we repeat the lhs once for every possible 
        combination of the non-numeric params. 
    round_to : int
        number of decimal places to round to. 

    Returns
    -------
    pd.DataFrame 
        dataframe where each column is a parameter and each row is a sample
    '''
    # separate numeric and non-numeric elements
    numeric_params = {key: value for key, value in parameter_ranges.items() if all(isinstance(x, (int, float)) for x in value)}
    non_numeric_params = {key: value for key, value in parameter_ranges.items() if not all(isinstance(x, (int, float)) for x in value)}

    # -----
    # make sure each value has only two elements
    for key, values in numeric_params.items():
        if len(values) != 2:
            raise ValueError(f"The key '{key}' has a length of {len(values)}, but it must be 2.")
    # -----
    # get the number of parameters
    num_parameters = len(numeric_params)

    # create a latin hypercube sampler
    sampler = qmc.LatinHypercube(d=num_parameters)

    # generate the samples in the unit hypercube
    sample = sampler.random(n=n_samples)
    
    # extract the ranges from the dictionary
    keys = list(numeric_params.keys())
    min_vals = [numeric_params[key][0] for key in keys]
    max_vals = [numeric_params[key][1] for key in keys]

    # scale the samples to the desired parameter ranges
    scaled_sample = qmc.scale(sample, min_vals, max_vals)

    # convert back to a dict
    parameter_values = {keys[i]: np.round(scaled_sample[:, i], round_to) for i in range(num_parameters)}
    # and now to a pandas dataframe
    df_num = pd.DataFrame(parameter_values)

    # --- handle the non-numeric values
    if len(non_numeric_params) > 0:
        if nonnum_repeat_type == "prescribed_cases": 
            df_nonnum = pd.DataFrame(non_numeric_params)
        elif nonnum_repeat_type == "all_combinations":
            # generate all combinations of parameter values
            combinations = list(itertools.product(*non_numeric_params.values()))
            df_nonnum = pd.DataFrame(combinations, columns=non_numeric_params.keys())
        # merge dfs
        # create a dummy key for cross join
        df_num['key'] = 1
        df_nonnum['key'] = 1
        # merge the DataFrames on the dummy key
        dfout = pd.merge(df_num, df_nonnum, on='key').drop('key', axis=1)
    else:
        dfout = df_num

    # return result
    return dfout


# --- function to sample all possible values
def all_combinations_sampler(
        parameter_values: dict,
)-> pd.DataFrame:
    '''
    Read in a dictionary of parameter values where each key is the parameter
    name (as listed in defaults/default_dicts.py) and each value is a 
    list of values to test. Output a pd.DataFrame that includes all 
    possible combinations of the listed parameter values.

    Parameters
    ----------
    parameter_ranges : dict
        dictionary with keys equal to default_dicts.py parameters 
        (must be numeric!) and values are values to test.
    
    Returns
    -------
    pd.DataFrame 
        dataframe where each column is a parameter and each row is a sample
    '''
    # check if all parameter value lengths are 2 (this might
    # indicate that the dict is for latin hypercube sampling)
    all_len_2 = all(len(value) == 2 for value in parameter_values.values())
    if all_len_2:
        warnings.warn("All lists in the dict have a length of 2. If these are min / max values, you may have meant to use the Latin hypercube sampler!", UserWarning)

    # generate all combinations of parameter values
    combinations = list(itertools.product(*parameter_values.values()))

    # create a DataFrame with parameter names as columns
    df = pd.DataFrame(combinations, columns=parameter_values.keys())
    # return result
    return df

# --- prescribed cases
def prescribed_cases(
        parameter_cases: dict,
)->pd.DataFrame:
    '''
    Create pandas dataframe where each row is an individual experimental
    setup that aligns 1:1 with the structure of the parameter_cases
    dictionary

    Parameters
    ----------
    parameter_cases : dict
        dictionary with keys equal to default_dicts.py parameters 
        (must be numeric!) and values are values to test. Each 
        parameter value array must be the same length. 
    
    Returns
    -------
    pd.DataFrame 
        dataframe where each column is a parameter and each row is a sample
    '''
    # confirm that all lists are the same length
    all_same_length = all(len(value) == len(next(iter(parameter_cases.values()))) for value in parameter_cases.values())
    if not all_same_length:
        raise ValueError(f"All parameter value lists must be the same length for prescribed_cases!")
    
    # convert to pandas dataframe
    dfout = pd.DataFrame(parameter_cases)

    # return result
    return dfout

# --- one at a time
def one_at_a_time_sensitivity(
        parameter_values: dict,
        default_str: str = "**default**",
)-> pd.DataFrame:
    '''
    Create pandas dataframe where each row is a test of a single value from a 
    single parameter, setting all else to default.

    Parameters
    ----------
    parameter_cases : dict
        dictionary with keys equal to default_dicts.py parameters 
        (must be numeric!) and values are values to test. Each 
        parameter value array must be the same length. 
    default_str : str
        name to use if the value for the default dictionary should be used.
        CAUTION: this might be hard-coded in the helper_functions.py to 
        expect a certain value! Only change if you know what you're doing.
    
    Returns
    -------
    pd.DataFrame 
        dataframe where each column is a parameter and each row is a sample
    '''
    # initialize an empty row
    rows = []
    # loop through each key and value in the parameter dict
    for key, values in parameter_values.items():
        for val in values:
            # set all rows to default
            row = {col: default_str for col in parameter_values.keys()}
            row[key] = val # over-write the default value with the given parameter value
            rows.append(row) 
    # return the dataframe
    return pd.DataFrame(rows)


# --- function to add constant values to the dict
def add_constant_parameters(
        df_batch: pd.DataFrame,
        constant_dict: dict,
)->pd.DataFrame:
    '''
    Take in the existing batch dataframe and add the constant parameter 
    values. 

    Parameters
    ----------
    df_batch : pd.DataFrame
        the pandas dataframe that is output from one of the other sample 
        functions (latin_hypercube_sampler, all_combinations_sampler,
        prescribed_cases).
    constant_dict : dict
        dictionary where keys are parameter names and each has a single 
        value that is held constant for all rows. 

    Returns
    -------
    pd.DataFrame
        the final batch .csv that gets saved
    '''
    # check that all dicts have only one value
    all_len_1 = all((isinstance(value, (list, tuple)) and len(value) == 1) or not isinstance(value, (list, tuple)) for value in constant_dict.values())
    if not all_len_1:
        warnings.warn("Expected all dict parameters to have one value but at least one parameter has more. This may lead to unintended results.", UserWarning)

    # constant DataFrames
    df2 = pd.DataFrame([constant_dict])

    # Create a dummy key for cross join
    df_batch['key'] = 1
    df2['key'] = 1

    # Merge the DataFrames on the dummy key
    combined_df = pd.merge(df_batch, df2, on='key').drop('key', axis=1)
    return combined_df


## Generate samples

### Latin hypercube example

In [4]:
df_lhs = latin_hypercube_sampler(parameter_ranges, n_samples=100, 
                                 nonnum_repeat_type="prescribed_cases")
df_lhs

Unnamed: 0,b_area,basedep,f,iceflag
0,45117485.14,2622.14,0.40,.false.
1,45117485.14,2622.14,0.40,.true.
2,9905732.76,642.87,0.38,.false.
3,9905732.76,642.87,0.38,.true.
4,47146314.66,887.18,0.60,.false.
...,...,...,...,...
195,39454134.57,1931.72,0.28,.true.
196,17741115.10,2761.11,0.45,.false.
197,17741115.10,2761.11,0.45,.true.
198,10884363.94,1145.65,0.06,.false.


### All combinations example

In [5]:
df_all = all_combinations_sampler(parameter_values)
df_all

Unnamed: 0,b_area,basedep,f,iceflag
0,21000,100.0,0.01,.false.
1,21000,100.0,0.01,.true.
2,21000,100.0,0.25,.false.
3,21000,100.0,0.25,.true.
4,21000,100.0,0.50,.false.
...,...,...,...,...
155,21000000,3000.0,0.50,.true.
156,21000000,3000.0,0.75,.false.
157,21000000,3000.0,0.75,.true.
158,21000000,3000.0,0.99,.false.


### Prescribed cases example

In [6]:
df_cases = prescribed_cases(parameter_cases)
df_cases

Unnamed: 0,b_area,basedep,f,iceflag
0,21000,100.0,0.01,.false.
1,210000,500.0,0.25,.true.
2,2100000,1500.0,0.75,.false.
3,21000000,3000.0,0.99,.true.


### One-at-a-time example

In [10]:
df_one_at_a_time = one_at_a_time_sensitivity(parameter_one_at_a_time)
df_one_at_a_time

Unnamed: 0,b_area,basedep,f,iceflag
0,21000,**default**,**default**,**default**
1,210000,**default**,**default**,**default**
2,2100000,**default**,**default**,**default**
3,21000000,**default**,**default**,**default**
4,**default**,100,**default**,**default**
5,**default**,500,**default**,**default**
6,**default**,1500.0,**default**,**default**
7,**default**,3000.0,**default**,**default**
8,**default**,**default**,0.01,**default**
9,**default**,**default**,0.25,**default**


## Constant values

These are parameter values that differ from the default that are the same for every simulation (e.g., every row in batch.csv). Two inputs are required: "default_dict_path" and "dict_name", which tell the model which set of default values to use to fill in everything not covered in batch.csv.

In [None]:
constant_dict = {
    # --- THE ONLY TWO REQUIRED COLUMNS ---
    # [UPDATE TO PATH ON YOUR MACHINE]
    "default_dict_path": "/Users/tylerkukla/Documents/GitHub/PRYSM/psm/lake_v2/defaults",
    "dict_name": "defaults_main",
    # -------------------------------------
    # 
    # other constants
    "datafile": "CP_SLIM_modernTopo_280ppm_test_input.txt",
    "outdir": "/Users/tylerkukla/Documents/GitHub/PRYSM/psm/lake_v2",
    "xlat": 34,
    "xlon": -101,
}

In [8]:
# --- add to the dataframe
# (EXAMPLE: assume we're using the prescribed_cases dataframe for now)
df = df_cases.copy()

# add the constants
dfout = add_constant_parameters(df, constant_dict)

In [None]:
# --- add unique name for each experiment
casename_root = "LHS_run_v0"
dfout['casename'] = casename_root + "_" + (df.index + 1).astype(str)
# move 'casename' to the first position
dfout = dfout[['casename'] + [col for col in dfout.columns if col != 'casename']]
dfout


Unnamed: 0,casename,b_area,basedep,f,iceflag,default_dict_path,dict_name,datafile,outdir,xlat,xlon,casename.1
0,LHS_run_v0_1,21000,100.0,0.01,.false.,/Users/tylerkukla/Documents/GitHub/PRYSM/psm/l...,defaults_main,CP_SLIM_lowTopo_500ppm_test_input.txt,/Users/tylerkukla/Documents/GitHub/PRYSM/psm/l...,34,-101,LHS_run_v0_1
1,LHS_run_v0_2,210000,500.0,0.25,.true.,/Users/tylerkukla/Documents/GitHub/PRYSM/psm/l...,defaults_main,CP_SLIM_lowTopo_500ppm_test_input.txt,/Users/tylerkukla/Documents/GitHub/PRYSM/psm/l...,34,-101,LHS_run_v0_2
2,LHS_run_v0_3,2100000,1500.0,0.75,.false.,/Users/tylerkukla/Documents/GitHub/PRYSM/psm/l...,defaults_main,CP_SLIM_lowTopo_500ppm_test_input.txt,/Users/tylerkukla/Documents/GitHub/PRYSM/psm/l...,34,-101,LHS_run_v0_3
3,LHS_run_v0_4,21000000,3000.0,0.99,.true.,/Users/tylerkukla/Documents/GitHub/PRYSM/psm/l...,defaults_main,CP_SLIM_lowTopo_500ppm_test_input.txt,/Users/tylerkukla/Documents/GitHub/PRYSM/psm/l...,34,-101,LHS_run_v0_4


In [11]:
# --- SAVE RESULT
# [CHANGE DIR TO YOUR MACHINE]
maindir = "/Users/tylerkukla/Documents/GitHub/PRYSM/psm/lake_v2"
savedir = "batch_inputs"
batch_filename = f"batch_{casename_root}.csv"

# save the dataframe as a csv
dfout.to_csv(os.path.join(maindir, savedir, batch_filename), index=False)

In [None]:
# -------------------------------------------------------