# Skill profiling

This notebook will orchestrate a skill profiling of the analog forecast algorithm across all available options for a set of dates.

In [1]:
import os
import subprocess
from pathlib import Path
import luts
from config import data_dir, project_dir

### Goal

The goal is to compute the error for the analog forecast method and a naive forecast method. The product here should be a table of results - errors between the forecast and "observed" values for all spatial domains, variables, etc.

### Processing strategy

We have some large data files - daily data for the northern hemisphere for our variables of interest - that will end up being read completely into memory because of the search of analogs over the entire time series for that full domain as well as subdomains. Additionally, the naive forecasting will be sampling many of the time steps over many simulations. Being ~45GB (or ~23GB for the raw (i.e. non-anomaly-based) files), it will make sense to read the dataset completely into memory and then iterate over the possible groups. So we will iterate over the data files at the lowest level, which are grouped by variable and data type (raw vs anomaly) for 8 files.

We will use slurm here to execute the `run_profile.py` script, which will conduct the profiling for all dates specified in that file. 

### Naive profiling

I believe we only need to simulate the naive forecasts for each domain and variable, not for every reference date. This assumes that the distribution of "skill" (RMSE for now) for the naive forecast is the same for every day of the year. For each spatial domain and variable, we are attempting to simulate the distribution of a naive forecast skill based on selecting uniformly random analogs from the complete historical time series. 

So, we can create a table of naive forecast skill for all combinations of spatial domain and variable, which can then be joined with a table of analog forecast results for useful comparisons. 

We will use slurm for this because each group takes a while to process. We will call the `run_profile.py` script to the profiling for a particular variable and data type. 

In [2]:
def write_run_profile_sbatch(
    sbatch_fp, 
    sbatch_out_fp, 
    varname, 
    results_fp, 
    use_anom, 
    data_dir, 
    project_dir, 
    conda_init_script
):
    sbatch_head = (
        "#!/bin/sh\n"
        "#SBATCH --nodes=1\n"
        "#SBATCH --cpus-per-task=32\n"
        "#SBATCH --exclusive\n"
        "#SBATCH --mail-type=FAIL\n"
        f"#SBATCH --mail-user=kmredilla@alaska.edu\n"
        f"#SBATCH -p main\n"
        f"#SBATCH --output {sbatch_out_fp}\n"
        f"source {conda_init_script}\n"
        "conda activate analog-forecast\n"
        f"export DATA_DIR={data_dir}\n"
        f"export PYTHONPATH=$PYTHONPATH:{project_dir}\n"
    )

    py_commands = (
        f"time python {project_dir.joinpath('skill_profiling', 'run_profile.py')} "
        f"--varname {varname} "
        f"--results_file {results_fp} "
        f"{'--use_anom' if use_anom else ''} "
    )

    commands = sbatch_head + py_commands

    with open(sbatch_fp, "w") as f:
        f.write(commands)
        
    return

Make the slurm scripts for `sbatch`ing:

In [3]:
sbatch_dir = Path("slurm")
sbatch_dir.mkdir(exist_ok=True)
results_dir = Path("results")
results_dir.mkdir(exist_ok=True)

sbatch_fps = []
results_fps = []

# get the path to the conda init script
conda_init_script = os.getenv("CONDA_INIT_SCRIPT")

for varname in luts.varnames_lu.keys():
    for use_anom in [True, False]:
        group_str = f"{varname}{'_anom' if use_anom else ''}"
        sbatch_fp = sbatch_dir.joinpath(f"run_profile_{group_str}.slurm").resolve()
        sbatch_out_fp = sbatch_dir.joinpath(f"run_profile_{group_str}_%j.out").resolve()
        results_fp = results_dir.joinpath(f"{group_str}.csv").resolve()
        sbatch_kwargs = {
            "sbatch_fp": sbatch_fp,
            "sbatch_out_fp": sbatch_out_fp,
            "varname": varname,
            "results_fp": results_fp,
            "use_anom": use_anom,
            "data_dir": data_dir,
            "project_dir": project_dir,
            "conda_init_script": conda_init_script
        }
        
        write_run_profile_sbatch(**sbatch_kwargs)
        sbatch_fps.append(sbatch_fp)
        results_fps.append(results_fp)

Submit the jobs:

In [5]:
_ = [subprocess.check_output(["sbatch", str(fp)]) for fp in sbatch_fps]

end