# 20 km WRF re-stacking pipeline

This pipeline restructures the raw 20km WRF outputs that cover Alaska and the surrounding regions (created by P Bieniek) into more user-friendly files that can be easily imported into popular GIS software. This WRF dataset consists of hourly outputs for one reanalysis, ERA-Interim, and two GCMs, GFDL-CM3, and NCAR-CCSM4. This pipeline is designed to be executed entirely from this notebook.

This is a rather complicated SNAP data pipeline. It works on a large amount of data (~300 GB for a single model / scenario / year, so that's over 90 TB for $2 * 95 + 2 * 35 + 35$ model / scenario / year combinations), creates a large number of final data files (>10k), and makes use of slurm, specific directory structure / file management, and asyncronous execution ability (i.e. re-run certain steps, run steps for only certain variables, etc). The "Setup" step provides info on executing it.

# 0 - Setup

This step provides instructions for setting up and running the pipeline. 

First off, a snapshot of the structure of the target base data:

In [13]:
ls /archive/DYNDOWN/DIONE/pbieniek/ccsm/hist/hourly | head -5

[0m[38;5;27m1970[0m/
[38;5;27m1971[0m/
[38;5;27m1972[0m/
[38;5;27m1973[0m/
[38;5;27m1974[0m/


In [14]:
ls /archive/DYNDOWN/DIONE/pbieniek/ccsm/hist/hourly/ | tail -6

[38;5;27m2003[0m/
[38;5;27m2004[0m/
[38;5;27m2005[0m/
nohup.out
[38;5;34morgdata.sh[0m*
[m

In [18]:
ls /archive/DYNDOWN/DIONE/pbieniek/ccsm/hist/hourly/1979 | head -5

[0m[38;5;34mdailylog.out[0m*
[38;5;34mWRFDS_d01.1979-01-01_00.nc[0m*
[38;5;34mWRFDS_d01.1979-01-01_01.nc[0m*
[38;5;34mWRFDS_d01.1979-01-01_02.nc[0m*
[38;5;34mWRFDS_d01.1979-01-01_03.nc[0m*
ls: write error


This structure applies for all outputs, and exists for the following model / scenario / year combinations:

* `era/`:
    * `hist/`: 1979-2015
* `gfdl/`
    * `hist/`: 1970-2006
    * `rcp85/`: 2006-2100
* `ccsm/`
    * `hist/`: 1970-2005
    * `rcp85/`: 2005-2100

## 0.1 - Pipeline execution

### Processing

The default configuration for this pipeline is to process all available data - all year / variable / model / scenario combinations possible. However, at the finest level of control, this pipeline can re-stack a single year's worth of data for a single variable / model / scenario combination.

As seen above, the input data are grouped by model and scenario names and are consistently structured - hourly WRF outputs grouped by yearly folders. Thus, processing is done at the model / scenario "group" level - more on that below.

Given the large file size / abundance issue, this pipeline is best utilized in an async fashion, with memory management tasks, regular printouts of what's happening and progress on things, what files are where for which groups, etc. 

### System

This pipeline is being developed on the Chinook cluster:

In [2]:
!uname -a

Linux chinook00.rcs.alaska.edu 2.6.32-754.35.1.el6.61015g0000.x86_64 #1 SMP Mon Dec 21 12:41:07 EST 2020 x86_64 x86_64 x86_64 GNU/Linux


This pipeline makes use of slurm and multiple cores / compute nodes for processing in reasonable time.

In [5]:
!sinfo -V

slurm 19.05.7


### Execution

This notebook should be executed sequentially to process the entire dataset. To process only subsets of the target dataset, which might be done for fixing an issue or re-processing some failed runs, all code cells in this Setup section from Section 0.2 onward need to be executed prior.

## 0.2 - Environment

Instead of relying on environment variables, this pipeline utilizes user-supplied parameters specified in the cells of this notebook by simply assigning values to variables prior to executing any processing code cells.

### 0.2.1 - global parameters

The following variables are used throughout the pipeline and are loadset in the code cell below:

* `base_dir` - Full path to the directory that will contain all ancillary and intermediate files that will be kept, such as scripts for slurm / `sbatch`
* `output_dir` - Full path to the directory that will contain the final output data (will be the same as `base_dir` here but specified separately for consistency with other SNAP pipelines)
* `scratch_dir` - Full path to the scratch directory that raw WRF outputs will be copied to prior to processing them
    * This pipelines works with WRF outputs that are on a mounted file system, and so can be copied over to scratch space and removed when done to improve IO and avoid the need to keep them in the `base_dir`.
* `slurm_email` - String containing email address to use for failed slurm notifications
* `conda_init_script` - This is currently specific to Chinook. This is the path to a script that contains commands for initializing the shells on the compute nodes to use `conda activate`, has the typical commands seen in `~/.bashrc` after installing conda:

In [19]:
cat ~/init_conda.sh

#!/bin/bash

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/kmredilla/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/home/kmredilla/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/home/kmredilla/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/home/kmredilla/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<< 



Supply the values for these parameters:

In [1]:
# User-parameters
base_dir = "/import/SNAP/wrf_data/project_data/wrf_data"
output_dir = "/import/SNAP/wrf_data/project_data/wrf_data"
scratch_dir = "/center1/DYNDOWN/kmredilla/wrf_data"
slurm_email = "kmredilla@alaska.edu"
conda_init_script = "/home/kmredilla/init_conda.sh"

### 0.2.2 - job parameters

The following arguments are required for a single job of re-stacking data for a particular variable (or variables), model, scenario, and year (or years):

* `varname`: Name of the variable. This is the lower case version of the variable name in the WRF outputs.
* `wrf_dir`: This is the directory containing the WRF files. This codebase is designed for use with hourly output, so this needs to be the `hourly/` directory if there are multiple options (e.g. `daily/`, `monthly/`, etc.).
* `group`: Encoded value specifying the WRF group being worked on, which is just a combination of the model and scenario (or just model, in terms of ERA-Interim).  One of [`era_interim`, `gfdl_hist`, `ccsm_hist`, `gfdl_rcp85`, `ccsm_rcp85`].
* `years`: a list of years to work on specified as integers, such as `[1979, 1980]`, or omit to work on all years available for a given WRF group.

The WRF outputs of interest from different runs of model/scenario may be in separate places, but there is consistency in file structure across all groups - all `hourly` directories have annual subgroups consisting of the WRF outputs to be restacked.

## 0.3 - Global imports and filepaths

Set up all filepathing used in the cell below and import all packages used in multiple sections. 

In [2]:
import os
import time
from pathlib import Path
# codebase
import luts
import restack_20km as main

base_dir = Path("/import/SNAP/wrf_data/project_data/wrf_data")
output_dir = Path("/import/SNAP/wrf_data/project_data/wrf_data")
scratch_dir = Path("/center1/DYNDOWN/kmredilla/wrf_data")
# where raw wrf outputs will be copied on scratch
raw_scratch_dir = scratch_dir.joinpath("raw")
raw_scratch_dir.mkdir(exist_ok=True)
# where initially restacked data will be stored on scratch_space
restack_scratch_dir = scratch_dir.joinpath("restacked")
restack_scratch_dir.mkdir(exist_ok=True)

slurm_dir = scratch_dir.joinpath("slurm")
slurm_dir.mkdir(exist_ok=True)
slurm_email = "kmredilla@alaska.edu"

# this env is always defined if notebook started with anaconda-project run
project_dir = Path(os.getenv("PROJECT_DIR"))
ap_env = project_dir.joinpath("envs/default")
cp_script = project_dir.joinpath("restack_20km/mp_cp.py")
restack_script = project_dir.joinpath("restack.py")

# 1 - Re-stack data and improve the file structure

This is the main lift of the pipeline and it applies to a single WRF group, for any variables and years specified. It re-stacks the WRF outputs, which means extracting the data for all variables in a single WRF file and combining them into new files grouped by variable and year. It then assigns useful metadata and restructures the files to achieve greater usability (note - this was previously a separate step, but the storage of essentially duplicate intermediate data was not efficient).

As mentioned above, this pipeline is currently configured to run for all potential combinations of variables / years for each group. This section will demonstrate execution of all the processing steps required to re-stack one single WRF group, NCAR-CCSM4 historical, and then will proceed to string them all together for processing the remaining WRF groups.

## 1.1 - Copy WRF data to scratch space 

If not present on the filesystem (as is the at the time of developing the current code) then the WRF data need to be copied over. 

This step will copy the annual subdirectory (or directories) containing the WRF outputs for all specified years to scratch space for efficient reading. This step utilizes `sbatch`.

Specify the desired job parameters in the code cell below:

In [3]:
# job parameters
wrf_dir = Path("/archive/DYNDOWN/DIONE/pbieniek/ccsm/hist/hourly")
group = "ccsm_hist"
years = [2004, 2005]

Use slurm to breakup the work for copying multiple years across nodes. Specify the number of CPUs to use in the `ncpus` parameter and write the `sbatch` scripts for copying the data for each year:

In [4]:
ncpus = 10
partition = "t1small"

# if no years supplied, get all
if len(years) == 0:
    years = luts.groups[group]["years"]

sbatch_fps = []
group_dir = raw_scratch_dir.joinpath(group)
for year in years:
    # write to .slurm script
    sbatch_fp = slurm_dir.joinpath(f"cp_scratch_{group}_{year}.slurm")
    # filepath for slurm stdout
    sbatch_out_fp = slurm_dir.joinpath(f"cp_scratch_%j_{group}_{year}.out")
    src_dir = wrf_dir.joinpath(str(year))
    dst_dir = group_dir.joinpath(str(year))
    sbatch_head = main.make_sbatch_head(ncpus, slurm_email, partition, conda_init_script, ap_env)
    main.write_sbatch_copyto_scratch(sbatch_fp, sbatch_out_fp, src_dir, dst_dir, cp_script, ncpus, sbatch_head)
    sbatch_fps.append(sbatch_fp)

Ensure yearly subdirectories are present before submitting the batch jobs. 

In [13]:
main.make_yearly_scratch_dirs(group, years, raw_scratch_dir)

Make sure the directories on `$ARCHIVE` have been staged for copying over, e.g.:

```
batch_stage -r /archive/DYNDOWN/DIONE/pbieniek/ccsm/hist/hourly
```

And then call `sbatch` to submit all of the `sbatch_fps`:

In [None]:
# job_ids = [main.submit_sbatch(fp) for fp in sbatch_fps]

#### Check progress of copying to scratch

Run the cell below to check the progress of the copy for the current arguments:

In [5]:
_ = main.check_raw_scratch(wrf_dir, group, years, raw_scratch_dir)

0 of 17520 requestedWRF output files found in /center1/DYNDOWN/kmredilla/wrf_data/raw.


## 1.2: Restack the data

Now that the WRF outputs are available on the scratch filesystem for faster access, execute the restacking script.

In [None]:
# how to ensure files aren't opened at same time by different processes? Is that bad anyway? Is there handling for that currently?

In [None]:
def sbatch_restack(group, year, variable):
    """Restack the WRF outputs by creating and 
    submitting a sbatch job for a single group,
    year, and variable
    """

    restack_script = Path(os.getcwd()).joinpath("restack.py")
    
    slurm_dir = base_dir.joinpath("slurm/restack")
    slurm_dir.mkdir(exist_ok=True, parents=True)
    
    # setup command
    sbatch_out_str = f"{str(slurm_dir)}/slurm_restack_%j_{group}_{year}.out"
    head = (
        "#!/bin/sh\n"
        "#SBATCH --nodes=1\n"
        f"#SBATCH --cpus-per-task={ncpus}\n"
        "#SBATCH --account=snap\n"
        "#SBATCH --mail-type=FAIL\n"
        f"#SBATCH --mail-user={slurm_email}\n"
        "#SBATCH -p main\n"
        f'#SBATCH --output {sbatch_out_str}\n'
        'eval "$(conda shell.bash hook)"\n'
        # base conda env has anaconda-project installed
        "conda activate\n"
    )
    command = f"anaconda-project run python {restack_script} -fp {} -n {ncpus}\n"
    
    # write to .slurm script
    sbatch_fp = slurm_dir.joinpath(pairs_fp.name.replace(".pickle", ".slurm"))
    with open(sbatch_fp, "w") as f:
        f.write(head + command)
        
    out = subprocess.check_output(["sbatch", sbatch_fp])
    
    return {"subprocess_out": out, "sbatch_out_str": sbatch_out_str}

### 1.3 Move stacked from scratch

Should this step be kept?? Could run improve step while on scratch as well...

### 1.4 Remove the WRF outputs from scratch

To clear up space in a $SCRATCH_DIR with limited capacity, remove the WRF outputs that have been completed.

## 2: Improve the data

In [65]:
group_dir.joinpath(str(year)).mkdir(exist_ok=True, parents=True) 

In [51]:
wrf_dir.parent.name

'hist'

In [48]:
test = []
test