# Objective
The purpose of this notebook is to create a data hypercube from the GIPL model outputs for multiple permafrost variables (mean annual ground temperature, talik thickness, depth of permafrost base, and depth of top of permafrost) for the Alaska (i.e., CRREL) spatial domain (Aleutians not included). We know from our exploratory data analysis (EDA) that the processing work is mostly of the metadata variety: standardizing filenames, and then also establishing a cohesive spatial reference and NoData value for every GeoTIFF.

There are 6000 2471 X 1941 GeoTIFFs with a 1 km spatial resolution distributed equally across the following climate model + emissions scenario combinations: 
```
├── 5Models_45
├── 5Models_85
├── GFDL_45
├── GFDL_85
├── NCAR_45
└── NCAR_85
```


## Pipeline Steps

The flow here is as follows:

 0. Set up directories and configure data fetch and output archival.
 1. Fetch data if needed, and verify all the data is in place.
 2. Extract - decompress the data if needed, and verify all files are in place.
 3. Specify Output Parameters: File naming convention, raster creation profile
 4. Create a new dataset with corrected metadata and NoData values.
 5. Archive the outputs.

## 0 - Setup



In [1]:
import os
import shutil
import rasterio as rio
import re
import numpy as np
import threading
import concurrent.futures
import random
import tqdm
#from multiprocessing import Pool
from multiprocess import Pool
from pyproj.crs import CRS
from pathlib import Path
from rasterio import Affine

# general config
COPY_SOURCE = False
COPY_OUTPUTS_TO_ARCHIVE = False
os.environ["NCORES"] = "24"

# directory config
os.environ["ORIGINAL_SRC_DIR"] = "/atlas_scratch/ssmarchenko/ssmarchenko_home/2022_CRREL_proj/"
os.environ["PROJECT_DIR"] = "/atlas_scratch/cparr4/new_gipl_eda_marchenko_revision/"
os.environ["DST_DIR"] = "/atlas_scratch/cparr4/new_gipl_eda_marchenko_revision/2022_CRREL_proj/"
os.environ["EXTRACT_DIR"] = "/atlas_scratch/cparr4/new_gipl_eda_marchenko_revision/2022_CRREL_proj/extracted/"
os.environ["OUTPUT_DIR"] = "/atlas_scratch/cparr4/new_gipl_eda_marchenko_revision/crrel_gipl_outputs/"
os.environ["ARCHIVE_DIR"] = "/workspace/Shared/Tech_Projects/Arctic_EDS/project_data/rasdaman_datasets/crrel_gipl_outputs/"

# set the environment variables and create directories and Path objects
gipl_src_path = Path(os.environ["ORIGINAL_SRC_DIR"])

gipl_dst_path = Path(os.environ["DST_DIR"]).mkdir(parents=True, exist_ok=True)
gipl_dst_path = Path(os.environ["DST_DIR"])

extract_path = Path(os.environ["EXTRACT_DIR"]).mkdir(parents=True, exist_ok=True)
extract_path = Path(os.environ["EXTRACT_DIR"])

output_path = Path(os.environ["OUTPUT_DIR"]).mkdir(parents=True, exist_ok=True)
output_path = Path(os.environ["OUTPUT_DIR"])

archive_path = Path(os.environ["ARCHIVE_DIR"]).mkdir(parents=True, exist_ok=True)
archive_path = Path(os.environ["ARCHIVE_DIR"])

## 1 - Fetch

We don't need to download external data - these are on our file system courtesy of Sergey Marchenko and the data were intially fetched via

```shell
cp -r /atlas_scratch/ssmarchenko/ssmarchenko_home/2022_CRREL_proj /atlas_scratch/cparr4/new_gipl_eda_marchenko_revision
````
but we can include a fetch function to grab data if we need it. These source data (about 58 GB) are compressed `.zip` archives.

In [2]:
print("Executing Step 1 (Fetch Data)...\n")

fps = [x for x in gipl_src_path.rglob("*.zip")]
print(f"{len(fps)} .zip files are currently located at {gipl_src_path}.")


if COPY_SOURCE:
    new_fps = [gipl_dst_path / ''.join(x.name) for x in fps]
    print(f"Copying {len(new_fps)} .zip files to {gipl_dst_path}")
    for src, dst in zip(fps, new_fps):
        shutil.copy(src, dst)
else:
    print("\nNo files were copied from the source directory to the project directory.\n")
    
zip_fps = sorted([x for x in gipl_dst_path.rglob("*.zip")])
print(f"{len(zip_fps)} .zip files are currently located at {gipl_dst_path}.")

Executing Step 1 (Fetch Data)...

66 .zip files are currently located at /atlas_scratch/ssmarchenko/ssmarchenko_home/2022_CRREL_proj.

No files were copied from the source directory to the project directory.

66 .zip files are currently located at /atlas_scratch/cparr4/new_gipl_eda_marchenko_revision/2022_CRREL_proj.


## 2 - Extracting

There is quite a bit of data to extract. I recommend skipping this step while testing if possible, i.e. just use `cparr4`'s scratch directory or ask around to see if anyone has it stashed. If you do need to extract, you can use this command in a notebook cell to do a find-unzip on all the compressed data. Or, just fire this off in your local shell, minus the bang(!).

```shell
!find $DST_DIR -name '*.zip' -exec unzip -q -d $EXTRACT_DIR {} \;
```

That command will produce a flat directory of the **6000 GeoTIFFs** that are compressed within the source `.zip` files. These data are 245 GB on disk once extracted.

We'll do a quick validation of the file paths to make sure all the GeoTiffs are in place. We know from the EDA that there are some file naming inconsistencies, so we'll do a coarse check just on the basis of model-scenario combinations and not worry about variables for the moment.

In [3]:
fps = [x for x in extract_path.rglob("*.tif")]
rcp45_fps = [x for x in fps if "rcp45" in x.name.lower()]
rcp85_fps = [x for x in fps if "rcp85" in x.name.lower()]

gfdl_rcp45_fps = [x for x in rcp45_fps if "gfdl" in x.name.lower()]
ncar_rcp45_fps = [x for x in rcp45_fps if "ncar" in x.name.lower()]
fivemodel_rcp45_fps = [x for x in rcp45_fps if "5mod" in x.name.lower()]

gfdl_rcp85_fps = [x for x in rcp85_fps if "gfdl" in x.name.lower()]
ncar_rcp85_fps = [x for x in rcp85_fps if "ncar" in x.name.lower()]
fivemodel_rcp85_fps = [x for x in rcp45_fps if "5mod" in x.name.lower()]

# check equal number of geotiffs across the different models and scenarios
assert len(rcp45_fps) == len(rcp85_fps), f"Each scenario does not have the sname number of files. RCP 4.5 has {len(rcp45_fps)} GeoTIFFs, but RCP 8.5 has {len(rcp85_fps)}."
assert len(gfdl_rcp85_fps) == len(gfdl_rcp85_fps) == len(fivemodel_rcp85_fps), "Number of files per model unequal for RCP 8.5!"
assert len(gfdl_rcp45_fps) == len(gfdl_rcp45_fps) == len(fivemodel_rcp45_fps), "Number of files per model unequal for RCP 4.5!"
assert len(gfdl_rcp45_fps) == len(gfdl_rcp45_fps) == len(fivemodel_rcp45_fps) == len(gfdl_rcp85_fps) == len(gfdl_rcp85_fps) == len(fivemodel_rcp85_fps)

print(f"There are {len(fps)} GeoTIFFs to process.")

There are 6000 GeoTIFFs to process.


## 3 - Specify Output Parameters
### 3.1 File Naming: CCCC (Create Consistent Cohesive Convention)

We've already segmented the 6000 GeoTIFFs by model and scenario combination. The dataset has ten variables, so we need a convention that reads something like `prefix_model_scenario_variable_year.tif`. We'll probably need regular expressions to extract the year and variable from the existing file names.

In [4]:
def get_re_year(fp):
    """Fetch a single year (YYYY) from a file name."""
    year = re.match(r'.*([1-3][0-9]{3})', fp).group(1)
    return year


def get_re_depth(fp):
    """Fetch depth from a file name for mean annual ground temperature (magt) variables."""
    try:
        depth = re.match(r'.*(0.5|1|2|3|4)m_', fp).group(1)
    except:
        depth = "5"
    return depth


def get_re_permafrost_var(fp):
    """Fetch the permafrost variable from a file name."""
    if "magt" in fp:
        if "surf" in fp:
            depth = "surface_"
        else:
            depth = get_re_depth(fp) + "m_"
        pf_var = f"magt{depth}degC_"
    elif "talik" in fp:
        pf_var = "talikthickness_m_"
    elif "base" in fp:
        pf_var = "permafrostbase_m_"
    elif "top" in fp:
        pf_var = "permafrosttop_m_"
    
    return pf_var


def create_new_filename(original_fp, new_filename_prefix):
        fp_name = original_fp.name.lower()
        new_fname = f"{new_filename_prefix}{get_re_permafrost_var(fp_name)}{get_re_year(fp_name)}.tif"
        return new_fname

In [5]:
# test our parsing and generation with a few random selections
create_new_filename(random.choice(fps), "test_")

'test_permafrosttop_m_2091.tif'

 ### 3.2 Raster Creation Profile

We know from our EDA work that there is a slight drift in the Affine transform across all the different raster files. We should probably use the affine transform that that is native to the majority of the model outputs. We'll also want to make sure LZW compression and -9999 no data values are set.

In [6]:
def metadata_check(directory):
    all_meta = []
    fps = [x for x in directory.glob("*.tif")]
    read_lock = threading.Lock()

    def process(fp):
        src = rio.open(fp)
        with read_lock:
            profile = src.profile
            all_meta.append(profile)
    
    # We map the process() function over the list of files
    with concurrent.futures.ThreadPoolExecutor(
        max_workers=int(os.getenv("NCORES"))
    ) as executor:
        executor.map(process, fps)

    return all_meta

In [7]:
meta = metadata_check(extract_path)

In [8]:
# Example metadata object
meta[0]

{'driver': 'GTiff', 'dtype': 'float32', 'nodata': -3.4028234663852886e+38, 'width': 2471, 'height': 1941, 'count': 1, 'crs': CRS.from_epsg(3338), 'transform': Affine(1000.0, 0.0, -979791.7089865601,
       0.0, -1000.0, 2375479.7509745434), 'blockxsize': 128, 'blockysize': 128, 'tiled': True, 'interleave': 'band'}

In [9]:
tx_c_values = []
for j in meta:
    tx_c_values.append(j["transform"].c)
tx_f_values = []
for j in meta:
    tx_f_values.append(j["transform"].f)

In [10]:
np.unique(tx_c_values, return_counts=True)

(array([-979791.70898656, -979791.70898656, -979791.70869206]),
 array([2000, 2000, 2000]))

In [11]:
np.unique(tx_f_values, return_counts=True)

(array([2375479.75097454, 2375479.75097454, 2375479.75138362]),
 array([2000, 2000, 2000]))

In [12]:
# rounding to three places yields consistent c and f Affine transform values
np.unique(np.round(tx_c_values,3))[0]

-979791.709

In [13]:
np.unique(np.round(tx_f_values,3))[0]

2375479.751

In [14]:
new_transform = Affine(1000.0, 0.0, np.unique(np.round(tx_c_values,3))[0],
                       0.0, -1000.0, np.unique(np.round(tx_f_values,3))[0])
new_transform

Affine(1000.0, 0.0, -979791.709,
       0.0, -1000.0, 2375479.751)

In [15]:
height_values = []
for j in meta:
    height_values.append(j["height"])
np.unique(height_values, return_counts=True)

(array([1941]), array([6000]))

In [16]:
width_values = []
for j in meta:
    width_values.append(j["width"])
np.unique(width_values, return_counts=True)

(array([2471]), array([6000]))

In [17]:
crs = CRS(3338)

In [18]:
# we will fix this in the new raster creation profile
meta[0]["nodata"]

-3.4028234663852886e+38

In [19]:
profile = {
    "driver": "GTiff",
    "crs": crs,
    "transform": new_transform,
    "width":  meta[0]["width"],
    "height": meta[0]["height"],
    "count": 1,
    "dtype": np.float32,
    "nodata": -9999,
    "tiled": False,
    "compress": "lzw",
    "interleave": "band",
}

## 4 - Make the Dataset

In [23]:
def get_array_values(fp):
    
    with rio.open(fp) as src:
        arr_values = src.read(1)
    del src
    return arr_values


def fix_nodata_values(arr):
    
    arr[arr == meta[0]["nodata"]] = -9999
    return arr


def write_new_geotiff(out_fp, arr):
    with rio.open(out_fp, "w", **profile) as dst:
        dst.write(arr, 1)
    del dst
    return


def run_new_geotiffs(args):
    in_fp, out_fp = args
    write_new_geotiff(out_fp, fix_nodata_values(get_array_values(in_fp)))
    return


def make_permafrost_dataset(original_fps, new_filename_prefix):
    
    args = []
    for raster_src in original_fps:
        
        args.append((
            raster_src,
            output_path.joinpath(create_new_filename(raster_src, new_filename_prefix)),
        ))
    with Pool(int(os.getenv("NCORES"))) as pool:
        for _ in tqdm.tqdm(
            pool.imap_unordered(run_new_geotiffs, args), total=len(args)
        ):
            pass
    del pool
    return

In [22]:
make_permafrost_dataset(fivemodel_rcp45_fps, "gipl_5ModelAvg_rcp45_")

100%|█████████████████████████████████████████████████████████| 1000/1000 [01:27<00:00, 11.43it/s]


In [24]:
make_permafrost_dataset(fivemodel_rcp85_fps, "gipl_5ModelAvg_rcp85_")

100%|█████████████████████████████████████████████████████████| 1000/1000 [00:28<00:00, 35.70it/s]


In [25]:
make_permafrost_dataset(gfdl_rcp85_fps, "gipl_GFDL-CM3_rcp85_")

100%|█████████████████████████████████████████████████████████| 1000/1000 [01:23<00:00, 12.03it/s]


In [26]:
make_permafrost_dataset(gfdl_rcp45_fps, "gipl_GFDL-CM3_rcp45_")

100%|█████████████████████████████████████████████████████████| 1000/1000 [01:31<00:00, 10.92it/s]


In [29]:
make_permafrost_dataset(ncar_rcp45_fps, "gipl_NCAR-CCSM4_rcp45_")

100%|█████████████████████████████████████████████████████████| 1000/1000 [00:33<00:00, 29.57it/s]


In [28]:
make_permafrost_dataset(ncar_rcp85_fps, "gipl_NCAR-CCSM4_rcp85_")

100%|█████████████████████████████████████████████████████████| 1000/1000 [01:21<00:00, 12.30it/s]


In [30]:
output_fps = [x for x in output_path.glob("*.tif")]
assert len(output_fps) == 6000, f"Only {len(output_fps)} GeoTIFFs were created, but 6000 were expected."

## 5 - Archive the data

Stash the data in the *backed-up-Rasdaman-pot-of-SNAP-gold<sup>TM</sup>* For the Arctic-EDS that's here: `/workspace/Shared/Tech_Projects/Arctic_EDS/project_data/rasdaman_datasets/$ARCHIVE_DIR` and while we expect these GIPL outputs will hit the storefront in a few different ways (ARDAC, etc.) this is a good spot for right now.

In [31]:
if COPY_OUTPUTS_TO_ARCHIVE:
    archive_fps = [archive_path / ''.join(x.name) for x in output_fps]
    print(f"Copying {len(archive_fps)} files to {archive_path}...")
    for src, dst in zip(output_fps, archive_fps):
        shutil.copy(src, dst)
    assert(len([x for x in archive_path.rglob("*.tif")]) == len(output_fps))
else:
    print("No files were copied from the project output directory to the archive directory.")


No files were copied from the project output directory to the archive directory.


# Pipeline Complete!