# Purpose
## Prepare the 4 km GIPL outputs for active layer thickness (ALT) and mean annual ground temperature (MAGT) for an ingest to Rasdaman.
Currently these data comprise a stack of geotiffs that span different RCP scenarios, eras, and climate models. While Rasdaman can handle the stack of geotiffs in an ingest, I'll convert the stack to a multidimensional netCDF to retain the variables (ALT and MAGT) as "field names" in Rasdaman. This is also good training for me as I'm far less fluent in netcdf + xarray part of the stack than in geotiff + rasterio.

The goal is to create a completely inclusive datacube of both historical and projected data. 
It will have the following dimensions for both ALT and MAGT variables:
* model
* scenario
* era
* era start (excluded for now, dimension would be redundant with just era)
* era end (excluded for now, dimension would be redundant with just era)
* Y
* X

In [1]:
from pathlib import Path
import numpy as np
import xarray as xr
import pandas as pd
import rasterio as rio
import os

In [2]:
# set up the file paths to the data
data_dir = Path("geotiff3338/")
data_fps = sorted(data_dir.glob("*"))
data_fps[0]

PosixPath('geotiff3338/alt_cruts31_historical_era1995_1986to2005.tif')

The convention is `variable_model_scenario_era_yearRange` although it is a little weird here because for the two historical files we are calling CRU TS 3.1 a "model" and "historical" a scenario.

In [3]:
# variables needed to describe the data
varnames = ["magt", "alt"]
scenarios = ["historical", "rcp45", "rcp85"]
models = ["cruts31", "gfdlcm3", "gisse2r", "ipslcm5alr", "mricgcm3", "ncarccsm4"]
eras = ["1995", "2025", "2050", "2075", "2095"]
era_starts = ["1986", "2011", "2036", "2061", "2086"]
era_ends = ["2005", "2040", "2065", "2090", "2100"]
units_lu = {"magt": "°C", "alt": "m"}

# integer encoding for strings for the netcdf coords (Rasdaman wants this)
# restart from 0? if it fails, try beginning with zero for each encoding dictionary.
era_encoding = {"1995": 0, "2025": 1, "2050": 2, "2075": 3, "2095": 4}
scenario_encoding = {"rcp45": 5, "rcp85": 6, "historical": 7}
model_encoding = {"gfdlcm3": 8, "gisse2r": 9, "cruts31": 10, "ipslcm5alr": 11, "mricgcm3": 12, "ncarccsm4": 13}
all_encoding = {**units_lu, **era_encoding, **scenario_encoding, **model_encoding}

# get x and y dimensions from a single file
with rio.open(data_fps[0]) as src:
    src_meta = src.meta.copy()
    # get x and y coordinates for axes
    y = np.array([src.xy(i, 0)[1] for i in np.arange(src.height)])
    x = np.array([src.xy(0, j)[0] for j in np.arange(src.width)])
    # get the number of pixels
    ny, nx = src.height, src.width
    

In [4]:
# creating a dictionary from all the raster files
# the directory of rasters to dict is kind of boilerplate at this point
# key is the filename, value is a subdictionary with keys for each characteristic
# we'll force -9999.0 as the no data value for good measure.
data_di = {}

for fp in data_fps:
    fn = fp.name.split(".tif")[0]
    data_di[fn] = {}
    fn_components = fn.split("_")
    data_di[fn]["varname"] = fn_components[0]
    data_di[fn]["model"] = fn_components[1]
    data_di[fn]["scenario"] = fn_components[2]
    data_di[fn]["era"] = fn_components[3][-4:]
    data_di[fn]["era start"] = fn_components[4][0:4]
    data_di[fn]["era end"] = fn_components[4][-4:]
    
    with rio.open(fp) as src:
    
        arr = src.read(1)
        arr[np.isnan(arr)] = -9999.0
        data_di[fn]["arr"] = arr

data_di['alt_gfdlcm3_rcp45_era2025_2011to2040']

{'varname': 'alt',
 'model': 'gfdlcm3',
 'scenario': 'rcp45',
 'era': '2025',
 'era start': '2011',
 'era end': '2040',
 'arr': array([[-9999., -9999., -9999., ..., -9999., -9999., -9999.],
        [-9999., -9999., -9999., ..., -9999., -9999., -9999.],
        [-9999., -9999., -9999., ..., -9999., -9999., -9999.],
        ...,
        [-9999., -9999., -9999., ..., -9999., -9999., -9999.],
        [-9999., -9999., -9999., ..., -9999., -9999., -9999.],
        [-9999., -9999., -9999., ..., -9999., -9999., -9999.]],
       dtype=float32)}

### Higher Dimensions
This is where it gets interesting. We need to define the shape of our data cube on a per variable basis. In this instance that'll be era X model X scenario X x-coordinate X y-coordinate. That's a 5 dimensional *hypercube* for those scoring at home (space (x and y) plus time (era) plus model and scenario). There may be a simpler way to do this, but setting up arrays full of no data (e.g. -9999) is a good start and will act as governor when it comes to pushing data because if we exceed the indicies of the array, numpy will yell at us. It is also a memory check - but that shouldn't be an issue on Apollo / Zeus.

In [5]:
# set up a multidimensional array
arr_shape = (len(eras),
             len(models),
             len(scenarios),
             ny,
             nx)

out_arr = np.full(arr_shape, -9999.0, dtype=np.float32)
print(out_arr.shape)

(5, 6, 3, 489, 914)


This place-holder array checks out. 5 possible era, 6 possible models, 3 possible scenarios. Specifying `dtype` here is important. This should match the `dtype` of the input GeoTIFFs. We are not done initializing arrays though. The hypercube needs to get filled, even when data does not exist because of invalid dimensional combinations. For example, we have no "historical-ncarccsm4" scenario-model combinatiion GeoTIFF (because it is nonsense). But should create an array we can push to the hypercube for those indicies.

In [6]:
# set up a "null" array for invalid dimensional combos by grabbing a slice of the place-holder array
null_arr = out_arr[0, 0, 0,].copy()
print(null_arr.shape) 

(489, 914)


Now we convert the dictionary full of raster data to a DataFrame where each row is a file and columns reflect the data and the describing characteristics. I'm not convinced this step is totally necessary, but querying a dictionary, especially a nested dictionary, is sort of fraught. The DataFrame is a bit more friendly. 

In [7]:
df = pd.DataFrame.from_dict(data_di).sort_index().T
df

Unnamed: 0,arr,era,era end,era start,model,scenario,varname
alt_cruts31_historical_era1995_1986to2005,"[[-9999.0, -9999.0, -9999.0, -9999.0, -9999.0,...",1995,2005,1986,cruts31,historical,alt
alt_gfdlcm3_rcp45_era2025_2011to2040,"[[-9999.0, -9999.0, -9999.0, -9999.0, -9999.0,...",2025,2040,2011,gfdlcm3,rcp45,alt
alt_gfdlcm3_rcp45_era2050_2036to2065,"[[-9999.0, -9999.0, -9999.0, -9999.0, -9999.0,...",2050,2065,2036,gfdlcm3,rcp45,alt
alt_gfdlcm3_rcp45_era2075_2061to2090,"[[-9999.0, -9999.0, -9999.0, -9999.0, -9999.0,...",2075,2090,2061,gfdlcm3,rcp45,alt
alt_gfdlcm3_rcp45_era2095_2086to2100,"[[-9999.0, -9999.0, -9999.0, -9999.0, -9999.0,...",2095,2100,2086,gfdlcm3,rcp45,alt
...,...,...,...,...,...,...,...
magt_ncarccsm4_rcp45_era2095_2086to2100,"[[-9999.0, -9999.0, -9999.0, -9999.0, -9999.0,...",2095,2100,2086,ncarccsm4,rcp45,magt
magt_ncarccsm4_rcp85_era2025_2011to2040,"[[-9999.0, -9999.0, -9999.0, -9999.0, -9999.0,...",2025,2040,2011,ncarccsm4,rcp85,magt
magt_ncarccsm4_rcp85_era2050_2036to2065,"[[-9999.0, -9999.0, -9999.0, -9999.0, -9999.0,...",2050,2065,2036,ncarccsm4,rcp85,magt
magt_ncarccsm4_rcp85_era2075_2061to2090,"[[-9999.0, -9999.0, -9999.0, -9999.0, -9999.0,...",2075,2090,2061,ncarccsm4,rcp85,magt


This DataFrame checks out. Next a nested loop will populate a copy of the place-holder `out_arr` for each data variable (MAGT and ALT in this case). The key thing here is the ORDER. We have to be certain that we are iterating in sync with the shape of the place-holder array. We defined our output data structure so we have to stick to it. Era is the first (technically 0th) dimension, model is the second, and scenario the third.

In [8]:
out_arrs_by_var = []

for var in varnames:
    arr_to_fill = out_arr.copy()
    for era, er in zip(eras, range(out_arr.shape[0])):
        for model, mn in zip(models, range(out_arr.shape[1])):
            for scenario, sc in zip(scenarios, range(out_arr.shape[2])):
                query = "era == @era & scenario == @scenario & model == @model"
                try:
                    sub_arr = df[df.varname == var].query(query)["arr"].values[0]
                except IndexError:
                    sub_arr = null_arr.copy()
                arr_to_fill[er, mn, sc] = sub_arr
                
    out_arrs_by_var.append(arr_to_fill)

In [9]:
varnames

['magt', 'alt']

In [10]:
magt_arr = np.array(out_arrs_by_var[0])
alt_arr = np.array(out_arrs_by_var[1])
print(magt_arr.shape, alt_arr.shape)

(5, 6, 3, 489, 914) (5, 6, 3, 489, 914)


Looks good! A 5 dimensional array for each variable: 5 possible era, 6 possible models, 3 possible scenarios, X, Y. Now we'll create an xarray Dataset object and prescribe the dimensions. We'll use the integer encoding for the coordinate values to play nice with Rasdaman.

In [11]:
dim_names = ["era", "model", "scenario", "y", "x"]

ds = xr.Dataset(data_vars={"magt": (dim_names, magt_arr),
                           "alt": (dim_names, alt_arr)},
                coords={"era": [era_encoding[era] for era in eras],
                        "model": [model_encoding[model] for model in models],
                        "scenario": [scenario_encoding[scenario] for scenario in scenarios],
                        "y": y,
                        "x": x},
               attrs=all_encoding)

ds

This is a quick test of the historical ALT data as read straight from the original GeoTIFF and as sliced from the cube to be sure they are identical.

In [12]:
data_fps[0]

PosixPath('geotiff3338/alt_cruts31_historical_era1995_1986to2005.tif')

In [13]:
alt_hist_slice = ds.sel(era=0, model=10, scenario=7).alt

In [14]:
print(type(alt_hist_slice.data))
print(alt_hist_slice.dtype)
print(alt_hist_slice.data.shape)

<class 'numpy.ndarray'>
float32
(489, 914)


In [15]:
src = rio.open(data_fps[0])
test_arr = src.read(1)
test_arr.shape

(489, 914)

In [16]:
(alt_hist_slice.data == test_arr).all()

True

In [17]:
# specify encoding to compress
encoding = {"magt": {"zlib": True, "complevel": 9, "_FillValue": -9999.0},
            "alt": {"zlib": True, "complevel": 9, "_FillValue": -9999.0},
           }

In [18]:
ds.to_netcdf("gipl_alt_magt_4km.nc", encoding=encoding)

In [19]:
ls -lhrt

total 18M
drwxr-xr-x. 2 cparr4 snap_users 8.0K Nov 16 07:50 [0m[38;5;27mgeotiff3338[0m/
-rw-r--r--. 1 cparr4 snap_users  14K Nov 19 14:35 preprocess_GIPL_4km_alt_and_magt.ipynb
-rw-r--r--. 1 cparr4 snap_users  18M Nov 19 14:36 gipl_alt_magt_4km.nc
