# CRU TS 4.0 Preprocessing

Use this notebook to preprocess SNAP's 2km CRU TS 4.0 [temperature](http://ckan.snap.uaf.edu/dataset/historical-monthly-and-derived-temperature-products-downscaled-from-cru-ts-data-via-the-delta-m) and [precipitation](http://ckan.snap.uaf.edu/dataset/historical-monthly-and-derived-precipitation-products-downscaled-from-cru-ts-data-via-the-delta-m) data for ingest into rasdaman. The code in here is set up to compute average and standard deviation layers for each season for the period of 1950-2009 using the monthly means provided in the CRU TS 4.0 dataset (currently the only non-decadal dataset!). It then clips these new summary GeoTIFFs to the IEM extent.

## Setup

1. The CRU TS 4.0  GeoTIFFs need to be in a single folder for each variable, and each folder should be in the same directory. That directory may be stored in the `$SCRATCH_DIR` environment variable or may be set in the cell below. 
2. The shapefile for the IEM domain can be found in the [geospatial-vector-veractiy](https://github.com/ua-snap/geospatial-vector-veracity/blob/706c56855885165eab2c4817e8ca8a4ffb9d751a/vector_data/polygon/boundaries/iem/AIEM_domain.shp) repo. Clone this repo and change the input path as needed in the following code cell, or set the `$GVV_DIR` environment variable.

Averaged and clipped rasters will be written to a single output folder with the name of the workding data directory plus the suffix `_1950-2009_<stat>_<season>_iem_domain`.


In [1]:
import os
from pathlib import Path

# set paths to datasets and geospatial-vector-veractiy repo
scratch_path = os.getenv("SCRATCH_DIR") or "/atlas_scratch/kmredilla/iem-webapp"
gvv_path = os.getenv("GVV_DIR") or "/workspace/UA/kmredilla/geospatial-vector-veracity"

scratch_dir = Path(scratch_path)
gvv_dir = Path(gvv_path)

## Get the extent of the clipped domain

It is most efficient to only average over the extent of the IEM domain, since we will be clipping the final data to that polygon. Get the extent values from the domain shapefile:

In [2]:
import geopandas as gpd
import numpy as np


# get extent of clipped raster using a sample FP
# because we don't need to be averaging over areas
# we don't intend to keep.

# IEM domain
# copied from /workspace/Shared/Tech_Projects/Alaska_IEM/project_data/IEM_Domain.zip
aiem_domain_gdf = gpd.read_file(gvv_dir.joinpath("vector_data/polygon/boundaries/iem/AIEM_domain.shp"))
        
bounds = {bound: value for bound, value in zip(["wb", "sb", "eb", "nb"], aiem_domain_gdf.bounds.values[0])}

## Process

Use rasterio to do a windowed read of the rasters using the extent of the IEM domain. 

Set the `data_dir_name` variable below to be the name of the folder to work on.

In [3]:
data_dir_name = "cru_ts40_2km_monthly_tas" 

In [6]:
import rasterio as rio
from multiprocessing import Pool

from rasterio.plot import show
from rasterio.mask import mask
from rasterio.windows import Window

from tqdm.notebook import tqdm


# setup seasons to iterate over
seasons = {
    "DJF": [12, 1, 2],
    "MAM": [3, 4, 5],
    "JJA": [6, 7, 8],
    "SON": [9, 10, 11],
}

# open a single file to get the row/column values for 
# windowed reading using the domain extent bounds
fp = list(scratch_dir.joinpath(data_dir_name).glob("*.tif")[0]
with rio.open(fp) as src:
    row_start = src.index(bounds["wb"], bounds["nb"])[0]
    row_stop = src.index(bounds["wb"], bounds["sb"])[0]
    col_start = src.index(bounds["wb"], bounds["sb"])[1]
    col_stop = src.index(bounds["eb"], bounds["sb"])[1]
    # also save metadata for later
    meta = src.meta.copy()

# create the window object for reuse
window = Window.from_slices(slice(row_start, row_stop), slice(col_start, col_stop))

# get the window tranform for the new new
with rio.open(fp) as src:
    win_transform = src.window_transform(window)



def read_tif(args):
    """Read a geotiff using a window"""
    fp, window = args
    with rio.open(fp) as src:
        return src.read(1, window=window)
    

def run_summary(fps, window):
    """Run the summarizaiton of the data files. Reads the data
    into an array, and writes new files based on chosen summary
    stat.
    """
    arrs = []
    args = [(fp, window) for fp in fps]
    with Pool(32) as pool:
        for out in tqdm(pool.imap(read_tif, args), total=len(args)):
            arrs.append(out)
    
    arr = np.array(arrs)
    
    mean_arr = arr.mean(axis=0)
    std_arr = arr.std(axis=0)
    
    return mean_arr, std_arr


# def clip_raster():
#     """Clip a raster to clip_poly, crop to extent, 
#     and write to out_dir with same filename
#     """
#     out_image, out_transform = mask(src, geoms, invert=True)
    
#     in_fp, clip_poly, bounds, out_dir = args
#     with rio.open(in_fp) as src:
#         arr, new_transform = mask(src, clip_poly)
#         meta = src.meta.copy()
#     arr = arr[0][bounds["nb"]:bounds["sb"], bounds["wb"]:bounds["eb"]]
#     out_fp = out_dir.joinpath(in_fp.name.replace(".tif", "_iem_domain.tif"))
#     meta["transform"] = new_transform
#     with rio.open(out_fp, "w", **meta) as dst:
#         dst.write(arr, 1)
        
#     return out_fp

Window(col_off=717, row_off=84, width=1280, height=931)

In [14]:
src_dir = scratch_dir.joinpath(data_dir_name)

# iterate over seasons 
for season in seasons.keys():
    months = seasons[season]
    fps = []
    for month in months:
        fps.append(src_dir.glob(f"*_{str(month).zfill(2)}_*.tif"))
    
    

1
2
3
4
5
6
7
8
9
10
11
12
