# Create Observational Baseline Datasets

This notebook creates the raw variable datasets for HR-ERA5/IMERG, LR-ERA5/IMERG, and LR-ERA5/GPCP needed to calculate the precipitation-buoyancy POD.

## Import Necessary Packages

In [1]:
import xesmf
import warnings
import numpy as np
import xarray as xr
from datetime import datetime
warnings.filterwarnings('ignore')

## User-Defined Configurations

Define the user's name/email, specify the directory where the variable data is, and set the directory where the baseline datasets will be saved.

In [2]:
AUTHOR  = 'Savannah L. Ferretti'      
EMAIL   = 'savannah.ferretti@uci.edu' 
FILEDIR = '/global/cfs/cdirs/m4334/sferrett/monsoon-pod/data/raw'
SAVEDIR = '/global/cfs/cdirs/m4334/sferrett/monsoon-pod/data/interim'

## Create Baseline Datasets

Our three data sources (ERA5, IMERG V06, and GPCP) have different temporal frequencies and spatial resolutions. To create consistent, analysis-ready data, we apply the `format_var()` function to individual variables, which loads in the Xarray.DataArrays from `FILEDIR`, and regrids (using [xESMF](https://xesmf.readthedocs.io/en/stable/)) and temporally resamples as-needed. The `combine_vars()` function then combines the processed variables into a single, cohesive dataset.

In [3]:
def load(filename,filedir=FILEDIR):
    filepath = f'{FILEDIR}/{filename}'
    ds = xr.open_dataarray(filepath)
    return ds.load()

def regrid(da,gridtarget):
    if not isinstance(gridtarget,(xr.Dataset,xr.DataArray)):
        raise TypeError("Input 'gridtarget' must be an Xarray.Dataset or Xarray.DataArray")
    regridder = xesmf.Regridder(da,gridtarget,method='bilinear')
    return regridder(da,keep_attrs=True)
  
def resample(da,frequency,method):
    if frequency not in ['H','D']:
        raise ValueError("Frequency must be 'H' (hourly) or 'D' (daily)")
    da.coords['time'] = da.time.dt.floor(frequency) 
    if method=='mean':
        return da.groupby('time').mean()
    elif method=='first':
        return da.groupby('time').first()
    elif method=='last':
        return da.groupby('time').last()
    else:
        raise ValueError("Method must be 'mean', 'first', or 'last'")

def format_var(filename,filedir=FILEDIR,gridtarget=None,frequency=None,method=None):
    da = load(filename,filedir)
    if gridtarget is not None:
        da = regrid(da,gridtarget)
    if frequency is not None and method is not None:
        da = resample(da,frequency,method)
    return da
    
def combine_vars(dalist,author=AUTHOR,email=EMAIL):
    if not isinstance(dalist,list):
        raise TypeError('Input must be a list of Xarray.DataArrays')
    if not all(isinstance(da,xr.DataArray) for da in dalist):
        raise TypeError('All elements in the input list must be Xarray.DataArrays')
    ds = xr.merge(dalist)
    ds.attrs = dict(history=f'Created on {datetime.today().strftime("%Y-%m-%d")} by {author} ({email})')
    print(f'Dataset Size: {ds.nbytes*1e-9:.2f} GB')
    return ds

### HR-ERA5/IMERG

Our HR-ERA5/IMERG dataset features hourly data on a 0.25° x 0.25° grid. We use ERA5 variables at their native resolution and adjust IMERG V06 precipitation to match, coarsening its grid and reducing its sampling frequency.

In [6]:
ps = format_var('ERA5_surface_pressure.nc')
q  = format_var('ERA5_specific_humidity.nc')
t  = format_var('ERA5_temperature.nc')
imergpr = format_var('IMERG_precipitation_rate.nc',gridtarget=ps,frequency='H',method='first')

In [7]:
hrimerg = combine_vars([imergpr,ps,q,t])

Dataset Size: 61.81 GB


### LR-ERA5/GPCP

Our LR-ERA5/GPCP dataset features daily mean data on a 1.0° x 1.0° grid. We use GPCP precipitation at its native resolution and adjust ERA5 variables to match, coarsening its grid and temporally averaging.

In [4]:
gpcppr = format_var('GPCP_precipitation_rate.nc')
ps = format_var('ERA5_surface_pressure.nc',gridtarget=gpcppr,frequency='D',method='mean')
q  = format_var('ERA5_specific_humidity.nc',gridtarget=gpcppr,frequency='D',method='mean')
t  = format_var('ERA5_temperature.nc',gridtarget=gpcppr,frequency='D',method='mean')

In [5]:
lrgpcp = combine_vars([gpcppr,ps,q,t])

Dataset Size: 0.17 GB


### LR-ERA5/IMERG

We also generate a lower-resolution IMERG V06 dataset (LR-ERA5/IMERG) at the same frequency and spatial resolution of LR-ERA5/GPCP. We temporally average and coarsen the grid of ERA5 variables and IMERG V06 precipitation.

In [9]:
ps = format_var('ERA5_surface_pressure.nc',gridtarget=gpcppr,frequency='D',method='mean')
q  = format_var('ERA5_specific_humidity.nc',gridtarget=gpcppr,frequency='D',method='mean')
t  = format_var('ERA5_temperature.nc',gridtarget=gpcppr,frequency='D',method='mean')
imergpr = format_var('IMERG_precipitation_rate.nc',gridtarget=gpcppr,frequency='D',method='mean')

In [10]:
lrimerg = combine_vars([imergpr,ps,q,t])

Dataset Size: 0.17 GB


## Save Baseline Datasets

Save each observational Xarray.Dataset as a netCDF file to `SAVEDIR`.

In [11]:
def save(ds,filename,savedir=SAVEDIR):
    filepath = f'{savedir}/{filename}'
    ds.to_netcdf(filepath)

In [12]:
save(hrimerg,'HR_ERA5_IMERG_baseline.nc')
save(lrimerg,'LR_ERA5_IMERG_baseline.nc')
save(lrgpcp,'LR_ERA5_GPCP_baseline.nc')