# Download and Save Cloud Data

This notebook downloads variables needed to calculate the precipitation-buoyancy POD from multiple cloud stores. The following code obtains and preprocesses thermodynamic variables from ERA5, and precipitation from both IMERG V06 and GPCP.

## Import Necessary Packages

In [5]:
import xesmf
import gcsfs
import fsspec
import warnings
import numpy as np
import xarray as xr
import planetary_computer
from datetime import datetime
import pystac_client as pystac
warnings.filterwarnings('ignore')

## User-Defined Fields

Define user information (for data download attribution), the directory where the data should be saved to, and subsetting parameters (years, months, latitude/longitude/pressure level ranges).

In [6]:
AUTHOR    = 'Savannah L. Ferretti'      
EMAIL     = 'savannah.ferretti@uci.edu' 
SAVEDIR   = '/ocean/projects/atm200007p/sferrett/Repos/monsoon-pr/data/raw'
YEARS     = [2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020]
MONTHS    = [6,7,8]
LATRANGE  = (5.,25.) 
LONRANGE  = (60.,90.)
LEVRANGE  = (500.,1000.)

## Import and Preproccess Cloud Datasets

The raw data for this analysis is accessible through publicly available cloud stores. ERA5 data, at its native hourly frequency and 0.25° x 0.25° spatial resolution, can be found on the LEAP Pangeo Data Catalog [here](https://catalog.leap.columbia.edu/feedstock/arco-era5). IMERG V06 data, provided at half-hourly frequency at 0.1° x 0.1° spatial resolution, can be accessed via Microsoft Planetary Computer [here](https://planetarycomputer.microsoft.com/dataset/gpm-imerg). GPCP data, available at daily frequency with 1.0° x 1.0° spatial resolution, is also hosted on the LEAP Pangeo Data Catalog [here](https://catalog.leap.columbia.edu/feedstock/global-precipitation-climatology-project). To efficiently handle these large datasets, the following functions lazily load all data using Xarray.

In [7]:
def get_era5():
    store = 'gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2/'
    ds    = xr.open_zarr(store,decode_times=True)  
    return ds

def get_imerg():
    store   = 'https://planetarycomputer.microsoft.com/api/stac/v1'
    catalog = pystac.Client.open(store,modifier=planetary_computer.sign_inplace)
    assets  = catalog.get_collection('gpm-imerg-hhr').assets['zarr-abfs']
    ds      = xr.open_zarr(fsspec.get_mapper(assets.href,**assets.extra_fields['xarray:storage_options']),consolidated=True)
    return ds

def get_gpcp():
    store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'
    ds    = xr.open_dataset(store,engine='zarr',chunks={})   
    return ds

In [8]:
era5data  = get_era5()
imergdata = get_imerg()
gpcpdata  = get_gpcp()

The ```preprocess()``` function preprocesses each variable using the user-defined fields specified above. It executes two functions: one that standardizes dimensions across datasets, and another that subsets the time and space dimensions and specifies pressure levels to keep (if applicable).

In [9]:
def standardize(ds):
    dimnames = {'latitude':'lat','longitude':'lon','level':'lev'}
    ds = ds.rename({oldname:newname for oldname,newname in dimnames.items() if oldname in ds.dims})
    targetdims = ['lev','time','lat','lon'] if 'lev' in ds.dims else ['time','lat','lon']
    extradims  = [dim for dim in ds.dims if dim not in targetdims]
    if extradims:
        ds = ds.drop_dims(extradims)
    for dim in targetdims:
        if dim=='time' and ds.coords[dim].dtype.kind!='M':
            ds.coords[dim] = ds.indexes[dim].to_datetimeindex()
            ds = ds.sel(time=~ds.time.to_index().duplicated(keep='first'))
        elif dim=='lon':
            ds.coords[dim] = (ds.coords[dim]+180)%360-180        
        elif dim!='time':
            ds.coords[dim] = ds.coords[dim].astype(float)
    ds = ds.sortby(targetdims).transpose(*targetdims)   
    return ds

def subset(da,years=YEARS,months=MONTHS,latrange=LATRANGE,lonrange=LONRANGE,levrange=LEVRANGE):
    da = da.sel(time=(da['time.year'].isin(years))&(da['time.month'].isin(months)))
    da = da.sel(lat=slice(*latrange),lon=slice(*lonrange))
    if 'lev' in da.dims:
        da = da.sel(lev=slice(*levrange))
    return da

def preprocess(da,years=YEARS,months=MONTHS,latrange=LATRANGE,lonrange=LONRANGE,levrange=LEVRANGE):
    da = standardize(da)
    da = subset(da,years,months,latrange,lonrange,levrange)
    return da

In [10]:
gpcp  = preprocess(gpcpdata)
imerg = preprocess(imergdata)
era5  = preprocess(era5data)

## Extract Necessary Variables

Only four variables are needed from across these three datasets: precipitation (in mm/day) from IMERG V06 and GPCP, and surface pressure (hPa), specific humidity (kg/kg), and temperature (K) from ERA5. Convert units as necessary, and remove unrealistic values. 

In [24]:
gpcpdata  = gpcp.precip.where(gpcp.precip>=0,0)
imergdata = (imerg.precipitationCal).where(imerg.precipitationCal>=0,0)*24 # From mm/hr to mm/day
psdata    = era5.surface_pressure/100 # From Pa to hPa
qdata     = era5.specific_humidity
tdata     = era5.temperature

## Create Observational Baselines

Our three datasets have different temporal frequencies and spatial resolution. Therefore, regridding and resampling of the data is needed to make temporally and spatially-consistent datasets ready for analysis. ```regrid()``` uses [xESMF](https://xesmf.readthedocs.io/en/stable/) to convert between rectilinear grids, and ```resample()``` changes the sampling frequency/takes temporal means (if applicable).

In [13]:
def regrid(da,gridsource,gridtarget):
    regridder = xesmf.Regridder(gridsource,gridtarget,method='bilinear',keep_attrs=True)
    return regridder(da)

def resample(da,frequency,method):
    if frequency not in ['H','D']:
        raise ValueError("Frequency must be 'H' (hourly) or 'D' (daily)")
    if method=='mean':
        return da.resample(time=frequency).mean()
    elif method=='first':
        return da.resample(time=frequency).first()
    elif method=='last':
        return da.resample(time=frequency).last()
    else:
        raise ValueError("Method must be 'mean', 'first', or 'last'")

## Preprocess Data

The ```preprocess()``` function preprocesses each variable using the user-defined fields above. It standardizes dimensions, subsets the time and space dimensions, specifies pressure levels to keep (if applicable), resamples the data to a specified sampling frequency (which can be instantaneous or a time-mean), and regrids the IMERG V06 precipitation data to the same grid as ERA5. It also timestamps the date which these datasets were created, along with the personal information of the user who created them.

In [6]:
def standardize(da):
    dimnames = {'latitude':'lat','longitude':'lon','level':'lev'}
    da = da.rename({oldname:newname for oldname,newname in dimnames.items() if oldname in da.dims})
    dims = ['lev','time','lat','lon'] if 'lev' in da.dims else ['time','lat','lon']
    for dim in dims:
        if dim == 'time' and da.coords[dim].dtype.kind != 'M':
            da.coords[dim] = da.indexes[dim].to_datetimeindex()
        elif dim != 'time':
            da.coords[dim] = da.coords[dim].astype(float)
    da = da.sortby(dims).transpose(*dims)
    return da

def subset(da,years=YEARS,months=MONTHS,latrange=LATRANGE,lonrange=LONRANGE,levrange=LEVRANGE):
    da = da.sel(time=(da['time.year'].isin(years))&(da['time.month'].isin(months)))
    da = da.sel(lat=slice(*latrange),lon=slice(*lonrange))
    if 'lev' in da.dims:
        da = da.sel(lev=slice(*levrange))
    return da
    
def resample(da,frequency=FREQUENCY):
    da.coords['time'] = da.time.dt.floor(frequency) 
    return da.groupby('time').first()

def regrid(da,resolution,latrange=LATRANGE,lonrange=LONRANGE):
    newlats = np.arange(latrange[0],latrange[1]+resolution,resolution)
    newlons = np.arange(lonrange[0],lonrange[1]+resolution,resolution)
    da = da.interp(lat=newlats,lon=newlons,kwargs={'fill_value':'extrapolate'})
    return da

def preprocess(da,shortname,longname,units,source,years=YEARS,months=MONTHS,resolution=None,latrange=LATRANGE,lonrange=LONRANGE,levrange=LEVRANGE,frequency=FREQUENCY,author=AUTHOR,email=EMAIL):
    da = standardize(da)
    da = subset(da,years,months,latrange,lonrange,levrange)
    if xr.infer_freq(da.time) != frequency:
        da = resample(da,frequency)
    if resolution:
        da = regrid(da,resolution)
    ds = xr.Dataset(data_vars={shortname:([*da.dims],da.data)},
                    coords={dim:da.coords[dim].data for dim in da.dims})
    ds[shortname].attrs = dict(long_name=longname,units=units)
    ds.time.attrs = dict(long_name='Time')
    ds.lat.attrs  = dict(long_name='Latitude',units='°N')
    ds.lon.attrs  = dict(long_name='Longitude',units='°E')
    if 'lev' in ds.dims:
        ds.lev.attrs = dict(long_name='Pressure level',units='hPa')
    ds.attrs = dict(source=source,history=f'Created on {datetime.today().strftime("%Y-%m-%d")} by {author} ({email})')
    return ds

In [7]:
pr = preprocess(prdata,resolution=0.25,shortname='pr',longname='Precipitation flux',units='mm/day',source='IMERG V06')
ps = preprocess(psdata,shortname='ps',longname='Surface pressure',units='hPa',source='ERA5')
q  = preprocess(qdata,shortname='q',longname='Specific humidity',units='kg/kg',source='ERA5')
t  = preprocess(tdata,shortname='t',longname='Air temperature',units='K',source='ERA5')

In [8]:
print(f'Size of pr: {pr.nbytes*1e-9:.2f} GB')
print(f'Size of ps: {ps.nbytes*1e-9:.2f} GB')
print(f'Size of q:  {q.nbytes*1e-9:.2f} GB')
print(f'Size of t:  {t.nbytes*1e-9:.2f} GB')

Size of pr: 1.82 GB
Size of ps: 1.82 GB
Size of q:  29.09 GB
Size of t:  29.09 GB


## Save Variables

Save each variable Xarray.Dataset as a netCDF file to the user-defined save directory (```SAVEDIR```).

In [21]:
def save(ds,filename,savedir=SAVEDIR):
    filepath = f'{savedir}/{filename}'
    ds.to_netcdf(filepath)

In [23]:
save(pr,'IMERG_precipitation_flux.nc') # 22m 24s
save(ps,'ERA5_surface_pressure.nc')    # 7m 3s
save(q,'ERA5_specific_humidity.nc')    # 4h 2m 11s
save(t,'ERA5_temperature.nc')          # 3h 32m 51s