# Download and Save Cloud Data

This notebook downloads variables needed to calculate the precipitation-buoyancy POD from multiple cloud stores (thermodynamic variables from ERA5, and precipitation from IMERG V06).

## Import Necessary Packages

In [1]:
import gcsfs
import fsspec
import warnings
import numpy as np
import xarray as xr
import planetary_computer
from datetime import datetime
import pystac_client as pystac
warnings.filterwarnings('ignore')

## User-Defined Configurations

Define the user's name/email (for data download attribution), set the directory where the variable data will be saved, and specify subsetting parameters (i.e., years, months, and latitude/longitude/pressure level ranges).

In [2]:
AUTHOR    = 'Savannah L. Ferretti'
EMAIL     = 'savannah.ferretti@uci.edu'
SAVEDIR   = '/global/cfs/cdirs/m4334/sferrett/monsoon-pod/data/raw'
YEARS     = [2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020]
MONTHS    = [6,7,8]
LATRANGE  = (5.,25.) 
LONRANGE  = (60.,90.)
LEVRANGE  = (500.,1000.)

## Import ERA5, IMERG V06, and GPCP Data

The raw data for this analysis is accessible through publicly available cloud stores. ERA5 data, at its native hourly frequency and 0.25° x 0.25° spatial resolution, can be found on the LEAP Pangeo Data Catalog [here](https://catalog.leap.columbia.edu/feedstock/arco-era5). IMERG V06 data, provided at half-hourly frequency at 0.1° x 0.1° spatial resolution, can be accessed via Microsoft Planetary Computer [here](https://planetarycomputer.microsoft.com/dataset/gpm-imerg). GPCP data, available at daily frequency with 1.0° x 1.0° spatial resolution, is also hosted on the LEAP Pangeo Data Catalog [here](https://catalog.leap.columbia.edu/feedstock/global-precipitation-climatology-project). To efficiently handle these large datasets, the following functions lazily load all data into Xarray.Datasets.

In [3]:
def get_era5():
    store = 'gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2/'
    ds    = xr.open_zarr(store,decode_times=True)  
    return ds

def get_imerg():
    store   = 'https://planetarycomputer.microsoft.com/api/stac/v1'
    catalog = pystac.Client.open(store,modifier=planetary_computer.sign_inplace)
    assets  = catalog.get_collection('gpm-imerg-hhr').assets['zarr-abfs']
    ds      = xr.open_zarr(fsspec.get_mapper(assets.href,**assets.extra_fields['xarray:storage_options']),consolidated=True)
    return ds

In [4]:
era5  = get_era5()
imerg = get_imerg()

## Extract Necessary Variables

We only need four variables from these three datasets: precipitation from IMERG V06 and GPCP, and surface pressure, specific humidity, and temperature from ERA5. Convert units as necessary, and remove unrealistic values. 

In [5]:
imergprdata  = imerg.precipitationCal.where(
    (imerg.precipitationCal!=-9999.9)&
    (imerg.precipitationCal>=0),
    np.nan)*24 # mm/hr to mm/day
gpcpprdata  = gpcp.precip.where(
    (gpcp.precip!=-99999.)&
    (gpcp.precip!=9.96921e+36)&
    (gpcp.precip>=0),
    np.nan)
psdata = era5.surface_pressure/100 # Pa to hPa
qdata  = era5.specific_humidity
tdata  = era5.temperature

## Preprocess Data

The `preprocess()` function preprocesses each variable using the user-defined fields above. It standardizes dimensions, subsets the time and space dimensions, specifies pressure levels to keep (if applicable), and formats the metadata for each variable.

In [6]:
def standardize(da):
    dimnames = {'latitude':'lat','longitude':'lon','level':'lev'}
    da = da.rename({oldname:newname for oldname,newname in dimnames.items() if oldname in da.dims})
    targetdims = ['lev','time','lat','lon'] if 'lev' in da.dims else ['time','lat','lon']
    extradims  = [dim for dim in da.dims if dim not in targetdims]
    if extradims:
        da = da.drop_dims(extradims)
    for dim in targetdims:
        if dim=='time':
            if da.coords[dim].dtype.kind!='M':
                da.coords[dim] = da.indexes[dim].to_datetimeindex()
            da = da.sel(time=~da.time.to_index().duplicated(keep='first'))
        elif dim=='lon':
            da.coords[dim] = (da.coords[dim]+180)%360-180        
        elif dim!='time':
            da.coords[dim] = da.coords[dim].astype(float)
    da = da.sortby(targetdims).transpose(*targetdims)   
    return da
    
def subset(ds,years=YEARS,months=MONTHS,latrange=LATRANGE,lonrange=LONRANGE,levrange=LEVRANGE):
    ds = ds.sel(time=(ds['time.year'].isin(years))&(ds['time.month'].isin(months)))
    ds = ds.sel(lat=slice(*latrange),lon=slice(*lonrange))
    if 'lev' in ds.dims:
        ds = ds.sel(lev=slice(*levrange))
    return ds
    
def preprocess(da,shortname,longname,units,years=YEARS,months=MONTHS,latrange=LATRANGE,lonrange=LONRANGE,levrange=LEVRANGE,author=AUTHOR,email=EMAIL):
    da = standardize(da)
    da = subset(da,years,months,latrange,lonrange,levrange)
    ds = xr.Dataset(data_vars={shortname:([*da.dims],da.data)},
                    coords={dim:da.coords[dim].data for dim in da.dims})
    ds[shortname].attrs = dict(long_name=longname,units=units)
    ds.time.attrs = dict(long_name='Time')
    ds.lat.attrs  = dict(long_name='Latitude',units='°N')
    ds.lon.attrs  = dict(long_name='Longitude',units='°E')
    if 'lev' in ds.dims:
        ds.lev.attrs = dict(long_name='Pressure level',units='hPa')
    ds.attrs = dict(history=f'Created on {datetime.today().strftime("%Y-%m-%d")} by {author} ({email})')
    print(f'{longname}: {ds.nbytes*1e-9:.2f} GB')
    return ds

In [7]:
imergpr = preprocess(imergprdata,shortname='pr',longname='IMERG V06 precipitation rate',units='mm/day')
gpcppr  = preprocess(gpcpprdata,shortname='pr',longname='GPCP precipitation rate',units='mm/day')
ps = preprocess(psdata,shortname='ps',longname='ERA5 surface pressure',units='hPa')
q  = preprocess(qdata,shortname='q',longname='ERA5 specific humidity',units='kg/kg')
t  = preprocess(tdata,shortname='t',longname='ERA5 air temperature',units='K')

IMERG V06 precipitation rate: 22.26 GB
GPCP precipitation rate: 0.01 GB
ERA5 surface pressure: 1.82 GB
ERA5 specific humidity: 29.09 GB
ERA5 air temperature: 29.09 GB


## Save Variables

Save each variable Xarray.Dataset as a netCDF file to `SAVEDIR`. The time it took to save the data to disk is commented on the right. A different machine or more efficient saving methods can alter this time.

In [8]:
def save(ds,filename,savedir=SAVEDIR):
    filepath = f'{savedir}/{filename}'
    ds.to_netcdf(filepath)

In [11]:
save(imergpr,'IMERG_precipitation_rate.nc') # 22m 53s
save(gpcppr,'GPCP_precipitation_rate.nc')   # 6s
save(ps,'ERA5_surface_pressure.nc')         # 5m 55s
save(q,'ERA5_specific_humidity.nc')         # 4h 27m 1s
save(t,'ERA5_temperature.nc')               # 2h 50m 42s