## Explaination
The following script processes a single 6GB netcdf file into a more managable 1MB file for analysis.

## About the data
This ERA5 dataset contains 10 years of convective rain rate and convective avalaible potential energy at a 31 km spatial resolution over Australia.

## About the tools
As you'll see, xarray is a powerful tool for manipulating netcdf datasets without reading the entire dataset into memory.
Here xarray is used to find the nearest ERA5 grid point to each airport location.

__You can't need to run this script as the data file isn't hosted on this server__ If you want to run this script, let me know.

In [6]:
#import the python libraries we'll need and define a function that uses pandas to read out airport catalogue

import xarray as xr #used for manipulating netcdf datasets
import numpy as np #used for arrays
import pandas #used in our case for reading csv files

def read_csv(csv_ffn, header_line):
    """
    CSV reader used for the radar locations file (comma delimited)
    
    Parameters:
    ===========
        csv_ffn: str
            Full filename to csv file
            
        header_line: int or None
            to use first line of csv as header = 0, use None to use column index
            
    Returns:
    ========
        as_dict: dict
            csv columns are dictionary
    
    """
    df = pandas.read_csv(csv_ffn, header=header_line, skipinitialspace=True)
    as_dict = df.to_dict(orient='list')
    return as_dict

In [3]:
#define the location of our datasets
airport_csv_fn = 'airport_table.csv'
era5_ffn = '/g/data/kl02/jss548/era5-data/cape_convrain_sfc_era5.nc'
out_folder = '../preprocessed_data/'

#define dates
start_date = '20090101'
end_date = '20181231'

In [4]:
#load airport list
ap_dict = read_csv(airport_csv_fn, header_line=1)
ap_name_list = ap_dict['Name']
ap_lat_list = ap_dict['Latitude']
ap_lon_list = ap_dict['Longitude']
ap_rid_list = ap_dict['radar_id']

In [5]:
#preprocess ERA5 data

#load netcdf data using xarray
DS = xr.open_dataset(era5_ffn)

#loop through each airport
for i, ap_name in enumerate(ap_name_list):
    
    #for each variable, use the xarray 'sel' function to filter data by nearest grid point to the airport location, and the time range
    era5_cape = DS.cape.sel(longitude=ap_lon_list[i], method='nearest').sel(latitude=ap_lat_list[i], method='nearest').sel(time=slice(start_date, end_date))
    era5_crr = DS.crr.sel(longitude=ap_lon_list[i], method='nearest').sel(latitude=ap_lat_list[i], method='nearest').sel(time=slice(start_date, end_date))
    era5_time = DS.time.sel(time=slice(start_date, end_date))
    
    #save to file using numpyz format
    save_path = out_folder + ap_name + '_era5.npz'
    np.savez(save_path, era5_cape=era5_cape[:], era5_crr=era5_crr[:], era5_time=era5_time[:])
    
    print('finished', ap_name)

finished MEL
finished SYD
finished WSYD
finished BNE


__By extracted exactly what we need for analysis, we've converted a 6.3 GB dataset into 4 x 0.5 MB files.__

This will save us lots of time when analysing the data later.