## Explaination
This script builds a condensed radar dataset at airport location

### About the weather radar data
Weather radar data is sourced from project rq0 on NCI. This data will soon be published via NCI (publically accessible by June 2020), but in the meantime you'll need to be an NCI user and request to join project rq0 to access the data. Data used for this exampled is processed 'level 2' data, which is on 1 x 1 km cartesian grid, rather than in the native radar coordinates (spherical). From the level 2 datasets, the 'reflectivity at 2km' and 'echo top heights' datasets are used. These consist of one netcdf file per radar per day per variable, whereby the dimensions are time, x and y.

### What are extracting?
In airport_table.csv, there is a radar identification number for each airport. For this radar, we are extracting the level 2 data within a search radius (in our case, 10 miles) from the airport. The maximum reflectivity is taken within this radius. A minimum threshold is also applied to remove precipitation that is likely not thunderstorms. If valid reflectivity of a thunderstorm exists within the search radius, the top of the reflectivity echo (proxy for storm top height) is also extracted. Our script also keeps track of which days there was no radar data.

### What does this data mean?
Reflectivity provides an indicator of thunderstorm presence and intensity
Echo Top Height provides additional information on the depth of a thunderstorm (which relates to the intensity)

To speed up processing 10 years of weather radar data, this notebook used the Multiprocessing library. 10 years of data (30GB) can be processed in 5 minutes using 15 cores!

In [None]:
#load startard libraries
import os
import glob
import warnings
from datetime import datetime, timedelta
from multiprocessing import Pool

import xarray as xr #used for reading netcdf files
import numpy as np #used for arrays
import pandas #used for reading csv files
import tqdm #provides a nice progress bar for multiprocessing

import pyart_transform #used for coordinate calculation

In [2]:
def chunks(l, n):
    """
    Yield successive n-sized chunks from l.
    From http://stackoverflow.com/a/312464
    """
    for i in range(0, len(l), n):
        yield l[i:i + n]

def read_csv(csv_ffn, header_line):
    """
    CSV reader used for the radar locations file (comma delimited)
    
    Parameters:
    ===========
        csv_ffn: str
            Full filename to csv file
            
        header_line: int or None
            to use first line of csv as header = 0, use None to use column index
            
    Returns:
    ========
        as_dict: dict
            csv columns are dictionary
    
    """
    df = pandas.read_csv(csv_ffn, header=header_line, skipinitialspace=True)
    as_dict = df.to_dict(orient='list')
    return as_dict

def daterange(date1, date2):
    """
    Generate date list between dates
    """
    date_list = []
    for n in range(int ((date2 - date1).days)+1):
        date_list.append(date1 + timedelta(n))
    return date_list

In [3]:
#file patg config
airport_csv_fn = 'airport_table.csv'
out_folder = '../preprocessed_data/'
radar_root = '/g/data/rq0/level_2/daily_150km'

#date range
start_date = '20090101'
end_date = '20181231'

#filters for data
search_radius = 9260 #m, using 5 nautical mile radius
min_reflectivity = 50 #dBZ
min_eth = 5000 #m

#build date list
date_list  = daterange(datetime.strptime(start_date, '%Y%m%d'), datetime.strptime(end_date, '%Y%m%d'))

#set number of CPU for multiprocessing
NCPU = 15

In [4]:
def extract_radar_data(radar_id, target_date, ap_lon, ap_lat, ap_name):
    
    """
    This is our core function that processes a single day of radar data and returns that requires statistics at the location of the airport
    
    Parameters:
    ===========
        radar_id: int
            radar identification number
            
        target_date: datetime64
            datetime value for target date
        
        ap_lon: float
            value of airport longitude
            
        ap_lat: float
            value of airport latitude
        
        ap_name: str
            name of airport
            
    Returns:
    ========
        as_dict: dict
            time series of statistics for target day
    
    """
    
    #convert to radar_id and target_date to strings for building file paths
    radar_id_str = str(radar_id).zfill(2)
    target_date_str = datetime.strftime(target_date, '%Y%m%d')
    
    #build file paths and check data files exist
    var = 'ECHO_TOP_HEIGHTS'
    eth_ffn = '/'.join([radar_root, var, radar_id_str, str(target_date.year)]) + '/' + '_'.join([radar_id_str, target_date_str, var]) + '.nc'
    var = 'REFLECTIVITY'
    ref_ffn = '/'.join([radar_root, var, radar_id_str, str(target_date.year)]) + '/' + '_'.join([radar_id_str, target_date_str, var]) + '.nc'
    if not os.path.isfile(eth_ffn) or not os.path.isfile(ref_ffn):
        #here we return the target_date, which is used to keep track of days missing data due to radar outage
        return target_date
    
    #open weather radar datasets using xarray
    ds_eth = xr.open_dataset(eth_ffn)
    ds_ref = xr.open_dataset(ref_ffn)

    #find location of ap in radar cartesian coordinate space (x,y)
    radar_lat = float(ds_eth.origin_latitude)
    radar_lon = float(ds_eth.origin_longitude)
    ap_x, ap_y = pyart_transform.geographic_to_cartesian_aeqd(ap_lon, ap_lat, radar_lon, radar_lat)
    
    #using x,y dimensions, calculate distance of every grid point from ap_x and ap_y
    x_grid, y_grid = np.meshgrid(ds_eth.x, ds_eth.y)
    dist_grid = np.sqrt((x_grid-ap_x)**2 + (y_grid-ap_y)**2)
    
    #find points within the search radius distance
    search_mask = dist_grid<search_radius

    #extend mask into same coordinate space as netcdf data (repeat into a 3rd time dimension)
    search_mask_time = np.repeat(search_mask[np.newaxis, :, :], len(ds_ref.time), axis=0)
    
    #apply mask to filter reflectivity and ETH
    ds_ref_search = ds_ref.reflectivity.where(search_mask_time, other=0) #replace everything outside search radius with 0
    ref_max = np.max(ds_ref_search, axis=(1,2))
    ds_eth_search = ds_eth.echo_top_heights.where(search_mask_time, other=0) #replace everything outside search radius with 0
    eth_max = np.max(ds_eth_search, axis=(1,2))
    
    #extract radar time
    radar_time_daily = ds_ref.time.data

    #threhold by reflectivity
    valid_mask = np.logical_and(ref_max >= min_reflectivity, eth_max >= min_eth)
    
    #check if there's any valid data and return as dictionary
    if np.any(valid_mask):
        #return arrays
        return {'time':radar_time_daily[valid_mask], 'ref':ref_max[valid_mask], 'eth':eth_max[valid_mask]}
    else:
        #no valid data, return nothing
        return None
    

In [5]:
#load airport list
ap_dict = read_csv(airport_csv_fn, header_line=1)

ap_name_list = ap_dict['Name']
ap_lat_list = ap_dict['Latitude']
ap_lon_list = ap_dict['Longitude']
ap_rid_list = ap_dict['radar_id']

In [None]:
#preprocess radar data!

#for each airport
for i, ap_name in enumerate(ap_name_list):
    
    #initalise variables to store our output data
    radar_time = np.array([],dtype='datetime64')
    radar_ref = np.array([])
    radar_eth = np.array([])
    radar_outage_dt = np.array([],dtype='datetime64')
    
    #extract data from ap csv
    radar_id = ap_rid_list[i]
    ap_lat = ap_lat_list[i]
    ap_lon = ap_lon_list[i]
    
    #build chunked list for multiprocessing
    chunked_list  = chunks(date_list, NCPU)
    
    #loop through dates using multiprocessing
    for list_slice in tqdm.tqdm(chunked_list, total=int(len(date_list)/NCPU)):
        #open multiprocessing pool
        with Pool(NCPU) as pool:
            #append additional arguments needed for core function
            args_list = [(radar_id, oneset, ap_lon, ap_lat, ap_name) for oneset in list_slice]
            #use starmap to handle multiple input function
            result_list = pool.starmap(extract_radar_data, args_list)
            #compile results
            for result in result_list:
                if result is None:
                    #has returned no valid data
                    continue
                elif type(result) is dict:
                    #has returned some data
                    radar_time = np.append(radar_time, result['time'])
                    radar_ref = np.append(radar_ref, result['ref'])
                    radar_eth = np.append(radar_eth, result['eth'])
                else:
                    #has returned that there was a radar outage
                    radar_outage_dt = np.append(radar_outage_dt, result)
                    
    #save to file
    print('finished', ap_name)
    save_path = out_folder + ap_name + '_radar.npz'
    np.savez(save_path, radar_ref=radar_ref, radar_eth=radar_eth, radar_time=radar_time, radar_outage_dt=radar_outage_dt)

 28%|██▊       | 68/243 [30:29<1:24:52, 29.10s/it]

__By extracted exactly what we need for analysis, we've converted a ~30 GB dataset into 4 x 0.05 MB files.__

This will save us lots of time when analysing the data later.