# Subsetting MERRA2 data

In order for TensorFlow to be able to properly ingest MERRA2 aerosol data, preprocessing the data to only include specific regions of interest must be done first. The following code snippets will take a sample MERRA2 aerosol data file and crop it to a given lat/lon box that you are interested in.

In [None]:
import xarray as xr
import os
import numpy as np
import sys
from glob import glob

Here, we specify the input dataset path *in_merra_path*, the output cropped dataset path *out_path*, and the subset of variables that we want. In this case, the variables that are chosen represent the mass of the different aerosol species in MERRA2. These five species are: 
   * dust (DU) 
   * organic carbon (OC) 
   * black carbon (BC) 
   * sea salt (SS) 
   * organic carbon (OC). 
   
Other variables such as Angstrom exponent and AOD exist in the dataset. However, we want to choose variables that, from physical intuition, would be most likely predictors of the PM2.5 concentration over Houston. Therefore, one can use such physical reasoning to reduce the dimensionality of the input space to the variables that would most likely be able to describe the physical characteristics of the aerosol/meteorological regime and remove potentially redundant variables.

In [None]:
in_merra_path = '/lcrc/group/earthscience/rjackson/MERRA2/2010/*.nc4' 
out_path = '/lcrc/group/earthscience/rjackson/MERRA2/hou_temp/'
    
# Only include the variables we want. We can choose the relevant features of interest by 
variable_list = [
    "BCSMASS", "DUSMASS25",
    "OCSMASS", "SO4SMASS",
    "SSSMASS25"]

if not os.path.exists(out_path):
    os.makedirs(out_path)

Here, you specify the domain to where you want to crop your input data using the *ax_extent* variable. The *ax_extent* variable is a 4-member list with [*lon_min*, *lon_max*, *lat_min*, *lat_max*] as members. This domain is a 10 degree by 10 degree box surrounding Houston.

In [None]:
ax_extent = [-100, -90, 25, 35]

Finally, we use *xarray* in order to do the data cropping and save the output to another series of netCDF files. This code will work on either a singular file or a series of netCDF files.

In [None]:
inp_ds = xr.open_mfdataset(in_merra_path)
print(inp_ds)
for variable in variable_list:
    if os.path.exists(out_path + '%s.nc' % (variable)):
        continue
    print("Processing %s" % variable)
    in_ds1 = inp_ds[variable].sel(lon=slice(ax_extent[0], ax_extent[1]), lat=(ax_extent[2], ax_extent[3]))
    in_ds1.load()
    in_ds1.to_netcdf(out_path + '%s.nc' % (variable))
    in_ds1.close()
inp_ds.close()