# Regridding CO:
    
This notebook will generate the results in the CO column of Table 1, with the exception of the last two rows, as well as Figure 8, the data needed to get the results in the first row of Table 5, and in Table 10. Make sure to replace any filepaths with the appropriate filepaths on your machine.


First, we import all relevant libraries.

In [None]:
import sys
import os
import netCDF4 as ntf
from pyhdf.SD import SD, SDC
import numpy as np
import math
from netCDF4 import Dataset 
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import cartopy.crs as ccrs
import cartopy
import xesmf as xmf
import xarray as xr
import glob

Next, we load in the raw dataset, where all the files that were downloaded are kept in one directory,
and then extract the desired data into an array.

In [None]:
raw_co = xr.open_mfdataset("/data0/rm3873/co_data/*.nc4",concat_dim='TIMERANGE',combine='nested')

In [None]:
co_data = np.array(raw_co["CO_dof_A"])

Here, we set up our parameters including info about the desired region, the size of the target grid, filepaths, latitude stride for the regridding, selected days, scale factor to apply to the data, and the key missing aggression parameter

In [None]:
binary_path = '/data0/rm3873/co_binary_regridded_v2.nc'
land_mask_path = '/data0/zzheng/GEOS-Chem-grid/land_mask.nc'

lat_st = 6
lat_end = 36.25
lat_siz = 0.25
lon_st = 68.125
lon_end = 97.8126
lon_siz = 0.3125
lat_start_idx = 0
lat_end_idx = len(co_data[0])
lon_start_idx = 0
lon_end_idx = len(co_data[0][0])
target_grid_lats = 121
target_grid_lons = 96
orig_grid_size = 0.25
SCALE_FACTOR = 1
day_start = 1
day_end = 366

ratio = int(target_grid_lats*target_grid_lons/(lat_end_idx*lon_end_idx))
missing_aggression_over = 1
missing_aggression_under = 0

Let's check the percent of missing data in the raw grid.

In [None]:
np.count_nonzero(~np.isnan(data))/(365*31*31)*100 

.... and get the empty new grid.

In [None]:
regridded = np.full((day_end-day_start,target_grid_lats,target_grid_lons), np.nan)

Let's setup the regridded NetCDF file with our desired dimensions, we will fill in the 'co' variable.

In [None]:
land_mask = Dataset(land_mask_path,mode='r',format='NETCDF4_CLASSIC')
ncfile = Dataset(binary_path,mode='w',format='NETCDF4_CLASSIC') 
lat_dim = ncfile.createDimension('lat', target_grid_lats)     
lon_dim = ncfile.createDimension('lon', target_grid_lons)
time = ncfile.createDimension('time',day_end-day_start)

lat = ncfile.createVariable('lat', np.float32, ('lat',))
lat.units = 'degrees_north'
lat.long_name = 'latitude'
lon = ncfile.createVariable('lon', np.float32, ('lon',))
lon.units = 'degrees_east'
lon.long_name = 'longitude'
time = ncfile.createVariable('time', np.float64, ('time',))
time.units = 'days of 2015'
time.long_name = 'days_of_the_year'
# Define a 3D variable to hold the data
co = ncfile.createVariable('co',np.float64,('time','lat','lon')) # note: unlimited dimension is leftmost

lat[:] = np.arange(lat_st,lat_end,lat_siz)
lon[:] = np.arange(lon_st,lon_end,lon_siz)
time[:] = np.arange(day_start,day_end)

Now for the core regridding process!

Let's make up a lookup table of sorts for the missing data in the current grid by means of a nearest-neighbors method in the grid: give each missing cell the value of its nearest non-missing cell (as deemed by L2 norm). Does not edit original matrix.

In [None]:
filled_raw_co = co_data
for day in range(len(filled_raw_co)):
    print(day)
    for i in range(len(filled_raw_co[day])):
        for j in range(len(filled_raw_co[day][i])):

            if(not np.isnan(filled_raw_co[day][i][j])):
                continue

            min_dist = np.inf
            for k in range(len(filled_raw_co[day])):
                for l in range(len(filled_raw_co[day][k])):

                    if(np.isnan(filled_raw_co[day][i][j])):
                        continue

                    cur_dist = np.sqrt((k-i)**2 + (l-j)**2)
                    if(cur_dist < min_dist):
                        filled_raw_co[day][i][j] = filled_raw_co[day][k][l]
                        min_dist = cur_dist

What this algorithm can do is consider the cells in the target grid that will be mapped to from the original grid. For values that are oringinally missing, we use the nearest neighbor non-missing value as obtained from the last step, else we use the existent value. We use this value as the mean, and the variance be the ratio (which we set to be the ratio of the grid sizes) times the variance of the original data, and sample values from a normal distribution with these parameters to fill in the box that represents the extrapolated region. Because in the interpolation process used for the other regriddings, we would honor that concept by insert some missing values in both the case of a present value (to represent the number of missing values being under the threshold in an interpolation process) and in an original missing value (the number of present values is below the threshold). We using the missing_aggression under/over values to determine what fraction of the target box to randomly select to be missing.

For the purposes of the result in the report, we simplify this process by simply considering a variance of 0, missing over thresholds of 1 and under thresholds of 0, aka, simply copying the original value into the target box by repeating it for every cell in the target box - even if it's missing.

In [None]:
for day in range(day_start,day_end):
    print('Day: ' + str(day))
    data = co_data[day-1]
    var = ratio * np.nanvar(data)
    
    ct_lat = 0
    lat_str = 4
    i=lat_start_idx

    while(i < lat_end_idx):
        #Part III
        if(i == 28):
            lat_str = 3
            
        j = lon_start_idx
        ct_lon = 0
        lon_str = 3
        
        while(j < lon_end_idx):
            if(j == 28):
                lon_str = 4

            #Part IV
            cur_to_extrapolate = data[i][j]
            
            #Part V
            to_add = None
            cur_row = [m for m in range(ct_lat,ct_lat+lat_str)] 
            cur_col = [n for n in range(ct_lon,ct_lon+lon_str)]
            cur_box_idx = np.ix_(cur_row,cur_col)
            
            if(np.isnan(cur_to_extrapolate)):
                #to_add = np.random.normal(filled_raw_co[day-1][i][j],var,(lat_str,lon_str))
                #num_miss = int(ratio*missing_aggression_over)
                #mark_miss = np.random.choice(len(to_add)*len(to_add[0]),num_miss,replace=False)
                #to_add = to_add.flatten()
                #to_add[mark_miss] = np.nan
                #to_add = to_add.reshape((lat_str,lon_str))
                to_add = np.full((lat_str,lon_str), np.nan)
            else:
                #to_add = np.random.normal(cur_to_extrapolate,var,(lat_str,lon_str))
                #num_miss = int(ratio*missing_aggression_under)
                #mark_miss = np.random.choice(len(to_add)*len(to_add[0]),num_miss,replace=False)
                #to_add = to_add.flatten()
                #to_add[mark_miss] = np.nan
                #to_add = to_add.reshape((lat_str,lon_str))
                to_add = np.full((lat_str,lon_str), cur_to_extrapolate)
            

            #Part VII
            regridded_co[day-1][cur_box_idx] = to_add
            ct_lon = ct_lon + lon_str
            j = j + 1


        ct_lat = ct_lat + lat_str
        i = i + 1

Let's read in the simulated data (already matched for the GEOS-Chem target grid), and copy them into a dictionary of the species.

In [None]:
raw_emission = xr.open_dataset("/data0/rm3873/dsi_india/daily_emission.nc").sel(lat=slice(6,36),lon=slice(68,98))
raw_gas = xr.open_dataset("/data0/rm3873/dsi_india/daily_gas_column.nc").sel(lat=slice(6,36),lon=slice(68,98))
raw_pm = xr.open_dataset("/data0/rm3873/dsi_india/daily_surface_pm25_RH50.nc").sel(lat=slice(6,36),lon=slice(68,98))
raw_met = xr.open_dataset("/data0/rm3873/dsi_india/daily_meteo.nc").sel(lat=slice(6,36),lon=slice(68,98))
raw_aod = xr.open_dataset("/data0/rm3873/dsi_india/daily_aod.nc").sel(lat=slice(6,36),lon=slice(68,98))
raw_emission["EmisDST_Natural"] = raw_emission["EmisDST1_Natural"] + raw_emission["EmisDST2_Natural"] + raw_emission["EmisDST3_Natural"] + raw_emission["EmisDST4_Natural"]
feature_ml = [
    {"PM25":[]},
    {'CO_trop':[], 'SO2_trop':[], 'NO2_trop':[], 'CH2O_trop':[], 'NH3_trop':[]},
     {'AOT_C':[], 'AOT_DUST_C':[]},
    {'T2M':[], 'PBLH':[], 'U10M':[], 'V10M':[], 'PRECTOT':[], 'RH':[]},
    {'EmisDST_Natural':[], 
                'EmisNO_Fert':[], 'EmisNO_Lightning':[], 'EmisNO_Ship':[], 'EmisNO_Soil':[]}]
sets = [raw_pm,raw_gas,raw_aod,raw_met,raw_emission]

for i in range(len(sets)):
    for spec in feature_ml[i]:
        print(spec)
        cur_set = sets[i][spec]
        
        for entry in cur_set:
            feature_ml[i][spec].append([])
        for j in range(len(cur_set)):
            print(j)
            for k in range(len(cur_set[j])):
                feature_ml[i][spec][j].append(cur_set[j][k])
        feature_ml[i][spec] = np.array(feature_ml[i][spec])

Now we can simply apply our CO regridded missing mask onto all these datasets!

In [None]:
missing_list = np.argwhere(np.isnan(regridded))
missing_vals = [np.NaN] * len(missing_list)
for j in range(len(sets)):
    for spec in feature_ml[j]:
        print(spec)
        feature_ml[j][spec][tuple(np.transpose(missing_list))] = missing_vals

Finally, let's write all these datasets, now with the CO missing mask applied to them, to disk!

In [None]:
for i in range(len(feature_ml)):
    for spec in feature_ml[i]:
        fname = '/data0/rm3873/custom_regridded_no2_' + str(spec) + '.nc'
        ncfile = Dataset(fname,mode='w',format='NETCDF4_CLASSIC') 
        lat_dim = ncfile.createDimension('lat', 121)     
        lon_dim = ncfile.createDimension('lon', 96)
        time = ncfile.createDimension('time',365)

        lat = ncfile.createVariable('lat', np.float32, ('lat',))
        lat.units = 'degrees_north'
        lat.long_name = 'latitude'
        lon = ncfile.createVariable('lon', np.float32, ('lon',))
        lon.units = 'degrees_east'
        lon.long_name = 'longitude'
        time = ncfile.createVariable('time', np.float64, ('time',))
        time.units = 'days of 2015'
        time.long_name = 'days_of_the_year'
        # Define a 3D variable to hold the data
        key_var = ncfile.createVariable(spec,np.float64,('time','lat','lon'))

        lat[:] = np.arange(6,36.25,0.25)
        lon[:] = np.arange(68.125,97.8126,0.3125)
        time[:] = np.arange(1,366)
        key_var[::] = feature_ml[i][spec]
        ncfile.close()