# Machine Learning Preprocessing

Here, we do the preprocessing to generate the training and testing datasets for the machine learning framework, from the missing-regridded datasets.

Import necessary libraries

In [1]:
import xarray as xr
import pandas as pd
import numpy as np
import gc

Give the filepath prefix to which missing-regridded file set that you generated that you plan to use. Then, we get all those files and the approrpriate data in them

In [7]:
prefix = "/data0/rm3873/aod_30_"

In [8]:
feature_ml = {
    "pm":["PM25"],
    "gas":['CO_trop', 'SO2_trop', 'NO2_trop', 'CH2O_trop', 'NH3_trop'],
    "aod":['AOT_C', 'AOT_DUST_C'],
    "met":['T2M', 'PBLH', 'U10M', 'V10M', 'PRECTOT', 'RH'],
    "emission":['EmisDST_Natural', 
                'EmisNO_Fert', 'EmisNO_Lightning', 'EmisNO_Ship', 'EmisNO_Soil'],
}
feature_ls = [feature_ml[k] for k in feature_ml] 
feature_ls = sum(feature_ls,[])

In [9]:
fname_list = []
for spec in feature_ls:
    fname_list.append(prefix + spec + '.nc')
ds = xr.merge(xr.open_dataset(fname) for fname in fname_list)

We select a number of days in the year to randomly pick as the testing data, and the rest is training.

In [10]:
test_idxs = np.random.choice(range(1,366), 60, replace=False)
train_idxs = [i for i in range(1,366) if i not in test_idxs]

We generate the final training and testing data and write them in compressed version to disk.

In [11]:
train = ds.sel(indexers={'time':train_idxs}).to_dataframe().reset_index().dropna()
train.to_parquet(prefix + 'missing_train_v2.gzip', compression='gzip') 

test = ds.sel(indexers={'time':test_idxs}).to_dataframe().reset_index().dropna()
test.to_parquet(prefix + 'missing_test_v2.gzip', compression='gzip') 