# Machine Learning Code

This notebook contains the rough work for using a machine learning approach to impute missing values in the LIS dataset. The ISS orbit is fast-moving and irregular in when it visits our region of interest. It would be helpful to use another proxy dataset to try to infer where and when lightning occurred in cases where the instrument is not currently viewing the area of interest.

We will use AIRS cloud-top pressure, which has a strong empirical connection to lightning frequency as an input variable to train a machine learning algorithm (scikit-learn) to recognize lightning flashes. We use the LIS flash data and divide it into training and validation sets to construct the ML model and to test it.

- LIS data source: Non-Quality Controlled Lightning Imaging Sensor (LIS) on International Space Station (ISS) Science Data V2
- AIRS data source: Aqua/AIRS L3 Daily Standard Physical Retrieval (AIRS-only) 1 degree x 1 degree V7.0 at GES DISC
- VIIRS data source: VIIRS/SNPP Cloud Properties Level-3 daily 1x1 degree grid

## Input data processing

First order of business is to reduce the size and number of files to be used in this analysis. To begin, the daily AIRS files are each 180MB, and while the LIS files are smaller, there is one per orbit of the ISS - 16 each day. So we can do some preprocessing of the input data to reduce the amount of file I/O and save everyone lots of time. Ideally, we'd like to use a month of data in July 2020 for both instruments.


In [2]:
# first let's define the region of interest, roughly the extreme edges of Canada
# except north edge which is limited by highest latitude ISS orbit

latrange = [42, 55]
lonrange = [-141, -52]

In [19]:
from netCDF4 import Dataset
import os
import numpy as np

processday="20200714"

# assume that all .nc files in this directory are LIS files
flashlat = np.empty([0])
flashlon = np.empty([0])
flashtim = np.empty([0])
vtlat = np.empty([0])
vtlon = np.empty([0])
vtbeg = np.empty([0])
vtend = np.empty([0])
for file in os.listdir("./."):
    if file.endswith(".nc"):
        if processday in file:
            print(file)
            lisdata = Dataset(file, mode='r')
            # this is to make sure the file actually has more than zero flashes
            if "lightning_flash_lat" in lisdata.variables.keys():
                flashlat = np.append(flashlat, lisdata.variables['lightning_flash_lat'][:])
                flashlon = np.append(flashlon, lisdata.variables['lightning_flash_lon'][:])
                flashtim = np.append(flashtim, lisdata.variables['lightning_flash_TAI93_time'][:])
            if "viewtime_lat" in lisdata.variables.keys():
                vtlat = np.append(vtlat, lisdata.variables['viewtime_lat'][:])
                vtlon = np.append(vtlon, lisdata.variables['viewtime_lon'][:])
                vtbeg = np.append(vtbeg, lisdata.variables['viewtime_TAI93_start'][:])
                vtend = np.append(vtend, lisdata.variables['viewtime_TAI93_end'][:])

print(len(flashlat))
# now we filter down just to our region of interest
filtlat = np.empty([0])
filtlon = np.empty([0])
filttim = np.empty([0])
for i in range(len(flashlat)):
    if flashlat[i] < latrange[0]:
        continue
    if flashlat[i] > latrange[1]:
        continue
    if flashlon[i] < lonrange[0]:
        continue
    if flashlon[i] > lonrange[1]:
        continue
    filtlat = np.append(filtlat, flashlat[i])
    filtlon = np.append(filtlon, flashlon[i])
    filttim = np.append(filttim, flashtim[i])

print(filtlat)

# now let's write this to a file for easier use later
outfile = "LIS-output-" + processday + ".nc"
print(outfile)
outfilenc = Dataset(outfile, mode='w')

fl_dim = outfilenc.createDimension('flash',len(filtlat))
vt_dim = outfilenc.createDimension('viewtime',len(vtlat))
fl_lat = outfilenc.createVariable('flash_lat',np.float32,('flash',))
fl_lat.units = 'degrees north'
fl_lon = outfilenc.createVariable('flash_lon',np.float32,('flash',))
fl_lon.units = 'degrees east'
fl_tim = outfilenc.createVariable('flash_time',np.float32,('flash',))
fl_tim.units = 'TAI93 seconds'
vt_lat = outfilenc.createVariable('viewtime_lat',np.float32,('viewtime',))
vt_lat.units = 'degrees north'
vt_lon = outfilenc.createVariable('viewtime_lon',np.float32,('viewtime',))
vt_lon.units = 'degrees east'
vt_beg = outfilenc.createVariable('viewtime_begin',np.float32,('viewtime',))
vt_beg.units = 'TAI93 seconds'
vt_end = outfilenc.createVariable('viewtime_end',np.float32,('viewtime',))
vt_end.units = 'TAI93 seconds'

fl_lat[:] = filtlat
fl_lon[:] = filtlon
fl_tim[:] = filttim
vt_lat[:] = vtlat
vt_lon[:] = vtlon
vt_beg[:] = vtbeg
vt_end[:] = vtend
outfilenc.close()

# now the file LIS-output-YYYYMMDD.nc contains all of the flash lat/lon/time data over canada for that day
# the viewtime variables contain the start and end times that the satellite viewed a particular space, which
# might be useful for determining where lightning isn't


ISS_LIS_SC_V2.1_20200714_204918_FIN.nc
ISS_LIS_SC_V2.1_20200714_113205_FIN.nc
ISS_LIS_SC_V2.1_20200714_034744_FIN.nc
ISS_LIS_SC_V2.1_20200714_065328_FIN.nc
ISS_LIS_SC_V2.1_20200714_143749_FIN.nc
ISS_LIS_SC_V2.1_20200714_130457_FIN.nc
ISS_LIS_SC_V2.1_20200714_021451_FIN.nc
ISS_LIS_SC_V2.1_20200714_235503_FIN.nc
ISS_LIS_SC_V2.1_20200714_052036_FIN.nc
ISS_LIS_SC_V2.1_20200714_082620_FIN.nc
ISS_LIS_SC_V2.1_20200714_004159_FIN.nc
ISS_LIS_SC_V2.1_20200714_235940_FIN.nc
ISS_LIS_SC_V2.1_20200714_161042_FIN.nc
ISS_LIS_SC_V2.1_20200714_174334_FIN.nc
ISS_LIS_SC_V2.1_20200714_222211_FIN.nc
ISS_LIS_SC_V2.1_20200714_191626_FIN.nc
ISS_LIS_SC_V2.1_20200714_095913_FIN.nc
3779
[42.00299454 42.1256485  42.02429581 42.01006699 42.00909805 42.00284958
 42.22993469 42.00469971 42.00111389 42.0381012  42.21572113 42.19789505
 42.15224075 42.31695175 42.12429428 42.1749382  42.24757004 42.24080276
 42.08692932 42.22729492 42.18551254 42.26767349 42.21121216 42.21268845
 42.10673141 42.16991043 42.24951172 42.

In [None]:
#import h5py
#import xarray as xr

#sample_airs_file = 'AIRS.2020.07.05.L3.RetStd_IR001.v7.0.4.0.G20333031235.hdf'
#airsdata = h5py.File(sample_airs_file, 'r')
#airsdata = SD(sample_airs_file, SDC.READ)
#airsdata = xr.open_dataset(sample_airs_file)

processday="20200712"
sample_viirs_file = 'CLDPROP_D3_VIIRS_NOAA20.A2020193.011.2020336152527.nc'
vdata = Dataset(sample_viirs_file, mode='r')
vlat = vdata.variables['latitude'][:]
vlon = vdata.variables['longitude'][:]
cthgroup = vdata.groups['Cloud_Top_Height']
vcth = cthgroup.variables['Mean'][:,:]

print(vdata.ncattrs())
vtime = getattr(vdata,'time_coverage_start')
print(vtime)

#print(vlat)
#print(vlon)
#print(len(vlat))

# check that vtime contains the same date as processday

# now let's write this to a file for easier use later
outfilenc.close()
outfile = "VIIRS-output-" + processday + ".nc"
print(outfile)
outfilenc = Dataset(outfile, mode='w')
vx_dim = outfilenc.createDimension('longitude',len(vlon))
vy_dim = outfilenc.createDimension('latitude',len(vlat))
v_lat = outfilenc.createVariable('latitude',np.float32,('latitude',))
v_lat.units = 'degrees north'
v_lon = outfilenc.createVariable('longitude',np.float32,('longitude',))
v_lon.units = 'degrees east'
v_cth = outfilenc.createVariable('cloud_top_height',np.float32,('longitude','latitude',))
v_cth.units = 'm'

v_lat[:] = vlat
v_lon[:] = vlon
v_cth[:] = vcth

outfilenc.close()



At this point, we have two sets of daily files: "LIS-output-YYYYMMDD.nc" and "VIIRS-output-YYYYMMDD.nc". I've processed two weeks of them, but we could go out and get more as needed, processing the raw data from NASA EarthData using the code above. What we'd like is to feed this data into a machine-learning algorithm (something like scikit-learn.neural_network.MLPClassifier). Reading through some rudimentary examples, it looks like the easiest (or at least most common) way to pass the data to these routines is by converting it into a pandas Dataframe.


In [4]:
import pandas as pd
from netCDF4 import Dataset

for date in range(20200701, 20200714,1):
    datestr = str(date)
    lisfilestr = "LIS-output-"+datestr+".nc"
    viirsfilestr = "VIIRS-output-"+datestr+".nc"
    lisdata = Dataset(lisfilestr,"r")
    # the variables below are numpy arrays, which can be converted/input into a pandas Dataframe
    flashlat = lisdata.variables['flash_lat'][:]
    flashlon = lisdata.variables['flash_lon'][:]
    lisdata.close()
    viirsdata = Dataset(viirsfilestr,"r")
    viirslat = viirsdata.variables['latitude'][:]
    viirsdata.close()
