MODIS Pipeline


> In this tutorial, we will walk through how one can download data and prep for any further machine learning work with the GOES16 dataset.
> We will:
> 1) download the data
> 2) harmonize the data
> 3) create patches that are ready for ML consumption.

In [1]:
import autoroot
import os
from dotenv import load_dotenv
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import rasterio
import cartopy
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER
xr.set_options(
    keep_attrs=True, 
    display_expand_data=False, 
    display_expand_coords=False, 
    display_expand_data_vars=False, 
    display_expand_indexes=False
)
np.set_printoptions(threshold=10, edgeitems=2)

import seaborn as sns
sns.reset_defaults()
sns.set_context(context="talk", font_scale=1.0)

%matplotlib inline

***

## Download

Firstly, we need to download the data.

#### Save Directory

This is arguably the most important part.
We need to define where we want to save the data.

We use the `autoroot` package to manually handle all of the 


In [2]:
root_dir = autoroot.root

**Note**: The data is very heavy! So make sure you have adequate space.

In [3]:
save_dir = os.getenv("ITI_DATA_SAVEDIR")

### Account

We use the NASA data registry which hosts all of the datasets.
We use the [EarthAccess](https://earthaccess.readthedocs.io/en/stable/tutorials/getting-started/) API which enables us to easily download data using python.

**Warning**: the user **must** have an account for the [NASA EarthData](https://urs.earthdata.nasa.gov) API.
Please follow the link to register for an account.
There are different ways to authenticate your account. 
We recommend you log in once and store it to your local `~/.netrc` file or alternatively, setting the `.env` variable to your `EARTHDATA_USERNAME` and `EARTH_PASSWORD`.
See these [instructions](https://earthaccess.readthedocs.io/en/stable/tutorials/getting-started/#auth) for more information.

#### Config

We have a configuration file which features some of the options available for downloading data.
One can take a peek using the command below.

In [8]:
!cat $autoroot.root/config/example/download.yaml

# PERIOD
period:
  start_date: '2020-10-01'
  start_time: '00:00:00'
  end_date: '2020-10-31'
  end_time: '23:59:00'

# CLOUD MASK
cloud_mask: True
  
# PATH FOR SAVING DATA
save_dir: data

defaults:
  - _self_
  


We also have some more things we can change that are satellite specific.

We can see them using the command below.

In [11]:
# !cat $autoroot.root/config/example/satellite/aqua.yaml


```yaml
download:
  _target_: rs_tools._src.data.modis.downloader_aqua.download
  save_dir: ${save_dir}/aqua/
  start_date: ${period.start_date}
  start_time: ${period.start_time}
  end_date: ${period.end_date}
  end_time: ${period.end_time}
  region: "-130 -15 -90 5" # "lon_min lat_min lon_max lat_max"
```

For this tutorial, we will change the save directory, start/end time, and the time step.

Notice how we will change some configurations within the `download.yaml` file and some others that are within the `satellite.yaml` file, in particular the `aqua.yaml`.

```bash
python rs_tools \
    satellite=aqua \
    stage=download \
    save_dir="/path/to/savedir" \
    period.start_date="2020-10-01" \
    period.end_date="2020-10-02" \
    period.start_time="09:00:00" \
    period.end_time="21:00:00"
```

***

## GeoProcessing


We have an extensive geoprocessing steps to be able to 

We can peek into the `rs_tools/config/example/download.yaml` configuration file to see some of the options we have to modify this.


In [13]:
# !cat $autoroot.root/config/example/satellite/aqua.yaml

```yaml
geoprocess:
  _target_: rs_tools._src.geoprocessing.modis.geoprocessor_modis.geoprocess
  read_path: ${read_path}/aqua/raw
  save_path: ${save_path}/aqua/geoprocessed
  satellite: aqua
```

In particular, we will focus on the `geoprocess` step within the configuration.
The most important options are the `resolution` and the `region`.
The resolution is a float or integer that is measured in km.

Below, we have an example of the command we 

```bash
python rs_tools \
    satellite=goes \
    stage=geoprocess \
    read_path="/path/to/savedir/" \
    save_path="/path/to/savedir/" \
    satellite.geoprocess.resolution=5000
```

We can see the saved data are clean

```bash
/path/to/savedir/goes16/geoprocessed/20201001150019_goes16.nc
/path/to/savedir/goes16/geoprocessed/20201002150019_goes16.nc
```

In [7]:
import satpy
from satpy import Scene

In [5]:
from rs_tools._src.utils.io import get_list_filenames
list_of_files = get_list_filenames("/pool/usuarios/juanjohn/data/iti/aqua/raw/aqua/L1b/", ".hdf")

In [10]:
xr.open_dataset(list_of_files[0])

ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'h5netcdf', 'scipy', 'gini', 'rasterio', 'zarr']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html

In [11]:
scn = Scene(
    reader="modis_l1b", 
    filenames=[list_of_files[0]], 
)

Don't know how to open the following files: {'/pool/usuarios/juanjohn/data/iti/aqua/raw/aqua/L1b/MYD021KM.A2020275.1955.061.2020276153609.hdf'}


ValueError: No supported files found

In [6]:
scn.load(get_modis_channel_numbers(), resolution = resolution, calibration="radiance")

NameError: name 'scn' is not defined