# Learning MODIS LST from ERA features


### Data


* `X data`. This is the ERA `.grib` files.

* `Y data`. This is the MODIS `.tif` files.


### Preprocessing


1. Convert The ERA `.grib` files 

   These are are on a monthly grain and quite large. It is useful to convert these into `NetCDF` files on a hourly grain. This makes IO and operations on these files much faster. This is done in`convert_grib_to_netcdf.py`. See also `01.Convert_grib_to_NetCDF.ipynb`. In this step we also convert the longtiude to `long1` format so as to match the format in the MODIS data.

2. Join the ERA and MODIS data

   These are joined in both time/space to produce a collection of hourly `.pkl` files. This is done in `join_MODIS_with_ERA.py`. See also `02.Join_MODIS_with_ERA_NetCDF.ipynb` and `03.Join_MODIS_with_ERA_faiss.ipynb`
   
3. Create a single ML dataframe

    We then join all these hourly files into a single `ML_data.pkl` which can be easily loaded. 
   
   
   
### Processing + Viz

* Given our cleaned data we can train our model. We use a simple tapered sequential NN. We are currently taking 2018 as a training set and 2020 as a testing set, holding 2019 as a validation set. See `04.Train-NN-lite.py` and `scripts/train_and_predict_lite.py`. This script creates a directory which holds the model, training history and predictions.

* Predictions can be visualized and compared in `05.Plot_Model.ipynb` 



### Further information 
We have 4 “types” of ERA file, corresponding to different model fields.

    They are:

    1. `sfc_skin_unstructured_...`. 10 years of data (2010-2020), hourly grain

        aluvp, aluvd, alnip, alnid, cl, cvl, mvh, istl1, istl2, slt, sdfor, z, sd, store, isor, anor, spor, 2d, lsm, fal
        
    2. `sfc_unstructured_...` 3 years of data (2018,209,2020), hourly grain

        sp, msl, 10u, 10v, 2t
    
    3. `sfc_skin2_unstructured_...` 10 years of data, only at 06:00 and 18:00

        ssrd, strd, tp

    4. `ml_skin_unstructured...` 10 years of data.

        some spherical harmonic fields
  
    
We will use exclusively `sfc_unstructured_` and `sfc_skin_unstructured_` for 2018-2020






### Questions and uncertainties

* Is it better to do a "single nearest match" where each ERA data point is joined to the nearest MODIS point or else a "group averaged match" where every MODIS point is matched to the nearest ERA point, and we then take an average over that set, subject to some cutoff tolerance (e.g. only count MODIS data points if they are <50km of the ERA point). We seem to train better in the latter case.

* The nearest neighbour matching employs the [faiss library](https://github.com/facebookresearch/faiss) for fast k-nearest neighbours on GPU. However this does not allow for custom metrics (e.g. Haversine) and instead used the L2 (squared) norm from the latitude/longitude coordinates. Naturally these are equivalent under small angle approximation, could be some overlooked edge case
    



---


