Machine learning for land surface modeling.
python main.py --process_raw_data --join_era_modis --ML_prep --train_model --predict --evaluate
We have two primary sets of data. The first is a selection of ERA5 fields. These can be thought of as our features or inputs. The second is land surface temperature measurements from MODIS. Both sets of data are provided by ECMWF having undergone some pre-processing and re-gridding, but are generally publicly available.
The raw ERA data is divided among the following files, all on a reduced Gaussain grid:
ERA_sfc
. Monthly files between 2016-2021, hourly grain.ERA_skin
. Monthly files between 2016-2021, hourly grain.ERA_skt
. Monthly files between 2016-2021, hourly grain.climateV15
. Selection of constant-in-time features - for example orography - split over multiple files.climateV20
. As above, but more recent version.Monthly lakes
. 12 monthly files describing how lake cover varies month-to-month.Salt lakes
. Single file with time-constant salt lake fraction
Daily files for MODIS Aqua day observations between 2016-2021 on an hourly grain. Resolution is 10800 longitude and 5400 pixels latitude (60 pixels per degree). Average LST error is
For more details on the raw data, see Workflow.ipynb
.
To get all this disparate data into a more manageable form call python main.py --process_raw_data
This creates:
- Time Variable ERA fields. One file per month
- Time Constant ERA fields. One file per version (V15,V20)
The additional monthly lakes and salt lakes are unchanged by this step. The MODIS files are also untouched.
In order to use the ERA-MODIS data together to train a model, it is necessary to join the data in time and space. That is, given a collection of ERA features at time t
and grid point x
, what is the corresponding real world observation provided by MODIS? This is done by the call python main.py --join_era_modis
.
The general method involves taking an hour of ERA data (which covers the whole globe) and an hour of MODIS data (which covers just a strip) and then using a k-nearest neighbours algorithm to find the nearest ERA grid point for every MODIS point. We filter out any matches where the Haversine distance is > 50 km, and then group by the ERA coordinates to get an average temperature value. The 'nearness' measure is the Haversine metric. We use RAPIDS for GPU accelerated k-nearest neighbours search. This is built on top of FAISS. One can also use a standard non-GPU knn from scikit-learn
This joining process outputs monthly parquet
files which hold: position,time,features,target
. Below is an example of a single hour of joined ERA-MODIS data.
python main.py --ML_prep
For the purposes of ML it is useful then modify these files via either a 'greedy' or a 'sensible' method:
-
Greedy involves amalgamating all training data into a single file, all validation data into a single file, etc. We can then load this single file when training, reading into memory only the desired features (
parquet
is column oriented) -
Sensible involves converting our
parquet
files into a TFRecord format which can then be easily loaded batchwise into an ML pipeline.
For training data of ~ 1 year or less, we typically use Greedy
. 12 months of data in .parquet
format is around 3G, and only a subset of that is typically loaded into memory when training (i.e. don't train over all columns).
At this stage we typically also normalize our features and reassign the V20 climate fields to be "delta fields"; the correction to the V15 value. Similarly clake_monthly_value
is reassigned to clake_monthly_value
- cl_v20
.
python main.py --train_model
The model training is completely specified via the config
file. Here a use can set the network structure, batch size, learning rates, loss metric, early stopping patience etc.
In the case that the number of neurons in a hidden layer (nodes_per_layer
), the default value is Nfeatures/2
. We take ADAM as our standard optimiser.
Once the training completes, the trained model is saved to disk along with the training history (training_history.json
) and a complete copy of the config file used for the training (configuration.json
).
python main.py --predict
Loads a trained model and makes predictions for the data specified in the config.
Outputs latitude/longtude/time/MODIS LST/LST prediction/ ERA skt to predictions.parquet
in the model directory.