# Automated Machine Learning for Earth Science via AutoGluon

## Authors

- Author1 = {"name": "Xingjian Shi", "affiliation": "Amazon Web Services", "email": "xjshi@amazon.com", "orcid": ""}
- Author2 = {"name": "Wen-ming Ye", "affiliation": "Amazon Web Services", "email": "wye@amazon.com", "orcid": ""}
- Author3 = {"name": "Nick Erickson", "affiliation": "Amazon Web Services", "email": "neerick@amazon.com", "orcid": ""}
- Author4 = {"name": "Jonas Mueller", "affiliation": "Amazon Web Services", "email": "jonasmue@amazon.com", "orcid": ""}
- Author5 = {"name": "Alexander Shirkov", "affiliation": "Amazon Web Services", "email": "ashyrkou@amazon.com", "orcid": ""}
- Author6 = {"name": "Zhi Zhang", "affiliation": "Amazon Web Services", "email": "zhiz@amazon.com", "orcid": ""}
- Author7 = {"name": "Mu Li", "affiliation": "Amazon Web Services", "email": "mli@amazon.com", "orcid": ""}
- Author8 = {"name": "Alexander Smola", "affiliation": "Amazon Web Services", "email": "alex@smola.org", "orcid": ""}

## Table of Contents
* [Purpose](#purpose)
* [Forest Cover Type Classification](#forest-cover-type-classification)
    * [Train Model with One Line](#train-model-with-one-line)
    * [Evaluation and Prediction](#evaluation-and-prediction)
    * [Load the Predictor](#load-the-predictor)
    * [Feature Importance](#feature-importance)
    * [Achieve Better Performance](#achieve-better-performance)
* [Solar Radiation Prediction](#solar-radiation-predictions)
* [More Information](#more-information)

## Purpose

In this notebook, we introduce [AutoGluon](https://github.com/awslabs/autogluon) to the Earth science community. AutoGluon is an automated machine learning toolkit that enables users to solve machine learning problems with a single line of code. Many earth science problems involve tabular-like datasets. With AutoGluon, you can feed in the **raw** data table and specify the `label` column. AutoGluon will deliver a model that has reasonable performance in a short period of time. In addition, with AutoGluon, you can also analyze the importance of each feature column with a single line of code. In the following, we illustrate how to use AutoGluon to build machine learning models for two Earth Science problems.

## Setup

We will install AutoGluon and fix the random seed.

In [None]:
!pip install "autogluon[all]"
import random
import numpy as np
random.seed(123)
np.random.seed(123)

## Forest Cover Type Classification

In the first example, we will predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables. The dataset is downloaded from [Kaggle Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction). Study area of the dataset includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type. Let's first download the dataset.

In [2]:
!wget https://deep-earth.s3.amazonaws.com/datasets/earthcube2021_demo/forest-cover-type-prediction.zip -O forest-cover-type-prediction.zip
!unzip -o forest-cover-type-prediction.zip -d forest-cover-type-prediction

--2021-04-13 09:40:12--  https://deep-earth.s3.amazonaws.com/datasets/earthcube2021_demo/forest-cover-type-prediction.zip
Resolving deep-earth.s3.amazonaws.com (deep-earth.s3.amazonaws.com)... 52.216.186.19
Connecting to deep-earth.s3.amazonaws.com (deep-earth.s3.amazonaws.com)|52.216.186.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26555059 (25M) [application/zip]
Saving to: ‘forest-cover-type-prediction.zip’


2021-04-13 09:40:12 (93.3 MB/s) - ‘forest-cover-type-prediction.zip’ saved [26555059/26555059]

Archive:  forest-cover-type-prediction.zip
  inflating: forest-cover-type-prediction/sampleSubmission.csv  
  inflating: forest-cover-type-prediction/sampleSubmission.csv.zip  
  inflating: forest-cover-type-prediction/test.csv  
  inflating: forest-cover-type-prediction/test.csv.zip  
  inflating: forest-cover-type-prediction/test3.csv  
  inflating: forest-cover-type-prediction/train.csv  
  inflating: forest-cover-type-prediction/train.csv.zip  


Here, we load and visualize the dataset. We will split the dataset to 80% training and 20% development for the purpose of reporting the score on the development data.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('forest-cover-type-prediction/train.csv.zip')
df = df.drop('Id', 1)
train_df, dev_df = train_test_split(df, random_state=100)

By visualizing the dataset, we can see that there are 54 feature columns and 1 label column called `"Cover_Type"`.

In [4]:
train_df.head(5)

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
4138,2132,252,14,30,2,940,188,249,198,1438,...,0,0,0,0,0,0,0,0,0,4
8143,3270,95,25,134,30,2301,250,194,0,616,...,0,0,0,0,0,0,0,1,0,7
10743,2387,200,14,0,0,592,214,252,170,577,...,0,0,0,0,0,0,0,0,0,6
12932,2286,307,30,270,197,713,124,206,214,1036,...,0,0,0,0,0,0,0,0,0,6
10918,2672,221,30,134,89,2787,169,251,203,1206,...,0,0,0,0,0,0,0,0,0,3


### Train Model with One Line

Next, we train a model in AutoGluon with a single line of code. We will just need to specify the label column before calling `.fit()`. Here, the label column is `Cover_Type`. AutoGluno will inference the problem type automatically. In our example, it can correctly figure out that it is a "multiclass" classification problem and output the model with the best accuracy. Internally, it will also figure out the feature type automatically.

In [5]:
import autogluon
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label='Cover_Type', path='ag_ec2021_demo').fit(train_df)

Beginning AutoGluon training ...
AutoGluon will save models to "ag_ec2021_demo/"
AutoGluon Version:  0.1.0
Train Data Rows:    11340
Train Data Columns: 54
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	7 unique label values:  [4, 7, 6, 3, 2, 5, 1]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
NumExpr defaulting to 8 threads.
Train Data Class Count: 7
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    23835.42 MB
	Train Data (Original)  Memory Usage: 4.9 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTyp

█

	0.858	 = Validation accuracy score
	28.98s	 = Training runtime
	0.3s	 = Validation runtime
Fitting model: KNeighborsUnif ...
	0.8201	 = Validation accuracy score
	0.05s	 = Training runtime
	0.1s	 = Validation runtime
Fitting model: KNeighborsDist ...
	0.828	 = Validation accuracy score
	0.05s	 = Training runtime
	0.1s	 = Validation runtime
Fitting model: RandomForestGini ...
	0.858	 = Validation accuracy score
	1.33s	 = Training runtime
	0.1s	 = Validation runtime
Fitting model: RandomForestEntr ...
	0.8616	 = Validation accuracy score
	1.72s	 = Training runtime
	0.1s	 = Validation runtime
Fitting model: ExtraTreesGini ...
	0.8668	 = Validation accuracy score
	1.02s	 = Training runtime
	0.1s	 = Validation runtime
Fitting model: ExtraTreesEntr ...
	0.8624	 = Validation accuracy score
	1.12s	 = Training runtime
	0.1s	 = Validation runtime
Fitting model: LightGBM ...
	0.8721	 = Validation accuracy score
	4.6s	 = Training runtime
	0.07s	 = Validation runtime
Fitting model: LightGBMXT ...


█

Fitting model: WeightedEnsemble_L2 ...
	0.8968	 = Validation accuracy score
	0.5s	 = Training runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 137.61s ...
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("ag_ec2021_demo/")


We can visualize the performance of each model with `predictor.leaderboard()`. Internally, AutoGluon trains multiple tabular models and computes a weighted ensemble at the last stage.

In [6]:
predictor.leaderboard()

                  model  score_val  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.896825       0.369547  29.222437                0.000530           0.495989            2       True         14
1         LightGBMLarge   0.882716       0.053217  13.815028                0.053217          13.815028            1       True         13
2               XGBoost   0.874780       0.069986  32.888943                0.069986          32.888943            1       True         12
3              LightGBM   0.872134       0.072509   4.604279                0.072509           4.604279            1       True          9
4            LightGBMXT   0.871252       0.093603   3.734746                0.093603           3.734746            1       True         10
5        ExtraTreesGini   0.866843       0.102842   1.020321                0.102842           1.020321            1       True          7
6        ExtraTreesEntr   0

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.896825,0.369547,29.222437,0.00053,0.495989,2,True,14
1,LightGBMLarge,0.882716,0.053217,13.815028,0.053217,13.815028,1,True,13
2,XGBoost,0.87478,0.069986,32.888943,0.069986,32.888943,1,True,12
3,LightGBM,0.872134,0.072509,4.604279,0.072509,4.604279,1,True,9
4,LightGBMXT,0.871252,0.093603,3.734746,0.093603,3.734746,1,True,10
5,ExtraTreesGini,0.866843,0.102842,1.020321,0.102842,1.020321,1,True,7
6,ExtraTreesEntr,0.862434,0.102911,1.120916,0.102911,1.120916,1,True,8
7,RandomForestEntr,0.861552,0.102714,1.724213,0.102714,1.724213,1,True,6
8,NeuralNetMXNet,0.858907,0.075091,25.497201,0.075091,25.497201,1,True,1
9,CatBoost,0.858025,0.007117,12.718287,0.007117,12.718287,1,True,11


### Evaluation and Prediction

We can also evaluate the model performance on the heldout predictor dataset by calling `.evaluate()`.

In [7]:
predictor.evaluate(dev_df)

Predictive performance on given data: accuracy = 0.8772486772486773


0.8772486772486773

To get the prediction, you may just use  `predictor.predict()`.

In [8]:
predictions = predictor.predict(dev_df)
predictions

7777     7
8689     5
14825    1
4925     6
10184    7
        ..
11980    5
7584     1
3479     4
8328     7
9835     2
Name: Cover_Type, Length: 3780, dtype: int64

For classification problems, we can also use `.predict_proba` to get the probability.

In [9]:
probs = predictor.predict_proba(dev_df)
probs.head(5)

Unnamed: 0,1,2,3,4,5,6,7
7777,0.07527,0.00129,8.962022e-07,8.80957e-08,6.413592e-07,6.953858e-07,0.923438
8689,0.002839,0.00138,0.000976325,1.353729e-06,0.9784312,0.01635405,1.8e-05
14825,0.880177,0.006195,2.141967e-06,1.022144e-06,2.351248e-05,8.070939e-06,0.113593
4925,4e-06,0.000425,0.4139419,0.001343966,2.404201e-06,0.5842818,1e-06
10184,0.00851,0.000444,3.632976e-06,8.345026e-07,1.156368e-05,3.524735e-06,0.991026


### Load the Predictor

Loading a AutoGluon model is straight-forward. We can directly call `.load()`

In [10]:
predictor_loaded = TabularPredictor.load('ag_ec2021_demo')
predictor_loaded.evaluate(dev_df)

Predictive performance on given data: accuracy = 0.8772486772486773


0.8772486772486773

### Feature Importance

AutoGluon offers a built-in method for calculating the relative importance of each feature based on [permutation-shuffling](https://scikit-learn.org/stable/modules/permutation_importance.html). In the following, we calculate the feature importance and print the top-10 important features. Here, `importance` means the importance score and the other values determine the statistical significance of the calculated score because we use random sampling in the calculation.

In [11]:
importance = predictor.feature_importance(dev_df)
importance.head(10)

Computing feature importance via permutation shuffling for 54 features using 1000 rows with 3 shuffle sets...
	159.69s	= Expected runtime (53.23s per shuffle set)
	19.53s	= Actual runtime (Completed 3 of 3 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Elevation,0.505333,0.008083,4.3e-05,3,0.551649,0.459017
Horizontal_Distance_To_Roadways,0.145,0.009165,0.000665,3,0.197517,0.092483
Horizontal_Distance_To_Fire_Points,0.109333,0.010599,0.001559,3,0.170065,0.048601
Horizontal_Distance_To_Hydrology,0.070333,0.009074,0.002751,3,0.122327,0.01834
Hillshade_Noon,0.014,0.003606,0.010701,3,0.03466,-0.00666
Vertical_Distance_To_Hydrology,0.013333,0.004509,0.018037,3,0.039172,-0.012505
Aspect,0.012667,0.002082,0.004441,3,0.024595,0.000738
Wilderness_Area1,0.010667,0.004726,0.029818,3,0.037746,-0.016413
Wilderness_Area4,0.008333,0.002309,0.012329,3,0.021566,-0.0049
Soil_Type10,0.003667,0.000577,0.004082,3,0.006975,0.000358


### Achieve Better Performance

The default behavior of AutoGluon is to compute a weighted ensemble of a diverse set of models. Usually, you can achieve better performance via stack ensembling. To achieve better performance based on automated stack ensembling, you can specify `presets="best_quality"` when calling `.fit()` in AutoGluon. For more details, you can also checkout our provided script. The detailed architecture is described in [1] and we also provide the following figure so you can know the general architecture.

<img src="https://deep-earth.s3.amazonaws.com/datasets/earthcube2021_demo/stacking.png" alt="screenshot" style="width: 500px;"/>

With `.fit(train_df, presets="best_quality")`, we are able to achieve 82/1692 in the competition.

<img src="https://deep-earth.s3.amazonaws.com/datasets/earthcube2021_demo/forest_cover_type.png" alt="screenshot" style="width: 500px;"/>

## Solar Radiation Prediction

In the second example, we will train model to predict the solar radiation. The orignal dataset is available in [Kaggle Solar Radiation Prediction](https://www.kaggle.com/dronio/SolarEnergy). The dataset contains such columns as: "wind direction", "wind speed", "humidity" and "temperature". The response parameter that is to be predicted is: "Solar_radiation". It contains measurements for the past 4 months and you have to predict the level of solar radiation.

In [12]:
!wget https://deep-earth.s3.amazonaws.com/datasets/earthcube2021_demo/SolarPrediction.csv.zip -O SolarPrediction.csv.zip

--2021-04-13 09:42:57--  https://deep-earth.s3.amazonaws.com/datasets/earthcube2021_demo/SolarPrediction.csv.zip
Resolving deep-earth.s3.amazonaws.com (deep-earth.s3.amazonaws.com)... 52.216.241.36
Connecting to deep-earth.s3.amazonaws.com (deep-earth.s3.amazonaws.com)|52.216.241.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 523425 (511K) [application/zip]
Saving to: ‘SolarPrediction.csv.zip’


2021-04-13 09:42:57 (40.3 MB/s) - ‘SolarPrediction.csv.zip’ saved [523425/523425]



In [13]:
import pandas as pd
df = pd.read_csv('SolarPrediction.csv.zip')
train_df, dev_df = train_test_split(df, random_state=100)

In [14]:
train_df.head(10)

Unnamed: 0,UNIXTime,Data,Time,Radiation,Temperature,Pressure,Humidity,WindDirection(Degrees),Speed,TimeSunRise,TimeSunSet
2664,1474412104,9/20/2016 12:00:00 AM,12:55:04,1039.15,65,30.4,57,2.26,5.62,06:11:00,18:21:00
12230,1476543319,10/15/2016 12:00:00 AM,04:55:19,1.21,51,30.46,23,181.58,6.75,06:17:00,17:59:00
11706,1476704422,10/17/2016 12:00:00 AM,01:40:22,1.22,50,30.47,39,142.56,10.12,06:18:00,17:58:00
12924,1476330025,10/12/2016 12:00:00 AM,17:40:25,28.35,59,30.45,42,167.42,4.5,06:16:00,18:02:00
27507,1482367563,12/21/2016 12:00:00 AM,14:46:03,637.93,57,30.39,74,40.94,4.5,06:53:00,17:49:00
2516,1474457405,9/21/2016 12:00:00 AM,01:30:05,1.21,45,30.39,73,159.07,3.37,06:11:00,18:20:00
32227,1480723808,12/2/2016 12:00:00 AM,14:10:08,177.19,45,30.34,93,134.78,11.25,06:42:00,17:42:00
12705,1476396922,10/13/2016 12:00:00 AM,12:15:22,1008.08,65,30.46,46,71.24,5.62,06:17:00,18:01:00
14992,1475697322,10/5/2016 12:00:00 AM,09:55:22,292.44,55,30.47,101,18.7,7.87,06:14:00,18:08:00
23615,1478267417,11/4/2016 12:00:00 AM,03:50:17,1.18,44,30.42,38,176.34,7.87,06:25:00,17:47:00


Like in our previos example, we can directly train a predictor with a single `.fit()` call. The difference is that AutoGluon can automatically determine that it is a regression problem.

In [15]:
predictor = TabularPredictor(label='Radiation', eval_metric='r2', path='ag_ec2021_demo2').fit(train_df)

Beginning AutoGluon training ...
AutoGluon will save models to "ag_ec2021_demo2/"
AutoGluon Version:  0.1.0
Train Data Rows:    24514
Train Data Columns: 10
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (1601.26, 1.11, 206.52072, 315.54334)
	If 'regression' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    20110.64 MB
	Train Data (Original)  Memory Usage: 7.88 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stag

[1000]	train_set's l2: 5825	train_set's r2: 0.941343	valid_set's l2: 6881.24	valid_set's r2: 0.932405
[2000]	train_set's l2: 4818.35	train_set's r2: 0.951483	valid_set's l2: 6360.95	valid_set's r2: 0.937497
[3000]	train_set's l2: 4202.38	train_set's r2: 0.957684	valid_set's l2: 6212.24	valid_set's r2: 0.938993


	0.9393	 = Validation r2 score
	8.77s	 = Training runtime
	0.14s	 = Validation runtime
Fitting model: CatBoost ...
	0.942	 = Validation r2 score
	4.82s	 = Training runtime
	0.0s	 = Validation runtime
Fitting model: XGBoost ...
	0.9444	 = Validation r2 score
	4.15s	 = Training runtime
	0.01s	 = Validation runtime
Fitting model: NeuralNetMXNet ...
	0.9372	 = Validation r2 score
	107.26s	 = Training runtime
	0.04s	 = Validation runtime
Fitting model: NeuralNetFastAI ...


█

  warn(f'{self.__class__} conditioned on metric `{self.monitor}` which is not available. Available metrics are: {", ".join(map(str, self.learn.recorder.names[1:-1]))}')
  warn(f'{self.__class__} conditioned on metric `{self.monitor}` which is not available. Available metrics are: {", ".join(map(str, self.learn.recorder.names[1:-1]))}')


█

		[Errno 2] No such file or directory: '/tmp/tmpiiz9m03_/models/NeuralNetFastAI.pth'
Detailed Traceback:
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/autogluon/tabular/trainer/abstract_trainer.py", line 911, in _train_and_save
    model = self._train_single(X, y, model, X_val, y_val, **model_fit_kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/autogluon/tabular/trainer/abstract_trainer.py", line 883, in _train_single
    model.fit(X=X, y=y, X_val=X_val, y_val=y_val, **model_fit_kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/autogluon/core/models/abstract/abstract_model.py", line 405, in fit
    self._fit(**kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/autogluon/tabular/models/fastainn/tabular_nn_fastai.py", line 244, in _fit
    model.load(self.name)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/fastai/basic_train.py", line 269, in load
    state = torch.load(source, map_location=d

We can evaluate on the development set with the same approach.

In [16]:
predictor.evaluate(dev_df)

Predictive performance on given data: r2 = 0.9544595815213469


0.9544595815213469

## More Information

You may check our website for more information and tutorials: https://auto.gluon.ai/. We also support automatically train models with text, image, and multimodal tabular data.

## References

1. Erickson, Nick and Mueller, Jonas and Shirkov, Alexander and Zhang, Hang and Larroy, Pedro and Li, Mu and Smola, Alexander, AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data, 2020, https://arxiv.org/pdf/2003.06505.pdf