# Blue gold preservation with trees

By Thomas Deurloo (https://www.kaggle.com/th0m4sd) & Simon Nouwens (https://www.kaggle.com/esquire900)

In this notebook we suggest an approach to estimate future water levels of multiple waterbodies in northern and central Italy. Nine different waterbodies of 4 distinct types (aquifers, water springs, rivers & lakes) are studied, each with different target variables. We aim to generate a single workflow for all different types of waterbodies, resulting in easy to interpret models. 

Hydrology can be seen as buckets of water, with in- & outflow depending on various parameters (physical parameter such as soil type, surface water etc.). This means that waterbodies often work on cutoff points; 10 mm rain might mostly infiltrate into the soil without reaching a river, while everything above that cutoff will directly add to the volume of the river. Tree algorithms work with cutoff points, and therefor can represent this behavior well; in our experience they work great in hydrology!

Water levels in water bodies can often be described as an interaction between weather and the physics of that specific waterbody. An accurate weather prediction with some knowledge of the waterbody would result in a great model, which is why this notebook puts a great focus on weather data, and leaves the physical properties of the waterbody to be extracted by machine learning models.

We aim for this notebook to be concise and to the point and will therefor not extensively analyse each of the 20 target measurements or all of the input parameters, but leave the reader with documented code to further dive into points of his or her interest.

# Executive summary

- External NASA weather data is collected surrounding the waterbody locations, and all data is cleaned and organized.
- The data is analyzed, based on which custom features are engineered in an easy-to-interpret manner. 
- Those features are used as inputs for Gradient Boosted Tree (GBM) models. Three different models are generated for each target; aiming to predict the water levels 14, 28 and 56 days ahead. Both the models and our preprocessing choices are optimized, aiming for the lowest Mean Average Error (MAE) on the validation datasets. The normalized MAE score of all models is 0.04067, implying that the models make decent predictions of the target measurements.

This pipeline results in scores displayed below. As the different targets are hard to relate to eachother we do not compute aggregate scores over multiple datasets. The combination of this notebook and extra downloadable scripts leaves the user with a single framework to further explore the existing datasets, or expand the currently used locations to predict other target measurements or waterbodies.

In [9]:
import pandas as pd
import os, json

from utils import *

df_info = pd.read_json("./sim-res-final/info.json")
print("-------- Normalized MAE scores grouped by the number of days predicted in advance")
print(df_info.groupby(["pred_ahead"]).mae_normalized_test.describe()[['mean', 'std', '50%']]) # normalized mae per # days ahead

print("\n-------- Absolute MAE scores grouped per dataset")
print(df_info.groupby(["dataset"]).mae_test.describe()[['mean', 'std', '50%']]) # normalized mae per # days ahead


-------- Normalized MAE scores grouped by the number of days predicted in advance
                mean       std       50%
pred_ahead                              
14          0.025650  0.021282  0.022092
28          0.043527  0.032871  0.039595
56          0.081121  0.071369  0.066210

-------- Absolute MAE scores grouped per dataset
                         mean       std       50%
dataset                                          
aquifer_auser        0.262369  0.173030  0.212645
aquifer_petrignano   0.348686  0.160172  0.321854
water_spring_amiata  0.303286  0.284234  0.310041


# Data gathering & cleaning

As mentioned in the introduction, waterbodies are best described as an interaction between weather and the physical system, making weather the most relevant available data to predict water levels. We collect weather data from NASA's POWER system (https://power.larc.nasa.gov/) as external dataset. It's API offers very specific weather features for any location on earth, making it ideal for our requirements.

Based on the given names of the waterbodies, 57 locations of interest are determined, for each which 16 weather metrics (rain, humidity, temperature and wind speeds) are collected in a daily timeframe going back to 2000. We combine this external dataset with the given datasets and manually clean up a number inconsistencies in the data, the details of which are commented in the code. For the features, missing values are filled with a quadratic interpolation. The missing values of the target variables are left empty so we don't train models on guessed measurements, only on actual data. Water levels are long-tailed distributions as the incoming water they depend on are non-normally distributed (Watterson and Dix (2003)). This means we only clean the obviously faulty data such as 0-values, but remove no further outliers.

# Exploratory data analysis

- show why we np.log the target variables


# Methods

## Feature engineering

Feature engineering is done by dimension reduction of the weather data. Per data type (i.e. ws10m, the wind speed at 10m altitude, or rainfall), we multiply the data points D with a location weight matrix W and take the mean, where W is optimized. This approach offers several advantages over other dimension reduction techniques:
- The dimension reduction is significant, for example the 16 NASA POWER features over 10 locations (160 features) are reduced to 16 features
- The resulting features are still easy to interpret, as they directly represent weather conditions, such as "rain" or "wind speed"
- By hyperoptimizing W, we get an optimal location vector that directly represents how important a certain location is for a target variable 

We do this for both the NASA POWER data and the internal dataset that is given, resulting in the following features:
- 16 NASA POWER weather features (wind speed, temperature, etc.)
- target variables (# features is variable per dataset)
- Internal data set features (# features is variable per dataset)
- Several date related features: month (1-12, categorical), week (1-52, continuous) & day_of_week (1-7, categorical). The continuous week variable helps the model with seasonality, while the other 2 categorical variables assist in detecting date based artifacts, such as the ones found in target_flow_rate_galleria_alta. Adding yearly data often hurts performance (referentie?)

Finally some historical data needs to be taken into account. We add 8 shifted rolling means for each of the previously mentioned features: rolling mean of period 2 shifted back 1, 3 & 5 days, rolling mean of period 5 shifted back 5, 10 & 15 days, and the rolling mean of period 20 shifted back 20 and 40 days. This gives the model plenty of data to work with without getting a huge number of features.

## Training
Using the engineered features, we split the data into a train, validation and test chunck in a 70-10-20 fashion, and train a GBM model. The training data is used as input, the validation data for early stopping and parameter optimization. The test dataset is used for final metric reporting. Since we only train model on data where the target variable is non-null, the exact split dates differ per dataset.

For every target variable we generate 3 models: predicting 14, 28 and 56 days ahead. Targets are not predicted directly, but normalized into a logged percentual change, so that 
`target_14 = log(target(t-14) / target(t) + 1)`. Taking the log of the target generally assists models when trying to predict non-normally distributed targets. Training multiple models is preferable over a single one with multiple targets, since predicting 14 days ahead allows using training data up to t-14 (where t is the current day), while predicting 56 days ahead only allows data up to t-56. For every model, we optimze a number of parameters, including both hyperparameters of the GBM algorithm and other variables such as the location matrix W and how the data is preprocessed. The optimization is done with Optuna (Akiba et al, 2019).

# Findings

# Discussion
- Since the usage of future (weather) predictions as input for the model is restricted in this competition, the resulting models will by definition be a combination of weather predictions and physical properties of the studied system.
- Retraining models after a certain period (i.e. every week or every month), significantly increases the models performance. 
- The current measurements are likely influenced by human intervention, i.e. the opening or closing of dams. We could not find a dataset which described these actions so we couldn't account for it. If actions are taken based on the predictions of a model that's trained on this kind of biased data, there is risk of a feedback loop where the model makes predictions as if an intervening action will be made.

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta,and Masanori Koyama. 2019.
Optuna: A Next-generation Hyperparameter Optimization Framework. In KDD.

Watterson, I., , and M. Dix, 2003: Simulated changes due to global warming in daily precipitation means and extremes and their interpretation using the gamma distribution. J. Geophys. Res., 108, 4379, doi:10.1029/2002JD002928.

