# Training, Testing and Predicting With an LSTM Model
In the notebook `ml/deeplearning_tuning.ipynb` we performed a Bayesian hyperparameter tuning to find the best hyperparameters for 3 potential LSTM models architectures.
In this notebook we:
1. train each of the 3 LSTM architectures with their best hyperparameters
2. tested their results on the test dataset to select the best model
3. analyze the sensibility of the best model to hyperparameters
4. use the best model to predict the groundwater depth for the year 2022.
## Multiple Multivariate Time Series Predictions with LSTM
The dataset is made of 478 Township-Ranges, each containing a multivariate (80 features) time series (data between 2014 to 2021). This dataset can thus be seen as a 3 dimensional dataset of
$478\ TownshipRanges\ *\ 8\ time stamps\ *\ 80\ features$
The objective is to predict the 2022 target value of `GSE_GWE` (Ground Surface Elevation to Groundwater Water Elevation - Depth to groundwater elevation in feet below ground surface) for each Township-Range.

LSTMs are used for time series and NLP because they are both sequential data and depend on previous states.
The future prediction *Y(t+n+1)* depends not only on the last state *X1(t+n), ..., Y(t+n)*, not only on past values of the feature *Y(t+1), ..., Y(t+n)*, but on the entire past states sequence.

![Multi-Variate Multi TImes-Series Predictions with LSTM - Training and Prediction](../doc/images/lstm_inputs_outputs.jpg)

During training and predictions:
* Township-Ranges are passed into the model one by one
* each cell in the LSM neural network receives a Township-Range state for a specific year (the state of the Township-Range at a specific position in the series)
* each state (year) in the series is represented by a multidimensional vector of all 80 features (including the target feature Y `GSE_GWE`)

The output is the Township-Ranges next year's value for the specific feature Y `GSE_GWE`. The model is trained on 2014-2020 (7 years) data to predict 2021.
During inference the last 7 years (2015-2021) of data are passed as input to predict the 2022 value.

![Multi-Variate Multi TImes-Series Predictions with LSTM - Cells Inputs](../doc/images/lstm_table_to_cells.jpg)

In [33]:
import sys
sys.path.append('..')

In [34]:
import os
import pickle
import numpy as np
import pandas as pd
import random

from lib.township_range import TownshipRanges
from lib.read_data import read_and_join_output_file
from lib.deeplearning import get_train_test_datasets, get_sets_shapes, get_data_for_prediction, evaluate_forecast, combine_all_target_years, get_year_to_year_differences
from lib.viz import view_trs_side_by_side, draw_hyperparameters_distribution, draw_line_chart

import tensorflow
from sklearn import set_config
from tensorflow import keras
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, TimeDistributed, RepeatVector
from keras.optimizers import Adam

In [35]:
RANDOM_SEED = 31
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
tensorflow.random.set_seed(RANDOM_SEED)

In [36]:
print("Num GPUs Available: ", len(tensorflow.config.list_physical_devices('GPU')))

Num GPUs Available:  0


## Preparing the Dataset
### The Train-Test Split
To fit our dataset and objective, as well as LSTM neural networks architecture we perform the train test split as follows:
* Training and Test sets will be split by Township-Ranges. I.e., some Township-Ranges will have all their 2014-2021 data points in the training set, some others will be in the test set.
* The model will be trained based on the 2014-2020 data for all features - including the target feature - with the training target being the 2021 value of the target feature.

With such a method, unlike a simple time series forecasting where the target feature is forecasted only based on its past value, we allow past value of other features (in our case cultivated crops, precipitations, population density, number of wells drilled) to influence the future value of the target feature.

![Train-Test Split](../doc/images/lstm-train-test-split.jpg)

We do not create a validation dataset as we use Keras internal cross-validation mechanism to shuffle the data points (i.e., the Township-Ranges) and keep some for the validation at each training epoch.
### Data Imputation and Scaling
Missing data imputation for a Township-Range is performed only using the existing data of that Township-Range (not the data of all Township-Ranges). For example:
* a *fill forward* approach is used for many fields like crops, vegetation and soils. The percentage of land use per crop in 2014 in a Township-Range is imputed into the missing year 2015 for that particular Township-Range.
* for fields like `PCT_OF_CAPACITY` (the capacity percentage of water reservoir), missing values in a Township-Range are filled using the min, mean, median or max values of that particular Township-Range. In this case the *fit* method of our custom impute pipeline does nothing. Not only the impute pipeline does not need to learn values from other Township-Ranges data points to impute missing values in a Township-Range, if it does it will not be able to impute data in the test set as we impute by Township-Range and the Township-Ranges in the test set are not seen when *fitting* the impute pipeline. The *transform* method, then simply fills the missing values in a Township-Range based on past values of that Township range. This way, we can split the train and test sets by Township-Range and impute missing value without any data leakage as the impute pipeline does not learn anything from the Township-Ranges in the test set.

We use a MinMax scaler to scale all values between 0 and 1 for the neural network, except for the vegetation, soils and crops datasets which are already scaled between 0 and 1.

It should be noted that we do not need to do any data imputation on the training and test sets *y* target feature since it does not have any missing data point.

In [37]:
test_size=0.15
target_variable="GSE_GWE"
# Load the data from the ETL output files
X = read_and_join_output_file()
X.drop(["SHORTAGE_COUNT"], inplace=True, axis=1)
# Split the input pandas Dataframe into training and test datasets, applies the impute pipeline
# transformation and reshapes the datasets to 3D (samples, time, features) numpy arrays
X_train, X_test, y_train, y_test, impute_pipeline, impute_columns, target_scaler = get_train_test_datasets(X, target_variable=target_variable, test_size=test_size, random_seed=RANDOM_SEED, save_to_file=True)
model_predictions_df = pd.DataFrame(y_test, columns=[target_variable])
model_scores_df = pd.DataFrame(columns=["MAE", "MSE", "RMSE"])
nb_features = X_train.shape[-1]
get_sets_shapes(X_train, X_test)

Unnamed: 0,nb_items,nb_timestamps,nb_features
training dataset,406,7,80
test dataset,72,7,80


In [38]:
set_config(display="diagram")
display(impute_pipeline)

## Training Different Models
We tried 3 different LSTM models:
* A simple model made of a single *LSTM* layer and an output *Dense* layer
* A model made of a *LSTM* layer followed by a *Dense* and *Dropout* layers before the output layer
* An Encoder-Decoder model made of 2 *LSTM* layers followed by a *Dense* and *Dropout* layers

![LSTM Model Architectures](../doc/images/lstm_architectures.jpg)


Encoder-decoder architectures are more common for sequence-to-sequence learning e.g., when forecasting the next 3 days (output sequence of length 3) based on the past year data (input sequence of length 365). In our case we only predict data for 1 time step in the feature. The output sequence being of length 1 this architecture might seem superfluous but has been tested anyway. This architecture was inspired by the Encoder-Decoder architecture in this article: *[CNN-LSTM-Based Models for Multiple Parallel Input and Multi-Step Forecast](https://towardsdatascience.com/cnn-lstm-based-models-for-multiple-parallel-input-and-multi-step-forecast-6fe2172f7668)*.

As such models are made for sequence to sequence learning and forecasting, the output of such a model is different from the previous ones. It has an output of size *[samples, forecasting sequence length, target features]*. In our case the forecasting sequence length and number of target features are both 1.
### Training Model 1 - Simple LSTM Model

In [39]:
m1_hyper_parameters = {
    "lstm_units": 160,
    "lstm_activation": "sigmoid",
    "learning_rate": 0.001,
    "validation_split": 0.1,
    "batch_size": 128,
    "epochs": 270,
}

model1 = Sequential()
model1.add(LSTM(m1_hyper_parameters["lstm_units"], activation=m1_hyper_parameters["lstm_activation"], input_shape=(7, nb_features)))
model1.add(Dense(1, activation="linear"))
model1.summary()

Model: "sequential_13"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_14 (LSTM)              (None, 160)               154240    
                                                                 
 dense_15 (Dense)            (None, 1)                 161       
                                                                 
Total params: 154,401
Trainable params: 154,401
Non-trainable params: 0
_________________________________________________________________


In [40]:
model1.compile(loss="mse", optimizer=Adam(learning_rate=m1_hyper_parameters["learning_rate"]), metrics=[keras.metrics.RootMeanSquaredError()])
model1.fit(X_train, y_train,
           validation_split=m1_hyper_parameters["validation_split"],
           batch_size=m1_hyper_parameters["batch_size"],
           epochs=m1_hyper_parameters["epochs"],
           shuffle=True)
yhat = model1.predict(X_test, verbose=0)
yhat_inverse = target_scaler.inverse_transform(yhat)
model_predictions_df["model_1_prediction"] = yhat_inverse
model_scores_df.loc["model 1"] = evaluate_forecast(y_test, yhat_inverse)

Epoch 1/270
Epoch 2/270
Epoch 3/270
Epoch 4/270
Epoch 5/270
Epoch 6/270
Epoch 7/270
Epoch 8/270
Epoch 9/270
Epoch 10/270
Epoch 11/270
Epoch 12/270
Epoch 13/270
Epoch 14/270
Epoch 15/270
Epoch 16/270
Epoch 17/270
Epoch 18/270
Epoch 19/270
Epoch 20/270
Epoch 21/270
Epoch 22/270
Epoch 23/270
Epoch 24/270
Epoch 25/270
Epoch 26/270
Epoch 27/270
Epoch 28/270
Epoch 29/270
Epoch 30/270
Epoch 31/270
Epoch 32/270
Epoch 33/270
Epoch 34/270
Epoch 35/270
Epoch 36/270
Epoch 37/270
Epoch 38/270
Epoch 39/270
Epoch 40/270
Epoch 41/270
Epoch 42/270
Epoch 43/270
Epoch 44/270
Epoch 45/270
Epoch 46/270
Epoch 47/270
Epoch 48/270
Epoch 49/270
Epoch 50/270
Epoch 51/270
Epoch 52/270
Epoch 53/270
Epoch 54/270
Epoch 55/270
Epoch 56/270
Epoch 57/270
Epoch 58/270
Epoch 59/270
Epoch 60/270
Epoch 61/270
Epoch 62/270
Epoch 63/270
Epoch 64/270
Epoch 65/270
Epoch 66/270
Epoch 67/270
Epoch 68/270
Epoch 69/270
Epoch 70/270
Epoch 71/270
Epoch 72/270
Epoch 73/270
Epoch 74/270
Epoch 75/270
Epoch 76/270
Epoch 77/270
Epoch 78

### Training Model 2 - LSTM + Dense Layer Model

In [41]:
m2_hyper_parameters = {
    "lstm_units": 100,
    "lstm_activation": "sigmoid",
    "dense_units": 11,
    "dense_activation": "tanh",
    "dropout": 0.1,
    "learning_rate": 0.0001,
    "validation_split": 0.1,
    "batch_size": 32,
    "epochs": 200,
}

model2 = Sequential()
model2.add(LSTM(m2_hyper_parameters["lstm_units"], activation=m2_hyper_parameters["lstm_activation"], input_shape=(7, nb_features)))
model2.add(Dense(m2_hyper_parameters["dense_units"], activation=m2_hyper_parameters["dense_activation"]))
model2.add(Dropout(m2_hyper_parameters["dropout"]))
model2.add(Dense(1, activation="linear"))
model2.summary()

Model: "sequential_14"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_15 (LSTM)              (None, 100)               72400     
                                                                 
 dense_16 (Dense)            (None, 11)                1111      
                                                                 
 dropout_2 (Dropout)         (None, 11)                0         
                                                                 
 dense_17 (Dense)            (None, 1)                 12        
                                                                 
Total params: 73,523
Trainable params: 73,523
Non-trainable params: 0
_________________________________________________________________


In [42]:
model2.compile(loss="mse", optimizer=Adam(learning_rate=m2_hyper_parameters["learning_rate"]), metrics=[keras.metrics.RootMeanSquaredError()])
model2.fit(X_train, y_train,
           validation_split=m2_hyper_parameters["validation_split"],
           batch_size=m2_hyper_parameters["batch_size"],
           epochs=m2_hyper_parameters["epochs"],
           shuffle=True)
yhat = model2.predict(X_test, verbose=0)
yhat_inverse = target_scaler.inverse_transform(yhat)
model_predictions_df["model_2_prediction"] = yhat_inverse
model_scores_df.loc["model 2"] = evaluate_forecast(y_test, yhat_inverse)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

### Training Model 3 - Encoder-Decoder LSTM Model

In [43]:
m3_hyper_parameters = {
    "lstm_units": 300,
    "lstm_activation": "sigmoid",
    "2nd_lstm_units": 140,
    "2nd_lstm_activation": "sigmoid",
    "dense_units": 21,
    "dense_activation": "tanh",
    "dropout": 0.2,
    "learning_rate": 0.001,
    "validation_split": 0.1,
    "batch_size": 32,
    "epochs": 200,
}


model3 = Sequential()
model3.add(LSTM(m3_hyper_parameters["lstm_units"], activation=m3_hyper_parameters["lstm_activation"], input_shape=(7, nb_features)))
model3.add(RepeatVector(1))
model3.add(LSTM(m3_hyper_parameters["2nd_lstm_units"], activation=m3_hyper_parameters["lstm_activation"], return_sequences=True))
model3.add(TimeDistributed(Dense(m3_hyper_parameters["dense_units"], activation=m3_hyper_parameters["dense_activation"])))
model3.add(Dropout(m3_hyper_parameters["dropout"]))
model3.add(Dense(1, activation="linear"))
model3.summary()

Model: "sequential_15"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_16 (LSTM)              (None, 300)               457200    
                                                                 
 repeat_vector_1 (RepeatVect  (None, 1, 300)           0         
 or)                                                             
                                                                 
 lstm_17 (LSTM)              (None, 1, 140)            246960    
                                                                 
 time_distributed_1 (TimeDis  (None, 1, 21)            2961      
 tributed)                                                       
                                                                 
 dropout_3 (Dropout)         (None, 1, 21)             0         
                                                                 
 dense_19 (Dense)            (None, 1, 1)            

In [44]:
y_train_3d =  y_train[..., np.newaxis]
model3.compile(loss="mse", optimizer=Adam(learning_rate=m3_hyper_parameters["learning_rate"]), metrics=[keras.metrics.RootMeanSquaredError()])
model3.fit(X_train, y_train_3d,
           validation_split=m3_hyper_parameters["validation_split"],
           batch_size=m3_hyper_parameters["batch_size"],
           epochs=m3_hyper_parameters["epochs"],
           shuffle=True)
yhat = model3.predict(X_test, verbose=0)
yhat_inverse = target_scaler.inverse_transform(yhat.squeeze(2))
model_predictions_df["model_3_prediction"] = yhat_inverse
model_scores_df.loc["model 3"] = evaluate_forecast(y_test, yhat_inverse)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

## Comparing the Different Models
### Comparing the Model Scores

In [45]:
model_scores_df

Unnamed: 0,MAE,MSE,RMSE
model 1,23.793732,1208.829956,34.76823
model 2,30.339775,1814.69873,42.599281
model 3,30.205767,1610.084839,40.125862


### Comparing the Model Predictions on the Test Dataset
Here we are comparing the target variable values for the year 2021 for the Township-Ranges in the test set compared to the prediction made by each model based on the 2014-2020 data for the Township-Ranges in the test set.

In [46]:
model_predictions_df

Unnamed: 0,GSE_GWE,model_1_prediction,model_2_prediction,model_3_prediction
0,33.198000,22.690382,27.054747,8.250940
1,34.795000,50.077850,51.687843,31.764212
2,161.756667,68.355354,67.346954,47.582420
3,54.423000,40.084999,40.748684,21.895430
4,80.653077,97.764793,102.518166,73.492050
...,...,...,...,...
67,187.252308,178.960907,164.108978,167.784225
68,179.551290,153.492096,163.449860,148.738907
69,236.543750,249.778610,250.463730,243.026459
70,292.550000,274.968384,248.172867,255.672699


Based on the model scores it turns out that the simplest of the three LSTM models is the one having the best scores.

However, considering all the measurements between 2014 and 2022, the `GSE_GWE` (Ground Surface Elevation to Groundwater Water Elevation - Depth to groundwater elevation in feet below ground surface) target value has a
* median of 137.09 (~41.7 meters)
* mean value of 167.37 feet (~50.9 meters)
* min value of 0.5 feet (0 meters)
* max value of 727.5 feet (221.6 meters)

A mean average error of 23.80 feet (7 meters), and root mean square error of 34.77 feet (10.4 meters) in the prediction is fairly large. Even the best model does not seem to be accurate enough to be useful.

We save the best model anyway to perform predictions and analyze the results. Refer to the notebook `/ml/deeplearning_results.ipynb` for the analysis of the 2022 predictions results.

In [47]:
model_dir = "../assets/models/"
keras_model_dir = os.path.join(model_dir, "keras_lstm_model")
os.makedirs(keras_model_dir, exist_ok=True)
# Save the Keras Model
model1.save(keras_model_dir)
# Save the data imputation pipeline and target min-max scaler
pipeline_data = {
    "impute_pipeline": impute_pipeline,
    "impute_columns": impute_columns,
    "target_scaler": target_scaler
}
with open(os.path.join(model_dir, "lstm_model_pipeline.pkl"), "wb") as file:
    pickle.dump(pipeline_data, file)

INFO:tensorflow:Assets written to: ../assets/models/keras_lstm_model\assets


## Data Sensitivity Analysis
Taking the best model, we analyse here how much the amount of historical data impacts the LSTM performance. To do so we will recursively retrain (with the same hyperparameters) the LSTM based on more and more historical data. E.g., the model will first be retrained only based on 2020 data, then 2019-2020 data, etc.

In [None]:
short_models_rmse_df = pd.DataFrame(columns=["nb_years", "rmse"])
for nb_years in range(1,8):
    print(f"train the model with {nb_years} year(s) out of 7 years of data. Please wait...")
    # Get the last nb_years from the training and test sets
    X_train_short = X_train[:,-nb_years:]
    X_test_short = X_test[:,-nb_years:]
    # Reconfigure the model input shape with the number of years
    model = Sequential()
    model.add(LSTM(m1_hyper_parameters["lstm_units"], activation=m1_hyper_parameters["lstm_activation"], input_shape=(nb_years, nb_features)))
    model.add(Dense(1, activation="linear"))
    # Train the model and make predictions
    model.compile(loss="mse", optimizer=Adam(learning_rate=m1_hyper_parameters["learning_rate"]), metrics=[keras.metrics.RootMeanSquaredError()])
    model.fit(X_train_short, y_train,
              validation_split=m1_hyper_parameters["validation_split"],
              batch_size=m1_hyper_parameters["batch_size"],
              epochs=m1_hyper_parameters["epochs"],
              shuffle=True,
              verbose=0)
    yhat = model.predict(X_test_short, verbose=0)
    yhat_inverse = target_scaler.inverse_transform(yhat)
    _, _, rmse_score = evaluate_forecast(y_test, yhat_inverse)
    short_models_rmse_df.loc[len(short_models_rmse_df)] = [nb_years, rmse_score]

train the model with 1 year(s) out of 7 years of data. Please wait...
train the model with 2 year(s) out of 7 years of data. Please wait...
train the model with 3 year(s) out of 7 years of data. Please wait...
train the model with 4 year(s) out of 7 years of data. Please wait...


In [None]:
draw_line_chart(short_models_rmse_df, x="nb_years", x_title="Number of Years", y="rmse", y_title="Root Mean Square Error", title="LSTM Model RMSE", subtitle="Based on the number of years the model was trained on.")

## Hyperparameters Sensitivity Analysis
We perform here an analysis of the best model's sensitivity to the following hyperparamters:
* the optimizer used (e.g. Adam RMSprop, Adagrad)
* the training validation datasets split
* the number of lstm units
* the learning rate
* the batch size
* the number of training epochs

To perform this analysis, we trained 33,345 LSTM models for all possible combinations (within the selected ranges of values) of those 6 hyperparameters on the best model, and recorded for each model, the Root Mean Square Error (RMSE) on the test set.
The results are stored in a CSV file available in the ../assets/tuning folder.

The below visualization displays for each hyperparameter value, the distribution of the RMSE, and the mean of RMSE (using the color), for all models trained with that hyperparameter value. This allows us to show if a specific hyperparameter tends to lead to lower or higher RMSE and to compare the distribution between two values of the same hyperparameters.

In [None]:
hpt_df = pd.read_csv(r"../assets/tuning/hpt_results.csv")
# We can discard some hyperparameter data to reduce the size of the visualization and improve readability
#hpt_df = hpt_df[hpt_df["epochs"].isin(range(50, 310, 40))]
#hpt_df = hpt_df[hpt_df["lstm_units"].isin(range(10, 200, 30))]
draw_hyperparameters_distribution(hpt_df, ["optimizer", "validation_split", "learning_rate", "batch_size", "epochs", "lstm_units"])

Looking at this visualization, we can see - with some surprise - that the hyperparameters which seem to have the biggest impact on the model performance have little to do with the model architecture itself (the number of LSTM units) but with how the model is trained.
* The choice of the optimizer seems to have the largest impact on the model performance, with both the mean and distribution of the RMSE for all models trained with an `Adagrad` optimizer being really bad.
* The training-validation percentage split seems to have little impact. The best performance is obtained with assigning 10% of the training data to the validation set, but with 15% of the training data to the validation set the results are close. We can also see that when assigning 5% of the data to the validation set, the distribution is much more even i.e., there are more models performing worse with a small validation set.
* The bigger the learning rate, the better the model performs in terms or RMSE. The distribution of all models RMSE shows that with a learning rate of 0.01, most models have low RMSE around 40. With a learning_rate of 0.001 we have a bimodal distribution of the RMSE with models performing either around 40 or very poorly around 150. With a learning rate of 0.0001, the distribution although still bimodal is more even with most models having an RMSE above 60.
* On the other hand, the smaller the batch size, the more models have a low RMSE.
* Although there is less of a difference if we compare similar values (e.g., 50 and 70 epochs or 270 and 290 epochs), we still see clearly that the bigger the number training epochs the more there are trained models with a low RMSE. With a low number of epochs there are more models have an RMSE around 150.
* The number of LSTM units, impacting the number of neurons in the LSTM model, seems to have less impact on the performance of the RMSE. The distribution of all models RMSE does show differences between 10 and 190 LSTM units but not as much as other hyperparameters. The strong bimodal distribution with a low number of LSTM units show that in this case models will either perform well or very poorly.

What is also interesting is that, as seen below, if we take the combination of all the best hyperparameters
* lstm_units: 60
* learning_rate: 0.01
* validation_split: 0.1
* batch_size: 32
* epochs: 290

we end up with an MAE of 25.62 and an RMSE of 34.96 both slightly worse (respectively 23.80 and 34.77) than the best model (model 1) we found previously. The best model is thus not necessarily obtained by the combination of the best individual hyperparameters.

In [None]:
m4_hyper_parameters = {
    "lstm_units": 60,
    "lstm_activation": "sigmoid",
    "learning_rate": 0.01,
    "validation_split": 0.1,
    "batch_size": 32,
    "epochs": 290,
}

model4 = Sequential()
model4.add(LSTM(m4_hyper_parameters["lstm_units"], activation=m4_hyper_parameters["lstm_activation"], input_shape=(7, nb_features)))
model4.add(Dense(1, activation="linear"))
model4.summary()

In [None]:
model4.compile(loss="mse", optimizer=Adam(learning_rate=m4_hyper_parameters["learning_rate"]), metrics=[keras.metrics.RootMeanSquaredError()])
model4.fit(X_train, y_train,
           validation_split=m4_hyper_parameters["validation_split"],
           batch_size=m4_hyper_parameters["batch_size"],
           epochs=m4_hyper_parameters["epochs"],
           shuffle=True)
yhat = model4.predict(X_test, verbose=0)
yhat_inverse = target_scaler.inverse_transform(yhat)
model_scores_df.loc["hyperparameters best combination"] = evaluate_forecast(y_test, yhat_inverse)

In [None]:
model_scores_df

## Predicting 2022
Even though our best model has a too large error to be useful, we can try as an exercise, to predict the 2022 target variable for all the Township-Ranges.

The model was trained to predict the 2021 data based on the previous 7 years of data 2014 to 2020. To predict 2022 we thus need to pass the previous 7 years of data (2015-2021). To do so:
1. We use our impute pipeline trained on the training dataset to impute values on the entire dataset and normalize the data
2. We drop the 2014 data points
3. We reshape the dataset as a 3 dimensional numpy array in the form of [all Township-Ranges, 2015-2021, 80 features]

Once we predict the 2022 values of the target variable, we extract the 2021 values from the original dataset to compare the 2021 values with the predicted 2022 values.

In [None]:
# Predict the 2022 values for all Township-Ranges for the target variable based on 2015-2021 data
X_2015_to_2021 = get_data_for_prediction(X, impute_pipeline, impute_columns)
yhat_2022 = model1.predict(X_2015_to_2021, verbose=0)
yhat_inverse_2022 = target_scaler.inverse_transform(yhat_2022)
predictions_2022_df = pd.DataFrame(yhat_inverse_2022, index=X.index.get_level_values(0).unique(), columns=[target_variable])
# Add the 2022 values of the target variable to the existing ones
all_years_df = combine_all_target_years(X, target_variable, predictions_2022_df)
all_years_df

In [None]:
township_range = TownshipRanges()
all_years_map_df = pd.merge(township_range.sjv_township_range_df, all_years_df.reset_index(), how="left", on=["TOWNSHIP_RANGE", ])
view_trs_side_by_side(all_years_map_df, feature="YEAR", value="GSE_GWE", title="San Joaquin Valley GSE_GWE with 2022 predictions")

When we look at the 2022 predictions of the `GSE_GWE` compared to the actual 2014-2021 the predictions are fairly consistent. Areas with high `GSE_GWE` values (i.e., deep ground to water depth) remain the same, and area with low `GSE_GWE` remain the same. However, if the model follows the past year trend, even with a high RMSE of 34.82 feet (10.6 meters), areas with high `GSE_GWE` will remain areas of high `GSE_GWE`. Comparing the past years `GSE_GWE` measurement values with the 2022 predictions is thus only partly informative about the quality of the prediction.

We thus try to also compare the year-to-year *difference* in the `GSE_GWE` from 2014 to 2021 and between our 2022 predictions and the 2021 values.

In [None]:
yty_difference_df = get_year_to_year_differences(X, target_variable, predictions_2022_df)
yty_difference_df

In [None]:
difference_df = pd.merge(township_range.sjv_township_range_df, pd.melt(yty_difference_df.reset_index(), id_vars=["TOWNSHIP_RANGE"], var_name="YEAR", value_name="GSE_GWE_DIFFERENCE"), how="left", on=["TOWNSHIP_RANGE", ])
view_trs_side_by_side(difference_df, feature="YEAR", value="GSE_GWE_DIFFERENCE", title="San Joaquin Valley GSE_GWE year-to-year variations from 2014 until 2022 predictions")

In [None]:
yty_difference_df.describe()

Looking at the above table, here too, the difference between the 2022 predictions and 2021 measurements of `GSE_GWE` remains consistent with the year-to-year difference from previous years. Despite the RMSE score being bad, the year-to-year variation of `GSE_GWE` remains within acceptable range.
## Conclusion
Using a simple LSTM neural network to make next year predictions based on the past 7 years of data, we are able to achieve a more accurate prediction on the test set with an RMSE of 34.77 feet (10.4 meters) compared to an RMSE between 75 and 95 feet (22.8 and 28.9 meters) using supervised algorithms like XGBoost or K-Neighbors regressor.

The 2022 predictions look to be within the range of acceptable values and year-to-year variations. However, if the objective is to help policymakers and water resources management agencies predict a year in advance where to focus their attention in terms of well water shortages and drilling, the level of error of the model feels too big to be useful.

## 2021 Predictions For Model Comparison and Failure Analysis
The best LSTM model above was trained on a subset of Township-Ranges with all their 2014-2021 data and tested on another subset od Township-Ranges (see above "Preparing the Dataset" section). But other more traditional models were trained on 1 year of data to predict the next year data, with the 2021 data held as a test set.
To compare these models side-by-side, here we discard the 2021 data, train the model on the 2014-2020 data and use the model to predict the 2021 values. The 2021 predictions are stored in a CSV file which will be used to compare with other model 2021 predictions and the real values.

In [None]:
X_shortened = X.copy()
X_shortened.drop("2021", level=1, axis=0, inplace=True)
X_train, X_test, y_train, y_test, impute_pipeline, impute_columns, target_scaler = get_train_test_datasets(X_shortened, target_variable=target_variable, test_size=test_size, random_seed=RANDOM_SEED, save_to_file=True)
model_predictions_df = pd.DataFrame(y_test, columns=[target_variable])
nb_features = X_train.shape[-1]
get_sets_shapes(X_train, X_test)
model_predict_2021 = Sequential()
model_predict_2021.add(LSTM(m1_hyper_parameters["lstm_units"], activation=m1_hyper_parameters["lstm_activation"], input_shape=(6, nb_features)))
model_predict_2021.add(Dense(1, activation="linear"))
model_predict_2021.compile(loss="mse", optimizer=Adam(learning_rate=m1_hyper_parameters["learning_rate"]), metrics=[keras.metrics.RootMeanSquaredError()])
model_predict_2021.fit(X_train, y_train,
           validation_split=m1_hyper_parameters["validation_split"],
           batch_size=m1_hyper_parameters["batch_size"],
           epochs=m1_hyper_parameters["epochs"],
           shuffle=True)
# Predict the 2021 values for all Township-Ranges for the target variable based on 2015-2021 data
X_2015_to_2020 = get_data_for_prediction(X_shortened, impute_pipeline, impute_columns)
yhat_2021 = model_predict_2021.predict(X_2015_to_2020, verbose=0)
yhat_inverse_2021 = target_scaler.inverse_transform(yhat_2021)
y_2021_df = pd.DataFrame(X.xs("2021", level=1, axis=0)["GSE_GWE"])
predictions_2021_df = y_2021_df.merge(pd.DataFrame(yhat_inverse_2021, index=X.index.get_level_values(0).unique(), columns=["LSTM"]), how="left", left_index=True, right_index=True)
predictions_2021_df.rename(columns={"GSE_GWE": "2021_GSE_GWE"}, inplace=True)
predictions_2021_df

In [None]:
predictions_dir = "../assets/predictions/"
os.makedirs(os.path.dirname(predictions_dir), exist_ok=True)
predictions_2021_df.to_csv(os.path.join(predictions_dir, "lstm_predictions.csv"), index=True)
model_scores_df.rename(index={"model 1": "LSTM"}, inplace=True)
model_scores_df.reset_index(inplace=True)
model_scores_df.rename(columns={"index": "MODEL"},inplace=True)
model_scores_df[model_scores_df["MODEL"]=="LSTM"].to_csv(os.path.join(predictions_dir, "lstm_model_errors.csv"), index=False)