# Deeplearning LSTM Model Training and Testing
This notebooks trains 3 different LSTM model architectures to try to predict the target variable based on the historical data of all the features.

In [7]:
import sys
sys.path.append('..')

In [8]:
import os
import pickle
import numpy as np
import pandas as pd
import random
import tensorflow

from tensorflow import keras
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, TimeDistributed, RepeatVector
from keras.optimizers import Adam
from lib.read_data import read_and_join_output_file
from lib.deeplearning import get_train_test_datasets,  get_sets_shapes, evaluate_forecast

In [9]:
RANDOM_SEED = 31
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
tensorflow.random.set_seed(RANDOM_SEED)

In [10]:
print("Num GPUs Available: ", len(tensorflow.config.list_physical_devices('GPU')))

Num GPUs Available:  0


## Preparing the Dataset
### The Train-Test Split
The dataset is made of 478 Township-Ranges, each containing a multivariate (81 features) time series (data between 2014 to 2021). This dataset can thus be seen as a 3 dimensional dataset of
$478 TownshipRanges * 8 time stamps * 81 features$
The objective is to predict the 2022 target value of `GSE_GWE` (Ground Surface Elevation to Groundwater Water Elevation - Depth to groundwater elevation in feet below ground surface) for each Township-Range.

LSTM neural networks can be used for time series forecasting and take inputs of the shape *[samples, time series steps, features]*. This perfectly fits our dataset.
To fit our dataset and objective, as well as LSTM neural networks architecture we will thus perform the train test split as follow:
* Training and Test sets will be split by Township-Ranges. I.e., some Township-Ranges will have all their 2014-2021 data points in the training set, some others will be in the the test set.
* The model will be trained based on the 2014-2020 data for all features - including the target feature - and will be trained and tested on the 2021 value of the target feature.

With such a method, unlike a simple time series forecasting where the target feature is forecasted only based on its past value, we allow past value of other features (in our case cultivated crops, precipitations, population density, number of wells drilled) to influence the future value of the target feature.

![Train-Test Split](../doc/images/deeplearning-train-test-split.jpg)

We do not create a validation dataset as we use Keras internal cross-validation mechanism to shuffle the data points (i.e., the Township-Ranges) and keep some for the validation at each training epoch.
### Data Imputation and Scaling
Missing data imputation for a Township-Range is performed only using the existing data of that Township-Range (not the data of all Township-Ranges). For example:
* a *fill forward* approach is used for many fields like crops, vegetation and soils. The percentage of land use per crop in 2014 in a Township-Range is imputed into the missing year 2015 for that particular Township-Range.
* for fields like `PCT_OF_CAPACITY` (the capacity percentage of water reservoir), missing values in a Township-Range are filled using the min, mean, median or max values of that particular Township-Range
This approach means that the data imputation *fit* method does not need to learn values from other Township-Ranges data points to impute missing values. Since our train and test datasets are split by Township-Ranges, it avoids issues when  using the impute pipeline fitted on the training dataset to impute data for Township-Ranges the impute pipeline has not seen before.

We use a MinMax scaler to scale all values between 0 and 1 for the neural network.

It should be noted that we do not need to do any data imputation on the training and test sets *y* target feature since it does not have any missing data point.

In [11]:
test_size=0.15
target_variable="GSE_GWE"
# Load the data from the ETL output files
X = read_and_join_output_file()
# Split the input pandas Dataframe into training and test datasets, applies the impute pipeline
# transformation and reshapes the datasets to 3D (samples, time, features) numpy arrays
X_train, X_test, y_train, y_test, impute_pipeline, target_scaler = get_train_test_datasets(X, target_variable=target_variable,
    test_size=test_size, random_seed=RANDOM_SEED, save_to_file=True)
model_predictions_df = pd.DataFrame(y_test, columns=[target_variable])
model_scores_df = pd.DataFrame(columns=["mae", "mse", "rmse"])
nb_features = X_train.shape[-1]
get_sets_shapes(X_train, X_test)

Unnamed: 0,nb_items,nb_timestamps,nb_features
training dataset,406,7,81
test dataset,72,7,81


## Training Different Models
We tried 3 different LSTM models:
* A simple model made of a single *LSTM* layer and an output *Dense* layer
* A model made of a *LSTM* layer followed by a *Dense* and *Dropout* layers before the output layer
* An Encoder-Decoder model made of 2 *LSTM* layers followed by a *Dense* and *Dropout* layers

![LSTM Model Architectures](../doc/images/deeplearning-architectures.jpg)


Encoder-decoder architectures are more common for sequence to sequence learning e.g., when forecasting the next 3 days (output sequence of length 3) based on the past year data (input sequence of length 365). In our case we only predict data for 1 time step in the feature. The output sequence being of length 1 this architecture might seem superfluous but has been tested anyway. This architecture was inspired by the Encoder-Decoder architecture in this article: *[CNN-LSTM-Based Models for Multiple Parallel Input and Multi-Step Forecast](https://towardsdatascience.com/cnn-lstm-based-models-for-multiple-parallel-input-and-multi-step-forecast-6fe2172f7668)*.

As such models are made for sequence to sequence learning and forecasting, the output of such a model is different from the previous ones. It has an output of size *[samples, forcasting sequence length, target features]*. In our case the forecasting sequence length and number of target features are both 1.
## Training Model 1 - Simple LSTM Model

In [12]:
m1_hyper_parameters = {
    "lstm_units": 160,
    "lstm_activation": "sigmoid",
    "learning_rate": 0.001,
    "validation_split": 0.1,
    "batch_size": 128,
    "epochs": 270,
}

model1 = Sequential()
model1.add(LSTM(m1_hyper_parameters["lstm_units"], activation=m1_hyper_parameters["lstm_activation"], input_shape=(7, nb_features)))
model1.add(Dense(1, activation="linear"))
model1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 160)               154880    
                                                                 
 dense (Dense)               (None, 1)                 161       
                                                                 
Total params: 155,041
Trainable params: 155,041
Non-trainable params: 0
_________________________________________________________________


In [13]:
model1.compile(loss="mse", optimizer=Adam(learning_rate=m1_hyper_parameters["learning_rate"]), metrics=[keras.metrics.RootMeanSquaredError()])
model1.fit(X_train, y_train,
           validation_split=m1_hyper_parameters["validation_split"],
           batch_size=m1_hyper_parameters["batch_size"],
           epochs=m1_hyper_parameters["epochs"],
           shuffle=True)
yhat = model1.predict(X_test, verbose=0)
yhat_inverse = target_scaler.inverse_transform(yhat)
model_predictions_df["model_1_prediction"] = yhat_inverse
model_scores_df.loc["model 1"] = evaluate_forecast(y_test, yhat_inverse)

Epoch 1/270
Epoch 2/270
Epoch 3/270
Epoch 4/270
Epoch 5/270
Epoch 6/270
Epoch 7/270
Epoch 8/270
Epoch 9/270
Epoch 10/270
Epoch 11/270
Epoch 12/270
Epoch 13/270
Epoch 14/270
Epoch 15/270
Epoch 16/270
Epoch 17/270
Epoch 18/270
Epoch 19/270
Epoch 20/270
Epoch 21/270
Epoch 22/270
Epoch 23/270
Epoch 24/270
Epoch 25/270
Epoch 26/270
Epoch 27/270
Epoch 28/270
Epoch 29/270
Epoch 30/270
Epoch 31/270
Epoch 32/270
Epoch 33/270
Epoch 34/270
Epoch 35/270
Epoch 36/270
Epoch 37/270
Epoch 38/270
Epoch 39/270
Epoch 40/270
Epoch 41/270
Epoch 42/270
Epoch 43/270
Epoch 44/270
Epoch 45/270
Epoch 46/270
Epoch 47/270
Epoch 48/270
Epoch 49/270
Epoch 50/270
Epoch 51/270
Epoch 52/270
Epoch 53/270
Epoch 54/270
Epoch 55/270
Epoch 56/270
Epoch 57/270
Epoch 58/270
Epoch 59/270
Epoch 60/270
Epoch 61/270
Epoch 62/270
Epoch 63/270
Epoch 64/270
Epoch 65/270
Epoch 66/270
Epoch 67/270
Epoch 68/270
Epoch 69/270
Epoch 70/270
Epoch 71/270
Epoch 72/270
Epoch 73/270
Epoch 74/270
Epoch 75/270
Epoch 76/270
Epoch 77/270
Epoch 78

## Training Model 2 - LSTM + Dense Layer Model

In [14]:
m2_hyper_parameters = {
    "lstm_units": 100,
    "lstm_activation": "sigmoid",
    "dense_units": 11,
    "dense_activation": "tanh",
    "dropout": 0.1,
    "learning_rate": 0.0001,
    "validation_split": 0.1,
    "batch_size": 32,
    "epochs": 200,
}

model2 = Sequential()
model2.add(LSTM(m2_hyper_parameters["lstm_units"], activation=m2_hyper_parameters["lstm_activation"], input_shape=(7, nb_features)))
model2.add(Dense(m2_hyper_parameters["dense_units"], activation=m2_hyper_parameters["dense_activation"]))
model2.add(Dropout(m2_hyper_parameters["dropout"]))
model2.add(Dense(1, activation="linear"))
model2.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_1 (LSTM)               (None, 100)               72800     
                                                                 
 dense_1 (Dense)             (None, 11)                1111      
                                                                 
 dropout (Dropout)           (None, 11)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 12        
                                                                 
Total params: 73,923
Trainable params: 73,923
Non-trainable params: 0
_________________________________________________________________


In [15]:
model2.compile(loss="mse", optimizer=Adam(learning_rate=m2_hyper_parameters["learning_rate"]), metrics=[keras.metrics.RootMeanSquaredError()])
model2.fit(X_train, y_train,
           validation_split=m2_hyper_parameters["validation_split"],
           batch_size=m2_hyper_parameters["batch_size"],
           epochs=m2_hyper_parameters["epochs"],
           shuffle=True)
yhat = model2.predict(X_test, verbose=0)
yhat_inverse = target_scaler.inverse_transform(yhat)
model_predictions_df["model_2_prediction"] = yhat_inverse
model_scores_df.loc["model 2"] = evaluate_forecast(y_test, yhat_inverse)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

## Training Model 3 - Encoder-Decoder LSTM Model

In [16]:
m3_hyper_parameters = {
    "lstm_units": 300,
    "lstm_activation": "sigmoid",
    "2nd_lstm_units": 140,
    "2nd_lstm_activation": "sigmoid",
    "dense_units": 21,
    "dense_activation": "tanh",
    "dropout": 0.2,
    "learning_rate": 0.001,
    "validation_split": 0.1,
    "batch_size": 32,
    "epochs": 200,
}


model3 = Sequential()
model3.add(LSTM(m3_hyper_parameters["lstm_units"], activation=m3_hyper_parameters["lstm_activation"], input_shape=(7, nb_features)))
model3.add(RepeatVector(1))
model3.add(LSTM(m3_hyper_parameters["2nd_lstm_units"], activation=m3_hyper_parameters["lstm_activation"], return_sequences=True))
model3.add(TimeDistributed(Dense(m3_hyper_parameters["dense_units"], activation=m3_hyper_parameters["dense_activation"])))
model3.add(Dropout(m3_hyper_parameters["dropout"]))
model3.add(Dense(1, activation="linear"))
model3.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_2 (LSTM)               (None, 300)               458400    
                                                                 
 repeat_vector (RepeatVector  (None, 1, 300)           0         
 )                                                               
                                                                 
 lstm_3 (LSTM)               (None, 1, 140)            246960    
                                                                 
 time_distributed (TimeDistr  (None, 1, 21)            2961      
 ibuted)                                                         
                                                                 
 dropout_1 (Dropout)         (None, 1, 21)             0         
                                                                 
 dense_4 (Dense)             (None, 1, 1)             

In [17]:
y_train_3d =  y_train[..., np.newaxis]
model3.compile(loss="mse", optimizer=Adam(learning_rate=m3_hyper_parameters["learning_rate"]), metrics=[keras.metrics.RootMeanSquaredError()])
model3.fit(X_train, y_train_3d,
           validation_split=m3_hyper_parameters["validation_split"],
           batch_size=m3_hyper_parameters["batch_size"],
           epochs=m3_hyper_parameters["epochs"],
           shuffle=True)
yhat = model3.predict(X_test, verbose=0)
yhat_inverse = target_scaler.inverse_transform(yhat.squeeze(2))
model_predictions_df["model_3_prediction"] = yhat_inverse
model_scores_df.loc["model 3"] = evaluate_forecast(y_test, yhat_inverse)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

## Comparing the Different Models
### Comparing the Model Scores

In [18]:
model_scores_df

Unnamed: 0,mae,mse,rmse
model 1,23.666956,1212.567017,34.821934
model 2,29.587023,1729.227051,41.583977
model 3,30.181566,1613.689209,40.17075


### Comparing the Model Predictions on the Test Dataset
Here we are comparing the target variable values for the year 2021 for the Township-Ranges in the test set compared to the prediction made by each model based on the 2014-2020 data for the Township-Ranges in the test set.

In [19]:
model_predictions_df

Unnamed: 0,GSE_GWE,model_1_prediction,model_2_prediction,model_3_prediction
0,33.198000,24.226774,28.419720,8.400249
1,34.795000,50.136299,54.256641,31.891882
2,161.756667,67.704063,69.008156,48.474617
3,54.423000,41.642086,41.073723,21.912615
4,80.653077,96.102310,102.622726,73.064133
...,...,...,...,...
67,187.252308,179.061325,164.998611,167.344238
68,179.551290,154.905746,162.471863,149.055557
69,236.543750,248.222504,249.016388,241.518127
70,292.550000,274.849457,250.053467,255.379242


Based on the model scores it turns out that the simplest of the three LSTM models is the one having the best scores.

However considering all the measurements between 2014 and 2022, the `GSE_GWE` (Ground Surface Elevation to Groundwater Water Elevation - Depth to groundwater elevation in feet below ground surface) target value has a
* median of 137.09 (~41.7 meters)
* mean value of 167.37 feet (~50.9 meters)
* min value of 0.5 feet (0 meters)
* max value of 727.5 feet (221.6 meters)

A mean average error of 23.66 feet (7.2 meters), and root mean square error of 34.82 feet (10.6 meters) in the prediction is fairly large. Even the best model does not seem to be accurate enough to be useful.

We save the best model anyway to perform predictions and analyze the results. Refer to the notebook `/ml/deeplearning_results.ipynb` for the analysis of the 2022 predictions results.

In [23]:
model_dir = "../assets/models/"
keras_model_dir = os.path.join(model_dir, "keras_lstm_model")
os.makedirs(keras_model_dir, exist_ok=True)
# Save the Keras Model
model1.save(keras_model_dir)
# Save the data imputation pipeline and target min-max scaler
pipeline_data = {
    "impute_pipeline": impute_pipeline,
    "target_scaler": target_scaler
}
with open(os.path.join(model_dir, "lstm_model_pipeline.pkl"), "wb") as file:
    pickle.dump(pipeline_data, file)

INFO:tensorflow:Assets written to: ../assets/models/keras_lstm_model\assets
