# Predicting With the LSTM Model
The dataset is made of 478 Township-Ranges, each containing a multivariate (81 features) time series (data between 2014 to 2021). This dataset can thus be seen as a 3 dimensional dataset of
$478 TownshipRanges * 8 time stamps * 81 features$
The objective is to predict the 2022 target value of `GSE_GWE` (Ground Surface Elevation to Groundwater Water Elevation - Depth to groundwater elevation in feet below ground surface) for each Township-Range.

![Multi-Variate Multi TImes-Series Predictions with LSTM - Training and Prediction](../doc/images/lstm_inputs_outputs.jpg)

LSTMs are used for time series and NLP because they are both sequential data and depend on previous states.
The future prediction *Y(t+n+1)* depends not only on the last state *X1(t+n) - Y(t+n)*, not only on past values of the feature *Y(t+1) - Y(t+n)*, but on the entire past states sequence.
During training and predictions:
* Township-Ranges are passed into the model one by one
* each cell in the LSM neural network receives a Township-Range state for a specific year (the state of the Township-Range at a specific position in the series)
* each state (year) in the series is represented by a multi-dimensional vector of all 81 features (including the target feature Y `GSE_GWE`)

The output is the Township-Ranges next year's value for the specific feature Y `GSE_GWE`. The model is trained on 2014-2020 (7 years) data to predict 2021.
During inference the last 7 years (2015-2021) of data are passed as input to predict the 2022 value.

![Multi-Variate Multi TImes-Series Predictions with LSTM - Cells Inputs](../doc/images/lstm_table_to_cells.jpg)

Based on the best model trained in the `/ml/deeplearning_training.ipynb` notebook, in this notebook we use the model to predict the 2022 value for the target variable `GSE_GWE` (Ground Surface Elevation to Groundwater Water Elevation - Depth to groundwater elevation in feet below ground surface) for each Township-Range and analyse the results.

Our best LSTM model was the one with the simplest architecture.

![LSTM Simple Model Architecture](../doc/images/lstm_architecture_1.jpg)

This model, when tested on the test set had the following scores:
* a Mean Average Error of 23.66 feet (7.2 meters)
* A Root Mean Square Error of 34.82 feet (10.6 meters).

Note: To run this notebook you must have run
* all the EDA notebooks to have all the data locally,
* the `/ml/deeplearning_training.ipynb` notebook at least once for the model to have been saved in your environment.

In [2]:
import sys
sys.path.append('..')

In [4]:
import os
import pickle
import numpy as np
import pandas as pd
import random

import tensorflow
from tensorflow import keras
#from lib.township_range import TownshipRanges
from lib.read_data import read_and_join_output_file
from lib.deeplearning import get_data_for_prediction, combine_all_target_years, get_year_to_year_differences
from lib.viz import view_trs_side_by_side

In [5]:
RANDOM_SEED = 31
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
tensorflow.random.set_seed(RANDOM_SEED)

In [6]:
test_size=0.15
target_variable="GSE_GWE"
# Load the data from the ETL output files
X = read_and_join_output_file()
# We load the best trained model
# Refer to the notebook /ml/deeplearning_training.ipynb
model_dir = "../assets/models/"
keras_model_dir = os.path.join(model_dir, "keras_lstm_model")
model = keras.models.load_model(keras_model_dir)
# Load the imputation pipeline and the target variable min-max scaler
with open(os.path.join(model_dir, "lstm_model_pipeline.pkl"), "rb") as file:
    pipeline_data = pickle.load(file)
impute_pipeline = pipeline_data.get("impute_pipeline")
target_scaler = pipeline_data.get("target_scaler")

## Predicting 2022
Even though our best model has a too large error to be useful, we can try as an exercise, to predict the 2022 target variable for all the Township-Ranges.

The model was train to predict the 2021 data based on the previous 7 years of data 2014 to 2020. To predict 2022 we thus need to pass the previous 7 years of data (2015-2021). To do so:
1. We use our impute pipeline trained on the training dataset to impute values on the entire dataset and normalize the data
2. We drop the 2014 data points
3. We reshape the dataset as a 3 dimensional numpy array in the form of [all Township-Ranges, 2015-2021, 81 features]

Once we predict the 2022 values of the target variable, we extract the 2021 values from the original dataset to compare the 2021 values with the rpedicted 2022 values

In [7]:
# Predict the 2022 values for all Township-Ranges for the target variable based on 2015-2021 data
X_2015_to_2021 = get_data_for_prediction(X, impute_pipeline)
yhat_2022 = model.predict(X_2015_to_2021, verbose=0)
yhat_inverse_2022 = target_scaler.inverse_transform(yhat_2022)
predictions_2022_df = pd.DataFrame(yhat_inverse_2022, index=X.index.get_level_values(0).unique(), columns=[target_variable])
# Add the 2022 values of the target variable to the existing ones
all_years_df = combine_all_target_years(X, target_variable, predictions_2022_df)
all_years_df

Unnamed: 0_level_0,Unnamed: 1_level_0,GSE_GWE
TOWNSHIP_RANGE,YEAR,Unnamed: 2_level_1
T01N R02E,2014,57.046154
T01N R02E,2015,56.027436
T01N R02E,2016,48.830000
T01N R02E,2017,48.007333
T01N R02E,2018,45.985000
...,...,...
T32S R30E,2018,405.450000
T32S R30E,2019,413.150000
T32S R30E,2020,404.600000
T32S R30E,2021,383.500000


In [6]:
township_range = TownshipRanges()
all_years_map_df = pd.merge(township_range.sjv_township_range_df, all_years_df.reset_index(), how="left", on=["TOWNSHIP_RANGE", ])
view_trs_side_by_side(all_years_map_df, feature="YEAR", value="GSE_GWE", title="San Joaquin Valley GSE_GWE with 2022 predictions")

When we look at the 2022 predictions of the `GSE_GWE` compared to the actual 2014-2021 the predictions are fairly consistent. Areas with high `GSE_GWE` values (i.e., deep ground to water depth) remain the same, and area with low `GSE_GWE` remain the same. However, if the model follows the past year trend, even with an RMSE of 34.82 feet (10.6 meters), areas with high `GSE_GWE` will remain areas of high `GSE_GWE`. Comparing the past years `GSE_GWE` measurement values with the 2022 predictions is thus partly informative.

We thus decide to also compare the year to year *difference* in the `GSE_GWE` from 2014 to 2021 and between our 2022 predictions and the 2021 values.

In [None]:
yty_difference_df = get_year_to_year_differences(X, target_variable, predictions_2022_df)
yty_difference_df

Unnamed: 0_level_0,2014_2015,2015_2016,2016_2017,2017_2018,2018_2019,2019_2020,2020_2021,2021_2022
TOWNSHIP_RANGE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
T01N R02E,-1.018718,-7.197436,-0.822667,-2.022333,-0.203500,6.414500,0.997636,1.667852
T01N R03E,5.538801,-10.782982,-1.871021,-1.744653,0.230496,0.134899,8.257401,14.075516
T01N R04E,14.321818,-4.161818,-2.347333,3.394000,-2.202381,4.957381,-2.288810,0.263447
T01N R05E,8.593333,-3.834333,-1.543286,2.161978,1.180974,4.827487,-0.859790,-8.564791
T01N R06E,0.445625,6.296518,-8.221518,2.960375,-0.574158,11.043158,0.818000,-2.127443
...,...,...,...,...,...,...,...,...
T32S R26E,-5.275000,4.211538,-26.840110,8.766807,0.764396,11.478138,23.135897,-35.759245
T32S R27E,6.091629,9.723077,17.857692,-8.739271,-0.926316,-19.604605,32.741071,-16.047874
T32S R28E,-42.664167,51.418452,54.644048,-79.904487,20.433654,-0.816071,-17.148352,-2.887731
T32S R29E,0.114286,26.715714,2.086154,1.358846,0.155769,-13.852198,-17.951299,-4.087142


In [8]:
difference_df = pd.merge(township_range.sjv_township_range_df, pd.melt(yty_difference_df.reset_index(), id_vars=["TOWNSHIP_RANGE"], var_name="YEAR", value_name="GSE_GWE_DIFFERENCE"), how="left", on=["TOWNSHIP_RANGE", ])
view_trs_side_by_side(difference_df, feature="YEAR", value="GSE_GWE_DIFFERENCE", title="San Joaquin Valley GSE_GWE year-to-year variations from 2014 until 2022 predictions")

In [9]:
yty_difference_df.describe()

Unnamed: 0,2014_2015,2015_2016,2016_2017,2017_2018,2018_2019,2019_2020,2020_2021,2021_2022
count,478.0,478.0,478.0,478.0,478.0,478.0,478.0,478.0
mean,4.168503,9.478901,-4.443996,-11.496586,8.155784,-0.856081,9.706196,4.578692
std,37.938902,43.592788,41.070322,55.686581,47.958611,61.609121,42.090169,23.9575
min,-214.5,-357.565,-418.166667,-439.9,-113.860606,-403.02,-301.165,-117.883564
25%,-6.549669,-3.354762,-11.805607,-13.710833,-8.499281,-11.23074,-2.559786,-8.303697
50%,3.651762,4.567063,-2.47957,-0.102719,0.007393,0.9625,6.161682,3.837004
75%,14.973875,16.306929,5.665875,8.030018,11.546388,13.615143,23.236474,16.413693
max,345.015,228.0,236.31,137.892,475.72,350.9,192.938286,122.175941


Here too, the difference between the 2022 predictions and 2021 measurements of `GSE_GWE` remains consistent with the year-to-year difference. The table below showing the *mean*, *standard deviation*, *min*, *max*, etc. values also shows that the difference between 2022 predictions difference with
## Conclusion
Using a simple LSTM neural network to make next year predictions based on the past 7 years of data, we are able to achieve a more accurate prediction on the test set with an RMSE of 34.82 feet (10.6 meters) compared to an RMSE between 75 and 95 feet (22.8 and 28.9 meters) using supervised algorithms like XGBoost or K-Neighbours regressor.

The 2022 predictions look to be within the range of acceptable values and year-to-year variations. However, if the objective is to help policy makers and water resources management agencies predict a year in advance where to focus their attention in terms of well water shortages and drilling, the level of error of this feels to big to be useful.