#  üèû Statistical tests & LSTM(PyTorch) üìà

(you can visit my last notebook on marketing : https://www.kaggle.com/anthokalel/basket-analysis-for-dog-food)

![](https://media.nationalgeographic.org/assets/photos/000/191/19125.jpg)

* [Introduction](#intro)
* [Exploratory Data Analysis](#section-one)
    - [Importing librairies and load data](#section_one_1)
    - [Description of the dataset](#section_one_2)
* [Study of MultiVariate Time Series](#section-two)
    - [How much the time series influence each other ? (Granger's causality test)](#section_two_1)
    - [How the time series are correlated ? Pearson's test](#section_two_3)
* [LSTM  model and feature importance (PyTorch)](#section_three)
    - [LSTM model](#section_three_1)
    - [Feature importance](#section_three_2)
    - [Results for Lake Level](#section_three_3)
    - [Results for Flow Rate](#section_three_4)
* [Generalisation to other datasets](#generalization)
    - [Aquifer Auser](#auser)
    - [Aquifer Doganella](#doganella)
    - [Aquifer Luco](#luco)
    - [Aquifer Petrignano](#petrignano)
    - [River Arno](#arno)
    - [Water Spring Amiata](#amiata)
    - [Water Spring Lupa](#lupa)
    - [Water Spring Madonna di Canneto](#madonna)
* [Conclusion](#conclusion)

<a id="intro"></a>
## Introduction



Hi, welcome to my notebook !


The goal of this notebook is to develop a generic method to forecast the time series of the dataset. Then, use this model to know which features have an importance on the model.
This notebook is focused on the Lake Bilancino dataset to develop the method. Then, this will be generealized to other datasets.


Good reading !

<a id="section-one"></a>
### I - Exploratory Data Analysis


What is the Bilancino Lake ?

Bilancino lake is an artificial lake located in the municipality of Barberino di Mugello (about 50 km from Florence). It is used to refill the Arno river during the summer months. Indeed, during the winter months, the lake is filled up and then, during the summer months, the water of the lake is poured into the Arno river.

<a id =section_one_1></a>
#### Importing librairies and load data

In [None]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from torch import nn
import torch
from statsmodels.tsa.stattools import grangercausalitytests
import plotly.express as px
import numpy as np
import holoviews as hv
from holoviews import opts
from sklearn.metrics import mean_squared_error
from torch.utils.data import Dataset
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader 
import plotly.express as px
import plotly.graph_objects as go

hv.extension('bokeh')

Lake_Bilancino = pd.read_csv("../input/acea-water-prediction/Lake_Bilancino.csv")

In [None]:
Lake_Bilancino

The dataset is composed of time  series from from 03/06/2002 to 30/06/2020 month by month :
* 5 Rainfall variables ;
* Temperature variables ;
* Lake Level (target) ;
* Flow Rate (target).

We can see many NaN values at the beginning of the dataset for variables. 

I don't want to use method to replace these values because I don't want my model use false data. The best is to keep them like that.


<a id="section_one_2"></a>
#### Description of the dataset

Let's have a statistical description of our dataset to know it more.

In [None]:
Lake_Bilancino.describe()

Rainfall indicates the quantity of rain falling, expressed in millimeters (mm) in the area X.

Here are the box plots of rainfall : 

In [None]:
def box_plot(df, id_begin = 1, id_last = 5):
    fig = go.Figure()
    for column in df.columns[id_begin:id_last]:

        fig.add_trace(go.Box(y=df[column], name = column))
        fig.update_traces(boxpoints='outliers', 
                          jitter= 0)

    return fig.show()


box_plot(Lake_Bilancino, id_last = 5)

Rainfall values are most of them but time to there are extreme values that can influence the model for the forecast of Lake Level and Flow Rate.

The temperature indicates the temperature, expressed in ¬∞C, detected by the thermometric station Le Croci.

In [None]:
fig = px.line(Lake_Bilancino, x=Lake_Bilancino['Date'], y=Lake_Bilancino['Temperature_Le_Croci'], 
              title='Temperature_Le_Croci')
fig.show()

In [None]:
box_plot(Lake_Bilancino, id_begin = 6, id_last = 7)

The Temperature is on average 15 degree celcius. Temperature is very homogeneous around 15¬∞C with extremum in Winter with -5¬∞c and almost 35¬∞c in Summer. 


The Lake Level indicates the river level, expressed in meters (m)


In [None]:
fig = px.line(Lake_Bilancino, x=Lake_Bilancino['Date'], y=Lake_Bilancino['Lake_Level'], 
              title='Lake_Level')
fig.show()

The Lake Level has their minimum in the middle of Automn after Summer and their maximum in the months of April and May. It has sinusoidal variations. 

In [None]:
box_plot(Lake_Bilancino, id_begin = 7, id_last = 8)

Most of the values are around 250m but there are also many values above. 

The Flow Rate indicates the lake's flow rate, expressed in cubic meters per seconds (mc/s) :

In [None]:
fig = px.line(Lake_Bilancino, x=Lake_Bilancino['Date'], y=Lake_Bilancino['Flow_Rate'], 
              title='Flow_Rate')
fig.show()

In [None]:
box_plot(Lake_Bilancino, id_begin = 8, id_last = 9)

* Most of the value of rainfall time series are 0. The question is : Do the non-null values have an influence on the behavior of Lake Level and Flow Rate ? Probably, because rainfall is filling Lakes. But, we will analyse this statiscally ; 
* Temperature is very homogeneous around 15¬∞c. I think this is the result of the variation between winter and summer ; 
* Lake Level is almost 250 but there are a few non insignificant values under 248.
* There are any trend for Flow Rate, there are time to time some pics and it varies greatly.

<a id="section_two"></a>
### Study of Multivariate Time Series

Now, we are going to understand deeper the behaviour of our time series and how they depend and influence each other in order to establish the proper model. 

Remember, we want to create a robust model to predict Lake Level and Flow Rate in function of the last data. A Multivariate Time Series will analyse how much the time series depend each other. It adds information to have a forecasting more robust and more precise.

We want to find the variables we will use in our future model. We don't want to use variables that are not significant in the forecasting because it can skew our model.

<a id="section_two_1"></a>
#### How much the time series influence each other ? (Granger's causality test)

In order to know how much the time series influence each other, we are going to use Granger's causality test. This test is use in order to know whose are the variables that a VAR model should use. It means to know variables in which there are linear relationship. 

The null hypothesis is that the coefficients of past values in the regression equals zero, it means that X past values of time series does not influence the Y time series.
If the p-value obtained from the test is lesser than the significance level of 0.05, we can reject the null hypothesis and suppose that there are linear relationships between variables. 



In [None]:
def granger_causality_tests(df, maxlag = 12):

    """Check Granger Causality of all possible combinations of the Time series.
    The rows are the response variable, columns are predictors. The values in the table 
    are the P-Values. P-Values lesser than the significance level (0.05), implies 
    the Null Hypothesis that the coefficients of the corresponding past values is 
    zero, that is, the X does not cause Y can be rejected.

    data      : pandas dataframe containing the time series variables
    variables : list containing names of the time series variables.
    """
    
    variables = df.columns
    granger_pvalue_matrix = pd.DataFrame(np.zeros((len(variables), len(variables))))

    for i, variable_c in enumerate(variables):
        for j, variable_r in enumerate(variables):
            min_p_value = np.min([grangercausalitytests(df[[variable_c, variable_r]].dropna(), 
                     maxlag = maxlag, verbose = False)[i][0]['ssr_chi2test'][1] for i in range(1, maxlag+1)])
            granger_pvalue_matrix.loc[i, j] = min_p_value

    granger_pvalue_matrix.columns = [var + '_x' for var in variables]
    granger_pvalue_matrix.index = [var + '_y' for var in variables]
    return granger_pvalue_matrix

def draw_granger_pvalue_matrix(granger_pvalue_matrix):
    fig = go.Figure(data=go.Heatmap(z = granger_pvalue_matrix, 
                                    x = granger_pvalue_matrix.columns, 
                                    y = granger_pvalue_matrix.index, 
                                    colorscale = [[0, 'green'], [0.05, 'green'],
                                                  [0.05, 'red'], [1, 'red']
                                                  ]))
    fig.update_layout(title="Minimum of p_values for a maxlag of 12 of a Granger test")
    return fig.show()

In [None]:
granger_pvalue_matrix = granger_causality_tests(Lake_Bilancino.iloc[578:, 1:].dropna())

In [None]:
draw_granger_pvalue_matrix(granger_pvalue_matrix)

Variables with substring "_x" are predictions and with substring "_y" are responses. In green, we can see p-values < 0.05. In red, p-values >= 0.05.

How to interpret this ?

For example the p-value Flow_Rate_x causes Lake_Level_y is 8.24 x 10^-53. So we can reject the null hypothesis and conclude that Flow Rate causes Lake Level. 

It seems that it is more difficult for rainfall variables to have cause relations. It means that rainfall in a place does not affect rainfall in an other place.

<a id="section_two_3"></a>
#### How the time series are correlated ? Pearson's test

The last test will be to check correlation between time series. 

The method that I will use will be to differentiate in percentage the time series and check their linear correlation with Pearson correlation.

In [None]:
def draw_correlation_matrix(df, method = 'pearson'):
    pct_change_list = []
    for column in df.columns[1:]:
        pct_change_list.append(df[column].pct_change().replace([np.inf, -np.inf], np.nan).dropna())
    pct_change_array = np.zeros((len(df.columns[1:]), len(df.columns[1:])))
    for i, pct_change_i in enumerate(pct_change_list):
        for j, pct_change_j in enumerate(pct_change_list):
            pct_change_array[i, j] = pct_change_i.corr(pct_change_j, method)
    fig = px.imshow(pct_change_array,
                    x=df.columns[1:],
                    y=df.columns[1:]
                   )
    fig.update_xaxes(title = "Correlation matrix", side="top")
    return fig.show()
draw_correlation_matrix(Lake_Bilancino.iloc[:, 1:], method = "pearson")

* We observe a great correlation between rainfall variables ;
* There is also correlation between Lake Level and Flow Rate ;
* Temperature seems to not affect other variables.

<a id="section_three"></a>
### LSTM model and feature importance (PyTorch)

Our goal is to explain how exogonous variables (temperature and rainfall variables) have an influence on the forecasting or not.

I have trained two LSTM models : one with the exogonous variables and the other without exogonous variables.




Here we normalize our data to have a better fit in our model and we split it into three parts : 
- train dataset ;
- validation test (to avoid overfitting) ;
- test dataset (to check the variance of our model) ;

In [None]:
def scale_and_split(df):
    scaler = StandardScaler()
    transformer = scaler.fit_transform(df.iloc[:, 1:])
    df_scaled = pd.DataFrame(transformer, columns = df.columns[1:])
    train, val = train_test_split(df_scaled, test_size=0.33, shuffle = False)
    val, test = train_test_split(val, test_size=0.5, shuffle = False)
    return train, val, test, scaler

def prepare_inputs_outputs(train, val, test, seq_len, inputs, output):
    X_train, y_train = train[:len(train)-seq_len][inputs], train[[output]]
    X_val, y_val = val[:len(val)-seq_len][inputs], val[[output]]
    X_test, y_test = test[:len(test)-seq_len][inputs], test[[output]]
    return X_train, y_train, X_val, y_val, X_test, y_test

In [None]:
parameters = {'inputs':['Lake_Level', 'Flow_Rate'],
              'outputs':['Lake_Level', 'Flow_Rate'],
              'seq_len': 30,
              'batch_size_train':32,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.0005,
              'epochs':20
             }

train, val, test, scaler = scale_and_split(Lake_Bilancino.iloc[578:]) 
X_train, y_train, X_val, y_val, X_test, y_test = prepare_inputs_outputs(train, val, test, 
                                                                        parameters['seq_len'], 
                                                                        parameters['inputs'], 
                                                                        parameters['outputs'][0])
fig = go.Figure()
fig.add_trace(go.Scatter(x=X_train.index, y=X_train['Lake_Level'],
                    mode='lines',
                    name='train data'))
fig.add_trace(go.Scatter(x=X_val.index, y=X_val['Lake_Level'],
                    mode='lines',
                    name='validation data'))
fig.add_trace(go.Scatter(x=X_test.index, y=X_test['Lake_Level'],
                    mode='lines', 
                    name='test data'))
fig.update_layout(title = 'Lake level cutted into train, validation and test dataset (normalized)')

fig.show() 

The forecast is based on the 30 previous hours.

In [None]:
from torch.utils.data import Dataset
class TimeSeries(Dataset):
    def __init__(self, X, y, seq_len = 30):
        self.X = torch.tensor(np.array(X) ,dtype=torch.float32)
        self.y = torch.tensor(np.array(y) ,dtype=torch.float32)
        #self.column_input = column_input
        #self.column_output = column_output
        self.seq_len = seq_len
        
    def __getitem__(self,idx):
        return self.X[idx:idx+self.seq_len], self.y[idx+self.seq_len]

    def __len__(self):
        return len(self.X) - (self.seq_len-1)
    
    
def get_dataloader(X, y, batch_size):
    return DataLoader(TimeSeries(X, y), shuffle=False, batch_size=32, drop_last = True)

In [None]:
if torch.cuda.is_available():  
    device = "cuda:0" 
else:  
    device = "cpu"  

<a id="section_three_1"></a>
#### LSTM model

![Visualization of a LSTM unit](https://www.researchgate.net/profile/Xiaofeng_Yuan4/publication/331421650/figure/fig2/AS:771405641695233@1560928845927/The-structure-of-the-LSTM-unit.png)

I am using the recurrent neural network LSTM to make our forecasting. It is a non linear model that can allow us to predict next values in function of past values, even farther values in time.

The $x_t$ is of dimension *features x timestep* and is fed into the LSTM network.
Then the LSTM compute these values : 
<math>
\begin{align}
f_t &= \sigma_g(W_{f} x_t + U_{f} h_{t-1} + b_f) \\
i_t &= \sigma_g(W_{i} x_t + U_{i} h_{t-1} + b_i) \\
o_t &= \sigma_g(W_{o} x_t + U_{o} h_{t-1} + b_o) \\
\tilde{c}_t &= \sigma_c(W_{c} x_t + U_{c} h_{t-1} + b_c) \\
c_t &= f_t \circ c_{t-1} + i_t \circ \tilde{c}_t \\
h_t &= o_t \circ \sigma_h(c_t)
\end{align}
</math>


* Batch size : 32, it means $x_t$ is fed into the network 32 times before the back propagation algorithm.
* Epochs : sometimes 20, sometimes, it depends on the data.
* Loss : MSE Loss (for regression)
* Optimizer : Adam Optimizer

Advantages : 
* Non Linear model ;
* Many parameters ;in \mathbb{N}$
* LSTM handle better vanishing and exploding gradients than RNN ;
* Can memorize previous precious informations in the previous step.

Inconvenience : 
* It is a black box model, it's difficult to see and visualize why the model made this prediction and on which features it is based on.

<a id="section_three_2"></a>
#### Feature importance

Fortunately, I have found a paper on [Medium](https://towardsdatascience.com/feature-importance-with-time-series-and-recurrent-neural-network-27346d500b9c) where the author found a simple way to describe how NN made their prediction thanks to derivative importance. 

Given $f$ the function representative of our model, $y_t$ the output and $(x_{t-1}^1, ..., x_{t-p}^m)$ the input variables where $p = 30$ (time steps), $m$ is the number of input features, $W$ the weights, the model can b√© written in the following way : 
$y_t = f(x_{t-1}^1, ..., x_{t-p}^m, W)$

Mathematically, the derivative measures the sensivity to change of the function $f$ value with respect to its argument $x$. 

The intuition in the derivative importance is to compute this derivative $f$ w.r.t. to model inputs 

<math>
\begin{bmatrix}
x_{t-p}^1 & \cdots &  x_{t-p}^m\\
\vdots & \ddots & \vdots \\
 x_{t-1}^1 & \cdots &  x_{t-1}^m 
\end{bmatrix}
</math>

So for each $x_t$ feeded into the LSTM network, you can compute the matrix :

<math>
\begin{bmatrix}
\dfrac{\partial y_t}{\partial x_{t-p}^1} & \cdots & \dfrac{\partial y_t}{\partial x_{t-p}^m}\\
\vdots & \ddots & \vdots \\
\dfrac{\partial y_t}{\partial x_{t-1}^1} & \cdots & \dfrac{\partial y_t}{\partial x_{t-1}^m} 
\end{bmatrix}
=
\begin{bmatrix}
\dfrac{\partial f}{\partial x_{t-p}^1} & \cdots & \dfrac{\partial f}{\partial x_{t-p}^m}\\
\vdots & \ddots & \vdots \\
\dfrac{\partial f}{\partial x_{t-1}^1} & \cdots & \dfrac{\partial f}{\partial x_{t-1}^m} 
\end{bmatrix}
</math>
    
It gives us how much the input values have an influence on the output of our model.

Then, we make a mean over the time step dimension ($p$) and again a mean when all the inputs have been feeded into the network to get the derivative importance and so : feature importance.




In [None]:
class MTS_RNN(nn.Module):
    def __init__(self, input_size=8, hidden_layer_size=20, output_size=2):
        super().__init__()
        self.hidden_layer_size = hidden_layer_size
        self.lstm = nn.LSTM(input_size, hidden_layer_size)
        self.linear = nn.Linear(hidden_layer_size, output_size)

    def forward(self, x):
        output, _ = self.lstm(x)
        predictions = self.linear(output[:, -1])
        return predictions

In [None]:
def train_model(model, train_loader, val_loader, loss_function, optimizer, epochs):


    train_val_history = {'train': [],
                        'val': []}
    for i in range(epochs):
        for j, data in enumerate(train_loader):
            inputs, labels = data[0].to(device), data[1].to(device)
            optimizer.zero_grad()

            y_pred = model(inputs)

            single_loss = torch.sqrt(loss_function(y_pred, labels))

            single_loss.backward()

            optimizer.step()

        train_val_history['train'].append(single_loss.item())
        #print("Training data : " + f'epoch: {i:3} loss: {single_loss.item():10.8f}')

        for j, data in enumerate(val_loader):
            inputs, labels = data[0].to(device), data[1].to(device)
            optimizer.zero_grad()

            y_pred = model(inputs)

            single_loss = torch.sqrt(loss_function(y_pred, labels))

        train_val_history['val'].append(single_loss.item())
        #print("Validation data : " + f'epoch: {i:3} loss: {single_loss.item():10.8f}')

    #print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')
    return model, train_val_history



In [None]:

def draw_training_loss(train_val_history, epochs):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=np.arange(epochs), y=np.array(train_val_history['train']),
                        mode='lines',
                        name = 'train'))
    fig.add_trace(go.Scatter(x=np.arange(epochs), y=np.array(train_val_history['val']),
                        mode='lines',
                        name = 'validation'))
    fig.update_layout(title="Training and validation loss",
                      xaxis_title="Epochs",
                      yaxis_title="MSE loss",)

    return fig.show()


In [None]:
def predict_and_draw(df, test_loader, output, scaler, model, y_test, loss_function, draw = False):

    test_output = []
    test_output_grad = []
    for j, data in enumerate(test_loader):
        inputs, labels = data[0].to(device), data[1].to(device)
        inputs.requires_grad = True
        y_pred = model(inputs)
        single_loss = loss_function(y_pred, labels)

        test_output.append(y_pred)
        
        single_loss.backward()
 
        test_output_grad.append(torch.mean(inputs.grad, axis = 0))
        
    gradient_input_mean = np.sort(np.absolute(torch.mean(torch.cat(test_output_grad), axis = 0)
                                      .detach().numpy()))
    
    indice_outputs = [i - 1 for i, x in enumerate(df.columns.isin(output)) if x]
    mean = [mean for i, mean in enumerate(scaler.mean_) if i in indice_outputs]
    var = [var for i, var in enumerate(scaler.var_) if i in indice_outputs]
    
    predictions = pd.DataFrame(torch.cat(test_output).detach().numpy(), columns = output)
    
    predictions = predictions * var + mean
    y = y_test[parameters['seq_len']:] * var + mean
    
    def draw(predictions, y, output):
        fig = make_subplots(rows=2, cols=1, 
                           subplot_titles=("Prediction on test dataset for " + output ,
                                           "Feature importance by derivarive importance for " + output))
        fig.add_trace(go.Scatter(x=np.arange(len(predictions)), y=np.array(predictions[output]),
                            mode='lines',
                            name = 'predictions'), row = 1, col = 1)
        fig.add_trace(go.Scatter(x=np.arange(len(predictions)), y=np.array(y[output]),
                            mode='lines',
                            name = 'ground truth'), row = 1, col = 1)
        
        fig.add_trace(go.Bar(x=df.columns[1:], y = gradient_input_mean, 
                            name = 'gradient of lstm w.r.t. features'), row = 2, col = 1)

        return fig.show()

    if draw:
        draw(predictions, y, output[0])
 
    return y, predictions
        


In [None]:
def train_predict_and_draw(parameters, df, output):
    train, val, test, scaler = scale_and_split(df) 
    X_train, y_train, X_val, y_val, X_test, y_test = prepare_inputs_outputs(train, val, test, 
                                                                            parameters['seq_len'], 
                                                                            parameters['inputs'], 
                                                                            output)

    train_loader = get_dataloader(X_train, y_train, parameters['batch_size_train'])
    val_loader = get_dataloader(X_val, y_val, parameters['batch_size_val'])
    test_loader = get_dataloader(X_test, y_test, parameters['batch_size_test'])
    model = MTS_RNN(input_size=len(parameters['inputs']), hidden_layer_size=20, output_size=1).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=parameters['lr'])
    loss_function = nn.MSELoss()

    model_LSTM, train_val_history = train_model(model, 
                                                train_loader, 
                                                val_loader,
                                                loss_function, 
                                                optimizer, 
                                                parameters['epochs'])

    draw_training_loss(train_val_history, parameters['epochs'])

    y, predictions = predict_and_draw(df, test_loader, [output], scaler, model, y_test, loss_function, 
                                      draw = True)
    return y, predictions

<a id="section_three_3"></a>
#### Result for Lake Level

In [None]:
parameters = {'inputs':Lake_Bilancino.columns[1:],
              'outputs':['Lake_Level', 'Flow_Rate'],
            'seq_len': 30,
              'batch_size_train':32,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.002,
              'epochs':20
             }

y, predictions = train_predict_and_draw(parameters, Lake_Bilancino.iloc[578:], parameters['outputs'][0])

**The RMSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

**The MSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

The metrics are low, which mean that the model is good for future predictions.

Generally, the LSTM model is well fitted the ground truth but locally there are some peaks.

According to this model, the flow rate but also the Rainfall in Le Croci and Cavallina have an influence on the prediction. I have checked where are these places and there are located near the Lake. The other cities like Saint Piero or Mangona are farther, this is why there have less influence on the prediction of lake level.

<a id="section_three_4"></a>
#### Result for Flow Rate

In [None]:
parameters = {'inputs':Lake_Bilancino.columns[1:],
              'outputs':['Lake_Level', 'Flow_Rate'],
            'seq_len': 30,
              'batch_size_train':32,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.0001,
              'epochs':40
             }

y, predictions = train_predict_and_draw(parameters, Lake_Bilancino.iloc[578:], parameters['outputs'][1])

**The RMSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

**The MSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

The metrics are higher than the flow rate. It is because of many photos that the model is able to predict but not adjust to the correct value. I think it would have taken more data for this.

There are many peaks in the flow rate  but the model make good forecastings where these peaks appears. More generally, the model is fitted with the test dataset.

According to this model, farther the city is from the lake, the less it has an influence on flow rate. Which is logic, the water of rainfall from distant cities make less time to reach the lake.

<a id="generalization"></a>
### Generalization to other datasets

<a id="auser"> </a>
#### Aquifer Auser

This waterbody consists of two subsystems, called NORTH and SOUTH, where the former partly influences the behavior of the latter. Indeed, the north subsystem is a water table (or unconfined) aquifer while the south subsystem is an artesian (or confined) groundwater.

The levels of the NORTH sector are represented by the values of the SAL, PAG, CoS and DIEC wells, while the levels of the SOUTH sector by the LT2 well.



In [None]:
Aquifer_Auser = pd.read_csv('../input/acea-water-prediction/Aquifer_Auser.csv')
Aquifer_Auser = Aquifer_Auser.interpolate(method = 'linear').fillna(0)


In [None]:
parameters = {'inputs':Aquifer_Auser.columns[1:],
              'outputs':['Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS', 
                         'Depth_to_Groundwater_LT2'],
              'seq_len': 30,
              'batch_size_train':32,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.0001,
              'epochs':40
             }
y, predictions = train_predict_and_draw(parameters, Aquifer_Auser, parameters['outputs'][0])

**The RMSE for this model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

**The MSE for this model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

We can see in the prediction that it fits well with the general behaviour of the curve but there are many peaks in the prediction that make rise the metrics. 


In [None]:
parameters = {'inputs':Aquifer_Auser.columns[1:],
              'outputs':['Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS', 
                         'Depth_to_Groundwater_LT2'],
              'seq_len': 30,
              'batch_size_train':32,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.001,
              'epochs':30
             }
y, predictions = train_predict_and_draw(parameters, Aquifer_Auser, parameters['outputs'][1])

**The RMSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

**The MSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

In [None]:
parameters = {'inputs':Aquifer_Auser.columns[1:],
              'outputs':['Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS', 
                         'Depth_to_Groundwater_LT2'],
              'seq_len': 30,
              'batch_size_train':32,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.0001,
              'epochs':30
             }
y, predictions = train_predict_and_draw(parameters, Aquifer_Auser, parameters['outputs'][2])

**The RMSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

**The model of the RMSE is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

In these three forecasting, the Depth to groundwater is very influened by hydrometry and volume. The Rainfall have a lesser influence on its behaviour. 



<a id="doganella"></a>
#### Aquifer Doganella


The wells field Doganella is fed by two underground aquifers not fed by rivers or lakes but fed by meteoric infiltration. The upper aquifer is a water table with a thickness of about 30m. The lower aquifer is a semi-confined artesian aquifer with a thickness of 50m and is located inside lavas and tufa products. These aquifers are accessed through wells called Well 1, ..., Well 9. Approximately 80% of the drainage volumes come from the artesian aquifer. The aquifer levels are influenced by the following parameters: rainfall, humidity, subsoil, temperatures and drainage volumes.

In [None]:
Aquifer_Doganella = pd.read_csv('../input/acea-water-prediction/Aquifer_Doganella.csv')
Aquifer_Doganella = Aquifer_Doganella.interpolate(method = 'linear').fillna(0)

In [None]:
parameters = {'inputs': Aquifer_Doganella.columns[1:],
              'outputs':['Depth_to_Groundwater_Pozzo_1', 
                         'Depth_to_Groundwater_Pozzo_2', 
                         'Depth_to_Groundwater_Pozzo_3',
                         'Depth_to_Groundwater_Pozzo_4', 
                         'Depth_to_Groundwater_Pozzo_5', 
                         'Depth_to_Groundwater_Pozzo_6',
                         'Depth_to_Groundwater_Pozzo_7', 
                         'Depth_to_Groundwater_Pozzo_8', 
                         'Depth_to_Groundwater_Pozzo_9'],
              'seq_len': 30,
              'batch_size_train':32,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.001,
              'epochs':30
             }

for output in parameters['outputs']:
    y, predictions = train_predict_and_draw(parameters, Aquifer_Doganella, output)

**The RMSE of the model is :** 

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

**The MSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

The model is not fitting well with the groundtruth. I think there are too much variables to make him fit. But we can see that the set of these 9 models to predict the depth to groundwater is using always the same variables : the temperature and the volume have a huge impact on it.

<a id="luco"> </a>
#### Aquifer Luco

The Luco wells field is fed by an underground aquifer. This aquifer not fed by rivers or lakes but by meteoric infiltration at the extremes of the impermeable sedimentary layers. Such aquifer is accessed through wells called Well 1, Well 3 and Well 4 and is influenced by the following parameters: rainfall, depth to groundwater, temperature and drainage volumes.

In [None]:
Aquifer_Luco = pd.read_csv('../input/acea-water-prediction/Aquifer_Luco.csv')
Aquifer_Luco = Aquifer_Luco.interpolate(method = 'linear').fillna(0)

In [None]:
parameters = {'inputs': Aquifer_Luco.columns[1:],
              'outputs':['Depth_to_Groundwater_Podere_Casetta'],
              'seq_len': 30,
              'batch_size_train':1,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.001,
              'epochs':30
             }
y, predictions = train_predict_and_draw(parameters, Aquifer_Luco, 'Depth_to_Groundwater_Podere_Casetta')

**The RMSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

The MSE of the model is : 

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

The prediction fits with the trend of the ground truth but again, many peaks come to rise the error metrics.

We can see that the depth to groundwater is also caused by volume and temperature like we saw in the Aquifer Doganella.

<a id="petrignano"></a>
#### Aquifer Petrignano

The wells field of the alluvial plain between Ospedalicchio di Bastia Umbra and Petrignano is fed by three underground aquifers separated by low permeability septa. The aquifer can be considered a water table groundwater and is also fed by the Chiascio river. The groundwater levels are influenced by the following parameters: rainfall, depth to groundwater, temperatures and drainage volumes, level of the Chiascio river.

In [None]:
Aquifer_Petrignano = pd.read_csv('../input/acea-water-prediction/Aquifer_Petrignano.csv')
Aquifer_Petrignano = Aquifer_Petrignano.interpolate(method = 'linear').fillna(0)

In [None]:
parameters = {'inputs': Aquifer_Petrignano.columns[1:],
              'outputs':['Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25'],
              'seq_len': 30,
              'batch_size_train':1,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.001,
              'epochs':30
             }
for output in parameters['outputs']:
    y, predictions = train_predict_and_draw(parameters, Aquifer_Petrignano, output)
    

**The RMSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

**The MSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

The model fits well with the ground truth. The error metrics are low.

According to the model, we can see that the Depth to groundwater is caused mainly by the groundwater level (hydrometry) in Chiasco and also by the volume. In this prediction, the temperature is less a variable to predict the depth to groundwater than previous datasets.

<a id="arno"></a>
#### River Arno

Arno is the second largest river in peninsular Italy and the main waterway in Tuscany and it has a relatively torrential regime, due to the nature of the surrounding soils (marl and impermeable clays). Arno results to be the main source of water supply of the metropolitan area of Florence-Prato-Pistoia. The availability of water for this waterbody is evaluated by checking the hydrometric level of the river at the section of Nave di Rosano.

In [None]:
River_Arno = pd.read_csv('../input/acea-water-prediction/River_Arno.csv')
River_Arno = River_Arno.interpolate(method = 'linear').fillna(0)

In [None]:
parameters = {'inputs': River_Arno.columns[1:],
              'outputs':['Hydrometry_Nave_di_Rosano'],
              'seq_len': 30,
              'batch_size_train':1,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.001,
              'epochs':30
             }
y, predictions = train_predict_and_draw(parameters, River_Arno, 'Hydrometry_Nave_di_Rosano')

**The RMSE of the model is :** 

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

**The MSE of the model is : ******

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

The model fits very well with the ground truth. The error metrics are very low.

According to the model, the groundwater level (hydrometry) depends of the volume but also of some rainfall. There are some rainfalls more important than other. 

<a id="amiata"></a>
#### Water Spring Amiata

The Amiata waterbody is composed of a volcanic aquifer not fed by rivers or lakes but fed by meteoric infiltration. This aquifer is accessed through Ermicciolo, Arbure, Bugnano and Galleria Alta water springs. The levels and volumes of the four sources are influenced by the parameters: rainfall, depth to groundwater, hydrometry, temperatures and drainage volumes.

In [None]:
Water_Spring_Amiata = pd.read_csv('../input/acea-water-prediction/Water_Spring_Amiata.csv')
Water_Spring_Amiata = Water_Spring_Amiata.interpolate(method = 'linear').fillna(0)

In [None]:
parameters = {'inputs': Water_Spring_Amiata.columns[1:],
              'outputs':['Flow_Rate_Bugnano', 'Flow_Rate_Arbure', 'Flow_Rate_Ermicciolo', 
                        'Flow_Rate_Galleria_Alta'],
              'seq_len': 30,
              'batch_size_train':1,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.001,
              'epochs':30
             }
for output in parameters['outputs']:
    y, predictions = train_predict_and_draw(parameters, Water_Spring_Amiata, output)

**The RMSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

** The MSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

The model is not fitting well with the ground truth.

According to this model, the other flow rates depend of the galleria alta flow rate. The temperature is also a variable that make us the predict the flow rate.

<a id="lupa"></a>
#### Water Spring Lupa

In [None]:
Water_Spring_Lupa = pd.read_csv('../input/acea-water-prediction/Water_Spring_Lupa.csv')
Water_Spring_Lupa = Water_Spring_Lupa.interpolate(method = 'linear').fillna(0)

In [None]:
parameters = {'inputs': Water_Spring_Lupa.columns[1:],
              'outputs':['Flow_Rate_Lupa'],
              'seq_len': 30,
              'batch_size_train':1,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.001,
              'epochs':30
             }
for output in parameters['outputs']:
    y, predictions = train_predict_and_draw(parameters, Water_Spring_Lupa, output)

**The RMSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

**The MSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

The model fits well with the ground truth. It even predict the last linear rise of the flow rate.

According to this model, the flow rate Lupa depends only on itself and not on Terni rainfall.

<a id="madonna"></a>
#### Water Spring Madonna di Canneto

In [None]:
Water_Spring_Madonna_di_Canneto = pd.read_csv('../input/acea-water-prediction/Water_Spring_Madonna_di_Canneto.csv')
Water_Spring_Madonna_di_Canneto = Water_Spring_Madonna_di_Canneto.interpolate(method = 'linear').fillna(0)

In [None]:
parameters = {'inputs': Water_Spring_Madonna_di_Canneto.columns[1:],
              'outputs':['Flow_Rate_Madonna_di_Canneto'],
              'seq_len': 30,
              'batch_size_train':1,
              'batch_size_val':8,
              'batch_size_test':8,
              'lr':0.001,
              'epochs':30
             }
for output in parameters['outputs']:
    y, predictions = train_predict_and_draw(parameters, Water_Spring_Madonna_di_Canneto, output)

**The RMSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions)

**The MSE of the model is : **

In [None]:
mean_squared_error(y[:len(predictions)], predictions, squared = False)

The model fits very well on the ground truth. But there are always some peaks that make higher the error metrics.

According to this model, the flow rate depends mainly on itself but also a little bit on temperature.

<a id="conclusion"> </a>
### Conclusion

The LSTM is a good model to fit these time series. Some of them have peaks and trend, which can be diffcult sometimes to forecast but the model detect them and forecast them.
Derivative importance gives us good informations about how the LSTM make the forecast according to inputs. This technique is very useful when we deal with neural networks which are black boxes and we don't know how the neural network make their prediction. 

Thanks for having read this notebook, I hope you enjoyed and maybe have learned new things on water behaviour in nature !
