# Smart Water Analytics

In this competition, Acea Group focuses on the water sector to preserve water bodies by forecasting the water level. Each dataset has a different kind of waterbody with its unique behavior and characteristics. Available Datasets are:

1. Aquifer (Auser, Doganella, Luco, Petrignano)
2. Water Spring (Amiata, Lupa, Madonna di Canneto)
3. River (Arno)
4. Lake (Bilancino)
The models' predictive power will be evaluated with both Mean Absolute Error (MAE) and Mean Square Error (MSE).

# AQUIFER

## 1. AUSER

### Exploratory Data Analysis

#### Importing the libraries

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
!pip install statsmodels --upgrade

In [None]:
auser_data= pd.read_csv("../input/acea-water-prediction/Aquifer_Auser.csv",parse_dates=True)
auser_data

Determine the shape or structure of the data frame.

In [None]:
print('Shape: ', auser_data.shape)

Different datatypes present in our dataset.

In [None]:
auser_data.info()

The Date column has data type 'object' which need to be changed to 'DateTime'

In [None]:
auser_data['Date'] = pd.to_datetime(auser_data.Date, format = '%d/%m/%Y')

auser_data.dtypes

Our Output Features/ Dependent variables are:

1. Depth_to_Groundwater_SAL
2. Depth_to_Groundwater_LT2
3. Depth_to_Groundwater_CoS

A time series is a series of data points indexed (or listed or graphed) in time order. Thus it is a sequence of discrete-time data. Since I want the “DATE” column as our index, but simply by reading, it is not doing it, so we have to add some extra parameters.

In [None]:
auser_data= auser_data.set_index('Date')

Here, we are plotting the Output variables of Auser_Aquifier. As we can observe there is huge amount of data which is missing from the intial years.

In [None]:
df_auser= auser_data[["Depth_to_Groundwater_SAL","Depth_to_Groundwater_LT2","Depth_to_Groundwater_CoS"]]
sns.set(style="whitegrid")
plt.figure(figsize=(10,7))
sns.color_palette("husl", 9)
#sns.lineplot(data=df_auser)
df_auser.plot(linewidth=2, fontsize=12)

#### Handling Missing Values

In [None]:
print("The percentage of missing values in dataset")
((auser_data.isnull() | auser_data.isna()).sum() * 100 / auser_data.index.size).round(2)

After looking at the percentage of missing values in each columns we will remove the columns with missing value % more than 50% because it might afftec the performance of our model.

So we will remove *Depth_to_Groundwater_DIEC* and *Depth_to_Groundwater_PAG*


In [None]:
auser_data= auser_data.drop(columns=['Depth_to_Groundwater_DIEC','Depth_to_Groundwater_PAG'])

Now we will remove the missing values from the data.

It is require that a row has at least 22 non-NaNs out of total 27 features. Keeping this threshold is giving the optimal number of records with minimal loss of data. We have adjusted the threshold throughout all of the 9 waterbodies dataset.

In [None]:
auser_data= auser_data.dropna(0,how='all',thresh=22)

We have set the threshold as 22. This is because we want to remove the nan values with minimum loss of data.

Now we will interpolate the missing data.

The Series Pandas object provides an interpolate() function to interpolate missing values, and there is a nice selection of simple and more complex interpolation functions. We are using linear interpolation. This draws a straight line between available data, in this case on the first of the month, and fills in values at the chosen frequency from this line.

In [None]:
auser_data = auser_data.interpolate(method = 'linear')

In [None]:
auser_data = auser_data.apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
print("The percentage of missing values in dataset")
((auser_data.isnull() | auser_data.isna()).sum() * 100 / auser_data.index.size).round(2)

In [None]:
auser_plot= auser_data[["Depth_to_Groundwater_SAL","Depth_to_Groundwater_LT2","Depth_to_Groundwater_CoS"]]
sns.set(style="whitegrid")
plt.figure(figsize=(10,7))
sns.color_palette("husl", 9)
#sns.lineplot(data=plot, legend="full", err_style="bars")
auser_plot.plot(linewidth=1, fontsize=12)

In [None]:
fig, axs = plt.subplots(5, 2,figsize=(15,11))
axs[0, 0].plot(auser_data[["Rainfall_Monte_Serra"]])
axs[0, 0].set_title('Rainfall_Monte_Serra')
axs[0, 1].plot(auser_data[["Rainfall_Piaggione"]])
axs[0, 1].set_title('Rainfall_Piaggione')
axs[1, 0].plot(auser_data[["Rainfall_Gallicano"]])
axs[1, 0].set_title('Rainfall_Gallicano')
axs[1, 1].plot(auser_data[["Rainfall_Pontetetto"]])
axs[1, 1].set_title('Rainfall_Pontetetto')
axs[2, 0].plot(auser_data[["Rainfall_Orentano"]])
axs[2, 0].set_title('Rainfall_Orentano')
axs[2, 1].plot(auser_data[["Rainfall_Borgo_a_Mozzano"]])
axs[2, 1].set_title('Rainfall_Borgo_a_Mozzano')
axs[3, 0].plot(auser_data[["Rainfall_Calavorno"]])
axs[3, 0].set_title('Rainfall_Calavorno')
axs[3, 1].plot(auser_data[["Rainfall_Croce_Arcana"]])
axs[3, 1].set_title('Rainfall_Croce_Arcana')
axs[4, 0].plot(auser_data[["Rainfall_Tereglio_Coreglia_Antelminelli"]])
axs[4, 0].set_title('Rainfall_Tereglio_Coreglia_Antelminelli')
axs[4, 1].plot(auser_data[["Rainfall_Fabbriche_di_Vallico"]])
axs[4, 1].set_title('Rainfall_Fabbriche_di_Vallico')

for ax in axs.flat:
    ax.set(xlabel='Date', ylabel='Rainfall(mm)')

# Hide x labels and tick labels for top plots and y ticks for right plots.
for ax in axs.flat:
    ax.label_outer()

Now we will plot correlation matrix for all the features/independent variables.

In [None]:
sns.set_theme(style="white")

# Compute the correlation matrix
corr = auser_data.corr(method="pearson")

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

### Checking The Stationarity

In order to check the stationarity of the time series (i.e identify wether the time series is stationary or not) we perform Augmented Dickey-Fuller test (ADF Test.)

For AD Fuller test:

1. Null Hypothesis - Series possesses a unit root and hence is not stationary.
2. Alternate Hypothesis - Series is stationary

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(auser_data.columns)):
  result = adfuller(auser_data[auser_data.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(auser_data.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(auser_data.columns[i]))
    print(" ")

As, we can see from the test that there is a Non-Stationarity within some features.So, now we will remove the Non- Stationarity by using Differencing


In [None]:
auser_data.dropna().plot()

Time series datasets may contain trends and seasonality, which may need to be removed prior to modeling. Differencing is a popular and widely used data transform for making time series data stationary.Now we will remove the Non- Stationarity by using First Order Differencing.

In [None]:
auser_data=auser_data-auser_data.shift(1)
auser_data.dropna().plot()


In [None]:
auser_data.head()

We will remove the first row because it consist of missing values or NAN

In [None]:
auser_data = auser_data.iloc[1:]

#### Confirming Stationarity

Now we will run the AD Fuller Test on the data to confirm if the all the timeseries are stationary or not.


In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(auser_data.columns)):
  result = adfuller(auser_data[auser_data.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(auser_data.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(auser_data.columns[i]))
    print(" ")


Since the time Series is stationary we can proceed with the model building section.

### Building Predictive Models

#### Split Data into Training and Testing set

In [None]:
temp=auser_data
temp= temp.drop(columns=["Depth_to_Groundwater_SAL","Depth_to_Groundwater_LT2","Depth_to_Groundwater_CoS"])

df= auser_data[["Depth_to_Groundwater_SAL","Depth_to_Groundwater_LT2","Depth_to_Groundwater_CoS"]]
new= pd.merge(temp, df, left_index=True, right_index=True)

# Update the main dataframe i.e. auser_data
auser_data=new
auser_data.head()

In [None]:
auser_data.shape

Now we will split the data into X and Y matrix for further building the models.

In [None]:
X, Y = np.split(auser_data,[-3],axis=1)

In [None]:
print("Shape of X", X.shape)
print("Shape of Y", Y.shape)

Now we will split the data in Training set and Testing set.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 12)

In [None]:
print("Shape of X_train", X_train.shape)
print("Shape of Y_train", Y_train.shape)
print("Shape of X_test", X_test.shape)
print("Shape of Y_test", Y_test.shape)

We will use Standard Scalar Scalar for scaling the features of our dataset.

In [None]:
# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X.describe()

#### 1. LSTM Model.

In [None]:
# Import necessary libraries and packages from Keras for building model
import tensorflow as tf
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.optimizers import Adam

In [None]:
# reshape input to be 3D [samples, timesteps, features]
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))


Now we will build the architecture of the LSTM RNN

In [None]:
# Initialize the Neural Network based on LSTM RNN
model = Sequential()

# Add 1st LSTM RNN layer
model.add(LSTM(units=64, return_sequences=True, input_shape=(1, 21)))

# Adding 2nd LSTM layer
model.add(LSTM(units=32, return_sequences=True))

# Adding 3rd LSTM layer
model.add(LSTM(units=16, return_sequences=False))

# Adding Dropout
model.add(Dropout(0.25))

# Output layer
model.add(Dense(units=3, activation='linear'))

# Compiling the Neural Network
model.compile(optimizer = Adam(learning_rate=0.01), loss='mean_squared_error')

In [None]:
history = model.fit(X_train, Y_train, shuffle=True, epochs=250, validation_split=0.2, verbose=1, batch_size=256)

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [None]:
model.save_weights("Auser_M1.h5")

In [None]:
train_pred = model.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

Now we will predict on Test data

In [None]:
test_pred = model.predict(X_test)

In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

#### Model Evaluation

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

train_mse = mean_squared_error(Y_train, train_pred)
print('Train MSE: %.3f' % train_mse)

test_mse = mean_squared_error(Y_test, test_pred)
print('Test MSE: %.3f' % test_mse)

train_mae = mean_absolute_error(Y_train, train_pred)
print('Train MAE: %.3f' % train_mae)

test_mae = mean_absolute_error(Y_test, test_pred)
print('Test MAE: %.3f' % test_mae)

Thus, the final scores for Aquifer_AUSER MSE= 0.012 and MAE=0.037 

## DOGANELLA

### Exploratory Data Analysis

#### Importing The Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
import gc
import missingno as mn
import datetime
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
%matplotlib inline

In [None]:
doganella = pd.read_csv('../input/acea-water-prediction/Aquifer_Doganella.csv',parse_dates=True)
doganella

Determine the shape or structure of the data frame.

In [None]:
print('Shape: ', doganella.shape)

#### Handling Missing Values

In [None]:
print("The percentage of missing values in dataset")
((doganella.isnull() | doganella.isna()).sum() * 100 / doganella.index.size).round(2)


In [None]:
doganella = doganella.dropna(0,how ='all',thresh=7)
doganella

In [None]:
print("The percentage of missing values in dataset")
((doganella.isnull() | doganella.isna()).sum() * 100 / doganella.index.size).round(2)

In [None]:
doganella.shape

In [None]:
doganella_plot = doganella[['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2', 'Depth_to_Groundwater_Pozzo_3', 'Depth_to_Groundwater_Pozzo_4', 'Depth_to_Groundwater_Pozzo_5', 'Depth_to_Groundwater_Pozzo_6', 'Depth_to_Groundwater_Pozzo_7', 'Depth_to_Groundwater_Pozzo_8', 'Depth_to_Groundwater_Pozzo_9']]
sns.set(style="whitegrid")
plt.figure(figsize=(35,7))
sns.color_palette("husl", 9)
sns.lineplot(data=doganella_plot)


In [None]:
doganella = doganella.set_index('Date')
doganella.index = pd.to_datetime(doganella.index)

In [None]:
doganella = doganella.interpolate(method = 'time')

In [None]:
doganella = doganella.apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
print("The percentage of missing values in dataset")
((doganella.isnull() | doganella.isna()).sum() * 100 / doganella.index.size).round(2)

### Checking The Stationarity

In order to check the stationarity of the time series (i.e identify wether the time series is stationary or not) we perform Augmented Dickey-Fuller test (ADF Test.)

For AD Fuller test:

1. Null Hypothesis - Series possesses a unit root and hence is not stationary.
2. Alternate Hypothesis - Series is stationary

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(doganella.columns)):
  result = adfuller(doganella[doganella.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(doganella.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(doganella.columns[i]))
    print(" ")

Now we will remove the Non-Stationarity present in some features by using the method of Differencing.

In [None]:
doganella.dropna().plot()

In [None]:
doganella=doganella-doganella.shift(1)
doganella.dropna().plot()

In [None]:
doganella.head()

We will remove the first row because it consist of missing values or NAN

In [None]:
doganella = doganella.iloc[1:]

#### Confirming Stationarity

Now we will run the AD Fuller Test on the data to confirm if the all the timeseries are stationary or not.

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(doganella.columns)):
  result = adfuller(doganella[doganella.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(doganella.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(doganella.columns[i]))
    print(" ")

Since the time Series is stationary we can proceed with the model building section.

### Building Predictive Models

#### Split Data into Training and Testing set

In [None]:
temp=doganella
temp= temp.drop(columns=['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2', 'Depth_to_Groundwater_Pozzo_3', 'Depth_to_Groundwater_Pozzo_4', 'Depth_to_Groundwater_Pozzo_5', 'Depth_to_Groundwater_Pozzo_6', 'Depth_to_Groundwater_Pozzo_7', 'Depth_to_Groundwater_Pozzo_8', 'Depth_to_Groundwater_Pozzo_9'])

df= doganella[['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2', 'Depth_to_Groundwater_Pozzo_3', 'Depth_to_Groundwater_Pozzo_4', 'Depth_to_Groundwater_Pozzo_5', 'Depth_to_Groundwater_Pozzo_6', 'Depth_to_Groundwater_Pozzo_7', 'Depth_to_Groundwater_Pozzo_8', 'Depth_to_Groundwater_Pozzo_9']]
new= pd.merge(temp, df, left_index=True, right_index=True)

# Update the main dataframe i.e. auser_data
doganella=new
doganella.head()

In [None]:
doganella.shape

Now we will split the data into X and Y matrix for further building the models.

In [None]:
X, Y = np.split(doganella,[-9],axis=1)

In [None]:
print("Shape of X", X.shape)
print("Shape of Y", Y.shape)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 12)

In [None]:
print("Shape of X_train", X_train.shape)
print("Shape of Y_train", Y_train.shape)
print("Shape of X_test", X_test.shape)
print("Shape of Y_test", Y_test.shape)

We will use Standard Scalar Scalar for scaling the features of our dataset.

In [None]:
# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X.describe()

#### 1. LSTM Model.

In [None]:
# Import necessary libraries and packages from Keras for building model
import tensorflow as tf
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.optimizers import Adam

In [None]:
# reshape input to be 3D [samples, timesteps, features]
X_Train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_Test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))

In [None]:
# Initialize the Neural Network based on LSTM RNN
model = Sequential()

# Add 1st LSTM RNN layer
model.add(LSTM(units=32, return_sequences=False, input_shape=(1,12)))


# Output layer
model.add(Dense(units=9, activation='linear'))

# Compiling the Neural Network
model.compile(optimizer = Adam(learning_rate=0.01), loss='mean_squared_error')

In [None]:
history = model.fit(X_Train, Y_train, shuffle=True, epochs=50, validation_split=0.1, verbose=1, batch_size=64)

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [None]:
model.save_weights("Doganella_M1.h5")

In [None]:
train_pred = model.predict(X_Train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
test_pred = model.predict(X_Test)

In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

train_mse = mean_squared_error(Y_train, train_pred)
print('Train MSE: %.3f' % train_mse)

test_mse = mean_squared_error(Y_test, test_pred)
print('Test MSE: %.3f' % test_mse)

train_mae = mean_absolute_error(Y_train, train_pred)
print('Train MAE: %.3f' % train_mae)

test_mae = mean_absolute_error(Y_test, test_pred)
print('Test MAE: %.3f' % test_mae)

Thus, the final scores for Aquifer_DOGANELLA MSE= 0.513 and MAE=0.193

## LUCO

### Exploratory Data Analysis

#### Importing The Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
import gc
import missingno as mn
import datetime
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
%matplotlib inline

In [None]:
luco = pd.read_csv('../input/acea-water-prediction/Aquifer_Luco.csv',parse_dates=True)
luco

Determine the shape or structure of the data frame.

In [None]:
print('Shape: ', luco.shape)

#### Handling Missing Values

In [None]:
print("The percentage of missing values in dataset")
((luco.isnull() | luco.isna()).sum() * 100 / luco.index.size).round(2)

In [None]:
luco = luco.dropna(0,how ='all',thresh=8)
luco

In [None]:
print("The percentage of missing values in dataset")
((luco.isnull() | luco.isna()).sum() * 100 / luco.index.size).round(2)

In [None]:
luco.shape

In [None]:
luco_plot = luco[['Depth_to_Groundwater_Podere_Casetta']]
sns.set(style="whitegrid")
plt.figure(figsize=(35,7))
sns.color_palette("husl", 9)
#sns.lineplot(data=doganella_plot)
df_auser.plot(linewidth=2, fontsize=12)

In [None]:
luco = luco.set_index('Date')
luco.index = pd.to_datetime(luco.index)

In [None]:
luco = luco.interpolate(method = 'time')

In [None]:
luco = luco.apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
print("The percentage of missing values in dataset")
((luco.isnull() | luco.isna()).sum() * 100 / luco.index.size).round(2)

### Checking The Stationarity

In order to check the stationarity of the time series (i.e identify wether the time series is stationary or not) we perform Augmented Dickey-Fuller test (ADF Test.)

For AD Fuller test:

1. Null Hypothesis - Series possesses a unit root and hence is not stationary.
2. Alternate Hypothesis - Series is stationary

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(luco.columns)):
  result = adfuller(luco[luco.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(luco.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(luco.columns[i]))
    print(" ")


Now we will remove the Non- Stationarity present in some features by using Differencing

In [None]:
luco.dropna().plot()

In [None]:
luco= luco-luco.shift(1)
luco.dropna().plot()

In [None]:
luco.head()

We will remove the first row because it consist of missing values or NAN

In [None]:
luco = luco.iloc[1:]

#### Confirming Stationarity

Now we will run the AD Fuller Test on the data to confirm if the all the timeseries are stationary or not.

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(luco.columns)):
  result = adfuller(luco[luco.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(luco.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(luco.columns[i]))
    print(" ")


Since the time Series is stationary we can proceed with the model building section.

### Building Predictive Models

#### Split Data into Training and Testing set

In [None]:
temp=luco
temp= temp.drop(columns=['Depth_to_Groundwater_Podere_Casetta'])

df= luco[['Depth_to_Groundwater_Podere_Casetta']]
new= pd.merge(temp, df, left_index=True, right_index=True)

# Update the main dataframe i.e. auser_data
luco= new
luco.head()

In [None]:
luco.shape

Now we will split the data into features and output matrix

In [None]:
X, Y = np.split(luco,[-1],axis=1)

In [None]:
print("Shape of X", X.shape)
print("Shape of Y", Y.shape)

Now we will split the data into Training set and Testing set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 12)

In [None]:
print("Shape of X_train", X_train.shape)
print("Shape of Y_train", Y_train.shape)
print("Shape of X_test", X_test.shape)
print("Shape of Y_test", Y_test.shape)

We will use Standard Scalar Scalar for scaling the features of our dataset.

In [None]:
# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X.describe()

#### 1. LSTM Model.

In [None]:
# Import necessary libraries and packages from Keras for building model
import tensorflow as tf
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.optimizers import Adam


In [None]:
# reshape input to be 3D [samples, timesteps, features]
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))

In [None]:
# Initialize the Neural Network based on LSTM RNN
model = Sequential()

# Add 1st LSTM RNN layer
model.add(LSTM(units=32, return_sequences=True, input_shape=(1, 20)))

# Adding 2nd LSTM layer
#model.add(LSTM(units=32, return_sequences=True))

# Adding 3rd LSTM layer
model.add(LSTM(units=16, return_sequences=False))

# Adding Dropout
model.add(Dropout(0.3))

# Output layer
model.add(Dense(units=1, activation='linear'))

# Compiling the Neural Network
model.compile(optimizer = Adam(learning_rate=0.01), loss='mean_squared_error')

In [None]:
history = model.fit(X_train, Y_train, shuffle=True, epochs=150, validation_split=0.2, verbose=1, batch_size=64)

In [None]:
# Save the model weights
model.save_weights("luco_M1.h5")

In [None]:
train_pred = model.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
test_pred = model.predict(X_test)

In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

train_mse = mean_squared_error(Y_train, train_pred)
print('Train MSE: %.3f' % train_mse)

test_mse = mean_squared_error(Y_test, test_pred)
print('Test MSE: %.3f' % test_mse)

train_mae = mean_absolute_error(Y_train, train_pred)
print('Train MAE: %.3f' % train_mae)

test_mae = mean_absolute_error(Y_test, test_pred)
print('Test MAE: %.3f' % test_mae)

Thus, the final scores for Aquifer_LUCO MSE= 0.003 and MAE=0.018

## PETRIGNANO

### Exploratory Data Analysis

#### Importing The Libraries


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
import gc
import missingno as mn
import datetime
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
%matplotlib inline

In [None]:
petrignano = pd.read_csv('../input/acea-water-prediction/Aquifer_Petrignano.csv',parse_dates=True)
petrignano

Determine the shape or structure of the data frame.

In [None]:
print('Shape: ', petrignano.shape)

#### Handling Missing Values

In [None]:
print("The percentage of missing values in dataset")
((petrignano.isnull() | petrignano.isna()).sum() * 100 / petrignano.index.size).round(2)

In [None]:
petrignano = petrignano.dropna(0,how ='all',thresh=3)
petrignano

We have set the threshold as 3. This is because we want to remove the nan values with minimum loss of data.

In [None]:
print("The percentage of missing values in dataset")
((petrignano.isnull() | petrignano.isna()).sum() * 100 / petrignano.index.size).round(2)

In [None]:
petrignano.shape

In [None]:
petrignano_plot = petrignano[['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25']]
sns.set(style="whitegrid")
plt.figure(figsize=(35,7))
sns.color_palette("husl", 9)
sns.lineplot(data=petrignano_plot)
#df_auser.plot(linewidth=2, fontsize=12)

Futher, we will set the Date column as out index.

In [None]:
petrignano = petrignano.set_index('Date')
petrignano.index = pd.to_datetime(petrignano.index)

In [None]:
petrignano = petrignano.interpolate(method = 'time')

In [None]:
petrignano = petrignano.apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
print("The percentage of missing values in dataset")
((petrignano.isnull() | petrignano.isna()).sum() * 100 / petrignano.index.size).round(2)

### Checking The Stationarity

In order to check the stationarity of the time series (i.e identify wether the time series is stationary or not) we perform Augmented Dickey-Fuller test (ADF Test.)

For AD Fuller test:

1. Null Hypothesis - Series possesses a unit root and hence is not stationary.
2. Alternate Hypothesis - Series is stationary

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(petrignano.columns)):
  result = adfuller(petrignano[petrignano.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(petrignano.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(petrignano.columns[i]))
    print(" ")

Now we will remove the Non- Stationarity by using Differencing.

In [None]:
petrignano.dropna().plot()

In [None]:
petrignano= petrignano-petrignano.shift(1)
petrignano.dropna().plot()

In [None]:
petrignano.head()

We will remove the first row because it consist of missing values or NAN

In [None]:
petrignano = petrignano.iloc[1:]

#### Confirming Stationarity

Now we will run the AD Fuller Test on the data to confirm if the all the timeseries are stationary or not.

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(petrignano.columns)):
  result = adfuller(petrignano[petrignano.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(petrignano.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(petrignano.columns[i]))
    print(" ")

Since the time Series is stationary we can proceed with the model building section.

### Building Predictive Models

#### Split Data into Training and Testing set

In [None]:
temp=petrignano
temp= temp.drop(columns=['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25'])

df= petrignano[['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25']]
new= pd.merge(temp, df, left_index=True, right_index=True)

# Update the main dataframe i.e. auser_data
petrignano= new
petrignano.head()

In [None]:
petrignano.shape

In [None]:
X, Y = np.split(petrignano,[-2],axis=1)

In [None]:
print("Shape of X", X.shape)
print("Shape of Y", Y.shape)

Now we will split the data into Training set and Testing set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 12)

In [None]:
print("Shape of X_train", X_train.shape)
print("Shape of Y_train", Y_train.shape)
print("Shape of X_test", X_test.shape)
print("Shape of Y_test", Y_test.shape)

We will use Standard Scalar Scalar for scaling the features of our dataset.

In [None]:
# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X.describe()

#### 1. LSTM Model.

In [None]:
# Import necessary libraries and packages from Keras for building model
import tensorflow as tf
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.optimizers import Adam

In [None]:
# reshape input to be 3D [samples, timesteps, features]
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))

In [None]:
# Initialize the Neural Network based on LSTM RNN
model = Sequential()

# Add 1st LSTM RNN layer
model.add(LSTM(units=32, return_sequences=True, input_shape=(1, 5)))

# Adding 2nd LSTM layer
model.add(LSTM(units=16, return_sequences=True))

# Adding 3rd LSTM layer
model.add(LSTM(units=8, return_sequences=False))

# Adding Dropout
model.add(Dropout(0.3))

# Output layer
model.add(Dense(units=2, activation='linear'))

# Compiling the Neural Network
model.compile(optimizer = Adam(learning_rate=0.01), loss='mean_squared_error')

In [None]:
history = model.fit(X_train, Y_train, shuffle=True, epochs=100, validation_split=0.1, verbose=1, batch_size=64)

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [None]:
model.save_weights("petrignano_M1.h5")

In [None]:
train_pred = model.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
test_pred = model.predict(X_test)

In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

train_mse = mean_squared_error(Y_train, train_pred)
print('Train MSE: %.3f' % train_mse)

test_mse = mean_squared_error(Y_test, test_pred)
print('Test MSE: %.3f' % test_mse)

train_mae = mean_absolute_error(Y_train, train_pred)
print('Train MAE: %.3f' % train_mae)

test_mae = mean_absolute_error(Y_test, test_pred)
print('Test MAE: %.3f' % test_mae)

Thus, the final scores for Aquifer_PETRIGNANO MSE= 0.021 and MAE=0.063

In conclusion, for the water body AQUIFER the predictive model that fits the best is LSTM RNN. The MSE and MAE scores obtained per water bodies are as follows:



In [None]:
from IPython.display import Image
import os
!ls ../input/

Image("../input/results/W1_Updated.png")

# WATER SPRING

## AMIATA

### Exploratory Data Analysis

#### Importing The Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
import gc
import missingno as mn
import datetime
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
%matplotlib inline

In [None]:
amiata = pd.read_csv('../input/acea-water-prediction/Water_Spring_Amiata.csv',parse_dates=True)
amiata

Determine the shape or structure of the data frame.

In [None]:
print('Shape: ', amiata.shape)

#### Handling Missing Values

In [None]:
print("The percentage of missing values in dataset")
((amiata.isnull() | amiata.isna()).sum() * 100 / amiata.index.size).round(2)

In [None]:
amiata = amiata.dropna(0,how ='all',thresh=5)
amiata

We have set the threshold as 5. This is because we want to remove the nan values with minimum loss of data.

In [None]:
print("The percentage of missing values in dataset")
((amiata.isnull() | amiata.isna()).sum() * 100 / amiata.index.size).round(2)

In [None]:
amiata.shape

In [None]:
amiata_plot = amiata[['Flow_Rate_Bugnano', 'Flow_Rate_Arbure',
 'Flow_Rate_Ermicciolo', 'Flow_Rate_Galleria_Alta']]
sns.set(style="whitegrid")
plt.figure(figsize=(35,7))
sns.color_palette("husl", 9)
sns.lineplot(data=amiata_plot)
#df_auser.plot(linewidth=2, fontsize=12)

In [None]:
amiata = amiata.set_index('Date')
amiata.index = pd.to_datetime(amiata.index)

In [None]:
amiata = amiata.interpolate(method = 'time')

In [None]:
amiata = amiata.apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
sns.set_theme(style="white")

# Compute the correlation matrix
corr = amiata.corr(method="pearson")

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

As depicted from the correlation plot above ther is an observer correlation between  Rainfall_Castel_del_Piano & Rainfall_Abbadia_S_Salvatore, Rainfall_S_Fiora, Rainfall_Laghetto_Verde and Rainfall_Vetta_Amiata.

So we will drop Rainfall_Castel_del_Piano in order to avoid any bais in out input data for smooth training of our model.

In [None]:
temp= amiata
temp= temp.drop(columns=['Rainfall_Castel_del_Piano'])
amiata= temp

In [None]:
print("The percentage of missing values in dataset")
((amiata.isnull() | amiata.isna()).sum() * 100 / amiata.index.size).round(2)

### Checking The Stationarity

In order to check the stationarity of the time series (i.e identify wether the time series is stationary or not) we perform Augmented Dickey-Fuller test (ADF Test.)

For AD Fuller test:

1. Null Hypothesis - Series possesses a unit root and hence is not stationary.
2. Alternate Hypothesis - Series is stationary

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(amiata.columns)):
  result = adfuller(amiata[amiata.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(amiata.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(amiata.columns[i]))
    print(" ")

Now we will remove the Non- Stationarity present in some features by using the method of Differencing.

In [None]:
amiata.dropna().plot()

In [None]:
amiata= amiata-amiata.shift(1)
amiata.dropna().plot()

In [None]:
amiata.head()

We will remove the first row because it consist of missing values or NAN

In [None]:
amiata = amiata.iloc[1:]

#### Confirming Stationarity

Now we will run the AD Fuller Test on the data to confirm if the all the timeseries are stationary or not.

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(amiata.columns)):
  result = adfuller(amiata[amiata.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(amiata.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(amiata.columns[i]))
    print(" ")

Since the time Series is stationary we can proceed with the model building section.

### Building Predictive Models

#### Split Data into Training and Testing set

In [None]:
temp=amiata
temp= temp.drop(columns=['Flow_Rate_Bugnano', 'Flow_Rate_Arbure',
 'Flow_Rate_Ermicciolo', 'Flow_Rate_Galleria_Alta'])

df= amiata[['Flow_Rate_Bugnano', 'Flow_Rate_Arbure',
 'Flow_Rate_Ermicciolo', 'Flow_Rate_Galleria_Alta']]
new= pd.merge(temp, df, left_index=True, right_index=True)

# Update the main dataframe i.e. auser_data
amiata= new
amiata.head()

In [None]:
amiata.shape

In [None]:
X, Y = np.split(amiata,[-4],axis=1)

In [None]:
print("Shape of X", X.shape)
print("Shape of Y", Y.shape)

Now we will split the data into Training set and Testing set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, random_state = 12)

In [None]:
print("Shape of X_train", X_train.shape)
print("Shape of Y_train", Y_train.shape)
print("Shape of X_test", X_test.shape)
print("Shape of Y_test", Y_test.shape)

We will use Standard Scalar Scalar for scaling the features of our dataset.

In [None]:
# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X.describe()

#### 1. XGBoost Regression Model.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgr
from sklearn.multioutput import MultiOutputRegressor

Training XGBoost with evaluation metric as MAE.

In [None]:
xgr_mae = xgr.XGBRegressor(learning_rate =0.01, n_estimators=10000, max_depth=3, eval_metric='mae', seed=12)

In [None]:
multioutputregressor = MultiOutputRegressor(xgr_mae)
xgbr_1= multioutputregressor.fit(X_train,Y_train)

In [None]:
# For Training set Prediction
train_pred = xgbr_1.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
# For Testing set Prediction
test_pred= xgbr_1.predict(X_test)

In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import math

In [None]:
# Mean Absolute Error
print("Training MAE", mean_absolute_error(Y_train, train_pred))
print("Testing MAE",mean_absolute_error(Y_test, test_pred))


Thus, the final scores for Water_Spring_AMIATA MAE=0.084

#### 1. LSTM Model.

In [None]:
# Import necessary libraries and packages from Keras for building model
import tensorflow as tf
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.optimizers import Adam

In [None]:
# reshape input to be 3D [samples, timesteps, features]
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))

In [None]:
# Initialize the Neural Network based on LSTM RNN
model = Sequential()

# Add 1st LSTM RNN layer
model.add(LSTM(units=8, return_sequences=True, input_shape=(1, 10)))

# Adding 2nd LSTM layer
#model.add(LSTM(units=32, return_sequences=True))

# Adding 3rd LSTM layer
model.add(LSTM(units=16, return_sequences=False))

# Adding Dropout
model.add(Dropout(0.2))

# Output layer
model.add(Dense(units=4, activation='linear'))

# Compiling the Neural Network
model.compile(optimizer = Adam(learning_rate=0.01), loss='mean_squared_error')

In [None]:
history = model.fit(X_train, Y_train, shuffle=True, epochs=50, validation_split=0.2, verbose=1, batch_size=32)

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [None]:
model.save_weights("amiata_M1.h5")

In [None]:
train_pred = model.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
test_pred = model.predict(X_test)

In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

train_mse = mean_squared_error(Y_train, train_pred)
print('Train MSE: %.3f' % train_mse)

test_mse = mean_squared_error(Y_test, test_pred)
print('Test MSE: %.3f' % test_mse)

train_mae = mean_absolute_error(Y_train, train_pred)
print('Train MAE: %.3f' % train_mae)

test_mae = mean_absolute_error(Y_test, test_pred)
print('Test MAE: %.3f' % test_mae)

Thus, the final scores for Water_Spring_AMIATA MSE= 0.052 and MAE=0.076

## MADONNA_DI_CANNETO

### Exploratory Data Analysis

#### Importing The Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
import gc
import missingno as mn
import datetime
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
%matplotlib inline

In [None]:
mdc = pd.read_csv('../input/acea-water-prediction/Water_Spring_Madonna_di_Canneto.csv',parse_dates=True)
mdc

Determine the shape or structure of the data frame.

In [None]:
print('Shape: ', mdc.shape)

#### Handling Missing Values

In [None]:
print("The percentage of missing values in dataset")
((mdc.isnull() | mdc.isna()).sum() * 100 / mdc.index.size).round(2)

In [None]:
mdc_plot = mdc[['Flow_Rate_Madonna_di_Canneto']]
sns.set(style="whitegrid")
plt.figure(figsize=(35,7))
sns.color_palette("husl", 9)
sns.lineplot(data=mdc_plot)


In [None]:
mdc = mdc.dropna(subset = ["Date"])
mdc

In [None]:
mdc = mdc.set_index('Date')
mdc.index = pd.to_datetime(mdc.index)

In [None]:
mdc = mdc.interpolate(method = 'linear')

In [None]:
mdc = mdc.apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
print("The percentage of missing values in dataset")
((mdc.isnull() | mdc.isna()).sum() * 100 / mdc.index.size).round(2)

### Checking The Stationarity

In order to check the stationarity of the time series (i.e identify wether the time series is stationary or not) we perform Augmented Dickey-Fuller test (ADF Test.)

For AD Fuller test:

1. Null Hypothesis - Series possesses a unit root and hence is not stationary.
2. Alternate Hypothesis - Series is stationary

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(mdc.columns)):
  result = adfuller(mdc[mdc.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(mdc.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(mdc.columns[i]))
    print(" ")


Since the time Series is stationary we can proceed with the model building section.

### Building Predictive Models

#### Split Data into Training and Testing set

In [None]:
temp=mdc
temp= temp.drop(columns=['Flow_Rate_Madonna_di_Canneto'])

df= mdc[['Flow_Rate_Madonna_di_Canneto']]
new= pd.merge(temp, df, left_index=True, right_index=True)

# Update the main dataframe i.e. auser_data
mdc= new
mdc.head()

In [None]:
X, Y = np.split(mdc,[-1],axis=1)

In [None]:
print("Shape of X", X.shape)
print("Shape of Y", Y.shape)

Now we will split the data into Training set and Testing set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, random_state = 12)

In [None]:
print("Shape of X_train", X_train.shape)
print("Shape of Y_train", Y_train.shape)
print("Shape of X_test", X_test.shape)
print("Shape of Y_test", Y_test.shape)

We will use Standard Scalar Scalar for scaling the features of our dataset.

In [None]:
# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X.describe()

#### 1. XGBoost Regression Model.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgr

Training XGBoost with evaluation metric as MAE.

In [None]:
xgr_mae = xgr.XGBRegressor(learning_rate =0.01, n_estimators=10000, max_depth=3, eval_metric='mae', seed=12)

In [None]:
xgbr_1= xgr_mae.fit(X_train,Y_train)

In [None]:
# For Training set Prediction
train_pred = xgbr_1.predict(X_train)

In [None]:
# For Testing set Prediction
test_pred=xgbr_1.predict(X_test)


In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import math

In [None]:
# Mean Absolute Error
print("Training MAE", mean_absolute_error(Y_train, train_pred))
print("Testing MAE",mean_absolute_error(Y_test, test_pred))

Thus, the final scores for XGBoost implemented on Water_Spring_MADONNA_DI_CANNETO ha MAE=18.19

#### 2. LSTM Model.

In [None]:
# Import necessary libraries and packages from Keras for building model
import tensorflow as tf
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.optimizers import Adam

In [None]:
# reshape input to be 3D [samples, timesteps, features]
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))

In [None]:
# Initialize the Neural Network based on LSTM RNN
model = Sequential()

# Add 1st LSTM RNN layer
model.add(LSTM(units=32, return_sequences=True, input_shape=(1, 2)))

# Adding 2nd LSTM layer
#model.add(LSTM(units=32, return_sequences=True))

# Adding 3rd LSTM layer
model.add(LSTM(units=16, return_sequences=False))

# Adding Dropout
model.add(Dropout(0.2))

# Output layer
model.add(Dense(units=1, activation='linear'))

# Compiling the Neural Network
model.compile(optimizer = Adam(learning_rate=0.01), loss='mean_absolute_error')

In [None]:
history = model.fit(X_train, Y_train, shuffle=True, epochs=100, validation_split=0.2, verbose=1, batch_size=32)

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [None]:
model.save_weights("mdc_M1.h5")

In [None]:
train_pred = model.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
test_pred = model.predict(X_test)

In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

train_mae = mean_absolute_error(Y_train, train_pred)
print('Train MAE: %.3f' % train_mae)

test_mae = mean_absolute_error(Y_test, test_pred)
print('Test MAE: %.3f' % test_mae)

Thus, the final scores for LSTM RNN implemented on Water_Spring_MADONNA_DI_CANNETO has MAE=19.40

## LUPA

### Exploratory Data Analysis

#### Importing The Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
import gc
import missingno as mn
import datetime
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
%matplotlib inline


In [None]:
lupa = pd.read_csv('../input/acea-water-prediction/Water_Spring_Lupa.csv',parse_dates=True)
lupa

Determine the shape or structure of the data frame.

In [None]:
print('Shape: ', lupa.shape)

#### Handling Missing Values

In [None]:
print("The percentage of missing values in dataset")
((lupa.isnull() | lupa.isna()).sum() * 100 / lupa.index.size).round(2)

In [None]:
lupa_plot = lupa[['Flow_Rate_Lupa']]
sns.set(style="whitegrid")
plt.figure(figsize=(35,7))
sns.color_palette("husl", 9)
sns.lineplot(data=lupa_plot)


In [None]:
lupa = lupa.dropna(subset = ["Date"])
lupa

In [None]:
lupa = lupa.set_index('Date')
lupa.index = pd.to_datetime(lupa.index)

In [None]:
lupa = lupa.interpolate(method = 'linear')


In [None]:

lupa = lupa.apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
print("The percentage of missing values in dataset")
((lupa.isnull() | lupa.isna()).sum() * 100 / lupa.index.size).round(2)

### Checking The Stationarity

In order to check the stationarity of the time series (i.e identify wether the time series is stationary or not) we perform Augmented Dickey-Fuller test (ADF Test.)

For AD Fuller test:

1. Null Hypothesis - Series possesses a unit root and hence is not stationary.
2. Alternate Hypothesis - Series is stationary

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(lupa.columns)):
  result = adfuller(lupa[lupa.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(lupa.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(lupa.columns[i]))
    print(" ")

In [None]:
lupa.dropna().plot()

In [None]:
lupa= lupa-lupa.shift(1)
lupa.dropna().plot()

In [None]:
lupa.head()

We will remove the first row because it consist of missing values or NAN

In [None]:
lupa = lupa.iloc[1:]

#### Confirming Stationarity

Now we will run the AD Fuller Test on the data to confirm if the all the timeseries are stationary or not.

In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(lupa.columns)):
  result = adfuller(lupa[lupa.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(lupa.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(lupa.columns[i]))
    print(" ")

Since the time Series is stationary we can proceed with the model building section.

### Building Predictive Models

#### Split Data into Training and Testing set

In [None]:
temp=lupa
temp= temp.drop(columns=['Flow_Rate_Lupa'])

df= lupa[['Flow_Rate_Lupa']]
new= pd.merge(temp, df, left_index=True, right_index=True)

# Update the main dataframe i.e. auser_data
lupa= new
lupa.head()

In [None]:
lupa.shape

In [None]:
X, Y = np.split(lupa,[-1],axis=1)

In [None]:
print("Shape of X", X.shape)
print("Shape of Y", Y.shape)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, random_state = 12)

In [None]:
print("Shape of X_train", X_train.shape)
print("Shape of Y_train", Y_train.shape)
print("Shape of X_test", X_test.shape)
print("Shape of Y_test", Y_test.shape)

We will use Standard Scalar Scalar for scaling the features of our dataset.

In [None]:
# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X.describe()

#### 1. XGBoost Regression Model.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgr
from sklearn.multioutput import MultiOutputRegressor


Training XGBoost with evaluation metric as MAE.

In [None]:
xgr_mae = xgr.XGBRegressor(learning_rate =0.01, n_estimators=10000, max_depth=3, eval_metric='mae', seed=12)

In [None]:
xgbr_1= xgr_mae.fit(X_train,Y_train)

In [None]:
# For Training set Prediction
train_pred = xgbr_1.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
# For Testing set Prediction
test_pred=xgbr_1.predict(X_test)

In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import math

In [None]:
# Mean Absolute Error
print("Training MAE", mean_absolute_error(Y_train, train_pred))
print("Testing MAE",mean_absolute_error(Y_test, test_pred))

Thus, the final scores for Water_Spring_LUPA after implementing XGBoost is MAE(Mean Absolute Error)= 0.3

#### 2. LSTM Model.


In [None]:
# Import necessary libraries and packages from Keras for building model
import tensorflow as tf
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.optimizers import Adam

In [None]:
# reshape input to be 3D [samples, timesteps, features]
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))

In [None]:
# Initialize the Neural Network based on LSTM RNN
model = Sequential()

# Add 1st LSTM RNN layer
model.add(LSTM(units=32, return_sequences=True, input_shape=(1, 1)))

# Adding 2nd LSTM layer
#model.add(LSTM(units=32, return_sequences=True))

# Adding 3rd LSTM layer
model.add(LSTM(units=16, return_sequences=False))

# Adding Dropout
model.add(Dropout(0.2))

# Output layer
model.add(Dense(units=1, activation='linear'))

# Compiling the Neural Network
model.compile(optimizer = Adam(learning_rate=0.01), loss='mean_absolute_error')


In [None]:
history = model.fit(X_train, Y_train, shuffle=True, epochs=100, validation_split=0.2, verbose=1, batch_size=32)

In [None]:
model.save_weights("lupa_M1.h5")

In [None]:
train_pred = model.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
test_pred = model.predict(X_test)


In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

train_mae = mean_absolute_error(Y_train, train_pred)
print('Train MAE: %.3f' % train_mae)

test_mae = mean_absolute_error(Y_test, test_pred)
print('Test MAE: %.3f' % test_mae)



Thus, the final scores for Water_Spring_LUPA after implementing LSTM RNN MAE=0.284

In conclusion, for the Water Body Water_Spring we can implement **XGBoost** model as it is best predictive model which performs well on all the three Water_Springs i.e. Amiata, Madonna di Canneto, Lupa.

In [None]:
from IPython.display import Image
import os
!ls ../input/

Image("../input/results/W2.png")

# RIVER

## ARNO

### Exploratory Data Analysis

#### Importing The Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
import gc
import missingno as mn
import datetime
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
%matplotlib inline

In [None]:
arno = pd.read_csv('../input/acea-water-prediction/River_Arno.csv',parse_dates=True)
arno

Determine the shape or structure of the data frame.


In [None]:
print('Shape: ', arno.shape)

#### Handling Missing Values


In [None]:
print("The percentage of missing values in dataset")
((arno.isnull() | arno.isna()).sum() * 100 / arno.index.size).round(2)

In [None]:
arno_plot = arno[['Hydrometry_Nave_di_Rosano']]
sns.set(style="whitegrid")
plt.figure(figsize=(35,7))
sns.color_palette("husl", 9)
sns.lineplot(data=arno_plot)
#df_auser.plot(linewidth=2, fontsize=12)

In [None]:
arno = arno.set_index('Date')
arno.index = pd.to_datetime(arno.index)

In [None]:
arno = arno.interpolate(method = 'linear')

In [None]:
arno = arno.apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
print("The percentage of missing values in dataset")
((arno.isnull() | arno.isna()).sum() * 100 / arno.index.size).round(2)

### Checking The Stationarity

In order to check the stationarity of the time series (i.e identify wether the time series is stationary or not) we perform Augmented Dickey-Fuller test (ADF Test.)

For AD Fuller test:

1. Null Hypothesis - Series possesses a unit root and hence is not stationary.
2. Alternate Hypothesis - Series is stationary

In [None]:

from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(arno.columns)):
  result = adfuller(arno[arno.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(arno.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(arno.columns[i]))
    print(" ")


Since the time Series is stationary we can proceed with the model building section.

### Building Predictive Models

#### Split Data into Training and Testing set

In [None]:
temp=arno
temp= temp.drop(columns=['Hydrometry_Nave_di_Rosano'])

df= arno[['Hydrometry_Nave_di_Rosano']]
new= pd.merge(temp, df, left_index=True, right_index=True)

# Update the main dataframe i.e. auser_data
arno= new
arno.head()


In [None]:
arno.shape

In [None]:
X, Y = np.split(arno,[-1],axis=1)


In [None]:
print("Shape of X", X.shape)
print("Shape of Y", Y.shape)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, random_state = 12)

In [None]:
print("Shape of X_train", X_train.shape)
print("Shape of Y_train", Y_train.shape)
print("Shape of X_test", X_test.shape)
print("Shape of Y_test", Y_test.shape)

We will use Standard Scalar Scalar for scaling the features of our dataset.

In [None]:
# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X.describe()

#### 1. XGBoost Regression Model.


In [None]:
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgr

Training XGBoost with evaluation metric as MAE.

In [None]:
xgr_mae = xgr.XGBRegressor(learning_rate =0.01, n_estimators=10000, max_depth=3, eval_metric='mae', seed=12)

In [None]:
xgbr_1= xgr_mae.fit(X_train,Y_train)

In [None]:
# For Training set Prediction
train_pred = xgbr_1.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
# For Testing set Prediction
test_pred=xgbr_1.predict(X_test)

In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import math

In [None]:
# Mean Absolute Error
print("Training MAE", mean_absolute_error(Y_train, train_pred))
print("Testing MAE",mean_absolute_error(Y_test, test_pred))


#### 2. LSTM Model.

In [None]:
# Import necessary libraries and packages from Keras for building model
import tensorflow as tf
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.optimizers import Adam

In [None]:
# reshape input to be 3D [samples, timesteps, features]
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))


In [None]:
# Initialize the Neural Network based on LSTM RNN
model = Sequential()

# Add 1st LSTM RNN layer
model.add(LSTM(units=32, return_sequences=True, input_shape=(1,15)))

# Adding 3rd LSTM layer
model.add(LSTM(units=16, return_sequences=False))

# Adding Dropout
model.add(Dropout(0.25))

# Output layer
model.add(Dense(units=1, activation='linear'))

# Compiling the Neural Network
model.compile(optimizer = Adam(learning_rate=0.01), loss='mean_absolute_error')

In [None]:
history = model.fit(X_train, Y_train, shuffle=True, epochs=50, validation_split=0.2, verbose=1, batch_size=256)

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [None]:
model.save_weights("arno_M1.h5")

In [None]:
train_pred = model.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
test_pred = model.predict(X_test)


In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

train_mae = mean_absolute_error(Y_train, train_pred)
print('Train MAE: %.3f' % train_mae)

test_mae = mean_absolute_error(Y_test, test_pred)
print('Test MAE: %.3f' % test_mae)

Thus, the final scores for River_ARNO after implementing LSTM RNN is MAE=0.36

In conclusion, for the water body RIVER the predictive model that fits the best is LSTM RNN. The MAE scores obtained are as follows:

In [None]:
from IPython.display import Image
import os
!ls ../input/

Image("../input/results/W3.png")

# LAKE

## BILANCINO

### Exploratory Data Analysis

#### Importing The Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
import gc
import missingno as mn
import datetime
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
%matplotlib inline

In [None]:
bilancino = pd.read_csv('../input/acea-water-prediction/Lake_Bilancino.csv',parse_dates=True)
bilancino

Determine the shape or structure of the data frame.

In [None]:
print('Shape: ', bilancino.shape)


#### Handling Missing Values

In [None]:
print("The percentage of missing values in dataset")
((bilancino.isnull() | bilancino.isna()).sum() * 100 / bilancino.index.size).round(2)

In [None]:
bilancino = bilancino.dropna(0,how ='all',thresh=9)
bilancino


In [None]:
print("The percentage of missing values in dataset")
((bilancino.isnull() | bilancino.isna()).sum() * 100 / bilancino.index.size).round(2)

In [None]:
bilancino.shape


In [None]:
bilancino_plot = bilancino[['Lake_Level','Flow_Rate']]
sns.set(style="whitegrid")
plt.figure(figsize=(35,7))
sns.color_palette("husl", 9)
sns.lineplot(data=bilancino_plot)
#df_auser.plot(linewidth=2, fontsize=12)

In [None]:
bilancino = bilancino.set_index('Date')
bilancino.index = pd.to_datetime(bilancino.index)

In [None]:
bilancino = bilancino.interpolate(method = 'time')

In [None]:
bilancino = bilancino.apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
print("The percentage of missing values in dataset")
((bilancino.isnull() | bilancino.isna()).sum() * 100 / bilancino.index.size).round(2)

### Checking The Stationarity

In order to check the stationarity of the time series (i.e identify wether the time series is stationary or not) we perform Augmented Dickey-Fuller test (ADF Test.)

For AD Fuller test:

1. Null Hypothesis - Series possesses a unit root and hence is not stationary.
2. Alternate Hypothesis - Series is stationary


In [None]:
from statsmodels.tsa.stattools import adfuller
print("AUGMENTED DICKEY FULLER TEST \n\n")
for i in range(len(bilancino.columns)):
  result = adfuller(bilancino[bilancino.columns[i]])

  if result[1] > 0.05 :
    print('{} - Series is NOT Stationary'.format(bilancino.columns[i]))
    print(" ")
  else:
    print('{} - Series is Stationary'.format(bilancino.columns[i]))
    print(" ")

Since the time Series is stationary we can proceed with the model building section.

### Building Predictive Models

#### Split Data into Training and Testing set

In [None]:
temp=bilancino
temp= temp.drop(columns=['Lake_Level','Flow_Rate'])

df= bilancino[['Lake_Level','Flow_Rate']]
new= pd.merge(temp, df, left_index=True, right_index=True)

# Update the main dataframe i.e. auser_data
bilancino= new
bilancino.head()


In [None]:
bilancino.shape

In [None]:
X, Y = np.split(bilancino,[-2],axis=1)

In [None]:
print("Shape of X", X.shape)
print("Shape of Y", Y.shape)

Now we will split the data into Training set and Testing set.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 12)

In [None]:
print("Shape of X_train", X_train.shape)
print("Shape of Y_train", Y_train.shape)
print("Shape of X_test", X_test.shape)
print("Shape of Y_test", Y_test.shape)

We will use Standard Scalar Scalar for scaling the features of our dataset.

In [None]:
# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X.describe()

#### 1. XGBoost Regression Model.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgr
from sklearn.multioutput import MultiOutputRegressor

Training XGBoost with evaluation metric as MAE.

In [None]:
xgr_mae = xgr.XGBRegressor(learning_rate =0.01, n_estimators=10000, max_depth=3, eval_metric='mae', seed=12)

In [None]:
multioutputregressor = MultiOutputRegressor(xgr_mae)
xgbr_1= multioutputregressor.fit(X_train,Y_train)

In [None]:
# For Training set Prediction
train_pred = xgbr_1.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
# For Testing set Prediction
test_pred=xgbr_1.predict(X_test)

In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import math

In [None]:
# Mean Absolute Error
print("Training MAE", mean_absolute_error(Y_train, train_pred))
print("Testing MAE",mean_absolute_error(Y_test, test_pred))

Thus, the final scores for Lake_BILANCINO implemented using XGBoost is MAE=2.101

#### 1. LSTM Model.


In [None]:
# Import necessary libraries and packages from Keras for building model
import tensorflow as tf
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.optimizers import Adam

In [None]:
# reshape input to be 3D [samples, timesteps, features]
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))

In [None]:
# Initialize the Neural Network based on LSTM RNN
model = Sequential()

# Add 1st LSTM RNN layer
model.add(LSTM(units=64, return_sequences=True, input_shape=(1, 6)))

# Adding 2nd LSTM layer
model.add(LSTM(units=32, return_sequences=True))

# Adding 3rd LSTM layer
model.add(LSTM(units=16, return_sequences=False))

# Adding Dropout
model.add(Dropout(0.25))

# Output layer
model.add(Dense(units=2, activation='linear'))

# Compiling the Neural Network
model.compile(optimizer = Adam(learning_rate=0.01), loss='mean_absolute_error')

In [None]:
history = model.fit(X_train, Y_train, shuffle=True, epochs=150, validation_split=0.2, verbose=1, batch_size=256)

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()


In [None]:
model.save_weights("bilancino_M1.h5")

In [None]:
train_pred = model.predict(X_train)

In [None]:
print("Predicted on Training Data: ",train_pred)
print("Actual Train Data: ",Y_train)

In [None]:
test_pred = model.predict(X_test)

In [None]:
print("Predicted on Test Data: ",test_pred)
print("Actual Test Data: ",Y_test)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

train_mae = mean_absolute_error(Y_train, train_pred)
print('Train MAE: %.3f' % train_mae)

test_mae = mean_absolute_error(Y_test, test_pred)
print('Test MAE: %.3f' % test_mae)



Thus, the final scores for Lake_BILANCINO implemented using LSTM RNN is MAE=4.515

In conclusion, for the water body RIVER the predictive model that fits the best is XGBoost. The MAE scores obtained are as follows:

In [None]:
from IPython.display import Image
import os
!ls ../input/

Image("../input/results/W4_new.png")

The table below represents the different models implmented on different water bodies and the ones highlighted are the models that best performed onb the specific dataset of the waterbodies.

In [None]:
from IPython.display import Image
import os
!ls ../input/

Image("../input/results/Final_1.png")