# Introduction

Hello everyone, welcome to our notebook! \
In the following notebook, we'll present the four models that we worked on that help predict the water level in different waterbodies, more specifically in : water springs, lakes, rivers, or aquifers. Each waterbody's behaviour is unique, therefore different model is used for each waterbody. We'll take a look into different models each time and compare them in order to pick the best one.
Good reading !





Let is first import the libraries needed. 

In [None]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error 
from keras.layers import LSTM, Dense
from keras.models import Sequential



We're going to work on the river Arno first.

# River Arno
## *Data Preprocessing*

In [None]:
arno = pd.read_csv('../input/acea-water-prediction/River_Arno.csv')

Before diving into the different models, let is first take a clsoer look to our dataset in order to understand how it behaves.

In [None]:
arno.describe()

In [None]:
arno.info()

In [None]:
arno.head()

In [None]:
print("Data shape :" ,arno.shape)

As we can see, there are a lot of missing values in the dataset. Some preprocessing is necessary for the model to learn the right weights. The heat map visualizes better the values that are missing.

In [None]:
sns.heatmap(arno.isnull())

The heat map shows the existence of many missing values ( Nan values). Droping all the rows containing missing values has been tried but the model performance wasn't good enough, a predicted result since the model dosen't have enough training examples to generalize well to the rest of the data. A different approach was taken, instead of dropping these rows, why not try to fil the missing values with approximative values. The map shows where measurements were taken, we'll use that to fill most of the missing values with rainfalls in closer regions.

![](https://media.discordapp.net/attachments/791649795617325061/804851868571533332/download.png)


In [None]:
arno.iloc[3474:4569+1,-9] = arno.iloc[3474:4569+1,-4]
arno.iloc[3474:4569+1,-10] = arno.iloc[3474:4569+1,-4]
arno.iloc[3474:4569+1,-3] = arno.iloc[3474:4569+1,-4]
arno.iloc[3474:6764+1,-5] = arno.iloc[3474:6764+1,-8]
arno.iloc[3474:6764+1,-6] = arno.iloc[3474:6764+1,-8]
arno.iloc[3839:6764+1,-7] = arno.iloc[3839:6764+1,-8]
arno.iloc[4569:6764+1,-3] = arno.iloc[4569:6764+1,-8]
arno.iloc[4569:6764+1,-4] = arno.iloc[4569:6764+1,-8]
arno.iloc[4569:6764+1,-9] = arno.iloc[4569:6764+1,-8]
arno.iloc[4569:6764+1,-10] = arno.iloc[4569:6764+1,-8]
arno.iloc[6474:6764+1,-11] = arno.iloc[6474:6764+1,-13]


In [None]:
sns.heatmap(arno.isnull())

We also shifted some of the values in the dataset by a certain number of days, since for some of these rainfalls, it takes time to arrive to the river. We settled on a four days shift because it gave the best performance.

In [None]:
# moving target data 1 day, this could be explored later
# NN MSE for Number_days = 3 : 0.007
Number_days = 4
arno['Rainfall_Vernio'] = arno['Rainfall_Vernio'].shift (-Number_days)
arno['Rainfall_Mangona'] = arno['Rainfall_Mangona'].shift (-Number_days)
arno['Rainfall_S_Agata'] = arno['Rainfall_S_Agata'].shift (-Number_days)
arno['Rainfall_S_Piero'] = arno['Rainfall_S_Piero'].shift (-Number_days)
arno['Rainfall_Le_Croci'] = arno['Rainfall_Le_Croci'].shift (-Number_days)
arno['Rainfall_Cavallina'] = arno['Rainfall_Cavallina'].shift (-Number_days)

# dropping all rows that contain a null value
arno_1 = arno.dropna(how='any',axis=0).copy()

In [None]:
arno_1.info()

In [None]:
sns.heatmap(arno_1.isnull())

In [None]:
sns.heatmap(arno_1.corr());

Some columns were added to the dataset : day, month and year. We'll use them to understand how the hydrometry varies over the year. 

In [None]:
arno_1['Day'] = arno_1['Date'].str.split('/').str[0]
arno_1['Month'] = arno_1['Date'].str.split('/').str[1]
arno_1['Year'] = arno_1['Date'].str.split('/').str[2]

In [None]:
plt.figure(1,figsize = (10,5))
plt.plot(arno_1.Hydrometry_Nave_di_Rosano)
plt.title("Hydrometry_Nave_di_Rosano")

The previous graph shows some anomalies in the hydrometry values. For instance, the hydrometry can't be equal to 0 around the row 3500. To fix this, we'll replace these zeros by nan values so we can interpolate them afterwards. The method used for the interpolation is the "pchip" method.

In [None]:
arno_1.Hydrometry_Nave_di_Rosano.replace(0, np.nan, inplace=True)
arno_1.Hydrometry_Nave_di_Rosano.interpolate(method ='pchip', limit_direction ='forward', inplace=True)

In [None]:
plt.figure(1,figsize = (10,5))
plt.plot(arno_1.Hydrometry_Nave_di_Rosano)
plt.title("Lake Level")

The map shown earlier helps catergorize the points to two categories : 

# Two groups of points :
**1) Points located near Arno**
* Stia
* Camaldoli
* Consuma
* Incisa
* Montevarchi
* Laterina
* Bibbiena
* *S_Savino*

**2) Points reaching the River Arno through the River Sieve**
* Vernio
* Magona
* Cavallina
* S_Agata
* Le_Corci
* S_peiro

**remarks :**

* (The Sieve is a river in Italy. It is a tributary of the Arno.)
* (Lago di Bilancino is made with a dam on the river Sieve)

To see if there are really different categories in our data set, we'll proceed to use the principal component analysis.


In [None]:
from sklearn.preprocessing import StandardScaler

# Standardizing the features
arno_T = arno_1.T
x = StandardScaler().fit_transform(arno_T.iloc[1:-5,:])

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents)

In [None]:
principalDf.head()

In [None]:
eigvals = pca.explained_variance_ratio_

In [None]:
eigvals/np.sum(eigvals) * 100

In [None]:
plt.figure(figsize = (8,8))
plt.bar(list(range(14)),eigvals/np.sum(eigvals) * 100)
plt.show()

In [None]:
#pourcentage of the first two axes
(eigvals[0]+eigvals[1])/np.sum(eigvals) * 100 

In [None]:
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)

features = ['Rainfall_Le_Croci','Rainfall_Cavallina','Rainfall_S_Agata','Rainfall_Mangona','Rainfall_S_Piero','Rainfall_Vernio','Rainfall_Stia','Rainfall_Consuma','Rainfall_Incisa','Rainfall_Montevarchi','Rainfall_S_Savino','Rainfall_Laterina','Rainfall_Bibbiena','Rainfall_Camaldoli']

ax.scatter(principalDf.loc[:, 0]
            , principalDf.loc[:, 1]
            , s = 50)

for i, txt in enumerate(features):
    ax.annotate(txt, (principalDf.loc[i, 0], principalDf.loc[i, 1]))

ax.grid()

The results of principal component analysis show that we can indeed categorize the different points to two categories. 
Now, we'll start trying on different models and evaluate their performances.

# Models

In [None]:
cols = ['Year','Date', 'Rainfall_Le_Croci', 'Rainfall_Cavallina', 'Rainfall_S_Agata', 'Rainfall_Mangona', 'Rainfall_S_Piero', 'Rainfall_Vernio', 'Rainfall_Stia', 'Rainfall_Consuma', 'Rainfall_Incisa', 'Rainfall_Montevarchi', 'Rainfall_S_Savino', 'Rainfall_Laterina', 'Rainfall_Bibbiena', 'Rainfall_Camaldoli', 'Temperature_Firenze', 'Day', 'Month', 'Hydrometry_Nave_di_Rosano']
arno_1 = arno_1[cols]
arno_1.head()

# GradientBoostingRegressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
# Train-Test split the data
arno_model = arno_1.iloc[:, 2:].values

X, y = arno_model[:, :-1],  arno_model[:, -1]
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.82, shuffle = True)


X_train = np.asarray(X_train).astype('float32')
y_train = np.asarray(y_train).astype('float32')

# Gradient Boost Regressor

gbr = GradientBoostingRegressor(learning_rate=0.01,n_estimators=500)
gbr_model = gbr.fit(X_train,y_train)

In [None]:
y_pred = gbr_model.predict(X_test)

In [None]:
plt.figure(figsize = (10,5))
plt.plot(y_test, 'r', label='True values')
plt.plot(y_pred, label= 'Predicted values')
plt.legend()
plt.show()

In [None]:
mae = mean_absolute_error(y_test, y_pred)
print('MAE: %.3f' % mae)

mse = mean_squared_error(y_test, y_pred)
print('MSE: %.3f' % mse)

# Neural Network

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from tensorflow.keras.optimizers import Adam

model = Sequential()

model.add(Dense(X_train.shape[1], activation='relu'))
model.add(Dense(32, activation='relu'))
# model.add(Dropout(0.2))

model.add(Dense(64, activation='relu'))
# model.add(Dropout(0.2))

model.add(Dense(128, activation='relu'))
# model.add(Dropout(0.2))

model.add(Dense(512, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1))

model.compile(optimizer=Adam(0.001), loss='mse')

In [None]:
r = model.fit(X_train, y_train,batch_size=32,epochs=30)

In [None]:
y_pred = model.predict(X_test)

In [None]:
plt.figure(figsize = (10,5))
plt.plot(y_test, 'r', label='True values')
plt.plot(y_pred, label= 'Predicted values')
plt.legend()
plt.show()

In [None]:
for i in range(15):
    plt.figure(i)
    plt.plot(X_test[:,i])
    plt.title(i)

In [None]:
mae = mean_absolute_error(y_test, y_pred)
print('MAE: %.3f' % mae)

mse = mean_squared_error(y_test, y_pred)
print('MSE: %.3f' % mse)

# Models using results of PCA

Using the results of the PCA analysis, we're going to use fewer variales to build our models, we'll then see which one perform better.

In [None]:
# Train-Test split the data

arno_model = arno_1.iloc[:, [6,7,10,16,17,18,19]].values

X, y = arno_model[:, :-1],  arno_model[:, -1]
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.82, shuffle = False)

# Cat Boost Regressor

gbr = GradientBoostingRegressor()
gbr_model = gbr.fit(X_train,y_train)

In [None]:
y_pred = gbr_model.predict(X_test)

In [None]:
plt.figure(figsize = (10,5))
plt.plot(y_test, 'r', label='True values')
plt.plot(y_pred, label= 'Predicted values')
plt.legend()
plt.show()

In [None]:
mae = mean_absolute_error(y_test, y_pred)
print('MAE: %.3f' % mae)

mse = mean_squared_error(y_test, y_pred)
print('MSE: %.3f' % mse)

In [None]:
# Using walk forward validation

arno_model = arno_1.iloc[:, 2:]

model = XGBRegressor(objective='reg:squarederror', n_estimators=500)

window_size = 30000
for i in range (400, arno_model.shape[0], window_size):
    end = i + window_size
    if end > arno_model.shape[0]:
        end = arno_model.shape[0]
    
    window = arno_model.iloc[0:end ,:]
    
    # getting train and test values from window
    data = window.values
    X, y = data[:, :-1], data[:, -1]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle = False)
    
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    print("values for",i,"th run")
    mae = mean_absolute_error(y_test, prediction)
    print('MAE: %.3f' % mae)
    
    mse = mean_squared_error(y_test, prediction)
    print('MSE: %.3f' % mse)
    
    plt.figure(figsize = (10,5))
    plt.plot(y_test, 'r', label='True values')
    plt.plot(prediction, label= 'Predicted values')
    plt.legend()
    plt.show()

# LSTM

In [None]:
#LSTM Dataset 

def LSTM_dataset(dataset,lookback):
    X = np.zeros(np.shape(dataset))
    Y = np.zeros(np.shape(dataset)[0])
    for i in range (len(dataset)-lookback-1):
        row = dataset[i:(i+lookback),:]
        X[i:(i+lookback),:] = row
        Y[i] = dataset[i+lookback,-1]
    Y = np.reshape(Y,(np.shape(Y)[0],1))
    return np.concatenate((X,Y),axis=1)

In [None]:

arno_model = arno_1.iloc[:, 2:].values
# taking data for day n and predicting hydrometry of n+1 th day, you may look at different lookbacks, 7, 15, 30..
lookback=1

arno_model = LSTM_dataset(arno_model,lookback)

scaler = MinMaxScaler(feature_range=(0, 1))

X, y = arno_model[:, :-1],  arno_model[:, -1]
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, shuffle = False)

# reshape into [samples, timesteps, features]
X_train =  X_train.reshape((X_train.shape[0],1,X_train.shape[1]))
X_test =  X_test.reshape((X_test.shape[0],1,X_test.shape[1]))

print(np.shape(X_train))
#X_test = np.reshape(X_test, (4, 106, X_test.shape[1]))
#y_train = np.reshape(y_train, (4, 106))
#print (X_train.shape)
#print(y_train.shape)

model = Sequential()
model.add(LSTM(32, input_shape= np.shape(X_train)[1:], name= "lstm"))
model.add(Dense(1, name= "output"))
model.compile(loss='mean_squared_error', optimizer='adam')

#print(model.summary())
model.fit(X_train, y_train,batch_size=32, epochs=15)
prediction = model.predict(X_test)


mae = mean_absolute_error(y_test, prediction)
print('MAE: %.3f' % mae)

mse = mean_squared_error(y_test, prediction)
print('MSE: %.3f' % mse)

plt.figure(figsize = (10,5))
plt.plot(y_test, 'r', label='True values')
plt.plot(prediction, label= 'Predicted values')
plt.legend()
plt.show()

* Now that we're done with the river Arno, let is move on to the second waterbody : Lake Bilancino.

The same approach is used in the case of Lake.

# Lake Bilancino

In [None]:
# Importing the dataset
bilancino = pd.read_csv('../input/acea-water-prediction/Lake_Bilancino.csv')

In [None]:
bilancino.head()

In [None]:
print("Data shape :" ,bilancino.shape)

In [None]:
sns.heatmap(bilancino.isnull())

In [None]:
# dropping all rows that contain a null value
bilancino1 = bilancino.dropna(how='any',axis=0).copy()

In [None]:
X,y = bilancino1.iloc[:,1:-2].values, bilancino1.iloc[:,-2:].values

In [None]:
#Filling the NaN Values with the means
#Fix missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer_2 = SimpleImputer(missing_values = np.nan, strategy = 'median')

imputer = imputer.fit(X)
imputer_2 = imputer_2.fit(y)

X = imputer.transform(X)
y = imputer_2.transform(y)

In [None]:
print(np.shape(X))
print(np.shape(y))

# Features to predict

In [None]:
plt.figure(1,figsize = (10,5))
plt.plot(y[:,0])
plt.title("Lake Level")
plt.figure(2,figsize = (10,5))
plt.plot(y[:,1])
plt.title("Flow rate")
plt.figure(3,figsize = (10,5))
#Example of input Data 
plt.plot(X[:,0])
plt.title("Rainfall S_ Peiro")