# Streamflow Modelling Using Machine Learning Based on Discharge and Precipitation Time series (Case Study: Santa Coloma de Gramenet Hydrometric Station)

In [None]:
plt.rcParams['figure.figsize'] = (20, 10)

In [None]:
import matplotlib.pyplot as plt
image1 = plt.imread("../input/bess-river/figu002.jpg")
plt.imshow(image1)

The objective of this work is to model the streamflow discharge at the outlet of the Besós river basin

The data schema is:

- fecha: daily data
- Gramenet: daily discharge at the "Santa Coloma de Gramenet" gauging station. Units are in m3/s
- Barcelona, Barcelona_fabra and Sabadell_aero: daily rainfall in the "Barcelona", "Barcelona Fabra" and "Sabadell Aeropuerto" rain stations. Units are in mm
- Garriga,	Castellar,	Llica,	el_Mogent,	Mogoda: daily upstream flow discharge at the "La Garriga", "Castellar Valles", "Lliça de Vall", "Montornes Valles", "Santa Perpetua de Mogoda" gauging stations. Units are in m3/s

In [None]:
import matplotlib.pyplot as plt
image2 = plt.imread("../input/location-map-of-the-bess-river-basin/besos.jpg")
plt.imshow(image2)

#### Objetive

This project consists of using data from the rain stations ("Barcelona", "Barcelona Fabra" and "Sabadell Aeropuerto"), the upstream gauging stations ("La Garriga", "Castellar Valles", "Lliça de Vall "," Montornes Valles "," Santa Perpetua de Mogoda ") and the historical flow data in the" Santa Coloma de Gramenet "gauging station, to model the flow values in the latter gauging station

For this, the CRISP-DM methodology will be used.

In [None]:
image3 = plt.imread("../input/crispdm/CRISP-DM_Process_Diagram.png")
plt.imshow(image3)

#### CRISP-DM Fase 1: Business Understanding

The objective of this problem is to model the daily flow in the "Santa Coloma de Gramenet" station from the rain stations, the upstream gauging stations and the historical flow records at this station.

##### Type of Machine Learning Problem:

It is a Supervised Regression problem: because it involves training an algorithm from a series of historical continuous data.

The objective variable of this problem is: the daily flow at the "Santa Coloma de Gramenet" gauging station.

The independent variables are the rainfall in the three stations that we have, the daily flows of the upstream gauging stations, and the historical data in the target station.

The algorithms that will be used to solve this problem are: Multiple Linear Regression, Gradient Boosting Regression, Support Vector Regression and Random Forest Regression

#### CRISP-DM Fase 2: Data Understanding

Let's load the libraries and metrics we need

In [None]:
!pip install hydroeval

In [None]:
import numpy as np
import pandas as pd
import os
import re
import seaborn as sns
import matplotlib.pyplot as plt
import json
%matplotlib inline
import hydroeval as he
from statsmodels.graphics import tsaplots
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy import stats

Let's import the dataset

In [None]:
df = pd.read_csv("../input/rainfall-and-runoff-data-for-the-bess-river-basin/all_stations.csv", index_col='fecha')
df

Once loaded, we can plot the dataset using the Matplotlib library

In [None]:
plt.rcParams['figure.figsize'] = (20, 10)

In [None]:
columns = [0, 1, 2, 3, 4, 5, 6, 7, 8]
i = 1
values = df.values
# define figure object and size
plt.figure(figsize=(20,60))
# plot each column with a for loop
for variable in columns:
     plt.subplot(len(columns), 1, i)
     plt.plot(values[:, variable])
     plt.xlabel('day',fontsize=15)
     plt.ylabel('Rainfall (mm)',fontsize=15)
     plt.title(df.columns[variable], y=0.5, loc='right')
     plt.tick_params(labelsize=15)
     plt.grid()
     plt.ioff()
     i += 1
plt.show()

In [None]:
df[['Sabadell_aero', 'Gramenet']].plot()

Some descriptive analyzes

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1)
sns.heatmap(df.T.isna(), cmap='Blues')
ax.set_title('Missing values', fontsize=25)

for i in ax.yaxis.get_major_ticks():
    i.label.set_fontsize(20)
for i in ax.xaxis.get_major_ticks():
    i.label.set_fontsize(13)

plt.show()

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
f, ax = plt.subplots()
plt.xticks(rotation='90')
sns.barplot(x=missing_data.index, y=missing_data['Percent'])
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
missing_data.head()

##### Conclusions

- From 06-06-2008, the el_Mogent gauging station has many null values
- The longest common period is from 01-01-2003 to 06-05-2008 (56.18%)
- In this period, the Castellar gauging station has a lot of null values
- Also in this period, apart from the Llica, el_Mogent, Mogoda and Gramenet stations, the others present some null values

It has been decided to: 
- Delete the records from 06-06-2008
- Remove the Castellar gauging station
- Eliminate the null values that remain in the other stations by the function .dropna()

#### CRISP-DM Fase 3: Data Preparation

In [None]:
#Selecting the common period
df = df.loc['2003-01-01':'2008-06-05']
df

In [None]:
#Deleting the "Castellar" gauging station
df = df.drop('Castellar', 1)
df

In [None]:
#The number of day with 0 m3/s flow rate in the target gauging station "Santa Coloma de Gramenet"
len(df.loc[df['Gramenet'] == 0])

In [None]:
# calculate dataset mean and standard deviation
mean = df.mean()
std = df.std()
# normalise dataset with previously calculated values
df_std = (df - mean) / std
# create violin plot
df_std = df_std.melt(var_name='Column', value_name='Normalised')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Column', y='Normalised', data=df_std)
_ = ax.set_xticklabels(df.keys(), rotation=90)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
#Prospect the missing values again
fig, ax = plt.subplots(nrows=1, ncols=1)
sns.heatmap(df.T.isna(), cmap='Blues')
ax.set_title('Fig 1 - Missing Values', fontsize=18)

for i in ax.yaxis.get_major_ticks():
    i.label.set_fontsize(14)

plt.show()

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
f, ax = plt.subplots()
plt.xticks(rotation='90')
sns.barplot(x=missing_data.index, y=missing_data['Percent'])
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
missing_data.head()

In [None]:
#Correlation heatmap
fig, ax = plt.subplots() 
ax.set_title('Fig 2 - Correlation Matrix', fontsize=14)
ax = sns.heatmap(df.corr(), vmin=-1, vmax=1, center=0, cmap='Blues', square=True)

In [None]:
#Dataframe correlation
df.corr()

In [None]:
#Convert index to datetime
df.index = pd.to_datetime(df.index)

In [None]:
#Deleting remaining missing values
df = df.dropna()

In [None]:
df.info()

In [None]:
#Statistical results description
stats.describe(df)

In [None]:
#Lag creation
def lag_creation(df, lag_start, lag_end, columns, inplace=False, freq=1):
    if not inplace:
        df = df.copy()
    for col in columns:
        for i in range(lag_start, lag_end, freq):
            df["lag_"+str(i)+"_"+col] = df[col].shift(i)
    if not inplace:
        return df

#Encoding the cyclical properties of time
def date_features(df, inplace=False):
    if not inplace:
        df = df.copy()
    df.index = pd.to_datetime(df.index)
    df['day_sin'] = np.sin(df.index.dayofweek*(2.*np.pi/7))
    df['day_cos'] = np.cos(df.index.dayofweek*(2.*np.pi/7))
    df['month_sin'] = np.sin(df.index.dayofweek*(2.*np.pi/12))
    df['month_cos'] = np.cos(df.index.dayofweek*(2.*np.pi/12))
    if not inplace:
        return df

#Data normalization
from sklearn.preprocessing import MinMaxScaler
def Normalize_columns(df, columns, inplace=False):
    if not inplace:
        df = df.copy()
    sc = MinMaxScaler()
    df[columns] = sc.fit_transform(df[columns])
    if not inplace:
        return df

In [None]:
#Cross-correlation
def crosscorr(datax, datay, lag=0, wrap=False):
    """ Lag-N cross correlation. 
    Shifted data filled with NaNs 
    
    Parameters
    ----------
    lag : int, default 0
    datax, datay : pandas.Series objects of equal length
    Returns
    ----------
    crosscorr : float
    """
    if wrap:
        shiftedy = datay.shift(lag)
        shiftedy.iloc[:lag] = datay.iloc[-lag:].values
        return datax.corr(shiftedy)
    else: 
        return datax.corr(datay.shift(lag))

In [None]:
df.head(2000)

Analyze which lags are more related to the flow

In [None]:
plt.plot(np.arange(0, 6), [crosscorr(df['Gramenet'], df['Barcelona'], lag) for lag in range(0, 6)])
plt.title('Q_Gramenet - P_Barcelona cross-correlation', fontsize=15)
plt.xlabel('Lags',fontsize=15)
plt.ylabel('Correlation coefficient',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.ioff()
plt.show();

In [None]:
plt.plot(np.arange(0, 6), [crosscorr(df['Gramenet'], df['Barcelona_fabra'],lag) for lag in range(0, 6)])
plt.title('Q_Gramenet - P_Barcelona Fabra cross-correlation', fontsize=15)
plt.xlabel('Lags',fontsize=15)
plt.ylabel('Correlation coefficient',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.ioff()
plt.show();

In [None]:
plt.plot(np.arange(0, 6), [crosscorr(df['Gramenet'], df['Sabadell_aero'], lag) for lag in range(0, 6)])
plt.title('Q_Gramenet - P_Sabadell_aero cross-correlation ', fontsize = 15)
plt.xlabel('Lags',fontsize=15)
plt.ylabel('Correlation coefficient',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.ioff()
plt.show();

In [None]:
plt.plot(np.arange(0, 6), [crosscorr(df['Gramenet'], df['Mogoda'], lag) for lag in range(0, 6)])
plt.title('Q_Gramenet - Q_Mogoda cross-correlation', fontsize = 15)
plt.xlabel('Lags',fontsize=15)
plt.ylabel('Correlation coefficient',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.ioff()
plt.show();

In [None]:
plt.plot(np.arange(0, 6), [crosscorr(df['Gramenet'], df['el_Mogent'], lag) for lag in range(0, 6)])
plt.title(' Q_Gramenet - Q_el Mogent cross-correlation', fontsize=15)
plt.xlabel('Lags',fontsize=15)
plt.ylabel('Correlation coefficient',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.ioff()
plt.show();

In [None]:
plt.plot(np.arange(0, 6), [crosscorr(df['Gramenet'], df['Garriga'], lag) for lag in range(0, 6)])
plt.title('Q_Gramenet - Q_Garriga cross-correlation', fontsize= 15)
plt.xlabel('Lags',fontsize=15)
plt.ylabel('Correlation coefficient',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.ioff()
plt.show();

In [None]:
plt.plot(np.arange(0, 6), [crosscorr(df['Gramenet'], df['Llica'], lag) for lag in range(0, 6)])
plt.title('Q_Gramenet - Q_Llica cross-correlation', fontsize=15)
plt.xlabel('Lags',fontsize=15)
plt.ylabel('Correlation coefficient',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.ioff()
plt.show();

In [None]:
# Display the autocorrelation plot of the 'Gramenet' flow time serie
fig = tsaplots.plot_acf(df['Gramenet'], lags=6)
plt.title('Autocorrelation plot', fontsize= 15)
plt.xlabel('Lags',fontsize=15)
plt.ylabel('Correlation coefficient',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.ioff()
plt.show()

### Selecting data features 

First, let's model the flow discharge in the target station "Santa Coloma de Gramenet" without considering the historical flow in this gauging station, and by considering only one lag for each of the stations used.

In [None]:
df['y'] = df['Gramenet']
freq=1

Normalize_columns(df, ['Barcelona', 'Barcelona_fabra', 'Sabadell_aero', 'Garriga', 'Llica', 'el_Mogent', 'Mogoda', 'Gramenet'], inplace=True)
lag_creation(df, 1, 2, ['Barcelona'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['Barcelona_fabra'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['Sabadell_aero'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['Garriga'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['Llica'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['el_Mogent'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['Mogoda'], inplace=True, freq=freq)
date_features(df, inplace=True)
df.dropna(inplace=True)
del df['Gramenet']

In [None]:
#data features
df.columns

In [None]:
df.head()

Split the dataset into training and test datasets

In [None]:
X = df.loc[:, df.columns!='y']
y = df['y']
train = df.iloc[:int(len(df)*0.80)]
test = df.iloc[int(len(df)*0.80):]

In [None]:
X_train = train.dropna().drop('y', axis=1)
y_train = train['y']

X_test = test.dropna().drop('y', axis=1)
y_test = test['y']
test = test.dropna().drop('y', axis=1)

X_train = X_train.values
y_train = y_train.values

X_test = X_test.values
y_test = y_test.values

#### CRISP-DM Fases 4 & 5: Modeling and Evaluation

In [None]:
#Load the Regression Algorithms
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor

### Multiple Linear Regression

In [None]:
# Training the MLR
reg = LinearRegression()
reg.fit(X_train, y_train)

Training period

In [None]:
y_train = reg.predict(X_train)

In [None]:
mse =  mean_squared_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MSE =', mse)

In [None]:
mae = mean_absolute_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print("MAE =", mae)

In [None]:
r2 = r2_score(df.iloc[:int(len(df)*0.80)].y, y_train)
print("R2 =", r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[:int(len(df)*0.80)].y, y_train)
print("CE =", ce)

Test period

In [None]:
y = reg.predict(X_test)

In [None]:
test['y'] = y

In [None]:
mse= mean_squared_error(df.iloc[int(len(df)*0.80):].y, test.y)
print("MSE =", mse)

In [None]:
mae = mean_absolute_error(df.iloc[int(len(df)*0.80):].y, test.y)
print("MAE =", mae)

In [None]:
r2= r2_score(df.iloc[int(len(df)*0.80):].y, test.y)
print('R2 =',r2)

In [None]:
ce= he.evaluator(he.nse, df.iloc[int(len(df)*0.80):].y, test.y)
print('CE =',ce)

In [None]:
# Hydrograph plot for both training and test periods
plt.scatter(df.iloc[int(len(df)*0.80):].index, df.iloc[int(len(df)*0.80):].y, color ='b', label= "observed")
plt.scatter(df.iloc[:int(len(df)*0.80)].index, df.iloc[:int(len(df)*0.80)].y, color ='b')
plt.plot(df.iloc[int(len(df)*0.80):].index, test['y'], 'orange', label="simulated")
plt.plot(df.iloc[:int(len(df)*0.80)].index, y_train, 'orange')
plt.axvline(13596, 0, 80, linestyle='--')
plt.figtext(0.75, 0.7, "Testing period", fontsize = 20)
plt.figtext(0.35, 0.7, "Training period", fontsize = 20)
plt.title("Observed and simulated streamflow discharge", fontsize=15)
plt.xlabel('Year',fontsize=15)
plt.ylabel('Streamflow discharge (m3/s)',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.legend(fontsize="x-large")
plt.show()

In [None]:
# MLR interception
Intercept=reg.intercept_
Intercept

In [None]:
# MLR coefficients
Coefficients=reg.coef_
Coefficients

In [None]:
# MLR coefficients plot
plt.bar([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18], # train.dropna().drop('y', axis=1).columns
        
        [10.69955338, -2.10885021, -4.17639059, 18.03800362,  7.05231142,
       30.78924603, 13.35561054,  3.66337905,  8.8844766 ,  9.32797234,
        5.88963554, -2.7686622 ,  4.1474829 , -1.5876765 , -0.22284191,
        0.08073451,  0.76746745,  0.21916047]) # reg.coef_

### Gradient Boosting Regressor

In [None]:
#GBR hyperparameters
parameters = {'n_estimators'     : [430,450,470,490],
              'max_features'     : [0.02,0.03,0.04,0.05],
              'learning_rate'    : [0.02,0.04,0.06,0.08,0.1],
              'subsample'        : [0.2,0.3,0.4,0.5],              
              'max_depth'        : [5,6,7,8],
              }

In [None]:
#Time counter function
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
gbr = GradientBoostingRegressor()

In [None]:
#Grid search hyperparameters tuning
rndm_GBR = GridSearchCV(estimator=gbr, param_grid = parameters, cv = 5, n_jobs=2, verbose=3)

In [None]:
from datetime import datetime

start_time = timer(None)
rndm_GBR.fit(X_train, y_train)
timer(start_time)

In [None]:
#GBR optimal hyperparameters
rndm_GBR.best_params_

In [None]:
#Training the GBR with the optimal hyperparameters
gbr = GradientBoostingRegressor(learning_rate=0.04, max_depth= 5, max_features= 0.02, n_estimators= 430, subsample= 0.2)
gbr.fit(X_train, y_train)

Training period

In [None]:
y_train = gbr.predict(X_train)

In [None]:
mse =  mean_squared_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MSE =', mse)

In [None]:
mae = mean_absolute_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print("MAE =", mae)

In [None]:
r2 = r2_score(df.iloc[:int(len(df)*0.80)].y, y_train)
print("R2 =", r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[:int(len(df)*0.80)].y, y_train)
print("CE =", ce)

Testing period

In [None]:
y = gbr.predict(X_test)

In [None]:
test['y'] = y

In [None]:
mse = mean_squared_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MSE =', mse)

In [None]:
mae = mean_absolute_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MAE =', mae)

In [None]:
r2 = r2_score(df.iloc[int(len(df)*0.80):].y, y)
print('R2 = ', r2)

In [None]:
#CE of Nash
ce= he.evaluator(he.nse, df.iloc[int(len(df)*0.80):].y, test.y)
print('CE =', ce)

In [None]:
# Hydrograph plot for both training and test periods
plt.scatter(df.iloc[int(len(df)*0.80):].index, df.iloc[int(len(df)*0.80):].y, color ='b', label= "observed")
plt.scatter(df.iloc[:int(len(df)*0.80)].index, df.iloc[:int(len(df)*0.80)].y, color ='b')
plt.plot(df.iloc[int(len(df)*0.80):].index, test['y'], 'orange', label="simulated")
plt.plot(df.iloc[:int(len(df)*0.80)].index, y_train, 'orange')
plt.axvline(13596, 0, 80, linestyle='--')
plt.figtext(0.75, 0.7, "Testing period", fontsize = 20)
plt.figtext(0.35, 0.7, "Training period", fontsize = 20)
plt.title("Observed and simulated streamflow discharge", fontsize=15)
plt.xlabel('Year',fontsize=15)
plt.ylabel('Streamflow discharge (m3/s)',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.legend(fontsize="x-large")
plt.show()

### Support Vector Regression

In [None]:
#SVR hyperparameters
param={
  "C"        :  [0.1,1.0,10,100,1000],
 "epsilon"   :  [0.00001,0.0001,0.001,0.01,0.1,1],
 "gamma"     :  [0.0001,0.001,0.01,0.1,1],
}

In [None]:
#Time counter
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
#Grid search hyperparameters tuning
modelsvr = SVR(kernel='rbf')
grids = GridSearchCV(modelsvr,param,cv=5,n_jobs = 2, verbose = 3)

In [None]:
from datetime import datetime

start_time = timer(None)
grids.fit(X_train, y_train)
timer(start_time)

In [None]:
#The optimal SVR hyperparameters
grids.best_params_

In [None]:
#Training the SVR with the optimal hyperparameters
modelsvr = SVR(kernel='rbf', C=100, epsilon=0.0001, gamma=0.1)
modelsvr.fit(X_train, y_train)

Training period

In [None]:
y_train = modelsvr.predict(X_train)

In [None]:
mse =  mean_squared_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MSE =', mse)

In [None]:
mae = mean_absolute_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print("MAE =", mae)

In [None]:
r2 = r2_score(df.iloc[:int(len(df)*0.80)].y, y_train)
print("R2 =", r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[:int(len(df)*0.80)].y, y_train)
print("CE =", ce)

Test period

In [None]:
y = modelsvr.predict(X_test)

In [None]:
test['y'] = y

In [None]:
mse= mean_squared_error(df.iloc[int(len(df)*0.80):].y, test.y)
print("MSE =",mse)

In [None]:
mae= mean_absolute_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MAE =', mae)

In [None]:
r2= r2_score(df.iloc[int(len(df)*0.80):].y, y)
print('R2 =', r2)

In [None]:
ce= he.evaluator(he.nse, df.iloc[int(len(df)*0.80):].y, test.y)
print('CE =',ce)

In [None]:
# Hydrograph plot for both training and test periods
plt.scatter(df.iloc[int(len(df)*0.80):].index, df.iloc[int(len(df)*0.80):].y, color ='b', label= "observed")
plt.scatter(df.iloc[:int(len(df)*0.80)].index, df.iloc[:int(len(df)*0.80)].y, color ='b')
plt.plot(df.iloc[int(len(df)*0.80):].index, test['y'], 'orange', label="simulated")
plt.plot(df.iloc[:int(len(df)*0.80)].index, y_train, 'orange')
plt.axvline(13596, 0, 80, linestyle='--')
plt.figtext(0.75, 0.7, "Testing period", fontsize = 20)
plt.figtext(0.35, 0.7, "Training period", fontsize = 20)
plt.title("Observed and simulated streamflow discharge", fontsize=15)
plt.xlabel('Year',fontsize=15)
plt.ylabel('Streamflow discharge (m3/s)',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.legend(fontsize="x-large")
plt.show()

### RandomForest Regression

In [None]:
param= {'n_estimators' : [300,330,350,380,400,430,450,480,500],
        'max_features' : ("auto", "sqrt", "log2"),
       },

In [None]:
#Time counter
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
#Grid search hyperparameters tuning
model = RandomForestRegressor(random_state = 42)
grids = GridSearchCV(model,param,cv=5,n_jobs = 2, verbose = 3)

In [None]:
from datetime import datetime

start_time = timer(None)
grids.fit(X_train, y_train)
timer(start_time)

In [None]:
#The best RF hyperparameters
grids.best_params_

In [None]:
rf = RandomForestRegressor(n_estimators = 400, max_features= 'sqrt', random_state = 42)
# Train the model on training data
rf.fit(X_train, y_train);

Training period

In [None]:
y_train = rf.predict(X_train)

In [None]:
mse =  mean_squared_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MSE =',mse)

In [None]:
mae = mean_absolute_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print("MAE =", mae)

In [None]:
r2 = r2_score(df.iloc[:int(len(df)*0.80)].y, y_train)
print("R2 =", r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[:int(len(df)*0.80)].y, y_train)
print("CE =", ce)

Test period

In [None]:
y = rf.predict(X_test)

In [None]:
test['y']=y

In [None]:
mse=mean_squared_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MSE =',mse)

In [None]:
mae = mean_absolute_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MAE =', mae)

In [None]:
r2= r2_score(df.iloc[int(len(df)*0.80):].y, y)
print('R2 =', r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[int(len(df)*0.80):].y, test.y)
print('CE =', ce)

In [None]:
# Hydrograph plot for both training and test periods
plt.scatter(df.iloc[int(len(df)*0.80):].index, df.iloc[int(len(df)*0.80):].y, color ='b', label= "observed")
plt.scatter(df.iloc[:int(len(df)*0.80)].index, df.iloc[:int(len(df)*0.80)].y, color ='b')
plt.plot(df.iloc[int(len(df)*0.80):].index, test['y'], 'orange', label="simulated")
plt.plot(df.iloc[:int(len(df)*0.80)].index, y_train, 'orange')
plt.axvline(13596, 0, 80, linestyle='--')
plt.figtext(0.75, 0.7, "Testing period", fontsize = 20)
plt.figtext(0.35, 0.7, "Training period", fontsize = 20)
plt.title("Observed and simulated streamflow discharge", fontsize=15)
plt.xlabel('Year',fontsize=15)
plt.ylabel('Streamflow discharge (m3/s)',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.legend(fontsize="x-large")
plt.show()

Now, let's model the flow discharge by considering the 2 lags historical flow the target gauging station "Santa Coloma de Gramenet"

In [None]:
df['y'] = df['Gramenet']
freq=1

Normalize_columns(df, ['Barcelona', 'Barcelona_fabra', 'Sabadell_aero', 'Garriga', 'Llica', 'el_Mogent', 'Mogoda', 'Gramenet'], inplace=True)
lag_creation(df, 1, 2, ['Barcelona'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['Barcelona_fabra'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['Sabadell_aero'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['Garriga'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['Llica'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['el_Mogent'], inplace=True, freq=freq)
lag_creation(df, 1, 2, ['Mogoda'], inplace=True, freq=freq)
lag_creation(df, 1, 3, ['Gramenet'], inplace=True, freq=freq)
date_features(df, inplace=True)
df.dropna(inplace=True)
del df['Gramenet']

In [None]:
df.columns

In [None]:
df.head(1405)

We split the dataset into training and testing datasets

In [None]:
X = df.loc[:, df.columns!='y']
y = df['y']
train = df.iloc[:int(len(df)*0.80)]
test = df.iloc[int(len(df)*0.80):]

In [None]:
X_train = train.dropna().drop('y', axis=1)
y_train = train['y']

X_test = test.dropna().drop('y', axis=1)
y_test = test['y']
test = test.dropna().drop('y', axis=1)

X_train = X_train.values
y_train = y_train.values

X_test = X_test.values
y_test = y_test.values

Modelling and Evaluation

### Multiple Linear Regression

In [None]:
reg = LinearRegression()
reg.fit(X_train, y_train)

Training period

In [None]:
y_train = reg.predict(X_train)

In [None]:
mse = mean_squared_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MSE =',mse)

In [None]:
mae= mean_absolute_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MAE =', mae)

In [None]:
r2 = r2_score(df.iloc[:int(len(df)*0.80)].y, y_train)
print('R2 =', r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[:int(len(df)*0.80)].y, y_train)
print('CE =', ce)

Testing period

In [None]:
y = reg.predict(X_test)

In [None]:
test['y'] = y

In [None]:
mse = mean_squared_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MSE =', mse)

In [None]:
mae = mean_absolute_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MAE =', mae)

In [None]:
r2 = r2_score(df.iloc[int(len(df)*0.80):].y, test.y)
print('R2 =', r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[int(len(df)*0.80):].y, test.y)
print('CE =', ce)

In [None]:
# Hydrograph plot for both training and test periods
plt.scatter(df.iloc[int(len(df)*0.80):].index, df.iloc[int(len(df)*0.80):].y, color ='b', label= "observed")
plt.scatter(df.iloc[:int(len(df)*0.80)].index, df.iloc[:int(len(df)*0.80)].y, color ='b')
plt.plot(df.iloc[int(len(df)*0.80):].index, test['y'], 'orange', label="simulated")
plt.plot(df.iloc[:int(len(df)*0.80)].index, y_train, 'orange')
plt.axvline(13596, 0, 80, linestyle='--')
plt.figtext(0.75, 0.7, "Testing period", fontsize = 20)
plt.figtext(0.35, 0.7, "Training period", fontsize = 20)
plt.title("Observed and simulated streamflow discharge", fontsize=15)
plt.xlabel('Year',fontsize=15)
plt.ylabel('Streamflow discharge (m3/s)',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.legend(fontsize="x-large")
plt.show()

In [None]:
# MLR interception
Intercept=reg.intercept_
Intercept

In [None]:
# MLR coefficients
Coefficients=reg.coef_
Coefficients

In [None]:
# MLR coefficients plot
plt.bar([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], # train.dropna().drop('y', axis=1).columns
        
        [9.70564837e+00, -1.41113516e+00, -5.23703491e+00,  1.97528309e+01,
        9.36606971e+00,  3.06220187e+01,  1.29015046e+01,  4.59684755e+00,
        2.07828089e+00,  1.14507831e+01, -3.77209818e+00, -7.26481703e+00,
       -1.46609291e+01, -1.56883170e+01,  4.12135658e+01,  9.35425332e-01,
       -2.34186258e-01,  3.99417170e-02,  5.94165290e-01,  2.82438609e-01]) # reg.coef_

### Gradient Boosting Regressor

In [None]:
#GBR Hyperparameters
parameters = {'n_estimators'     : [400,450,500,550,600],
              'max_features'     : [0.3,0.4,0.5,0.6],
              'learning_rate'    : [0.05,0.06,0.08,0.1],
              'subsample'        : [0.2,0.3,0.4,0.5,0.6],              
              'max_depth'        : [5,7,9,11,12],
              }

In [None]:
#Time counter
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
gbr = GradientBoostingRegressor()
#Grid search GBR hyperparameters tuning
rndm_GBR = GridSearchCV(estimator=gbr, param_grid = parameters, cv = 5, n_jobs=2, verbose=3)

In [None]:
from datetime import datetime

start_time = timer(None)
rndm_GBR.fit(X_train, y_train)
timer(start_time)

In [None]:
#Optimal GBR hyeprparameters
rndm_GBR.best_params_

In [None]:
#Training the GBR with the best hyperparameters
gbr = GradientBoostingRegressor(learning_rate=0.06, max_depth= 7, max_features= 0.4, n_estimators= 500, subsample= 0.2)
gbr.fit(X_train, y_train)

Training period

In [None]:
y_train = gbr.predict(X_train)

In [None]:
mse = mean_squared_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MSE =', mse)

In [None]:
mae= mean_absolute_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MAE =', mae)

In [None]:
r2 = r2_score(df.iloc[:int(len(df)*0.80)].y, y_train)
print('R2 =', r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[:int(len(df)*0.80)].y, y_train)
print('CE =', ce)

Testing period

In [None]:
y = gbr.predict(X_test)

In [None]:
test['y'] = y

In [None]:
mse = mean_squared_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MSE =', mse)

In [None]:
mae = mean_absolute_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MAE =', mae)

In [None]:
r2 = r2_score(df.iloc[int(len(df)*0.80):].y, test.y)
print('R2 =', r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[int(len(df)*0.80):].y, test.y)
print('CE =', ce)

In [None]:
# Hydrograph plot for both training and test periods
plt.scatter(df.iloc[int(len(df)*0.80):].index, df.iloc[int(len(df)*0.80):].y, color ='b', label= "observed")
plt.scatter(df.iloc[:int(len(df)*0.80)].index, df.iloc[:int(len(df)*0.80)].y, color ='b')
plt.plot(df.iloc[int(len(df)*0.80):].index, test['y'], 'orange', label="simulated")
plt.plot(df.iloc[:int(len(df)*0.80)].index, y_train, 'orange')
plt.axvline(13596, 0, 80, linestyle='--')
plt.figtext(0.75, 0.7, "Testing period", fontsize = 20)
plt.figtext(0.35, 0.7, "Training period", fontsize = 20)
plt.title("Observed and simulated streamflow discharge", fontsize=15)
plt.xlabel('Year',fontsize=15)
plt.ylabel('Streamflow discharge (m3/s)',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.legend(fontsize="x-large")
plt.show()

### Support Vector Regression

In [None]:
#SVR hyperparameters
param={
  "C"        :  [0.1,1.0,10,100,1000],
 "epsilon"   :  [0.0001,0.001,0.01,0.1,1],
 "gamma"     :  [0.0001,0.001,0.01,0.1,1],
}

In [None]:
#Time counter function
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
modelsvr = SVR(kernel='rbf')
#Grid search SVR hyperparameters tuning
grids = GridSearchCV(modelsvr,param,cv=5,n_jobs = 2, verbose = 3)

In [None]:
from datetime import datetime

start_time = timer(None)
grids.fit(X_train, y_train)
timer(start_time)

In [None]:
#The optimal SVR hyperparameters
grids.best_params_

In [None]:
#Training the SVR with the best hyperparameters
modelsvr = SVR(kernel='rbf', C=1000, epsilon=0.0001, gamma=0.01)
modelsvr.fit(X_train, y_train)

Training period

In [None]:
y_train = modelsvr.predict(X_train)

In [None]:
mse = mean_squared_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MSE =', mse)

In [None]:
mae= mean_absolute_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MAE =', mae)

In [None]:
r2 = r2_score(df.iloc[:int(len(df)*0.80)].y, y_train)
print('R2 =', r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[:int(len(df)*0.80)].y, y_train)
print('CE =', ce)

Testing period

In [None]:
y = modelsvr.predict(X_test)

In [None]:
test['y'] = y

In [None]:
mse = mean_squared_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MSE =',mse)

In [None]:
mae = mean_absolute_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MAE =',mae)

In [None]:
r2= r2_score(df.iloc[int(len(df)*0.80):].y, y)
print('R2 =',r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[int(len(df)*0.80):].y, test.y)
print('CE =', ce)

In [None]:
# Hydrograph plot for both training and test periods
plt.scatter(df.iloc[int(len(df)*0.80):].index, df.iloc[int(len(df)*0.80):].y, color ='b', label= "observed")
plt.scatter(df.iloc[:int(len(df)*0.80)].index, df.iloc[:int(len(df)*0.80)].y, color ='b')
plt.plot(df.iloc[int(len(df)*0.80):].index, test['y'], 'orange', label="simulated")
plt.plot(df.iloc[:int(len(df)*0.80)].index, y_train, 'orange')
plt.axvline(13596, 0, 80, linestyle='--')
plt.figtext(0.75, 0.7, "Testing period", fontsize = 20)
plt.figtext(0.35, 0.7, "Training period", fontsize = 20)
plt.title("Observed and simulated streamflow discharge", fontsize=15)
plt.xlabel('Year',fontsize=15)
plt.ylabel('Streamflow discharge (m3/s)',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.legend(fontsize="x-large")
plt.show()

### Random Forest Regression

In [None]:
#RFR hyperparameters
param= {'n_estimators' : [300,330,350,380,400,430,450,480,500],
        'max_features' : ("auto", "sqrt", "log2"),
       },

In [None]:
#Time counter function
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
model = RandomForestRegressor(random_state = 42)
# Grid search RFR hyperparameters tuning
grids = GridSearchCV(model,param,cv=5,n_jobs = 2, verbose = 3)

In [None]:
from datetime import datetime

start_time = timer(None)
grids.fit(X_train, y_train)
timer(start_time)

In [None]:
# The best RFR hyperparameters
grids.best_params_

In [None]:
rf = RandomForestRegressor(n_estimators = 480, max_features= 'sqrt', random_state = 42)
# Train the model on training data
rf.fit(X_train, y_train);

Training period

In [None]:
y_train =rf.predict(X_train)

In [None]:
mse = mean_squared_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MSE =', mse)

In [None]:
mae= mean_absolute_error(df.iloc[:int(len(df)*0.80)].y, y_train)
print('MAE =', mae)

In [None]:
r2 = r2_score(df.iloc[:int(len(df)*0.80)].y, y_train)
print('R2 =', r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[:int(len(df)*0.80)].y, y_train)
print('CE =', ce)

Testing period

In [None]:
y = rf.predict(X_test)

In [None]:
test['y']=y

In [None]:
mse = mean_squared_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MSE =',mse)

In [None]:
mae = mean_absolute_error(df.iloc[int(len(df)*0.80):].y, test.y)
print('MAE =',mae)

In [None]:
r2 = r2_score(df.iloc[int(len(df)*0.80):].y, y)
print('R2 =', r2)

In [None]:
ce = he.evaluator(he.nse, df.iloc[int(len(df)*0.80):].y, test.y)
print('CE =', ce)

In [None]:
# Hydrograph plot for both training and test periods
plt.scatter(df.iloc[int(len(df)*0.80):].index, df.iloc[int(len(df)*0.80):].y, color ='b', label= "observed")
plt.scatter(df.iloc[:int(len(df)*0.80)].index, df.iloc[:int(len(df)*0.80)].y, color ='b')
plt.plot(df.iloc[int(len(df)*0.80):].index, test['y'], 'orange', label="simulated")
plt.plot(df.iloc[:int(len(df)*0.80)].index, y_train, 'orange')
plt.axvline(13596, 0, 80, linestyle='--')
plt.figtext(0.75, 0.7, "Testing period", fontsize = 20)
plt.figtext(0.35, 0.7, "Training period", fontsize = 20)
plt.title("Observed and simulated streamflow discharge", fontsize=15)
plt.xlabel('Year',fontsize=15)
plt.ylabel('Streamflow discharge (m3/s)',fontsize=15)
plt.tick_params(labelsize=15)
plt.grid()
plt.legend(fontsize="x-large")
plt.show()

#### FINAL CONCLUSION

**The performance comparison of the results revealed that the SVR model outperformed the other models in predicting daily flow without and with considering the historical at the target station. It was also deduced that taking into account the previous flows in the objective gauging station clearly improves the prediction results.**