This competition uses nine different datasets, completely independent and not linked to each other. Each dataset can represent a different kind of waterbody. As each waterbody is different from the other, the related features as well are different from each other. So, if for instance we consider a water spring we notice that its features are different from the lake’s one. This is correct and reflects the behavior and characteristics of each waterbody. The Acea Group deals with four different type of waterbodies: water spring (for which three datasets are provided), lake (for which a dataset is provided), river (for which a dataset is provided) and aquifers (for which four datasets are provided).

Let’s see how these nine waterbodies differ from each other.

Waterbody: Auser
Type: Aquifer

Description: This waterbody consists of two subsystems, called NORTH and SOUTH, where the former partly influences the behavior of the latter. Indeed, the north subsystem is a water table (or unconfined) aquifer while the south subsystem is an artesian (or confined) groundwater.

The levels of the NORTH sector are represented by the values of the SAL, PAG, CoS and DIEC wells, while the levels of the SOUTH sector by the LT2 well.

Waterbody: Petrignano
Type: Aquifer

Description: The wells field of the alluvial plain between Ospedalicchio di Bastia Umbra and Petrignano is fed by three underground aquifers separated by low permeability septa. The aquifer can be considered a water table groundwater and is also fed by the Chiascio river. The groundwater levels are influenced by the following parameters: rainfall, depth to groundwater, temperatures and drainage volumes, level of the Chiascio river.

Waterbody: Doganella
Type: Aquifer

Description: The wells field Doganella is fed by two underground aquifers not fed by rivers or lakes but fed by meteoric infiltration. The upper aquifer is a water table with a thickness of about 30m. The lower aquifer is a semi-confined artesian aquifer with a thickness of 50m and is located inside lavas and tufa products. These aquifers are accessed through wells called Well 1, ..., Well 9. Approximately 80% of the drainage volumes come from the artesian aquifer. The aquifer levels are influenced by the following parameters: rainfall, humidity, subsoil, temperatures and drainage volumes.

Waterbody: Luco
Type: Aquifer

Description: The Luco wells field is fed by an underground aquifer. This aquifer not fed by rivers or lakes but by meteoric infiltration at the extremes of the impermeable sedimentary layers. Such aquifer is accessed through wells called Well 1, Well 3 and Well 4 and is influenced by the following parameters: rainfall, depth to groundwater, temperature and drainage volumes.

Waterbody: Amiata
Type: Water spring

Description: The Amiata waterbody is composed of a volcanic aquifer not fed by rivers or lakes but fed by meteoric infiltration. This aquifer is accessed through Ermicciolo, Arbure, Bugnano and Galleria Alta water springs. The levels and volumes of the four sources are influenced by the parameters: rainfall, depth to groundwater, hydrometry, temperatures and drainage volumes.

Waterbody: Madonna di Canneto
Type: Water spring

Description: The Madonna di Canneto spring is situated at an altitude of 1010m above sea level in the Canneto valley. It does not consist of an aquifer and its source is supplied by the water catchment area of the river Melfa.

Waterbody: Lupa
Type: Water spring

Description: this water spring is located in the Rosciano Valley, on the left side of the Nera river. The waters emerge at an altitude of about 375 meters above sea level through a long draining tunnel that crosses, in its final section, lithotypes and essentially calcareous rocks. It provides drinking water to the city of Terni and the towns around it.

Waterbody: Arno
Type: River

Description: Arno is the second largest river in peninsular Italy and the main waterway in Tuscany and it has a relatively torrential regime, due to the nature of the surrounding soils (marl and impermeable clays). Arno results to be the main source of water supply of the metropolitan area of Florence-Prato-Pistoia. The availability of water for this waterbody is evaluated by checking the hydrometric level of the river at the section of Nave di Rosano.

Waterbody: Bilancino
Type: Lake

Description: Bilancino lake is an artificial lake located in the municipality of Barberino di Mugello (about 50 km from Florence). It is used to refill the Arno river during the summer months. Indeed, during the winter months, the lake is filled up and then, during the summer months, the water of the lake is poured into the Arno river.

Each waterbody has its own different features to be predicted. The table below shows the expected feature to forecast for each waterbody.

[](https://storage.cloud.google.com/kaggle-media/competitions/Acea/Screen%20Shot%202020-12-02%20at%2012.40.17%20PM.png)

![](https://storage.cloud.google.com/kaggle-media/competitions/Acea/Screen%20Shot%202020-12-02%20at%2012.40.17%20PM.png)

It is of the utmost importance to notice that some features like rainfall and temperature, which are present in each dataset, don’t go alongside the date. Indeed, both rainfall and temperature affect features like level, flow, depth to groundwater and hydrometry some time after it fell down. This means, for instance, that rain fell on 1st January doesn’t affect the mentioned features right the same day but some time later. As we don’t know how many days/weeks/months later rainfall affects these features, this is another aspect to keep into consideration when analyzing the dataset.

Install pmdarima module which we will use to fill in missing stationary values for temperature

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
from sklearn import preprocessing
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import r2_score
#from pmdarima.arima import auto_arima
#from pmdarima.arima import ADFTest
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from datetime import datetime, date
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from statsmodels.tsa.vector_ar.var_model import VAR
import numpy as np
import math
from math import sqrt
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from matplotlib import pyplot
import pickle


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6195295%2Fcca952eecc1e49c54317daf97ca2cca7%2FAcea-Input.png?generation=1606932492951317&alt=media)

# AQUIFERS

In [None]:
Aquifer_Doganella = pd.read_csv("/kaggle/input/acea-water-prediction/Aquifer_Doganella.csv")
Aquifer_Auser = pd.read_csv("/kaggle/input/acea-water-prediction/Aquifer_Auser.csv")
Aquifer_Luco = pd.read_csv("/kaggle/input/acea-water-prediction/Aquifer_Luco.csv")
Aquifer_Petrignano = pd.read_csv("/kaggle/input/acea-water-prediction/Aquifer_Petrignano.csv")
aquifers_lst = [Aquifer_Petrignano,Aquifer_Doganella,Aquifer_Auser,Aquifer_Luco]

In [None]:
class MyPreprocessor:
    '''If downsample is set to true then the the PreProcessor downsamples data to the granualarity parameter
    granularity should be either "7D" for weekly, or "M" for Monthly.
    
    The processor will take the average of each feature. if "get_averages" is set to True'''
    
    def __init__(self,data,features,downsample=False,granularity=None,get_averages=False):
        self.df = data.copy()
        self.features = features
        self.downsample = downsample
        self.granularity = granularity
        self.get_averages = get_averages
        self.PCA = PCA

        
    def filter_year(self):
        self.df['Date'] = self.df['Date'].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
        self.df = self.df.loc[(self.df['Date']>=self.start_year)]
        return df

    def fill_zero_volume(self,df):
        '''Since zero volume is highly unlikely we will replace 0s with null
        We will also replace the 0 values in Temperature with nan since many observations have this as the default where
        a real observation is missing.  We we then do an interpolation, it will fill these with the temperatures of the days
        following or before as proxys'''
        for col in df.columns:
            if col.startswith("Volume"):
                df[col] = np.where((df[col] == 0),np.nan, df[col])
            else:
                df[col]
        return df
    
    
    def fill_nas(self,df):    
        for col in df.columns:
            df[col] = df[col].interpolate().fillna(value=None, method='backfill', axis=None, limit=None, downcast=None)
        
        return df
    
    def feature_engineer(self,df):
        
        '''Create date columns for year, month, day. Average the values of the diffent sampled locations'''
        #df['Date'] = df['Date'].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
        #df = df.loc[(df['Date']>=self.start_year)]
        #y = df.iloc[0]['Date'].year
        #df['year-month'] = df['Date'].apply(lambda x: int((x.year-y)*12+x.month))
        df['year'] = df['Date'].apply(lambda x: int(x.year))
        df['month'] = df['Date'].apply(lambda x: int(x.month))
        df['day'] = df['Date'].apply(lambda x: int(x.day))
        return df
    
    def engineer_averages(self,df):
        df['rainfall-avg'] = df[[ f for f in df.columns if f.startswith('Rain')]].apply(lambda x: x.mean(), axis=1)
        #df['rainfall-std'] = df[[ f for f in df.columns if f.startswith('Rain')]].apply(lambda x: np.std(x, axis=0), axis=1)
        df['temperature-avg'] = df[[ f for f in df.columns if f.startswith('Temperature')]].apply(lambda x: x.mean(), axis=1)
        #df['temperature-std'] = df[[ f for f in df.columns if f.startswith('Temperature')]].apply(lambda x: x.mean(), axis=1)
        df['hydrometry-avg'] = df[[ f for f in df.columns if f.startswith('Hydrometry')]].apply(lambda x: x.mean(), axis=1)
        df['volume-avg'] = df[[ f for f in df.columns if f.startswith('Volume')]].apply(lambda x: x.mean(), axis=1)
        df['depth_to_groundwater-avg'] = df[[ f for f in df.columns if f.startswith('Depth')]].apply(lambda x: x.mean(), axis=1)
        df = df.set_index('Date')
        df = df.iloc[:,-9:]
        data = df.drop([f for f in df.columns if (f.startswith('Rainfall') or f.startswith('Temperature') 
                            or f.startswith('Hydrometry') or f.startswith('Volume') or f.startswith('Depth'))],axis=1)
        return data
    
    
    def main(self):
        data = self.df
        #data = self.filter_year()
        data = self.fill_zero_volume(data)
        data = self.fill_nas(data)
        data = self.feature_engineer(data)
        
        if self.get_averages == True:
            data = self.engineer_averages(data)
        else:
            data
        
        if self.downsample == True:
            df = self.data.resample(f'{granularity}', on='Date').sum().reset_index(drop=False)
            return df
        else:
            return data
        
            


In [None]:
class Data_Exploration:
    
    def __init__(self,df,features,aquifer=None,PCA=False):
        self.df = df
        self.features = features
        self.nrows = len(features)
        self.aquifer = aquifer
        self.PCA = PCA
        
    def scaler(self):
        data = self.df.copy()
        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(data)
        return scaled_data
    

        
    def line_plots(self):
        depths = [f for f in self.df.columns if f.startswith('Depth')]
        for f in depths:
            depth_array = self.df[f].values
            _index_array = np.array(self.df.index)
            normalized_depth = preprocessing.normalize([depth_array])
            normalized_depth = pd.Series(normalized_depth[0])

            for feature in self.features:
                feat_values = self.df['{}'.format(feature)].values
                normalized_feature = preprocessing.normalize([feat_values])
                normalized_feature = normalized_feature[0]
                normalized_feature = pd.Series(normalized_feature)
                #the_array = np.hstack((_index_array, normalized_depth,normalized_feature))
                normalized_df = pd.DataFrame(data= {'Date':self.df.index,f'{f}':normalized_depth,f'{feature}':normalized_feature})
                fig= plt.figure(figsize=(10,3))
                plt.plot(normalized_df[f'{f}'], label=f'{f}')
                plt.plot(normalized_df[f'{feature}'], label=str.capitalize(feature))
                plt.legend()
                plt.title(f'{feature} vs. {f} (Normalized)')
                plt.show()
        
    def hist_plot(self):
        print("DISTRIBUTION CHARTS=====================================")    
        for feat in self.df.columns:
            fig= plt.figure(figsize=(10,3))
            sns.distplot(self.df[feat].fillna(np.inf), color='indianred')
            plt.title(f'{str.capitalize(feat)}: ', fontsize=14)
            plt.tight_layout()
            plt.show()
            
    def hist_plots(self):
            
        f, ax = plt.subplots(nrows=self.nrows, ncols=1, figsize=(10, 35))
        for i,feat in enumerate(self.features):
            sns.distplot(self.df[feat].fillna(np.inf), ax=ax[i], color='indianred')
            ax[i].set_title(f'{str.capitalize(feat)}: ', fontsize=14)
            #ax[i].set_ylabel(ylabel=f'{str.capitalize(feat)}', fontsize=14)
        plt.tight_layout()
        plt.show()
        
    def stationarity(self):
        # Dickey-Fuller Test
        def interpret_dftest(dftest):
            dfoutput = pd.Series(dftest[0:2], index=['Test Statistic','p-value'])
            return dfoutput

        for feature in self.features:
            split = str(feature).split("_")
            feat = " ".join(split)
            print(f"{str.capitalize(feat)}:")
            print(interpret_dftest(adfuller(self.df[feature])))
            print("--------------------------------------------------------")
            
        
    def plot_corr_matrix(self):
        df = self.df.copy()
        df.set_index('Date')
        #for col in df.columns:
        #    df[col] = df[col].abs()
        # Change values of columns to absolute values
        fig= plt.figure(figsize=(15,15))
        corrMatrix = df.corr()
        sns.heatmap(corrMatrix, annot=True)
        plt.show()
                                                     
    def plot_auto_correlation(self):
        plot_acf(self.df['depth_to_groundwater-avg'])
        plot_pacf(self.df['depth_to_groundwater-avg'])
        
    
    def _PCA_(self):
        aquifer_df = self.df.copy()
        df= aquifer_df.drop("Date",axis=1)
        X_reduced = PCA(n_components=2).fit_transform(df)
        pf = pd.DataFrame(X_reduced, columns=['PCA1','PCA2'])
        df['PCA1'] = pf['PCA1']
        df['PCA2'] = pf['PCA2']
        for col in df.columns: 
            if col.startswith('Depth'):
                xval = preprocessing.normalize([np.array(df['PCA1'])])
                yval = preprocessing.normalize([np.array(df['PCA2'])])
                ax.set_zlabel(f'{col}')
                zval = preprocessing.normalize([np.array(df[f'{col}'])])
                #zval = preprocessing.normalize([np.array(df['PCA12'])])
                ax.scatter(xval,yval,zval,c=df[f'{col}'])
        

    def main(self):
        self.line_plots()
        self.hist_plots()
        self.stationarity()
        if self.PCA == True:
            self._PCA_()
        else: 
            pass
        self.plot_corr_matrix()
        

In [None]:
class Data_Exploration2:
    
    def __init__(self,df,features,aquifer=None):
        self.df = df
        self.features = features
        self.nrows = len(features)
        self.aquifer = aquifer
        
    def scaler(self):
        data = self.df.copy()
        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(data)
        return scaled_data
    
    def lines_plot(self,df):
        depths = [f for f in df.columns if f.startswith('Depth')]
        for f in depths:
            depth_array = df[f].values
            _index_array = np.array(df.index)
            normalized_depth = depth_array
            normalized_depth = pd.Series(normalized_depth[0])

            for feature in self.features:
                feat_values = df['{}'.format(feature)].values
                normalized_feature = feat_values
                normalized_feature = normalized_feature[0]
                normalized_feature = pd.Series(normalized_feature)
                #the_array = np.hstack((_index_array, normalized_depth,normalized_feature))
                normalized_df = pd.DataFrame(data= {'Date':df.index,f'{f}':normalized_depth,f'{feature}':normalized_feature})
                fig= plt.figure(figsize=(8,4))
                plt.plot(normalized_df[f'{f}'], label=f'{f}')
                plt.plot(normalized_df[f'{feature}'], label=str.capitalize(feature))
                plt.legend()
                plt.title(f'{feature} vs. {f} (Normalized)')
                plt.show()
        
    def line_plots(self):
        depths = [f for f in self.df.columns if f.startswith('Depth')]
        for f in depths:
            depth_array = self.df[f].values
            _index_array = np.array(self.df.index)
            normalized_depth = preprocessing.normalize([depth_array])
            normalized_depth = pd.Series(normalized_depth[0])

            for feature in self.features:
                feat_values = self.df['{}'.format(feature)].values
                normalized_feature = preprocessing.normalize([feat_values])
                normalized_feature = normalized_feature[0]
                normalized_feature = pd.Series(normalized_feature)
                #the_array = np.hstack((_index_array, normalized_depth,normalized_feature))
                normalized_df = pd.DataFrame(data= {'Date':self.df.index,f'{f}':normalized_depth,f'{feature}':normalized_feature})
                fig= plt.figure(figsize=(8,4))
                plt.plot(normalized_df[f'{f}'], label=f'{f}')
                plt.plot(normalized_df[f'{feature}'], label=str.capitalize(feature))
                plt.legend()
                plt.title(f'{feature} vs. {f} (Normalized)')
                plt.show()
        
    def hist_plots(self):
            
        f, ax = plt.subplots(nrows=self.nrows, ncols=1, figsize=(10, 12))
        for i,feat in enumerate(self.features):
            sns.distplot(self.df[feat].fillna(np.inf), ax=ax[i], color='indianred')
            ax[i].set_title(f'{str.capitalize(feat)}: ', fontsize=14)
            ax[i].set_ylabel(ylabel=f'{str.capitalize(feat)}', fontsize=14)
        plt.tight_layout()
        plt.show()
        
    def stationarity(self):
        # Dickey-Fuller Test
        def interpret_dftest(dftest):
            dfoutput = pd.Series(dftest[0:2], index=['Test Statistic','p-value'])
            return dfoutput

        for feature in self.features:
            split = str(feature).split("_")
            feat = " ".join(split)
            print(f"{str.capitalize(feat)}:")
            print(interpret_dftest(adfuller(self.df[feature])))
            print("--------------------------------------------------------")
            
    def resampling(self, year):
        fig, ax = plt.subplots(ncols=2, nrows=4, sharex=True, figsize=(16,12))

        ax[0, 0].bar(self.df.Date, self.df['rainfall-avg'], width=5, color='dodgerblue')
        ax[0, 0].set_title('Daily Rainfall (Acc.)', fontsize=14)

        resampled_df = self.df[['Date','rainfall-avg']].resample('7D', on='Date').sum().reset_index(drop=False)
        ax[1, 0].bar(resampled_df.Date, resampled_df['rainfall-avg'], width=10, color='dodgerblue')
        ax[1, 0].set_title('Weekly Rainfall (Acc.)', fontsize=14)

        resampled_df = self.df[['Date','rainfall-avg']].resample('M', on='Date').sum().reset_index(drop=False)
        ax[2, 0].bar(resampled_df.Date, resampled_df['rainfall-avg'], width=15, color='dodgerblue')
        ax[2, 0].set_title('Monthly Rainfall (Acc.)', fontsize=14)

        resampled_df = self.df[['Date','rainfall-avg']].resample('12M', on='Date').sum().reset_index(drop=False)
        ax[3, 0].bar(resampled_df.Date, resampled_df['rainfall-avg'], width=20, color='dodgerblue')
        ax[3, 0].set_title('Annual Rainfall (Acc.)', fontsize=14)

        for i in range(4):
            ax[i, 0].set_xlim([date(year, 1, 1), date(2020, 6, 30)])

        sns.lineplot(self.df.Date, self.df['temperature-avg'], color='dodgerblue', ax=ax[0, 1])
        ax[0, 1].set_title('Daily Temperature (Acc.)', fontsize=14)

        resampled_df = self.df[['Date','temperature-avg']].resample('7D', on='Date').mean().reset_index(drop=False)
        sns.lineplot(resampled_df.Date, resampled_df['temperature-avg'], color='dodgerblue', ax=ax[1, 1])
        ax[1, 1].set_title('Weekly Temperature (Acc.)', fontsize=14)

        resampled_df = self.df[['Date','temperature-avg']].resample('M', on='Date').mean().reset_index(drop=False)
        sns.lineplot(resampled_df.Date, resampled_df['temperature-avg'], color='dodgerblue', ax=ax[2, 1])
        ax[2, 1].set_title('Monthly Temperature (Acc.)', fontsize=14)

        resampled_df = self.df[['Date','temperature-avg']].resample('365D', on='Date').mean().reset_index(drop=False)
        sns.lineplot(resampled_df.Date, resampled_df['temperature-avg'], color='dodgerblue', ax=ax[3, 1])
        ax[3, 1].set_title('Annual Temperature (Acc.)', fontsize=14)

        for i in range(4):
            ax[i, 1].set_xlim([date(year, 1, 1), date(2020, 6, 30)])
            ax[i, 1].set_ylim([-5, 35])
        plt.show()
        
    def plot_corr_matrix(self,df):
        df.set_index('Date')
        #for col in df.columns:
        #    df[col] = df[col].abs()
        # Change values of columns to absolute values
        fig= plt.figure(figsize=(10,10))
        corrMatrix = df.corr()
        sns.heatmap(corrMatrix, annot=True)
        plt.show()
                                                     
    def plot_auto_correlation(self):
        plot_acf(self.df['depth_to_groundwater-avg'])
        plot_pacf(self.df['depth_to_groundwater-avg'])
        

    def main(self):
        scaled_data = self.scaler()
        self.lines_plot(scaled_data)
        #self.line_plots()
        self.hist_plots()
        self.stationarity()
        self.plot_corr_matrix(scaled_data)

## DATA PROCESSING & EXPLORATORY ANALYSIS

We'll take a quick look at all the data from each column and then at how each feature correlates with each of the depth (target) features

### Aquifer Auser

In [None]:
date_time = pd.to_datetime(Aquifer_Auser.Date, format='%d/%m/%Y')
plot_cols = Aquifer_Auser.iloc[:,1:-1].columns
plot_features = Aquifer_Auser[plot_cols]
plot_features.index = date_time
_ = plot_features.plot(subplots=True,figsize=(15,20))

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(16,5))
sns.heatmap(Aquifer_Auser.T.isna(), cmap='Blues')
ax.set_title('Fields with Missing Values', fontsize=16)
#for tick in ax.xaxis.get_major_ticks():
#    tick.label.set_fontsize(14) 
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(14)
plt.show()

We will filter the data from Aquifer Auser to only after 2011

In [None]:
df = Aquifer_Auser.copy()
df['Date'] = df['Date'].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
df = df.loc[(df['Date']>='01/01/2011')]

In [None]:
# Check time intervals
df['Time_Interval'] = df.Date - df.Date.shift(1)

print(df[['Date', 'Time_Interval']].head())

print(f"{df['Time_Interval'].value_counts()}")
df = df.drop('Time_Interval', axis=1)

In [None]:
# create list of features
features = [f for f in df.columns if (f.startswith('Depth')==False and f !='Date')]
processor = MyPreprocessor(df,features=features,get_averages=False)
Auser_df = processor.main()

In [None]:
Auser_df.info()

In [None]:
explorer = Data_Exploration(Auser_df,features,'Aquifer Auser',PCA=False)
explorer.main()

We can see that month, temperature, and volume had a fairly significant correlation with Depth to Groundwater, **while rainfall had a poor correlation**, suprisingly. This may be due to that Rainfall is on average over the whole time period is very low, and that effects of rainfall are not immediate, there is a delay.  The histogram shows that the majority of days had 0-2 cm of precipitation and this leads to the correlation being weak.

Interestingly, there is a strong correlation betweeen year and Hydrometry especially at POL.  There is also strong correlations between Hydrometry and Depth, depending on if they are at the North or South sections.  

### AQUIFER DOGANELLA

In [None]:
date_time = pd.to_datetime(Aquifer_Doganella.Date, format='%d/%m/%Y')
plot_cols = Aquifer_Doganella.iloc[:,1:-1].columns
plot_features = Aquifer_Doganella[plot_cols]
plot_features.index = date_time
_ = plot_features.plot(subplots=True,figsize=(15,20))

 **We can fill in the missing Temperatures from Monteporzio with forcasted values of a pmdarima

The data for this Aquifer is missing alot of data, there are many holes to fill.  We will only use the last year of data for this one.

In [None]:
df = Aquifer_Doganella.copy()
df['Date'] = df['Date'].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
df = df.loc[(df['Date']>='01-01-2019')]
#df = df.set_index('Date')

In [None]:
df.info()

In [None]:
processor = MyPreprocessor(df,features=features,get_averages=False)
Doganella_df = processor.main()

In [None]:
# Get new features for Doganella 
features = [f for f in Doganella_df.columns if (f.startswith('Depth')==False and f !='Date')]

explorer = Data_Exploration(Doganella_df,features,'Aquifer Doganella')
explorer.main()

### AQUIFER PETRIGNANO

In [None]:
date_time = pd.to_datetime(Aquifer_Petrignano.Date, format='%d/%m/%Y')
plot_cols = Aquifer_Petrignano.iloc[:,1:-1].columns
plot_features = Aquifer_Petrignano[plot_cols]
plot_features.index = date_time
_ = plot_features.plot(subplots=True,figsize=(15,20))

We will only use data later than 2009 for Aquifer Petrignano.  Since zero volume is highly unlikely we will replace 0s with null
We will also replace the 0 values in Temperature with nan since many observations have this as the default where
a real observation is missing.  We we then do an interpolation, it will fill these with the temperatures of the days
following or before as proxys.  This is done on the other datasets as well

In [None]:
df = Aquifer_Petrignano.copy()
df['Date'] = df['Date'].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
df = df.loc[(df['Date']>='01-01-2009')]

In [None]:
features = [f for f in df.columns if (f.startswith('Depth')==False and f !='Date')]
processor = MyPreprocessor(df,features=features,get_averages=False)
Petrignano_df = processor.main()

In [None]:
Petrignano_df.head()

In [None]:
# Call our data explorer object for 
explorer = Data_Exploration(Petrignano_df,features,'Aquifer Petrignano')
explorer.main()

From the correlations chart we can see that the two depth readings from the two pozos perfectly correlate with each other.  The is also a strong correlation between Volume_C10_Petrignano and the two depth readings as also seen in the line chart. 

Interestingly, the year has a significant correlation with Volume_C10_Petrignano.  Year also has a relatively strong correlation with the Depths.  This could signify a particularly dry or wet year.

### AQUIFER LUCO

In [None]:
date_time = pd.to_datetime(Aquifer_Luco.Date, format='%d/%m/%Y')
plot_cols = Aquifer_Luco.iloc[:,1:-1].columns
plot_features = Aquifer_Luco[plot_cols]
plot_features.index = date_time
_ = plot_features.plot(subplots=True,figsize=(15,20))

# Random Forest Model

## Time series as supervised learning problem

Time series data can be phrased as supervised learning.

Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem. We can do this by using previous time steps as input variables and use the next time step as the output variable.

We can restructure this time series dataset as a supervised learning problem by using the value at the previous time step to predict the value at the next time-step.

* **time  measure**
* 1,      100
* 2,      110
* 3,      108
* 4,      115
* 5,      120

Reorganizing the time series dataset this way, the data would look as follows:


* **X,   y**
* ?,     100
* 100,   110
* 110,   108
* 108,   115
* 115,   120
* 120,   ?

Note that the time column is dropped and some rows of data are unusable for training a model, such as the first and the last.

This representation is called a sliding window, as the window of inputs and expected outputs is shifted forward through time to create new “samples” for a supervised learning model.

We can use the **shift()** function in Pandas to automatically create new framings of time series problems given the desired length of input and output sequences.

This would be a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better-performing models.

The function below will take a time series as a NumPy array time series with one or more columns and transform it into a supervised learning problem with the specified number of inputs and outputs. We can use this function to prepare a time series dataset for Random Forest.

The function takes four arguments:

* data: Sequence of observations as a list or 2D NumPy array. Required.
* n_in: Number of lag observations as input (X). Values may be between [1..len(data)] Optional. Defaults to 1.
* n_out: Number of observations as output (y). Values may be between [0..len(data)-1]. Optional. Defaults to 1.
* dropnan: Boolean whether or not to drop rows with NaN values. Optional. Defaults to True.

In [None]:
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols = list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
    # put it all together
    agg = concat(cols, axis=1)
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg.values


In *walk-forward validation*, the dataset is first split into train and test sets by selecting a cut point, e.g. all data except the last 12 months is used for training and the last 12 months is used for testing.

If we are interested in making a one-step forecast, e.g. one month, then we can evaluate the model by training on the training dataset and predicting the first step in the test dataset. We can then add the real observation from the test set to the training dataset, refit the model, then have the model predict the second step in the test dataset.

The function below performs walk-forward validation.

It takes the entire supervised learning version of the time series dataset and the number of rows to use as the test set as arguments.

It then steps through the test set, calling the random_forest_forecast() function to make a one-step forecast. An error measure is calculated and the details are returned for analysis.

In [None]:
def walk_forward_validation(data, n_test):
    predictions = list()
    # split dataset
    train, test = train_test_split(data, n_test)
    # seed history with training dataset
    history = [x for x in train]
    # step over each time-step in the test set
    for i in range(len(test)):
        # split test row into input and output columns
        testX, testy = test[i, :-1], test[i, -1]
        # fit model on history and make a prediction
        yhat = random_forest_forecast(history, testX)
        # store forecast in list of predictions
        predictions.append(yhat)
        # add actual observation to history for the next loop
        history.append(test[i])
        # summarize progress
        print('>expected=%.1f, predicted=%.1f' % (testy, yhat))
    # estimate prediction error
    error = mean_absolute_error(test[:, -1], predictions)
    return error, test[:, -1], predictions

We can use the RandomForestRegressor class to make a one-step forecast.

The random_forest_forecast() function below implements this, taking the training dataset and test input row as input, fitting a model and making a one-step prediction.

In [None]:
# fit an random forest model and make a one step prediction
def random_forest_forecast(train, testX):
    # transform list into array
    train = asarray(train)
    # split into input and output columns
    trainX, trainy = train[:, :-1], train[:, -1]
    # fit model
    model = RandomForestRegressor(n_estimators=100)
    model.fit(trainX, trainy)
    # make a one-step prediction
    yhat = model.predict([testX])
    return yhat[0]

The final forecast function will use the RandomForestRegressor on the entire dataset to forecast the next time step

In [None]:
def forecast(df):
    values = df.values
    train = series_to_supervised(values, n_in=1)
    # split into input and output columns
    trainX, trainy = train.iloc[:, :-1], train.iloc[:, -1]
    # fit model
    model = RandomForestRegressor(n_estimators=100)
    model.fit(trainX, trainy)
    values = trainX.values
    # construct an input for a new prediction
    row = values[-1:].flatten()
    yhat = model.predict(asarray([row]))
    print('Input: %s\n, Predicted: %.3f' % (row, yhat[0]))
    return yhat

We can use the RandomForestRegressor class to make a one-step forecast.

The random_forest_forecast() function below implements this, taking the training dataset and test input row as input, fitting a model and making a one-step prediction.

## Training models and forecasting next time step at each station of each aquifer

We will train a new model on the previous 100 days to observe how to model performs at each station.  Then we will use the entire dataset for the aquifer to predict the next (future) time step.  In this instance that will be one day, but we could also have used the next week by setting that downsampleing parameter in the PreProcessing module.

### AQUIFER PETRIGRANO

In [None]:
# transform a time series dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols = list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
    # put it all together
    agg = concat(cols, axis=1)
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg.values
 
# split a univariate dataset into train/test sets
def train_test_split(data, n_test):
    return data[:-n_test, :], data[-n_test:, :]
 
# fit an random forest model and make a one step prediction
def random_forest_forecast(train, testX):
    # transform list into array
    train = asarray(train)
    # split into input and output columns
    trainX, trainy = train[:, :-1], train[:, -1]
    # fit model
    model = RandomForestRegressor(n_estimators=100)
    model.fit(trainX, trainy)
    # make a one-step prediction
    yhat = model.predict([testX])
    return yhat[0]
 
# walk-forward validation for univariate data
def walk_forward_validation(data, n_test):
    predictions = list()
    # split dataset
    train, test = train_test_split(data, n_test)
    # seed history with training dataset
    history = [x for x in train]
    # step over each time-step in the test set
    for i in range(len(test)):
        # split test row into input and output columns
        testX, testy = test[i, :-1], test[i, -1]
        # fit model on history and make a prediction
        yhat = random_forest_forecast(history, testX)
        # store forecast in list of predictions
        predictions.append(yhat)
        # add actual observation to history for the next loop
        history.append(test[i])
        # summarize progress
        print('Day %.0f  expected=%.1f  predicted=%.1f' % (i,testy, yhat))
    # estimate prediction error
    error = mean_absolute_error(test[:, -1], predictions)
    return error, test[:, -1], predictions

# load the dataset
Petrignano_df = Petrignano_df[['year','month','day','Rainfall_Bastia_Umbra','Temperature_Bastia_Umbra',
                               'Temperature_Petrignano','Volume_C10_Petrignano','Hydrometry_Fiume_Chiascio_Petrignano',
                               'Depth_to_Groundwater_P24','Depth_to_Groundwater_P25']]

values = Petrignano_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 100)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater at P25 to Aquifer Petrigrano')
pyplot.show()

We can see that the model after training on just the last 100 days using just one days lag as input does rather well at predicting the true value of that day, with a MAE of 0.035 meters.  So our model can predict the next day with error expected of about 3.5 centimeters!  If we want we can change the n_in variable input to be 7 if we want use a lag time of 7 days (since each observations is daily, and set the n_out to 7 to see predict the data 7 days ahead.  However, our model will probably not perform as well so for our purposes here we will only predict one day ahead

### Fitting the final model to all the data for new predictions

Once a final Random Forest model configuration is chosen, a model can be finalized and used to make a prediction on new data.

This is called an **out-of-sample forecast**, e.g. predicting beyond the training dataset. This is identical to making a prediction during the evaluation of the model, as we always want to evaluate a model using the same procedure that we expect to use when the model is used to make predictions on new data.

The code below demonstrates fitting a final Random Forest model on all available data and making a one-step prediction beyond the end of the dataset.

In [None]:
# Reorder Columns to have the 'Depth' columns on the end, and the exact "Pozo" depth we want to predict on
Petrignano_df = Petrignano_df[['year','month','day','Rainfall_Bastia_Umbra','Temperature_Bastia_Umbra',
                               'Temperature_Petrignano','Volume_C10_Petrignano','Hydrometry_Fiume_Chiascio_Petrignano',
                               'Depth_to_Groundwater_P24','Depth_to_Groundwater_P25']]

def forecast(df):
    values = df.values
    train = series_to_supervised(values, n_in=1)
    # split into input and output columns
    trainX, trainy = train[:, :-1], train[:, -1]
    # fit model
    model = RandomForestRegressor(n_estimators=100)
    model.fit(trainX, trainy)
    values = trainX
    # construct an input for a new prediction
    row = values[-1:].flatten()
    yhat = model.predict(asarray([row]))
    print('Input: %s\n, Predicted: %.3f' % (row, yhat[0]))
    return yhat

Depth_to_Groundwater_P25_forecast = forecast(Petrignano_df)

In [None]:
Depth_to_Groundwater_P25_forecast

### Model for Depth_at_P24 Petrigrano

Now we use the same model to predict the depth at station Pozo P24.

In [None]:
class RF_Regression_Model:
    def __init__(self,df,columns,predictor):
        self.df = df
        self.columns = columns
        self.predictor = predictor
        df = self.df[self.columns]
    
    # transform a time series dataset into a supervised learning dataset
    def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
        n_vars = 1 if type(data) is list else self.data.shape[1]
        df = DataFrame(data)
        cols = list()
        # input sequence (t-n, ... t-1)
        for i in range(n_in, 0, -1):
            cols.append(df.shift(i))
        # forecast sequence (t, t+1, ... t+n)
        for i in range(0, n_out):
            cols.append(df.shift(-i))
        # put it all together
        agg = concat(cols, axis=1)
        # drop rows with NaN values
        if dropnan:
            agg.dropna(inplace=True)
        return agg.values

    # split a univariate dataset into train/test sets
    def train_test_split(self,data, n_test):
        return data[:-n_test, :], data[-n_test:, :]

    # fit an random forest model and make a one step prediction
    def random_forest_forecast(self,train, testX):
        # transform list into array
        train = asarray(train)
        # split into input and output columns
        trainX, trainy = train[:, :-1], train[:, -1]
        # fit model
        model = RandomForestRegressor(n_estimators=100)
        model.fit(trainX, trainy)
        # make a one-step prediction
        yhat = model.predict([testX])
        return yhat[0]

    # walk-forward validation for univariate data
    def walk_forward_validation(self,data, n_test):
        predictions = list()
        # split dataset
        train, test = self.train_test_split(data, n_test)
        # seed history with training dataset
        history = [x for x in train]
        # step over each time-step in the test set
        for i in range(len(test)):
            # split test row into input and output columns
            testX, testy = test[i, :-1], test[i, -1]
            # fit model on history and make a prediction
            yhat = random_forest_forecast(history, testX)
            # store forecast in list of predictions
            predictions.append(yhat)
            # add actual observation to history for the next loop
            history.append(test[i])
            # summarize progress
            print('Day %.0f  expected=%.1f  predicted=%.1f' % (i,testy, yhat))
        # estimate prediction error
        error = mean_absolute_error(test[:, -1], predictions)
        return error, test[:, -1], predictions

    def test_model(self):
        values = self.df.values
        # transform the time series data into supervised learning
        data = self.series_to_supervised(values)
        # evaluate
        mae, y, yhat = walk_forward_validation(data, 30)
        print('MAE: %.3f' % mae)
        # plot expected vs predicted
        fig= plt.figure(figsize=(15,8))
        pyplot.plot(y, label='Expected')
        pyplot.plot(yhat, label='Predicted')
        pyplot.legend()
        plt.title(f'Predicted {self.predictor}')
        pyplot.show()
        
    def forecast(self):
        values = self.df.values
        train = series_to_supervised(values, n_in=1)
        # split into input and output columns
        trainX, trainy = train[:, :-1], train[:, -1]
        # fit model
        model = RandomForestRegressor(n_estimators=100)
        model.fit(trainX, trainy)
        values = trainX
        # construct an input for a new prediction
        row = values[-1:].flatten()
        depth_feature = 'Depth_to_Groundwater_P24'
        yhat = model.predict(asarray([row]))
        print('Input: %s\n, Predicted: %.3f' % (row, yhat[0]))
        return yhat
    
    # load the dataset
    def main(self):
        self.test_model()
        forecast = self.forecast()
        return forecast


In [None]:
# Reorder Columns to have the 'Depth' columns on the end
Petrignano_df = Petrignano_df[['year','month','day','Rainfall_Bastia_Umbra','Temperature_Bastia_Umbra',
                               'Temperature_Petrignano','Volume_C10_Petrignano','Hydrometry_Fiume_Chiascio_Petrignano',
                               'Depth_to_Groundwater_P25','Depth_to_Groundwater_P24']]
    
values = Petrignano_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# evaluate
mae, y, yhat = walk_forward_validation(data, 100)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater at P24 Aquifer Petrigrano')
pyplot.show()
    
    
Depth_to_Groundwater_P24_forecast = forecast(Petrignano_df)

In [None]:
class VAR_MODEL:
    
    def __init__(self,df):
        self.data = df
        #self.features = features
        #self.data = self.df.drop(['Date'], axis=1)
        #self.data.index = self.df.Date
        #self.data = self.data[features]
        
    def _train_fit(self):
        scaler = MinMaxScaler()
        scaled = scaler.fit_transform(self.data)
        #creating the train and validation set
        self.train = scaled[:int(0.8*(len(scaled)))]
        self.valid = scaled[int(0.8*(len(scaled))):]
        #self.train = self.data[:int(0.8*(len(self.data)))]
        #self.valid = self.data[int(0.8*(len(self.data))):]
        
        #fit the model
        self.model = VAR(endog=self.train)
        self.model_fit = self.model.fit()
        
    
    def _predict_(self):
        
        # make prediction on validation
        prediction = self.model_fit.forecast(self.model_fit.y, steps=len(self.valid))
        
        cols = self.data.columns

        #converting predictions to dataframe
        pred = pd.DataFrame(index=range(0,len(prediction)),columns=[cols])
        for j in range(0,len(cols)):
            for i in range(0, len(prediction)):
                pred.iloc[i][j] = prediction[i][j]
        

        print("Predicted values of test set")
        print(pred.head())
        print("----------------------------------")
        
        valid = pd.DataFrame(data=self.valid,columns=cols)
        print("Valid (true) values of test set")
        print(self.valid.head())
        print("----------------------------------")
        
        valid = self.valid.reset_index()
        for c in pred.columns:
            c = c[0]
            if c.startswith('Depth'):
                fig= plt.figure(figsize=(10,3))
                plt.plot(pred[f'{c}'], label=f'{c}_Prediction')
                plt.plot(valid[f'{c}'], label=f'{c}_True')
                plt.legend()
                plt.title(f'Predicted {c} vs. True {c}')
                plt.show()
                
        #check rmse
        for i in cols:
            if i.startswith('Depth'):
                print('rmse value for', i, 'is : ', sqrt(mean_squared_error(pred[[i]], self.valid[[i]])))
            
        #make final forecast predictions for next day
        model = VAR(endog=self.data)
        model_fit = model.fit()
        yhat = model_fit.forecast(model_fit.y, steps=1)
        print("----------------------------------")
        print("Forecasted feature predictions for next time interval")
        print(yhat)
        return yhat
    
    def main(self):
        self._train_fit()
        yhat = self._predict_()
        return yhat
    


### DOGANELLA

### Forecasting Pozzo 7 - Doganella

In [None]:
# load the dataset
Doganella_df = Doganella_df[['year','month','day','Rainfall_Monteporzio','Rainfall_Velletri','Volume_Pozzo_4',
                             'Volume_Pozzo_5+6','Volume_Pozzo_7','Volume_Pozzo_8','Volume_Pozzo_9','Temperature_Monteporzio','Temperature_Velletri',
                            'Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_4',
                             'Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6','Depth_to_Groundwater_Pozzo_7']]

values = Doganella_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 100)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater to Aquifer Doganella')
pyplot.show()

Depth_to_Groundwater_Pozzo_7_forecast = forecast(Doganella_df)
print(Depth_to_Groundwater_Pozzo_7_forecast)

### Forecasting Pozzo 6 - Doganella

In [None]:
Doganella_df = Doganella_df[['year','month','day','Rainfall_Monteporzio','Rainfall_Velletri','Volume_Pozzo_4',
                             'Volume_Pozzo_5+6','Volume_Pozzo_7','Volume_Pozzo_8','Volume_Pozzo_9','Temperature_Monteporzio','Temperature_Velletri',
                            'Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_4',
                             'Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_6']]

values = Doganella_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater to Aquifer Doganella')
pyplot.show()

Depth_to_Groundwater_Pozzo_6_forecast = forecast(Doganella_df)
print(Depth_to_Groundwater_Pozzo_6_forecast)

### Forecasting Pozzo 5 Doganella

In [None]:
Doganella_df = Doganella_df[['year','month','day','Rainfall_Monteporzio','Rainfall_Velletri','Volume_Pozzo_4',
                             'Volume_Pozzo_5+6','Volume_Pozzo_7','Volume_Pozzo_8','Volume_Pozzo_9','Temperature_Monteporzio','Temperature_Velletri',
                            'Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_4',
                             'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_6','Depth_to_Groundwater_Pozzo_5']]

values = Doganella_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater to Aquifer Doganella')
pyplot.show()

Depth_to_Groundwater_Pozzo_5_forecast = forecast(Doganella_df)
print(Depth_to_Groundwater_Pozzo_5_forecast)

### Forecasting Pozzo 4 Doganella

In [None]:
Doganella_df = Doganella_df[['year','month','day','Rainfall_Monteporzio','Rainfall_Velletri','Volume_Pozzo_4',
                             'Volume_Pozzo_5+6','Volume_Pozzo_7','Volume_Pozzo_8','Volume_Pozzo_9','Temperature_Monteporzio','Temperature_Velletri',
                            'Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3',
                             'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_6','Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_4']]

values = Doganella_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater to Aquifer Doganella')
pyplot.show()

Depth_to_Groundwater_Pozzo_4_forecast = forecast(Doganella_df)
print(Depth_to_Groundwater_Pozzo_4_forecast)

### Pozzo 3 Doganella

In [None]:
Doganella_df = Doganella_df[['year','month','day','Rainfall_Monteporzio','Rainfall_Velletri','Volume_Pozzo_4',
                             'Volume_Pozzo_5+6','Volume_Pozzo_7','Volume_Pozzo_8','Volume_Pozzo_9','Temperature_Monteporzio','Temperature_Velletri',
                            'Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2',
                            'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_6','Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_4','Depth_to_Groundwater_Pozzo_3']]

values = Doganella_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater to Aquifer Doganella')
pyplot.show()

Depth_to_Groundwater_Pozzo_3_forecast = forecast(Doganella_df)
print(Depth_to_Groundwater_Pozzo_3_forecast)

### Forecasting Pozzo 2 - Doganella

In [None]:
Doganella_df = Doganella_df[['year','month','day','Rainfall_Monteporzio','Rainfall_Velletri','Volume_Pozzo_4',
                             'Volume_Pozzo_5+6','Volume_Pozzo_7','Volume_Pozzo_8','Volume_Pozzo_9','Temperature_Monteporzio','Temperature_Velletri',
                            'Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_6','Depth_to_Groundwater_Pozzo_5',
                             'Depth_to_Groundwater_Pozzo_4','Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_2']]

values = Doganella_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater at Pozzo 2, Aquifer Doganella')
pyplot.show()

Depth_to_Groundwater_Pozzo_2_forecast = forecast(Doganella_df)
print(Depth_to_Groundwater_Pozzo_2_forecast)

The model on Pozo 2 performs relatively poorly due to the sudden drop in the groundwater level.  The model isnt particularly strong at anticipating these drastic changes

In [None]:
Doganella_df = Doganella_df[['year','month','day','Rainfall_Monteporzio','Rainfall_Velletri','Volume_Pozzo_4',
                             'Volume_Pozzo_5+6','Volume_Pozzo_7','Volume_Pozzo_8','Volume_Pozzo_9','Temperature_Monteporzio','Temperature_Velletri',
                            'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_6','Depth_to_Groundwater_Pozzo_5',
                             'Depth_to_Groundwater_Pozzo_4','Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_1']]

values = Doganella_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater at Pozzo 1, Aquifer Doganella')
pyplot.show()

Depth_to_Groundwater_Pozzo_1_forecast = forecast(Doganella_df)
print(Depth_to_Groundwater_Pozzo_1_forecast)

## Aquifer Auser

In [None]:
Auser_df.columns

### Forecasting Depth as DIEC , Auser

In [None]:
 Auser_df = Auser_df[['year', 'month','day','Rainfall_Gallicano', 'Rainfall_Pontetetto',
       'Rainfall_Monte_Serra', 'Rainfall_Orentano', 'Rainfall_Borgo_a_Mozzano',
       'Rainfall_Piaggione', 'Rainfall_Calavorno', 'Rainfall_Croce_Arcana',
       'Rainfall_Tereglio_Coreglia_Antelminelli','Rainfall_Fabbriche_di_Vallico',
        'Temperature_Orentano', 'Temperature_Monte_Serra',
       'Temperature_Ponte_a_Moriano', 'Temperature_Lucca_Orto_Botanico',
       'Volume_POL', 'Volume_CC1', 'Volume_CC2', 'Volume_CSA', 'Volume_CSAL',
       'Hydrometry_Monte_S_Quirico', 'Hydrometry_Piaggione',
        'Depth_to_Groundwater_LT2',
       'Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_PAG',
       'Depth_to_Groundwater_CoS', 'Depth_to_Groundwater_DIEC']]
    
values = Auser_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater at DIEC, Aquifer Auser')
pyplot.show()

Depth_to_Groundwater_DIEC_forecast = forecast(Auser_df)
print(Depth_to_Groundwater_DIEC_forecast)

### Forecasting Depth at CoS

In [None]:
 Auser_df = Auser_df[['year', 'month','day','Rainfall_Gallicano', 'Rainfall_Pontetetto',
       'Rainfall_Monte_Serra', 'Rainfall_Orentano', 'Rainfall_Borgo_a_Mozzano',
       'Rainfall_Piaggione', 'Rainfall_Calavorno', 'Rainfall_Croce_Arcana',
       'Rainfall_Tereglio_Coreglia_Antelminelli','Rainfall_Fabbriche_di_Vallico',
        'Temperature_Orentano', 'Temperature_Monte_Serra',
       'Temperature_Ponte_a_Moriano', 'Temperature_Lucca_Orto_Botanico',
       'Volume_POL', 'Volume_CC1', 'Volume_CC2', 'Volume_CSA', 'Volume_CSAL',
       'Hydrometry_Monte_S_Quirico', 'Hydrometry_Piaggione',
        'Depth_to_Groundwater_LT2',
       'Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_PAG',
       'Depth_to_Groundwater_DIEC','Depth_to_Groundwater_CoS']]
    
values = Auser_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater at CoS, Aquifer Auser')
pyplot.show()

Depth_to_Groundwater_CoS_forecast = forecast(Auser_df)
print(Depth_to_Groundwater_CoS_forecast)

### Forecasting Depth to PAG Station

In [None]:
 Auser_df = Auser_df[['year', 'month','day','Rainfall_Gallicano', 'Rainfall_Pontetetto',
       'Rainfall_Monte_Serra', 'Rainfall_Orentano', 'Rainfall_Borgo_a_Mozzano',
       'Rainfall_Piaggione', 'Rainfall_Calavorno', 'Rainfall_Croce_Arcana',
       'Rainfall_Tereglio_Coreglia_Antelminelli','Rainfall_Fabbriche_di_Vallico',
        'Temperature_Orentano', 'Temperature_Monte_Serra',
       'Temperature_Ponte_a_Moriano', 'Temperature_Lucca_Orto_Botanico',
       'Volume_POL', 'Volume_CC1', 'Volume_CC2', 'Volume_CSA', 'Volume_CSAL',
       'Hydrometry_Monte_S_Quirico', 'Hydrometry_Piaggione',
        'Depth_to_Groundwater_LT2',
       'Depth_to_Groundwater_SAL',
       'Depth_to_Groundwater_DIEC','Depth_to_Groundwater_CoS','Depth_to_Groundwater_PAG']]
    
values = Auser_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater at PAG, Aquifer Auser')
pyplot.show()

Depth_to_Groundwater_PAG_forecast = forecast(Auser_df)
print(Depth_to_Groundwater_PAG_forecast)

### Depth forecast at SAL, Aquifer Auser

In [None]:
 Auser_df = Auser_df[['year', 'month','day','Rainfall_Gallicano', 'Rainfall_Pontetetto',
       'Rainfall_Monte_Serra', 'Rainfall_Orentano', 'Rainfall_Borgo_a_Mozzano',
       'Rainfall_Piaggione', 'Rainfall_Calavorno', 'Rainfall_Croce_Arcana',
       'Rainfall_Tereglio_Coreglia_Antelminelli','Rainfall_Fabbriche_di_Vallico',
        'Temperature_Orentano', 'Temperature_Monte_Serra',
       'Temperature_Ponte_a_Moriano', 'Temperature_Lucca_Orto_Botanico',
       'Volume_POL', 'Volume_CC1', 'Volume_CC2', 'Volume_CSA', 'Volume_CSAL',
       'Hydrometry_Monte_S_Quirico', 'Hydrometry_Piaggione',
        'Depth_to_Groundwater_LT2',
       'Depth_to_Groundwater_DIEC','Depth_to_Groundwater_CoS','Depth_to_Groundwater_PAG',
        'Depth_to_Groundwater_SAL']]
    
values = Auser_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater at SAL, Aquifer Auser')
pyplot.show()

Depth_to_Groundwater_SAL_forecast = forecast(Auser_df)
print(Depth_to_Groundwater_SAL_forecast)

### Forecasting Depth at LT2, Aquifer Auser

In [None]:
 Auser_df = Auser_df[['year', 'month','day','Rainfall_Gallicano', 'Rainfall_Pontetetto',
       'Rainfall_Monte_Serra', 'Rainfall_Orentano', 'Rainfall_Borgo_a_Mozzano',
       'Rainfall_Piaggione', 'Rainfall_Calavorno', 'Rainfall_Croce_Arcana',
       'Rainfall_Tereglio_Coreglia_Antelminelli','Rainfall_Fabbriche_di_Vallico',
        'Temperature_Orentano', 'Temperature_Monte_Serra',
       'Temperature_Ponte_a_Moriano', 'Temperature_Lucca_Orto_Botanico',
       'Volume_POL', 'Volume_CC1', 'Volume_CC2', 'Volume_CSA', 'Volume_CSAL',
       'Hydrometry_Monte_S_Quirico', 'Hydrometry_Piaggione',
       'Depth_to_Groundwater_DIEC','Depth_to_Groundwater_CoS','Depth_to_Groundwater_PAG',
        'Depth_to_Groundwater_SAL','Depth_to_Groundwater_LT2']]
    
values = Auser_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Depth to Groundwater at LT2, Aquifer Auser')
pyplot.show()

Depth_to_Groundwater_LT2_forecast = forecast(Auser_df)
print(Depth_to_Groundwater_LT2_forecast)

# RIVERS

## Arno River

In [None]:
River_Arno = pd.read_csv('/kaggle/input/acea-water-prediction/River_Arno.csv')
River_Arno.info()

In [None]:
date_time = pd.to_datetime(River_Arno.Date, format='%d/%m/%Y')
plot_cols = River_Arno.iloc[:,1:-1].columns
plot_features = River_Arno[plot_cols]
plot_features.index = date_time
_ = plot_features.plot(subplots=True,figsize=(15,20))

In [None]:
class Data_Exploration_Rivers:
    
    def __init__(self,df,features,aquifer=None,PCA=False):
        self.df = df
        self.features = features
        self.nrows = len(features)
        self.aquifer = aquifer
        self.PCA = PCA
        
    def scaler(self):
        data = self.df.copy()
        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(data)
        return scaled_data
    

        
    def line_plots(self):
        depths = [f for f in self.df.columns if f.startswith('Hydrometry')]
        for f in depths:
            depth_array = self.df[f].values
            _index_array = np.array(self.df.index)
            normalized_depth = preprocessing.normalize([depth_array])
            normalized_depth = pd.Series(normalized_depth[0])

            for feature in self.features:
                feat_values = self.df['{}'.format(feature)].values
                normalized_feature = preprocessing.normalize([feat_values])
                normalized_feature = normalized_feature[0]
                normalized_feature = pd.Series(normalized_feature)
                #the_array = np.hstack((_index_array, normalized_depth,normalized_feature))
                normalized_df = pd.DataFrame(data= {'Date':self.df.index,f'{f}':normalized_depth,f'{feature}':normalized_feature})
                fig= plt.figure(figsize=(10,3))
                plt.plot(normalized_df[f'{f}'], label=f'{f}')
                plt.plot(normalized_df[f'{feature}'], label=str.capitalize(feature))
                plt.legend()
                plt.title(f'{feature} vs. {f} (Normalized)')
                plt.show()
        
    def hist_plot(self):
        print("DISTRIBUTION CHARTS=====================================")    
        for feat in self.df.columns:
            fig= plt.figure(figsize=(10,3))
            sns.distplot(self.df[feat].fillna(np.inf), color='indianred')
            plt.title(f'{str.capitalize(feat)}: ', fontsize=14)
            plt.tight_layout()
            plt.show()
            
    def hist_plots(self):
            
        f, ax = plt.subplots(nrows=self.nrows, ncols=1, figsize=(10, 35))
        for i,feat in enumerate(self.features):
            sns.distplot(self.df[feat].fillna(np.inf), ax=ax[i], color='indianred')
            ax[i].set_title(f'{str.capitalize(feat)}: ', fontsize=14)
            #ax[i].set_ylabel(ylabel=f'{str.capitalize(feat)}', fontsize=14)
        plt.tight_layout()
        plt.show()
        
    def stationarity(self):
        # Dickey-Fuller Test
        def interpret_dftest(dftest):
            dfoutput = pd.Series(dftest[0:2], index=['Test Statistic','p-value'])
            return dfoutput

        for feature in self.features:
            split = str(feature).split("_")
            feat = " ".join(split)
            print(f"{str.capitalize(feat)}:")
            print(interpret_dftest(adfuller(self.df[feature])))
            print("--------------------------------------------------------")
            
        
    def plot_corr_matrix(self):
        df = self.df.copy()
        df.set_index('Date')
        #for col in df.columns:
        #    df[col] = df[col].abs()
        # Change values of columns to absolute values
        fig= plt.figure(figsize=(15,15))
        corrMatrix = df.corr()
        sns.heatmap(corrMatrix, annot=True)
        plt.show()
                                                     
    def plot_auto_correlation(self):
        for col in self.df.columns:
            if col.startswith('Hydrometry'):
                plot_acf(self.df[f'{col}'])
                plot_pacf(self.df[f'{col}'])
        
    
    def _PCA_(self):
        aquifer_df = self.df.copy()
        df= aquifer_df.drop("Date",axis=1)
        X_reduced = PCA(n_components=2).fit_transform(df)
        pf = pd.DataFrame(X_reduced, columns=['PCA1','PCA2'])
        df['PCA1'] = pf['PCA1']
        df['PCA2'] = pf['PCA2']
        for col in df.columns: 
            if col.startswith('Hydrometry'):
                xval = preprocessing.normalize([np.array(df['PCA1'])])
                yval = preprocessing.normalize([np.array(df['PCA2'])])
                ax.set_zlabel(f'{col}')
                zval = preprocessing.normalize([np.array(df[f'{col}'])])
                #zval = preprocessing.normalize([np.array(df['PCA12'])])
                ax.scatter(xval,yval,zval,c=df[f'{col}'])
        

    def main(self):
        self.line_plots()
        self.hist_plots()
        self.stationarity()
        self.plot_auto_correlation()
        if self.PCA == True:
            self._PCA_()
        else: 
            pass
        self.plot_corr_matrix()

Since this dataset is missing the key feature of Temperature for the previous years, we will pick a time period when there were measurements for all the stations for each variable, and use that to train a model and predict the first day of 2007 and compare with the actual value for Hydrometry.  The model can be reused for a later day when there is better data

In [None]:
df = River_Arno.copy()
df['Date'] = df['Date'].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
df = df.loc[(df['Date']>='01-01-2004') & (df['Date']<'01-01-2007')]

In [None]:
features = ['Rainfall_Le_Croci', 'Rainfall_Cavallina', 'Rainfall_S_Agata',
       'Rainfall_Mangona', 'Rainfall_S_Piero', 'Rainfall_Vernio',
       'Rainfall_Stia', 'Rainfall_Consuma', 'Rainfall_Incisa',
       'Rainfall_Montevarchi', 'Rainfall_S_Savino', 'Rainfall_Laterina',
       'Rainfall_Bibbiena', 'Rainfall_Camaldoli', 'Temperature_Firenze',
       'Hydrometry_Nave_di_Rosano']

In [None]:
processor = MyPreprocessor(df,features)
Arno_df = processor.main()

explorer = Data_Exploration_Rivers(Arno_df,features,'Aquifer Auser',PCA=False)
explorer.main()

In [None]:
Arno_df = Arno_df[features]

**From the charts its very clear that Hydrometry is strongly correlated with Rainfall and Temperature**

In [None]:
values = Arno_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Hydrometry of Arno River at Nave di Rosano')
pyplot.show()

Arno_River_Hydrometry_forcast = forecast(Arno_df)
print(Arno_River_Hydrometry_forcast)

In [None]:
print("01/01/2017")
print(River_Arno[River_Arno['Date']=='01/01/2017']['Hydrometry_Nave_di_Rosano'].values)
print(f"Prediction: {Arno_River_Hydrometry_forcast}")

We can use the model to predict data when we have sufficient data to train on.  For now for example purposes this is a prediction of the 1st of Jan 2017

# LAKES

## Lake Bilancino

In [None]:
Lake_Bilancino = pd.read_csv('/kaggle/input/acea-water-prediction/Lake_Bilancino.csv')

date_time = pd.to_datetime(Lake_Bilancino.Date, format='%d/%m/%Y')
plot_cols = Lake_Bilancino.iloc[:,1:-1].columns
plot_features = Lake_Bilancino[plot_cols]
plot_features.index = date_time
_ = plot_features.plot(subplots=True,figsize=(15,20))

In [None]:
class Data_Exploration:
    
    def __init__(self,df=None,features=None,targets=None,aquifer=None,PCA=False):
        self.df = df
        self.features = features
        self.nrows = len(features)
        self.aquifer = aquifer
        self.PCA = PCA
        self.targets=targets
        
    def scaler(self):
        data = self.df.copy()
        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(data)
        return scaled_data
    

        
    def line_plots(self):
        for target in self.targets:
            levels = [f for f in self.df.columns if f.startswith(f'{target}')]
            for f in levels:
                depth_array = self.df[f].values
                _index_array = np.array(self.df.index)
                normalized_depth = preprocessing.normalize([depth_array])
                normalized_depth = pd.Series(normalized_depth[0])

                for feature in self.features:
                    feat_values = self.df['{}'.format(feature)].values
                    normalized_feature = preprocessing.normalize([feat_values])
                    normalized_feature = normalized_feature[0]
                    normalized_feature = pd.Series(normalized_feature)
                    #the_array = np.hstack((_index_array, normalized_depth,normalized_feature))
                    normalized_df = pd.DataFrame(data= {'Date':self.df.index,f'{f}':normalized_depth,f'{feature}':normalized_feature})
                    fig= plt.figure(figsize=(10,3))
                    plt.plot(normalized_df[f'{f}'], label=f'{f}')
                    plt.plot(normalized_df[f'{feature}'], label=str.capitalize(feature))
                    plt.legend()
                    plt.title(f'{feature} vs. {f} (Normalized)')
                    plt.show()
        
    def hist_plot(self):
        print("DISTRIBUTION CHARTS=====================================")    
        for feat in self.df.columns:
            fig= plt.figure(figsize=(10,3))
            sns.distplot(self.df[feat].fillna(np.inf), color='indianred')
            plt.title(f'{str.capitalize(feat)}: ', fontsize=14)
            plt.tight_layout()
            plt.show()
            
    def hist_plots(self):
            
        f, ax = plt.subplots(nrows=self.nrows, ncols=1, figsize=(10, 35))
        for i,feat in enumerate(self.features):
            sns.distplot(self.df[feat].fillna(np.inf), ax=ax[i], color='indianred')
            ax[i].set_title(f'{str.capitalize(feat)}: ', fontsize=14)
            #ax[i].set_ylabel(ylabel=f'{str.capitalize(feat)}', fontsize=14)
        plt.tight_layout()
        plt.show()
        
    def stationarity(self):
        # Dickey-Fuller Test
        def interpret_dftest(dftest):
            dfoutput = pd.Series(dftest[0:2], index=['Test Statistic','p-value'])
            return dfoutput

        for feature in self.features:
            split = str(feature).split("_")
            feat = " ".join(split)
            print(f"{str.capitalize(feat)}:")
            print(interpret_dftest(adfuller(self.df[feature])))
            print("--------------------------------------------------------")
            
        
    def plot_corr_matrix(self):
        df = self.df.copy()
        df.set_index('Date')
        #for col in df.columns:
        #    df[col] = df[col].abs()
        # Change values of columns to absolute values
        fig= plt.figure(figsize=(15,15))
        corrMatrix = df.corr()
        sns.heatmap(corrMatrix, annot=True)
        plt.show()
                                                     
    def plot_auto_correlation(self):
        for col in self.df.columns:
            if col.startswith('Hydrometry'):
                plot_acf(self.df[f'{col}'])
                plot_pacf(self.df[f'{col}'])
        
    
    def _PCA_(self):
        aquifer_df = self.df.copy()
        df= aquifer_df.drop("Date",axis=1)
        X_reduced = PCA(n_components=2).fit_transform(df)
        pf = pd.DataFrame(X_reduced, columns=['PCA1','PCA2'])
        df['PCA1'] = pf['PCA1']
        df['PCA2'] = pf['PCA2']
        for col in df.columns: 
            if col.startswith('Hydrometry'):
                xval = preprocessing.normalize([np.array(df['PCA1'])])
                yval = preprocessing.normalize([np.array(df['PCA2'])])
                ax.set_zlabel(f'{col}')
                zval = preprocessing.normalize([np.array(df[f'{col}'])])
                #zval = preprocessing.normalize([np.array(df['PCA12'])])
                ax.scatter(xval,yval,zval,c=df[f'{col}'])
        

    def main(self):
        self.line_plots()
        self.hist_plots()
        self.stationarity()
        self.plot_auto_correlation()
        if self.PCA == True:
            self._PCA_()
        else: 
            pass
        self.plot_corr_matrix()

In [None]:
# Filter for 2004 and later
df = Lake_Bilancino.copy()
df['Date'] = df['Date'].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
df = df.loc[(df['Date']>='01-01-2004')]

In [None]:
features = [f for f in df.columns if (f.startswith('Depth')==False and f !='Date')]
processor = MyPreprocessor(df,features=features,get_averages=False)
Bilancino_df = processor.main()

In [None]:
features = ['Rainfall_S_Piero', 'Rainfall_Mangona', 'Rainfall_S_Agata',
       'Rainfall_Cavallina', 'Rainfall_Le_Croci', 'Temperature_Le_Croci',
       'Lake_Level', 'Flow_Rate']

# Call our data explorer object for 
explorer = Data_Exploration(df=Bilancino_df,features=features,targets=['Flow_Rate','Lake_Level'])
explorer.main()

### Observations

* There is greater flow rates in the Winter months as seen where there are spikes in flow rates when temperatures are low

In [None]:
Bilancino_df = Bilancino_df[['year', 'month', 'day','Rainfall_S_Piero', 'Rainfall_Mangona', 'Rainfall_S_Agata',
       'Rainfall_Cavallina', 'Rainfall_Le_Croci', 'Temperature_Le_Croci',
       'Lake_Level', 'Flow_Rate']]


values = Bilancino_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Hydrometry (Flow Rate) of Lake Bilancino')
pyplot.show()

Lake_Bilancino_Flow_Rate_forcast = forecast(Bilancino_df)
print(Lake_Bilancino_Flow_Rate_forcast)

The model doesnt do such a good job of predicting flow Rate.  We may have to Scale our data 

## Forecast Lake LEvel - Bilancino

In [None]:
Bilancino_df = Bilancino_df[['year', 'month', 'day','Rainfall_S_Piero', 'Rainfall_Mangona', 'Rainfall_S_Agata',
       'Rainfall_Cavallina', 'Rainfall_Le_Croci', 'Temperature_Le_Croci',
       'Flow_Rate','Lake_Level']]


values = Bilancino_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Hydrometry (Lake Level) of Lake Bilancino')
pyplot.show()

Lake_Bilancino_Lake_Level_forcast = forecast(Bilancino_df)
print(Lake_Bilancino_Lake_Level_forcast)

# SPRINGS

In [None]:
Spring_Amiata = pd.read_csv('/kaggle/input/acea-water-prediction/Water_Spring_Amiata.csv')
Spring_Madonna_di_Canneto = pd.read_csv('/kaggle/input/acea-water-prediction/Water_Spring_Madonna_di_Canneto.csv')
Spring_Lupa = pd.read_csv('/kaggle/input/acea-water-prediction/Water_Spring_Lupa.csv')

In [None]:
date_time = pd.to_datetime(Spring_Amiata.Date, format='%d/%m/%Y')
plot_cols = Spring_Amiata.iloc[:,1:-1].columns
plot_features = Spring_Amiata[plot_cols]
plot_features.index = date_time
_ = plot_features.plot(subplots=True,figsize=(15,20))

In [None]:
# Filter for 2016 and later
df = Spring_Amiata.copy()
df['Date'] = df['Date'].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
df = df.loc[(df['Date']>='01-01-2016')]

features = [f for f in df.columns if (f.startswith('Depth')==False and f !='Date')]
processor = MyPreprocessor(df,features=features,get_averages=False)
Spring_Amiata_df = processor.main()

In [None]:
Spring_Amiata_df.info()

In [None]:
class Data_Exploration:
    
    def __init__(self,df=None,features=None,targets=None,aquifer=None,PCA=False):
        self.df = df
        self.features = features
        self.nrows = len(features)
        self.aquifer = aquifer
        self.PCA = PCA
        self.targets=targets
        
    def scaler(self):
        data = self.df.copy()
        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(data)
        return scaled_data
    

        
    def line_plots(self):
        for target in self.targets:
            levels = [f for f in self.df.columns if f.startswith(f'{target}')]
            for f in levels:
                depth_array = self.df[f].values
                _index_array = np.array(self.df.index)
                normalized_depth = preprocessing.normalize([depth_array])
                normalized_depth = pd.Series(normalized_depth[0])

                for feature in self.features:
                    feat_values = self.df['{}'.format(feature)].values
                    normalized_feature = preprocessing.normalize([feat_values])
                    normalized_feature = normalized_feature[0]
                    normalized_feature = pd.Series(normalized_feature)
                    #the_array = np.hstack((_index_array, normalized_depth,normalized_feature))
                    normalized_df = pd.DataFrame(data= {'Date':self.df.index,f'{f}':normalized_depth,f'{feature}':normalized_feature})
                    fig= plt.figure(figsize=(10,3))
                    plt.plot(normalized_df[f'{f}'], label=f'{f}')
                    plt.plot(normalized_df[f'{feature}'], label=str.capitalize(feature))
                    plt.legend()
                    plt.title(f'{feature} vs. {f} (Normalized)')
                    plt.show()
        
    def hist_plot(self):
        print("DISTRIBUTION CHARTS=====================================")    
        for feat in self.df.columns:
            fig= plt.figure(figsize=(10,3))
            sns.distplot(self.df[feat].fillna(np.inf), color='indianred')
            plt.title(f'{str.capitalize(feat)}: ', fontsize=14)
            plt.tight_layout()
            plt.show()
            
    def hist_plots(self):
            
        f, ax = plt.subplots(nrows=self.nrows, ncols=1, figsize=(10, 35))
        for i,feat in enumerate(self.features):
            sns.distplot(self.df[feat].fillna(np.inf), ax=ax[i], color='indianred')
            ax[i].set_title(f'{str.capitalize(feat)}: ', fontsize=14)
            #ax[i].set_ylabel(ylabel=f'{str.capitalize(feat)}', fontsize=14)
        plt.tight_layout()
        plt.show()
        
    def stationarity(self):
        # Dickey-Fuller Test
        def interpret_dftest(dftest):
            dfoutput = pd.Series(dftest[0:2], index=['Test Statistic','p-value'])
            return dfoutput

        for feature in self.features:
            split = str(feature).split("_")
            feat = " ".join(split)
            print(f"{str.capitalize(feat)}:")
            print(interpret_dftest(adfuller(self.df[feature])))
            print("--------------------------------------------------------")
            
        
    def plot_corr_matrix(self):
        df = self.df.copy()
        df.set_index('Date')
        #for col in df.columns:
        #    df[col] = df[col].abs()
        # Change values of columns to absolute values
        fig= plt.figure(figsize=(15,15))
        corrMatrix = df.corr()
        sns.heatmap(corrMatrix, annot=True)
        plt.show()
                                                     
    def plot_auto_correlation(self):
        for col in self.df.columns:
            if col.startswith('Hydrometry'):
                plot_acf(self.df[f'{col}'])
                plot_pacf(self.df[f'{col}'])
        
    
    def _PCA_(self):
        aquifer_df = self.df.copy()
        df= aquifer_df.drop("Date",axis=1)
        X_reduced = PCA(n_components=2).fit_transform(df)
        pf = pd.DataFrame(X_reduced, columns=['PCA1','PCA2'])
        df['PCA1'] = pf['PCA1']
        df['PCA2'] = pf['PCA2']
        for col in df.columns: 
            if col.startswith('Hydrometry'):
                xval = preprocessing.normalize([np.array(df['PCA1'])])
                yval = preprocessing.normalize([np.array(df['PCA2'])])
                ax.set_zlabel(f'{col}')
                zval = preprocessing.normalize([np.array(df[f'{col}'])])
                #zval = preprocessing.normalize([np.array(df['PCA12'])])
                ax.scatter(xval,yval,zval,c=df[f'{col}'])
        

    def main(self):
        self.line_plots()
        self.hist_plots()
        self.stationarity()
        self.plot_auto_correlation()
        if self.PCA == True:
            self._PCA_()
        else: 
            pass
        self.plot_corr_matrix()

In [None]:
 features    =   ['year', 'month',
       'day','Rainfall_Castel_del_Piano', 'Rainfall_Abbadia_S_Salvatore',
       'Rainfall_S_Fiora', 'Rainfall_Laghetto_Verde', 'Rainfall_Vetta_Amiata',
       'Depth_to_Groundwater_S_Fiora_8', 'Depth_to_Groundwater_S_Fiora_11bis',
       'Depth_to_Groundwater_David_Lazzaretti',
       'Temperature_Abbadia_S_Salvatore', 'Temperature_S_Fiora',
       'Temperature_Laghetto_Verde', 'Flow_Rate_Bugnano', 'Flow_Rate_Arbure',
       'Flow_Rate_Ermicciolo', 'Flow_Rate_Galleria_Alta']
    
targets = ['Flow_Rate_Bugnano', 'Flow_Rate_Arbure',
       'Flow_Rate_Ermicciolo', 'Flow_Rate_Galleria_Alta']

# Call our data explorer object for 
explorer = Data_Exploration(df=Spring_Amiata_df,features=features,targets=targets)
explorer.main()

### Flow Rate Galeria Alta Forecast

In [None]:
Spring_Amiata_df = Spring_Amiata_df[features]

values = Spring_Amiata_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Flow Rate Galleria Alta, Spring Amiata')
pyplot.show()

Flow_Rate_Galleria_Alta_forcast = forecast(Spring_Amiata_df)
print(Flow_Rate_Galleria_Alta_forcast)

### Flow Rate Ermicciolo Forecast

In [None]:
Spring_Amiata_df = Spring_Amiata_df[['year', 'month',
       'day','Rainfall_Castel_del_Piano', 'Rainfall_Abbadia_S_Salvatore',
       'Rainfall_S_Fiora', 'Rainfall_Laghetto_Verde', 'Rainfall_Vetta_Amiata',
       'Depth_to_Groundwater_S_Fiora_8', 'Depth_to_Groundwater_S_Fiora_11bis',
       'Depth_to_Groundwater_David_Lazzaretti',
       'Temperature_Abbadia_S_Salvatore', 'Temperature_S_Fiora',
       'Temperature_Laghetto_Verde', 'Flow_Rate_Bugnano', 'Flow_Rate_Arbure',
       'Flow_Rate_Galleria_Alta','Flow_Rate_Ermicciolo']]

values = Spring_Amiata_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Flow Rate Ermicciolo, Spring Amiata')
pyplot.show()

Flow_Rate_Ermicciolo_forcast = forecast(Spring_Amiata_df)
print(Flow_Rate_Ermicciolo_forcast)

### Forecast for Arbure, Spring Amiata

In [None]:
Spring_Amiata_df = Spring_Amiata_df[['year', 'month',
       'day','Rainfall_Castel_del_Piano', 'Rainfall_Abbadia_S_Salvatore',
       'Rainfall_S_Fiora', 'Rainfall_Laghetto_Verde', 'Rainfall_Vetta_Amiata',
       'Depth_to_Groundwater_S_Fiora_8', 'Depth_to_Groundwater_S_Fiora_11bis',
       'Depth_to_Groundwater_David_Lazzaretti',
       'Temperature_Abbadia_S_Salvatore', 'Temperature_S_Fiora',
       'Temperature_Laghetto_Verde', 'Flow_Rate_Bugnano',
       'Flow_Rate_Galleria_Alta','Flow_Rate_Ermicciolo', 'Flow_Rate_Arbure']]

values = Spring_Amiata_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Flow Rate Ermicciolo, Spring Amiata')
pyplot.show()

Flow_Rate_Arbure_forcast = forecast(Spring_Amiata_df)
print(Flow_Rate_Arbure_forcast)

### Forecasting Flow Rate Bugnano

In [None]:
Spring_Amiata_df = Spring_Amiata_df[['year', 'month',
       'day','Rainfall_Castel_del_Piano', 'Rainfall_Abbadia_S_Salvatore',
       'Rainfall_S_Fiora', 'Rainfall_Laghetto_Verde', 'Rainfall_Vetta_Amiata',
       'Depth_to_Groundwater_S_Fiora_8', 'Depth_to_Groundwater_S_Fiora_11bis',
       'Depth_to_Groundwater_David_Lazzaretti',
       'Temperature_Abbadia_S_Salvatore', 'Temperature_S_Fiora',
       'Temperature_Laghetto_Verde',
       'Flow_Rate_Galleria_Alta','Flow_Rate_Ermicciolo', 'Flow_Rate_Arbure','Flow_Rate_Bugnano']]

values = Spring_Amiata_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Flow Rate Bugnano, Spring Amiata')
pyplot.show()

Flow_Rate_Bugnano_forcast = forecast(Spring_Amiata_df)
print(Flow_Rate_Bugnano_forcast)

## SPRING LUPA

In [None]:
Spring_Lupa

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(16,5))
sns.heatmap(Spring_Lupa.T.isna(), cmap='Blues')
ax.set_title('Fields with Missing Values', fontsize=16)
#for tick in ax.xaxis.get_major_ticks():
#    tick.label.set_fontsize(14) 
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(14)
plt.show()

In [None]:
Spring_Lupa = pd.read_csv('/kaggle/input/acea-water-prediction/Water_Spring_Lupa.csv')

date_time = pd.to_datetime(Spring_Lupa.Date, format='%d/%m/%Y')
plot_cols = Spring_Lupa.iloc[:,1:-1].columns
plot_features = Spring_Lupa[plot_cols]
plot_features.index = date_time
_ = plot_features.plot(subplots=True,figsize=(15,20))

In [None]:
# Filter for 2016 and later
df = Spring_Lupa.copy()
df['Date'] = df['Date'].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
#df = df.loc[(df['Date']>='01-01-2016')]

features = [f for f in df.columns if (f !='Date')]
processor = MyPreprocessor(df,features=features,get_averages=False)
Spring_Lupa_df = processor.main()

In [None]:
Spring_Lupa_df.info()

In [None]:
 features    =   ['year', 'month',
       'day','Rainfall_Terni','Flow_Rate_Lupa']
    
targets = ['Flow_Rate_Lupa']

# Call our data explorer object for 
explorer = Data_Exploration(df=Spring_Lupa_df,features=features,targets=targets)
explorer.main()

In [None]:
Spring_Lupa_df = Spring_Lupa_df[features]

values = Spring_Lupa_df.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=1)
# Evaluate
mae, y, yhat = walk_forward_validation(data, 50)
print('MAE: %.3f' % mae)
# plot expected vs predicted
fig= plt.figure(figsize=(15,8))
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
plt.title('Predicted Flow Rate Lupa Sping')
pyplot.show()

Flow_Rate_Lupa_forcast = forecast(Spring_Lupa_df)
print(Flow_Rate_Lupa_forcast)

# Final Results

In [None]:
depths = {"Water Body": ['Aquifer Auser','Aquifer Auser','Aquifer Auser','Aquifer Auser',
                         'Aquifer Doganella','Aquifer Doganella','Aquifer Doganella','Aquifer Doganella','Aquifer Doganella','Aquifer Doganella','Aquifer Doganella',
                      'Aquifer Petrignano','Aquifer Petrignano',
                        'River Arno','Lake Bilancino','Lake Bilancino','Spring Amiata','Spring Amiata','Spring Amiata','Spring Amiata','Sping Lupa'],
          "Output" : ['Depth_to_Groundwater_PAG','Depth_to_Groundwater_CoS','Depth_to_Groundwater_SAL','Depth_to_Groundwater_LT2',
                      'Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_4',
                       'Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6','Depth_to_Groundwater_Pozzo_7',
                      'Depth_to_Groundwater_P24','Depth_to_Groundwater_P25','Hydrometry (River Level)','Flow Rate','Lake Level',
                     'Flow_Rate_Galleria_Alta','Flow_Rate_Ermicciolo_Alta','Flow Rate Arbure','Flow Rate Bugnano','Flow Rate Lupa'],
          'Forecast': [Depth_to_Groundwater_PAG_forecast,Depth_to_Groundwater_CoS_forecast,Depth_to_Groundwater_SAL_forecast,Depth_to_Groundwater_LT2_forecast,
                        Depth_to_Groundwater_Pozzo_1_forecast,Depth_to_Groundwater_Pozzo_2_forecast,Depth_to_Groundwater_Pozzo_3_forecast,Depth_to_Groundwater_Pozzo_4_forecast,
                        Depth_to_Groundwater_Pozzo_5_forecast,Depth_to_Groundwater_Pozzo_6_forecast,Depth_to_Groundwater_Pozzo_7_forecast,
                        Depth_to_Groundwater_P24_forecast,Depth_to_Groundwater_P25_forecast,Arno_River_Hydrometry_forcast,Lake_Bilancino_Flow_Rate_forcast,Lake_Bilancino_Lake_Level_forcast,
                      Flow_Rate_Galleria_Alta_forcast,Flow_Rate_Ermicciolo_forcast,Flow_Rate_Arbure_forcast,Flow_Rate_Bugnano_forcast,Flow_Rate_Lupa_forcast]
                        }
df = pd.DataFrame(depths,columns=["Water Body","Output",'Forecast'])
df1=df.set_index(["Water Body", 'Output'])

In [None]:
df1