<center><img src="https://www.gruppo.acea.it/content/dam/acea-corporate/acea-foundation/immagini/homepage/hub-home/Acea-Corporate-Logo.png" width="200" height="200" /></center>

# <center>Acea Smart Water Analytics</center>


## About Acea:
The **Acea Group SpA** (originally an acronym for "Azienda Comunale Elettricità e Acque"—"Electricity and Water Municipal Utility") is a multiutility operative in the management and development of networks and services in the water, energy and environmental sectors. 

In 1991, the Municipality transformed Acea into a special Company and, on 1 January 1998, the SpA was incorporated. From 19 July 1999, Acea SpA was admitted to listing on the Italian Stock Exchange and began an intense spin-off process. In implementation of the Galli law, Acea was identified as the integrated water service provider for Ato 2 Lazio. 

n 2017, Acea used the 2018-2022 Business Plan to identify the principles and strategic objectives on which to base its growth path. In the same year, Acea tackled one of its most serious water crises. Its hard work in carrying out extraordinary maintenance on the water network guaranteed service continuity for citizens, also thanks to a campaign raising awareness to the responsible use of the resource, which led to the company becoming the promoter of a sustainability culture, through a virtual collaboration with users. The restyling of the logo launched Acea into the digital world.

In 2018, the start of the Business Plan confirmed a strong boost for infrastructural investments in both the water and the electricity sector: resilient technology and innovation, paying special care to sustainable development for the environment and for people.

In 2019, Acea entered the sector of gas distribution and approved the new 2019-2022 Business Plan to accelerate growth.

## The Competition:
In this competition we will focus only on the water sector to help Acea Group preserve precious waterbodies. As it is easy to imagine, a water supply company struggles with the need to forecast the water level in different types of waterbodies (water spring, lake, river, or aquifer) to handle daily consumption. 

During fall and winter waterbodies are refilled, but during spring and summer they start to drain. To help preserve the health of these waterbodies it is important to predict the most efficient water availability, in terms of level and water flow for each day of the year.

**This competition uses nine different datasets**, completely independent and not linked to each other. Each dataset can represent a different kind of waterbody. As each waterbody is different from the other, the related features as well are different from each other. 

**The Acea Group deals with four different type of waterbodies:** 
* **<font color="green">AQUIFERS</font>** (for which four datasets are provided):
    * **AUSER**: This waterbody consists of two subsystems, called NORTH and SOUTH, where the former partly influences the behavior of the latter. Indeed, the north subsystem is a water table (or unconfined) aquifer while the south subsystem is an artesian (or confined) groundwater.
    * **DOGANELLA**: The wells field Doganella is fed by two underground aquifers not fed by rivers or lakes but fed by meteoric infiltration. The upper aquifer is a water table with a thickness of about 30m. The lower aquifer is a semi-confined artesian aquifer with a thickness of 50m and is located inside lavas and tufa products. These aquifers are accessed through wells called Well 1, ..., Well 9. Approximately 80% of the drainage volumes come from the artesian aquifer. The aquifer levels are influenced by the following parameters: rainfall, humidity, subsoil, temperatures and drainage volumes.
The levels of the NORTH sector are represented by the values of the SAL, PAG, CoS and DIEC wells, while the levels of the SOUTH sector by the LT2 well.  
    * **LUCO**: The Luco wells field is fed by an underground aquifer. This aquifer not fed by rivers or lakes but by meteoric infiltration at the extremes of the impermeable sedimentary layers. Such aquifer is accessed through wells called Well 1, Well 3 and Well 4 and is influenced by the following parameters: rainfall, depth to groundwater, temperature and drainage volumes.
    * **PETRIGNANO**: The wells field of the alluvial plain between Ospedalicchio di Bastia Umbra and Petrignano is fed by three underground aquifers separated by low permeability septa. The aquifer can be considered a water table groundwater and is also fed by the Chiascio river. The groundwater levels are influenced by the following parameters: rainfall, depth to groundwater, temperatures and drainage volumes, level of the Chiascio river.
    
    
* **<font color="green">LAKE</font>** (for which a dataset is provided):
    * **BILANCINO**: Bilancino lake is an artificial lake located in the municipality of Barberino di Mugello (about 50 km from Florence). It is used to refill the Arno river during the summer months. Indeed, during the winter months, the lake is filled up and then, during the summer months, the water of the lake is poured into the Arno river.


* **<font color="green">RIVER</font>** (for which a dataset is provided):
    * **ARNO**: Arno is the second largest river in peninsular Italy and the main waterway in Tuscany and it has a relatively torrential regime, due to the nature of the surrounding soils (marl and impermeable clays). Arno results to be the main source of water supply of the metropolitan area of Florence-Prato-Pistoia. The availability of water for this waterbody is evaluated by checking the hydrometric level of the river at the section of Nave di Rosano.


* **<font color="green">WATER SPRING</font>** (for which three datasets are provided):
    * **AMIATA**: The Amiata waterbody is composed of a volcanic aquifer not fed by rivers or lakes but fed by meteoric infiltration. This aquifer is accessed through Ermicciolo, Arbure, Bugnano and Galleria Alta water springs. The levels and volumes of the four sources are influenced by the parameters: rainfall, depth to groundwater, hydrometry, temperatures and drainage volumes.
    * **LUPA**: This water spring is located in the Rosciano Valley, on the left side of the Nera river. The waters emerge at an altitude of about 375 meters above sea level through a long draining tunnel that crosses, in its final section, lithotypes and essentially calcareous rocks. It provides drinking water to the city of Terni and the towns around it.
    * **MADONNA DI CANNETO**: The Madonna di Canneto spring is situated at an altitude of 1010m above sea level in the Canneto valley. It does not consist of an aquifer and its source is supplied by the water catchment area of the river Melfa.

## Objective:
The challenge is to determine how features influence the water availability of each presented waterbody. To be more straightforward, gaining a better understanding of volumes, they will be able to ensure water availability for each time interval of the year.

The time interval is defined as day/month depending on the available measures for each waterbody. Models should capture volumes for each waterbody(for instance, for a model working on a monthly interval a forecast over the month is expected). 

Each waterbody has its own different features to be predicted. The table below shows the expected feature to forecast for each waterbody. 

![imagen.png](attachment:imagen.png)

In [None]:
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D   
import math
import seaborn as sns
import pandas as pd
pd.options.display.max_columns = None
import numpy as np
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

In [None]:
Aq_Auser = pd.read_csv('../input/acea-water-prediction/Aquifer_Auser.csv', index_col = 'Date')
Aq_Doganella = pd.read_csv('../input/acea-water-prediction/Aquifer_Doganella.csv', index_col = 'Date')
Aq_Luco = pd.read_csv('../input/acea-water-prediction/Aquifer_Luco.csv', index_col = 'Date')
Aq_Petrignano = pd.read_csv('../input/acea-water-prediction/Aquifer_Petrignano.csv', index_col = 'Date')
Lake_Bilancino = pd.read_csv('../input/acea-water-prediction/Lake_Bilancino.csv', index_col = 'Date')
River_Arno = pd.read_csv('../input/acea-water-prediction/River_Arno.csv', index_col = 'Date')
Ws_Amiata = pd.read_csv('../input/acea-water-prediction/Water_Spring_Amiata.csv', index_col = 'Date')
Ws_Lupa = pd.read_csv('../input/acea-water-prediction/Water_Spring_Lupa.csv', index_col = 'Date')
Ws_Madonna_di_Canneto = pd.read_csv('../input/acea-water-prediction/Water_Spring_Madonna_di_Canneto.csv', index_col = 'Date')
Ws_Madonna_di_Canneto=Ws_Madonna_di_Canneto.loc[Ws_Madonna_di_Canneto.index.dropna()]

In [None]:
datasets = {"Aquifer Auser":Aq_Auser, "Aquifer Doganella":Aq_Doganella, "Aquifer Luco":Aq_Luco, "Aquifer Petrignano":Aq_Petrignano,
            "Lake Bilancino":Lake_Bilancino, "River Arno":River_Arno, "Water Spring Amiata":Ws_Amiata, "Water Spring Lupa":Ws_Lupa, 
            "Water Spring Madonna di Canneto":Ws_Madonna_di_Canneto}
for i in datasets:
    j=datasets[i]
    j.index = pd.to_datetime(j.index,format="%d/%m/%Y")

### Correction of absolute values
First, before we start analyzing the data, we have to make a correction of absolute values in the dataset clarified by the competition hosts.
**Important - For all attendees - Clarification about the possible values of the feature variables:**
Hello everyone, in recent days many of you have asked us clarifications about the positive, negative or absolute value to be considered to determine features. Well we have just published a new version of the dataset description file that includes a new sheet that shows for each feature of waterbody the possible values ​​that can assume that is positive or negative and if you can consider them as an absolute value. In attachment a screenshot of new sheet.
Thanks
Regards
<img src="https://storage.googleapis.com/kaggle-forum-message-attachments/1136967/17807/dataset_description.PNG" width="600" height="600" />

In [None]:
field=['Rainfall','Depth','Volume','Flow_Rate','Lake_Level']
for i in datasets:
    j=datasets[i]
    for k in j.columns:
        if any(x in k for x in field):
            j[k]=j[k].abs()
        else:
            pass

# 1 First look at the data

## 1.1 Description of the data
The purpose of the Data Description is to provide with the information needed to understand and use the archived data. 
We will peek at the data and have a first impression of the different Waterbodies with different methods.
1. Standard description of the data
2. Statistical description of the data
3. Visual/graphical description of the data 

### 1.1.1 Standard description of the data
It basically consists on giving a look to the head of the data, in order to see how it is shaped. As we can see, the datasets are composed of columns (of features as Rainfall, Temperature, Volume, etc, and the features to forecast previusly discussed), the index, conformed by dates with a daily interval, and values, which are separated into two groups, NaN values ("Not a Number") and float values("floating-point number").
As we can see, most of the values in the head of the datasets are filled with NaN values.

In [None]:
for i in datasets:
    j=datasets[i]
    print(color.UNDERLINE + color.CYAN + color.BOLD + i.upper() + color.END)
    display(datasets[i].head(5))

### 1.1.2 Stadistic description of the columns
In order to evaluate a dataset, you need to get a first contact with your data. This means you need to get an intuitive sense of how your data is distributed and what spectrum of values you have. 
It is a very important description, because it allows us to have a quick idea of the order of values that we will manipulate later.

Describe function will show us in short the general parameters of every column in the datasets. This is:
* **Count**: number of values in that column
* **Mean**: average of these values
* **Std**: Standard Deviation of these values
* **Min**: smallest number
* **25%, 50%, 75%**: the value at **<font color="magenta">x</font>** percentile
* **Max**: the largest value


In [None]:
for i in datasets:
    j=datasets[i]
    print(color.UNDERLINE + color.CYAN + color.BOLD + i.upper() + color.END)
    display(datasets[i].describe())

### 1.1.3 Visual/graphical description of the data
Visual description of the data is the graphical representation of information and data. We will focus on the representation of the data in time as a fast description of it and observe its trends, outliers, and patterns over the years.

In [None]:
def lineplot (i):
    j=datasets[i]
    for k in j.columns:       
        sns.set_theme(style="whitegrid")
        fig, ax = plt.subplots(figsize=(20,2.5))
        lineplot = sns.lineplot(x=j.index,y=j[k],color = 'orangered',label='Missing Values')
        lineplot = sns.lineplot(x=j.index,y=j[k].fillna(np.inf),color = 'seagreen',label='Original')
        lineplot.set_title(label=i.upper()+'\n'+k, fontdict={'fontsize':15}, pad=16)
        lineplot.set_xlabel('Date' , fontsize=12)
        if "Rainfall" in k:
            unit= "Precipitation (mm)"
        elif "Temperature" in k:
            unit= "Temperature (°C)"
        elif "Volume" in k:
            unit= "Volume of water (m3)"
        elif "Hydrometry" in k:
            unit= "Groundwater level (mts)"
        elif "Depth" in k:
            unit= "Distanse from the groundfloor (mts)"
        elif "Flow" in k:
            unit= "Flow Rate (m3/s)"
        else:
            unit='Value'
        lineplot.set_ylabel(unit , fontsize=12)
        lineplot.tick_params(axis='both', labelsize=12)
        lineplot.legend(fontsize=12,loc='upper left')
        plt.xlim(j.index.min(),j.index.max())
        plt.show()

**<font color="green">1-  AQUIFER AUSER:</font>**

Auser consists of two subsystems, NORTH and SOUTH, where the former partly influences the behaviour of the latter. The levels of the NORTH sector are represented by the values of the SAL, PAG, CoS and DIEC wells, while the levels of the SOUTH sector by the LT2 well. 
* **RAINFALL:** Rainfall feature refers to the precipitations (measured in millimeters) observed by different wheather stations. For Auser, we have 10 variables about precipitation, each for a different region. We can observe that these measurements were taken from the year 2006 onwards. The scale of rainfall in each graph is similar, being between 0mm and 150 mm, with some exceptions of extraordinary rainfalls that can reach more than 200mm.
* **TEMPERATURE:** Temprature data (measured in degrees Celsius) from Auser was taken from four different regions. We can observe that this measurements were taken since 1998, but in some cases (as in Orentano), we can conclude that the first years were filled with 0 values. This problem took place in some intervals where 0 degrees were indicated, which may mean either a measurement error or an implied missing data. There is an evident seasonality in the region where we can see a sine-wave-like behaviour. This behaviour usually goes from 0 to 30 degrees, with some exceptions (like Monte Serra) where temperature drops to negative values.
* **VOLUME:** Volume feature refers to the volume of water (expressed in cubic meters) taken from the drinking water treatment plant. We can clearly observe that in the case of treatment plant POL, water taken was constant and in the first months of 2020 the water taken from POL was increased drastically. The CC1 and CC2 treatment plants in the analyzed period, from 2005 onwards, had a stable course of approximately 16,000 and 13,000 cubic meters per day respectively. As to CSA and CSAL treatment plants, we can observe that started taking water since 2014. Previously, all the values were completed with 0s. Later on, the amount of water taken was around 6.000 and 4.000 cubic meters per day respectively.
* **HYDROMETRY:** It indicates the groundwater level, (expressed in meters), detected by the hydrometric station. Here we have two variables from two different stations that started taking values from year 2000. These values vary between -1 and 3 mts, and there is a small tendency to seasonality where we can clearly see periods of growth and reduction of these values. The principal issue here is that we do not have some data from the "Piaggione" station, especially between 2009 and 2011.
* **DEPTH TO GROUNDWATER:** It indicates the groundwater level, expressed in ground level (meters from the ground floor), detected by the piezometer. The features to forecast are the corresponding to SAL, COS, LT2. We can see that every dataset starts in a different year, and the corresponding to the NORTH sector has values that reach at most 10 mts, while LT2 from the SOUTH sector varies between 10 and 15 mts. We also can observe that is a radical decrease for LT2, CoS and SAL dropping to 0 passed 2020. This may be a reaction to the volume taken by the treatment plant POL which had increased drastically months before.

In [None]:
lineplot("Aquifer Auser")

**<font color="green">2-  AQUIFER DOGANELLA:</font>**

The Doganella well field is fed by two underground aquifers, the upper stratum is a water table with a thickness of about 30m while the lower one is a semi-confined artesian aquifer with a thickness of 50m. 
* **RAINFALL:** For Doganella, we have 2 variables about precipitation, each for a different region. We can observe that these measurements were taken from the year 2004 onwards. The scale of rainfall in each graph is similar, being between 0mm and 100 mm. We can observe there are missing values around 2015 and 2019 for both stations and a period in 2006 for Velletri.
* **TEMPERATURE:** Temprature data from Doganella were taken from two different regions since 2004. It appears that wheather stations stopped working properly from 2015 to 2019 because of the huge amount of missing values in this period. Fortunately for us, there is an evident seasonality in the region where we can see a sine-wave-like behaviour. This works as an advantage when we have to fill in missing values. Both temperatures go from 0 to 30 degrees, with some exceptions where the temperature drops to negative values.
* **VOLUME:** As to treatment plants, Doganella is conformed by eight of them. We have data for all places more or less from final months of 2016. For our benefit, we can see that the data is completed, this means, from finals of 2016 to 2020 we have no missing values. There is an almost linear trend over the time with severe drops to 0 in the amount of water taken by the treatment plant probably for maintenance or some other process. We can also see that values taken are almost uniform (of 4000 cubic meters per day) per plant.
* **DEPTH TO GROUNDWATER:** It is conformed by a total of nine features, all to forecast named from 1-9. Data were taken from mid 2012 with important periods of missing values for each feature. It is difficult to observe a correlation between them because each one has a particular behavior. We can see that the maximum value they get is of approximately 110 mts.

In [None]:
lineplot("Aquifer Doganella")

**<font color="green">3-  AQUIFER LUCO:</font>**

Luco is an underground aquifer not fed by rivers or lakes but fed by meteoric infiltration. 
* **RAINFALL:** For Luco, we have 10 variables about precipitation, each for a different region. We can observe the different starting dates for each region. From Simignano, Montalcinello and Sovicille we have information since 2000 and 2002, with some gaps (like in 2013 for Simignano). Their values reach at most 125mm. Continuing with Siena Pogio al Vento and Ponte Orgia, they both have records since 2017, so that let us with only 3-4 years of data. They both register precipitation lower than 80mm with some other exception. Mensano has records since late 2015 with values inferior to 100mm with an exception of a rainfall of more than 300mm in 2018. Monticiano la Pineta has records since mid 2014 whose values are inferior to 100mm with some exceptions. As to Scorgiano and Pentonia, both have records since 2012, and have incredible outliers of precipitations of 600mm. Pentonia also has lots of missing values between 2014 and 2017. About Monteroni Arbia Biena, it has records since 2012 with maximum values around 100mm.
* **TEMPERATURE:** Temprature data from Luco was taken from four different regions. We can observe that for Pentolina and Monteroni Arbia Biena, this measurements were taken from the year 2000 onwards, but in the other cases (Siena Poggio al Vento and Mensano), we can conclude that the first years were filled with 0 values and started recording since 2018 and late 2015 respectively. This problem happened in some intervals where we have streaks where 0 degrees is indicated, which may mean either a measurement error or an implied missing data. There is an evident seasonality in the region where we can see a sine-wave-like behaviour. This behaviour usually goes from -5 to 30 degrees.
* **VOLUME:** As to treatment plants, Luco is conformed by three of them. We have data for all places more or less since early 2015. For our benefit, we can see that the data is completed with no missing values. We can observe a similar behavior where the water taken increases throughout the years but decreases since 2020. Water taken by these treatment plants is about 200-250 cubic milimiter more or less.
* **DEPTH TO GROUNDWATER:** Its depth to groundwater is conformed by a total of four features, with Podere Casetta to forecast. Podere Casetta has values measured since 2008 with some gaps (like in 2018) and has a decreasing behaviour until 2014 where it reaches its minimum value (around 5 mts) and starts growing back. As to Pozzo1, Pozzo3 and Pozzo4, we have data since late 2017 but their behaviour is very particular. Pozzo1 and Pozzo 3 fluctuate between 8 and 12 mts with some exceptions while Pozzo4 fluctuates between 0 and 20 mts.

In [None]:
lineplot("Aquifer Luco")

**<font color="green">4-  AQUIFER PETRIGNANO:</font>**

Petrignano is an underground aquifer not fed by rivers or lakes but fed by meteoric infiltration. 
* **RAINFALL:** For Petrignano, we have 1 variable about precipitation from Bastia Umbra. Measurement were taken since 2009 with no missing values along the years. Its values reach at most 60mm with some exceptions.
* **TEMPERATURE:** Temprature data from Petrignano were taken from two different regions, Bastia Umbra and Petrignano. Both have records since 2009 with some intervals for Petrignano where there is a streak in 2015 where 0 degrees are indicated, which may mean either a measurement error or an implied missing data. There is an evident seasonality in the region where we can see a sine-wave-like behaviour. This behaviour usually goes from 0 to 30 degrees.
* **VOLUME:** As to treatment plants, Petrignano is conformed by one of them. Called C10 Petrignano, whose data were recorded approximately since late 2006. Data is completed with no missing values and its values round between 20,000 and 40,000 cubic meters with some severe drops to 0 in the amount of water taken by the treatment plant, especially in 2019. 
* **HYDROMETRY:** Composed by the hydrometric station Fiume Chiascio Petrignano, its measures started in 2009 with no missing values, except for a streak in 2015 of registered 0mts coincidential with the streak in temperature, we can assume that this is missing data. Most of its values vary between 2 and 4 mts.
* **DEPTH TO GROUNDWATER:** Its depth to groundwater is conformed by a total of two features, P24 and P25, both to forecast. Both have values measured since 2006 with almost no missing values. We can clearly observe an obvious correlation between both features, since their behavior in time are practically identical oscillating between 20 and 35 mts from the ground floor.

In [None]:
lineplot("Aquifer Petrignano")

**<font color="green">5-  LAKE BILANCINO:</font>**

It is an artificial lake in Mugello, in the province of Florence. It has a maximum depth of 31 metres and a surface area of 5 square kilometres. 
* **RAINFALL:** For Lake Bilancino, we have five variables about precipitation. Measurement were taken since 2004 with no missing values among the years. We can observe that they have a very similar behavior between all five of them and their values reach at most 80-100mm with some exceptions.
* **TEMPERATURE:** Temprature data from Bilancino was taken from station Le Croci. It has records since 2004 with no missing values. There is an evident seasonality in the region where we can see a sine-wave-like behaviour. This behaviour usually goes from 0 to 30 degrees.
* **LAKE LEVEL:** It indicates the river level, expressed in meters (m). It is the first feature to forecast for Lake Bilancino. With records taken since 2002, this dataset does not contain missing values. There is an evident seasonality in the region where we can also see a sine-wave-like behaviour, that may be related to the temperature. Its values oscillate between 244 to 252 mts.
* **FLOW RATE:** It indicates the lake's flow rate, expressed in cubic meters per seconds (mc/s). This is the second feature to forecast for Lake Bilancino. It has records since 2002 and almost no missing values in the hole measurement history. It does not seem to have a clearly seasonality but it has lots of peaks. It oscillates between 0 and 70 cubic meters per second.

In [None]:
lineplot("Lake Bilancino")

**<font color="green">6-  RIVER ARNO:</font>**

The Arno is the second largest river in peninsular Italy and the main waterway in Tuscany and it  has a relatively torrential regime, due to the nature of the surrounding soils (marl and impermeable clays)
* **RAINFALL:** For Luco, we have 14 variables about precipitation, each for a different region. We can observe that all of them started recording data since 2004 but stopped in different years. We can observe that for Le Croci, Cavallina, S Agata, Mangona and S Piero we have data until 2020 with no missing values and precipitations with no more than 100mm with some exeptions. Vernio and Incisa have data until 2016, also with no missing data and maximum values around 100mm. As to Stia, Consuma, Montevarchi, S Savino, Laterina and Camaldoli we only have data until 2008 more or less. That leaves us with only four years of registered data for these 6 regions. It can be seen that they also have a similar behavior and precipitation of no more than 100mm. As last we have Bibbiena, with data recorded for about 6 years and no missing values in this period. Precipitation registered was less than 80mm.
* **TEMPERATURE:** Temprature data from Luco consists in the measurement of one region, Firenze. It has records from 2000 to mid 2017 with no missing data. We can see the evident seasonality in the region where it has a sine-wave-like behaviour. This behaviour usually goes from 0 to 30 degrees with an increase of the temperature in summer of 2015, 2016 and 2017.
* **HYDROMETRY:** Composed by the hydrometric station Nave di Rosano, its measures started since 1998 with no missing values, except for a streak in the summer of 2008 of registered 0mts that we can assume was due to some problem. Most of its values vary between 1 and 4 mts with some peaks, especially in winter. Hydrometry Nave di Rosano is the feature to forecast for River Arno.

In [None]:
lineplot("River Arno")

**<font color="green">7-  WATER SPRING AMIATA:</font>**

This aquifer is accessed through the Ermicciolo, Arbure, Bugnano and Galleria Alta springs. The levels and volumes of the four springs are influenced by the parameters: pluviometry, sub-gradation, hydrometry, temperatures and drainage volumes.
* **RAINFALL:** For Amiata, we have 5 variables about precipitation. We can observe a difference between starting dates for each region. Castel del Piano has measures since 2002 with a gap of missing values in 2004. Abbadia S Salvatore started recording since 2010 with gaps of missing values in 2015. S Fiora and Laghetto Verde started recording since 2012 and have gaps of missing values in 2015 and 2013 respectively. We can see a similar behavior for all these four regions with peaks with almost no time difference. At last we have Vetta Amiata with data since 2014, almost no missing values, and maximum values of 80mm
* **TEMPERATURE:** Temprature data from Amiata consist of measures taken from three different regions. Abbadia S Salvatore and Laghetto Verde started recording since 2010 and both have gaps of missing values in 2015 and 2011 respectively, while S Fiora has records since 2000 and no missing values. We can observe that these regions have lower temperatures than the last datasets analized. Here they usually go from -10 to less than 30 degrees.
* **DEPTH TO GROUNDWATER:** Its depth to groundwater is conformed by a total of three features.  We have measures from S Fiora 8 and S Fiora 11bis since 2009 with an important gap of missing values since 2017. They both clearly have a huge correlation due to its identical behaviour throughout the years. As to David Lazzaretti, it has measures since late 2011 with no missing values. We can also see a correlation with Fiora depth to groundwater records but not as significant as between each other. The main difference is the scale of values. While S Fiora 8 and S Fiora 11bis have values between 36-41mts and 50-54mts respectively, David Lazzaretti values are between 290 and 315 mts.
* **FLOW RATE:** It indicates the flow rate, expressed in liters per second (l/s) (or cubic meters per seconds (mc/s)), taken from water spring K. Consists of four different datasets, all to forecast which registers were taken since 2015. For all four flow rates, we can see no missing values among the years. The first two are Bugnano and Arbure, we can see that water taken from these two regions started just in late 2015 and it can be seen a clear correlation between each other, due to their behaviour in time that is very similar. Both have drops to 0 that are almost coincidential in 2018. The main difference between these two is that Bugnano values round among 0 and 0.4, and Arbure among 0 and 3, this is nearly 10 times higher. Ermicciolo and Galleria Alta are the other two features, where both have started taken water since 2015 and have a particular behavior in time. Galleria alta is by far the feature with more water taken from the water spring, whose values are among 19 and 26 liters per second.

In [None]:
lineplot("Water Spring Amiata")

**<font color="green">8-  WATER SPRING LUPA:</font>**

It is located in the Arrone area and is used for drinking use.
* **RAINFALL:** Lupa has only one feature about precipitation with no missing values. It's about wheather station Terni, where records were taken in a very irregular way, we can see an almost linear behavior of rainfall for small periods, where until 2020 records did not surpass the 10mm. From 2020 we can see a more normal measurement of values according to rainfall usual records.
* **FLOW RATE:** Consists only of one dataset representing the flow rate, expressed in liters per second (l/s) (or cubic meters per seconds (mc/s)), taken from water spring Lupa. It has almost none missing values and an increasing linear behavior to finally in 2020 have a change of slope and decrease. Such a lineal behavior, both in flow rate and in rainfall, is clearly irregular.

In [None]:
lineplot("Water Spring Lupa")

**<font color="green">9-  WATER SPRING MADONNA DI CANNETO: </font>**

The Madonna di Canneto spring is situated at an altitude of 1010m above sea level in the Canneto valley. It does not consist of an aquifer and its source is supplied by the water catchment area of the river Melfa.
* **RAINFALL:** For Madonna di Canneto, we have one variable about precipitation from Settefrati. Measurement were taken since 2012 until 2019 with no missing values. Its values reach at most 100mm with some exceptions.
* **TEMPERATURE:** Temprature data from Madonna di Canneto was also taken from Settefrati from 2012 to 2019 with no missing values where  we can see the seasonality related to the seasons in the region. This behaviour varies from -5 to 30 degrees.
* **FLOW RATE:** Consists only of one dataset representing the flow rate, expressed in liters per second (l/s) (or cubic meters per seconds (mc/s)), taken from water spring Madonna di Canneto. It has lots of gaps of missing values wich is a huge problem, and a very particuar behavior in time. Records were registered since 2015.

In [None]:
lineplot("Water Spring Madonna di Canneto")

### **Daily Rainfall Rate in a Monthly Scale**
These boxplots show the **Distribution of the Daily Rainfall Rate in a Monthly Scale**. These were made mainly to show the amount of rainly days taken by the different stations. We will next study more in depth the correlation between these features, but it works to have a first impression of the data behavior.
From here we can observe the different behaviors of regions. For example, Lupa has registered almost every day a precipitation superior to 0mm, while Petrignano has between 5 and 10 days per month with precipitations superior to 0mm.

In [None]:
fig, axs = plt.subplots(3,3, figsize=(15, 20), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = 2, wspace=1)
y=0
for i in datasets:
    n, o = divmod(y, 3)
    j=datasets[i]
    k=j.filter(regex='Rainfall' or 'Date')
    data=[]
    for l in k.columns:
        m=pd.DataFrame(k[l].where(k[l]>0).dropna())
        m[:]=1
        m=m.groupby(pd.Grouper(freq='M')).sum()
        data.append(pd.DataFrame(m,columns=[str(l)]))
    data=pd.concat(data,axis=1)
    sns.pointplot(y=data.mean(),x=data.columns,color="black",ax=axs[n,o])
    boxplot=sns.boxplot(data=data,showmeans=True,
                         meanprops={"marker":"o",
                                    "markerfacecolor":"white", 
                                    "markeredgecolor":"black",
                                    "markersize":"10"},ax=axs[n,o])
    axs[n,o].set_title(i, fontdict={'fontsize':18}, pad=16,color='GREEN')
    axs[n,o].set_xticklabels(boxplot.get_xticklabels(), rotation=90)
    legend_elements=[Line2D([0], [0],marker='o',color='w',markerfacecolor='white',markeredgecolor='black',markersize=10,label='Mean')]
    axs[n,o].legend(handles=legend_elements,loc='upper left',fontsize=14)
    axs[n,o].set_ylabel("Amount of Rainy Days", size=14)
    axs[n,o].set_ylim(0, 32)
    axs[n,o].set_xlabel("Monitoring station", size=14)
    y+=1
plt.tight_layout()
plt.show()

# 2 Missing values


## 2.1 Amount of missing values
In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Missing data can occur because of nonresponse: no information is provided for one or more items or for a whole unit ("subject").

We will deal with missing data later, but in this section we will have a look at the amount of missing values for each column in each dataset. Additionally a graphic was displayed below with all the datasets and their features with their corresponding percentage of missing values.

As we can see, there is a huge amount of missing data in some datasets like Aquifer Doganella, Luco, River Arno and Water Spring Amiata, where there are features with more than 70% of missing data.

On the other hand, Aquifer Petrignano, Lake Bilancino and Water Spring Lupa have less than 20% of missing values.

In [None]:
for i in datasets:
    j=datasets[i]
    print(color.UNDERLINE + color.CYAN + color.BOLD + i.upper() + color.END)
    total = j.isnull().sum().sort_values(ascending = False)
    percent = round(j.isnull().sum().sort_values(ascending = False)/len(j)*100,2)
    print(pd.concat([total, percent], axis=1, keys=['Total','Percent']))
    print('\n'+color.YELLOW + 'x'*60 + color.END+'\n')

In [None]:
fig, axs = plt.subplots(3,3, figsize=(15, 20), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = 3, wspace=3)
y=0
for i in datasets:
    n, o = divmod(y,3)
    j=datasets[i]
    percent = round(j.isnull().sum().sort_values(ascending = False)/len(j)*100,2)
    barplot=sns.barplot(percent.index,percent.values, ax=axs[n,o])
    axs[n,o].set_xticklabels(barplot.get_xticklabels(), rotation=90)
    axs[n,o].set_title(label=i.upper(), fontdict={'fontsize':18}, pad=16)
    axs[n,o].set_xlabel('\n'+'Features' , fontsize=15)
    axs[n,o].set_ylabel('% Missing Values' , fontsize=15)
    axs[n,o].set(ylim=(0, 100))
    y+=1
plt.tight_layout()
plt.show()

## 2.2 Cumulative percentaje of missing data
[](http://)
We will first make a peek on the data and see the interval of time from each dataset. This will help us in the process for analyzing  de cumulative pertentaje of missing values among the years.
As we can see, Aquifer Auser, Luco, River Arno and Water Spring Amiata are the datasets with oldest measures of more than 20 years. While Water Spring Lupa and Water Spring Madonna di Caneto are the datasets with newest measures of less than 13 years.

In [None]:
for i in datasets:
    j=datasets[i]
    print(color.UNDERLINE + color.CYAN + color.BOLD + i.upper() + color.END)
    first_date= j.index.min()
    last_date= j.index.max()
    print(first_date)
    print(last_date)
    print(color.YELLOW + 'x'*60 + color.END+'\n')

Here, as we anticipated, we will analyze graphically the different datasets based on their **Cumulative Percentage of Missing Data**. This will show us the growth of missing data throughout their years.

Analysing the values of the graph, we can observe the increasing missing data for each waterbody. From here its convenient to drop the years with maximum slope and conserve the years with most constant ammound of non missing data.
So, in order to do this, we create new datasets starting on the years selected.

**<font color="green">AQUIFER AUSER:</font>** It can be easily seen that the slope has a break in the year 2006 and before that decreased sharply.

**<font color="green">AQUIFER DOGANELLA:</font>** This dataset has a huge linear growth of missing data all along with small changes in the slope. Therefore we decided to cut it in the year 2012, where the amount of missing values reaches 40% of the data. Doganella is the dataset with higher percentage of missing data.

**<font color="green">AQUIFER LUCO:</font>** Again, the same happend with Aquifer Luco, hence, we again cut it in 2012 equivalent 40% of the data.

**<font color="green">AQUIFER PETRIGRANO:</font>** Here, we can observe a huge break in the slope, where starting from 2009 the curve gets a constant value indicating there is no more missing values after that date.

**<font color="green">LAKE BILANCINO:</font>** Same as Aquifer Petrigrano, but the break happens in 2004.

**<font color="green">RIVER ARNO:</font>** Here, the curve gets growing until 2004, where gets constant for almost 4 years and starts growing again in a more progressive way. Considering this, it seemed appropiated to cut it in 2004.

**<font color="green">WATER SPRING AMIATA:</font>** This dataset has a huge linear growth of missing data all along with small changes in the slope. Therefore we decided to cut it in the year 2012, where the amount of missing values reaches almost 50% of the data. 
Amiata represents the dataset with highest value of missing data untill 2012.

**<font color="green">WATER SPRING LUPA:</font>** The amount of missing data Lupa gets is at most 5% of its total. Hence it didn't seem convenient to modify the dataset.

**<font color="green">WATER SPRING MADONNA DI CANNETO</font>** This dataset has a cumulative missing data in its beginnings but considerating the fact that we only have data of 2012 onwards, we didnt modify it.

In [None]:
data=[]
for i in datasets:
    j=datasets[i]
    k=j.isnull().sum(axis=1)*100/j.size
    k=k.groupby(pd.Grouper(freq='M')).sum().cumsum()
    l=pd.Series(k,name=str(i))
    data.append(l)

data=pd.concat(data,axis=1)
fig, ax = plt.subplots(figsize=(20,7.5))
lineplot = sns.lineplot(data=data,dashes=False)
lineplot.set_title(label='Cumulative Percentage of Misssing Data', fontdict={'fontsize':20}, pad=16)
lineplot.legend(fontsize=12,loc='upper left')
lineplot.set_xlabel('Date' , fontsize=15)
lineplot.set_ylabel('Percentaje' , fontsize=15)
lineplot.tick_params(axis='both', labelsize=12)
plt.show()

# 3 Correlation between features

## 3.1 Correlation Matrix
In statistics, **Correlation** or **Dependence** is any statistical relationship, whether causal or not, between two random variables or bivariate data.
The most familiar measure of dependence between two quantities is the Pearson **Product-Moment correlation coefficient (PPMCC)**, or **Pearson's correlation coefficient**, commonly called simply "the correlation coefficient". 

Pearson correlation measures the existence (given by a p-value) and strength (given by the coefficient between -1 and +1) of a linear relationship between two variables. It should only be used when its underlying assumptions are satisfied. If the outcome is significant we conclude that a correlation exists

Mathematically, it is defined as the quality of least squares fitting to the original data. It is obtained by taking the ratio of the covariance of the two variables in question of our numerical dataset, normalized to the square root of their variances

The population correlation coefficient <font color="magenta">ρX, ρY</font> between two random variables X and Y with expected values <font color="magenta">μX, μY</font> and standard deviations <font color="magenta">σX, σY</font> is defined as
<center><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/93185aed3047ef42fa0f1b6e389a4e89a5654afa" width="350" height="350" /></center>

where **E** is the expected value operator, **cov** means covariance, and **corr** is a widely used alternative notation for the correlation coefficient. The Pearson correlation is defined only if both standard deviations are finite and positive. 

The Pearson correlation is always in the range [-1,+1].

* **Negatively Correlated**    (<font color="magenta">E[X,Y] < μX * μY</font>):    both variables change in opposite directions
* **Uncorrelated**    (<font color="magenta">E[X,Y] = μX * μY</font>):    there is no relation between variables
* **Positively Correlated**    (<font color="magenta">E[X,Y] > μX * μY</font>)   : both variables change in the same direction

In [None]:
def correlation (i):
    j=datasets[i]
    mask = np.triu(np.ones_like(j.corr(), dtype=np.bool))
    fig, ax = plt.subplots(figsize=(15,15))
    heatmap = sns.heatmap(j.corr(), mask=mask, vmin=-1,square=True, vmax=1,annot=True, cmap='mako', ax=ax, fmt='.01f')
    heatmap.set_title(i, fontdict={'fontsize':18}, pad=15)
    sns.set(font_scale=1)

**CORRELATION:** in order to clean the data and delete repeated values that can change the weight on the model, we have to see the correlation between the features so later we will be able to drop the repeated columns to reduce the dataset with more criteria if we consider necessary. To do this, first of all we will make a Pre-Selection of features:

**<font color="green">1-  AQUIFER AUSER:</font>**

* Starting with rainfalls, we can clearly see that there is a huge correlation between some of them, especially between  **Rainfall Calavorno** and **Gallicano**,**Borgo a Mozzano**, **Piaggione**,**Fabbriche di Vallico**. The correlation between Calavorno and Fabbriche di Vallico with these Rainfall is of 0,9, this means, that if we drop these two rainfall and keep the others, the variance lost in the model will be small.
* Same thing happens with **Temperature Lucca Orto Botanico** and **Monte Serra**, where from this two, Temperature Lucca Orto Botanico has most correlation respect of the other Temperatures, so this is the candidate to drop.
* Last value from Aquifer Auser is the correlation between **Volume CSAL** and **Volume CSA**, where CSAL has more variance respect to the other features.

In [None]:
correlation("Aquifer Auser")


**<font color="green">2-  AQUIFER DOGANELLA:</font>**

* We can see there are some **depth to groundwater** that are correlated like **pozzo2, pozzo4, pozzo6, pozzo8** but these datasets are the feature to forecast, so we won't dismiss them.
* As to the volume, **Volume_Pozzo_4** and **Volume_Pozzo_2** have high correlation, hence, it will be suitable to drop the one with less variance, this is, Volume Pozzo 2.
* For the Temperature, the correlation between **Temperature_Monteporzio** and **Temperature_Velletri** is equal to 1, hence, we should drop one of this. The optimal should be Temperature_Velletri because it has more correlation between features.

In [None]:
correlation("Aquifer Doganella")

**<font color="green">3-  AQUIFER LUCO:</font>**
* Here, the main correlation happens with **Rainfall_Ponte_Orgia** and **Rainfall_Siena_Poggio_al_Vento**, **Rainfall_Monticiano_la_Pineta**. Keepeng these last two features and dropping Rainfall_Ponte_Orgia is the best way to carry on.  
* There are some good correlations with the volumes of pozzo 1 and pozzo 3 with **Depth_to_groundwater_Podere_Casetta** that are optimal for predicting this feature.

In [None]:
correlation("Aquifer Luco")

**<font color="green">4-  AQUIFER PETRIGNANO:</font>**
* Here, again we can observe that the **Depth_to_Groundwater** are strongly correlated between each other for **P24** and **P25**, but we won't dismiss them as they are the variables we are looking for.

In [None]:
correlation("Aquifer Petrignano")

**<font color="green">5-  LAKE BILANCINO:</font>**
* In this case, we can see a huge correlation between all Rainfall features. Also, they have the same amount of missing values, and, considering that they are 5 out of 8 total features, we will drop at most 2 of them.

In [None]:
correlation("Lake Bilancino")

**<font color="green">6-  RIVER ARNO:</font>**
* River Arno have a huge amount of missing data that we have to consider to drop the features. As we can clearly see, there are 2 remarked groups of correlated **Rainfall** features **(Le_Croci, Cavallina, S_Agata, Mangona, S_Piero, Vernio)** and **(Stia, Consuma, Incisa, Montevarchi, S_Savino, Laterina, Bibbiena, Camaldoli)**. From the first group, only Vernio has over 40% of missing values but is the least correlated from that group. As for the other group, there are no big correlations, except for Laterina and Montevarchi,  where Laterina is the candidate to drop due to its missing values.

In [None]:
correlation("River Arno")

**<font color="green">7-  WATER SPRING AMIATA:</font>**
* Observing the Temperature we can see a high correlation between **Rainfall_Abbadia_Salvatore, Rainfall_S_Fiora, Rainfall_Lagheto_Verde**. From these 3, Lagheto Verde is the one with less amount of missing values, and Fiora is the one most correlated.
* There are high correlation between all of the 3 **Depth_to_Groundwater** features, so we recommend to drop at least 1, the feature with less missing values in this case is Depth_to_Groundwater_S_Fiora_11bis
* The last features to analyze from Amiata are **Temperature**. Here, we have again a strong correlation of 1 within all temperatures where by far Temperature_S_Fiora is the candidate to keep with 0% of missing values followed by Temperature_Laghetto_Verde with 52%. As you can see, there is a huge difference of missing values.

In [None]:
correlation("Water Spring Amiata")

**<font color="green">8-  WATER SPRING LUPA:</font>**
* It has only one feature **(Temperature)** and its poorly correlated to the feature to forecast.

In [None]:
correlation("Water Spring Lupa")

**<font color="green">9-  WATER SPRING MADONNA DI CANNETO:</font>**
* It has two features **(Temperature and Rainfall)** that are poorly correlated to the feature to forecast, so we won't drop them

In [None]:
correlation("Water Spring Madonna di Canneto")

## 3.2 Correlation by Threshold
At last, we made a function filtering the columns with a <ins>Threshold</ins> of **<font color="magenta">0.9</font>**. Because of the sequential nature of Python functions, they are unable to consider the state of all the variables holistically, so the decision to drop variables happens in order and is final.

That's why we have to make a joint analysis to be sure what features to drop.

The table consists of nine columns composed by the different datasets wich values are the features of each dataset that has a correlation equal or superior to the Threshold chosen.
 
Next we will see the features selected to drop in Preprocessing the data.

In [None]:
correlation=pd.DataFrame()
for key in datasets:
    Dataset=datasets[key]
    threshold=0.9
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = Dataset.corr()
    column=pd.Series([])
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if corr_matrix.iloc[i, j] >= threshold and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
                if colname in Dataset.columns:
                    #del dataset[colname] # deleting the column from the dataset
                    column=column.append(pd.Series([colname]),ignore_index=True)
                
    correlation[key]=column
display(correlation)

# 4 Data cleaning
**Data cleansing** or **data cleaning** is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
This data is usually not necessary or helpful when it comes to analyzing data because it may hinder the process or provide inaccurate results.

The goal of data cleaning is to create data sets that are standardized and uniform to allow data analytics tools to easily access and find the right data for each query.

This process will consist in dropping all the years that we selected in the **2.2 Cumulative Percentage of Missing Data** section that we considered as inaccurated record. Also, we will drop features seen in **3 Correlation between features** with high correlation that we consider as repeated values in order to clean the data and stabilize the model.

Afterwards, we will create new datatasets for wich will drop all the columns that correspond to the **Features to Forecast** in order to work woth the full features that will work as input part of the model. This new datasets will have its features resampled by week and a NEW FEATURES corresponding to the features to forecast shifted by one month. The shift of 28 days is to use the past 28 days to predict the future next week.

Finally, we select the **Feature to Forecast**, put them into new variables y, and join them in four groups, one for each category of waterbody (aquifers, water springs, river, lake). 

In [None]:

#-----AQ AUSER----
Aq_Auser_1 = pd.DataFrame()
for i in Aq_Auser.columns:
    if i.startswith(('Temperature','Hydrometry','Depth')):
        Aq_Auser_1[i] = Aq_Auser[i].resample('7D').mean()
    elif i.startswith(('Rainfall','Volume')):
        Aq_Auser_1[i] = Aq_Auser[i].resample('7D').sum(min_count=1)
for i in ['Depth_to_Groundwater_LT2','Depth_to_Groundwater_SAL',
          'Depth_to_Groundwater_CoS']:
    Aq_Auser_1[i + '_shifted'] = Aq_Auser_1[i].shift(-4)
y_Aq_Auser = Aq_Auser_1[['Depth_to_Groundwater_LT2','Depth_to_Groundwater_SAL',
                    'Depth_to_Groundwater_CoS']]

#-----AQ DOGANELLA----
Aq_Doganella_1 = pd.DataFrame()
for i in Aq_Doganella.columns:
    if i.startswith(('Temperature','Hydrometry','Depth')):
        Aq_Doganella_1[i] = Aq_Doganella[i].resample('7D').mean()
    elif i.startswith(('Rainfall','Volume')):
        Aq_Doganella_1[i] = Aq_Doganella[i].resample('7D').sum(min_count=1)
for i in ['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3',
          'Depth_to_Groundwater_Pozzo_4','Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
          'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8','Depth_to_Groundwater_Pozzo_9']:
    Aq_Doganella_1[i + '_shifted'] = Aq_Doganella_1[i].shift(-4)
y_Aq_Doganella = Aq_Doganella_1[['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2',
                              'Depth_to_Groundwater_Pozzo_3','Depth_to_Groundwater_Pozzo_4',
                              'Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
                              'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8',
                              'Depth_to_Groundwater_Pozzo_9']]

#-----AQ LUCO----
Aq_Luco_1 = pd.DataFrame()
for i in Aq_Luco.columns:
    if i.startswith(('Temperature','Hydrometry','Depth')):
        Aq_Luco_1[i] = Aq_Luco[i].resample('7D').mean()
    elif i.startswith(('Rainfall','Volume')):
        Aq_Luco_1[i] = Aq_Luco[i].resample('7D').sum(min_count=1)
for i in ['Depth_to_Groundwater_Podere_Casetta']:
    Aq_Luco_1[i + '_shifted'] = Aq_Luco_1[i].shift(-4)
y_Aq_Luco = Aq_Luco_1[['Depth_to_Groundwater_Podere_Casetta']]

#-----AQ PETRIGRANO----
Aq_Petrignano_1 = pd.DataFrame()
for i in Aq_Petrignano.columns:
    if i.startswith(('Temperature','Hydrometry','Depth')):
        Aq_Petrignano_1[i] = Aq_Petrignano[i].resample('7D').mean()
    elif i.startswith(('Rainfall','Volume')):
        Aq_Petrignano_1[i] = Aq_Petrignano[i].resample('7D').sum(min_count=1)
for i in ['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25']:
    Aq_Petrignano_1[i + '_shifted'] = Aq_Petrignano_1[i].shift(-4)
y_Aq_Petrignano = Aq_Petrignano[['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25']]

#-----LAKE BILANCINO----
Lake_Bilancino_1 = pd.DataFrame()
for i in Lake_Bilancino.columns:
    if i.startswith(('Temperature','Hydrometry','Depth','Lake_Level','Flow_Rate')):
        Lake_Bilancino_1[i] = Lake_Bilancino[i].resample('7D').mean()
    elif i.startswith(('Rainfall','Volume')):
        Lake_Bilancino_1[i] = Lake_Bilancino[i].resample('7D').sum(min_count=1)
for i in ['Lake_Level','Flow_Rate']:
    Lake_Bilancino_1[i + '_shifted'] = Lake_Bilancino_1[i].shift(-4)
y_Lake_Bilancino = Lake_Bilancino_1[['Lake_Level','Flow_Rate']]

#-----RIVER ARNO----
River_Arno_1 = pd.DataFrame()
for i in River_Arno.columns:
    if i.startswith(('Temperature','Hydrometry','Depth')):
        River_Arno_1[i] = River_Arno[i].resample('7D').mean()
    elif i.startswith(('Rainfall','Volume')):
        River_Arno_1[i] = River_Arno[i].resample('7D').sum(min_count=1)
for i in ['Hydrometry_Nave_di_Rosano']:
    River_Arno_1[i + '_shifted'] = River_Arno_1[i].shift(-4)
y_River_Arno = River_Arno_1[['Hydrometry_Nave_di_Rosano']]
    
#-----WS AMIATA----    
Ws_Amiata_1 = pd.DataFrame()
for i in Ws_Amiata.columns:
    if i.startswith(('Temperature','Hydrometry','Depth','Flow_Rate')):
        Ws_Amiata_1[i] = Ws_Amiata[i].resample('7D').mean()
    elif i.startswith(('Rainfall','Volume')):
        Ws_Amiata_1[i] = Ws_Amiata[i].resample('7D').sum(min_count=1)
for i in ['Flow_Rate_Bugnano','Flow_Rate_Arbure','Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta']:
    Ws_Amiata_1[i + '_shifted'] = Ws_Amiata_1[i].shift(-4)
y_Ws_Amiata = Ws_Amiata_1[['Flow_Rate_Bugnano','Flow_Rate_Arbure','Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta']]

#-----WS LUPA----
Ws_Lupa_1 = pd.DataFrame()
for i in Ws_Lupa.columns:
    if i.startswith(('Temperature','Hydrometry','Depth','Flow_Rate')):
        Ws_Lupa_1[i] = Ws_Lupa[i].resample('7D').mean()
    elif i.startswith(('Rainfall','Volume')):
        Ws_Lupa_1[i] = Ws_Lupa[i].resample('7D').sum(min_count=1)
for i in ['Flow_Rate_Lupa']:
    Ws_Lupa_1[i + '_shifted'] = Ws_Lupa_1[i].shift(-4)
y_Ws_Lupa = Ws_Lupa_1[['Flow_Rate_Lupa']]

#-----WS MADONNA DI CANETTO
Ws_Madonna_di_Canneto_1 = pd.DataFrame()
for i in Ws_Madonna_di_Canneto.columns:
    if i.startswith(('Temperature','Hydrometry','Depth','Flow_Rate')):
        Ws_Madonna_di_Canneto_1[i] = Ws_Madonna_di_Canneto[i].resample('7D').mean()
    elif i.startswith(('Rainfall','Volume')):
        Ws_Madonna_di_Canneto_1[i] = Ws_Madonna_di_Canneto[i].resample('7D').sum(min_count=1)
for i in ['Flow_Rate_Madonna_di_Canneto']:
    Ws_Madonna_di_Canneto_1[i + '_shifted'] = Ws_Madonna_di_Canneto_1[i].shift(-4)
    
y_Ws_Madonna_di_Canneto = Ws_Madonna_di_Canneto_1[['Flow_Rate_Madonna_di_Canneto']]


# Features to Forecast linked by Category
y_Aq = {"Aquifer Auser":y_Aq_Auser, "Aquifer Doganella":y_Aq_Doganella, "Aquifer Luco":y_Aq_Luco,"Aquifer Petrignano":y_Aq_Petrignano}
y_Lake = {"Lake Bilancino":y_Lake_Bilancino}
y_River = {"River Arno":y_River_Arno}
y_Ws = {"Water Spring Amiata":y_Ws_Amiata, "Water Spring Lupa":y_Ws_Lupa, "Water Spring Madonna di Canneto":y_Ws_Madonna_di_Canneto}

In [None]:
Aq_Auser_1 = Aq_Auser_1[~(Aq_Auser_1.index.year < 2006)].drop(columns=['Depth_to_Groundwater_LT2','Depth_to_Groundwater_SAL',
                                                                 'Depth_to_Groundwater_CoS','Rainfall_Calavorno',
                                                                 'Rainfall_Fabbriche_di_Vallico','Temperature_Lucca_Orto_Botanico'])

Aq_Doganella_1 = Aq_Doganella_1[~(Aq_Doganella_1.index.year < 2012)].drop(columns=['Depth_to_Groundwater_Pozzo_1',
                                                                             'Depth_to_Groundwater_Pozzo_2',
                                                                             'Depth_to_Groundwater_Pozzo_3',
                                                                             'Depth_to_Groundwater_Pozzo_4',
                                                                             'Depth_to_Groundwater_Pozzo_5',
                                                                             'Depth_to_Groundwater_Pozzo_6',
                                                                             'Depth_to_Groundwater_Pozzo_7',
                                                                             'Depth_to_Groundwater_Pozzo_8',
                                                                             'Depth_to_Groundwater_Pozzo_9',
                                                                             'Temperature_Velletri'])

Aq_Luco_1 = Aq_Luco_1[~(Aq_Luco_1.index.year < 2012)].drop(columns=['Depth_to_Groundwater_Podere_Casetta',
                                                              'Rainfall_Ponte_Orgia'])

Aq_Petrignano_1 = Aq_Petrignano_1[~(Aq_Petrignano_1.index.year < 2009)].drop(columns=['Depth_to_Groundwater_P24',
                                                                                'Depth_to_Groundwater_P25'])

Lake_Bilancino_1 = Lake_Bilancino_1[~(Lake_Bilancino_1.index.year < 2004)].drop(columns=['Lake_Level','Flow_Rate',
                                                                                     'Rainfall_Cavallina',
                                                                                     'Rainfall_Le_Croci'])

River_Arno_1 = River_Arno_1[~(River_Arno_1.index.year < 2004)].drop(columns=['Hydrometry_Nave_di_Rosano',
                                                                        'Rainfall_Cavallina', 'Rainfall_Vernio',
                                                                         'Rainfall_Laterina'])

Ws_Amiata_1 = Ws_Amiata_1[~(Ws_Amiata_1.index.year < 2012)].drop(columns=['Flow_Rate_Bugnano','Flow_Rate_Arbure',
                                                                    'Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta',
                                                                    'Rainfall_S_Fiora',
                                                                    'Depth_to_Groundwater_David_Lazzaretti',
                                                                    'Temperature_Abbadia_S_Salvatore',
                                                                    'Temperature_Laghetto_Verde'])

Ws_Lupa_1 = Ws_Lupa_1.drop(columns=['Flow_Rate_Lupa'])

Ws_Madonna_di_Canneto_1 = Ws_Madonna_di_Canneto_1.drop(columns=['Flow_Rate_Madonna_di_Canneto'])


datasets_1 = {"Aquifer Auser":Aq_Auser_1, "Aquifer Doganella":Aq_Doganella_1, "Aquifer Luco":Aq_Luco_1, "Aquifer Petrignano":Aq_Petrignano_1,
            "Lake Bilancino":Lake_Bilancino_1, "River Arno":River_Arno_1, "Water Spring Amiata":Ws_Amiata_1, "Water Spring Lupa":Ws_Lupa_1, 
            "Water Spring Madonna di Canneto":Ws_Madonna_di_Canneto_1}

# 5 Working with Outliers
First of all, a loop was made to drop the consecutive 0 values previously discussed in **1.1.3 Visual/graphical description of the data** , where lots of NaN values were filled with 0 in the case of <ins>Temperature</ins> and <ins>Hydrometry</ins>. This is in order to clean the data and get a best estimation.

In [None]:
for i in datasets_1:
    j=datasets_1[i]
    for k in j.columns:
        if 'Temperature' in k or 'Hydrometry' in k or 'Flow' in k:
            j[k]=j[k][j[k].replace(0,np.nan).ffill(limit=1).bfill(limit=1).notnull()]
        else:
            pass
    datasets_1[i]=j

Then, we plot the features in time with the Outliers. These Outliers are selected by the **Box and Whiskers Methode**
This  method  is  widely  used  in  order  to  detect  outliers.  Box-Whisker  is  a  plot  presenting  dataset very  effectively  by  aid  of central  and  distribution  criteria. 
The box and whiskers plot was first introduced in 1970 by John Tukey, who later published on the subject in 1977.

Box plot one of a diverse family of statistical techniques, called exploratory data analysis, used to visually identify patterns that may otherwise be hidden in a data set. Works as a method for summarizing estimated data according to a distance index which is adopted for discovering properties and data analysis. In this method, a rectangle (box) and two lines around the rectangle (whisker) are used and the plot is drawn by median, the first and third quarters, and the lowest and highest observed amount. 

The rectangle length equals interquartile range (<font color="magenta">IQR= Q3-Q1</font>). Internal and external frontiers are specified by successive steps. When a data is placed between internal and external boundaries, it is called weak outliner (data smaller than <font color="magenta">Q1-1.5IQR</font> and bigger than <font color="magenta">Q1+1.5IQR</font>) and if it is placed out of the external boundaries, it is regarded as strong outliner (data smaller than <font color="magenta">Q1-3IQR</font> and bigger than <font color="magenta">Q1+3IQR</font>).
<center><img src="https://upload.wikimedia.org/wikipedia/commons/1/1a/Boxplot_vs_PDF.svg" width="350" height="350" /></center>
<center>Figure: Boxplot and a probability density function (pdf) of a Normal N(0,1σ2)</center>

In [None]:
def boxplot (i):
    j=datasets_1[i]
    if len(j.columns)>3:
        fig, axs = plt.subplots(math.ceil(len(j.columns)/4),4, figsize=(15, 20*math.ceil(len(j.columns)/4)/5), facecolor='w', edgecolor='k')
        fig.subplots_adjust(hspace = 2, wspace=.5)
        y=0
        for k in j.columns:
            l, m = divmod(y, 4)
            if "Rainfall" in k:
                unit= "Precipitation (mm/week)"
                data=j[k][j[k]>0]
            elif "Temperature" in k:
                unit= "Temperature (°C)"
                data=j[k]
            elif "Volume" in k:
                unit= "Volume of water (m3/week)"
                data=j[k]
            elif "Hydrometry" in k:
                unit= "Groundwater level (mts)"
                data=j[k]
            elif "Depth" in k:
                unit= "Distanse groundfloor (mts)"
                data=j[k]
            elif "Flow" in k:
                unit= "Flow Rate (m3/s)"
                data=j[k]
            elif "Lake" in k:
                unit= "Lake level (mts)"
                data=j[k]
            else:
                unit='Value'
                data=j[k]
            sns.boxplot(data=data,color='green',showmeans=True,
                                 meanprops={"marker":"o",
                                            "markerfacecolor":"white", 
                                            "markeredgecolor":"black",
                                            "markersize":"10"},ax=axs[l,m])
            axs[l,m].set_title(i.upper()+'\n'+k, fontdict={'fontsize':12}, pad=16)
            axs[l,m].set_xticklabels('', rotation=90)
            axs[l,m].set_ylabel(unit, size=10)
            y+=1
    elif len(j.columns)<=3:
        fig, axs = plt.subplots(1,len(j.columns), figsize=(15*len(j.columns)/4, 4), facecolor='w', edgecolor='k')
        fig.subplots_adjust(wspace=10)   
        y=0
        for k in j.columns:
            sns.boxplot(data=j[k],color='green',showmeans=True,
                                 meanprops={"marker":"o",
                                            "markerfacecolor":"white", 
                                            "markeredgecolor":"black",
                                            "markersize":"10"},ax=axs[y])
            if "Rainfall" in k:
                unit= "Precipitation (mm/week)"
            elif "Temperature" in k:
                unit= "Temperature (°C)"
            elif "Volume" in k:
                unit= "Volume of water (m3/week)"
            elif "Hydrometry" in k:
                unit= "Groundwater level (mts)"
            elif "Depth" in k:
                unit= "Distanse groundfloor (mts)"
            elif "Flow" in k:
                unit= "Flow Rate (m3/s)"
            elif "Lake" in k:
                unit= "River level (mts)"
            else:
                unit='Value'
            axs[y].set_title(i.upper()+'\n'+k, fontdict={'fontsize':12}, pad=16)
            axs[y].set_xticklabels('', rotation=90)
            axs[y].set_ylabel(unit, size=10)
            y+=1
    plt.tight_layout()
    plt.show()

## 5.1 Boxplot Representation 
**<font color="green">1-  AQUIFER AUSER:</font>**

* Starting with Rainfalls, we can clearly see that they have a huge amount of outliers and that the IQR (or interquartile range) is generally between 0 and 50 mm. They all have a Gamma-like distribution, a common behavior for rainfalls.
* In the case of Depth to Groundwater, we can observe that Pag has a Gaussian-like distribution with no outliers, while Diec has a more dispersed distribution with an amount of outliers that have less distance to the groundfloor.
* Temperature, as expected, has a remarked Gaussian distribution where the IQR generally rounds between 7-20 degrees.
* Volume features have different behaviors. We can divide them in 3 groups. Pol, which has a Gaussian distribution with IQR between 50,000-75,000 cubic meters per week, CC1 and CC2, which have a more dispersed distribution with outliers far away from the IQR whom rounds the 110,000-120,000 and 78,000,-90,000 cubic meters per week respectively, and the last group conformed by CSA and CSAL, they both have an IQR quite stucked to 0. Their IQR goes from 0 to 40,000 and 30,000 respectively. They don't present outliers.
* Hydrometry composed by Monte S Quirico and Piaggione can take negative values and have outliers to consider superior to the IQR or median values, especially Monte S Quirico.
* Finally, the features to forecast shifted by one month, we can see that Depth to Groundwater LT2 shifted and SAL shifted have a more compact IQR with some outliers below it, while CoS shifted presents a Gaussian behavior with no outliers to remove.

In [None]:
boxplot("Aquifer Auser")

**<font color="green">2-  AQUIFER DOGANELLA:</font>**

* Starting with Rainfalls, again we can clearly see that they have a huge amount of outliers. In this case,the IQR (or interquartile range) is generally between 0 and 40 mm. They all have a Gamma-like distribution, a common behavior for rainfalls.
* Volume features have similar behaviors with Pozzo 1 as exception. The first one has an IQR rounding between 0-10,000 cubic meters per week with lots of outliers above it. The rest, Pozzo 2, Pozzo 3, Pozzo 4, Pozzo 5+6, Pozzo 7, Pozzo 8 and Pozzo 9, have an IQR quite stucked to 0. Their IQR goes from 0 to 25,000 generally, consider Pozzo 5+6 is the sum of 2 pozzos. They don't present outliers.
* Temperature, has a remarked Gaussian distribution where the IQR generally rounds between 8-19 degrees.
* In the case of Depth to Groundwater, we can observe that Pag has a Gaussian-like distribution with no outliers, while Diec has a more dispersed distribution with an amount of outliers that have less distance to the groundfloor.
* Finally, the features to forecast shifted by one month, we can see that Depth to Groundwater of the different number of Pozzo, have a more alike Gaussian distribution and no outliers with some exception like Pozzo 6 and Pozzo 9 that have outliers above the IQR.

In [None]:
boxplot("Aquifer Doganella")

**<font color="green">3-  AQUIFER LUCO:</font>**

* Starting with Rainfalls, again we can clearly see that there have a huge amount of outliers. In this case,the IQR is generally between 0 and 40 mm. They all have a Gamma-like distribution, a common behavior for rainfalls. We can observe extreme cases such as Pentolina, where in one week the precipitation measured was superior to 2,000mm. Clearly this kind of outliers unbalance the feature.
* In the case of Depth to Groundwater, we can observe that the three of them has a particular type of behavior but with their IQR in a very similar range. Also, they all have few amount of outliers.
* Temperature features have a remarked Gaussian distribution where the IQR generally rounds between 9-21 degrees with no outliers.
* Volume features have similar behaviors alike to a Gamma distribution with an IQR near 0. Their IQR goes from 0 to 1,250 for the three cases with no outliers
* In the case of Depth to Groundwater, we can observe that Pag has a Gaussian-like distribution with no outliers, while Diec has a more dispersed distribution with an amount of outliers that have less distance to the groundfloor.
* Finally, the features to forecast shifted by one month, we can see that Depth to Groundwater of the different number of Pozzo, have a more alike Gaussian distribution and no outliers with some exception like Pozzo 6 and Pozzo 9 that have outliers above the IQR.

In [None]:
boxplot("Aquifer Luco")

**<font color="green">4-  AQUIFER PETRIGNANO:</font>**

* Composed by only one Rainfall feature, its IQR is generally between 5 and 25 mm. It has a Gamma-like distribution with a huge amount of outliers.
* Temperature features have a remarked Gaussian distribution where the IQR generally rounds between 9-22 degrees with no outliers.
* Volume feature is similar to a Gaussian distribution with a thin IQR between 190,000-210,000 cubic meters per week. It presents lots of outliers in both directions.
* Hydrometry Fiume Chiascio Petrignano has similar behavior to a Gaussian distribution with its IQR between 2.1-2.75 mts.
* Finally, the features to forecast shifted by one month, we can see that Depth to Groundwater of P24 shifted and P25 shifted, have a distribution similar to Gaussian, with an IQR between 24-28mts and outliers above it.

In [None]:
boxplot("Aquifer Petrignano")

**<font color="green">5-  LAKE BILANCINO:</font>**

* Rainfalls features, demostrate a huge amount of outliers above their IQR. In this case,the IQR is generally between 5 and 25 mm except for Mangona with 5-50mm. They all have a Gamma-like distribution.
* Temperature feature has a remarked Gaussian distribution where the IQR rounds between 9-20 degrees with no outliers.
* The first feature to forecast, Lake Level shifted by one month, has a distribution similar to a Gaussian with an IQR between 248-251mts
* The second features to forecast shifted by one month is the Flow Rate. We can see that it has a more Gamma-like distribution with lots of outliers above the IQR. This IQR is almost flat varying between 0-4 cubic meters per second.

In [None]:
boxplot("Lake Bilancino")

**<font color="green">6-  RIVER ARNO:</font>**

* Arno has lots of Rainfalls features, where we can see they have a very similar distribution. In this case,the IQR is generally between 5 and 40. They all have a Gamma-like distribution and demostrate a huge amount of outliers above their IQR. 
* Temperature feature has a remarked Gaussian distribution where the IQR rounds between 11-23 degrees with no outliers.
* The last feature is the one to forecast shifted by one month, in this case is the Hydrometry Nave di Rosano shifted, it has an IQR of 1.1-1.7 mts with lots of outliers, most of them above this IQR.

In [None]:
boxplot("River Arno")

**<font color="green">7-  WATER SPRING AMIATA:</font>**

* Amiata Rainfalls features have a very similar distribution with an IQR between 5 and 50. They all have a Gamma-like distribution and demostrate a huge amount of outliers above their IQR.
* In the case of Depth to Groundwater, we can observe that both S Fiora 8 and S Fiora 11bis distribution are similar to a Gaussian and don't present outliers.
* Temperature feature has a remarked Gaussian distribution where the IQR rounds between 6-16 degrees with no outliers.
* The last features are the one to forecast shifted by one month, in this case they are Flow Rate Shifted. We can see that all of them have marked different distributions with lots of outliers below their IQR, except Galleria Alta Shifted that has no outliers at all.

In [None]:
boxplot("Water Spring Amiata")

**<font color="green">8-  WATER SPRING LUPA:</font>**

* Lupa Rainfall has an IQR between 10 and 20 with lots of outliers above it. It also has a Gamma-like distribution.
* The one to forecast shifted by one month in this case is Flow Rate Lupa Shifted. We can see that it has an IQR between 90-110 cubic meters pers second with lots of outliers above it.

In [None]:
boxplot("Water Spring Lupa")

**<font color="green">9-  WATER SPRING MADONNA DI CANNETO:</font>**

* Lupa Rainfall has an IQR between 0 and 40 with lots of outliers above it. It also has a Gamma-like distribution.
* Temperature feature has a remarked Gaussian distribution where the IQR rounds between 7-19 degrees with no outliers.
* The one to forecast shifted by one month in this case is Flow Rate Madonna di Canneto Shifted. We can see that it has an IQR between 225-290 cubic meters pers second with no outliers in it. It also has a distribution similar to a Gaussian.

In [None]:
boxplot("Water Spring Madonna di Canneto")

## 5.2 Timeline Representation

This is another way, more visual and singular way to represent the Outliers we want to drop in orter to normalize the features.

We will not go into details and analyze one by one, this section is just to give a more particular look of the outliers to remove by presenting each inidivual outliers in the hole features.

Finally a loop was made in order to remove all this outliers in each feature.

In [None]:
def outlier_lineplots (i):
    j=datasets_1[i]
    for k in j.columns:
        sns.set_theme(style="whitegrid")
        fig, ax = plt.subplots(figsize=(20,2.5))
        if "Rainfall" in k:
            unit= "Precipitation (mm)"
            data=j[k][j[k]>0]
        elif "Temperature" in k:
            unit= "Temperature (°C)"
            data=j[k]
        elif "Volume" in k:
            unit= "Volume of water (m3)"
            data=j[k]
        elif "Hydrometry" in k:
            unit= "Groundwater level (mts)"
            data=j[k]
        elif "Depth" in k:
            unit= "Distanse from the groundfloor (mts)"
            data=j[k]
        elif "Flow" in k:
            unit= "Flow Rate (m3/s)"
            data=j[k]
        elif "Lake" in k:
            unit= "River level (mts)"
            data=j[k]
        sns.lineplot(x=j.index,y=j[k],color = 'orangered',label='Missing Values')
        sns.lineplot(x=j.index,y=j[k].fillna(np.inf),color = 'seagreen',label='Original')
        Q1 = data.quantile(0.25)
        Q3 = data.quantile(0.75)
        IQR = Q3 - Q1
        STD=3*data.std()
        m=pd.DataFrame (j[k].where((j[k]>(Q3+1.5*IQR)) | (j[k]<(Q1-1.5*IQR))))
        sns.scatterplot(data=m,palette=["red"],s=60,legend=False,label='Extreme Outliers')
        ax.set_title(label=i.upper()+'\n'+k, fontdict={'fontsize':15}, pad=16)
        ax.set_xlabel('Date' , fontsize=12)
        ax.set_ylabel( unit, fontsize=12)
        ax.tick_params(axis='both', labelsize=12)
        plt.legend(fontsize=12,loc='upper left')
        plt.xlim(j.index.min(),j.index.max())
        plt.show()

**<font color="green">1-  AQUIFER AUSER:</font>**

In [None]:
outlier_lineplots('Aquifer Auser')

**<font color="green">2-  AQUIFER DOGANELLA:</font>**

In [None]:
outlier_lineplots('Aquifer Doganella')

**<font color="green">3-  AQUIFER LUCO:</font>**

In [None]:
outlier_lineplots('Aquifer Luco')

**<font color="green">4-  AQUIFER PETRIGNANO:</font>**

In [None]:
outlier_lineplots('Aquifer Petrignano')

**<font color="green">5-  LAKE BILANCINO:</font>**

In [None]:
outlier_lineplots('Lake Bilancino')

**<font color="green">6-  RIVER ARNO:</font>**

In [None]:
outlier_lineplots('River Arno')

**<font color="green">7-  WATER SPRING AMIATA:</font>**

In [None]:
outlier_lineplots('Water Spring Amiata')

**<font color="green">8-  WATER SPRING LUPA:</font>**

In [None]:
outlier_lineplots('Water Spring Lupa')

**<font color="green">9-  WATER SPRING MADONNA DI CANNETO:</font>**

In [None]:
outlier_lineplots('Water Spring Madonna di Canneto')

In [None]:
for i in datasets_1:
    j=datasets_1[i]
    for k in j.columns:
        if k.endswith('_shifted'):
            pass
        else:
            if 'Rainfall' in k:
                data=j[k][j[k]>0]
            else:
                data=j[k]
            q1 = data.quantile(0.25)
            q3 = data.quantile(0.75)
            IQR = q3 - q1
            q1 = np.percentile(data, 25, interpolation='midpoint')
            q3 = np.percentile(data, 75, interpolation='midpoint')
            j[k]=np.where((j[k]>(q3+3*IQR)) | (j[k]<(q1-3*IQR)),np.NaN,j[k])
    datasets_1[i]=j

# 6 Filling data with K-NN Imputer
For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known part of the data.

The **KNNImputer** class provides imputation for filling in missing values using the k-Nearest Neighbors approach. By default, a euclidean distance metric that supports missing values, is used to find the nearest neighbors. Each missing feature is imputed using values from nearest neighbors that have a value for the feature. The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor. 

If a sample has more than one feature missing, then the neighbors for that sample can be different depending on the particular feature being imputed. When the number of available neighbors is less than nearest neighbors and there are no defined distances to the training set, the training set average for that feature is used during imputation. If there is at least one neighbor with a defined distance, the weighted or unweighted average of the remaining neighbors will be used during imputation.

First we will make an estimation of the best nearest neigbor with cross validation score, analyzing the different mean squared error between 1 and 21 nearest neighbors for each dataset. Then based on the MSE, we will select the optimal nearest neighbor for each dataset and proceed to do de imputation.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import KNNImputer
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

strategies = [str(n) for n in [1,3,5,7,11,13,15,17,19,21]]

def KNN_Imputer(i):
    j=datasets_1[i]
    if 'Aq' in i:
        features = y_Aq
    elif 'Lake' in i:
        features = y_Lake
    elif 'River' in i:
        features = y_River
    elif 'Water' in i:
        features = y_Ws
    model=RandomForestRegressor(random_state=17)
    for y in features:
        if y==i:
            train_y=features[y]
            values=pd.DataFrame(columns=['Nearest Neighbors','Mean Squared Error', 'Std'])
            for col in train_y:
                for s in strategies:                            
                    pipeline = Pipeline(steps=[('n', KNNImputer(n_neighbors=int(s))), ('m', model)])
                    Cv = model_selection.KFold(n_splits=10)
                    scores = cross_val_score(pipeline, 
                                             j.loc[j.index.intersection(train_y[col].dropna().index)],
                                             train_y[col].loc[train_y[col].index.intersection(j.index)].dropna(),
                                             scoring='neg_mean_squared_error',n_jobs=-1,cv=Cv)
                    scores=abs(scores)
                    values=values.append({'Nearest Neighbors':s,'Mean Squared Error':np.mean(scores), 'Std':np.std(scores)},ignore_index=True)
            values=values.groupby(by=['Nearest Neighbors'],as_index=False).mean()
            values['Nearest Neighbors']=values['Nearest Neighbors'].astype('int64')
            values.sort_values(by='Nearest Neighbors',ascending=True,inplace=True)
            print(values)
            sns.lineplot(y= values['Mean Squared Error'],x= values['Nearest Neighbors'])
            plt.ylabel("Mean Squared Error")
            plt.xlabel("N° Neighbors")
            plt.show()

**<font color="green">1-  AQUIFER AUSER:</font>**
We can observe that the best number of nearest neighbor for Aquifer Auser imputation is **<font color="magenta">7</font>**, which gave us the smalles MSE for the cross validation.

In [None]:
KNN_Imputer('Aquifer Auser')

**<font color="green">2-  AQUIFER DOGANELLA:</font>**
We can observe that the best number of nearest neighbor for Aquifer Doganella imputation is **<font color="magenta">19</font>**, which gave us the smalles MSE for the cross validation.

In [None]:
KNN_Imputer('Aquifer Doganella')

**<font color="green">3-  AQUIFER LUCO:</font>**
We can observe that the best number of nearest neighbor for Aquifer Luco imputation is **<font color="magenta">11</font>**, which gave us the smalles MSE for the cross validation.

In [None]:
KNN_Imputer('Aquifer Luco')

**<font color="green">4-  AQUIFER PETRIGNANO:</font>**
We can observe that the best number of nearest neighbor for Aquifer Petrignano imputation is **<font color="magenta">21</font>**, which gave us the smalles MSE for the cross validation.

In [None]:
KNN_Imputer('Aquifer Petrignano')

**<font color="green">5-  LAKE BILANCINO:</font>**
We can observe that the best number of nearest neighbor for Lake Bilancino imputation is **<font color="magenta">17</font>**, which gave us the smalles MSE for the cross validation.

In [None]:
KNN_Imputer('Lake Bilancino')

**<font color="green">6-  RIVER ARNO:</font>**
We can observe that the best number of nearest neighbor for River Arno imputation is **<font color="magenta">1</font>**, which gave us the smalles MSE for the cross validation.

In [None]:
KNN_Imputer('River Arno')

**<font color="green">7-  WATER SPRING AMIATA:</font>**
We can observe that the best number of nearest neighbor for Water Spring Amiata imputation is **<font color="magenta">1</font>**, which gave us the smalles MSE for the cross validation.

In [None]:
KNN_Imputer('Water Spring Amiata')

**<font color="green">8-  WATER SPRING LUPA:</font>**
We can observe that the best number of nearest neighbor for Water Spring Lupa imputation is **<font color="magenta">3</font>**, which gave us the smalles MSE for the cross validation.

In [None]:
KNN_Imputer('Water Spring Lupa')

**<font color="green">9-  WATER SPRING MADONNA DI CANNETO:</font>**
We can observe that the best number of nearest neighbor for Water Spring Madonna di Canneto imputation is **<font color="magenta">17</font>**, which gave us the smalles MSE for the cross validation.

In [None]:
KNN_Imputer('Water Spring Madonna di Canneto')

In [None]:

for i in datasets_1:
    j=datasets_1[i]
    if any(i in s for s in ['River Arno', 'Water Spring Amiata']):
        n = 1
    elif any(i in s for s in ['Water Spring Lupa']):
        n = 3
    elif any(i in s for s in ['Aquifer Auser']):
        n = 7
    elif any(i in s for s in ['Aquifer Luco']):
        n = 11
    elif i in ['Lake Bilancino','Water Spring Madonna di Canneto']:
        n = 17
    elif i in ['Aquifer Doganella']:
        n = 19
    elif i in ['Aquifer Petrignano']:
        n = 21
    imputer = KNNImputer(n_neighbors=n, weights='uniform', metric='nan_euclidean')
    j = pd.DataFrame(imputer.fit_transform(j),columns = j.columns, index=j.index)
    datasets_1[i]=j

# 7 Time Series Decomposition

Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category.
If we assume an additive decomposition, then we can write: 
**<font color="magenta"><center>yt=St+Tt+Rt</center></font>**


where **<font color="magenta">yt</font>** is the **data**, **<font color="magenta">St</font>** is the **seasonal component**, **<font color="magenta">Tt</font>** is the **trend-cycle component**, and **<font color="magenta">Rt</font>** is the **residual component**, all at **period** **<font color="magenta">t</font>**.

* **St (Seasonal component at time t):** reflects seasonality (seasonal variation). A seasonal pattern exists when a time series is influenced by seasonal factors. Seasonality occurs over a fixed and known period (e.g., the quarter of the year, the month, or day of the week).
* **Tt (Trend component at time t):** reflects the long-term progression of the series (secular variation). A trend exists when there is a persistent increasing or decreasing direction in the data. The trend component does not have to be linear.
* **Rt (Residual component at time t):** the irregular component (or "noise") at time t, which describes random, irregular influences. It represents the residuals or remainder of the time series after the other components have been removed.

Alternatively, a multiplicative decomposition would be written as:
**<font color="magenta"><center>yt=St×Tt×Rt</center></font>**



The additive decomposition is the most appropriate if the magnitude of the seasonal fluctuations, or the variation around the trend-cycle, does not vary with the level of the time series. When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series, then a multiplicative decomposition is more appropriate. Multiplicative decompositions are common with economic time series.

An alternative to using a multiplicative decomposition is to first transform the data until the variation in the series appears to be stable over time, then use an additive decomposition. When a log transformation has been used, this is equivalent to using a multiplicative decomposition because:
**<center><font color="magenta">yt=St×Tt×Rt</font> is equivalent to:
<font color="magenta">log(yt)=log(St)+log(Tt)+log(Rt)</font></center>**

The three components are shown separately in the bottom three panels of the following plots. These components can be added together to reconstruct the data shown in the top panel. Notice that the seasonal component changes slowly over time, so that any two consecutive years have similar patterns, but years far apart may have different seasonal patterns. The residual component shown in the bottom panel is what is left over when the seasonal and trend-cycle components have been subtracted from the data.

We will use STL decomposition:
**STL** is a versatile and robust method for decomposing time series. **STL** is an acronym for **“Seasonal and Trend decomposition using Loess”**, while Loess is a method for estimating nonlinear relationships. The STL method was developed by R. B. Cleveland, Cleveland, McRae, & Terpenning (1990).

**STL** has several advantages over the classical, SEATS and X11 decomposition methods:
* Unlike SEATS and X11, **STL** will handle any type of seasonality, not only monthly and quarterly data.
* The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.
* The smoothness of the trend-cycle can also be controlled by the user.

It can be robust to outliers (i.e., the user can specify a robust decomposition), so that occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components. They will, however, affect the residual component. We will not activate robust decomposition because previously we dropped most outliers.

In [None]:
import statsmodels.api as sm
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

def decomposition(i):
    j=datasets_1[i]
    sns.set_style('whitegrid')
    sns.set_context('talk')
    for k in j.columns:
        print(color.UNDERLINE + color.CYAN + color.BOLD + k.upper() + color.END)
        res=sm.tsa.STL(j[k], period=52,seasonal=13).fit()
        fig, (ax1,ax2,ax3,ax4) = plt.subplots(4,1, figsize=(20,10))
        res.observed.plot(ax=ax1,title='Data')
        res.trend.plot(ax=ax2, title='Trend-Cycle')
        res.seasonal.plot(ax=ax3, title='Seasonal')
        res.resid.plot(ax=ax4, title='Resid')
        fig.tight_layout()
        plt.show()


**<font color="green">1-  AQUIFER AUSER:</font>**

* Starting with Rainfalls, we can obsereve a trend-cycle wavy behavior similar for all these features decreasing for 2012 and increasing in 2013. They all have a similar seasonal behavior with peaks in winter and lows in summer.
* Next, Depth to groundwater PAG and DIEC also have a similar behavior in trend with an increased value in 2012 and drops in 2013 to then, stabilize in the next years. They have a similar seasonal behavior with pikes in the lasts months of each year and decreases drastically in the first months of each year.
* As to tempearture, we can observe that they have an almost linear trend behavior with an slightly growth. Their seasonal is clearly remarked, where in winter decreases and in summer increases.
* Volume features have all different trends and seasonality but we can see that these seasonalities are remarked in most cases. Trend for volume POL and CC2 have a decreasing behavior, where, CC1, CSA and CSAL have a particular behavior in time.
* Hydrometry have similar behaviors with Piaggione trend more flattened than Monte S Quirico. Again we can see that trend has a minimum in 2012 and starts growing in 2013. Their seasonalities have maximum values in summer and decreases the rest of the year for both cases.
* Depth to groundwater shifted have a decreasing trend among the years. As to seasonal behavior, LT2 and SAL have a more uniform behavior except for the last few years,  where we can see negative peaks, and COS has a remarked seasonality where it decreases in winter and increases in summer.

In [None]:
decomposition('Aquifer Auser')

**<font color="green">2-  AQUIFER DOGANELLA:</font>**

* Rainfalls have a huge gap in the middle years from 2015 to 2018/19 where trend has decreased notoriously. They have a similar seasonal behavior with an heterogeneous behavior composed by lots of peaks.
* Next, analyzing the volume of water taken, we can see that Pozzo 1 Pozzo 2 and Pozzo 4 have very similar trend, where they have an almost lineal behavior until 2017, where trend drops to then start growing back. Pozzo 3, Pozzo 7 and Pozzo 9 also have similar behavior between each other, where in this case, we can observe a maximum in 2017 and then start decreasing progressively.
* As to temperature, we can observe that it has an almost linear trend to then, drop with its minimum in 2017, where it starts growing back. Their seasonal behavior is clearly remarked, where in winter decreases and in summer increases.
* Almost every Depth to groundwater shifted featrure has decreased trend until 2016, where it starts growing back gradually. They don´t show a clear seasonal behavior that stands out.

In [None]:
decomposition ('Aquifer Doganella')

**<font color="green">3-  AQUIFER LUCO:</font>**

* Rainfall features for Luco have an almost linear trend where we can see a slight decrease in 2020. They don´t have a remarked seasonality but it seems to have peaks in autumn months.
* As to Depth to groundwater, we can see a clear similarity between Pozzo 1 and Pozzo 3, where their trend grows quickly. Their seasonality is not clear but we can observe that first months of the year have bigger values. Pozzo 4 on the other hand, has an almost linear trend until 2017, where increases to get its maximum value in 2018 and starts decreasing constantly. Also it has a remarked seasonality, where in winter, its values drops considerably.
* From temperature, we can observe that they have an almost linear trend behavior with an slightly growth having its maximum trend in 2020. Their seasonal behavior is clearly remarked, where in winter decreases and in summer increases.
* Volume Pozzo 1, Pozzo 3 and Pozzo 4 have similar behavior, where we can observe they decrease until about 2016, and start increasing until 2018 where they have their maximum values and decreases again.
* Depth to groundwater shifted has a wavy behavior where its minimum its in mid 2014 and its maximum at lates 2017.

In [None]:
decomposition ('Aquifer Luco')

**<font color="green">4-  AQUIFER PETRIGNANO:</font>**

* Rainfall feature, as we can see, has increased trend since 2012 with an almost constant value having ups and downs. It does not present a clearly seasonality
* From tempearture, we again can observe that they have an almost linear trend behavior. Their seasonal is clearly remarked, where in winter decreases and in summer increases.
* Volume taken from the treatment plant had it maximum in 2012 and then decreased until 2017 to stabilize and get almost homogeneous. It does not present a remarked seasonality behavior.
* Hydrometry has a very uniform trend, with its ups and downs but very constant among the years. It also has a clear seasonal behavior, where in winter months it has upper values, and decreases in summer months.
* Depth to groundwater shifted start with a wavy behavior in their trend until 2015 where they stabilize and get a constant behavior. Their seasonal behavior is very transparent, we can observe a uniform behavior in most of the years, dropping in the first months. 

In [None]:
decomposition('Aquifer Petrignano')

**<font color="green">5-  LAKE BILANCINO:</font>**

* From Rainfalls, we can obsereve a trend-cycle wavy behavior similar for all these features, decreasing for late 2011 and then increasing. They all have a similar seasonal behavior with peaks in autumn/winter months and lows in spring/summer months.
* Tempearture, has an almost linear trend behavior between 14 and 16 celsius degrees. Their seasonal is clearly remarked, where in winter decreases and in summer increases.
* Lake level has an almost lineal trend with an exception in early 2012, where it drops hastly. We can observe an evident seasonality where falls heavily in mid summer/mid autumn months.
* Flow rate shifted has a wavy trend that varies among the years. As to it seasonal behavior, we can see that it has evident peaks in the early months of each year.

In [None]:
decomposition ('Lake Bilancino')

**<font color="green">6-  RIVER ARNO:</font>**
* We can observe that Rainfalls have a wavy behavior in time, that are very similar with each other. Something to consider, is that again we can see that in late 2011 is presented a minimum value, that has been repeated for many features. Seasonal behavior is still the same, they present peaks in autumn/winter months and drops in spring/summer months.
* For temperature, we can see an almost linear trend behavior between 16 and 18 celsius degrees. Their seasonal is clearly remarked, where in winter decreases and in summer increases.
* Hydrometry also presents a wavy behavior in time for trend with the same minimum value in late 2011. It has a clear seasonality where its maximum are presented in summer and drops in winter.

In [None]:
decomposition ('River Arno')

**<font color="green">7-  WATER SPRING AMIATA:</font>**

* In this case, Rainfall features have a more linear behavior in time, with a slight decrease in time. Seasonal behavior present peaks in winter months and drops in summer months.
* For temperature, we can see an almost linear trend behavior between 16 and 18 celsius degrees. Their seasonal is clearly remarked, where in winter decreases and in summer increases.
* We can observe that depth to groundwater features have an increasing wavy trend behavior with an unintelligible seasonal behavior.
* As to tempearture, we can observe that they have an almost linear trend behavior with an slightly growth. Their seasonal is clearly remarked, where in winter decreases and in summer increases.
* Flow rate shifted features also presents a wavy behavior in time for trend with  minimum value in 2018. They have not got a clear seasonability.

In [None]:
decomposition('Water Spring Amiata')

**<font color="green">8-  WATER SPRING LUPA:</font>**
* Once again, we can observe Rainfall trend dropping in 2012 with a wavy behavior in time, but a clear seasonability, where in late months of the year, maximum values are presented.
* Flow rate shifted has a linear increasing trend with an particular seasonal behavior.

In [None]:
decomposition('Water Spring Lupa')

**<font color="green">9-  WATER SPRING MADONNA DI CANNETO:</font>**

* Rainfall feature in this case has an almost linear behavior with an unmarked seasonal behavior.
* Trend for temperature is almost linear with an increasing value for last years. It seasonal is clearly remarked, where in winter decreases and in summer increases.
* At last, Flow rate shifted has a linear trend until 2016, where it gets wavy and starts decreasing notoriously. We can see a seasonal behavior that begins to take shape since 2016, where summer months have higher values.

In [None]:
decomposition('Water Spring Madonna di Canneto')

In [None]:
for i in datasets_1:
    j=datasets_1[i]
    for k in j.columns:
        res=sm.tsa.STL(j[k], period=52,seasonal=13).fit()
        j[f"{k}_seasonal"] = res.seasonal
        j[f"{k}_trend"] = res.trend
        j[f"{k}_resid"] = res.resid
    datasets_1[i] = j

# 8 Predictions

Prediction refers to the output of an algorithm after it has been trained on a historical dataset and applied to new data when forecasting the likelihood of a particular outcome. The algorithm will generate probable values for an unknown variable for each record in the new data, allowing the model builder to identify what that value will most likely be.

In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features. Supervised learning, is when the data comes with additional attributes that we want to predict.

Our case is about a supervised learning, in concrete, a case of regression analysis. We have two kinds of desired regression analysis, the ones asking to a multi output regressor (ths means, we are looking for more than one features to predict) and the ones asking for a simple regressor (looking for one feature to predict):

* **Multi output regressor:** Aquifer Auser, Aquifer Petrignano, Aquifer Doganella, Water Spring Amiata, Lake Bilancino.
* **Simple output regressor:** Aquifer Luco, Water Spring Madonna di Canneto, Water Spring Lupa, River Arno.

To predict using the best **Multioutput regressor**, we will use ***scikit-learn*** with different types of estimators. First, we will look for the best parameters using a combination of *RepeatedKFold* and *RandimizedSearchCV* applied to the entire dataset corresponding to the features to forecast dates. Once we have our best parameters, we proceed to analyze each estimator with X_train, y_train, predict X_test, get the mean squared error and mean absolute error comparing the predicted y vs the y_test. These train and test datasets were taken by a *train_test_split* with test_size of 0.25. The estimators used were:
1. ExtraTreesRegressor
2. KNeighborsRegressor
3. MultiTaskElasticNetCV
4. RandomForestRegressor
5. AdaBoostRegressor
6. XGBRegressor

**XGB Regressor** is not really part of the scikit-learn package, it depends on **XGBoost**. **XGBoost** is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way

As to the **Simpleoutput regressor**, we decided to use a powerfull Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming called ***TPOT***. 
The **TPOTRegressor** performs an intelligent search over machine learning pipelines that can contain supervised regression models, preprocessors, feature selection techniques, and any other estimator or transformer that follows the scikit-learn API. The TPOTRegressor will also search over the hyperparameters of all objects in the pipeline. By default, **TPOTRegressor** will search over a broad range of supervised regression models, transformers, and their hyperparameters.

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import MultiTaskLassoCV
from sklearn.linear_model import MultiTaskElasticNetCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
import xgboost as xgb
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor

from sklearn.tree import DecisionTreeRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import train_test_split

ESTIMATORS = {
    "Extra trees": {ExtraTreesRegressor(random_state=17):{'max_features' : ['sqrt','log2',1.0],
                                           'min_samples_leaf' : [1, 2, 3, 7, 11],
                                           'n_estimators': [50, 100],  'oob_score': [True, False]}},
    "K-nn": {KNeighborsRegressor():{'n_neighbors' :list(range(1, 31))}},
    "ElasticNet": {MultiTaskElasticNetCV(random_state=17):{'n_alphas':[100],"max_iter":[500],'l1_ratio': np.arange(0.3, 1, 0.3)}},
    "RandomForestRegressor": {RandomForestRegressor(random_state=17):{'bootstrap': [True, False],
                             'max_features':['auto', 'sqrt', 'log2'],
                             'max_depth': [50, 100, 150, None],
                             'n_estimators': [100,500,1000,1500]}},
    "MAdaB" : {MultiOutputRegressor(AdaBoostRegressor(random_state=17)):{'estimator__n_estimators':[50,100],
                                                                  'estimator__learning_rate':[0.3,0.6,1.0]}},
    "MXGB" : {MultiOutputRegressor(xgb.XGBRegressor(random_state=17)):{'estimator__n_estimators':[50,100],
                                                                  'estimator__learning_rate':[0.3,0.6,1.0]}},
    }


In [None]:
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RandomizedSearchCV

def Multiple_Regressor(x,y):
    X=x.loc[x.index.intersection(y.dropna().index)]
    Y=y.loc[y.index.intersection(x.index)].dropna()

    cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
    y_mse= {}
    y_mae= {}
    y_opt= {}
    X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.25, random_state=17)

    for name, estimator in ESTIMATORS.items():
        search = RandomizedSearchCV(list(estimator.keys())[0], estimator[list(estimator.keys())[0]], 
                                    n_iter=20, scoring='neg_mean_absolute_error', n_jobs=-1, cv=cv, random_state=17)
        result = search.fit(X, Y)
        print(name)
        print('Best Score: %s' % result.best_score_)
        print('Best Hyperparameters: %s' % result.best_params_)
        if 'MAdaB' in name:
            Optimized_estimator = MultiOutputRegressor(AdaBoostRegressor())
            Optimized_estimator.set_params(**result.best_params_)
            Optimized_estimator.fit(X_train, y_train)
            y_opt[name] = Optimized_estimator.fit(X_train, y_train)
            y_predict = Optimized_estimator.predict(X_test)
            y_mse[name] = mean_squared_error(y_test, y_predict)
            y_mae[name] = mean_absolute_error(y_test, y_predict)
        elif 'MXGB' in name:
            Optimized_estimator = MultiOutputRegressor(xgb.XGBRegressor())
            Optimized_estimator.set_params(**result.best_params_)
            Optimized_estimator.fit(X_train, y_train)
            y_opt[name] = Optimized_estimator.fit(X_train, y_train)
            y_predict = Optimized_estimator.predict(X_test)
            y_mse[name] = mean_squared_error(y_test, y_predict)
            y_mae[name] = mean_absolute_error(y_test, y_predict)
        else:
            Optimized_estimator=list(estimator.keys())[0]
            Optimized_estimator.set_params(**result.best_params_)
            Optimized_estimator.fit(X_train, y_train)
            y_opt[name]= Optimized_estimator.fit(X_train, y_train)
            y_predict = Optimized_estimator.predict(X_test)
            y_mse[name] = mean_squared_error(y_test, y_predict)
            y_mae[name] = mean_absolute_error(y_test, y_predict)
    print('\n')
    for name, value in y_mse.items():
        print('RMSE for ',name,' is ',value)
    print('\n')
    for name, value in y_mae.items():   
        print('MAE for ',name,' is ',value)
        
    best_estimator= min(y_mse, key=y_mse.get)
    
    if 'MAdaB' in best_estimator or "MXGB" in best_estimator:
        fig, ax = plt.subplots(figsize=(25,10))
        barplot = ax.bar(x=range(len(y_opt[best_estimator].estimators_[0].feature_importances_)), 
                              height = y_opt[best_estimator].estimators_[0].feature_importances_,
                              tick_label = x.columns)
        ax.set_ylabel('Importances',fontsize=15)
        ax.set_title('Model '+ x.keys()[0] +': Features Importances',fontsize=20)
        ax.tick_params(axis='both', labelsize=15)
        plt.xticks(rotation=90)
        fig.tight_layout()
    else:
        fig, ax = plt.subplots(figsize=(25,10))
        barplot = ax.bar(x=range(len(y_opt[best_estimator].feature_importances_)), 
                              height = y_opt[best_estimator].feature_importances_, tick_label = x.columns)
        ax.set_ylabel('Importances',fontsize=15)
        ax.set_title('Model '+ x.keys()[0] +': Features Importances',fontsize=20)
        ax.tick_params(axis='both', labelsize=15)
        plt.xticks(rotation=90)
        fig.tight_layout()
        
    prediction = y_opt[best_estimator].predict(x)
    return pd.DataFrame(data=prediction,index=x.index,columns=Y.columns)

In [None]:
from tpot import TPOTRegressor

def Simple_Regressor(x,y):
    X=x.loc[x.index.intersection(y.dropna().index)]
    Y=y.loc[y.index.intersection(x.index)].dropna()
    tpot = TPOTRegressor(generations=5, population_size=80, verbosity=1, random_state=17,
                         scoring='r2', n_jobs=-1, early_stop = 5)
    X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.25, random_state=17)
    tpot.fit(X_train, y_train)
    exctracted_best_model = tpot.fitted_pipeline_.steps[-1][1]
    exctracted_best_model.fit(X_train, y_train) 
    y_predict = exctracted_best_model.predict(X_test)
    print('\n')
    print('RMSE:', mean_squared_error(y_test, y_predict))
    print('MAE:', mean_absolute_error(y_test, y_predict))
    print('Winning pipeline:', exctracted_best_model)
    try:
        x_ax = range(len(exctracted_best_model.feature_importances_))
        height_ax = exctracted_best_model.feature_importances_
    except AttributeError as error:
        x_ax = range(len(exctracted_best_model.coef_))
        height_ax = abs(exctracted_best_model.coef_)
    fig, ax = plt.subplots(figsize=(25,10))
    barplot = ax.bar(x=x_ax, height = height_ax ,
                              tick_label = x.columns)
    ax.set_ylabel('Importances',fontsize=15)
    ax.set_title('Model '+ x.keys()[0] +': Features Importances',fontsize=20)
    ax.tick_params(axis='both', labelsize=15)
    plt.xticks(rotation=90)
    fig.tight_layout()
    
    prediction = exctracted_best_model.predict(x)
    return pd.DataFrame(data=prediction,index=x.index,columns=Y.columns)

> **<font color="green">1-   AQUIFER AUSER:</font>**

Applying multiple regressor function for Aquifer Auser we can observe that **Extra Trees regressor** is the most suitable for this ocation, with an **RMSE** of **<font color="magenta">0,08</font>** and a **MAE** of **<font color="magenta">0,13</font>**. Also, we can observe that the most important features to make the regression were depth to groundwater features, both shifted and non shifted, and hydrometry as well. Volume features had less impact and rainfall features have almost no impact to the regression made.

In [None]:
Aq_Auser_Prediction = Multiple_Regressor(datasets_1["Aquifer Auser"],y_Aq_Auser)

In [None]:
for i in Aq_Auser_Prediction.columns:
    fig, ax = plt.subplots(figsize=(25,10))
    lineplot = sns.lineplot(data=Aq_Auser_Prediction[i],color = 'orangered',label='Predicted')
    lineplot = sns.lineplot(data=y_Aq_Auser[i],color = 'seagreen',label='Original')
    lineplot.set_title(label=i.upper(), fontdict={'fontsize':25}, pad=16)
    lineplot.set_xlabel('Date' , fontsize=20)
    lineplot.set_ylabel("Distanse from the groundfloor (mts)" , fontsize=20)
    lineplot.tick_params(axis='both', labelsize=15)
    lineplot.legend(fontsize=20,loc='upper left')
    plt.show()

**<font color="green">2-   AQUIFER DOGANELLA:</font>**
For Doganella we used multiple regressor function in order to forecast its 9 different depth to groundwater. Best estimator in this case also was **Extra Trees regressor** with a **RMSE** of **<font color="magenta">0.92</font>**, and **MAE** of **<font color="magenta">0.41</font>**. We can clearly observe the difference between Depth to groundwater pozzo 1 shifted trend and the rest of the features in terms of feature importances to the regression model. We can see that other depth to groundwater shifted and volume features also contribute especially their trends.

In [None]:
Aq_Doganella_Prediction = Multiple_Regressor(datasets_1["Aquifer Doganella"],y_Aq_Doganella)

In [None]:
for i in Aq_Doganella_Prediction.columns:
    fig, ax = plt.subplots(figsize=(25,10))
    lineplot = sns.lineplot(data=Aq_Doganella_Prediction[i],color = 'orangered',label='Predicted')
    lineplot = sns.lineplot(data=y_Aq_Doganella[i],color = 'seagreen',label='Original')
    lineplot.set_title(label=i.upper(), fontdict={'fontsize':25}, pad=16)
    lineplot.set_xlabel('Date' , fontsize=20)
    lineplot.set_ylabel("Distanse from the groundfloor (mts)" , fontsize=20)
    lineplot.tick_params(axis='both', labelsize=15)
    lineplot.legend(fontsize=20,loc='upper left')
    plt.show()

**<font color="green">3-   AQUIFER LUCO:</font>**
Aquifer Luco has only one feature to forecast, and this is Depth to groundwater Podere Casetta. The winning estimator model was **XGBRegressor** with a **RMSE** of **<font color="magenta">0.01</font>** and a **MAE** of **<font color="magenta">0.08</font>**. In this case, we can observe, that for Luco the most important features are not the shifted ones, but Volume Pozzo 1 trend and Volume Pozzo 3 trend features by far.

In [None]:
Aq_Luco_Prediction = Simple_Regressor(datasets_1["Aquifer Luco"],y_Aq_Luco)

In [None]:
for i in Aq_Luco_Prediction.columns:
    fig, ax = plt.subplots(figsize=(25,10))
    lineplot = sns.lineplot(data=Aq_Luco_Prediction[i],color = 'orangered',label='Predicted')
    lineplot = sns.lineplot(data=y_Aq_Luco[i],color = 'seagreen',label='Original')
    lineplot.set_title(label=i.upper(), fontdict={'fontsize':25}, pad=16)
    lineplot.set_xlabel('Date' , fontsize=20)
    lineplot.set_ylabel("Distanse from the groundfloor (mts)" , fontsize=20)
    lineplot.tick_params(axis='both', labelsize=15)
    lineplot.legend(fontsize=20,loc='upper left')
    plt.show()

**<font color="green">4-   AQUIFER PETRIGNANO:</font>**
As to Aquifer Petrignano, again the best model selected was **Extra Trees Regressor**, with a **RMSE** of **<font color="magenta">0.09</font>** and a **MAE** of **<font color="magenta">0.20</font>**. By far the most important features were the shifted ones and their trends

In [None]:
Aq_Petrignano_Prediction = Multiple_Regressor(datasets_1["Aquifer Petrignano"],y_Aq_Petrignano)

In [None]:
for i in Aq_Petrignano_Prediction.columns:
    fig, ax = plt.subplots(figsize=(25,10))
    lineplot = sns.lineplot(data=Aq_Petrignano_Prediction[i],color = 'orangered',label='Predicted')
    lineplot = sns.lineplot(data=y_Aq_Petrignano[i],color = 'seagreen',label='Original')
    lineplot.set_title(label=i.upper(), fontdict={'fontsize':25}, pad=16)
    lineplot.set_xlabel('Date' , fontsize=20)
    lineplot.set_ylabel("Distanse from the groundfloor (mts)" , fontsize=20)
    lineplot.tick_params(axis='both', labelsize=15)
    lineplot.legend(fontsize=20,loc='upper left')
    plt.show()

**<font color="green">5-   LAKE BILANCINO:</font>**
Lake Bilancino has two features to forecast, its Lake Level and Flow Rate. In this case, the best estimator was **Random Forest regressor** with a **RMSE** of **<font color="magenta">4.39</font>** and **MAE** of **<font color="magenta">0.95</font>**. For Lake Bilancino, we can observe a more uniform behavior in the features importances. While Lake Level shifted is the most important feature, the others have an important impact in the model as well.

In [None]:
Lake_Bilancino_Prediction = Multiple_Regressor(datasets_1["Lake Bilancino"],y_Lake_Bilancino)

In [None]:
for i in Lake_Bilancino_Prediction.columns:
    if "Lake" in i:
        unit= "Lake Level (mts)"
    elif "Flow" in k:
        unit= "Flow Rate (m3/s)"
    fig, ax = plt.subplots(figsize=(25,10))
    lineplot = sns.lineplot(data=Lake_Bilancino_Prediction[i],color = 'orangered',label='Predicted')
    lineplot = sns.lineplot(data=y_Lake_Bilancino[i],color = 'seagreen',label='Original')
    lineplot.set_title(label=i.upper(), fontdict={'fontsize':25}, pad=16)
    lineplot.set_xlabel('Date' , fontsize=20)
    lineplot.set_ylabel(unit , fontsize=20)
    lineplot.tick_params(axis='both', labelsize=15)
    lineplot.legend(fontsize=20,loc='upper left')
    plt.show()

**<font color="green">6-   RIVER ARNO:</font>**
River arno has only one feature to forecast called Hydrometry Nave di Rosano. As we can observe, the winning model was **LassoLarsCV** with a **RMSE** of **<font color="magenta">0.15</font>** and a **MAE** of **<font color="magenta">0.29</font>**. The features that most affected this model were Rainfall Camaldoli trend , Rainfall S Savino trend and Temperature Frenze seasonal. We can clearly see how rainfalls affects the Hydrometry of the river.

In [None]:
River_Arno_Prediction = Simple_Regressor(datasets_1["River Arno"],y_River_Arno)

In [None]:
for i in River_Arno_Prediction.columns:
    if "Lake" in i:
        unit= "Lake Level (mts)"
    elif "Flow" in k:
        unit= "Flow Rate (m3/s)"
    else:
        unit='Value'
    fig, ax = plt.subplots(figsize=(25,10))
    lineplot = sns.lineplot(data=River_Arno_Prediction[i],color = 'orangered',label='Predicted')
    lineplot = sns.lineplot(data=y_River_Arno[i],color = 'seagreen',label='Original')
    lineplot.set_title(label=i.upper(), fontdict={'fontsize':25}, pad=16)
    lineplot.set_xlabel('Date' , fontsize=20)
    lineplot.set_ylabel(unit , fontsize=20)
    lineplot.tick_params(axis='both', labelsize=15)
    lineplot.legend(fontsize=20,loc='upper left')
    plt.xlim(j.index.min(),j.index.max())
    plt.show()

**<font color="green">7-   WATER SPRING AMIATA:</font>**
Once again, we can see that **Extra Trees regressor** is the best estimator for this case, with a **RMSE** of **<font color="magenta">0.12</font>** and a **MAE** of **<font color="magenta">0.15</font>**. Amiata has four different Flow Rate features to forecast. As we can see, different features have impact in the model, being Flow Rate Galleria Alta shifted trend the most important, but we can see how Rainfall Vetta Amiata trend, Depth to groundwater S Fiora 8 trend and Depth to groundwater S Fiora 11bis trend influence in the regression model.

In [None]:
Ws_Amiata_Prediction = Multiple_Regressor(datasets_1["Water Spring Amiata"],y_Ws_Amiata)

In [None]:
for i in Ws_Amiata_Prediction.columns:
    if "Lake" in i:
        unit= "Lake Level (mts)"
    elif "Flow" in k:
        unit= "Flow Rate (m3/s)"
    else:
        unit='Value'
    fig, ax = plt.subplots(figsize=(25,10))
    lineplot = sns.lineplot(data=Ws_Amiata_Prediction[i],color = 'orangered',label='Predicted')
    lineplot = sns.lineplot(data=y_Ws_Amiata[i],color = 'seagreen',label='Original')
    lineplot.set_title(label=i.upper(), fontdict={'fontsize':25}, pad=16)
    lineplot.set_xlabel('Date' , fontsize=20)
    lineplot.set_ylabel(unit , fontsize=20)
    lineplot.tick_params(axis='both', labelsize=15)
    lineplot.legend(fontsize=20,loc='upper left')
    plt.xlim(j.index.min(),j.index.max())
    plt.show()

**<font color="green">8-   WATER SPRING LUPA:</font>**
Water Spring Lupa has only one feature to forecast, its Flow Rate. Best estimator was **XGBRegressor** with a **RMSE** of **<font color="magenta">6.21</font>** and a **MAE** of **<font color="magenta">0.71</font>**. By far, the most important feature was Flow Rate Lupa shifted trend, as we can observe this flow rate had a particular behavior with few extra features.

In [None]:
Ws_Lupa_Prediction = Simple_Regressor(datasets_1["Water Spring Lupa"],y_Ws_Lupa)

In [None]:
for i in Ws_Lupa_Prediction.columns:
    if "Lake" in i:
        unit= "Lake Level (mts)"
    elif "Flow" in k:
        unit= "Flow Rate (m3/s)"
    else:
        unit='Value'
    fig, ax = plt.subplots(figsize=(25,10))
    lineplot = sns.lineplot(data=Ws_Lupa_Prediction[i],color = 'orangered',label='Predicted')
    lineplot = sns.lineplot(data=y_Ws_Lupa[i],color = 'seagreen',label='Original')
    lineplot.set_title(label=i.upper(), fontdict={'fontsize':25}, pad=16)
    lineplot.set_xlabel('Date' , fontsize=20)
    lineplot.set_ylabel(unit , fontsize=20)
    lineplot.tick_params(axis='both', labelsize=15)
    lineplot.legend(fontsize=20,loc='upper left')
    plt.xlim(j.index.min(),j.index.max())
    plt.show()

**<font color="green">9-   WATER SPRING MADONNA DI CANNETO:</font>**
At last, we have Water Spring Madonna di Canneto. With one feature to forecast (its Flow Rate), the best model was **GradientBoostingRegressor** with a **RMSE** of **<font color="magenta">377.52</font>** and a **MAE** of **<font color="magenta">14.92</font>**. It seems that temperature Settefrati seasonal and trend were the most important features that influenced the model. The other features had less impact with Flow Rate shifted trend at the top of them.

In [None]:
Ws_Madonna_di_Canneto_Prediction = Simple_Regressor(datasets_1["Water Spring Madonna di Canneto"],y_Ws_Madonna_di_Canneto)

In [None]:
for i in Ws_Madonna_di_Canneto_Prediction.columns:
    if "Lake" in i:
        unit= "Lake Level (mts)"
    elif "Flow" in k:
        unit= "Flow Rate (m3/s)"
    else:
        unit='Value'
    fig, ax = plt.subplots(figsize=(25,10))
    lineplot = sns.lineplot(data=Ws_Madonna_di_Canneto_Prediction[i],color = 'orangered',label='Predicted')
    lineplot = sns.lineplot(data=y_Ws_Madonna_di_Canneto[i],color = 'seagreen',label='Original')
    lineplot.set_title(label=i.upper(), fontdict={'fontsize':25}, pad=16)
    lineplot.set_xlabel('Date' , fontsize=20)
    lineplot.set_ylabel(unit , fontsize=20)
    lineplot.tick_params(axis='both', labelsize=15)
    lineplot.legend(fontsize=20,loc='upper left')
    plt.xlim(j.index.min(),j.index.max())
    plt.show()

# 9 Final conclusions
At last, we can infer after all the process made, that different features affect the waterbodies in various ways. 

For example, in case of aquifers, we can conclude that the past history of the features to forecast (this is, the shifted ones), and volume taken by the water plants were the most important features for the models. As to the lake waterbody we saw a different behavior, where features had a more uniform impact in the model. For river waterbody we could see a direct impact of the rainfalls and temperature in its flow rate. Last, the water spring waterbodies presented different behavior to the features, with their shifted features and Madonna di Canneto with their temperature. It was clear that in most cases the trend of the features were the most influential part in the model, while seasonal part were more unnoticed. 

Finally, we can see that for these multiple output regressor, Extra Trees regressor and Random Forest regressor were the ones that best suited given those specific parameters, so in order to reduce execution time we could drop the other models such as AdaBoost and XGB that slow down the process.