# Intro

This competition uses nine different datasets, completely independent and not linked to each other. Each dataset can represent a different kind of waterbody.The Acea Group deals with four different type of waterbodies: water spring (for which three datasets are provided), lake (for which a dataset is provided), river (for which a dataset is provided) and aquifers (for which four datasets are provided). 

The goal is to create four mathematical models, one for each category of waterbody (acquifers, water springs, river, lake) to predict the amount of water in each unique waterbody for a set time interval.  
The predictive power of the models shall be evaluated with both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE).

A typical pplication scenario is to predict in advance low or high levels of water availability of waterbody in order to start remediation actions as soon as possible to reduce water consumption such as communications to citizens or water rationing.


### Outcome of this notebook
This notebook provides with extensive data cleaning and exploratory data analysis on all the variables of the 9 waterbodies. 
A mathematical model forecasting waterbodies’ water availability in terms of groundwater depth for **aquifer “Petrignano”**. The time interval is defined as **days**. Due to lack of time, the same model or other models were not explored and applied to other waterbodies.
This notebook contains two parts: 

1.	**Explore and Clean Data**: All individual waterbodies’ variables are analyzed in depth in order to clean them from missing or erroneous values. Historical time intervals with missing target variables or insufficient independent variables have been removed. Moreover, the variable “depth at Podere Casetta” had missing values for the latest period (2020). Since it was a target variable for “Aquifer Luco”, I preferred not to remove this time interval, in order to later create a model on the most recent data. So, I used an autoregressive model to predict the variable in 2020 based on historical values of the same variable. Similarly, for water spring “Madonna di Canneto” the rainfall “Settefrati”, an important independent variable, had missing values for the entire 2019-2020. Removing this variable may have been compromised the utility of the model, therefore, I have used machine learning to reconstruct such a variable from other variables. 
2.	**Machine Learning for Prediction**: The objective of this part is to produce prediction models by relying on cleaned data from the previous part and by using a standard process for machine learning on time series data. The process considers training (to cross validate and choose parameters for each of 3 different algorithms), validation (to select the best algorithm) and test data (to achieve the final scores). Because of lack of time, only one waterbody was modeled: aquifer “Petrignano”. The achieved MSE and MAE scores for one of the two target variables (`depth_to_groundwater_p24`) was 0.14 for 1-day prediction, raising up to 0.37 for a 30-days prediction, while MAE went from 0.10 to 0.30 for the same prediction interval. The scores were obtained by executing the model on *previously unseen* test data from June 2019 to June 2020. Considering that the variable had an excursion from 24.5 to 27.5 in the same period, the mean error scores show good prediction performance.
Future work should apply the same process to other waterbodies, so that each different kind/category of waterbody (Aquifer, Water Spring, River and Lake) has its own model, each applicable to the single waterbody (Auser, Amiata, Petrignano, Doganella, Luco, Madonna di Canneto, Lupa, Arno, Bilancino).


## Load data and libraries

In [None]:
#!pip install odfpy
#!pip install pystan==2.19.1.1
#!pip install fbprophet

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime
from scipy import stats
%matplotlib inline


data_descr = pd.read_excel("/kaggle/input/acea-water-prediction/datasets_description.ods", engine="odf")

### There is no output information for databases "Madonna di Canneto" and "Lupa". So, let's add it manually. 
### Moreover, let's tranform the "Output" into a list of lower-case target variables to avoid problems with case sensitivity later on.

data_descr["Output"][data_descr["Database"]=="Water_Spring_Lupa"] = "Flow_Rate_Lupa"
data_descr["Output"][data_descr["Database"]=="Water_Spring_Madonna_di_Canneto"]= "Flow_Rate_Madonna_di_Canneto"


data_descr["Output"] = data_descr["Output"].apply(lambda x: x.lower())
data_descr["Output"] = data_descr["Output"].apply(lambda x: x.replace(" ","").split(","))

data_descr = data_descr.reindex(columns = ['type','name']+data_descr.columns.tolist() )
data_descr['type']=""
data_descr['name']=""

for ii, data in data_descr.iterrows(): 
    name = data["Database"]
    if name.startswith("Water"):
        data["type"] = "Water_Spring"
        data["name"] = name.replace("Water_Spring_","").strip()
    else:
        data["type"], data["name"] = name.strip().split("_")
    
data_descr.drop(columns=["Database"],inplace=True)
data_descr

Let's check the database information more closely by translating it into human readable format

In [None]:
def print_description(db,description,output):
    """
    Print each row of the database in human readable format.
    """
    print('\033[1m%s\033[0m\n'%db)
    print('%s\n'%description)
    print('Output:\n')
    for param in str(output).split(","):
        print("\u2022 %s\n"%param)

for ind, series in data_descr.iterrows():
    args = list(series)
    print_description(args[0],args[1],args[2])

Now that we have the name of the datasets, let's use the names to load the datasets into nested dictionaries in order to preserve the following data hierarchy:

* 1: Waterbodies 
    * 2: Waterbody Type (["Aquifer","Water_Spring","River","Lake"]) 
        * 3: Waterbody Name

When we load the data, let's transform "Date" columns into datetime so that it is easier to process it further

In [None]:
# extract waterbody types, names, data, and fill the data container
df = {} # initialize data container
for ii, info in data_descr.iterrows():
    waterbody_type = info["type"]
    waterbody_name = info["name"]
    df.setdefault(waterbody_type, {})[waterbody_name] = pd.read_csv('/kaggle/input/acea-water-prediction/%s_%s.csv'%(waterbody_type,waterbody_name))
    df[waterbody_type][waterbody_name].columns = [x.lower() for x in df[waterbody_type][waterbody_name].columns] # transfrom variables in lower case
    df[waterbody_type][waterbody_name]["date"] = pd.to_datetime(df[waterbody_type][waterbody_name]["date"], dayfirst=True)
    
print("Example of data structure:")    
print("df:\t\t"+str(df.keys()))   
print("df['Aquifer']:\t"+str(df["Aquifer"].keys()))   

Let's create utility functions and variables useful later on:
- a utility function `getWaterbody()` that gets waterbody type and variable name from a variable
- a utility function `getAllVariables()` that returns all variables of a given type e.g. ("temperature")
- a utility function `getIndependentVariables()` that for a given waterbody returns the independent variables including 
- a utility function `getTargetVariables()` that for a given waterbody returns the target variables including date
- a variable `global_targets` that contains all the target variable types

In [None]:
def getWaterbody(var_name):
    # If more waterbodies have the same variable, the first waterbody is returned
    # E.g.: getWaterbody("temperature_firenze")
    for ii,info in data_descr.iterrows():
        if var_name in df[info["type"]][info["name"]].columns:
            return(info["type"],info["name"])
    raise ValueError('There is no variable "%s".'%var_name)

def getAllVariables(var_type):
    # E.g.: getAllVariables("temperature")
    variables = pd.DataFrame(columns=["type","name","variable"])
    for ii, info in data_descr.iterrows():
        w_type = info["type"]
        w_name = info["name"]
        w_vars = list(df[w_type][w_name].columns)
        w_vars = [var for var in w_vars if var.startswith(var_type)]
        variables_ii = pd.DataFrame({"type":w_type,"name":w_name,"variable":w_vars})
        variables = pd.concat([variables,variables_ii],axis=0,sort=False)   
    variables.reset_index(drop=True,inplace=True)
    return variables

def getIndependentVariables(w_type,w_name,var_start_name=None):
    var_names = []
    for var_name in df[w_type][w_name].columns:
        if not var_name.startswith(tuple(global_targets)):
            if var_start_name:
                if var_name.startswith(var_start_name):
                    var_names.append(var_name)                    
            else:
                var_names.append(var_name)
    return var_names 

def getTargetVariables(w_type,w_name):
    var_names = []
    data_sub = data_descr[ (data_descr["type"]==w_type) & (data_descr["name"]==w_name)].copy()
    return list(data_sub["Output"])[0]

# In order to get the global targets let's identify the variabels used as targets

temp = list(data_descr["Output"].values)
global_targets = []
for tt in temp:
    if type(tt)==list:
        global_targets = global_targets + tt
    else:
        global_targets.append(tt)        
global_targets = list(set([x.split("_")[0] for x in global_targets]))

### Clean-up for dates 
- Remove missing dates
- Reorder datasets by date

In [None]:
for ii, info in data_descr.iterrows():
    w_type =  info["type"]
    w_name =  info["name"]
    n_unique = len(df[w_type][w_name]["date"].unique())
    n_tot = len(df[w_type][w_name]["date"])
    if n_unique!= n_tot:
        print("%s %s:\t%d unique over %d total date values"%(w_type,w_name,n_unique,n_tot))

Something going on with **dates** of water spring "Madonna_di_Canneto".

Let's remove duplicates and/or missing values

In [None]:
w_type = "Water_Spring"
w_name = "Madonna_di_Canneto"

df_sub = df[w_type][w_name]
df_sub["date"][df_sub["date"].duplicated()]
df_sub.drop(df_sub[df_sub["date"].isnull() ].index,inplace=True)


n_unique = len(df[w_type][w_name]["date"].unique())
n_tot = len(df[w_type][w_name]["date"])
print("%s %s:\t%d unique over %d total date values"%(w_type,w_name,n_unique,n_tot))

Let's sort datasets by dates

In [None]:
for ii, info in data_descr.iterrows():
    w_type = info["type"]
    w_name = info["name"]
    df[w_type][w_name].sort_values(by="date",inplace=True)
    df[w_type][w_name].reset_index(inplace=True,drop=True)

# Explore and Clean Data
So now that we have organized the datasets, let's look at the input and output variables for each waterbody.
The objective of this section is to explore each dataset and achieve the following results:
- Correct missing and faulty values
- Remove variables that are redundant or with not enough samples for the prediction task

Specifically, we will see that: 

1. Some target variables not requested as an output are present in some datasets. SO, let's remove them.
2. Some target variables started to be recorded only recently, while the dataset goes further in the past. Since, this is a necessary target variable, let's focus our attention to the most recent data from where the first depth to groundwater data was recorded. This applies, for example, to:
    - "Depth" variables for "Aquifer": "Doganella", "Auser","Luco"
    - All "Flow" rate variables for "Water_Spring"
3. For a given waterbody, Temperature and Rainfall at *some locations* are sporadically missing, so we can take the locaiton with most values and fill the missing values with values with the location of the same waterbody with the highest correlation as long as higher than 0.90. Examples: 
    - "Aquifer Doganella": Temperature "Monteporzio" "Velletri" 
    - "Aquifer Auser": Rainfall "Piaggione" "Borgo a Mozzano"
4. For a given waterbody, Temperature and Rainfall at *all locations* are completely missing at the same time, so we look for correlated variables from other waterbodies. So, in this case we should replace the last part of temperature and rainfall with other temperature/rainfall sensors mostly correlated with them. This applies to: 
    - "Water_Spring Madonna_di_Canneto"
    - Temperature for "Aquifer Auser"
    - Temperature for "Aquifer Petrignano"
5. When an independent variable (e.g. temeprature) is  fully available at one location, but missing in others, let's drop the others to avoid multicollinearity, as long as te correlation between the two is higher than 0.90. 
    - "Aquifer Auser": leave only temperature "Lucca_orto_botanico"

## General Cleaning

Let's remove data of the same type of the generally used output variables but not requested as target variables

In [None]:
print("Target variables removed:\n")
for ii, data in data_descr.iterrows():
    w_type = data["type"]
    w_name = data["name"]
    targets = data["Output"] # transforms in list of strings
    columns = []
    for var_name in df[w_type][w_name].columns:
        if var_name.startswith(tuple(global_targets)) & (var_name not in targets):
            print("\t\t"+var_name)
            columns.append(var_name)
    df[w_type][w_name].drop(columns=columns,inplace=True)

Let's see the impact of missing values on datasets. We create a function`plotMissingValues()` that helps inspecting missing values.

In [None]:
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap



def plotMissingValues(waterbody_type,waterbody_name=None, ax=None):
    # setup colors 
    myColors = ((0.0, 0.8, 0.0, 1.0),(0.8, 0.0, 0.0, 1.0)) # green and red
    cmap = LinearSegmentedColormap.from_list('Custom', myColors, len(myColors))

    if waterbody_name:
        nrows = 1
        waterbody_names = [waterbody_name]
    else:    
        waterbody_names = df[waterbody_type].keys()
        nrows = len(waterbody_names)
    
    if not ax:
        fig, ax = plt.subplots(nrows=nrows, ncols=1,figsize=(15,5*nrows))
        
    ii = 0
    nintervals = 20
    resol = 30 # how many parts is an interval (indicates resoltuion of each missing line)  
    for waterbody_name in waterbody_names:
        axi = ax[ii] if type(ax)==np.ndarray else ax #  ensure the case of only one axes is dealt with
        data = df[waterbody_type][waterbody_name].T.isnull()
        ndata = data.shape[1]
        for irow, row in data.iterrows(): # to avoid sporadic nulls are not visible, let's increase resolution
            data.loc[irow,:] = np.convolve([True]*int(ndata/nintervals/resol), row.values,"same")
        sns.heatmap(data,cmap=cmap,cbar=False,ax=axi)
        axi.set_title(waterbody_type + " " + waterbody_name)
        axi.set_yticks(range(df[waterbody_type][waterbody_name].shape[1])) # <--- set the ticks first
        axi.set_yticklabels(df[waterbody_type][waterbody_name].columns,va="top");
        x_ind = np.linspace(0,df[waterbody_type][waterbody_name].shape[0]-1,nintervals+1,dtype=int) # 20 intervals
        print()
        axi.set_xticks(x_ind)
        axi.set_xticklabels(df[waterbody_type][waterbody_name]["date"].dt.date.iloc[x_ind],rotation=45);
        axi.set_xlabel("")
        axi.grid(True,color="white")
        ii = ii +1
    
    if type(ax)==np.ndarray:
        fig.tight_layout(); #fig.suptitle(waterbody_type, fontsize=16,y=1.00);
    
plotMissingValues("Aquifer","Doganella");

We clearly see that target variables have no values for some historical dates (in the example above, "*depth_to_groundwater*" in *Aquifer Doganella* has no target variables before 2013). Since we cannot reliably reconstruct target variables for these dates, we will not to use such dates for building prediction models. 

Therefore, for each waterbody let's filter only the most recent data since when target variables are  generally available. Let's apply this filter to all the waterbodies' data. 

In [None]:
for ii, info in data_descr.iterrows():
    w_type = info["type"]
    w_name = info["name"]
    columns = list(df[w_type][w_name].columns)
    columns = [col in info["Output"] for col in columns]
    #columns = df[w_type][w_name].columns[columns] # columns of targets
    indstart= np.where((~df[w_type][w_name].iloc[:,columns].isnull()).any(axis=1))[0][0]
    df[w_type][w_name] = df[w_type][w_name].iloc[indstart:]
    df[w_type][w_name].reset_index(drop=True,inplace=True)

print("Removed historical dates with no target variables. Example of results on water springs.") 
plotMissingValues("Water_Spring")

Let's check correlations between input variables so to identify possible multicollinearity.

In [None]:
def plotCorrMatrix(waterbody_type,waterbody_name,var_names=None,ax=None,**kwargs):
    df_sub = df[waterbody_type][waterbody_name].copy(deep=True)
    if var_names == None:
        colnames = df_sub.columns
        if "Date" in colnames: colnames.remove("Date")
    else:
        colnames = []
        for name in var_names:
            colnames = colnames+list(df_sub.columns[np.where(df_sub.columns.str.startswith(name))[0]])
    nvars = len(colnames)
    if nvars<2: # if not enough variables to correlate
        print("Not enough variables to print")
        return None
    if ax==None:
        plt.figure(num=None, figsize=(nvars, nvars )) #, dpi=80, facecolor='w', edgecolor='k')
        ax = plt.gca()
        
    corrMatrix = df_sub[colnames].corr(method="pearson")
    corrMatrix = corrMatrix.where(np.tril(np.ones(corrMatrix.shape),-1).astype(np.bool))
    corrMatrix.drop(corrMatrix.columns[-1],axis=1,inplace=True)
    corrMatrix.drop(corrMatrix.index[0],axis=0,inplace=True)
    sns.heatmap(corrMatrix, annot=True,ax=ax,**kwargs)
    plt.setp(ax.get_xticklabels(), rotation=70);

plotCorrMatrix("Aquifer","Auser",["temperature","rainfall"],vmin=-1, vmax=1,cmap="coolwarm");

Variables of the same type, for example `temperature`, are strogly correlated with each other. 
In presence of multi-collinearity, the prediction power diminishes. Therefore it is a good practice to remove input variables too strongly correlated between each others.
On the other hand such correlated variables can be used to reconstruct missing values among each others.i

## Temperature
Now le's look at the temperature variables for all the waterbodies. 
The main purpose is exploration and data-cleaning.

In [None]:
def plot_WaterbodyVariable(waterbody_type,variable_name,waterbody_name=None,date_interval=None,**kwargs):
    """
    date_interval: tuple of start and end dates with format: '%Y-%m-%d', e.g.: ("2018-01-29", "2020-02-13")
    """
    
    nrows = len(df[waterbody_type])
    if waterbody_name:
        nrows = 1
        
    fig, ax = plt.subplots(nrows=nrows, ncols=1,figsize=(14,nrows*5))
    
    df_type = df[waterbody_type].copy()
    if waterbody_name:
        df_type = { waterbody_name: df_type[waterbody_name] }

    for ii, waterbody_name in enumerate(df_type):
        df_sub = df_type[waterbody_name]
        colnames = list(df_sub.columns[np.where(df_sub.columns.str.startswith(variable_name))[0]])
        if len(colnames)==0: # if there is no variable, just go to next dataframe
            continue
        colnames.append("date")
        df_sub = df_sub[colnames]
        if date_interval is not None:
            date_start,date_end = date_interval
            date_start =  datetime.strptime(date_start, '%Y-%m-%d').date()
            date_end =  datetime.strptime(date_end, '%Y-%m-%d').date()
            df_sub = df_sub[(df_sub["date"].dt.date>= date_start) & (df_sub["date"].dt.date<= date_end)]
        axi = ax[ii] if type(ax)==np.ndarray else ax #  ensure the case of only one axes is dealt with
        df_sub.plot(x="date",ax=axi,grid=True,**kwargs)
        axi.set_title(waterbody_type+" "+waterbody_name); axi.set_xlabel(""); 
        #axi.set_xticks(df_sub["Date"].index)
        #axi.set_xticklabels(df_sub["Date"].dt.date,rotation=45);
        df_sub["date"].dt.date
        plt.setp(axi.get_xticklabels(), rotation=60)
        axi.legend(loc = "center left", bbox_to_anchor = (1.0, 1.0))
    fig.tight_layout()


date_start = "2019-01-01"  # required format: '%Y-%m-%d'
date_end = "2019-12-31"
plot_WaterbodyVariable("Aquifer","temperature","Petrignano",date_interval=(date_start,date_end),marker=".",linestyle="-",alpha=0.5)

In [None]:
def plot_VarDistribution(waterbody_type,waterbody_name,variable_name,bins=50,stacked=True):

    df_sub = df[waterbody_type][waterbody_name].copy()

    colnames = list(df_sub.columns[np.where(df_sub.columns.str.startswith(variable_name))[0]])
    df_sub = df_sub[colnames]

    nrows = df_sub.shape[1]
    if stacked:
        fig, axs = plt.subplots(nrows=nrows, ncols=1,figsize=(20,nrows*5),sharex=False)
    else:
        fig = plt.figure(figsize=(20,5))
        ax = plt.gca()

    ii = 0
    for name, series in df_sub.iteritems():
        if stacked:
            ax = axs[ii] if type(axs)==np.ndarray else axs
        series.plot.hist(bins=bins,grid=True,alpha=0.5,ax=ax,legend=True)            
        ax.set_title(waterbody_type+" "+waterbody_name+" "+name)
        ii= ii +1
    plt.setp(ax.get_xticklabels(), rotation=60)
    fig.tight_layout()    


plot_VarDistribution("Aquifer","Luco","temperature",stacked=False)

Some **temperature** sensors look faulty because report 0 values. So let's identify them and set their 0 values to 'missing'. 

But not all the "temperature" variables need this treatment.
As discriminator, we will use the margin of error (at 95% confidence) between the mean count of the adjacent bins and the bin with 0.
We use 50 bins, but other references can be used as well (e.g. nbins = 1% of number of samples)

In [None]:
import scipy

conf_lev = 0.95
zscore = scipy.stats.norm.ppf( 1-(1-conf_lev)/2) 


for ii, w_type in enumerate(df):
    for jj, w_name in enumerate(df[w_type]):
        df_sub = df[w_type][w_name]#.copy()
        for name, series in df_sub.iteritems():
            if name.startswith('temperature'):
                n,bins = np.histogram(series.dropna(),bins=50) #n, bins = get_hist(axs[ii])
                ind0 = np.digitize(0,bins)-1 # index of the bin containing 0
                bins_diff = np.diff(n) # difference among adjacent bins
                moe = np.ceil(zscore*np.std(bins_diff)/np.sqrt(len(bins_diff))) # margin of error
                expected = (n[ind0-1]+n[ind0+1])/2
                check_ok = ((n[ind0]>=(expected-moe)) & (n[ind0]<=(expected+moe)) ) or (n[ind0]<20)
                if ~check_ok: # if number of zero values is abnormal, tranbsform all 0 to nan 
                    series[series==0.00]=np.nan
                    print("%s %s: removing %d values for %s as they are not in the confidence interval." %(w_type,w_name,n[ind0],name.replace("Temperature_","sensor ")))
            else:
                continue

For a given waterbody, temperature values at *some locations* are sporadically missing, so we can take the location with most values and fill the missing values with values of similar varibales nearby the same waterbody. 

We use the function `fixMissingValues()` to fix missing values of one series, with another series. 

We use the function `getVarCorrList()` to get the list of correlation values between variables of the same type.
Let's quickly look at what these functions do, before using them to clean up the data.

We use the function `cleanAndGetVariables` to find variable with least percentage of missing values in a waterbody and we replace its missing values with the most correlated variable in the SAME waterbody if the correlation is above a certain threshold, in this case 90%.

In [None]:
def fixMissingValues(var1,var2):
    """
    Fixes missing values of a variable (var1) with the values taken from another variable (var2).
    Method: normalization and substitution
    Returns the number of values for variable that were fixed
    If the variables have the same name, the function does not fix anything.
    """
    
    if var1 == var2:
        return 0
        
    w_type1,w_name1 = getWaterbody(var1)
    w_type2,w_name2 = getWaterbody(var2)
    
    series1 = df[w_type1][w_name1][["date", var1]].copy()
    series2 = df[w_type2][w_name2][["date",var2]].copy()


    mean1 = series1.mean(numeric_only=True)[0]
    mean2 = series2.mean(numeric_only=True)[0]
    std1 = series1.std(numeric_only=True)[0]
    std2 = series2.std(numeric_only=True)[0]
    
    data = series1.merge(series2,on="date",how="left")

    jj = 0
    for ii, val in data[var1].iteritems(): #in data.iterrows():
        if np.isnan(val):
            if (not np.isnan(data.loc[ii,var2])):
                data.loc[ii,var1] = (data[var2][ii] - mean2) * std1/std2 + mean1
                jj = jj +1
    df[w_type1][w_name1][var1] = data[var1].values
    return jj



### Show example

plt.figure(figsize=(15,5))
ax = plt.gca()
var1 = "temperature_abbadia_s_salvatore"
var2 =  "temperature_le_croci"#"temperature_laghetto_verde"

## dates used only for plotting
start_date = datetime.strptime("2015-06-01", '%Y-%m-%d').date() # datetime.strptime("2015-06-01", '%Y-%m-%d').date()
end_date = datetime.strptime("2016-03-07", '%Y-%m-%d').date()#datetime.strptime("2016-03-07", '%Y-%m-%d').date()

# plot original series not yet fixed
w_type1, w_name1 = getWaterbody(var1)
w_type2, w_name2 = getWaterbody(var2)
series1old = df[w_type1][w_name1][["date", var1 ]].copy()
series2 = df[w_type2][w_name2][["date", var2 ]].copy()
series1old = series1old[(series1old["date"].dt.date>start_date) & (series1old["date"].dt.date<end_date)]
series2 = series2[(series2["date"].dt.date>start_date) & (series2["date"].dt.date<end_date)]

ax.plot(series2["date"],series2[var2],"b",label = "original %s"%var2)
ax.plot(series1old["date"],series1old[var1],"k",label= "original %s"%var1)


# FIX MISSING VALUES
fixMissingValues(var1,var2);


# plot fixed series
series1new = df[w_type1][w_name1][["date", var1 ]].copy()
series1new = series1new[(series1new["date"].dt.date>start_date) & (series1new["date"].dt.date<end_date)]
ax.plot(series1new["date"],series1new[var1],"--k",label= "fixed missing %s"%var1,lw=2)
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.setp(plt.gca().get_xticklabels(), rotation=60);
plt.grid()

In [None]:
def getGlobalCorrList(var_type,isMirrored=True,removeDuplicates=False,date_interval=None):
    """
    Get list of correlations for one variable across different waterbodies
    isMirrored: reports the same correlation value for each of the two combination of variables
    removeDuplicates: avoids reporting cases where the same variable is used in multiple waterbodies
    """

    import itertools

    
    varList = getAllVariables(var_type)
    varList.reset_index(drop=True,inplace=True)# ensure indices are unique
    
    
    indices = list(itertools.combinations(varList.index,2))

    corrList = pd.DataFrame(columns = ['type1', 'name1', 'variable1', 'type2', 'name2', 'variable2', 'corr_value'],index=range(len(indices)))

    if date_interval is not None:
        date_start,date_end = date_interval
        date_start =  datetime.strptime(date_start, '%Y-%m-%d').date()
        date_end =  datetime.strptime(date_end, '%Y-%m-%d').date()
        
    for ii in indices:
        type1 = varList["type"][ii[0]]
        name1 = varList["name"][ii[0]]
        var1 = varList["variable"][ii[0]]
        type2 = varList["type"][ii[1]]
        name2 = varList["name"][ii[1]]
        var2 = varList["variable"][ii[1]]
        data1 = df[type1][name1][["date",var1]].copy()
        data2 = df[type2][name2][["date",var2]].copy()
        if date_interval is not None:
            data1 = data1[(data1["date"].dt.date>= date_start) & (data1["date"].dt.date<= date_end)]
            data2 = data2[(data2["date"].dt.date>= date_start) & (data2["date"].dt.date<= date_end)] 
        corrValue = data1.merge(data2,on="date",how="outer").corr().iloc[0,1]
        if ~np.isnan(corrValue):
            corrList_ii = pd.DataFrame({'type1': type1, 'name1':name1, 'variable1':var1, 'type2':type2, 'name2':name2, 'variable2':var2, 'corr_value':corrValue},index=[corrList.shape[0]])
            corrList = pd.concat([corrList,corrList_ii],sort=False)

    if isMirrored:
        corrList2 = corrList.copy()
        corrList2.columns = [col.replace("2","0").replace("1","2").replace("0","1") for col in corrList2.columns]
        corrList = pd.concat([corrList,corrList2],sort=False)
    
    if removeDuplicates:
        corrList.drop_duplicates(subset=["variable1","variable2"], keep='first', inplace=True)
           
    corrList.reset_index(drop=True,inplace=True)
    
    return corrList

### Show example
#list1 = getGlobalCorrList("rainfall",isMirrored=True,date_interval=("2020-01-01","2020-06-30"))
#list1 = list1.loc[list1["variable1"]=="rainfall_settefrati",:]
#list1.head()

As we can see, **temperatures** for different locations of **different** waterbodies are strongly correlated. This means that we can use them interchangeably.
Now, let's look at which variables contain most missing values.

In [None]:
def getMissingValues(var_name):

    report_table = pd.DataFrame(columns=["waterbody","variable","missing","total","missing%"])
    for ii, waterbody_type in enumerate(df):
        for jj, waterbody_name in enumerate(df[waterbody_type]):
            df_sub = df[waterbody_type][waterbody_name].copy()
            ind = np.where(df_sub.columns.str.startswith(var_name))[0]
            df_sub = df_sub.iloc[:,ind]
            if len(df_sub.columns)==0:
                continue
            #df_sub.columns=df_sub.columns.str.replace(var_name+"_","")
            statsd = pd.DataFrame({"waterbody":waterbody_type+" "+waterbody_name,"variable":list(df_sub.columns),"missing":np.nan,"total":df_sub.shape[0],"missing%":np.nan})
            statsd["missing"] = df_sub.isnull().sum().to_frame().values
            statsd["missing%"] = np.round(statsd["missing"]/statsd["total"]*100,1)
            report_table = pd.concat([statsd,report_table],axis=0,sort=False)   
    report_table = report_table.reindex(columns=["waterbody","variable","missing","total","missing%"] )  
    report_table.sort_values(by=['waterbody','missing%'],ascending=True,inplace=True) # sort in ascending order for each waterbody
    report_table.reset_index(drop=True,inplace=True)
    return report_table

getMissingValues("temperature").head()

In [None]:
def cleanAndGetVariables(var_type,var_name=None,corr_threshold=0.9,verbose=False):
    """
    For each waterbody find variable with least percentage of missing values
    If no missing values, the variable is clean enough and does not need any further treatment
    Otherwise find the most correlated variable in the SAME AQUIFER (if any) and use it to replace missing values of the first variable 
    Return: 
        - list of selected variables
        - list of complete waterbodies (waterbodies that have at least one complete variable)
    """

    corrList = getGlobalCorrList(var_type,isMirrored=True,removeDuplicates=True)
    report_table = getMissingValues(var_type)

    #pdb.set_trace()
    if var_name is not None:
        report_table = report_table.loc[report_table["variable"]==var_name,:]
    subtable = report_table[report_table["missing"]==0.0]
    completeWaterbodies = list(set(subtable["waterbody"]))
    selectedVariables = list(subtable["variable"])
    iid_remaining = np.array([wb not in completeWaterbodies for _,wb in report_table["waterbody"].iteritems()])
    report_table = report_table[iid_remaining]
    report_table.reset_index(drop=True,inplace=True)

    # for each waterbody, take the varibale with least missing values and fix it with any other variable of the same type 
    fixing = [True]
    while ((report_table.shape[0]>0) & np.any(fixing)):
        report_table.reset_index(drop=True,inplace=True)
        corrList.reset_index(drop=True,inplace=True)
        variables = report_table["variable"][report_table.groupby("waterbody")["missing%"].idxmin()]
        # for each waterbody fix the variable with less missing values, with the next varibale with highest correlation within the boundary
        fixing = []
        for ii,var_type in variables.iteritems():
            if verbose==True:
                missing = report_table["missing%"][report_table["variable"]==var_type].values[0]
                print("%s: %.1f%% missing"%(var_type,missing))
            fixing.append(False)
            #pdb.set_trace()
            if (var_type in corrList["variable1"].values): # if there are missing values and there is another variable of the same type
                corrmax = corrList[corrList["variable1"]==var_type]["corr_value"].max()
                if corrmax >= corr_threshold: # if varibale with the maximum correlation is higher than threshold
                    fixing[-1] = True # so we managed to fix
                    kk = corrList[corrList["variable1"]==var_type]["corr_value"].idxmax() # take the value with the highest correlation
                    var_type2 = corrList["variable2"].loc[kk]
                    nfixed = fixMissingValues(var_type,var_type2)
                    if nfixed>0:
                        print("   -> fixed %d values of %s with %s."%(nfixed,var_type,var_type2) )
                        jj = np.where(report_table["variable"]==var_type)[0]
                        report_table["missing"].iloc[jj] = report_table["missing"].iloc[jj] - nfixed
                        report_table["missing%"].iloc[jj] = float(report_table["missing"].iloc[jj] / report_table["total"].iloc[jj] *100)
                    corrList.drop(index=kk,inplace=True) # drop this correlation value since already used to avoid using always the same variable

        # remove the waterbody from the table of waterbodies to check        
        subtable = report_table[report_table["missing"]==0.0]
        completeWaterbodies = list(set(completeWaterbodies + list(subtable["waterbody"])))
        selectedVariables = list(set(selectedVariables + list(subtable["variable"])))
        iid_remaining = np.array([wb not in completeWaterbodies for _,wb in report_table["waterbody"].iteritems()])
        report_table = report_table[iid_remaining]
        report_table.reset_index(drop=True,inplace=True)
        
    return(completeWaterbodies,selectedVariables)

In [None]:
completeWaterbodies,selectedVariables = cleanAndGetVariables("temperature",corr_threshold=0.9,verbose=False)

Finally for each waterbody, we fill missing data with values form other sensors (when available), then we remove the other sensor data to avoid collinearity in the model.

In [None]:
plotMissingValues("Aquifer")

In [None]:
df["Aquifer"]["Auser"].drop(columns=["temperature_monte_serra","temperature_ponte_a_moriano"],inplace=True)
df["Aquifer"]["Luco"].drop(columns=["temperature_mensano","temperature_siena_poggio_al_vento","temperature_pentolina"],inplace=True)
df["Aquifer"]["Petrignano"].drop(columns=["temperature_petrignano"],inplace=True)

In [None]:
plotMissingValues("Water_Spring","Amiata")

We fill sporadic missing values with interpolated values. 

In [None]:
data = df["Water_Spring"]["Amiata"].copy()
data["temperature_laghetto_verde_interpolated"] = data["temperature_laghetto_verde"].interpolate(method='linear')
data["temperature_s_fiora_interpolated"] = data["temperature_s_fiora"].interpolate(method='linear')
data.loc[700:750,:].plot(x="date",y=["temperature_laghetto_verde_interpolated","temperature_laghetto_verde"],style=[".-",".-"],grid=True,figsize=(15,5))#,legend=False)

df["Water_Spring"]["Amiata"]["temperature_laghetto_verde"] = data["temperature_laghetto_verde_interpolated"].values
df["Water_Spring"]["Amiata"]["temperature_s_fiora"] = data["temperature_s_fiora_interpolated"].values

In [None]:
plotMissingValues("Water_Spring","Lupa")

Since this water spring does not include any temperature data, we attach temperature from a closeby location, where location is chosen based on correlation of the rainfall variable in the last period (2020). 

In [None]:
endDate = df["Water_Spring"]["Lupa"]["date"].values[-1].astype(str)[:10]
startDate = "2020-01-01"
corrList = getGlobalCorrList("rainfall",isMirrored=True,removeDuplicates=True,date_interval=(startDate,endDate))
corrList = corrList.loc[corrList["variable1"]=="rainfall_terni",["type2","name2","variable2","corr_value"]].sort_values(by="corr_value",ascending=False)
corrList.head()

Since there is no temperature data for location "*Croce Arcana*", we take the temperature from the second most rainfall-correlated location, which is "*Bastia_Umbra*".

In [None]:
#startDate = df["Water_Spring"]["Lupa"]["date"].values[0].astype(str)[:10]
data2 = df["Aquifer"]["Petrignano"][["date","temperature_bastia_umbra"]].copy()
df["Water_Spring"]["Lupa"] = df["Water_Spring"]["Lupa"].merge(data2,on="date",how="left")

In [None]:
plotMissingValues("River","Arno")

In [None]:
data = df["River"]["Arno"].copy()
ind_start = data.loc[~data["temperature_firenze"].isnull(),:].index[0]
df["River"]["Arno"] = data.loc[ind_start:,:]
data = df["River"]["Arno"].copy()
data["temperature_firenze_interpolated"] = data["temperature_firenze"].interpolate(method='linear')
data.iloc[810:860,:].plot(x="date",y=["temperature_firenze_interpolated","temperature_firenze"],style=[".-",".-"],grid=True,figsize=(15,5))#,legend=False)

df["River"]["Arno"]["temperature_firenze"] = data["temperature_firenze_interpolated"].values

In [None]:
plotMissingValues("Lake","Bilancino")

In [None]:
data = df["Lake"]["Bilancino"].copy()
data["temperature_le_croci_interpolated"] = data["temperature_le_croci"].interpolate(method='linear')
data.iloc[120:200,:].plot(x="date",y=["temperature_le_croci_interpolated","temperature_le_croci"],style=[".-",".-"],grid=True,figsize=(15,5))#,legend=False)

df["Lake"]["Bilancino"]["temperature_le_croci"] = data["temperature_le_croci_interpolated"].values

## Rainfall
Let's analyze and correct rainfall data for all waterbodies.

In [None]:
data = df["Water_Spring"]["Amiata"].copy()
variable_name = "rainfall_castel_del_piano"

fig, axs = plt.subplots(nrows=2,ncols=2,figsize=(20,12));
rolling = pd.concat([data[["date"]],data[[variable_name]].rolling(30).mean()],axis=1,sort=False)
rolling.plot(x="date",y=variable_name,ax=axs[0,0],title="30-days rolling mean",grid=True);
data[variable_name].plot.hist(bins=np.arange(0,10,0.25),grid=True,alpha=0.5,ax=axs[0,1],legend=False,title="rainfall data distribution") ;    
#data[variable_name].plot(kind='kde',ax=axs[0,1]);

# show yearly seasonality over months
groups = pd.crosstab(data["date"].dt.year, data["date"].dt.month, data[variable_name], aggfunc="mean",rownames=["Year"],colnames=["Month"])
groups.mean().plot(ax=axs[1,0],kind="bar",title="average across months",grid=True)

# create a boxplot of yearly data
groups.T.boxplot(ax=axs[1,1])
axs[1,1].set_title("boxplot of average monthly rainfall over years");

plt.tight_layout()
"""
datafull = data.copy()
datafull["year"] = datafull["date"].dt.year
datafull["day"] = datafull["date"].dt.dayofyear
datafull=datafull.pivot(columns="year",index="day",values=variable_name)
datafull = datafull.mean(0)
datafull.plot(x="index",ax=axs[2],style=["-o"],grid=True);
axs[2].set_xticks(datafull.index.astype(int))
axs[2].set_xticklabels(datafull.index.astype(int));
"""



We see that data is sparse, and the greatest majority of data falls at 0 value. 

Moreover, as expected, there is yearly seasonality, since months have different average rainfall.

Average monthly rainfall in 2020 has decreased with respect to 2019.

### Aquifer Doganella

In [None]:
plotMissingValues("Aquifer","Doganella")
plot_WaterbodyVariable("Aquifer","rainfall",waterbody_name="Doganella",date_interval=("2014-12-19","2015-02-28"))#,marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Aquifer","Doganella","rainfall",stacked=True)

For **Aquifer Doganella** there are sporadically missing rainfall data, so we fill them (where possible) with interpolation over a week's time window. We then substitute missing values for each location with non-missing values of the other location, and vice-versa. 

In [None]:
cleanAndGetVariables("rainfall",var_name="rainfall_monteporzio",corr_threshold=0.8,verbose=False)
df["Aquifer"]["Doganella"].loc[df["Aquifer"]["Doganella"]["rainfall_monteporzio"]<0,"rainfall_monteporzio"]=0.
cleanAndGetVariables("rainfall",var_name="rainfall_velletri",corr_threshold=0.8,verbose=False)
df["Aquifer"]["Doganella"].loc[df["Aquifer"]["Doganella"]["rainfall_velletri"]<0,"rainfall_velletri"]=0.

On average for how many consecutive days does it rain? This info will help us with filling data through interpolation.

In [None]:
import more_itertools

rainyDays = np.where(df["Aquifer"]["Doganella"]["rainfall_velletri"]!=0)[0]
# now let's group consecutive days of rain
nRainyDays = [len(list(group)) for group in more_itertools.consecutive_groups(rainyDays)]
np.median(nRainyDays)

Now, we interpolate to fill remaining missing values. 
We use **linear** interpolation because it is more suitable to approximate spiky data like rainfall.

We use a window of two days because we'have just seen that the median duration for rain is 2 days.

Finally we fill the remaining missing values with zeros, since no-rain is the most probable outcome for a day.

In [None]:
data = df["Aquifer"]["Doganella"].copy()

data["rainfall_velletri_interpolated"] = data["rainfall_velletri"].interpolate(method='linear',limit=2)
data["rainfall_monteporzio_interpolated"] = data["rainfall_monteporzio"].interpolate(method='linear',limit=2)
data["rainfall_velletri_interpolated"].fillna(0,inplace=True)
data["rainfall_monteporzio_interpolated"].fillna(0,inplace=True)

data.loc[900:1052,:].plot(x="date",y=["rainfall_monteporzio_interpolated","rainfall_monteporzio"],style=[".-",".-"],grid=True,figsize=(15,5))#,legend=False)

df["Aquifer"]["Doganella"]["rainfall_velletri"] = data["rainfall_velletri_interpolated"].values
df["Aquifer"]["Doganella"]["rainfall_monteporzio"] = data["rainfall_monteporzio_interpolated"].values

### Aquifer Auser

In [None]:
plotMissingValues("Aquifer","Auser")

The only missing values are at locations *Piaggione* and *Monte Serra*.

We fix *Piaggione* using other locations' values.

In [None]:
cleanAndGetVariables("rainfall",var_name="rainfall_piaggione",corr_threshold=0.8,verbose=False)
df["Aquifer"]["Auser"].loc[df["Aquifer"]["Auser"]["rainfall_piaggione"]<0,"rainfall_piaggione"]=0.

In [None]:
data = df["Aquifer"]["Auser"].copy()
data["rainfall_monte_serra_interpolated"] = data["rainfall_monte_serra"].interpolate(method='linear',limit=2)
data["rainfall_monte_serra_interpolated"].fillna(0,inplace=True)
df["Aquifer"]["Auser"]["rainfall_monte_serra"] = data["rainfall_monte_serra_interpolated"].values


### Aquifer Luco

In [None]:
plotMissingValues("Aquifer","Luco")

In [None]:
fig = plt.figure(figsize=(15,5))
ax= plt.gca()
plotCorrMatrix("Aquifer","Luco",var_names="rainfall",vmin=-1, vmax=1,cmap="coolwarm",ax=ax)

From the correlation matrices we see that some rainfall at some locations is quite correlated with rainfall at other locations (correlation>80%). So we can fix missing values of one location with missing values at another location.

In [None]:
cleanAndGetVariables("rainfall",var_name="rainfall_monticiano_la_pineta",corr_threshold=0.8,verbose=False)
df["Aquifer"]["Luco"].loc[df["Aquifer"]["Luco"]["rainfall_monticiano_la_pineta"]<0,"rainfall_monticiano_la_pineta"]=0.
cleanAndGetVariables("rainfall",var_name="rainfall_ponte_orgia",corr_threshold=0.8,verbose=False)
df["Aquifer"]["Luco"].loc[df["Aquifer"]["Luco"]["rainfall_ponte_orgia"]<0,"rainfall_ponte_orgia"]=0.
cleanAndGetVariables("rainfall",var_name="rainfall_scorgiano",corr_threshold=0.8,verbose=False)
df["Aquifer"]["Luco"].loc[df["Aquifer"]["Luco"]["rainfall_scorgiano"]<0,"rainfall_scorgiano"]=0.
cleanAndGetVariables("rainfall",var_name="rainfall_pentolina",corr_threshold=0.8,verbose=False)
df["Aquifer"]["Luco"].loc[df["Aquifer"]["Luco"]["rainfall_pentolina"]<0,"rainfall_pentolina"]=0.

*Montalcinello*,*Simignano* and *Sovicille* are locations with few missing values, but that can be still partially reconstructed. 

So we interpolate up to 2 days (avearage duration of rainfall) for each location.
Then, for the remaining missing values, we use other locations' values, but this time we are more tolerant to correlation values above 75% (below previous 80% threshold).

On the other hand, we do not spend more time on *Pentolina* and *Arbia Piena* *Ponte Orgia* as they miss large parts of data.

In [None]:
data = df["Aquifer"]["Luco"].copy()
data.loc[1620:1670,:].plot(x="date",y=["rainfall_sovicille"],style=[".-"],grid=True,figsize=(15,5))#,legend=False)

#data.loc[2100:2300,:].plot(x="date",y=["rainfall_simignano_interpolated","rainfall_simignano"],style=[".-",".-"],grid=True,figsize=(15,5))#,legend=False)


In [None]:
data = df["Aquifer"]["Luco"].copy()

data["rainfall_montalcinello_interpolated"] = data["rainfall_montalcinello"].interpolate(method='linear',limit=2)
data["rainfall_simignano_interpolated"] = data["rainfall_simignano"].interpolate(method='linear',limit=2)
data["rainfall_sovicille_interpolated"] = data["rainfall_sovicille"].interpolate(method='linear',limit=2)
data["rainfall_sovicille_interpolated"].fillna(0,inplace=True)
#data["rainfall_scorgiano_interpolated"] = data["rainfall_scorgiano"].interpolate(method='linear',limit=2)


data.loc[2100:2300,:].plot(x="date",y=["rainfall_simignano_interpolated","rainfall_simignano"],style=[".-",".-"],grid=True,figsize=(15,5))#,legend=False)

df["Aquifer"]["Luco"]["rainfall_montalcinello"] = data["rainfall_montalcinello_interpolated"].values
df["Aquifer"]["Luco"]["rainfall_simignano"] = data["rainfall_simignano_interpolated"].values
df["Aquifer"]["Luco"]["rainfall_sovicille"] = data["rainfall_sovicille_interpolated"].values
#df["Aquifer"]["Luco"]["rainfall_scorgiano"] = data["rainfall_scorgiano_interpolated"].values

In [None]:
cleanAndGetVariables("rainfall",var_name="rainfall_simignano",corr_threshold=0.75,verbose=False)
df["Aquifer"]["Luco"].loc[df["Aquifer"]["Luco"]["rainfall_simignano"]<0,"rainfall_simignano"]=0.
cleanAndGetVariables("rainfall",var_name="rainfall_montalcinello",corr_threshold=0.75,verbose=False)
df["Aquifer"]["Luco"].loc[df["Aquifer"]["Luco"]["rainfall_montalcinello"]<0,"rainfall_montalcinello"]=0.

In [None]:
plotMissingValues("Aquifer","Luco")
plot_WaterbodyVariable("Aquifer","rainfall_simignano",waterbody_name="Luco",date_interval=("2013-01-01","2014-02-28"))#,marker=".",linestyle="",alpha=0.5)


Let's remove variables where there are too litlle values

In [None]:
df["Aquifer"]["Luco"].drop(columns=["rainfall_siena_poggio_al_vento","rainfall_mensano","rainfall_scorgiano"],inplace=True)

### Aquifer Petrignano

In [None]:
plotMissingValues("Aquifer","Petrignano")

Since there are missing values, there is not much we can do about cleaning values at this location. So, we just drop the first part of the data, so that our data includes both temperature and rainfall.

In [None]:
indstart= np.where(~df["Aquifer"]["Petrignano"]["rainfall_bastia_umbra"].isnull())[0][0]
df["Aquifer"]["Petrignano"] = df["Aquifer"]["Petrignano"].iloc[indstart:]
df["Aquifer"]["Petrignano"].reset_index(drop=True,inplace=True)

### Water Spring Amiata

In [None]:
plotMissingValues("Water_Spring","Amiata")

We fix missing values with rainfall values at other locations, as long as correlation is higher than 80%

In [None]:
cleanAndGetVariables("rainfall",var_name="rainfall_abbadia_s_salvatore",corr_threshold=0.8,verbose=False)
df["Water_Spring"]["Amiata"].loc[df["Water_Spring"]["Amiata"]["rainfall_abbadia_s_salvatore"]<0,"rainfall_abbadia_s_salvatore"]=0.
cleanAndGetVariables("rainfall",var_name="rainfall_s_fiora",corr_threshold=0.8,verbose=False)
df["Water_Spring"]["Amiata"].loc[df["Water_Spring"]["Amiata"]["rainfall_s_fiora"]<0,"rainfall_s_fiora"]=0.
cleanAndGetVariables("rainfall",var_name="rainfall_laghetto_verde",corr_threshold=0.8,verbose=False)
df["Water_Spring"]["Amiata"].loc[df["Water_Spring"]["Amiata"]["rainfall_laghetto_verde"]<0,"rainfall_laghetto_verde"]=0.

Since *Vetta Amiata* is correlated less than 80% with any other location, let's clean the remaining sporadic values by using interpolation. 
Let's first calculate what is the average duration (in days) of rain, so that we use the proper window.

In [None]:
import more_itertools

rainyDays = np.where(df["Water_Spring"]["Amiata"]["rainfall_vetta_amiata"]!=0)[0]
# now let's group consecutive days of rain
nRainyDays = [len(list(group)) for group in more_itertools.consecutive_groups(rainyDays)]
data = df["Water_Spring"]["Amiata"].copy()
data["rainfall_vetta_amiata"] = data["rainfall_vetta_amiata"].interpolate(method='linear',limit=int(np.median((nRainyDays))))
data["rainfall_vetta_amiata"].fillna(0,inplace=True)
df["Water_Spring"]["Amiata"]["rainfall_vetta_amiata"] = data["rainfall_vetta_amiata"].values

In [None]:
plotMissingValues("Water_Spring","Amiata")

### Water Spring Lupa

In [None]:
plotMissingValues("Water_Spring","Lupa")

In [None]:
plot_WaterbodyVariable("Water_Spring","rainfall",waterbody_name="Lupa",date_interval=("2019-06-01","2020-06-30"))#,marker=".",linestyle="",alpha=0.5)

The rainfall data looks aggregated in two different ways, first aggregated monthly and then aggregated daily (only in 2020). 
So we monthly-aggregate 2020 data to make it consistent with the previous data. Then we smooth the full dataset in order to make it less jumpy.

In [None]:
data = df["Water_Spring"]["Lupa"][["date","rainfall_terni"]].copy()

data["year"] = data["date"].dt.year
data["month"] = data["date"].dt.month
data = data.merge(data.groupby(["year","month"]).mean().reset_index(),on=["year","month"],how="left",suffixes=('_old', '_new'))

plt.figure(figsize=(15,5))
ax = plt.gca()
data.loc[3500:5000,:].plot(x="date",y="rainfall_terni_old",grid=True,figsize=(15,5),style=["-k"],alpha=0.5,ax=ax)


msk = data['date'].map(lambda x: x.day) == 15
data.loc[~msk,"rainfall_terni_new"]=np.nan
data["rainfall_terni_new"] = data["rainfall_terni_new"].interpolate(method="akima")
data.loc[3500:5000,:].plot(x="date",y="rainfall_terni_new",grid=True,figsize=(15,5),style="-b",ax=ax)

df["Water_Spring"]["Lupa"]["rainfall_terni"] = data["rainfall_terni_new"].values

### Water Spring Madonna di Canneto

In [None]:
plotMissingValues("Water_Spring","Madonna_di_Canneto")

In [None]:
corrList = getGlobalCorrList("rainfall",removeDuplicates=True)
corrList = getGlobalCorrList("rainfall",isMirrored=True,removeDuplicates=True)
corrList = corrList.loc[corrList["variable1"]=="rainfall_settefrati",["type2","name2","variable2","corr_value"]].sort_values(by="corr_value",ascending=False)
corrList.head()

There is only one rainfall variable, and it is missing data from the beginning of 2019 onwards. This values will negatively impact the forecast. 

We also see that there is no variable highly correlated, so we need to predict the value based on rainfall from mulltiple other locations. 

In [None]:
# get the dates where the rainfal needs to be predicted


df["Water_Spring"]["Madonna_di_Canneto"] = df["Water_Spring"]["Madonna_di_Canneto"].loc[~df["Water_Spring"]["Madonna_di_Canneto"]["date"].isnull(),:] # remove dates where null

indices = np.where(df["Water_Spring"]["Madonna_di_Canneto"]["rainfall_settefrati"].isnull())[0]

startDate = df["Water_Spring"]["Madonna_di_Canneto"]["date"].iloc[indices[0]]
endDate = df["Water_Spring"]["Madonna_di_Canneto"]["date"].iloc[indices[-1]]

In [None]:
data = df["Water_Spring"]["Madonna_di_Canneto"][["date","rainfall_settefrati"]].copy()
data_train = data.loc[data["date"].dt.date<startDate,:]
data_pred = data.loc[(data["date"].dt.date>=startDate) & (data["date"].dt.date<=endDate),:]

In [None]:
# merge all rainfall data into one training dataset
data = df["Water_Spring"]["Madonna_di_Canneto"][["date","rainfall_settefrati"]].copy()
data_train = data.loc[data["date"].dt.date<startDate,:]
data_pred = data.loc[(data["date"].dt.date>=startDate) & (data["date"].dt.date<=endDate),:]

train_index = data_train.index
pred_index = data_pred.index

#select rainfall locations that can cover the missing rainfall data in Madonna di Canneto   
ndates = len(indices) # number of days with missing values
for ii,row in corrList.iterrows():
    var_name = row["variable2"]
    if row["corr_value"]>=0.25: # taregt variable has correlation of at least 25%
        data = df[row["type2"]][row["name2"]][["date",var_name]].copy()
        data4train = data.loc[(data["date"].dt.date<startDate),["date",var_name] ]
        data4pred = data.loc[(data["date"].dt.date<=endDate) & (data["date"].dt.date>=startDate),["date",var_name]]
        if data4pred[var_name].isnull().sum()<.05*ndates: # variable has less than 5% of missing values in the prediction date range
            if data4train[var_name].isnull().sum()<.1*ndates: # variable has less than 10% of missing values in the training date range
                data_train = data_train.merge(data4train,on="date",how="left")
                data_pred = data_pred.merge(data4pred,on="date",how="outer")

data_train.index =  train_index
data_pred.index =  pred_index

# interpolate for missing values up to 3 days
data_train = data_train.interpolate(method="linear",limit=3)
data_pred = data_pred.interpolate(method="linear",limit=3)
#data_rain = data_rain.iloc[:,np.where(~data_rain.isnull().any())[0]] 

# scale dates into cyclical features in range 0-1
day = pd.DatetimeIndex(data_train["date"]).dayofyear.values
data_train["day_y"] = np.sin(2*np.pi*day/365.25)
data_train["day_x"] = np.cos(2*np.pi*day/365.25)
data_train["day_y"] = (data_train["day_y"]+1)/2
data_train["day_x"] = (data_train["day_x"]+1)/2
data_train.drop("date",axis=1,inplace=True)

day = pd.DatetimeIndex(data_pred["date"]).dayofyear.values
data_pred["day_y"] = np.sin(2*np.pi*day/365.25)
data_pred["day_x"] = np.cos(2*np.pi*day/365.25)
data_pred["day_y"] = (data_pred["day_y"]+1)/2
data_pred["day_x"] = (data_pred["day_x"]+1)/2
data_pred.drop("date",axis=1,inplace=True)

data_train.describe()

Let's setup the train and test sets for the machine learning job

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")

In [None]:
def selectFeatures(data,vif_thr=1.5):
    ## uses variance inflation factor to select most relevenat features
    from statsmodels.stats.outliers_influence import variance_inflation_factor 

    vif_thr = 1.5 # the maximum VIF acceptable for an indepndent variable to have no collinearity with another variable

    data_sel = data.copy()
    # method is to repeatedly compute VIF for every variable and remove the highest VIF, until only varibales with VIF<=vif_thr are left  
    while True:

        vif_data = pd.DataFrame() 
        vif_data["feature"] = data_sel.columns 

        # calculating VIF for each independent variable 
        vif_data["VIF"] = [variance_inflation_factor(data_sel.values, i) for i in range(len(data_sel.columns))] 

        vif_iimax = vif_data["VIF"].idxmax()
        if vif_data.loc[vif_iimax,"VIF"]<=vif_thr:
            break
        data_sel.drop(columns=[vif_data.loc[vif_iimax,"feature"]],inplace=True)
    return(data_sel)


def print_results(results,verbose=False):
    print('BEST PARAMS: {}\n'.format(results.best_params_))
    print('BEST SCORE: {}\n'.format(round(results.best_score_,3)))

    if verbose:
        means = results.cv_results_['mean_test_score']
        stds = results.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, results.cv_results_['params']):
            print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))
        

def evaluate_model(name, model, features, labels,pr_plot=False):
    '''
    Evaluates and prints model performance indicators:
    - prediction runtime 
    - accuracy
    - precision
    - recall
    '''
    from time import time
    from sklearn.metrics import accuracy_score, precision_score, recall_score
    from sklearn.metrics import precision_recall_curve
    start = time()
    predictions = model.predict(features)
    end = time()
    accuracy = round(accuracy_score(labels, predictions), 3)
    precision = round(precision_score(labels, predictions), 3)
    recall = round(recall_score(labels, predictions), 3)
    print('{} -- Accuracy: {} / Precision: {} / Recall: {} / Latency: {}ms'.format(name,accuracy,precision,recall,
                                                                                round((end - start)*1000, 1)))
    if pr_plot:
        probs = model.predict_proba(features)
        precision, recall, thresholds = precision_recall_curve(labels, probs[:,1])
        plt.plot(thresholds,precision[:-1],label="precision")
        plt.plot(thresholds,recall[:-1],label="recall")
        plt.xlabel("binary threshold")
        plt.legend()
        plt.grid()

In [None]:
features = data_train.drop('rainfall_settefrati', axis=1) # remove the label column from the dataset
labels = data_train['rainfall_settefrati'] # use only the label column

X_train, X_val, y_train, y_val = train_test_split(features,labels,test_size=0.2,shuffle=True,random_state=42) # 60% training

X_test = data_pred.drop('rainfall_settefrati', axis=1) # features used for final prediction


## select most relevant features based on variance inflation
colnames = [x for x in X_train.columns if x not in ["day_x","day_y"]]
X_sel = selectFeatures(X_train[colnames],vif_thr=1.5)
X_train = pd.concat([X_sel,X_train[["day_x","day_y"]]],axis=1,sort=False)

X_val = X_val[X_train.columns]
X_test = X_test[X_train.columns]

X_train.head()

Since the data has many zero values (no rain), we first need to predict the occurrence of rain as a binary classificaiton problem. 
Then we can predict amount of rain only for days when rain is predicted. 

If we use only a prediction model without a classification model, the accuracy would be much worse.

In [None]:
# select data of rain vs no rain
rain_thrs = 0.0 # threshold for rain (can be changed depending on how noisy are the sensors, here we assume no noise)

# TRAIN
colnames = [x for x in X_train.columns if x not in ["day_x","day_y"]]
X_train_bin = X_train.copy()
X_train_bin[colnames] = (X_train_bin[colnames]>rain_thrs).astype(int)
y_train_bin = (y_train>rain_thrs).astype(int)
# VALIDATION
y_val_bin = (y_val>rain_thrs).astype(int)
X_val_bin = (X_val>rain_thrs).astype(int)
# TEST
X_test_bin = (X_test>rain_thrs).astype(int)

We try a multi-layer perceptron and a logistic regression classifier model, using cross-validation to select the best learning parameter.

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
parameters = {
    'C': [.001,.01,.1,1,10]
}

cv = GridSearchCV(logreg,parameters,cv=5)
cv.fit(X_train,y_train_bin.values.ravel())
print_results(cv)

binary_model = cv.best_estimator_
predictions = binary_model.predict(X_val_bin)
evaluate_model(type(cv.estimator).__name__, binary_model, X_val_bin, y_val_bin,pr_plot=True) # evaluate the model

In [None]:
from sklearn.neural_network import MLPClassifier


mlp = MLPClassifier(early_stopping=False)
parameters = {
    'hidden_layer_sizes': [(10,1),(30,1),(10,2),(30,2)],
    'activation': ["relu","tanh"],
    'learning_rate': ["constant","adaptive"] # "invscaling" takes a large jump at the beginning and then slightly decreases it while we get closer to the final model, "adaptive" eeps learning_rate constant as long as the training loss keeps decreasing otherwise it decreases learning rate 
}

cv = GridSearchCV(mlp,parameters,cv=5)
cv.fit(X_train,y_train_bin.values.ravel())

print_results(cv)

binary_model = cv.best_estimator_
evaluate_model(type(cv.estimator).__name__, binary_model, X_val_bin, y_val_bin,pr_plot=True) # evaluate the model

The multi-layer perceptron with **binary threshold 0.45** which has more than 70% recall and precision. 

Then, let's implement a gradient boosting regression, after having scaled the data using boxcox transofrmation with lambda=0 (equivalent to natural logaritm).

In [None]:
# scale data
colnames = [x for x in X_train.columns if x not in ["day_x","day_y"]]

X_train_scaled = X_train.copy() 
X_val_scaled = X_val.copy()
X_test_scaled = X_test.copy()
for col in colnames:
    X_train_scaled[col] = stats.boxcox(X_train_scaled.loc[:,col].values+1,lmbda=0) 
    X_val_scaled[col] = stats.boxcox(X_val_scaled.loc[:,col].values+1,lmbda=0) 
    X_test_scaled[col] = stats.boxcox(X_test_scaled.loc[:,col].values+1,lmbda=0) 

y_train_scaled = pd.Series(stats.boxcox(y_train.values+1,0),index=y_train.index)
y_val_scaled = pd.Series(stats.boxcox(y_val.values+1,0),index=y_val.index)

# Generate datasets for regression

y_train_rain = y_train_scaled[y_train_scaled>rain_thrs]
X_train_rain = X_train_scaled.loc[y_train_rain.index,:]

y_val_rain = y_val_scaled[y_val_scaled>rain_thrs]
X_val_rain = X_val_scaled.loc[y_val_rain.index,X_train_rain.columns]

X_test_rain = X_test_scaled.copy()

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error

gbr = GradientBoostingRegressor()

parameters = {
    'n_estimators': [100,200,300],
    'max_depth': [1, 3 ,5], # depth=1 is called "decision stump"
    'learning_rate': [0.001,0.01, 0.1]
}


cv = GridSearchCV(gbr,parameters,cv=5,scoring='neg_mean_squared_error',return_train_score=True)
cv.fit(X_train_rain,y_train_rain.values.ravel())

print_results(cv)

prediction_model = cv.best_estimator_
y_pred = prediction_model.predict(X_val_rain)
print("R2 score: %.2f" % r2_score(y_val_rain, y_pred))
print("MSE : %.4f" % mean_squared_error(y_val_rain, y_pred))


#residuals = (y_pred-y_test_rain)/y_test_rain
#plt.plot(residuals.values,label="residuals")
plt.plot(y_pred,label="predictions")
plt.plot(y_val_rain.values,label="true")
plt.legend()
plt.grid()

Now it is finally time to predict. 

First let's retrain the models on the whole data, since before we did not train on 20% of training data (due to cross-validation) and on validation data.

In [None]:
X_train = pd.concat([X_train,X_val],axis=0,sort=False)
y_train_bin = pd.concat([y_train_bin,y_val_bin],axis=0,sort=False)
X_train_rain = pd.concat([X_train_rain,X_val_rain],axis=0,sort=False)
y_train_rain = pd.concat([y_train_rain,y_val_rain],axis=0,sort=False)

In [None]:
binary_model.fit(X_train,y_train_bin.values.ravel())
prediction_model.fit(X_train_rain,y_train_rain.values.ravel())
evaluate_model(type(binary_model).__name__, binary_model, X_train, y_train_bin,pr_plot=True) 

In [None]:
bin_threshold = 0.45
y_pred = pd.Series(binary_model.predict_proba(X_test_bin)[:,1]>bin_threshold,index=X_test_bin.index)

X_test_rain = X_test_rain.loc[y_pred,X_train_rain.columns]

In [None]:
pred_rain = pd.Series(prediction_model.predict(X_test_rain),index=X_test_rain.index)
pred_rain = np.exp(pred_rain)-1

In [None]:
y_pred = y_pred.astype(float)
y_pred.loc[pred_rain.index] = pred_rain.values


In [None]:
df["Water_Spring"]["Madonna_di_Canneto"].loc[y_pred.index,"rainfall_settefrati"] = y_pred
plot_WaterbodyVariable("Water_Spring","rainfall",waterbody_name="Madonna_di_Canneto",date_interval=("2018-01-01","2019-12-31"))#,marker=".",linestyle="",alpha=0.5)


We see that the algorithm does not predict any large peak in the last quarter of the year (as in 2019), however it predicts rain for a large amount of days.  

### River Arno

In [None]:
plotMissingValues("River","Arno")

Let's remove old data with missing values, as it is not relevant for the forecasting

In [None]:
indices= np.where(~df["River"]["Arno"].loc[:,"rainfall_le_croci"].isnull())[0]
df["River"]["Arno"] = df["River"]["Arno"].iloc[indices,:]

Let's remove locations where there  is no recent data

In [None]:
df["River"]["Arno"].drop(columns=["rainfall_stia","rainfall_incisa","rainfall_montevarchi","rainfall_s_savino","rainfall_laterina","rainfall_bibbiena","rainfall_camaldoli","rainfall_consuma","rainfall_vernio"],inplace=True)

### Lake Bilancino

In [None]:
plotMissingValues("Lake","Bilancino")

Let's remove old data with missing values, as it is not relevant for the forecasting

In [None]:
indices= np.where(~df["Lake"]["Bilancino"].loc[:,"rainfall_le_croci"].isnull())[0]
df["Lake"]["Bilancino"] = df["Lake"]["Bilancino"].iloc[indices,:]

## Volume
The volume is the **OUT volume**, i.e. the volume that is taken from the pumps (and then sent to the water treatment plant).
The convention of the sign of volume is different for different companies, therefore there  is need to uniform the volumes to a single positive sign.
Furthermore we analyze the data and clean it for all the waterbodies.

In [None]:
for ii, info in data_descr.iterrows():
    w_type = info["type"]
    w_name = info["name"]
    variables = getIndependentVariables(w_type,w_name,"volume")
    for vv in variables:
        df[w_type][w_name][vv] = np.abs(df[w_type][w_name][vv]) 

### Aquifer Petrignano

In [None]:
plotMissingValues("Aquifer","Petrignano")
plot_WaterbodyVariable("Aquifer","volume",waterbody_name="Petrignano",date_interval=("2019-01-01","2019-12-31"))#,marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Aquifer","Petrignano","volume")

Aquifer Petrignano has some volume values that go abruptly to zero, so let's deal with them by conveting them to Na 

In [None]:
variable = getIndependentVariables("Aquifer","Petrignano","volume")
series = df["Aquifer"]["Petrignano"][variable]
indwhere = np.where(series==0)[0]
series.iloc[indwhere]=np.nan
df["Aquifer"]["Petrignano"][variable] = series
plot_VarDistribution("Aquifer","Petrignano","volume")

Now let's interpolate for missing values, taking care of filling only when there are not more than 15 days of missing data. Otherwise we have not enough confidence to interpolate

In [None]:
variable = getIndependentVariables("Aquifer","Petrignano","volume")
df["Aquifer"]["Petrignano"][variable] = df["Aquifer"]["Petrignano"][variable].interpolate(method='spline', order=1, axis=0, limit=15)
plot_WaterbodyVariable("Aquifer","volume",waterbody_name="Petrignano",date_interval=("2019-01-01","2019-12-31"))#,marker=".",linestyle="",alpha=0.5)

### Aquifer Doganella

In [None]:
plotMissingValues("Aquifer","Doganella")
plot_WaterbodyVariable("Aquifer","volume",waterbody_name="Doganella",date_interval=("2019-09-01","2019-10-31"))#,marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Aquifer","Doganella","volume",stacked=True)

For "Doganella", it looks there are periods when the wells get depleted (e.g. Oct 2019).
Generally, there are no particular observations to make.
There are possibilities to fix only sporadic data, bu data prior to the second half of 2016 cannot be reconstructed.

In [None]:
data = df["Aquifer"]["Doganella"].copy()
for pozzoname, pozzodata in data.loc[:,data.columns.str.startswith("volume")].iteritems():
     df["Aquifer"]["Doganella"][pozzoname] = pozzodata.interpolate(method='akima',limit=21)

### Aquifer Auser

In [None]:
plotMissingValues("Aquifer","Auser")
plot_WaterbodyVariable("Aquifer","volume",waterbody_name="Auser")#,date_interval=("2013-12-15","2014-01-15"))#,marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Aquifer","Auser","volume",stacked=True)

Wells have always a level different from zero, apart from two wells, where data starts to be recorded only from January 2014. 
Therefore we need to transform such values to "not available"

In [None]:
variable = getIndependentVariables("Aquifer","Auser","volume")
data_sel = df["Aquifer"]["Auser"][["volume_csal","volume_csa"]].copy()
data_sel.where(data_sel!=0.,inplace=True)
df["Aquifer"]["Auser"][["volume_csal","volume_csa"]] = data_sel
plot_WaterbodyVariable("Aquifer","volume",waterbody_name="Auser",date_interval=("2013-12-15","2014-01-15"))#,marker=".",linestyle="",alpha=0.5)

### Aquifer Luco

In [None]:
plotMissingValues("Aquifer","Luco")
plot_WaterbodyVariable("Aquifer","volume",waterbody_name="Luco",date_interval=("2018-11-21","2018-11-30"))#,marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Aquifer","Luco","volume",stacked=True)

Between decmber 2016 and February 2017, the volume of well 4 goes unexpectedly close to 0, while normally its level is  within wells 1 and 2. However, we have no information to establish that the values are somehow wrong.
The data seems to have weekly pattern. Since the volume of water taken is constant along the week (apart from a minimum on Sunday, we might conclude that the volume is artificially controlled.

## Depth to Groundwater

This is the target variable, so let's see how it changes. First we need to take absolute values to uniform across the emasurements of waterbodies.


In [None]:
for ii, info in data_descr.iterrows():
    w_type = info["type"]
    w_name = info["name"]
    variables = getTargetVariables(w_type,w_name)
    for vv in variables:
        df[w_type][w_name][vv] = np.abs(df[w_type][w_name][vv]) 

In [None]:
plotMissingValues("Aquifer","Auser")
plot_WaterbodyVariable("Aquifer","depth",waterbody_name="Auser")#,date_interval=("2020-06-01","2020-09-30"),marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Aquifer","Auser","depth",stacked=False)

In [None]:
plot_WaterbodyVariable("Aquifer","depth",waterbody_name="Auser",date_interval=("2020-06-01","2020-09-30"))#,marker=".",linestyle="",alpha=0.5)

In June 2020 there are strong daily spikes, that look quite irrealistic since the depth changes from a few meters to 0 and then goes back to a few meters abruptedly. Such values look more data errors than realistic ones.
In order to clean up the data, we transform such "zero" sensor readings to "missing".

In [None]:
variable = getIndependentVariables("Aquifer","Auser","depth")
data_sel = df["Aquifer"]["Auser"][["depth_to_groundwater_lt2","depth_to_groundwater_sal","depth_to_groundwater_cos"]].copy()
data_sel.where(data_sel!=0.,inplace=True)
df["Aquifer"]["Auser"][["depth_to_groundwater_lt2","depth_to_groundwater_sal","depth_to_groundwater_cos"]] = data_sel
plot_WaterbodyVariable("Aquifer","depth",waterbody_name="Auser",date_interval=("2020-06-01","2020-09-30"))#,marker=".",linestyle="",alpha=0.5)

Missing values are not sporadic but last longer. So, in order to predict missing values from istorical observations, we ned to use established auto-regressive techniques. 

The first step is to de-seasonalize the data. So, let's get a data interval sufficiently long where there are no missing values.

In [None]:
def getLongestValidRange(data,minlen):
    """
    Returns a list of sub-series, each containing an interval of non missing values larger than minlen
    """
    
    chunks = []
    series = data.copy()
    while True:
        null_ii= np.where(series.isnull())[0] # print(null_ii)
        if len(null_ii)==0:
            if len(series)>minlen:
                chunks.append(series)
            return chunks        
        
        #ii_diff = np.diff(null_ii) # difference between consecutive indices of null values
        null_ii = np.insert(null_ii, 0,-1)
        ii_diff = np.diff(null_ii) # difference between consecutive indices of null values
        
        mm = np.argmax(ii_diff) # index of the maximum value of the differences
        i_start = null_ii[mm]+1
        i_end = null_ii[mm+1]#-1 # print("start: %s, end: %s"%(i_start,i_end))
        chunk = series.iloc[i_start:i_end].copy()
        if len(chunk)>minlen:
            chunks.append(chunk)
            series.iloc[i_start:i_end+1] = np.nan
        else:
            return chunks
results = getLongestValidRange(df["Aquifer"]["Auser"]["depth_to_groundwater_lt2"].copy(),365*2)

In [None]:
nrows = len(results)
fig, axs = plt.subplots(nrows=nrows, ncols=1,figsize=(20,nrows*5),sharex=False)
for ii,result in enumerate(results):
    df["Aquifer"]["Auser"].loc[result.index,["date","depth_to_groundwater_lt2"]].plot(x="date",grid=True,ax = axs[ii])

So, these are the two chunks of 2-years data without missing values. They can be used to check for seasonality. There is yearly maximum in October and yearly minimum in April. 

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

def seasonalDecompose(observations,dates, period, plot=True,model="additive",extrapolate_trend=0,two_sided=True):
    frame = pd.DataFrame()
    if len(observations)>=2*period:
        res =  seasonal_decompose(observations, period=period,model=model,two_sided=two_sided,extrapolate_trend=extrapolate_trend)#int(period/100) )#=="freq") # returns [observed, trend, seasonal, resid]
        frame = pd.concat([dates, res.observed,res.trend,res.seasonal,res.resid],axis=1)
        if plot:
            fig, ax = plt.subplots(nrows=4, ncols=1,figsize=(15,12))
            frame.plot(x="date",y=frame.columns[1],ax=ax[0])
            frame.plot(x="date",y="trend",ax=ax[1])
            frame.plot(x="date",y="seasonal",ax=ax[2])
            frame.plot(x="date",y="resid",marker=".",linestyle="",alpha=0.5,ax=ax[3])
            plt.tight_layout()
    else:
        print("There are only %d observations, but for a period of %d we need at least %d observations" % (len(observations), period,2*period)  )
    return frame

period=365
dates = df["Aquifer"]["Auser"].loc[results[0].index,"date"]
res = seasonalDecompose(results[0], dates, period, plot=True)

In [None]:
def test_stationarity(timeseries,window=12):
    """
    Function that visually tests for stationariety of the time series. 
    It also performs "Augmented Dickey-Fuller Test", which is used to assess whether or not the time-series is stationary. 
    """
    
    # Calculate rolling statistics
    rolmean = timeseries.rolling(window=window).mean()
    rolstd = timeseries.rolling(window=window).std()

    # Plot rolling statistics:
    plt.figure(figsize=(14,3))
    plt.plot(timeseries, color='blue',label='Original')
    plt.plot(rolmean, color='red', label='Rolling Mean')
    plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show()
    
    # Perform Dickey-Fuller test:
    from statsmodels.tsa.stattools import adfuller
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)

series = res.resid.dropna()
test_stationarity(series,window=period)

Residuals are stationary (test statistic is lower than critical value). This means that such data can be used to predict the future. So, let's first remove the seasonal component.

In [None]:
def getSeasonality(series, samplePeriod) :
    """
    From a long series (type pandas.Series), it returns seasonality described by the samplePeriod series (type pandas.Series).
    The index of the samplePeriod matches the index of the series.
    
    Example:
    a = pd.Series(np.array([1,2.02,3,4,69.1,21,34.21,4,23,2,32,21,63,54,45.8,53]),index=range(16))
    b =  a.loc[3:9]
    removeSeasonality(a, b) 
    """
    
    max1 = samplePeriod.index[-1]
    max2 = series.index[-1]#["depth_to_groundwater_lt2"]
    min1 = samplePeriod.index[0]
    min2 = series.index[0]

    nSamples = len(samplePeriod)#len(res.seasonal.index)
    nForwardSamples = max2-max1
    nForward = int(np.ceil((nForwardSamples+nSamples)/nSamples))

    forwardSamples = np.tile(samplePeriod,nForward)[0:nForwardSamples+nSamples]

    nBackwardSamples = min1-min2
    nBackward = int(np.ceil((nBackwardSamples)/nSamples))
    backwardSamples = np.tile(samplePeriod,nBackward)[-nBackwardSamples:]

    seasonal = np.concatenate([backwardSamples,forwardSamples])
    assert len(seasonal)==len(series), "lengths are different, something is wrong"

    return seasonal

    #plt.plot(a.loc[4800:5000])#.loc[min1:max1]);
    #plt.plot(deseasonal.loc[4800:5000])

    
seasonal = getSeasonality(df["Aquifer"]["Auser"]["depth_to_groundwater_lt2"].copy(), res.seasonal[0:period])
deseasonal365 = df["Aquifer"]["Auser"]["depth_to_groundwater_lt2"] - seasonal

Now we can interpolate the **deseasonalized** data. We use the "akima" spline method, which given the variability of the data, it reduces wiggles. We interpolate for not more than 3 weeks, in order to keep some degree of confidence.

In [None]:
fig = plt.figure(figsize=(15,4))
ax = plt.gca()

original = pd.concat([df["Aquifer"]["Auser"]["date"], deseasonal365], axis=1)
original[deseasonal365.name] = original[deseasonal365.name].rolling(8).mean() # smoothing
interpolated = original.interpolate(method='akima')
interpolated.plot(x="date",style="k",ax=ax,alpha=0.5)
original.plot(x="date",ax=ax)

Finally we check for stationariety of residuals, in order to be sure that deseasonalization makes sense.

In [None]:
period=365
results = getLongestValidRange(df["Aquifer"]["Auser"]["depth_to_groundwater_lt2"].copy(),period*2)
dates = df["Aquifer"]["Auser"].loc[res.seasonal.index,"date"]
res = seasonalDecompose(results[0], dates, period, plot=True)
series = res.resid.dropna()
test_stationarity(series,window=period)

The test statistic is lower than critical value at 1%. So, residuals are stationary.  Therefore we can proceed.

In [None]:
df["Aquifer"]["Auser"]["depth_to_groundwater_lt2"] = interpolated[deseasonal365.name]+seasonal

Now, let's go to the other varibale: `depth_to_groundwater_sal`

In [None]:
seasonal = getSeasonality(df["Aquifer"]["Auser"]["depth_to_groundwater_sal"], res.seasonal[0:period])
deseasonal365 = df["Aquifer"]["Auser"]["depth_to_groundwater_sal"] - seasonal
fig = plt.figure(figsize=(15,4))
ax = plt.gca()
original = pd.concat([df["Aquifer"]["Auser"]["date"], deseasonal365], axis=1)
original[deseasonal365.name] = original[deseasonal365.name].rolling(8).mean() # smoothing
interpolated = original.interpolate(method='akima')#,limit=32)
interpolated.plot(x="date",style="k",ax=ax,alpha=0.5)
original.plot(x="date",ax=ax)

Finally we add back the seasonal factors

In [None]:
df["Aquifer"]["Auser"]["depth_to_groundwater_sal"] = interpolated[deseasonal365.name]+seasonal 


We repeat for `depth_to_groundwater_cos`

In [None]:
period=365
df["Aquifer"]["Auser"].plot(x="date",y="depth_to_groundwater_cos")

series = df["Aquifer"]["Auser"]["depth_to_groundwater_cos"]
interpolated = series.interpolate(method='akima',limit=7) # we first need to interpolate because we do not have a chunk sufficiently long
results = getLongestValidRange(interpolated,period*2)

In [None]:
dates = df["Aquifer"]["Auser"].loc[results[0].index,"date"]
res = seasonalDecompose(results[0], dates, period, plot=True)
series = res.resid.dropna()
test_stationarity(series,window=period)

In [None]:
seasonal = getSeasonality(df["Aquifer"]["Auser"]["depth_to_groundwater_cos"], res.seasonal[0:period])
deseasonal365 = df["Aquifer"]["Auser"]["depth_to_groundwater_cos"] - seasonal
fig = plt.figure(figsize=(15,4))
ax = plt.gca()
original = pd.concat([df["Aquifer"]["Auser"]["date"], deseasonal365], axis=1)
original[deseasonal365.name] = original[deseasonal365.name].rolling(7).mean() # smoothing
interpolated = original.interpolate(method='akima')
interpolated.plot(x="date",style="k",ax=ax,alpha=0.5)
original.plot(x="date",ax=ax)

In [None]:
df["Aquifer"]["Auser"]["depth_to_groundwater_cos"] = interpolated[deseasonal365.name]+seasonal 

### Aquifer Petrignano

In [None]:
plotMissingValues("Aquifer","Petrignano")
plot_WaterbodyVariable("Aquifer","depth",waterbody_name="Petrignano")#,date_interval=("2020-06-01","2020-09-30"),marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Aquifer","Petrignano","depth",stacked=False)

Here we see that the two target variables are correlated with each other.
So, in order to fill na, we use the substitutino method (already used for temperature and rainfall), which consists of substituting missing values of one variable with non-missing values of the other variable (properly normalized), and vice-versa. 

In [None]:
fixMissingValues("depth_to_groundwater_p24","depth_to_groundwater_p25");
fixMissingValues("depth_to_groundwater_p25","depth_to_groundwater_p24");

Finally, let's use interpolation to fill remaining missing values. 

In [None]:
series = df["Aquifer"]["Petrignano"]["depth_to_groundwater_p25"].copy()
interpolated = series.interpolate(method='akima',limit=7) # we first need to interpolate because we do not have a chunk sufficiently long
df["Aquifer"]["Petrignano"]["depth_to_groundwater_p25"] = interpolated

series = df["Aquifer"]["Petrignano"]["depth_to_groundwater_p24"].copy()
interpolated = series.interpolate(method='akima',limit=7) # we first need to interpolate because we do not have a chunk sufficiently long
df["Aquifer"]["Petrignano"]["depth_to_groundwater_p24"] = interpolated

plot_WaterbodyVariable("Aquifer","depth",waterbody_name="Petrignano")#,date_interval=("2020-06-01","2020-09-30"),marker=".",linestyle="",alpha=0.5)


### Aquifer Luco

In [None]:
plotMissingValues("Aquifer","Luco")
plot_WaterbodyVariable("Aquifer","depth",waterbody_name="Luco",date_interval=("2014-01-01","2015-01-01"))#,marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Aquifer","Luco","depth",stacked=False)

The target variable `depth_to_groundwater_podere_casetta` misses values during the most recent period (2020). 
Since we do want to use this interval as test-data, let's create **synthetic** taregt values for this interval by using auto-regression.

In [None]:
from fbprophet import Prophet
##prophet reqiures a pandas df at the below config ( date column named as DS and the value column as Y)
#ts.columns=['ds','y']
#model = Prophet( yearly_seasonality=True) #instantiate Prophet with only yearly seasonality as our data is monthly 
#model.fit(ts) #fit the model with your dataframe


df_sub = df["Aquifer"]["Luco"][["date","depth_to_groundwater_podere_casetta"]].copy()
df_sub.columns=['ds','y']
model = Prophet( yearly_seasonality=True) #instantiate Prophet with only yearly seasonality as our data is monthly 
model.fit(df_sub) #fit the model with your dataframe
future = model.make_future_dataframe(periods=0)
forecast = model.predict(future)

fig1 = model.plot(forecast)

Seasonality changes over time, so we need to fix the dates of seasonality changes by visual inspection. 
This should improve the quality of the prediction considering the available values.

In [None]:
from fbprophet.plot import add_changepoints_to_plot
newchangepoints = df_sub.loc[df_sub["ds"].isin(['2017-06-01','2018-01-01']),"ds"]
changepoints = pd.concat([model.changepoints,newchangepoints],sort=False)
model = Prophet(changepoints=changepoints, yearly_seasonality=True,changepoint_prior_scale=0.8, interval_width = 0.95)
forecast = model.fit(df_sub).predict(future)
fig2 = model.plot(forecast)
a = add_changepoints_to_plot(fig2.gca(), model, forecast)

In [None]:
ax = plt.figure(figsize=(15,5)).gca()
data = df["Aquifer"]["Luco"][["date","depth_to_groundwater_podere_casetta"]].copy()
data.plot(x="date",y="depth_to_groundwater_podere_casetta",style="b.",ax=ax)
locwhere = data["depth_to_groundwater_podere_casetta"].isnull()
data.loc[locwhere,"depth_to_groundwater_podere_casetta"] = forecast.loc[locwhere,"yhat"].values
data.plot(x="date",y="depth_to_groundwater_podere_casetta",style="r-",ax=ax,grid=True)

In [None]:
df["Aquifer"]["Luco"]["depth_to_groundwater_podere_casetta"] = data["depth_to_groundwater_podere_casetta"].values

##  Flow Rate

In [None]:
plotMissingValues("Water_Spring","Madonna_di_Canneto")
plot_WaterbodyVariable("Water_Spring","flow",waterbody_name="Madonna_di_Canneto")#,date_interval=("2020-06-01","2020-09-30"),marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Water_Spring","Madonna_di_Canneto","flow",stacked=False)

In [None]:
null_ii = np.where(df["Water_Spring"]["Madonna_di_Canneto"]["date"].isnull())[0]
df["Water_Spring"]["Madonna_di_Canneto"].drop(index=null_ii,inplace=True)

There are many missing values and the distribution is very skewed, meaning that there are two different peaks. 
Moreover, there is no visible seasonality. 
There are many spikes of one day, which might indicate either data errors or real behavior (for example, the flow rate might be controlled by a pump which shuts down for one day). However, the instructors said that they provided with raw data which need to be cleaned up, because they come from measures using sensors located in the field, exposed by their nature to natural weather, therefore subject to malfunction.

In [None]:
plot_WaterbodyVariable("Water_Spring","flow",waterbody_name="Madonna_di_Canneto")#,date_interval=("2017-06-01","2017-12-31"))

In [None]:
data = df["Water_Spring"]["Madonna_di_Canneto"].copy()
data["smoothed"] = data["flow_rate_madonna_di_canneto"]
data["smoothed"] = data["smoothed"].rolling(window=3).median() # a median of 3 days removes one-day spikes
data["interpolated"] = data["smoothed"].interpolate(method='akima')
data.loc[1000:1330,:].plot(x="date",y=["flow_rate_madonna_di_canneto","interpolated"],style=[".","-"],grid=True,figsize=(15,5))#,legend=False)

df["Water_Spring"]["Madonna_di_Canneto"]["flow_rate_madonna_di_canneto"] = data["interpolated"].values

In [None]:
plot_WaterbodyVariable("Water_Spring","flow",waterbody_name="Madonna_di_Canneto")

### Water Sping Lupa

In [None]:
plotMissingValues("Water_Spring","Lupa")
plot_WaterbodyVariable("Water_Spring","flow",waterbody_name="Lupa")#,date_interval=("2020-06-01","2020-09-30"),marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Water_Spring","Lupa","flow",stacked=False)

For the time series, it seems the flow rate measure is more stable than for other acquifers. It is visible an abrupt spike that goes to 0, which is most likely a sensor malfunction.  

There are also casual missing values, whihc can be easily fixed by interpolation.

In [None]:
data = df["Water_Spring"]["Lupa"].copy()
data["fixed"] = data["flow_rate_lupa"]
data.loc[data["fixed"]==0,"fixed"] = np.nan

data["interpolated"] = data["fixed"].interpolate(method='akima')
data.plot(x="date",y=["flow_rate_lupa","interpolated"],style=[".","-"],grid=True,figsize=(15,5))#,legend=False)

df["Water_Spring"]["Lupa"]["flow_rate_lupa"] = data["interpolated"].values

### Lake Bilancino

In [None]:
plotMissingValues("Lake","Bilancino")
plot_WaterbodyVariable("Lake","flow",waterbody_name="Bilancino")#,date_interval=("2020-06-01","2020-09-30"),marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Lake","Bilancino","flow",stacked=False)

There seem to be a minimum flow rate of a few liters per second, however, there are some spikes during the year which look seasonal.

In [None]:
data = df["Lake"]["Bilancino"].copy()
data["interpolated"] = data["flow_rate"]
#data["smoothed"] = data["smoothed"].rolling(window=3).median() # a median of 3 days removes one-day spikes
data["interpolated"] = data["interpolated"].interpolate(method='akima')
data.plot(x="date",y=["interpolated","flow_rate"],style=["-","-"],grid=True,figsize=(15,5))#,legend=False)

df["Lake"]["Bilancino"]["flow_rate"] = data["interpolated"].values

## Hydrometry
### River Arno

In [None]:
plotMissingValues("River","Arno")
plot_WaterbodyVariable("River","hydrometry",waterbody_name="Arno")#,date_interval=("2020-06-01","2020-09-30"),marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("River","Arno","hydrometry",stacked=False)

The data has a clear distribution (*gamma*), which means that there are no major outliers.  
There are some zero sensor readings to correct and some missing values to interpolate.

In [None]:
data = df["River"]["Arno"].copy()
data["fixed"] = data["hydrometry_nave_di_rosano"]
data.loc[data["fixed"]==0,"fixed"] = np.nan

data["fixed"] = data["fixed"].interpolate(method='akima',limit=5)
data.loc[3000:4500,:].plot(x="date",y=["hydrometry_nave_di_rosano","fixed"],style=[".","-"],grid=True,figsize=(15,5))#,legend=False)

Values for second half of 2018 are unavailable, so we use the same values form the previosu year, as the first half of the year looks pretty much the same.

In [None]:
data = df["River"]["Arno"].copy()
start_date08 = datetime.strptime("2008-07-01", '%Y-%m-%d').date() # datetime.strptime("2015-06-01", '%Y-%m-%d').date()
end_date08 = datetime.strptime("2008-12-31", '%Y-%m-%d').date()#datetime.strptime("2016-03-07", '%Y-%m-%d').date()
start_date07 = datetime.strptime("2007-07-01", '%Y-%m-%d').date() # datetime.strptime("2015-06-01", '%Y-%m-%d').date()
end_date07 = datetime.strptime("2007-12-31", '%Y-%m-%d').date()#datetime.strptime("2016-03-07", '%Y-%m-%d').date()

data07 = data.loc[((data["date"].dt.date>start_date07) & (data["date"].dt.date<end_date07)),"hydrometry_nave_di_rosano"]
data.loc[((data["date"].dt.date>start_date08) & (data["date"].dt.date<end_date08)),"hydrometry_nave_di_rosano"] = data07.values
data.loc[3000:4500,:].plot(x="date",y="hydrometry_nave_di_rosano",grid=True,figsize=(15,5))

df["River"]["Arno"]["hydrometry_nave_di_rosano"] = data["hydrometry_nave_di_rosano"].values

## Lake Level
### Lake Bilancino

In [None]:
plotMissingValues("Lake","Bilancino")
plot_WaterbodyVariable("Lake","lake_level",waterbody_name="Bilancino")#,date_interval=("2020-06-01","2020-09-30"),marker=".",linestyle="",alpha=0.5)
plot_VarDistribution("Lake","Bilancino","lake_level",stacked=False)

# Machine Learning for Prediction

#### Data splitting and cross-validation
For each model we create train/validation/test splits. However, time series modeling is different from time-independent supervised learning. There are intrinsic interrelationships between the data points measured across time. This means that during training and testing, the temporal structure of the series needs to be maintained. Shuffling the data from different dates causes bias. The **rolling-window** (or walk-forward) validation is best suited for time series-based forecasting because it avoids such biases and facilitates updating the models as new data comes in.

- **training** set will be used to select the best combination of parameters for a given algorithm and set of features. Thus, different hyperparameters will be chosen for the same algorithm in order to come up with the best model for the given algorithm. Within the training-set the rolling window method is used to select the best model.
- **validation** set will be used to test models from different algorithms against each other and select a winner for each prediction time in the future. Before validating the data, the model is retrained on all the previous data (in order to reduce model variance, as described here https://stats.stackexchange.com/questions/395645/does-retraining-a-model-on-all-available-data-necessarily-yield-a-better-model).
- **test** set will be used to finally get the final score for the preselected model.


On the training set, the best combination of parameters for a given algorithm is found using **blocked cross-validation**, which consists of leaving a gap between each training fold and its relative validation fold, as well as between different training folds. The approach is described in this paper: https://www.sciencedirect.com/science/article/abs/pii/S0304407600000300. This procedure will return a more reliable outcome because will avoid **data leakage** between different sets and different folds. In other words, it will incrrease independency of folds/sets. Thus, in the training set there is a series of training and validation folds, typically 5. The training prediction accuracy for a model is computed by averaging the scores obtained from training the model on each trainign fold and running it on the corresponding validation fold. 


#### Independent variables

As explained by the organizer in the competition forum, in the example below the first of the two options is preferred although the second option can also be tried.

Considering the aim of predicting the target variable at (t+1): 

**Option 1 (Future observations unknown)** - the information available for predicting the target at (t+1) is:
target variable at times (t), (t-1), (t-2), etc…
other variables (rainfall etc.) at times (t), (t-1), (t-2), etc… 

**Option 2 (Future observations known)** - the information available for predicting the target at (t+1) is: 
target variable at times (t), (t-1), (t-2), etc…
other variables (rainfall etc.) at times (t+1), (t), (t-1), (t-2), etc…

In other words, although it is acceptable to use data for the other variables (temperature, rainfall etc) at time (t+1), the business needs first and foremost to rely on current/past observations, beause the objective is **pure forecasting**.


#### Target variables

In order to ensure target variables are consistent, we ensure all targets are variations rather than absolute quantities. For example, "depth to groundwater" is converted to "daily change of depth".  


#### Prediction intervals
The required prediction is of "multi-step" kind, meaning that multiple predictions, each for a different time interval in the future, will be made. The performance of the model will be evaluated, for each prediction interval, using MSE. For example, considering an interval of 15 days, we average the MSE for the 15 days to come up with the 15-days-interval's MSE. Then we compare all the intervals in order to verify up to which interval we can consider a forecast acceptable according to the mean MSE of each interval.

Broadyl speaking, multi-step forecasts can be recursive or not. The *recursive multi-step* approach, consists of a single model that uses as input data both observations and predictions. The *non-recursive* approach, on the other hand, does not use any future prediction but only current and past observations, therefore each prediction interval needs a different model.

As mentioned by the organizers, the multi-step forecast **must not** be recursive "*if you have data observations until 31.12.2020 (t) and you try to predict the “Depth_to_Groundwater” at date 30.01.2021 (t+1) you are allowed to use all data, also the Depth_to_Groundwater measured in November (t-1), October (t-2), September (t-3) etc. Please notice also that, if you are willing to predict Depth_to_Groundwater at date 28.02.2021(t+2), in order to understand how far the prediction can go, you can use exclusively the given data, this means you cannot include either your estimated “Depth_to_Groundwater” value at 30.01.2021 (t+1) or temperature/rainfall weather predictions for the month of February (t+2). We want models based exclusively on observed data, basically you cannot use inferred data to make further predictions.*"

In [None]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV

In [None]:
class BlockingTimeSeriesSplit():
    def __init__(self, n_splits, train_ratio=0.7, gap_ratio=0.1):
        # train_gap ratios, indicates the portions of samples going to train set and gap respectively, where the sum of samples is 1.0
        self.n_splits = n_splits
        assert train_ratio+gap_ratio<1
        self.train_ratio = train_ratio
        self.gap_ratio = gap_ratio

    def get_n_splits(self, X, y, groups):
        return self.n_splits
    
    def split(self, X, y=None, groups=None):
        n_samples = len(X)
        k_fold_size = n_samples // self.n_splits
        indices = np.arange(n_samples) #X.index

        for i in range(self.n_splits):
            start = i * k_fold_size
            stop = start + k_fold_size
            mid = int(self.train_ratio * (stop - start)) + start
            gap = int(self.gap_ratio * (stop - start))
            yield indices[start: mid], indices[mid + gap: stop]
            
## function used to plot train/validation folds for the time series 
def plot_cv_indices(cv, X,ax, lw=20):
    """
    Create a sample plot for indices of a cross-validation object.
    https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html
    """

    # Generate the training/testing visualizations for each CV split
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=None)):
        # Fill in indices with the training/test groups
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0
    
        # Visualize the results
        cmap_cv = plt.cm.coolwarm
        ax.scatter(range(len(indices)), [ii + .5] * len(indices), c=indices, marker='_', lw=lw, cmap=cmap_cv,vmin=-.2, vmax=1.2)
        ax.set_yticklabels("")
        
    return ax



def plot_scores(searchers,ax=None,label=None):
    best_scores = np.empty(len(searchers))
    for ii,y_name in enumerate(searchers.keys()):
        best_scores[ii] = searchers[y_name].best_score_
    if ax is None:
        plt.plot(np.arange(len(searchers))+1,best_scores,marker=".",label=label)
        plt.xlabel("+days ("+"_".join(y_name.split("_")[:-1])+")" )
        plt.ylabel(searchers[y_name].refit)
        plt.grid()
        ax = plt.gca()
    else:
        ax.plot(np.arange(len(searchers))+1,best_scores,marker=".",label=label)
    ax.legend()
    return ax
    

def build_multi_output(y,pred_int,X=None,as_variation=False):
    """
    From a dataframe containing one or more output variables, it creates multiple lagged columns (pandas dataframe) to cover the desired prediction interval.
    `pred_int` is the length of the prediction interval.
    It returns the modified output y. If input matrix is also given, pred_int samples are removed from both input and output matrices in order to avoid missing values in the outputs due to lag.
    If `as_variation` is True, the lagged targets are given as differences with respect to the initial target, otherwise they are absolute.
    """

    assert type(y)== pd.core.frame.DataFrame, "y must be of type dataframe"
    varnames = list(y.columns)
    ymulti = pd.DataFrame(index=y.index)
    
    for varname in varnames:
        for lag in np.arange(pred_int)+1:
            if as_variation:
                ymulti[varname+"_+%s"%str(lag).zfill(3)] = y[varname].shift(-lag) - y[varname]
            else:
                ymulti[varname+"_+%s"%str(lag).zfill(3)] = y[varname].shift(-lag)
    
    if X is not None:
        # remove last rows with missing values due to lags
        ymulti = ymulti.iloc[0:ymulti.shape[0]-pred_int,:]
        assert type(X)== pd.core.frame.DataFrame, "X must be of type dataframe"
        Xmulti = X.loc[ymulti.index,:]
        return (ymulti,Xmulti)
    else:
        return ymulti

def fitMulti_GridSearchCV(X,y,gscv,X_y=None):
    """
    creates and fits a gridSearchCV object for each column of the target dataframe y.
    It returns a dictionary of fitted gridSearchCV objects, one for each column of the target dataframe y.
    If X_y is given, it attaches the relative features X_y to the input features X.
    The X_y features are selected by matching the part of their names after the last "_" with the same part of the name of the column y.
    """
    import copy
    
    searchers = {}
    mapper = {}
    for iii,(y_name,y_data) in enumerate(y.iteritems()): # run a grid search for each class, so that we can get a score for each optimal model
        this_gscv = copy.deepcopy(gscv)
        mapper[iii] = y_name
        print("searching best model for class %s"%y_name,end="\r")
        searchers[y_name] = this_gscv
        if X_y is None:
            X_sel = X
        else:
            X_y_sel = X_y.loc[:,X_y.columns.str.endswith(y_name.split("_")[-1])]
            X_sel = pd.concat([X,X_y_sel],sort=False,axis=1)
            #print(list(X_sel.columns))
        searchers[y_name].fit(X_sel, y_data.values.ravel());
    if X_y is None:
        feature_names = list(X.columns)
    else:    
        feature_names = list(X.columns)+["_".join(fname.split("_")[:-1]) for fname in list(X_y_sel.columns)]
    return (searchers,mapper,feature_names)

## Setup
Define the future prediction interval and define where the data will be stored.

In [None]:
### parameters for splitting data

T_set = 365 # is the period for each validation and test set (in days), while the remaining data is used for training 
T_pred = 30 # maximum prediction interval in the future 

### Define where data will be stored
### Hierarchy Aquifer name > target name > predition day

results_estimators = {} # save all the chosen estimators (one for each time)
results_RMSE = {} # store root mean squared error score
results_MAE = {} # store mean absolute error score
results_data = {} # store predicted vs true data

## Aquifer Petrignano

### Prepare data

In [None]:
wtype = "Aquifer"
wname = "Petrignano"

results_estimators[wtype+"_"+wname] = {} # save all the chosen estimators (one for each time)
results_RMSE[wtype+"_"+wname] = {} # store root mean squared error score
results_MAE[wtype+"_"+wname] = {} # store mean absolute error score
results_data[wtype+"_"+wname] = {} # store predicted vs true data

data = df[wtype][wname].copy()
data.reset_index(drop=True,inplace=True) # index starts from 0

targets = getTargetVariables(wtype,wname)
features = data.columns[~data.columns.isin(targets)]
X = data[features]
y = data[targets]
print("Input samples:")
display(X.head(3))
print("\nOutput samples:")
display(y.head(3))

### Create features and multiple targets
In order to improve the power of our model, we do the folling:
- Add the current output as a feature.
- Since the target variable is autoregressive, make output differences (differences between today and yesterday's depths) as new features.
- Since we want to predict taregts at different times in the future, create multiple targets, one for each future prediction day and one for each output variable
- Ensure that targets are differences rather than absolute values, so that prediction in the future is a difference with respect to current value. For example: the value of depth in 5 days is the difference between today's depth and the depth in 5 days. 
- Add seasonality features since target variables are seasonal (on yearly basis). They are: yearly seasonality of the target value (calculated on training set), target day of the year transformed into cyclical (with cosine/sin), new features by making the variables dependent on each time-shifted instance of the target variable.

In [None]:
### CREATE Multiple targets
# y = days x ntargets*T_pred
# X = days x (nfeatures+ntargets)

input_vars = X.columns

### Add current targets as features
X = pd.concat([X,y],axis=1,sort=False)


### Add target variables' historical mean and standard deviations as features
rollvals = [5,10,15,20,30] # rolling windows
for tt in targets:
    for rollval in rollvals:
        rollmean = X[tt].rolling(rollval,min_periods=1).mean()
        X.loc[:,tt+"_mean"+str(rollval).zfill(2)] = rollmean - rollmean.shift(rollval,fill_value=0) 
        X.loc[:,tt+"_std"+str(rollval).zfill(2)] = X[tt].rolling(rollval,min_periods=1).std().fillna(0).values  
     

### Add average daily temperature and volume
var_names = input_vars[input_vars.str.startswith(("temperature","rainfall","volume"))]
rollvals = [1,5,10,15,20,30] # rolling values
for var_name in var_names:    
    for rollval in rollvals:
        ## calculate rolling average
        X.loc[:,var_name+"_mean"+str(rollval).zfill(2)] = X.loc[:,var_name].rolling(rollval,min_periods=1).mean().values  

#### SORT COLUMNS by column names ###
X = X.reindex(sorted(X.columns), axis=1)

#####################  CREATE MULTI_OUTPUTS ##################################

# copy absolute values before it gets transformed
y_absolute = y.copy()

## Create multi-interval target predictions as variations with respect to current values
(y_multi, X_multi) = build_multi_output(y,T_pred,X,as_variation=True)




###################### ADD Seasonal values for the targets as Features ############################
imax = data.shape[0] # assumes index starts from 0
training_indices = X_multi.loc[:imax-2*T_set-1,:].index
data_seas = pd.concat([y_multi,X_multi.loc[training_indices,"date"]],sort=False,axis=1)
data_seas["dayofyear"] = data_seas["date"].dt.dayofyear
data_seas = data_seas.groupby("dayofyear").mean()

dates = X_multi[["date"]].copy()
data = X_multi[["date"]].copy()
for tt in targets:
    for dd in range(1,T_pred+1):
        tlabel = "+"+str(dd).zfill(3)
        dates[tlabel] = pd.PeriodIndex(dates['date'], freq='D') + dd#).to_timestamp()
        doy = dates[tlabel].apply(lambda x: x.dayofyear)
        data["day_cos_"+tlabel] =  (np.cos(2 * np.pi * doy.copy() /365.25)+1)/2
        data["day_sin_"+tlabel] =  (np.sin(2 * np.pi * doy.copy() /365.25)+1)/2
        data[tt+"_seas_"+tlabel] = pd.merge(doy,data_seas[tt+"_"+tlabel],how="left",left_on=tlabel,right_index=True)[tt+"_"+tlabel]

# New seasonal features
X_multi_y = data.drop(columns="date").copy()

        
###############################################################
print("shape target matrix: "+str(y_multi.shape))
print("shape input matrix: "+str(X_multi.shape))
print("shape seasonal matrix: "+str(X_multi_y.shape))

Let's create *training*, *validation* and *test* data sets as consecutive intervals. 
Then let's divide the trainign set in different *folds* in order to perform cross-validation.

In [None]:
imax = data.shape[0] # assumes index starts from 0

X_test_multi = X_multi.loc[imax-T_set:,:]
y_test_multi = y_multi.loc[imax-T_set:,:]
X_test_multi_y = X_multi_y.loc[imax-T_set:,:]

X_val_multi = X_multi.loc[imax-2*T_set:imax-T_set-1,:]
y_val_multi = y_multi.loc[imax-2*T_set:imax-T_set-1,:]
X_val_multi_y = X_multi_y.loc[imax-2*T_set:imax-T_set-1,:]

X_train_multi = X_multi.loc[:imax-2*T_set-1,:]
y_train_multi = y_multi.loc[:imax-2*T_set-1,:]
X_train_multi_y = X_multi_y.loc[:imax-2*T_set-1,:]

print("Training set from %s to %s."%(X_train_multi["date"].iloc[0].date(),X_train_multi["date"].iloc[-1].date()))
print("Validation set from %s to %s."%(X_val_multi["date"].iloc[0].date(),X_val_multi["date"].iloc[-1].date()))
print("Test set from %s to %s."%(X_test_multi["date"].iloc[0].date(),X_test_multi["date"].iloc[-1].date()))


btscv = BlockingTimeSeriesSplit(n_splits=5,train_ratio=0.7,gap_ratio=0.05)#BlockingTimeSeriesSplit(n_splits=5)
fig = plt.figure(figsize=(15, 5))
ax = fig.gca()
ax = plot_cv_indices(btscv, X_train_multi.values,ax)

xticks = np.arange(0,X_train_multi.shape[0],200) #xticks = ax.get_xticks()
ax.set_xticks(xticks)
ax.set_xticklabels(X_train_multi.iloc[xticks,np.where(X_train_multi.columns=="date")[0][0]].dt.date,rotation=60);
ax.set_title("Training set's cross-validation folds",size=14)
ax.grid()

### Scale features and targets
By scaling features and target we do not expect any major improvement on decision tree or gradinet boosting algoirthms. 
We cannot compare mean squared error (MSE) of scaled features with the same error on unscaled features. Since with scaling  the target variable gets expanded of one order of magnitude, the MSE becomes larger in absolute terms.
On the other hand, linear models, SVM and neural network are expected to perform much better, in terms of convergence time and accuracy, when variables are scaled.  
In order to scale the varibales in the **range 0-1**, we use the `MinMaxScaler` class, fitted on training data and then applied on validation and test sets.

In [None]:

############################## SCALE TARGET-INDEPENDENT FEATURES ########################

from sklearn.preprocessing import MinMaxScaler
scaler_X = MinMaxScaler()
whichVars = ~ ("date" == X_train_multi.columns)
X_train_multi.loc[:,whichVars] = scaler_X.fit_transform(X_train_multi.loc[:,whichVars])
X_val_multi.loc[:,whichVars] = scaler_X.transform(X_val_multi.loc[:,whichVars])
X_test_multi.loc[:,whichVars] = scaler_X.transform(X_test_multi.loc[:,whichVars])


############################## SCALE TARGET-DEPENDENT FEATURES ########################

from sklearn.preprocessing import MinMaxScaler
scaler_Xy = MinMaxScaler()
whichVars = ~ ("date" == X_train_multi.columns)
X_train_multi_y.loc[:,:] = scaler_Xy.fit_transform(X_train_multi_y)
X_val_multi_y.loc[:,:] = scaler_Xy.transform(X_val_multi_y)
X_test_multi_y.loc[:,:] = scaler_Xy.transform(X_test_multi_y)

############################# SCALE TARGETS ###############################
scaler_y = {}
for name,data in y_train_multi.iteritems():
    scaler_y[name] = MinMaxScaler()
    y_train_multi.loc[:,name] = scaler_y[name].fit_transform(y_train_multi[name].values.reshape(-1, 1))
    y_val_multi.loc[:,name] = scaler_y[name].transform(y_val_multi[name].values.reshape(-1, 1))
    y_test_multi.loc[:,name] = scaler_y[name].transform(y_test_multi[name].values.reshape(-1, 1))

In [None]:
## Select target variable
target = targets[0]

## Initializes data stores
results_estimators[wtype+"_"+wname][target] = {} # save all the chosen estimators (one for each time)
results_RMSE[wtype+"_"+wname][target] = {} # store root mean squared error score
results_MAE[wtype+"_"+wname][target] = {} # store mean absolute error score
results_data[wtype+"_"+wname][target] = {} # store predicted vs true data

### Baseline model
As a baseline let's use a decision tree algorithm, which is fast at both training and prediction times. 
It also allows to check how important are features

In [None]:
############## PREPARE #######################

## select features and outputs related only to this target
others = targets[:]
others.remove(target)
sel_cols = X_train_multi.columns[~np.array([[x.startswith(other) for x in X_train_multi.columns] for other in others]).any(0)] # remove columns that have nothing to do with this target
X_temp = X_train_multi[sel_cols]
X_temp = X_temp.drop(columns="date").copy()

## select outputs related only to this target
y_temp = y_train_multi.copy()
y_temp = y_temp.loc[:,y_temp.columns.str.startswith(target)]

sel_cols = X_train_multi_y.columns[~np.array([[x.startswith(other) for x in X_train_multi_y.columns] for other in others]).any(0)] # remove columns that have nothing to do with this target
X_y_temp = X_train_multi_y[sel_cols]

################### DEFINE MODEL ################

from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(
    criterion='mse', 
    splitter='random',#'best', 
    max_depth=None, 
    min_samples_split=2, 
    min_samples_leaf=1, 
    min_weight_fraction_leaf=0.0, 
    max_features=None, #('sqrt','log2')
    random_state=None, 
    max_leaf_nodes=None, 
    min_impurity_decrease=0.0, 
    min_impurity_split=None, 
    ccp_alpha=0.0
)

## SEARCH OPTIMAL PARAMETERS
parameters = {
    'min_samples_split': (0.01,0.15,0.25),#(0.01,0.05,0.1,0.2,0.5,0.8) #Used to control over-fitting.
    'splitter':("best","random")
}

opt_scoring = 'neg_mean_squared_error' # scoring used to select optimal parameters on the refit
 
gscv = GridSearchCV(
        estimator=model,
        param_grid=parameters,
        scoring=('r2','neg_mean_squared_error'),
        n_jobs=-1,
        refit=opt_scoring,
        cv=btscv,  # change this to the splitter subject to test
        verbose=0,
        error_score='raise',#=-999,
        return_train_score=True
        )


searchersDTR, mapperDTR, feat_names = fitMulti_GridSearchCV(X_temp,y_temp,gscv,X_y_temp)


############################ EVALUATE FEATURES #######################

feat_importances = pd.DataFrame(columns=feat_names,index=1+np.arange(y_temp.shape[1]))
feat_importances.index.name="days"
for ii,y_name in enumerate(searchersDTR.keys()):
    feat_importances.iloc[ii,:] = searchersDTR[y_name].best_estimator_.feature_importances_

fig = plt.figure(figsize=(18,0.2*len(feat_names)))
ax = fig.gca()
sns.heatmap(feat_importances.T.astype(float), vmin=0, vmax=1,cmap="coolwarm",ax=ax)
ax.set_title("features importances")
ax.set_yticks(range(feat_importances.shape[1])) # <--- set the ticks first
ax.set_yticklabels(feat_importances.columns,va="top");
plt.setp(ax.get_xticklabels(), rotation=70);

### Use Gradient Boosting Regression (GBR)
Gradient Bossting regressor is more performant .


In [None]:
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
    loss='ls', 
    learning_rate=0.1, 
    n_estimators=100, 
    subsample=1.0, 
    criterion='friedman_mse', 
    min_samples_split=2, 
    min_samples_leaf=1, 
    min_weight_fraction_leaf=0.0, 
    max_depth=3, 
    min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, 
    alpha=0.9, # tried (0.7,0.9,0.95) with no improvement 
    verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, 
    tol=0.0001, ccp_alpha=0.0)

parameters = {
    'learning_rate':(0.01,0.1,0.2), #(0.01,0.1,1.)
    'n_estimators':(30,50,70), # (5,50,100,200),
    'max_depth': (3,5,7)
}

opt_scoring = 'neg_mean_squared_error' # scoring used to select optimal parameters on the refit

gscv = GridSearchCV(
        estimator=model,
        param_grid=parameters,
        scoring=('r2','neg_mean_squared_error'),
        n_jobs=-1,
        refit=opt_scoring,
        cv=btscv,  # change this to the splitter subject to test
        verbose=0,
        error_score='raise',#=-999,
        return_train_score=True
        )


searchersGBR, mapperGBR, feat_names = fitMulti_GridSearchCV(X_temp,y_temp,gscv,X_y_temp)



############################ EVALUATE FEATURES #######################

feat_importances = pd.DataFrame(columns=feat_names,index=1+np.arange(y_temp.shape[1]))
feat_importances.index.name="days"
for ii,y_name in enumerate(searchersGBR.keys()):
    feat_importances.iloc[ii,:] = searchersGBR[y_name].best_estimator_.feature_importances_

fig = plt.figure(figsize=(18,0.2*len(feat_names)))
ax = fig.gca()
sns.heatmap(feat_importances.T.astype(float), vmin=0, vmax=1,cmap="coolwarm",ax=ax)
ax.set_title("features importances")
ax.set_yticks(range(feat_importances.shape[1])) # <--- set the ticks first
ax.set_yticklabels(feat_importances.columns,va="top");
plt.setp(ax.get_xticklabels(), rotation=70);

### Use Support Vector Machine (SVM)

In [None]:
from sklearn.svm import SVR # support vector machine regression

model = SVR(
    kernel='rbf', degree=3, 
    gamma='scale', 
    coef0=0.0, tol=0.001, 
    C=1.0, 
    epsilon=0.01, # default was 0.1, tried: (0.001,0.01,0.1),
    shrinking=True, cache_size=200, verbose=False, max_iter=- 1)



## SEARCH OPTIMAL PARAMETERS
parameters = {
    'C':(1e-4,1e-3,0.01,0.1),
    'kernel':('linear','rbf')
}

opt_scoring = 'neg_mean_squared_error' # scoring used to select optimal parameters on the refit
 
gscv = GridSearchCV(
        estimator=model,
        param_grid=parameters,
        scoring=('r2','neg_mean_squared_error'),
        n_jobs=-1,
        refit=opt_scoring,
        cv=btscv,  # change this to the splitter subject to test
        verbose=0,
        error_score='raise',#=-999,
        return_train_score=True
        )

searchersSVM, mapperSVM, feat_names = fitMulti_GridSearchCV(X_temp,y_temp,gscv,X_y_temp)



### Model Selection
Among the different selected best models, let's pick the best performing ones for each time interval.

In [None]:
searchers_list = {"decision tree": searchersDTR, 
                  "gradient boosting": searchersGBR,
                  "support vector machine": searchersSVM,
                 }

######################################### COMPARE MODELS ##################################
plt.figure(figsize=(15,5))
ax = plt.gca()
for name,ss in searchers_list.items():
    ax = plot_scores(ss,label=name,ax=ax);
plt.show()

In [None]:
from sklearn.metrics import mean_squared_error

## for each prediction time, choose best estimator
MSE = {}
best_estimator = {}

X_train_multi_y.columns = [col.replace("season_","") for col in X_train_multi_y.columns]
X_val_multi_y.columns = [col.replace("season_","") for col in X_val_multi_y.columns]



#### Prepare val data
others = targets[:]
others.remove(target)
sel_cols = X_val_multi.columns[~np.array([[x.startswith(other) for x in X_val_multi.columns] for other in others]).any(0)] # remove columns that have nothing to do with this target
X_temp_val = X_val_multi[sel_cols]
X_temp_val = X_temp_val.drop(columns="date").copy()
y_temp_val = y_val_multi.copy()
y_temp_val = y_temp_val.loc[:,y_temp_val.columns.str.startswith(target)]
sel_cols = X_val_multi_y.columns[~np.array([[x.startswith(other) for x in X_val_multi_y.columns] for other in others]).any(0)] # remove columns that have nothing to do with this target
X_y_temp_val = X_val_multi_y[sel_cols]

for tt in range(1,T_pred+1):
    MSE[tt] = np.inf
    name = target + "_+"+str(tt).zfill(3)
    for ss in searchers_list.values():
        
        
        ### fit estimator on training data
        colnames = X_y_temp.columns[X_y_temp.columns.str.endswith(name.split("_")[-1])]
        X_train = pd.concat([X_temp, X_y_temp.loc[:,colnames]],sort=False,axis=1)
        y_train = y_temp.loc[:,name].values.ravel()
        ss[name].best_estimator_.fit(X_train,y_train)
        
        ### predict on validation data
        colnames = X_y_temp_val.columns[X_y_temp_val.columns.str.endswith(name.split("_")[-1])]
        X_val = pd.concat([X_temp_val, X_y_temp_val.loc[:,colnames]],sort=False,axis=1)
        y_pred = ss[name].best_estimator_.predict(X_val)
        
        ### calculate MSE as score
        y_true = y_temp_val.loc[:,name].values#.ravel()
        thisMSE = mean_squared_error(y_true, y_pred)  
          
        ### compare score with other algoithms
        if thisMSE < MSE[tt]:
            MSE[tt] = thisMSE
            best_estimator[tt] = ss[name].best_estimator_

    print("prediction on day +%d: lowest MSE = %.4f on scaled data with %s " %(tt,MSE[tt],best_estimator[tt]))

### Run best models on test set

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Finally test on the most recent data
X_test_multi_y.columns = [col.replace("season_","") for col in X_test_multi_y.columns]


## Prepare
#e.g.: wtype = "Aquifer"
#e.g.: wname = "Petrignano"
#e.g.: target = "depth_to_groundwater_p24"



#### Prepare test data
others = targets[:]
others.remove(target)
sel_cols = X_test_multi.columns[~np.array([[x.startswith(other) for x in X_test_multi.columns] for other in others]).any(0)] # remove columns that have nothing to do with this target
X_temp_test = X_test_multi[sel_cols]
y_temp_test = y_test_multi.copy()
y_temp_test = y_temp_test.loc[:,y_temp_test.columns.str.startswith(target)]
sel_cols = X_test_multi_y.columns[~np.array([[x.startswith(other) for x in X_test_multi_y.columns] for other in others]).any(0)] # remove columns that have nothing to do with this target
X_y_temp_test = X_test_multi_y[sel_cols]


results_estimators[wtype+"_"+wname][target] = best_estimator # save all the chosen estimators (one for each prediction day)

for tt in range(1,T_pred+1):
    name = target + "_+"+str(tt).zfill(3) 
    
    ### refit estimator on training+validation data
    colnames = X_y_temp.columns[X_y_temp.columns.str.endswith(name.split("_")[-1])]
    X_train = pd.concat([X_temp, X_y_temp.loc[:,colnames]],sort=False,axis=1)
    X_val = pd.concat([X_temp_val, X_y_temp_val.loc[:,colnames]],sort=False,axis=1)
    
    X_in = pd.concat([X_train,X_val],axis=0,sort=False)
    y_train = y_temp.loc[:,name]
    y_val = y_temp_val.loc[:,name]
    y_in = pd.concat([y_train,y_val],axis=0,sort=False).values.ravel()
    best_estimator[tt].fit(X_in,y_in)
    
    ### predict on test data
    colnames = X_y_temp_test.columns[X_y_temp_test.columns.str.endswith(name.split("_")[-1])]
    X_test = pd.concat([X_temp_test, X_y_temp_test.loc[:,colnames]],sort=False,axis=1)
    y_pred = best_estimator[tt].predict(X_test.drop(columns="date"))
    y_true = y_temp_test.loc[:,name].values#.ravel()
    
    ### scale back to original values
    y_pred = scaler_y[name].inverse_transform(y_pred.reshape(-1,1))
    
    
    ### reconstruct absolute values from differences
    y_true = y_absolute[target].shift(-tt).loc[X_test.index]
    y_pred = y_absolute.loc[X_test.index,target].values.reshape(-1,1) + y_pred

    ### calculate MSE (on unscaled data) as score
    RMSE_test = mean_squared_error(y_true, y_pred,squared=False)
    MAE_test = mean_absolute_error(y_true, y_pred)

    ### plot predicted vs true data
    plt.figure(figsize=(15,5))
    ax=plt.gca()
    result = pd.DataFrame({'date':X_test["date"].values,'true':y_true.ravel(), 'predicted':y_pred.ravel()})
    result.plot(x="date",style="-o",ylabel=target,ax=ax,grid=True,
                title="%s %s - %s prediction for day +%d\ntest data scores with %s: RMSE = %.4f, MAE = %.4f"%(wtype,wname,target,tt,best_estimator[tt],RMSE_test,MAE_test))
    plt.show()
    
    ### store data
    results_RMSE[wtype+"_"+wname][target][tt] = RMSE_test
    results_MAE[wtype+"_"+wname][target][tt] = MAE_test
    results_data[wtype+"_"+wname][target][tt] = result


In [None]:
######################################### PLOT SCORES ##################################
plt.figure(figsize=(15,5))
ax = plt.gca()
ax.plot(results_RMSE[wtype+"_"+wname][target].keys(),results_RMSE[wtype+"_"+wname][target].values(),label="RMSE",lw=3,marker="o")#,style="o-")
ax.plot(results_MAE[wtype+"_"+wname][target].keys(),results_MAE[wtype+"_"+wname][target].values(),label="MAE",lw=3,marker="o")#,style="o-")
plt.grid()
plt.legend()
ax.set_xlabel("prediction days",size=14);
ax.set_title(target)