# EDA of the aquifers datasets

## Encyclopedic knowlege
The aquifers (literaly a water-carriers) are underground water sources. The water is stored in porous rock layers (e.g. sand), where the is an underlying bed of low-permeability rock (e.g. clay). Sometimes the low-permeability rock layer can also be above the porous layer, then it forms a confined aquifer, that can carry water at considerable overpressure. Auquifers can be stacker on top of each other, with unconfined aquifer above a confined aquifer:

https://www.canada.ca/en/environment-climate-change/services/water-overview/sources/groundwater/_jcr_content/par/img_305/image.img.gif/1506365092299.gif

The water table is the upper surface of the water that is contained in the ground. At some places, this height can be above the ground level. In such cases, the well would spill (even spout).

The height of the water table should be equal at each place of a given aquifer in quazi-static conditions, due to gravity. From this one might think that the depth to groundwater (DTG) depends on additional parameter of surface of land elevation. However under condition of rapid water extraction and low flowrates of the water through the porous rocks, the conditions are strongly non-equilibrium.

## water ballance

https://www.canada.ca/en/environment-climate-change/services/water-overview/sources/groundwater/_jcr_content/par/img_48812/image.img.gif/1506362514201.gif

Water turnover in the aquifer can vary dramaticaly. Estimated water storage times ranges form 2 weeks to 10000 years. Some aquifers contain water from the melted snow/ice from the last glacial period. Some contain residual prehistoric seawater.

### discharge
* The water from an aquifer is extracted by means of a well (Pozzo). This extraction is metered by water plant in volume quantities.
* The opoen aquifers may be extracted by plants, that drain water with their roots and cause water evaporation form leafs.
* Discharge to the surface can be caused by a water spring.
* Discharge by seepage into the sea. In this case, contamination by sea salt can occur.

Discharge of water leads to lowering of the water table (the top surface of the water filling the aquifer). This leads to an decrease of the "depth to ground" parameter of the well.

### recharge

* The water in acquifer is replenished by precipitation (called also meteoric water). Especially in the case of confined aquifers, the area where the rainfall seeps into the aqfer can be away from the well that is used for its extraction. 
* Seepage form the river/stream bed.
* from the sea (salty).

Recharge of water leads to rising of the water table. This leads to an increase of the "depth to ground" parameter of the well.




In [None]:
!pip install networkx tigramite

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
import tigramite

from tigramite import data_processing as pp
from tigramite import plotting as tp
from tigramite.pcmci import PCMCI
from tigramite.independence_tests import ParCorr, GPDC, CMIknn, CMIsymb
from tigramite.models import LinearMediation, Prediction

In [None]:
aqfs = dict()
basename = 'Aquifer_'
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if basename in filename:
            link = os.path.join(dirname, filename)
            aqf_name = filename.split('_')[-1].split('.')[0]
            aqfs[aqf_name] = pd.read_csv(link, index_col=0, parse_dates=True,dtype=np.float32)
            #print(aqfs[aqf_name].describe())
            
        

# Auser aquifer
Infor from the competition documents: This water body consists of two subsystems, that we call NORH and SOUTH, where the former partly influences the behaviour of the latter.
* The levels of the NORTH sector are represented by the values of the SAL, PAG, CoS and DIEC wells, 
* the levels of the SOUTH sector by the LT2 well.

Targets: Depth_to_Groundwater_SAL, Depth_to_Groundwater_COS, Depth_to_Groundwater_LT2



In [None]:
df = aqfs['Auser']

dt = df.index
timestamp_s = dt.map(datetime.timestamp)

day = 24*60*60
year = (365.2425)*day # number of days in a year

df['year_sin'] = np.sin( timestamp_s * (2 * np.pi / year))
df['year_cos'] = np.cos(timestamp_s * (2 * np.pi / year))

df['month'] = dt.month
df['week_of_year'] = dt.isocalendar().week




In [None]:
df.columns

In [None]:
feats = ['Depth_to_Groundwater_LT2', 'Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS']
rain_feats = ['Rainfall_Gallicano', 'Rainfall_Pontetetto', 'Rainfall_Monte_Serra',
       'Rainfall_Orentano', 'Rainfall_Borgo_a_Mozzano', 'Rainfall_Piaggione',
       'Rainfall_Calavorno', 'Rainfall_Croce_Arcana',
       'Rainfall_Tereglio_Coreglia_Antelminelli',
       'Rainfall_Fabbriche_di_Vallico']
temp_feats = ['Temperature_Orentano', 'Temperature_Monte_Serra',
       'Temperature_Ponte_a_Moriano', 'Temperature_Lucca_Orto_Botanico']
hydrometry = ['Hydrometry_Monte_S_Quirico', 'Hydrometry_Piaggione']


In [None]:
df[feats].plot()

The data on Depth_to_Ground_X features are somehow burried in periodic up-and downs that occur every month. There are strange year-to-year jumps in the oscillations, as if the actual levels were set by water supply management authority (Acea).

It is tempting to use the top-most water level as the water table height. The water in the well cannot rise above the water table height. 
On the other hand, the jumps of the upmost depth level seems to jump unnaturally at the end of every year. Instead, we will consider the abrupt changes an artifact of water level measurements, and smoothen the evolutions of the DTG paramter by taking the median value foe every month, or week interval. 

### Calculate the weekly and monthly medians

In [None]:
timeslice = slice('01-01-2006',None) # check the whole time interval in the dataset
feats_diff = [feat+'_diff' for feat in feats]


monthly = df.loc[timeslice,feats].resample('1M').median()
monthly_dev = monthly[feats].diff()
monthly_dev['month'] = monthly_dev.index.month
weekly = df.loc[timeslice,feats].resample('1w').median()

weekly_dev=  weekly[feats] - monthly[feats].resample('1w').ffill()
#df.loc[timeslice,['Depth_to_Groundwater_Podere_Casetta',]].plot()

# restore the lost feature weak_of_year
weekly_dev['week_of_year'] = weekly_dev.index.isocalendar().week
weekly_dev['month'] = weekly_dev.index.month


In [None]:
years = slice('2006','2009')
ax = plt.axes()
df.loc[years,feats].plot(label=feats,ax=ax,alpha=0.3)
monthly.loc[years,feats].plot(ax=ax)
plt.legend(loc=[1,0])

plt.figure()
ax = plt.axes()
df.loc[timeslice,feats].plot(label=feats,ax=ax,alpha=0.3)
monthly[feats].plot(ax=ax)
plt.legend(loc=[1,0])


The monthly medians represents the Depth values pretty shoothly. The strange up-and-downs are elliminated. In the case of CoS and SAL data, there are quite large monthly variations, with abrupt changes in depths. The data in  2020 contain zero values simultaneously in LT2 and CoS (maybe in SAL too), SAL reaches zero half a year later. Maybe a lack of senzor maintenance due to COVID-19 effects? This is candidate for removal.

The monthly medians of CoS and SAL follow similar patterns, with CoS showing larger variations.
The LT2 and SAL are also pretty correlated.

## how the DTG changes ?

In [None]:
for site in feats:
    monthly_dev.boxplot(column=[site],by='month')
    plt.ylabel('changes in monthly DTG averages')


# When it rains on Auser?

In [None]:
df.loc[timeslice,rain_feats].plot(subplots=True)

 The dataset shows almost complete history of rain records since 2006. There is a year of gap for year 2009 from Piaggione.

In [None]:
df.groupby('month').describe().loc[:,(rain_feats,'mean')].plot()
#pp.ylim(0,8)
plt.legend('',title='locations')
plt.ylabel('mean daily rainfall (mm)')

# What is the air temperature?

In [None]:
df.loc[timeslice,temp_feats].plot(subplots=True)
plt.figure()
df.groupby('month').describe().loc[:,(temp_feats,'mean')].plot()



The temperature records are missing in 'Ponte_a_Moriano'.

# Feature engineering ideas
* The Features of Water extraction volumes (WEVs), Rainfall (RFL), Temperature and DTG should be related by some relation, that is stable in time.

* DTG has meaning of state parameter (total remaining volume in the well), but in fact, we want to predict ist change with respect to its previous state (differential). Its actual value should be considered more as conditional parameter.
* The WEV has meaning of a rate parameter (volume per day). 
* Rainfall parameter (RFL) has meaning of rate parameter (volume, per day).
* Temperature has meaning of conditional parameter, that does not represent water quantity, but can influence how RFL increases DTG.

* The oscillations of DTG parameter of Pozzos  within each year probably occurs due to extraction/replenishing cycle of the well. The fact that the amplitude of the oscillations are more or less constant suggests that the target boundary levels of DTG are set by a hydrological authority (presumably Acea itself). The Acea representative (Louisa) claims that POC well in fact is not used for water extraction, just metering. In that case the origin of the DTG oscillations are unclear. It may mean that the POC well is located near some of the other Pozzos that are used for extraction, and thus its levels are sychronized.
* For predictions the monthly jumps of the DTG parameter is probably worthy to filter out.

# Causality discovery

For the purose of efficient causality discovery, we will utilize dedicated python package called Tigramite. It allows to efficiently reconstruct causal graphs from high-dimensional time series datasets and model the obtained causal dependencies for causal mediation and prediction analyses. Causal discovery is based on linear as well as non-parametric conditional independence tests applicable to discrete or continuously-valued time series.


More info can be found in the recent conference paper: J. Runge (2020): Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence, UAI 2020,Toronto, Canada, 2019, AUAI Press, 2020. http://auai.org/uai2020/proceedings/579_main_paper.pdf

Features

*    high detection power even for large-scale time series datasets
*    flexible conditional independence test statistics adapted to continuously-valued or discrete data, and different assumptions about linear or nonlinear dependencies
*    automatic hyperparameter optimization for most tests
*    prediction class based on sklearn models including causal feature selection


## Weekly time scale

### preprocessing of the data
Resampling the data and removal of the artifacts. We can filter the depth data, taking only the data that are close to the monthly averages:

In [None]:
# selecting features according to their completeness

feats = ['Depth_to_Groundwater_LT2',
       'Depth_to_Groundwater_SAL',
       'Depth_to_Groundwater_CoS']

rain_feats = ['Rainfall_Gallicano', 'Rainfall_Pontetetto', 'Rainfall_Monte_Serra',
       'Rainfall_Orentano', 'Rainfall_Borgo_a_Mozzano', 
       'Rainfall_Calavorno', 'Rainfall_Croce_Arcana',
       'Rainfall_Tereglio_Coreglia_Antelminelli',
       'Rainfall_Fabbriche_di_Vallico']
temp_feats = ['Temperature_Orentano']
volumes = ['Volume_POL', 'Volume_CC1', 'Volume_CC2', 'Volume_CSA', 'Volume_CSAL']
hydrometries = ['Hydrometry_Monte_S_Quirico', 'Hydrometry_Piaggione']

selected_feats =[]
timeslice = slice('2007','03-2020')
for feat_list in [feats,temp_feats,rain_feats]:
    selected_feats.extend(feat_list)



In [None]:
timeslice = slice('2008','03-2020') # check this time interval in the dataset

interval = '1M' # new_sampling_interval_code

monthly = df.loc[timeslice,:].resample(interval).median().ffill() # resample whole dataset

interval = '1d' # new_sampling_interval_code
monthly_upsampled = monthly.resample(interval).interpolate('linear')

# filter the data:
max_dev = {'Depth_to_Groundwater_CoS':0.3, 
                 'Depth_to_Groundwater_SAL':0.2,
                 'Depth_to_Groundwater_LT2':0.2}
filterred = df.loc[timeslice,:] 

dfdiff = df.loc[timeslice,feats] - monthly_upsampled.loc[timeslice,feats]
for feat in feats:
    dfdiff_idxs = dfdiff[feat].where(abs(dfdiff[feat]) > max_dev[feat]).dropna().index # indices of data out of maxdev bounds
    filterred.loc[dfdiff_idxs,feat] = np.nan
    filterred.loc[timeslice,feat] = filterred.loc[timeslice,feat].resample(interval).interpolate('linear')

#compare the traces 
fig,ax = plt.subplots(ncols=1,nrows=1,figsize=(10,10))
df.loc[:,feats].plot(ax=ax)
monthly.loc[:,feats].plot(ax=ax)
ax.legend(loc=(1,0))

fig,ax2 = plt.subplots(ncols=1,nrows=1,figsize=(10,10))
#filterred.loc[timeslice,feats].resample(interval).interpolate('linear').plot(ax=ax2)
filterred[feats].plot(ax=ax2)

In [None]:
# removal of the non-nubmers
filterred=filterred.where(filterred.notna(), 0).loc[:,selected_feats]

## resample the filterred data into weekly intervals:

In [None]:
interval = '1w' # new_sampling_interval_code

weekly = filterred.resample(interval).mean() # resample whole dataset

fig,ax = plt.subplots(ncols=1,nrows=1,figsize=(10,10))
filterred.loc[:,feats].plot(ax=ax)
weekly.loc[:,feats].plot(ax=ax)
ax.legend(loc=(1,0))

## calculate the differential from the weekly data

In [None]:
weekly[feats] = weekly[feats].diff() # differentiate
weekly = weekly.iloc[1:,:] # throw away the first line
weekly[feats].boxplot()

## filter the differentials according to quartiles

In [None]:

quart = weekly[feats].describe().T
cutoffs = 2*(-quart[r'25%'] + quart[r'75%']).abs() 

feats_thrshld = cutoffs.to_dict()

for feat in feats:
    weekly[feat] = weekly[feat].where(abs(weekly[feat]) < feats_thrshld[feat] , np.nan).fillna(value=0)
    
weekly[feats].boxplot()
weekly[feats].plot(subplots=True,figsize=(15,12))

## add features based on datetime

In [None]:
dt = weekly.index
timestamp_s = dt.map(datetime.timestamp)

day = 24*60*60
year = (365.2425)*day # number of days in a year

weekly['year_sin'] = np.sin( timestamp_s * (2 * np.pi / year))
weekly['year_cos'] = np.cos(timestamp_s * (2 * np.pi / year))

## causality discovery on weekly data

In [None]:
parcorr = ParCorr(significance='analytic')
# select the data
dataset = weekly
var_names = [col_name.split('_')[0][0]+col_name.split('_')[-1][0:2] for col_name in dataset.columns] # abbreviate the names
dataframe = pp.DataFrame(data=dataset.values,var_names = var_names)
pcmci = PCMCI(
    dataframe=dataframe, 
    cond_ind_test=parcorr,
    verbosity=1)

In [None]:
correlations = pcmci.get_lagged_dependencies(tau_max=15, val_only=True)['val_matrix']

#plt.figure(figsize=(10,10))
lag_func_matrix = tp.plot_lagfuncs(val_matrix=correlations, setup_args={'var_names':var_names, 'figsize':(18,10),
                                    'x_base':2, 'y_base':1}); 
plt.show()

The rainfall featrues are loosing effect after max 7 weeks. Let us restrict the causality search to this number of past time steps.

## calculate the lagged causal effect


In [None]:
tau_max = 7
pc_alpha = 0.05
pcmci.verbosity = 1

results = pcmci.run_pcmciplus(tau_min=0, tau_max=tau_max, pc_alpha=pc_alpha)

print("Graph")
print (results['graph'])
print("Adjacency MCI partial correlations")
print (results['val_matrix'].round(2))
print("Adjacency p-values")
print (results['p_matrix'].round(3))

q_matrix = pcmci.get_corrected_pvalues(p_matrix=results['p_matrix'], fdr_method='fdr_bh',
                                                  exclude_contemporaneous=False)

link_matrix = results['graph']

tp.plot_graph(
    val_matrix=results['val_matrix'],
    link_matrix=link_matrix,
    var_names=var_names,
    link_colorbar_label='cross-MCI (edges)',
    node_colorbar_label='auto-MCI (nodes)',
    ); plt.show()

In the weekly time scale, the algorithm found the following causality pathways: 
* the Rain feature "RAn" directly influences Depth "DSA" with lag1 week
* and RPo in 7 weeks influences DLT.
* there seems to be communication between the DCo and DSA istantly, and with DLT in 5 or 4 weeks.

# Make the predictions based on the obtained causalities information
For this predictions we will use the linar regression model, together with the data processed by the tigramite package that reflect the causal dependency graph.

In [None]:
T=weekly.shape[0]
N=weekly.shape[1]
pred = Prediction(dataframe=dataframe,
        cond_ind_test=ParCorr(),   #CMIknn ParCorr
        prediction_model = sklearn.linear_model.LinearRegression(),
#         prediction_model = sklearn.gaussian_process.GaussianProcessRegressor(),
        # prediction_model = sklearn.neighbors.KNeighborsRegressor(),
#    data_transform=sklearn.preprocessing.StandardScaler(),
    train_indices= range(int(0.8*T)),
    test_indices= range(int(0.8*T), T),
    verbosity=1
    )


## predict target 2 (Depth_to_ground_CoS, DCo)


Now, we estimate causal predictors using get_predictors for the target variable 2 (DCo) taking into account a maximum past lag of tau_max. Note that the predictors are different for each prediction horizon. For example, at a prediction horizon of steps_ahead=1 we get the causal parents from the model plus some others:


In [None]:
target = 2
tau_max = 7
pc_alpha = 0.05
predictors = pred.get_predictors(
                  selected_targets=[target],
                  steps_ahead=1,
                  tau_max=tau_max,
                  pc_alpha=pc_alpha
                  )
link_matrix = np.zeros((N, N, tau_max+1), dtype='bool')
for j in [target]:
    for p in predictors[j]:
        link_matrix[p[0], j, abs(p[1])] = 1

# Plot time series graph
tp.plot_time_series_graph(
    figsize=(6, 3),
    val_matrix=np.ones(link_matrix.shape),
    link_matrix=link_matrix,
    var_names=var_names,
    link_colorbar_label='',
    )


In [None]:
pred.fit(target_predictors=predictors, 
                selected_targets=[target],
                    tau_max=tau_max)

predicted = pred.predict(target)
true_data = pred.get_test_array()[0]

plt.scatter(true_data, predicted,alpha=0.3)
plt.title(r"NRMSE = %.2f" % (np.abs(true_data - predicted).mean()/true_data.std()))
plt.plot(true_data, true_data, 'k-')
plt.xlabel('True test data')
plt.ylabel('Predicted test data')

fig, ax = plt.subplots(1,figsize=(10,10))
ax.plot(true_data)
ax.plot(predicted)
