*First effort in Time Series Analysis and Modeling.*

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys

from sklearn.preprocessing import StandardScaler

sns.set()

In [None]:
colors = ["#4C72B0","#00A8B0", '#AFCB38', "#2C934F", "#0A014F", "#268DB0", "#333232", "#653239", "#3C5A14", "#FEA82F"]
sns.set_palette(sns.color_palette(colors))

# Challenge

Each dataset represents a different kind of waterbody. As each waterbody is different from the other, the related features are also different. So, if for instance we consider a water spring we notice that its features are different from those of a lake. These variances are expected based upon the unique behavior and characteristics of each waterbody. The Acea Group deals with four different type of waterbodies: water springs, lakes, rivers and aquifers.

# Aquifers

## Auser

This waterbody consists of two subsystems, called NORTH and SOUTH, where the former partly influences the behavior of the latter. Indeed, the north subsystem is a water table (or unconfined) aquifer while the south subsystem is an artesian (or confined) groundwater. The levels of the NORTH sector are represented by the values of the SAL, PAG, CoS and DIEC wells, while the levels of the SOUTH sector by the LT2 well.

**Predict**: Depth_to_Groundwater_SAL, Depth_to_Groundwater_CoS, Depth_to_Groundwater_LT2

In [None]:
auser = pd.read_csv('Aquifer_Auser.csv', parse_dates = [0])
print(auser.shape)
auser.head()

In [None]:
auser.describe()

### Data cleaning

In [None]:
def missing_data(df):
    is_null_data = df.isnull()
    total = is_null_data.sum()
    percent = ((total/is_null_data.count())*100)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total Missing', 'Percent'])
    
    return missing_data.sort_values(by = 'Percent', ascending=False)

In [None]:
missing_data(auser)

Depth_to_Groundwater_DIEC and Depth_to_Groundwater_PAG contain more then 50% missing data, also it is not predictable values, so I decided to remove these columns. 

In [None]:
auser.drop(['Depth_to_Groundwater_DIEC', 'Depth_to_Groundwater_PAG'], axis = 1, inplace = True)

In [None]:
def missing_data_rows(df):
    is_null_data = df.isnull()
    total = is_null_data.sum(axis = 1)
    percent = ((total/is_null_data.count(axis = 1))*100)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total Missing', 'Percent'])
    
    return missing_data.sort_values(by = 'Percent', ascending=False)

In [None]:
missing_data_rows(auser)

There are a lot of missing data in some rows. I decided to delete them from dataset.

In [None]:
def remove_empty_rows(df):
    # finding dataframe columns number
    df_len = df.shape[1]
    
    #define 70% size of it
    max_empty_count = df_len * 70 / 100
    
    current_size = df.shape[0]
    df = df[df.isnull().sum(axis = 1) < max_empty_count]
    new_size = df.shape[0]
    
    print('deleted_rows : {}'.format(current_size - new_size))
    return df

In [None]:
auser = remove_empty_rows(auser)

In [None]:
def fillna_monthly(df):
    #define month and year column for rolling over it
    df['Month'] = df.loc[:, 'Date'].dt.month
    df['Year'] = df.loc[:, 'Date'].dt.year
    
    #fill NaN with mean over year and month
    df = df.groupby(['Year','Month']).transform(lambda x: x.fillna(x.mean()))
    
    #fill left NaN with mean over month
    df['Month'] = df.loc[:, 'Date'].dt.month
    df = df.groupby('Month').transform(lambda x: x.fillna(x.mean()))
    
    return df

In [None]:
auser = fillna_monthly(auser)

### EDA

In [None]:
def combine_columns(df):
    #rainfalls columns
    rain_bool = df.columns.str.contains('Rain', case=False) #boolen list
    rain_cols = df.iloc[:, rain_bool].columns #list of TRUE columns
    df['Total_Rainfalls_Mean'] = df[rain_cols].mean(axis = 1, skipna=True) 
    df.drop(rain_cols, axis = 1, inplace = True) #remove TRUE columns from dataset
    
    #temperature columns
    temperature_bool = df.columns.str.contains('Temperature', case=False) 
    temperature_cols = df.iloc[:, temperature_bool].columns
    df['Total_Temperature_Mean'] = df[temperature_cols].mean(axis = 1, skipna=True) 
    df.drop(temperature_cols, axis = 1, inplace = True)
    
    #volume columns
    volume_bool = df.columns.str.contains('Volume', case=False) 
    volume_cols = df.iloc[:, volume_bool].columns
    df['Total_Volume_Mean'] = df[volume_cols].mean(axis = 1, skipna=True) 
    df.drop(volume_cols, axis = 1, inplace = True)
    
    #hydrometry_columns
    hydrometry_bool = df.columns.str.contains('Hydrometry', case=False)
    #not all datasets have hydrometry
    if hydrometry_bool.any():
        hydrometry_cols = df.iloc[:, hydrometry_bool].columns
        df['Total_Hydrometry_Mean'] = df[hydrometry_cols].mean(axis = 1, skipna=True) 
        df.drop(hydrometry_cols, axis = 1, inplace = True)
    
    return df

In [None]:
combine_columns(auser)

I noticed Rainfalls have similar mean values, however their standard deviations are diferent. But let's look at feature distributions.

In [None]:
#feature distributions
f, axs = plt.subplots(2, 2, figsize=(15, 6))
sns.distplot(auser['Total_Rainfalls_Mean'], ax=axs[0,0])
sns.distplot(auser['Total_Temperature_Mean'], ax=axs[0,1])
sns.distplot(auser['Total_Volume_Mean'], ax=axs[1,0])
sns.distplot(auser['Total_Hydrometry_Mean'], ax=axs[1,1])
f.tight_layout()

Let's look how features changed yearly

In [None]:
#relationship between date and features
f, axs = plt.subplots(2, 2, figsize=(15, 6))
sns.lineplot(data=auser, x="Date", y="Total_Rainfalls_Mean", ax=axs[0,0])
sns.lineplot(data=auser, x="Date", y="Total_Temperature_Mean", ax=axs[0,1])
sns.lineplot(data=auser, x="Date", y="Total_Volume_Mean",ax=axs[1,0])
sns.lineplot(data=auser, x="Date", y="Total_Hydrometry_Mean", ax=axs[1,1])
f.tight_layout()

Let's look how groundwaters changed yearly

In [None]:
#relationship between date and groundwaters
f, axs = plt.subplots(1, 3, figsize=(15, 3))
sns.lineplot(data=auser, x="Date", y="Depth_to_Groundwater_SAL", ax=axs[0])
sns.lineplot(data=auser, x="Date", y="Depth_to_Groundwater_CoS", ax=axs[1])
sns.lineplot(data=auser, x="Date", y="Depth_to_Groundwater_LT2", ax=axs[2])
f.tight_layout()

Let's look how groundwaters correlated with features

In [None]:
#relationship between groundwaters and features
f, axs = plt.subplots(3, 4, figsize=(15, 10))

#Depth_to_Groundwater_SAL
sns.lineplot(data=auser, x="Depth_to_Groundwater_SAL", y="Total_Rainfalls_Mean", dashes=False, ax=axs[0,0])
sns.lineplot(data=auser, x="Depth_to_Groundwater_SAL", y="Total_Temperature_Mean", dashes=False, ax=axs[0,1])
sns.lineplot(data=auser, x="Depth_to_Groundwater_SAL", y="Total_Volume_Mean", dashes=False, ax=axs[0,2])
sns.lineplot(data=auser, x="Depth_to_Groundwater_SAL", y="Total_Hydrometry_Mean", dashes=False, ax=axs[0,3])

#Depth_to_Groundwater_CoS
sns.lineplot(data=auser, x="Depth_to_Groundwater_CoS", y="Total_Rainfalls_Mean", dashes=False, ax=axs[1,0])
sns.lineplot(data=auser, x="Depth_to_Groundwater_CoS", y="Total_Temperature_Mean", dashes=False, ax=axs[1,1])
sns.lineplot(data=auser, x="Depth_to_Groundwater_CoS", y="Total_Volume_Mean", dashes=False, ax=axs[1,2])
sns.lineplot(data=auser, x="Depth_to_Groundwater_CoS", y="Total_Hydrometry_Mean", dashes=False, ax=axs[1,3])

#Depth_to_Groundwater_LT2
sns.lineplot(data=auser, x="Depth_to_Groundwater_LT2", y="Total_Rainfalls_Mean", dashes=False, ax=axs[2,0])
sns.lineplot(data=auser, x="Depth_to_Groundwater_LT2", y="Total_Temperature_Mean", dashes=False, ax=axs[2,1])
sns.lineplot(data=auser, x="Depth_to_Groundwater_LT2", y="Total_Volume_Mean", dashes=False, ax=axs[2,2])
sns.lineplot(data=auser, x="Depth_to_Groundwater_LT2", y="Total_Hydrometry_Mean", dashes=False, ax=axs[2,3])
f.tight_layout()

2020 year shows badly results for groundwater, extremaly falling down.

In [None]:
def feature_correlation_visual(df, columns, scaler = False):
    temp_data = pd.DataFrame(df, columns = columns).set_index([columns[0]])
    
    if scaler == True:
        scaler = StandardScaler()
        temp_data[columns[1]] =  scaler.fit_transform(temp_data[[columns[1]]])
        
    plt.figure(figsize=(20, 7))
    sns.lineplot(data=temp_data, dashes=False)
    plt.title('Correlation between {} and Groundwaters'.format(columns[1]), fontdict = {'fontsize': 16, 'verticalalignment': 'bottom'})
    plt.legend(columns, loc='lower left');

In [None]:
#relationship between Rainfalls and groundwaters
feature_correlation_visual(auser, ['Date', 'Total_Rainfalls_Mean', 'Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS','Depth_to_Groundwater_LT2'], scaler=True)

Correlation between rainfalls and groundwaters is expressed, there is some similar peak, but this correlation is delayed in time, that's ok, rains don't get to groundwaters at once. 

Also, it looks like SAl and COS have correlation, that's ok as it both Northern groundwater aquifer, and Lt2 is Southern.

In [None]:
#relationship between Temperature and groundwaters
feature_correlation_visual(auser, ['Date', 'Total_Temperature_Mean', 'Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS','Depth_to_Groundwater_LT2'], scaler=True)

Again there is correlation with delay in time.

In [None]:
#relationship between Volume and groundwaters
feature_correlation_visual(auser, ['Date', 'Total_Volume_Mean', 'Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS','Depth_to_Groundwater_LT2'], scaler=True)

I am highly interested in data from 2020, what could influence on the water so much?

In [None]:
plt.figure(figsize=(14, 4))
sns.heatmap(auser.corr(), cbar=True, annot=True, square=True, annot_kws={'size': 12}, 
            cmap=["#4C72B0", "#708EBF", "#9CAFD1", "#C3CDE2", "#EAEAF2"])
plt.show()

Heatmap doesn't show any good correlation between veriables.  I still think there are higher correlation between rainfalls and graundwaters. Because the reinfalls is the only resources for groundwaters. 

## Doganella

The wells field Doganella is fed by two underground aquifers not fed by rivers or lakes but fed by meteoric infiltration. The upper aquifer is a water table with a thickness of about 30m. The lower aquifer is a semi-confined artesian aquifer with a thickness of 50m and is located inside lavas and tufa products. These aquifers are accessed through wells called Well 1, ..., Well 9. Approximately 80% of the drainage volumes come from the artesian aquifer. The aquifer levels are influenced by the following parameters: rainfall, humidity, subsoil, temperatures and drainage volumes.

**Predict**: Depth_to_Groundwater_Pozzo_1, ..., Depth_to_Groundwater_Pozzo_9

In [None]:
doganella = pd.read_csv('Aquifer_Doganella.csv',  parse_dates = [0])
print(doganella.shape)
doganella.head()

In [None]:
doganella.describe()

In [None]:
missing_data(doganella)

In [None]:
missing_data_rows(doganella)

In [None]:
doganella = remove_empty_rows(doganella)

In [None]:
doganella = fillna_monthly(doganella)

In [None]:
combine_columns(doganella)

In [None]:
doganella.columns

In [None]:
#relationship between Rainfalls and groundwaters
feature_correlation_visual(doganella, ['Date', 'Total_Rainfalls_Mean', 'Depth_to_Groundwater_Pozzo_1', 'Depth_to_Groundwater_Pozzo_2',
       'Depth_to_Groundwater_Pozzo_3', 'Depth_to_Groundwater_Pozzo_4',
       'Depth_to_Groundwater_Pozzo_5', 'Depth_to_Groundwater_Pozzo_6',
       'Depth_to_Groundwater_Pozzo_7', 'Depth_to_Groundwater_Pozzo_8',
       'Depth_to_Groundwater_Pozzo_9'], scaler=False)

In [None]:
#relationship between Temperature and groundwaters
feature_correlation_visual(doganella, ['Date', 'Total_Temperature_Mean', 'Depth_to_Groundwater_Pozzo_1', 'Depth_to_Groundwater_Pozzo_2',
       'Depth_to_Groundwater_Pozzo_3', 'Depth_to_Groundwater_Pozzo_4',
       'Depth_to_Groundwater_Pozzo_5', 'Depth_to_Groundwater_Pozzo_6',
       'Depth_to_Groundwater_Pozzo_7', 'Depth_to_Groundwater_Pozzo_8',
       'Depth_to_Groundwater_Pozzo_9'], scaler=False)

In [None]:
plt.figure(figsize=(16, 6))
sns.heatmap(doganella.corr(), cbar=True, annot=True, square=True, annot_kws={'size': 12}, 
            cmap=["#4C72B0", "#708EBF", "#9CAFD1", "#C3CDE2", "#EAEAF2"])
plt.show()

## Luco

The Luco wells field is fed by an underground aquifer. This aquifer not fed by rivers or lakes but by meteoric infiltration at the extremes of the impermeable sedimentary layers. Such aquifer is accessed through wells called Well 1, Well 3 and Well 4 and is influenced by the following parameters: rainfall, depth to groundwater, temperature and drainage volumes.

**Predict**: Depth_to_Groundwater_Podere_Casetta

In [None]:
luco = pd.read_csv('Aquifer_Luco.csv', parse_dates = [0])
print(luco.shape)
luco.head()

In [None]:
luco.describe()

In [None]:
missing_data(luco)

In [None]:
luco.drop(['Depth_to_Groundwater_Pozzo_3', 'Depth_to_Groundwater_Pozzo_4', 'Depth_to_Groundwater_Pozzo_1',
          'Rainfall_Siena_Poggio_al_Vento', 'Rainfall_Ponte_Orgia', 'Rainfall_Mensano', 'Volume_Pozzo_4',
          'Volume_Pozzo_3', 'Volume_Pozzo_1'], axis = 1, inplace = True)

In [None]:
missing_data_rows(luco)

In [None]:
luco = remove_empty_rows(luco)

In [None]:
luco = fillna_monthly(luco)

In [None]:
combine_columns(luco)
luco.drop('Total_Volume_Mean', axis = 1, inplace = True)

In [None]:
luco

In [None]:
plt.figure(figsize=(20, 7))
sns.lineplot(data=luco.set_index('Date'), dashes=False)
plt.title('Correlation between Rainfalls, Temperature and Groundwaters', fontdict = {'fontsize': 16, 'verticalalignment': 'bottom'})
plt.legend(luco.columns[1:], loc='lower left');

In [None]:
plt.figure(figsize=(6, 3))
sns.heatmap(luco.corr(), cbar=True, annot=True, square=True, annot_kws={'size': 12}, 
            cmap=["#4C72B0", "#708EBF", "#9CAFD1", "#C3CDE2", "#EAEAF2"])
plt.show()

## Petrignano

The wells field of the alluvial plain between Ospedalicchio di Bastia Umbra and Petrignano is fed by three underground aquifers separated by low permeability septa. The aquifer can be considered a water table groundwater and is also fed by the Chiascio river. The groundwater levels are influenced by the following parameters: rainfall, depth to groundwater, temperatures and drainage volumes, level of the Chiascio river.

**Predict**: Depth_to_Groundwater_Pozzo_P24, Depth_to_Groundwater_Pozzo_P25

In [None]:
petrignano = pd.read_csv('Aquifer_Petrignano.csv', parse_dates = [0])
print(petrignano.shape)
petrignano.head()

In [None]:
petrignano.describe()

In [None]:
missing_data(petrignano)

In [None]:
missing_data_rows(petrignano)

In [None]:
petrignano = remove_empty_rows(petrignano)

In [None]:
petrignano = fillna_monthly(petrignano)

In [None]:
combine_columns(petrignano)

In [None]:
#relationship between Rainfalls and groundwaters
feature_correlation_visual(petrignano, ['Date', 'Total_Rainfalls_Mean', 'Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25'])

In [None]:
#relationship between Temperature and groundwaters
feature_correlation_visual(petrignano, ['Date', 'Total_Temperature_Mean', 'Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25'])

In [None]:
#relationship between Hydrometry and groundwaters
feature_correlation_visual(petrignano, ['Date', 'Total_Hydrometry_Mean', 'Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25'])

## Aquifers Summerize

Let's have fun and look what is going on with rainfalls and tempterature over years in Italy

In [None]:
aquifers = pd.merge(auser[['Date','Total_Rainfalls_Mean', 'Total_Temperature_Mean']], 
                    doganella[['Date','Total_Rainfalls_Mean', 'Total_Temperature_Mean']], on='Date', how='left')\
            .merge(luco[['Date','Total_Rainfalls_Mean', 'Total_Temperature_Mean']], on='Date', how='left')\
            .merge(petrignano[['Date','Total_Rainfalls_Mean', 'Total_Temperature_Mean']], on='Date', how='left')

combine_columns(aquifers)

In [None]:
aquifers['Year'] = aquifers.loc[:, 'Date'].dt.year
temp = aquifers.groupby('Year')[['Total_Rainfalls_Mean', 'Total_Temperature_Mean']]\
                            .agg({'Total_Rainfalls_Mean': sum, 'Total_Temperature_Mean':['mean', 'max']}).reset_index()

f, axs = plt.subplots(1, 3, figsize=(20, 6))
sns.barplot(x=temp['Total_Rainfalls_Mean','sum'], y=temp['Year'], orient='h', color='#00A8B0', ax=axs[0])
sns.barplot(x=temp['Total_Temperature_Mean','mean'], y=temp['Year'], orient='h', color='#00A8B0', ax=axs[1])
sns.barplot(x=temp['Total_Temperature_Mean','max'], y=temp['Year'], orient='h', color='#00A8B0', ax=axs[2]);

The level of rains in 2020 was much smaller then in previous years, however temperature incresed not so high and I can not see trand on this data.

# Model