<a id="top"></a>
# Notebook Contents

In this notebook I try to share the kind of analysis I perform to analyze Time Series Data. 

The notebook is divided in the following sections and subsections: 

[**1. Correlation Analysis**](#correlation)<br>
    1.1 [*Example Correlation Matrix for Water Spring Amiata*](#correlation_amiata)<br>
    1.2 [*Example Correlation Matrix for Water Spring Amiata just targets*](#correlation_amiata_targets)<br>
    1.3 [*Correlation Matrix for all Acea datasets*](#correlation_all)<br>
    1.4 [*Correlation Matrix for all Acea datasets just targets*](#correlation_all_targets)<br>
    1.5 [*What to do with it*](#use_correlation)<br>
    
    
[**2. AutoCorrelation Analysis**](#autocorrelation) <br>
    2.1 [*Stationarity Test*](#stationarity_test)<br>
    2.2 [*ACF and PACF plots stationary vs non stationary*](#acf_pacf_comparison)<br>
    2.3 [*ACF and PACF plots all Acea datasets*](#acf_pacf_all)<br>
    2.4 [*ACF and PACF plots all Acea datasets just targets*](#acf_pacf_all_targets)<br>
    2.5 [*What to do with it*](#use_acf_pacf)<br>

## **CURRENTLY UNDER CONSTRUCTION**

[**3. CrossCorrelation Analysis**](#crosscorrelation)<br> 
    3.1 [*Most crosscorrelated features/lags*](#most_cross)<br>
    3.2 [*What to do with it*](#use_cross)<br>
    
[**4. Spurious Correlations**](#spurious)<br>
    4.1 [*Revisiting Correlations*](#correlation_redo)<br>
    4.2 [*What to do with it*](#use_spurious)<br>

- **Causality**(maybe later)

<div class="row">
  <div class="column">
    <img src="https://www.gruppo.acea.it/content/dam/acea-corporate/acea-foundation/immagini/al-servizio-delle-persone/hub/acea-acqua-760x425.jpg" align="left" style="width:50%">
  </div>
  <div class="column">
    <img src="https://i.imgur.com/IxrgRGl.png" style="width:50%" align="right">
  </div>
</div>

<br>


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.options.display.max_columns = 30
import os
import re
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
import seaborn as sns
import tqdm
import itertools
import matplotlib
from matplotlib import pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')

root_path = '/kaggle/input/acea-water-prediction'
data_files = [i for i in os.listdir(root_path) if re.match(".+\.csv$", i)]
data_files.sort()
data_names = [i.replace('.csv', '') for i in data_files]
data_files = list(map(lambda x: os.path.join(root_path, x), data_files))

waterbody_type = [re.match("water_spring|aquifer|river|lake", i.lower())[0] for i in data_names]

def get_df_basic_information(df, waterbody_type, df_name): 
    
    n_rows, n_columns = df.shape
    
    mb_size = round(df.memory_usage(deep=True).sum()/1000000., 3)
    
    print("""{0}{1}\n
          N rows: {2}\tN columns: {3}\n
          Memory Usage: {4} Mb\n\n\n""".format(color_dict[waterbody_type], df_name,
                                           n_rows, n_columns, mb_size))
    
def crosscorr(datax, datay, lag=0):
    """ Lag-N cross correlation. 
    Parameters
    ----------
    lag : int, default 0
    datax, datay : pandas.Series objects of equal length

    Returns
    ----------
    crosscorr : float
    """
    return datax.corr(datay.shift(lag))
    
def chunks_old(l, n):
    """ Yield n successive chunks from l.
    """
    newn = int(len(l) / n)
    for i in range(0, n-1):
        yield l[i*newn:i*newn+newn]
    yield l[n*newn-newn:]
    
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]
    
df_dict = dict(zip(data_names, list(map(lambda x:pd.read_csv(x), data_files))))

for name in data_names:
    df_dict[name] = df_dict[name].loc[~df_dict[name]['Date'].isna()]
    df_dict[name]['Date'] = pd.to_datetime(df_dict[name]['Date'], format = "%d/%m/%Y").dt.date
    df_dict[name].sort_values('Date', ignore_index = True, inplace = True)
    
target_dict = {'Aquifer_Auser' : ['Depth_to_Groundwater_LT2', 'Depth_to_Groundwater_SAL', 'Depth_to_Groundwater_CoS'],
               'Aquifer_Doganella' : ['Depth_to_Groundwater_Pozzo_1','Depth_to_Groundwater_Pozzo_2','Depth_to_Groundwater_Pozzo_3',
                                      'Depth_to_Groundwater_Pozzo_4','Depth_to_Groundwater_Pozzo_5','Depth_to_Groundwater_Pozzo_6',
                                      'Depth_to_Groundwater_Pozzo_7','Depth_to_Groundwater_Pozzo_8','Depth_to_Groundwater_Pozzo_9'],
               'Aquifer_Luco' : ['Depth_to_Groundwater_Podere_Casetta'],
               'Aquifer_Petrignano' : ['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25'],
               'Lake_Bilancino': ['Lake_Level','Flow_Rate'],
               'River_Arno': ['Hydrometry_Nave_di_Rosano'],
               'Water_Spring_Amiata': ['Flow_Rate_Bugnano','Flow_Rate_Arbure','Flow_Rate_Ermicciolo','Flow_Rate_Galleria_Alta'],
               'Water_Spring_Lupa': ['Flow_Rate_Lupa'],
               'Water_Spring_Madonna_di_Canneto': ['Flow_Rate_Madonna_di_Canneto']}

<a id="correlation"></a>

## 1. Correlation Analysis

### Correlation Definition

One of the pillars of any EDA is to look for correlation between variables, i.e. the *Pearson correlation coefficient*. The Pearson correlation coefficient between two random variables $X$ and $Y$ is defined as: 

${\displaystyle \rho _{X,Y}={\frac {\operatorname {\mathbb {E} } [(X-\mu _{X})(Y-\mu _{Y})]}{\sigma _{X}\sigma _{Y}}}}$ , further details [here](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Definition)

Actually what we use is the *Sample Pearson Correlation Coefficient*, defined as: 

${\displaystyle r_{xy}={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}}}$, where $\mathbf{x},\mathbf{y}$ are non other than two columns (of dimension $n$) of your dataframe ($\bar{x}$,$\bar{y}$ being their sample means).

This coefficient takes values in $[-1,1]$ and we could sum it up like [this](https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/#:~:text=The%20statistical%20relationship%20between%20two,the%20other%20variables'%20values%20decrease.): 

- **Positive** Correlation: both variables change in the *same direction*.
- **Neutral** Correlation: *no relationship in the change* of the variables.
- **Negative** Correlation: variables change in *opposite directions*.

Notice also that it's symmetric and $r_{xx}=1$

### Correlation on Acea datasets

Let's now focus on Acea datasets and calculate the correlation between each dataset columns. 
Given the fact that it's symmetric we will plot just the lower triangular matrix and avoid plotting the diagunal since it's all 1s. 

I will show just some plots to avoid clogging the notebook, I hide the others so that you can see them simply my unhiding them. 

The subsections are: 

[*Example Correlation Matrix for Water Spring Amiata*](#correlation_amiata) <br>

[*Example Correlation Matrix for Water Spring Amiata just targets*](#correlation_amiata_targets) <br>

[*Correlation Matrix for all Acea datasets*](#correlation_all) <br>

[*Correlation Matrix for all Acea datasets just targets*](#correlation_all_targets) <br>


<a id="correlation_amiata"></a>
#### Example Correlation Matrix for Water Spring Amiata

In [None]:
#Change this flag if you want to see all correlation values, not just those whose absolute value exceeds THRESHOLD

PLOT_JUST_HIGH_VALS = True
THRESHOLD = 0.7

In [None]:
colors = sns.color_palette('coolwarm', 21)
levels = np.linspace(-1, 1, 21)
cmap_plot, norm = matplotlib.colors.from_levels_and_colors(levels, colors, extend="max")

data_name = 'Water_Spring_Amiata'
df = df_dict[data_name]
n_cols = df.shape[1]

corr_matrix = round(df.drop('Date', axis = 1).corr(), 2)
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

if PLOT_JUST_HIGH_VALS:
    mask[(abs(corr_matrix) < THRESHOLD) & (mask == False)] = True
        
fig, ax = plt.subplots(1, 1, figsize = (16, 10))

sns.heatmap(corr_matrix, mask=mask, annot=True, ax = ax, 
            vmin = -1, vmax = 1, 
            cmap = cmap_plot, norm = norm, annot_kws={"size": 9})
ax.hlines(range(0, n_cols+1), *ax.get_xlim(), lw=1, linestyles = 'dashed')
ax.vlines(range(0, n_cols+1), *ax.get_xlim(), lw=1, linestyles = 'dashed')
ax.xaxis.set_ticks_position('top')
ax.xaxis.label.set_size(11)
ax.tick_params(axis='both', which='major', labelsize=10)
plt.title('{}'.format(data_name.upper()))
plt.xticks(rotation=285)
fig.show()

***In this case we have rainfalls and temperatures positively correlated (since probably the areas are near each other), while he have depth to groundwater variables strongly negatively correlated with Flow_Rate_Galleria_Alta.***

<a id="correlation_amiata_targets"></a>

In [None]:
data_name = 'Water_Spring_Amiata'
df = df_dict[data_name]
n_cols = df.shape[1]

corr_matrix = round(df.drop('Date', axis = 1).corr(), 2)
mask = np.zeros_like(corr_matrix, dtype=bool)
if PLOT_JUST_HIGH_VALS:
    mask[(abs(corr_matrix) < THRESHOLD) & (mask == False)] = True
    for j in range(mask.shape[1]):
        mask[j,j] = True
    n_cols = df.shape[1]
    
target_cols = target_dict[data_name]
df_cols = corr_matrix.columns.tolist()
indices = [df_cols.index(i) for i in target_cols]
corr_matrix = corr_matrix[target_cols]
mask = mask[:, indices]

fig, ax = plt.subplots(1, 1, figsize = (16, 10))

sns.heatmap(corr_matrix, mask=mask, annot=True, ax = ax, 
            vmin = -1, vmax = 1, 
            cmap = cmap_plot, norm = norm, annot_kws={"size": 9})
ax.hlines(range(0, n_cols+1), *ax.get_xlim(), lw=1, linestyles = 'dashed')
ax.vlines(range(0, n_cols+1), *ax.get_ylim(), lw=1, linestyles = 'dashed')
ax.xaxis.set_ticks_position('top')
ax.xaxis.label.set_size(11)
ax.tick_params(axis='both', which='major', labelsize=10)
plt.title('{} just targets'.format(data_name.upper()))
plt.xticks(rotation=285)
fig.show()

Unhide to see how correlated features look over time. 

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

data_name = 'Water_Spring_Amiata'
df = df_dict[data_name]
df = df.loc[df.Date.astype(str) > '2018-01-01']
df_std = df.copy()
df_std.iloc[:, 1:] = sc.fit_transform(df.iloc[:, 1:])
df_std.iloc[:, 1:] = df_std.iloc[:, 1:].rolling(15).mean()
df = df_std.copy()

fig, axes = plt.subplots(1, 2, figsize = (20, 12))
ax = axes.ravel()
df[['Depth_to_Groundwater_S_Fiora_8', 'Date']].plot(x = 'Date', ax = ax[0], lw = 3)
df[['Flow_Rate_Galleria_Alta', 'Date']].plot(x = 'Date', ax = ax[0], lw = 1)
ax[0].set(title = 'Negative Correlation')
df[['Flow_Rate_Arbure', 'Date']].plot(x = 'Date', ax = ax[1], lw = 1)
df[['Flow_Rate_Bugnano', 'Date']].plot(x = 'Date', ax = ax[1], lw = 1)
ax[1].set(title = 'Positive Correlation')

Of course we are interested in the correlation among all variables, also between pairs of non targets ([collinearity](https://en.wikipedia.org/wiki/Multicollinearity#Definition) may be an issue for some linear-based models). 

<a id="correlation_all"></a>
Unhide the following to show correlation matrices for all datasets in Acea. 

In [None]:
colors = sns.color_palette('coolwarm', 21)
levels = np.linspace(-1, 1, 21)
cmap_plot, norm = matplotlib.colors.from_levels_and_colors(levels, colors, extend="max")

for j in range(len(data_names)):
    
    data_name = data_names[j]
    df = df_dict[data_name]
    n_cols = df.shape[1]
    
    corr_matrix = round(df.drop('Date', axis = 1).corr(), 2)
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
    
    if PLOT_JUST_HIGH_VALS:
        mask[(abs(corr_matrix) < THRESHOLD) & (mask == False)] = True
    
    fig, ax = plt.subplots(1, 1, figsize = (16, 10))
    
    sns.heatmap(corr_matrix, mask=mask, annot=True, ax = ax, 
                vmin = -1, vmax = 1, 
                cmap = cmap_plot, norm = norm, annot_kws={"size": 9})
    ax.hlines(range(0, n_cols+1), *ax.get_xlim(), lw=1, linestyles = 'dashed')
    ax.vlines(range(0, n_cols+1), *ax.get_xlim(), lw=1, linestyles = 'dashed')
    ax.xaxis.set_ticks_position('top')
    ax.tick_params(axis='both', which='major', labelsize=10)
    ax.xaxis.label.set_size(14)
    plt.title('{}'.format(data_name.upper()))
    plt.xticks(rotation=280)
    fig.show()

<a id="correlation_all_targets"></a>
Unhide the following to show correlation matrices for all datasets in Acea, focusing on targets.

Some plots are empty, that means that there are no features whose absolute correlation is higher than the threshold defined above. 

In [None]:
for i, data_name in enumerate(data_names):
    
    df = df_dict[data_name]
    n_cols = df.shape[1]
    corr_matrix = round(df.drop('Date', axis = 1).corr(), 2)
    mask = np.zeros_like(corr_matrix, dtype=bool)
    if PLOT_JUST_HIGH_VALS:
        mask[(abs(corr_matrix) < THRESHOLD) & (mask == False)] = True
        for j in range(mask.shape[1]):
            mask[j,j] = True
        n_cols = df.shape[1]

    target_cols = target_dict[data_name]
    df_cols = corr_matrix.columns.tolist()
    indices = [df_cols.index(i) for i in target_cols]
    corr_matrix = corr_matrix[target_cols]
    mask = mask[:, indices]

    fig, ax = plt.subplots(1, 1, figsize = (16, 10))

    sns.heatmap(corr_matrix, mask=mask, annot=True, ax = ax, 
                vmin = -1, vmax = 1, 
                cmap = cmap_plot, norm = norm, annot_kws={"size": 9})
    ax.hlines(range(0, n_cols+1), *ax.get_xlim(), lw=1, linestyles = 'dashed')
    ax.vlines(range(0, n_cols+1), *ax.get_ylim(), lw=1, linestyles = 'dashed')
    ax.xaxis.set_ticks_position('top')
    ax.xaxis.label.set_size(11)
    ax.tick_params(axis='both', which='major', labelsize=10)
    plt.title('{} just targets'.format(data_name.upper()))
    plt.xticks(rotation=285)
    fig.show()

<a id="use_correlation"></a>
#### What to do with it

Understanding which features are correlated with the target ones may help in selecting them in a model. If collinearity is an issue (for instance in linear models), working with principal components or using some regularization term (L1, so that features are dropped) may help.

[return to top](#top)

<a id="autocorrelation"></a>

## 2. Autocorrelation Analysis

### Autocorrelation Definition

In time series analysis we can't limit ourselves just at looking for correlation between two variables, for several reasons: 

- you may not have predictor variables, just target ones

- you usually don't have accurate information into the future, where you want to make predictions (you may want to predict lake level in a week from now but you don't have the actual data for rainfalls and temperatures)

- correlation compares two variables at the same timestamp, what we would like to do is to see whether the same variable is related to itself at a different point in time so that knowing past values can help us infer future ones


**Stationarity**

For any autocorrelation/autoregression **linear** (f.i. ARIMA) technique we need our time series to be **stationary**. 

I consider it an important concept, even if **we don't necessarily need stationarity for modeling**.

A loose definition of stationarity is the following:

> *A stationary time series is one whose statistical properties such as mean, variance, autocorrelation do not change over time ([reference](https://people.duke.edu/~rnau/411diff.htm))*. 

[Here](https://en.wikipedia.org/wiki/Stationary_process#Example_1) you can find a more precise definition (and useful examples to understand it).

Speaking in layman's terms: if our time series has some trend (f.i. grows over time) or seasonality (changes periodically, f.i. with a period of a year) we will need to remove these *components* to make the series stationary. 

Here you see an example of both: 

<div class="row">
  <div class="column">
    <img src="https://i.imgur.com/UoNcPu9.png" align="left" style="width:50%">
  </div>
  <div class="column">
    <img src="https://i.imgur.com/sHH220M.png" style="width:50%" align="right">
  </div>
</div>
<br>
Through `statsmodels.tsa.seasonal.seasonal_decompose` 
we can also split a time series into trend, seasonality and noise components.

<div class="row">
    <div class="column">
    <img src="https://i.imgur.com/0gURfWG.png" align="center" style="width:50%">
  </div>
  </div>
<br>

**Autocorrelation**

Let me be more engineer than strict mathematician: consider autocorrelation as a *time-dependent Pearson correlation coefficient* that we defined above. 


So:

${\displaystyle \rho _{XX}(\tau )={\frac {\operatorname {E} \left[(X_{t}-\mu ){\overline {(X_{t+\tau }-\mu )}}\right]}{\sigma ^{2}}}}$, go [here](https://en.wikipedia.org/wiki/Autocorrelation#Definition_for_wide-sense_stationary_stochastic_process) for more details.



<br>

**Partial Autocorrelation**

The partial autocorrelation at lag k is the correlation that results after removing the effect of any correlations due to the terms at shorter lags.


<div class="row">
  <div class="column">
    <img src="https://i.imgur.com/8kDMmZh.png" align="left" style="width:50%">
  </div>
  <div class="column">
    <img src="https://i.imgur.com/fCVUETa.png" style="width:50%" align="right">
  </div>
</div>

<br> 

You can see that since the time series is very autocorrelated at lag 1 this high correlation is dragged to the other lags. By plotting the partial autocorrelation function we remove this drag and focus on the 'real' correlated lags (notice however that in this case we probably have a non stationary time series, rather a positively trending one). 

# Open Question: 

In this notebook I don't perform any *AR* or *MA* modeling, but I still stress the importance of stationarity since it is a pillar concept in time series analysis. 

> My question is the following: given a $y_t$ non stationary time series and $\hat{y_t}$ its stationary counterpart (f.i. obtained through differencing) is looking at $y_t$ PACF plot the same as looking at $\hat{y_t}$ ACF plot in terms of which lag results as most important? 

I know that mathematically the two are non equivalent, but I was wondering whether there were examples of the two being different. 



Subsections are: 

- [*Stationarity Test*](#stationarity_test)<br>
- [*ACF and PACF plots stationary vs non stationary*](#acf_pacf_comparison)<br>
- [*ACF and PACF plots all Acea datasets*](#acf_pacf_all)<br>
- [*ACF and PACF plots all Acea datasets just targets*](#acf_pacf_all_targets)<br>
- [*What to do with it*](#use_acf_pacf)<br>

<a id="stationarity_test"></a>

### Stationarity test

To check if our features are stationary we will run the [Augmented Dickey Fuller test](https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test). 

> *The test is trying to reject the null hypothesis that a unit root exists and the data is non-stationary. If the null hypothesis is rejected, then the alternate can be considered valid (e.g., the data is stationary).* https://pythondata.com/stationary-data-tests-for-time-series-forecasting/

**A low p value rejects the null hypothesis i.e. we can consider the time series stationary**

In [None]:
from statsmodels.tsa.stattools import adfuller

#### Example on Aquifer Auser dataset

Here I'll run the Augmented Dickey Fuller test on the *Aquifer Auser* dataset. 

Since there are plenty of Nans present (which must be filled to perform the test), I will consider just data from 2017 and fill nans with both *mean* and *preceding value* to see whether test results are consistent. 



In [None]:
RECALCULATE = False # I have already calculated them 

if RECALCULATE: 
    data_name = 'Aquifer_Auser'
    df = df_dict[data_name]
    df['Year'] = pd.to_datetime(df.Date).dt.year
    df = df.loc[df.Date.astype(str) >= '2017-01-01']

    MAX_LAG = 365

    adf_list = []
    cols_to_check = [i for i in df.columns if i not in ['Date', 'Year']]

    for col in tqdm.tqdm(cols_to_check):
        adf_test_ffill = adfuller(df[col].fillna(method = 'ffill'), maxlag=MAX_LAG)
        col_mean = df[col].mean()
        adf_test_mean = adfuller(df[col].fillna(col_mean), maxlag=MAX_LAG)
        adf_list.append(pd.DataFrame({'feature': [col], 'p_value_mean': [adf_test_mean[1]], 'p_value_ffill': [adf_test_ffill[1]]}))

    adf_df = pd.concat(adf_list, axis = 0)
    adf_df.iloc[:, 1:] = round(adf_df.iloc[:, 1:], 6)
else:
    data_name = 'Aquifer_Auser'
    df = df_dict[data_name]
    df['Year'] = pd.to_datetime(df.Date).dt.year
    df = df.loc[df.Date.astype(str) >= '2017-01-01']
    adf_df = pd.read_pickle('/kaggle/input/adf-calculated/adf_test_auser.pickle')
    
display(adf_df.sort_values(['p_value_mean', 'p_value_ffill'], ignore_index = True, ascending = False))

Low p-values correspond to stationary time series. We could expected rainfalls to be stationary, or at least more stationary than temperatures which have a clear seasonal pattern. 

In [None]:
fig, axes = plt.subplots(2, 1, figsize = (11, 7))
ax = axes.ravel()

df[['Temperature_Monte_Serra', 'Date']].set_index('Date').plot(lw = 2, linestyle = "-.",ax = ax[0], title = 'Temperature_Monte_Serra (Non stationary Time Series)')

df[['Rainfall_Monte_Serra', 'Date']].set_index('Date').plot(lw = 2,linestyle = "-.", ax = ax[1], title = 'Rainfall_Monte_Serra (stationary Time Series)', color = 'red', sharex = True)

fig.suptitle('example of Stationary vs non Stationary time series')

##### Make a Time Series stationary

The canonical way of making a time series stationary is to take finite differences of the original time series until we achieve stationarity. Basically:

$y'_{i}=y_{i}-y_{i-1}$ $\forall i \in [1, n]$

The same holds for removing *seasonality*. Let's say that our original time series has a seasonality pattern of period $m$ = 365. Then: 

$y'_{i}=y_{i}-y_{i-m}$ $\forall i \in [1, n]$

Check [this](https://otexts.com/fpp2/stationarity.html) for further and more accurate explanations.

Let's see an example.

In [None]:
data_name = 'Aquifer_Auser'
df = df_dict[data_name]
time_series = df[['Temperature_Monte_Serra', 'Date']].loc[df.Date.astype(str) > '2016-01-01'].set_index('Date')

time_series_stationary = time_series.diff(1)

time_series_stationary_mean = time_series_stationary.mean()
adf_test_diff = adfuller(time_series_stationary.fillna(time_series_stationary_mean), maxlag=365)[1]

print('Former P Value:\t{}\nNew P Value:\t{}'.format(0.157, round(adf_test_diff, 2)))

By taking the difference the *Temperature_Monte_Serra* adf test p value changed from 0.16 (null hypothesis of non stationarity holds) to 0.0 (the series is now stationary). 

One could also argue why I didn't take the seasonal difference: well, it could be another way of doing this. By taking the one lag difference we are checking whether increases in temperature from one day to the next have a seasonal behaviour and this suffices to make it stationary. 

We can also see it visually:

In [None]:
fig, axes = plt.subplots(2, 1, figsize = (11, 7))
ax = axes.ravel()

time_series.plot(lw = 3, linestyle = "-.",ax = ax[0], title = 'Temperature_Monte_Serra (Non stationary Time Series)')

time_series_stationary.plot(lw = 3,linestyle = "-.", ax = ax[1], 
                            title = 'Differenced Temperature_Monte_Serra (stationary Time Series)', color = 'red', sharex = True)

fig.suptitle('example of Stationary vs non Stationary time series')

<a id="acf_pacf_auser"></a>

### ACF and PACF plots

There are different ways of calculating autocorrelation in python: Pandas handles nans, but does not provide a partial autocorrelation function, so I will use statsmodels (and fillna before). 

<a id="acf_pacf_comparison"></a>

#### Stationary vs Non Stationary comparison

I will compare the acf/pacf plots between stationary vs non stationary time series.

In [None]:
data_name = 'Aquifer_Auser'
df = df_dict[data_name]
df = df.loc[(df.Date.astype(str) >= '2016-01-01') & (df.Date.astype(str) <= '2020-06-01')]
cols = ['Depth_to_Groundwater_CoS', 'Rainfall_Borgo_a_Mozzano']
colors = [(0.31883238319215684, 0.4266050511215686, 0.8598574482039216), 
          (0.810615674827451, 0.26879706171764706, 0.23542761153333333)]
fig, axes = plt.subplots(len(cols), 3, figsize = (20, 12))
ax = axes.ravel()
for enum, col in enumerate(cols):
    (df[[col, 'Date']].loc[(df.Date.astype(str) >= '2017-01-01')].set_index('Date')
    .plot(lw = 3, ax = ax[enum*3], title = col, color = colors[enum], legend = False))
    ax[enum*3].set_xlabel(' ')
    ax[enum*3].xaxis.set_ticks(['2017-01', '2018-01', '2019-01', '2020-01'])
    myFmt = matplotlib.dates.DateFormatter("%Y-%m")
    ax[enum*3].xaxis.set_major_formatter(myFmt)

    acf_stat = acf(df[col].fillna(method = 'ffill'), nlags = 365)
    pd.Series(acf_stat[1:]).plot(ax = ax[enum*3+1], color = colors[enum],
                                 title = col+' acf', linestyle = '--', alpha = None, lw = 2, ylim=(-1,1))

    pacf_stat = pacf(df[col].fillna(method = 'ffill'), nlags = 365, method = 'ols')
    pd.Series(pacf_stat[1:]).plot(ax = ax[enum*3+2], title = col+' pacf', alpha = None, lw = 2, 
                                  linestyle = '-.', ylim=(-1,1), color = colors[enum],)
plt.suptitle('Stationary vs Non Stationary acf/pacf comparison')

As you can see the non stationary time series (*Depth_to_Groundwater_CoS*) has a stronger autocorrelation over all lags. We can nicely spot the annual seasonality peaking at lags ~180-185 and ~365. 

Instead, the stationary time series (*Rainfall_Borgo_a_Mozzano*) has no particular trend nor seasonality and that is shown in the acf and pacf plots. 

<a id="acf_pacf_all"></a>

#### ACF and PACF plots all Acea datasets

unhide to see

In [None]:
for data_name in data_names:
    df = df_dict[data_name]
    df = df.loc[(df.Date.astype(str) >= '2016-01-01') & (df.Date.astype(str) <= '2020-06-01')]
    cols_to_consider = [i for i in df.columns if i not in ['Date', 'Year']]
    colors = sns.color_palette("coolwarm_r", len(cols_to_consider))
    data_chunks = list(chunks(cols_to_consider, 3))
    for enum_chunk, cols in enumerate(list(data_chunks)):
        if len(cols)==3:
            fig, axes = plt.subplots(3, 3, figsize = (20, 12))
        else:
            fig, axes = plt.subplots(len(cols), 3, figsize = (20, 12))
        ax = axes.ravel()
        for enum, col in enumerate(cols):
            (df[[col, 'Date']].loc[(df.Date.astype(str) >= '2017-01-01')].set_index('Date')
            .plot(lw = 3, ax = ax[enum*3], title = col, color = colors[3*enum_chunk+enum], legend = False, sharex=True))
            ax[enum*3].set_xlabel(' ')
            ax[enum*3].xaxis.set_ticks(['2017-01', '2018-01', '2019-01', '2020-01'])
            myFmt = matplotlib.dates.DateFormatter("%Y-%m")
            ax[enum*3].xaxis.set_major_formatter(myFmt)
            try:
                acf_stat = acf(df[col].fillna(method = 'ffill'), nlags = 365)
                pd.Series(acf_stat[1:]).plot(ax = ax[enum*3+1], color = colors[3*enum_chunk+enum],
                                         title = col+' acf', linestyle = '--', alpha = None, lw = 2, ylim=(-1,1),
                                        sharex=True)
            except:
                print('Failed to Calculate acf')
            try:
                pacf_stat = pacf(df[col].fillna(method = 'ffill'), nlags = 365, method = 'ols')
                pd.Series(pacf_stat[1:]).plot(ax = ax[enum*3+2], title = col+' pacf', alpha = None, lw = 2, 
                                          linestyle = '-.', ylim=(-1,1), color = colors[3*enum_chunk+enum],
                                          sharex=True)
            except:
                print('Failed to Calculate acf')
        plt.suptitle(data_name)

<a id="acf_pacf_all_targets"></a>

#### ACF and PACF plots all Acea datasets just targets

unhide to see

In [None]:
for data_name in data_names:
    target_cols = target_dict[data_name]
    df = df_dict[data_name][target_cols + ['Date']]
    df = df.loc[(df.Date.astype(str) >= '2016-01-01') & (df.Date.astype(str) <= '2020-06-01')]
    cols_to_consider = [i for i in df.columns if i not in ['Date', 'Year']]
    colors = sns.color_palette("coolwarm_r", len(cols_to_consider))
    data_chunks = list(chunks(cols_to_consider, 3))
    for enum_chunk, cols in enumerate(list(data_chunks)):
        if len(cols)==3:
            fig, axes = plt.subplots(3, 3, figsize = (20, 12))
        else:
            fig, axes = plt.subplots(len(cols), 3, figsize = (20, 12))
        ax = axes.ravel()
        for enum, col in enumerate(cols):
            (df[[col, 'Date']].loc[(df.Date.astype(str) >= '2017-01-01')].set_index('Date')
            .plot(lw = 3, ax = ax[enum*3], title = col, color = colors[3*enum_chunk+enum], legend = False, sharex=True))
            ax[enum*3].set_xlabel(' ')
            ax[enum*3].xaxis.set_ticks(['2017-01', '2018-01', '2019-01', '2020-01'])
            myFmt = matplotlib.dates.DateFormatter("%Y-%m")
            ax[enum*3].xaxis.set_major_formatter(myFmt)
            try:
                acf_stat = acf(df[col].fillna(method = 'ffill'), nlags = 365)
                pd.Series(acf_stat[1:]).plot(ax = ax[enum*3+1], color = colors[3*enum_chunk+enum],
                                         title = col+' acf', linestyle = '--', alpha = None, lw = 2, ylim=(-1,1),
                                        sharex=True)
            except:
                print('Failed to Calculate acf')
            try:
                pacf_stat = pacf(df[col].fillna(method = 'ffill'), nlags = 365, method = 'ols')
                pd.Series(pacf_stat[1:]).plot(ax = ax[enum*3+2], title = col+' pacf', alpha = None, lw = 2, 
                                          linestyle = '-.', ylim=(-1,1), color = colors[3*enum_chunk+enum],
                                          sharex=True)
            except:
                print('Failed to Calculate acf')
        plt.suptitle(data_name)

<a id="use_acf_pacf"></a>

##### What to do with it

Inspecting the autocorrelation of a time series helps us finding which lags maybe used in a forecasting model. 
Of course there maybe more automatic ways than visually inspecting our data. 

##### Other (maybe) useful plots

In [None]:
data_chunks = chunks(range(len(data_names)), 3)
chunk_len = 3

for chunk in data_chunks:
    fig, axes = plt.subplots(1, 3, figsize = (20, 12))
    fig.suptitle('Autocorrelation lag 1 for each feature and dataset')
    axes_raveled = axes.ravel()
    for k in range(len(chunk)):
        
        j = chunk[k]
        data_name = data_names[j]
        df = df_dict[data_name].sort_values('Date', ignore_index = True).drop(['Date', 'Year'], 
                                                                              axis = 1, errors = 'ignore')
        
        autocorr_dataframe = (pd.DataFrame(df.apply(lambda x: x.autocorr(), 0))
                             .reset_index().rename(columns = {'index': 'feature', 0: 'autocorrelation'})
                             .sort_values('autocorrelation', ascending = False))
        
        ax = axes_raveled[k]
    
        sns.barplot(x='autocorrelation', y='feature', data=(autocorr_dataframe), ax = ax, palette = 'jet_r')
        y_labels = autocorr_dataframe.feature.tolist()
        ax.set_yticklabels([])
        ax.set_xticklabels([])
        ax.title.set_text(data_name)
        ax.title.set_fontsize(12)
        t=0
        for p in ax.patches:
            width = p.get_width() 
            if width < 0.01:# get bar length
                ax.text(width,       # set the text at 1 unit right of the bar
                p.get_y() + p.get_height() / 2, # get Y coordinate + X coordinate / 2
                '{:1.4f}'.format(width), # set variable to display, 2 decimals
                ha = 'left',   # horizontal alignment
                va = 'center')  # vertical alignment
            else:
                ax.text(width/4, 
                    # set the text at 1 unit right of the bar
                p.get_y() + p.get_height() / 2, # get Y coordinate + X coordinate / 2
                '{} {:1.4f}'.format(y_labels[t], width), # set variable to display, 2 decimals
                ha = 'left',   # horizontal alignment
                va = 'center',
                color = 'black',
                fontsize = 12)
            t+=1

[return to top](#top)

<a id="crosscorrelation"></a>

## 3. Crosscorrelation analysis

Once again, we won't dwelve too much into technical details here, remembering that our aim is to interpret the crosscorrelation coefficient as the Pearson correlation coefficient between two different features at different lags. We use this definition of crosscorrelation coefficient:

${\displaystyle \rho _{XY}(\tau )={\frac {\operatorname {K} _{XY}(\tau )}{\sigma _{X}\sigma _{Y}}}={\frac {\operatorname {E} [\left(X_{t}-\mu _{X}\right){\overline {\left(Y_{t+\tau }-\mu _{Y}\right)}}]}{\sigma _{X}\sigma _{Y}}}}$, go [here](https://en.wikipedia.org/wiki/Cross-correlation#Cross-correlation_function) for more details.


<a id="most_cross"></a>

### Most crosscorrelated features/lags

As a first we will look for crosscorrelated features for lags 1-30 (a month) and see the most correlated combinations. 

**N.B.** Of course, many of the time series are non stationary and are very autocorrelated: this will result in many lags having a high crosscorrelation value. I will address this later. 

In [None]:
CROSS_THRESHOLD = 0.6
TOTAL_LAGS = 30

In [None]:
cross_corr_dict = {}

for data_name in data_names:
    df = df_dict[data_name]
    total_lags = range(1, TOTAL_LAGS)
    cross_corr = {}
    features = list(set(df.columns) - set(['Date', 'Year']))
    combinations = list(itertools.product(features, features))

    for j in total_lags:
        cross_corr[j] = []
        for k in tqdm.tqdm(combinations):
            cross_corr[j].append(crosscorr(df[k[0]], df[k[1]], lag = j))

    cross_corr = pd.DataFrame(cross_corr)
    cross_corr.columns = ['cross_correlation_lag_{}'.format(i) for i in range(1, TOTAL_LAGS)]

    cross_correlations = (pd.concat([pd.DataFrame(combinations).rename(columns = {0: 'first_feature', 1: 'second_feature'}),
                                     pd.DataFrame(cross_corr)], 1)
                          )

    cross_correlations_melt = (pd.melt(cross_correlations, id_vars=['first_feature', 'second_feature'], 
                               value_vars=['cross_correlation_lag_{}'.format(i) for i in range(1, TOTAL_LAGS)],
                               var_name = 'lag',
                               value_name = 'cross_correlation')
                              .assign(lag=lambda x: x.lag.str.replace('cross_correlation_lag_', "")))
    
    
    def sort_features(x, y):
        
        return tuple(sorted([x,y]))
    
    cross_correlations_melt[['pair_of_features']] = (cross_correlations_melt.apply(lambda x:sort_features(x.first_feature,
                                                                                                              x.second_feature), 1))

    cross_correlations_melt['first_feature'] = cross_correlations_melt['pair_of_features'].apply(lambda x: x[0])
    cross_correlations_melt['second_feature'] = cross_correlations_melt['pair_of_features'].apply(lambda x: x[1])
    
    cross_correlations_melt['pair_of_features'] = (cross_correlations_melt['first_feature'].str.replace("feature_", "") + 
                                              "_"  + cross_correlations_melt['second_feature'].str.replace("feature_", "") + "_lag" +
                                                   cross_correlations_melt['lag']
                                                  ).astype(str)
    
    cross_correlations_melt = cross_correlations_melt.drop_duplicates(['pair_of_features'], ignore_index=True)
    
    cross_corr_dict[data_name] = (cross_correlations_melt
                                  .loc[(abs(cross_correlations_melt.cross_correlation) > CROSS_THRESHOLD) & 
                                      (cross_correlations_melt.first_feature!= cross_correlations_melt.second_feature)]
                                  .reset_index(drop = True))

#### CrossCorrelated Features for each dataset

For each dataset let's see the top 5 ***most and least*** correlated features (with corresponding lag). 

In [None]:
for data_name in data_names:
    if len(cross_corr_dict[data_name]) == 0:
        display('{} empty'.format(data_name))
    else:
        display(pd.concat([cross_corr_dict[data_name]
            .sort_values('cross_correlation', ascending = False)
            .drop('pair_of_features', 1)
            .head(5),cross_corr_dict[data_name]
            .sort_values('cross_correlation', ascending = False)
            .drop('pair_of_features', 1)
            .tail(5)], axis = 0, ignore_index=True
           ).assign(dataset=data_name))

<a id = "example"></a>
#### An example of crosscorrelated pair of features

In [None]:
df_cross = df_dict['Aquifer_Auser']
cols = ['Temperature_Lucca_Orto_Botanico', 'Temperature_Monte_Serra']
df_cross['Temperature_Monte_Serra_lag2'] = df_cross['Temperature_Monte_Serra'].shift(2)

df_cross['Volume_POL_lag29'] = df_cross['Volume_POL'].shift(29)
df_cross = df_cross.loc[df_cross.Date.astype(str)>='2017-01-01'].set_index('Date')

fig, axes = plt.subplots(2, 2, figsize = (14, 10))
ax = axes.ravel()

df_cross[['Temperature_Lucca_Orto_Botanico', 'Temperature_Monte_Serra_lag2']].plot(ax = ax[0], lw = 2, 
                                                 linestyle = "-.")

(df_cross.loc[(df_cross.index.astype(str)>='2019-01-01') & (df_cross.index.astype(str)<='2019-06-01')]
            [['Temperature_Lucca_Orto_Botanico', 'Temperature_Monte_Serra_lag2']].plot(ax = ax[1], lw = 2, 
                                                 linestyle = "-."))


(df_cross[['Volume_CSA', 'Volume_POL_lag29']].plot(ax = ax[2], lw = 2, 
                                                 linestyle = "-.", sharex=True,))

(df_cross[(df_cross.index.astype(str)>='2019-01-01') & (df_cross.index.astype(str)<='2019-06-01')]
[['Volume_CSA', 'Volume_POL_lag29']].plot(ax = ax[3], lw = 2, sharex=True,
                                                 linestyle = "-."))

for j in range(4):
    ax[j].get_legend().remove()

fig.suptitle('Positive (Above) vs Negative (Below) crosscorrelation (Auser)')

ax[2].xaxis.set_ticks(['2017-01', '2018-01', '2019-01', '2020-01'])
myFmt = matplotlib.dates.DateFormatter("%Y-%m")
ax[enum*3].xaxis.set_major_formatter(myFmt)

ax[1].legend(loc="upper right", bbox_to_anchor=(1.3,1.1))
ax[3].legend(loc="upper right", bbox_to_anchor=(1.3,1.1))

<a id="spurious"></a>
    
## 4. Spurious Relationships

I would define a spurious correlation as the presence of a linear relationship between two variables due to either pure coincidence or the presence of a certain 3rd unseen factor ([here](https://en.wikipedia.org/wiki/Spurious_relationship) a more in depth definition).

This is clearly a **qualitative definition**: there's no clear mathematically definition of coincidence or what represente a 3rd factor, yet I find it useful in some cases.


One of the most known sources for spurious correlations is the https://www.tylervigen.com/spurious-correlations, where you can find plots like the following: 

<img src="https://i.imgur.com/L7nu3gj.png" align="center" style="width:80%">


Of course random correlations are not of interest, we may be more interested in spurious correlations where there is the presence of a 3rd unseen factor: 

<img src="https://i.stack.imgur.com/inojA.png" align="center" style="width:60%">


<a id = "correlation_redo"></a>

#### Revisiting Correlations

Here I go through some of the (auto/cross)correlations found above and revisit them after disclosing the 3rd omitting factor. 

##### Spurious correlation with seasonality: positive and negative examples

Let's go back to [this](#example) example, focusing on ***Temperature_Lucca_Orto_Botanico*** vs ***Temperature_Monte_Serra_lag2*** positive crosscorrelation.

In [None]:
df_cross = df_dict['Aquifer_Auser']
cols = ['Temperature_Lucca_Orto_Botanico', 'Temperature_Monte_Serra']
df_cross['Temperature_Monte_Serra_lag2'] = df_cross['Temperature_Monte_Serra'].shift(2)

df_cross = df_cross.loc[df_cross.Date.astype(str)>='2017-01-01'].set_index('Date')

fig, ax = plt.subplots(1, 1, figsize = (11, 7))

df_cross[['Temperature_Lucca_Orto_Botanico', 'Temperature_Monte_Serra_lag2']].plot(ax = ax, lw = 2, 
                                                 linestyle = "-.")


#ax.get_legend().remove()

fig.suptitle('Pearson Correlation: {}'.format(round(crosscorr(df_cross['Temperature_Lucca_Orto_Botanico'],
                                                        df_cross['Temperature_Monte_Serra_lag2']), 3)))

ax.xaxis.set_ticks(['2017-01', '2018-01', '2019-01', '2020-01'])
myFmt = matplotlib.dates.DateFormatter("%Y-%m")
ax.xaxis.set_major_formatter(myFmt)

It's pretty clear we have a seasonality pattern here, let's check it with `statsmodels.tsa.seasonal.seasonal_decompose`


In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

df_cross.index = pd.to_datetime(df_cross.index)

ts_1 = seasonal_decompose(df_cross['Temperature_Lucca_Orto_Botanico'], period =365)
ts_2 = seasonal_decompose(df_cross['Temperature_Monte_Serra_lag2'], period =365)

fig, axes = plt.subplots(2, 1, figsize = (11, 7))
ax = axes.ravel()
df_dec_1 = pd.DataFrame(pd.concat([ts_1.trend, ts_1.seasonal, ts_1.resid], axis = 1))
df_dec_1 = (df_dec_1.loc[(df_dec_1.index.astype(str)>='2017-07-01') & (df_dec_1.index.astype(str)<='2020-01-01')])
(df_dec_1.plot( ax = ax[0], linewidth = 2, linestyle = "-."))
ax[0].title.set_text('Decomposition of Temperature_Lucca_Orto_Botanico time series')

df_dec_2 = pd.DataFrame(pd.concat([ts_2.trend, ts_2.seasonal, ts_2.resid], axis = 1))
df_dec_2 = (df_dec_2.loc[(df_dec_2.index.astype(str)>='2017-07-01') & (df_dec_2.index.astype(str)<='2020-01-01')])
(df_dec_2.plot( ax = ax[1], linewidth = 2, linestyle = "-.", sharex = True))
ax[1].title.set_text('Decomposition of Temperature_Monte_Serra_lag2 time series')

We can definitely see a seasonal component dictating both time series behaviours! Let's see the correlation of both with both their seasonal components. 

In [None]:
from scipy.stats import pearsonr
ts1_season1 = round(pearsonr(ts_1.observed, ts_1.seasonal)[0], 3)
ts1_season2 = round(pearsonr(ts_1.observed, ts_2.seasonal)[0], 3)
ts2_season2 = round(pearsonr(ts_2.observed, ts_2.seasonal)[0], 3)
ts2_season1 = round(pearsonr(ts_2.observed, ts_1.seasonal)[0], 3)
season2_season1 = round(pearsonr(ts_2.seasonal, ts_1.seasonal)[0], 3)

<table style="width:100%" align="center">
  <tr>
    <th>Time Series 1</th>
    <th>Time Series 2</th>
    <th>Correlation</th>
  </tr>
  <tr>
    <td>Temperature_Lucca_Orto_Botanico</td>
    <td>Temperature_Lucca_Orto_Botanico Seasonal Component</td>
    <td>0.94</td>
  </tr>
  <tr>
    <td>Temperature_Lucca_Orto_Botanico</td>
    <td>Temperature_Monte_Serra_lag2 Seasonal Component</td>
    <td>0.88</td>
  </tr>
  <tr>
    <td>Temperature_Monte_Serra_lag2</td>
    <td>Temperature_Monte_Serra_lag2 Seasonal Component</td>
    <td>0.89</td>
  </tr>
  <tr>
    <td>Temperature_Monte_Serra_lag2</td>
    <td>Temperature_Lucca_Orto_Botanico Seasonal Component</td>
    <td>0.88</td>
  </tr>
  <tr>
    <td>Temperature_Lucca_Orto_Botanico Seasonal Component</td>
    <td>Temperature_Monte_Serra_lag2 Seasonal Component</td>
    <td>0.94</td>
  </tr>
</table>

As we can see there's a strong correlation with the seasonal component for each of the time series. 

Already here we could say that the **$3^{rd}$** omitting factor is definitely **seasonality**. To definitely check this we could see how residuals+trend correlate between the two time series (this is an intuition of mine which could definitely be wrong: a more proper method would be to make the 2 time series stationary and compare them then). 

In [None]:
ts1_no_season = (ts_1.resid + ts_1.trend).dropna()
ts2_no_season = (ts_2.resid + ts_2.trend).dropna()

df_no_season = pd.concat([ts1_no_season, ts2_no_season], axis = 1)
df_no_season.columns = ['Temperature_Lucca_Orto_Botanico', 'Temperature_Monte_Serra_lag2']

fig, ax = plt.subplots(1, 1, figsize = (11, 7))

df_no_season[['Temperature_Lucca_Orto_Botanico', 'Temperature_Monte_Serra_lag2']].plot(ax = ax, lw = 2, 
                                                 linestyle = "--")


#ax.get_legend().remove()

fig.suptitle('Pearson Correlation after dropping seasonality (omitting factor): {}'.format(round(crosscorr(df_no_season['Temperature_Lucca_Orto_Botanico'],
                                                        df_no_season['Temperature_Monte_Serra_lag2']), 3)))


In [None]:
from scipy.stats import pearsonr
ts1_no_season = (ts_1.resid + ts_1.trend).dropna()
ts2_no_season = (ts_2.resid + ts_2.trend).dropna()

ts1_ts2_noseason = round(pearsonr(ts1_no_season, ts2_no_season)[0], 3)
print(ts1_ts2_noseason)

<table style="width:100%" align="center">
  <tr>
    <th>Time Series 1</th>
    <th>Time Series 2</th>
    <th>Correlation</th>
  </tr>
  <tr>
    <td>Temperature_Lucca_Orto_Botanico no Seasonal Component</td>
    <td>Temperature_Monte_Serra_lag2 no Seasonal Component</td>
    <td>0.497</td>
  </tr>

Correlation drops to 0.497 with no seasonal component, definitely less than the original time series.

[return to top](#top)