# Notebook Contents

In this notebook I'll just show how to use scikit-learn [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) class to set up Cross Validation of our models. 

The notebook is divided in 3 simple sections: 

0) [CrossValidation Strategies](#crossvalidation_strategies) <br>

1) [TimeSeriesSplit](#timeseriessplit) <br>

2) [Example on Aquifer_Petrignano dataset](#example_on_aquifer_petrignano_dataset) <br>



Please look at [this](https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py) wonderful page for further theoretical explanations on CV splits. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.options.display.max_columns = 30
import os
import re
from colorama import Fore, Back, Style
import seaborn as sns
import plotly.express as px
import matplotlib
from matplotlib.patches import Patch
from matplotlib import pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
plt.style.use('fivethirtyeight')
cmap_data = plt.cm.Paired
cmap_cv = plt.cm.coolwarm
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Function modified from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

def plot_cv_indices(cv, n_splits, X, y, date_col = None):
    """Create a sample plot for indices of a cross-validation object."""
    
    fig, ax = plt.subplots(1, 1, figsize = (11, 7))
    
    # Generate the training/testing visualizations for each CV split
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y)):
        # Fill in indices with the training/test groups
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        # Visualize the results
        ax.scatter(range(len(indices)), [ii + .5] * len(indices),
                   c=indices, marker='_', lw=10, cmap=cmap_cv,
                   vmin=-.2, vmax=1.2)


    # Formatting
    yticklabels = list(range(n_splits))
    
    if date_col is not None:
        tick_locations  = ax.get_xticks()
        tick_dates = [" "] + date_col.iloc[list(tick_locations[1:-1])].astype(str).tolist() + [" "]

        tick_locations_str = [str(int(i)) for i in tick_locations]
        new_labels = ['\n\n'.join(x) for x in zip(list(tick_locations_str), tick_dates) ]
        ax.set_xticks(tick_locations)
        ax.set_xticklabels(new_labels)
    
    ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
           xlabel='Sample index', ylabel="CV iteration",
           ylim=[n_splits+0.2, -.2])
    ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.02))],
              ['Testing set', 'Training set'], loc=(1.02, .8))
    ax.set_title('{}'.format(type(cv).__name__), fontsize=15)
    
    

<a id="crossvalidation_strategies"></a>
# 0. CrossValidation Strategies

Here I compare differente crossvalidation strategies, on a toy dataset. 

In [None]:
from sklearn.model_selection import KFold, ShuffleSplit, StratifiedKFold, StratifiedShuffleSplit, TimeSeriesSplit
cvs = [KFold, ShuffleSplit, StratifiedKFold, StratifiedShuffleSplit, TimeSeriesSplit]
n_points = 100
n_splits = 5
X = np.random.randn(100, 10)
percentiles_classes = [.1, .3, .6]
y = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])

for i, cv in enumerate(cvs):
    this_cv = cv(n_splits=n_splits)
    plot_cv_indices(this_cv, n_splits, X, y, date_col=None)

<a id="timeseriessplit"></a>
# 1. TimeSeriesSplit

Of course we don't want to use information into the future to train our models, so we will opt for TimeSeriesSplit, again on the toy dataset defined before.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
n_splits = 5
tscv = TimeSeriesSplit(n_splits)

In [None]:
#example taken from https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py

for fold, (train_index, test_index) in enumerate(tscv.split(X)):
    print("Fold: {}".format(fold))
    print("TRAIN indices:", train_index, "\n", "TEST indices:", test_index)
    print("\n")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]


plot_cv_indices(tscv,n_splits, X, y)


## 1.1 ignore old data,with a fixed size data

In [None]:
from sklearn.model_selection import TimeSeriesSplit
n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits,max_train_size=20)

In [None]:
for fold, (train_index, test_index) in enumerate(tscv.split(X)):
    print("Fold: {}".format(fold))
    print("TRAIN indices:", train_index, "\n", "TEST indices:", test_index)
    print("\n")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]


plot_cv_indices(tscv,n_splits, X, y)


<a id="example_on_aquifer_petrignano_dataset"></a>
# 2. Example on Aquifer_Petrignano dataset

Let's see an example on one of our datasets in the challenge, Aquifer Petrignano.

In [None]:
df = pd.read_csv('/kaggle/input/acea-water-prediction/Aquifer_Petrignano.csv')
df = df.loc[~df['Date'].isna()]
df['Date'] = pd.to_datetime(df['Date'], format = "%d/%m/%Y").dt.date
df.sort_values('Date', ignore_index = True, inplace = True)
display(df.sample())

target_cols = ['Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25']
non_target_cols = list(set(df.columns) - set(target_cols + ['Date']))

X = df[non_target_cols].fillna(-99).shift(1) # I'll create a random feature matrix
y = df[target_cols]

In [None]:
tscv = TimeSeriesSplit(n_splits)
plot_cv_indices(tscv, n_splits, X, y, date_col = df['Date'])

Hope you'll find this notebook useful, if so please upvote it. 

Tell me what you think in the comments! 