<h2 style="background-color:#e6f7ff;" align = 'center' > Tabular Playground Series July 2021 </h2>

In this notebook I'll perform some exploratory data analysis, trying to update it often. 

<h4 style="background-color:#e6f7ff;" align = 'center'><i>Table of Contents</i></h4>

- [Data Description](#files)
- [First Exploration](#first_eda):
    - general info about data
    - train vs test date_time
    
- [Exploratory Data Analysis](#eda)

    - features distributions (train, train vs test)
    - correlation analysis
    - auto-cross correlation analysis
    - cross validation strategies

**Under Construction**

- [TimeSeriesSplit and Sample_Submission](#sub)


*Versioning:*

Check Version 16 for just outputs.

In [None]:
import numpy as np
import pandas as pd
import itertools
import tqdm
import matplotlib
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor
import PIL
import urllib
import warnings
warnings.filterwarnings("ignore")
from matplotlib.offsetbox import (TextArea, DrawingArea, OffsetImage,
                                  AnnotationBbox)
from matplotlib.patches import Patch
import seaborn as sns
import os

def crosscorr(datax, datay, lag=0):
    """ Lag-N cross correlation. 
    Parameters
    ----------
    lag : int, default 0
    datax, datay : pandas.Series objects of equal length

    Returns
    ----------
    crosscorr : float
    """
    return datax.corr(datay.shift(lag))

from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
root_path = '/kaggle/input/tabular-playground-series-jul-2021/'

<a id = "files"></a>
<h4>Data Description</h4>

In this competition you are predicting the values of air pollution measurements over time, based on basic weather information (temperature and humidity) and the input values of 5 sensors.

The three target values to you to predict are: *target_carbon_monoxide, target_benzene, and target_nitrogen_oxides*

<h4>Files</h4>

`train.csv` - the training data, including the weather data, sensor data, and values for the 3 targets

`test.csv` - the same format as train.csv, but without the target value; your task is to predict the value for each of these targets.

`sample_submission.csv` - a sample submission file in the correct format.

<a id = "first_eda"></a>

<h4 style="background-color:#e6f7ff;" align = 'center'><i>First Exploration</i></h4>

<h6> Read Data </h6>

In [None]:
train = pd.read_csv(root_path + 'train.csv')
test = pd.read_csv(root_path + 'test.csv')

<h6> Show some info about `train` and `test` </h6>

In [None]:
train.info(), test.info()

Small datasets memory wise and no null present. All columns except `date_time` are `float64`

<a id = 'train'></a>

Let's check whether there are gaps in our dates: 

In [None]:
MIN_DATETIME_TRAIN = train.date_time.min()
MAX_DATETIME_TRAIN = train.date_time.max()

assert len(train.date_time.unique()) == len(pd.date_range(MIN_DATETIME_TRAIN, MAX_DATETIME_TRAIN, freq='1H')), "There are gaps in train dates"

MIN_DATETIME_TEST = test.date_time.min()
MAX_DATETIME_TEST = test.date_time.max()

assert len(test.date_time.unique()) == len(pd.date_range(MIN_DATETIME_TEST, MAX_DATETIME_TEST, freq = '1H')), "There are gaps in test dates"

no gaps! 

Let's see it graphically 

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (11, 7))
cmap_cv = plt.cm.coolwarm
# Generate the training/testing visualizations for each CV split

# Fill in indices with the training/test groups
dates = pd.concat([train[['date_time']].assign(data='train'), test[['date_time']].assign(data = 'test')], axis = 0)

indices = np.array([1] * len(dates))
indices[dates.data == 'train'] = 1
indices[dates.data == 'test'] = 0

# Visualize the results
ax.scatter(range(len(train)), [.5] * len(train),
           c=indices[indices==1], marker='_', lw=15, cmap=cmap_cv,
           vmin=-.2, vmax=1.2)

ax.scatter(range(len(train), len(train)+len(test)), [1.] * len(test),
           c=indices[indices==0], marker='_', lw=15, cmap=cmap_cv,
           vmin=-.2, vmax=1.2)

date_col = dates['date_time']

if date_col is not None:
    tick_locations  = ax.get_xticks()
    for i in (tick_locations)[1:-1]:
        ax.vlines(i, 0, 2,linestyles='dotted', colors = 'grey')
    tick_dates = [" "] + date_col.iloc[list(tick_locations[1:-1])].astype(str).tolist() + [" "]
    
    tick_locations_str = [str(int(i)) for i in tick_locations]
    ax.set_xticks(tick_locations)
    ax.set_xticklabels(tick_dates, rotation = 35)
    ax.grid()
    #ax.set_yticklabels([])
    ax.set(yticks=np.arange(2) + .5, yticklabels=[],
           xlabel='date_time', ylabel="set",
           ylim=[0., 1.5])
    ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.02))],
              ['Training set', 'Testing set'], loc=(1.02, .8))
plt.suptitle("train.csv and test.csv date_time division", fontsize = 20, fontweight = 'bold')
plt.title("train: {}-{}\t test: {} - {}".format(MIN_DATETIME_TRAIN, MAX_DATETIME_TRAIN, MIN_DATETIME_TEST, MAX_DATETIME_TEST), fontsize = 10)

**There is one timestamp of overlap between train and test: 2011-01-01 00:00:00**

Let's check whether values correspond for that timestamp between train and test.

In [None]:
feature_cols = [i for i in train.columns if all(x not in i for x in ['target', 'date_time'])]
target_cols = ['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']

In [None]:
train[feature_cols + ['date_time']].merge(test[feature_cols + ['date_time']], on = 'date_time', suffixes = ('_train', '_test'))

**You can see they are exactly the same. So for the first test point we may also have the target values.** 

Let's check whether there are nan values in our columns (we know from `.info()` there are none):

In [None]:
train.isna().sum(axis = 0).rename('Number_of_Nans').to_frame().transpose()

<h4 style="background-color:#e6f7ff;" align = 'center'><i>Exploratory Data Analysis</i></h4>

<h6> Train Feature distributions: </h6>

In [None]:
index_col = 'date_time'
N_BINS = 20
plt.style.use('fivethirtyeight')
for feature in feature_cols[:1]:
    
    mean = round(train[feature].mean(), 2)
    median = round(train[feature].median(),2)
    st_dev = round(train[feature].std(), 2)
    
    fig,ax = plt.subplots(1, 1, figsize = (20, 6))
    (train[[index_col, feature]].sample(500).sort_values(index_col, ignore_index = True).set_index(index_col)
     .plot(lw = 2, linestyle = "-.", ax=ax, color = (0.31883238319215684, 0.4266050511215686, 0.8598574482039216)))
    fig.suptitle('{} through time'.format(feature), fontsize = 20, color ='black', fontweight = 'bold')

    Ys = ax.get_yticks()
    y_min = Ys.min()


    if y_min < 0:
        new_miny = y_min*0.93
    else:
        new_miny = y_min*1.07
    y_max = Ys.max()
    new_maxy = y_max*1.07

    Xs = ax.get_xticks()
    x_min = Xs.min()
    if x_min < 0:
        new_minx = x_min*0.8
    else:
        new_minx = x_min*1.2
    x_max = Xs.max()
    new_maxx = x_max*1.01

    ax.set_ylim(new_miny, new_maxy)
    ax.set_xlim(new_minx, new_maxx)

    ax.hlines(y = mean, xmin = x_min, xmax = x_max, colors='crimson',
              linestyles='dashdot', label='mean', alpha = 0.3, linewidth = 3)
    ax.text(x = x_max*0.9, y = train[feature].mean(), s = 'mean: {}'.format(mean))

    fig.show()

    fig,ax = plt.subplots(1, 1, figsize = (18, 6))
    
    if feature == 'deg_C':
        hot = 'https://gilmour.com/gilmour_map/images/256/hot.png'
        cold = 'https://cdn1.iconfinder.com/data/icons/winter-37/32/thermometer_cold_snow_winter_weather_forecast_temperature-128.png'
        
        arr_img = PIL.Image.open(urllib.request.urlopen(hot))
        arr_img = plt.imread(hot, format='png')

        imagebox = OffsetImage(arr_img, zoom=0.25)
        imagebox.image.axes = ax

        ab = AnnotationBbox(imagebox, xy = [45, 0.05],
                        xybox=(30, 5),
                        frameon = False,
                        xycoords='data',
                        boxcoords="offset points",
                        pad=0.5,
                        arrowprops=dict(
                            arrowstyle="->",
                            connectionstyle="angle,angleA=0,angleB=90,rad=3")
                        )

        ax.add_artist(ab)
        
        arr_img = PIL.Image.open(urllib.request.urlopen(hot))
        arr_img = plt.imread(cold, format='png')

        imagebox = OffsetImage(arr_img, zoom=0.35)
        imagebox.image.axes = ax

        ab = AnnotationBbox(imagebox, xy = [0, 0.05],
                        xybox=(30, 5),
                        frameon = False,
                        xycoords='data',
                        boxcoords="offset points",
                        pad=0.5,
                        arrowprops=dict(
                            arrowstyle="->",
                            connectionstyle="angle,angleA=0,angleB=90,rad=3")
                        )

        ax.add_artist(ab)
    
    fig.suptitle('{} distribution'.format(feature), fontsize = 20, color ='black', fontweight = 'bold')

    percentiles_asked = [0.1, 0.25, 0.5, 0.75, 0.9]
    percentiles = train[feature].quantile(percentiles_asked).tolist()

    ax.grid()
    
    sns.histplot(data = train, x = feature, ax = ax, kde=False, bins = N_BINS, stat = 'density', 
                 alpha = 0.5, fill = True, linewidth = 3, edgecolor='black', color = 'red')
    sns.kdeplot(data = train, x = feature, ax = ax, alpha = 0.01, fill = True, 
                linewidth = 3, color = 'blue')

    Ys = ax.get_yticks()
    y_min = Ys.min()
    if y_min < 0:
        new_miny = y_min*0.93
    else:
        new_miny = y_min*1.07
        
    y_max = Ys.max()
    new_maxy = y_max*1.1
    
    Xs = ax.get_xticks()
    x_min = Xs.min()
    
    if x_min < 0:
        new_xmin = x_min*0.7
    else:
        new_xmin = x_min*1.2
        
    x_max = Xs.max()
    new_maxx = x_max*1.07
    
    ax.grid()
    
    ax.set_ylim(new_miny, new_maxy)
    ax.set_xlim(new_xmin-0.05, new_maxx)
    ax.text(new_xmin, new_maxy*0.2, "mean: {}".format(mean), size = 12, alpha = 1)
    ax.text(new_xmin, new_maxy*0.35, "median: {}".format(median), size = 12, alpha = 1)
    ax.text(new_xmin, new_maxy*0.5, "std deviation: {}".format(st_dev), size = 12, alpha = 1)
    
    
    percentiles_asked = [0.25, 0.5, 0.75]
    percentiles = train[feature].quantile(percentiles_asked).tolist()
    for m, percentile in enumerate(percentiles):
        ax.axvline(percentile, alpha = 0.5, ymin = 0, ymax = 1, linestyle = ":", color = '#FFC30B')
        ax.text(percentile-0.16, new_maxy, "{}".format(percentiles_asked[m]), size = 12, alpha = 1)
    
    fig.show()

Unhide to see all:

In [None]:
index_col = 'date_time'
N_BINS = 20
plt.style.use('fivethirtyeight')
for feature in feature_cols[1:]:
    
    mean = round(train[feature].mean(), 2)
    median = round(train[feature].median(),2)
    st_dev = round(train[feature].std(), 2)
    
    fig,ax = plt.subplots(1, 1, figsize = (20, 6))
    (train[[index_col, feature]].sample(500).sort_values(index_col, ignore_index = True).set_index(index_col)
     .plot(lw = 2, linestyle = "-.", ax=ax, color = (0.31883238319215684, 0.4266050511215686, 0.8598574482039216)))
    fig.suptitle('{} through time'.format(feature), fontsize = 20, color ='black', fontweight = 'bold')

    Ys = ax.get_yticks()
    y_min = Ys.min()


    if y_min < 0:
        new_miny = y_min*0.93
    else:
        new_miny = y_min*1.07
    y_max = Ys.max()
    new_maxy = y_max*1.07

    Xs = ax.get_xticks()
    x_min = Xs.min()
    if x_min < 0:
        new_minx = x_min*0.8
    else:
        new_minx = x_min*1.2
    x_max = Xs.max()
    new_maxx = x_max*1.01

    ax.set_ylim(new_miny, new_maxy)
    ax.set_xlim(new_minx, new_maxx)

    ax.hlines(y = mean, xmin = x_min, xmax = x_max, colors='crimson',
              linestyles='dashdot', label='mean', alpha = 0.3, linewidth = 3)
    ax.text(x = x_max*0.9, y = train[feature].mean(), s = 'mean: {}'.format(mean))

    fig.show()

    fig,ax = plt.subplots(1, 1, figsize = (18, 6))
    
    if feature == 'deg_C':
        hot = 'https://gilmour.com/gilmour_map/images/256/hot.png'
        cold = 'https://cdn1.iconfinder.com/data/icons/winter-37/32/thermometer_cold_snow_winter_weather_forecast_temperature-128.png'
        
        arr_img = PIL.Image.open(urllib.request.urlopen(hot))
        arr_img = plt.imread(hot, format='png')

        imagebox = OffsetImage(arr_img, zoom=0.25)
        imagebox.image.axes = ax

        ab = AnnotationBbox(imagebox, xy = [45, 0.05],
                        xybox=(30, 5),
                        frameon = False,
                        xycoords='data',
                        boxcoords="offset points",
                        pad=0.5,
                        arrowprops=dict(
                            arrowstyle="->",
                            connectionstyle="angle,angleA=0,angleB=90,rad=3")
                        )

        ax.add_artist(ab)
        
        arr_img = PIL.Image.open(urllib.request.urlopen(hot))
        arr_img = plt.imread(cold, format='png')

        imagebox = OffsetImage(arr_img, zoom=0.35)
        imagebox.image.axes = ax

        ab = AnnotationBbox(imagebox, xy = [0, 0.05],
                        xybox=(30, 5),
                        frameon = False,
                        xycoords='data',
                        boxcoords="offset points",
                        pad=0.5,
                        arrowprops=dict(
                            arrowstyle="->",
                            connectionstyle="angle,angleA=0,angleB=90,rad=3")
                        )

        ax.add_artist(ab)
    
    fig.suptitle('{} distribution'.format(feature), fontsize = 20, color ='black', fontweight = 'bold')

    percentiles_asked = [0.1, 0.25, 0.5, 0.75, 0.9]
    percentiles = train[feature].quantile(percentiles_asked).tolist()

    ax.grid()
    
    sns.histplot(data = train, x = feature, ax = ax, kde=False, bins = N_BINS, stat = 'density', 
                 alpha = 0.5, fill = True, linewidth = 3, edgecolor='black', color = 'red')
    sns.kdeplot(data = train, x = feature, ax = ax, alpha = 0.01, fill = True, 
                linewidth = 3, color = 'blue')

    Ys = ax.get_yticks()
    y_min = Ys.min()
    if y_min < 0:
        new_miny = y_min*0.93
    else:
        new_miny = y_min*1.07
        
    y_max = Ys.max()
    new_maxy = y_max*1.1
    
    Xs = ax.get_xticks()
    x_min = Xs.min()
    
    if x_min < 0:
        new_xmin = x_min*0.7
    else:
        new_xmin = x_min*1.2
        
    x_max = Xs.max()
    new_maxx = x_max*1.07
    
    ax.grid()
    
    ax.set_ylim(new_miny, new_maxy)
    ax.set_xlim(new_xmin-0.05, new_maxx)
    ax.text(new_xmin, new_maxy*0.2, "mean: {}".format(mean), size = 12, alpha = 1)
    ax.text(new_xmin, new_maxy*0.35, "median: {}".format(median), size = 12, alpha = 1)
    ax.text(new_xmin, new_maxy*0.5, "std deviation: {}".format(st_dev), size = 12, alpha = 1)
    
    
    percentiles_asked = [0.25, 0.5, 0.75]
    percentiles = train[feature].quantile(percentiles_asked).tolist()
    for m, percentile in enumerate(percentiles):
        ax.axvline(percentile, alpha = 0.5, ymin = 0, ymax = 1, linestyle = ":", color = '#FFC30B')
        ax.text(percentile-0.16, new_maxy, "{}".format(percentiles_asked[m]), size = 12, alpha = 1)
    
    fig.show()

Train target distributions

In [None]:
index_col = 'date_time'
N_BINS = 20
plt.style.use('fivethirtyeight')
for feature in target_cols:
    
    mean = round(train[feature].mean(), 2)
    median = round(train[feature].median(),2)
    st_dev = round(train[feature].std(), 2)
    
    fig,ax = plt.subplots(1, 1, figsize = (18, 6))
    (train[[index_col, feature]].sample(500).sort_values(index_col, ignore_index = True).set_index(index_col)
     .plot(lw = 2, linestyle = "-.", ax=ax, color = (0.31883238319215684, 0.4266050511215686, 0.8598574482039216)))
    fig.suptitle('{} through time'.format(feature), fontsize = 20, color ='black', fontweight = 'bold')

    Ys = ax.get_yticks()
    y_min = Ys.min()


    if y_min < 0:
        new_miny = y_min*0.93
    else:
        new_miny = y_min*1.07
    y_max = Ys.max()
    new_maxy = y_max*1.07

    Xs = ax.get_xticks()
    x_min = Xs.min()
    if x_min < 0:
        new_minx = x_min*0.8
    else:
        new_minx = x_min*1.2
    x_max = Xs.max()
    new_maxx = x_max*1.01

    ax.set_ylim(new_miny, new_maxy)
    ax.set_xlim(new_minx, new_maxx)

    ax.hlines(y = mean, xmin = x_min, xmax = x_max, colors='crimson',
              linestyles='dashdot', label='mean', alpha = 0.3, linewidth = 3)
    ax.text(x = x_max*0.9, y = train[feature].mean(), s = 'mean: {}'.format(mean))

    fig.show()

    fig,ax = plt.subplots(1, 1, figsize = (16, 6))
    
    fig.suptitle('{} distribution'.format(feature), fontsize = 20, color ='black', fontweight = 'bold')

    percentiles_asked = [0.1, 0.25, 0.5, 0.75, 0.9]
    percentiles = train[feature].quantile(percentiles_asked).tolist()

    ax.grid()
    
    sns.histplot(data = train, x = feature, ax = ax, kde=False, bins = N_BINS, stat = 'density', 
                 alpha = 0.5, fill = True, linewidth = 3, edgecolor='black', color = 'red')
    sns.kdeplot(data = train, x = feature, ax = ax, alpha = 0.01, fill = True, 
                linewidth = 3, color = 'blue')

    Ys = ax.get_yticks()
    y_min = Ys.min()
    if y_min < 0:
        new_miny = y_min*0.93
    else:
        new_miny = y_min*1.07
        
    y_max = Ys.max()
    new_maxy = y_max*1.1
    
    Xs = ax.get_xticks()
    x_min = Xs.min()
    
    if x_min < 0:
        new_xmin = x_min*0.7
    else:
        new_xmin = x_min*1.2
        
    x_max = Xs.max()
    new_maxx = x_max*1.07
    
    ax.grid()
    
    ax.set_ylim(new_miny, new_maxy)
    ax.set_xlim(new_xmin-0.05, new_maxx)
    ax.text(new_xmin, new_maxy*0.2, "mean: {}".format(mean), size = 12, alpha = 1)
    ax.text(new_xmin, new_maxy*0.35, "median: {}".format(median), size = 12, alpha = 1)
    ax.text(new_xmin, new_maxy*0.5, "std deviation: {}".format(st_dev), size = 12, alpha = 1)
    
    
    percentiles_asked = [0.25, 0.5, 0.75]
    percentiles = train[feature].quantile(percentiles_asked).tolist()
    for m, percentile in enumerate(percentiles):
        ax.axvline(percentile, alpha = 0.5, ymin = 0, ymax = 1, linestyle = ":", color = '#FFC30B')
        ax.text(percentile-0.16, new_maxy, "{}".format(percentiles_asked[m]), size = 12, alpha = 1)
    
    fig.show()

<h6> Train vs Test Feature distributions </h6>

In [None]:
for idx, feature in enumerate(feature_cols):
    
    if idx%4 == 0:
        fig, axes = plt.subplots(2, 2, figsize = (20, 14))
        ax = axes.ravel()

    sns.kdeplot(x = train[feature], 
            ax = ax[idx%4], alpha = 0.25, fill = True, label = 'train', 
            linewidth = 3, color = 'blue')

    sns.kdeplot(x = test[feature], 
            ax = ax[idx%4], alpha = 0.25, fill = True, label = 'test', 
            linewidth = 3, color = 'red')
    
    if idx%4 ==0:
        ax[idx%4].legend(fontsize = 20, loc = 'upper right')

    ax[idx%4].set_ylabel('Density', fontsize = 15)
    ax[idx%4].set_title('')
    fig.suptitle('Train vs Test Distribution comparison {}'.format(feature), 
             fontsize = 20, fontweight = 'bold')

    plt.subplots_adjust(hspace = 0.6)

<h5> Correlation Matrix </h5> 

In [None]:
corr_df = train.drop('date_time', axis = 1).copy()
corr_matrix = round(corr_df.corr(), 2)
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
colors = sns.color_palette('coolwarm', 16)
levels = np.linspace(-1, 1, 16)
cmap_plot, norm = matplotlib.colors.from_levels_and_colors(levels, colors, extend="max")

fig, axes = plt.subplots(1, 2, figsize = (18, 10), gridspec_kw={'width_ratios': [2, 1]})
ax = axes.ravel()
mask_feature = np.triu(np.ones_like(corr_matrix[feature_cols].loc[feature_cols], dtype=bool))
sns.heatmap(corr_matrix[feature_cols].loc[feature_cols], 
            mask = mask_feature,
            annot=True, ax = ax[0], cbar=False,
            cmap = cmap_plot, 
            norm = norm, annot_kws={"size": 15, "color": 'black', 'fontweight' : 'bold'})
ax[0].hlines(range(len(feature_cols)), *ax[0].get_xlim(), color = 'black')
ax[0].vlines(range(len(feature_cols)), *ax[0].get_ylim(), color = 'black')

mask_target = np.triu(np.ones_like(corr_matrix[target_cols].loc[target_cols], dtype=bool))
sns.heatmap(corr_matrix[target_cols].loc[target_cols], 
            mask = mask_target,
            annot=True, ax = ax[1], 
            cmap = cmap_plot, 
            norm = norm, annot_kws={"size": 15, "color": 'black', 'fontweight' : 'bold'})
ax[1].hlines(range(len(target_cols)), *ax[0].get_xlim(), color = 'black')
ax[1].vlines(range(len(target_cols)), *ax[0].get_ylim(), color = 'black')
ax[0].set_title('Features', fontsize = 20)
ax[1].set_title('Targets', fontsize = 20)

fig.suptitle('Correlation Matrix', 
             fontsize = 20, color = 'black', fontweight = 'bold')
fig, ax = plt.subplots(1, 1, figsize = (20, 20))
sns.heatmap(corr_matrix, mask=mask, annot=True, ax = ax, 
            cmap = cmap_plot, 
            norm = norm, annot_kws={"size": 15, "color": 'black', 'fontweight' : 'bold'})
ax.hlines(range(len(corr_matrix.columns)), *ax.get_xlim(), color = 'black')
ax.vlines(range(len(corr_matrix.columns)), *ax.get_ylim(), color = 'black')
ax.xaxis.set_ticks_position('bottom')
ax.set_title('Distinct values for each variable', fontsize = 20)
ax.tick_params(axis='both', which='major', labelsize=14)
ax.tick_params(axis='both', which='minor', labelsize=14)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 'vertical', fontsize = 15, color = 'black')
ax.set_yticklabels(ax.get_yticklabels(), rotation = 35, fontsize = 15, color = 'black')
ax.xaxis.label.set_size(14)

circle_rad = 25  # This is the radius, in points
ax.plot(4.5, 9.5, 'o',
        ms=circle_rad * 2, mec='red', mfc='none', mew=4)

circle_rad = 25  # This is the radius, in points
ax.plot(4.5, 5.5, 'o',
        ms=circle_rad * 2, mec='blue', mfc='none', mew=4)

fig.suptitle('Correlation Matrix for {}'.format('train.csv'), 
             fontsize = 20, color = 'black', fontweight = 'bold')
plt.title("Circled highest and lowest correlation values", fontsize = 12)
fig.show()
    

<h5> AutoCorrelation Analysis </h5>

In [None]:
colors = [(0.31883238319215684, 0.4266050511215686, 0.8598574482039216), 
          (0.810615674827451, 0.26879706171764706, 0.23542761153333333)]

for enum, feature in enumerate(feature_cols):
    
    if enum % 2 == 0:
        fig, axes = plt.subplots(2, 3, figsize = (20, 12))
        ax = axes.ravel()
        
    if enum == 0:
        plt.suptitle("Autocorrelation Analysis for feature columns", size = 20, fontweight='bold')
    
    series = pd.concat([train[['date_time', feature]], test[['date_time', feature]]], axis = 0).drop_duplicates(ignore_index=True)
    
    series = series.set_index('date_time')
    
    series.sample(1000).sort_index().plot(lw = 3, ax = ax[enum%2*3], title = feature, color = colors[enum%2], legend = False)
    ax[0].set_xlabel('date_time')
    ax[0].set_xticks([])
    ax[3].set_xlabel('date_time')
    ax[3].set_xticks([])
    
    #myFmt = matplotlib.dates.DateFormatter("%Y-%m-%d")
    #ax[enum*3].xaxis.set_major_formatter(myFmt)

    acf_stat = acf(series.fillna(method = 'ffill'), nlags = 24)
    pd.Series(acf_stat[1:]).plot(ax = ax[enum%2*3+1], color = colors[enum%2],
                                 title = feature +' acf', linestyle = '--', alpha = None, lw = 2, ylim=(-1,1))

    pacf_stat = pacf(series.fillna(method = 'ffill'), nlags = 24, method = 'ols')
    pd.Series(pacf_stat[1:]).plot(ax = ax[enum%2*3+2], title = feature +' pacf', alpha = None, lw = 2, 
                                  linestyle = '-.', ylim=(-1,1), color = colors[enum%2],)

In [None]:
colors = [(0.31883238319215684, 0.4266050511215686, 0.8598574482039216), 
          (0.810615674827451, 0.26879706171764706, 0.23542761153333333),
          '#FFCD00']

for enum, feature in enumerate(target_cols):
    
    if enum % 3 == 0:
        fig, axes = plt.subplots(3, 3, figsize = (20, 12))
        ax = axes.ravel()
    
    plt.suptitle("Autocorrelation Analysis for target columns", size = 20, fontweight='bold')
    
    series = train.set_index('date_time')[feature].copy()
    
    series.sample(1000).sort_index().plot(lw = 3, ax = ax[enum%3*3], title = feature, color = colors[enum%3], legend = False)
    ax[0].set_xlabel('date_time')
    ax[0].set_xticks([])
    ax[3].set_xlabel('date_time')
    ax[6].set_xlabel('date_time')
    ax[3].set_xticks([])
    ax[6].set_xticks([])

    acf_stat = acf(series.fillna(method = 'ffill'), nlags = 24)
    pd.Series(acf_stat[1:]).plot(ax = ax[enum%3*3+1], color = colors[enum%3],
                                 title = feature +' acf', linestyle = '--', alpha = None, lw = 2, ylim=(-1,1))

    pacf_stat = pacf(series.fillna(method = 'ffill'), nlags = 24, method = 'ols')
    pd.Series(pacf_stat[1:]).plot(ax = ax[enum%3*3+2], title = feature +' pacf', alpha = None, lw = 2, 
                                  linestyle = '-.', ylim=(-1,1), color = colors[enum%3],)

**Personal take**: It seems that each feature/target is correlated with himself at the previous timestamp (1 hour before), as we can see directly from the partial autocorrelation plot. Train and test were concatenated for feature columns (while target columns are populated just in train, of course). 

In [None]:
feature_df = (pd.concat([train[['date_time']+ feature_cols], test[['date_time']+ feature_cols]], axis = 0)
              .drop_duplicates(ignore_index = True))

autocorr_df_features = (pd.DataFrame(feature_df.drop('date_time', axis = 1).apply(lambda x: x.autocorr(), 0))
                    .reset_index().rename(columns = {'index': 'column', 0: 'autocorrelation'})
                    .sort_values('autocorrelation', ascending = False))

autocorr_df_target = (pd.DataFrame(train[target_cols].apply(lambda x: x.autocorr(), 0))
                    .reset_index().rename(columns = {'index': 'column', 0: 'autocorrelation'})
                    .sort_values('autocorrelation', ascending = False))

autocorr_df = (pd.concat([autocorr_df_features, autocorr_df_target], axis = 0)
                    .sort_values('autocorrelation', ignore_index = True, ascending = False))
autocorr_df['autocorrelation'] = autocorr_df['autocorrelation'].round(4)

del autocorr_df_features, autocorr_df_target

fig, ax = plt.subplots(1, 2, figsize = (16, 8), gridspec_kw={'width_ratios': [2, 1]})
fig.suptitle('Autocorrelation lag 1 for each feature and target')
sns.barplot(x='autocorrelation', y='column', data=(autocorr_df), ax = ax[0], palette = 'coolwarm')
y_labels = autocorr_df.column.tolist()
ax[0].set_yticklabels([])
ax[0].set_xticklabels([])
t=0
for p in ax[0].patches:
    width = p.get_width() 
    if width < 0.01:
        ax[0].text(width,
        p.get_y() + p.get_height() / 2, 
        '{:1.4f}'.format(width),
        ha = 'left', 
        va = 'center')
    else:
        ax[0].text(width/4, 

        p.get_y() + p.get_height() / 2, 
        '{} {:1.4f}'.format(y_labels[t], width),
        ha = 'left',  
        va = 'center',
        color = 'black',
        fontsize = 12)
    t+=1
    
bbox=[-0.2, 0, 1.2, 0.9]
ax[1].axis('off')
ax[1].title.set_text('')
ccolors = plt.cm.BuPu(np.full(len(autocorr_df.columns), 0.1))

mpl_table = ax[1].table(cellText = autocorr_df.values, bbox=bbox, colLabels=autocorr_df.columns, colColours=ccolors)
mpl_table.auto_set_font_size(False)
mpl_table.auto_set_column_width(col=list(range(len(autocorr_df.columns))))
mpl_table.set_fontsize(14)
plt.subplots_adjust(hspace = 0.6)


<h5> CrossCorrelation Analysis </h5>

In [None]:
TOTAL_LAGS = 25
total_lags = range(1, TOTAL_LAGS)
features = list(set(train.columns) - set(['date_time']))
combinations = list(itertools.product(features, features))
CROSS_THRESHOLD = 0.5

cross_corr = {}

for j in total_lags:
    cross_corr[j] = []
    for k in tqdm.tqdm(combinations):
        cross_corr[j].append(crosscorr(train[k[0]], train[k[1]], lag = j))

cross_corr = pd.DataFrame(cross_corr)
cross_corr.columns = ['cross_correlation_lag_{}'.format(i) for i in range(1, TOTAL_LAGS)]

cross_correlations = (pd.concat([pd.DataFrame(combinations).rename(columns = {0: 'first_feature', 1: 'second_feature'}),
                                 pd.DataFrame(cross_corr)], 1))

cross_correlations_melt = (pd.melt(cross_correlations, id_vars=['first_feature', 'second_feature'], 
                           value_vars=['cross_correlation_lag_{}'.format(i) for i in range(1, TOTAL_LAGS)],
                           var_name = 'lag',
                           value_name = 'cross_correlation')
                          .assign(lag=lambda x: x.lag.str.replace('cross_correlation_lag_', "")))


def sort_features(x, y):

    return tuple(sorted([x,y]))

cross_correlations_melt[['pair_of_features']] = (cross_correlations_melt.apply(lambda x:sort_features(x.first_feature,
                                                                                                          x.second_feature), 1))

cross_correlations_melt['first_feature'] = cross_correlations_melt['pair_of_features'].apply(lambda x: x[0])
cross_correlations_melt['second_feature'] = cross_correlations_melt['pair_of_features'].apply(lambda x: x[1])

cross_correlations_melt['pair_of_features'] = (cross_correlations_melt['first_feature'].str.replace("feature_", "") + 
                                          "__"  + cross_correlations_melt['second_feature'].str.replace("feature_", "") + "__lag" +
                                               cross_correlations_melt['lag']
                                              ).astype(str)

cross_correlations_melt = cross_correlations_melt.drop_duplicates(['pair_of_features'], ignore_index=True)

cross_correlations_melt = (cross_correlations_melt.loc[(abs(cross_correlations_melt.cross_correlation) > CROSS_THRESHOLD) & 
                                  (cross_correlations_melt.first_feature!= cross_correlations_melt.second_feature)]
                              .reset_index(drop = True))

<h6> Most positively correlated features </h6>

In [None]:
display(cross_correlations_melt.sort_values('cross_correlation', ascending = False, ignore_index = True).head(5))

<h6> Most negatively correlated features </h6>

In [None]:
display(cross_correlations_melt.sort_values('cross_correlation', ascending = True, ignore_index = True).head(5))

<h6> An example of positive and negatively cross-correlated features </h6>

In [None]:
cols = ['sensor_2', 'target_benzene']
df_cross = train.copy()
df_cross['target_benzene__lag1'] = df_cross['target_benzene'].shift(1)
df_cross['target_benzene__lag1'] = (df_cross['target_benzene__lag1'] - df_cross['target_benzene__lag1'].mean())/df_cross['target_benzene__lag1'].std()

df_cross['sensor_2'] = (df_cross['sensor_2'] - df_cross['sensor_2'].mean())/df_cross['sensor_2'].std()
df_cross['sensor_3__lag1'] = df_cross['sensor_3'].shift(1)
df_cross['sensor_3__lag1'] = (df_cross['sensor_3__lag1'] - df_cross['sensor_3__lag1'].mean())/df_cross['sensor_3__lag1'].std()

df_cross = df_cross.set_index('date_time')

fig, axes = plt.subplots(2, 1, figsize = (14, 10))
ax = axes.ravel()

df_cross[['sensor_2', 'target_benzene__lag1']].plot(ax = ax[0], lw = 2, alpha = 0.5, linestyle = "-.")


(df_cross[['sensor_2', 'sensor_3__lag1']].plot(ax = ax[1], lw = 2, 
                                                 linestyle = "-.",  alpha = 0.5, sharex=True))

fig.suptitle('Positive (Above) vs Negative (Below) crosscorrelation')

myFmt = matplotlib.dates.DateFormatter("%Y-%m")
ax[1].xaxis.set_ticks([])
ax[0].set_title('sensor_2 vs target_benzene__lag1: 0.824438 Correlation')
ax[1].set_title('sensor_2 vs sensor_3__lag1: -0.72352 Correlation')
ax[1].legend(loc="upper right", bbox_to_anchor=(1.1,1.1))

<h5> Types of CrossValidation Techniques </h5>

Let's see different `sklearn.model_selection` Split generators and choose the one most suitable for Time Series data.  

In [None]:
def plot_cv_indices(cv, n_splits, X, y, date_col = None):
    """Create a sample plot for indices of a cross-validation object."""
    
    fig, ax = plt.subplots(1, 1, figsize = (11, 7))
    
    # Generate the training/testing visualizations for each CV split
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y)):
        # Fill in indices with the training/test groups
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        # Visualize the results
        ax.scatter(range(len(indices)), [ii+1] * len(indices),
                   c=indices, marker='_', lw=10, cmap=cmap_cv,
                   vmin=-.2, vmax=1.2)


    # Formatting
    yticklabels = list(range(n_splits))
    
    if date_col is not None:
        tick_locations  = ax.get_xticks()
        tick_dates = [" "] + date_col.iloc[list(tick_locations[1:-1])].astype(str).tolist() + [" "]

        tick_locations_str = [str(int(i)) for i in tick_locations]
        new_labels = ['\n\n'.join(x) for x in zip(list(tick_locations_str), tick_dates) ]
        ax.set_xticks(tick_locations)
        ax.set_xticklabels(new_labels)
    
    ax.set(yticks=np.arange(n_splits+2), #yticklabels=yticklabels,
           xlabel='Sample index', ylabel="CV iteration",
           ylim=[5.5, 0]
          )
    #ax.set_yticklabels([""] +range(1, n_splits+1)+[])
    ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.02))],
              ['Testing set', 'Training set'], loc=(1.02, .8))
    ax.set_title('{}'.format(type(cv).__name__), fontsize=15)
    
from sklearn.model_selection import KFold, ShuffleSplit, StratifiedKFold, StratifiedShuffleSplit, TimeSeriesSplit
cvs = [KFold, ShuffleSplit, StratifiedKFold, StratifiedShuffleSplit, TimeSeriesSplit]
n_points = 100
n_splits = 5
X = np.random.randn(100, 10)
percentiles_classes = [.1, .3, .6]
y = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])

for i, cv in enumerate(cvs):
    this_cv = cv(n_splits=n_splits)
    plot_cv_indices(this_cv, n_splits, X, y, date_col=None)

As you can see Time Series Split is the most appropriate one when dealing with time series data, like in this case. 

<a id = 'sub'></a>
<h5> Example usage of Time Series Split for Sample Submission </h5>

In [None]:
sample_submission = pd.read_csv(root_path +"/sample_submission.csv")
assert len(sample_submission) == len(test), 'Different number of values between sample_submission.csv and test.csv'

Let's build a training set, creating some lag 1 columns for our features (since autocorrelation is most at lag 1). 

In [None]:
train_length = len(train)-1 #I drop the last row, since already present in test

full_df = pd.concat([train[['date_time']+feature_cols].iloc[:-1], test[['date_time']+feature_cols]], axis = 0, ignore_index = True)
assert len(full_df) == train_length + len(test)

new_feature_cols=[]

for feature_col in feature_cols:
    new_feature = feature_col+"_lag"
    full_df[new_feature] = full_df[feature_col].shift()
    new_feature_cols.append(new_feature)

In [None]:
ts_fold = TimeSeriesSplit(n_splits = 5)

In [None]:
predictions = []

for train_idx, val_idx in tqdm.tqdm(ts_fold.split(X=full_df.iloc[:train_length], y=train.iloc[:train_length][target_cols])):
    
    preds = []
    for target in target_cols:
        lgbm_model = LGBMRegressor(max_depth = 10)
        lgbm_model.fit(full_df.loc[train_idx, feature_cols+new_feature_cols], train.loc[train_idx, target].values)
        preds.append(lgbm_model.predict(full_df.iloc[train_length:][feature_cols+new_feature_cols]))
    
    predictions.append(preds)

In [None]:
preds = np.mean(predictions, axis = 0)

In [None]:
sample_submission[target_cols] = preds.transpose()

In [None]:
sample_submission.to_csv('sample_submission.csv', index = False)