<a id="toc"></a>
<h1 style="background-color:#fda172; font-size: 3rem; font-weight: 700; padding: 0.5rem 0 0.5rem 0;" align = 'center' > Tabular Playground Series July 2021: Holiday Feature </h1>

In this notebook, we will create some time-related features. Most of the features have been created in
different notebooks.[\[2-5\]](#time_feat_1) However, we have an additional feature called `holiday`.

Note: The style of this table of contents is taken from Ref. [\[1\]](#notebook_toc).

<h2 style="background-color:#f5deb3;font-size: 2.5rem; font-weight: 700; padding: 0.5rem 0 0.5rem 0;" align = 'center'>Table of Contents</h2>

- [Import Packages](#import_package)

- [Load Data](#load_data)

- [Feature Engineering](#feature_engineering)
    - new features: hour, day, day of week, month, year, holiday
    
- [Effects of Holiday](#effects_of_holiday)
    - on [temperature and humidity](#effects_on_temp_hum)
    - on [carbon monoxide, benzene, nitrogen oxides](#effects_on_targets)
    - on [sensor 1 to 5](#effects_on_sensors)
    
- [Seasonality](#seasonality)
    - Trends for hour, day, day of week and month
    
- [Data Distribution](#data_distribution)    
    - Check for normality
    
- [Simple Models](#simple_model)    
    - Use Ridge and LGBM model with and without holiday feature
    
- [Submission](#submission)    
     - Submit using holiday feature
     
- [References](#references)



<a id="import_package"></a>
<h2 style="background-color:#f5deb3;font-size: 2.5rem; font-weight: 700; padding: 0.5rem 0 0.5rem 0;" align = 'center'>Import Packages</h2>

[\[Back to top\]](#toc)

In [None]:
import pandas as pd
import numpy as np
import sklearn as sk
import seaborn as sns
import matplotlib.pyplot as plt

from pandas.tseries.holiday import USFederalHolidayCalendar
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer, StandardScaler

from scipy import stats
from datetime import datetime, timedelta

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

SEED = 2021

<a id="load_data"></a>
<h2 style="background-color:#f5deb3;font-size: 2.5rem; font-weight: 700; padding: 0.5rem 0 0.5rem 0;" align = 'center'>Load Data</h2>

Standard procedure of loading data and quick check for basic statistics.

[\[Back to top\]](#toc)

In [None]:
data = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/train.csv')
TARGETS = ["target_carbon_monoxide", "target_benzene", "target_nitrogen_oxides"]
# Check for null columns
nullindata = False
for i, x in enumerate(data.isna().sum()):
    if x > 0:
        print("{} has {} nans.".format(columns[i], x))
if not nullindata:
    print("There is no missing data.")
        
# Neat trick from https://www.kaggle.com/marcinstasko/pca-analysis-tutorial-from-scratch?scriptVersionId=61741932
data.describe().drop('count').T\
        .style.bar(subset=['mean'])\
        .background_gradient(subset=['std'])\
        .background_gradient(subset=['50%'])\
        .background_gradient(subset=['max'])


<a id="feature_engineering"></a>
<h2 style="background-color:#f5deb3;font-size: 2.5rem; font-weight: 700; padding: 0.5rem 0 0.5rem 0;" align = 'center'>Feature Engineering</h2>

We will create new features based on the `date_time` entry. 
Many notebooks have published this but do not include the `holiday` feature.
The holiday refers to [standard federal holidays](https://pandas.pydata.org/pandas-docs/version/0.17.1/timeseries.html#holidays-holiday-calendars)
and weekends (Saturday and Sunday). Refer to the documentation for more info. 

We could also include the traffic hours during the weekdays.

[\[Back to top\]](#toc)

In [None]:
# Most of the work is needed to process the date
# Some obvious features is to determine if its a holiday
# and the hour of the day
# NOTE: There is only a year worth of data hence no yearly trend
def feature_engineering(data):
    if "date_time" in data.columns:
        data['date_time'] = pd.to_datetime(data.date_time)
        data['day'] = data.date_time.map(lambda x: x.day)
        data['month'] = data.date_time.map(lambda x: x.month)
        data['year'] = data.date_time.map(lambda x: x.year)
        data['hour'] = data.date_time.map(lambda x: x.hour)
        data['date'] = data.date_time.map(lambda x: x.date())
        data['dayofweek'] = data.date_time.map(lambda x: x.dayofweek)

        # Holidays = weekends + federal holidays
        cal = USFederalHolidayCalendar()
        holidays = cal.holidays(start=data['date'].min(), end=data['date'].max()+ timedelta(days=1))
        holidays = holidays.map(lambda x: x.date())
        data['holiday'] = data.date.isin(holidays) |  (data.dayofweek >=5)

        # Reduce the relative humidity to [0, 1]
        data['relative_humidity'] = data['relative_humidity']/100.0

        # Kelvin scales from 0 to Infty 
        # Not really useful if we do a standard scaling
        # Maybe you could try to map [0, inf) -> [0, 1] with 1/(1+x)
        data['deg_K'] = data['deg_C'] + 273.15
    return data
data = feature_engineering(data)

Here a the holiday used. You could also try to create a custom holiday calender.

In [None]:
# Here are the holidays used
USFederalHolidayCalendar().rules

<a id="effects_of_holiday"></a>
<h2 style="background-color:#f5deb3;font-size: 2.5rem; font-weight: 700; padding: 0.5rem 0 0.5rem 0;" align = 'center'>Effects of Holiday</h2>

This section focuses on holiday and non-holiday seasonalities.

[\[Back to top\]](#toc)

In [None]:
def plot_trend_by(response="target_carbon_monoxide", title="Carbon Monoxide", grouping="holiday"):
    timeframe = [["hour", "Hour of Day"], ["dayofweek", "Day of Week"], ["day", "Day of Month"], ["month", "Month of Year"]]
    fig, ax = plt.subplots(4, 1, figsize=(20, 20), dpi=100, sharey=True)
    for i in range(4):
        ax1 = sns.violinplot(ax=ax[i], data=data, kind="violin", x=timeframe[i][0], y=response, hue=grouping, split=True)
        if i == 3:
            ax1.set(xlabel=timeframe[i][1], ylabel=title)
        else:
            ax1.set(ylabel=title)
    plt.suptitle(title, fontsize=24, y=1)
    plt.tight_layout()
    plt.show()

<a id="effects_on_temp_hum"></a>
<h3>Temperature and Humidity</h3>

In [None]:
titles = [r"Temperature ($^{\circ}$C)", "Relative Humidity", "Absolute humidity"]
for ii, x in enumerate(["deg_C", "relative_humidity", "absolute_humidity"]):
    plot_trend_by(response=x, title=titles[ii], grouping="holiday")

<a id="effects_on_targets"></a>
<h3>Targets: Carbon monoxide, Benzene, Nitrogen oxides</h3>

[\[Back to top\]](#toc)

In [None]:
titles = ["Carbon monoxide", "Benzene", "Nitrogen oxides"]
for ii, x in enumerate(TARGETS):
    plot_trend_by(response=x, title=titles[ii], grouping="holiday")

<a id="effects_on_sensors"></a>
<h3>Sensor 1 to 5</h3>

[\[Back to top\]](#toc)

In [None]:
for ii in range(1, 6):
    plot_trend_by(response="sensor_{}".format(ii), title="Sensor {}".format(ii), grouping="holiday")

<a id="seasonality"></a>
<h2 style="background-color:#f5deb3;font-size: 2.5rem; font-weight: 700; padding: 0.5rem 0 0.5rem 0;" align = 'center'>Seasonality</h2>

Similar plots as the previous section but focuses on the mean and median measures.

[\[Back to top\]](#toc)

In [None]:
def plot_trend(data, yval, title):
    fig, ax = plt.subplots(2, 2, figsize=(8,8), dpi=100)
    ax[0,0].scatter(data.hour, data[yval], s=10, alpha=0.1)
    df = data.groupby('hour')
    ax[0,0].plot(df[yval].mean(), ls="--", c='red', label="mean")
    ax[0,0].plot(df[yval].median(), ls="--", c='green', label="median")
    ax[0,0].set_title("Hour of day")
    ax[0,0].set_xlim(-1, 24)
    
    ax[0,1].scatter(data.dayofweek, data[yval], s=10, alpha=0.1)
    df = data.groupby('dayofweek')
    ax[0,1].plot(df[yval].mean(), ls="--", c='red', label="mean")
    ax[0,1].plot(df[yval].median(), ls="--", c='green', label="median")
    ax[0,1].set_title("Day of week")
    ax[0,1].set_xlim(-1, 7)

    ax[1,0].scatter(data.day, data[yval], s=10, alpha=0.1)
    df = data.groupby('day')
    ax[1,0].plot(df[yval].mean(), ls="--", c='red', label="mean")
    ax[1,0].plot(df[yval].median(), ls="--", c='green', label="median")
    ax[1,0].set_title("Day of month")
    ax[1,0].set_xlim(0, 32)

    ax[1,1].scatter(data.month, data[yval], s=10, alpha=0.1)
    df = data.groupby('month')
    ax[1,1].plot(df[yval].mean(), ls="--", c='red', label="mean")
    ax[1,1].plot(df[yval].median(), ls="--", c='green', label="median")
    ax[1,1].set_title("Month")
    ax[1,1].set_xlim(0, 13)
    ax[0,0].legend()
    plt.suptitle(title, fontsize=24, y=1)
    plt.tight_layout()
    plt.show()

In [None]:
plot_trend(data, "deg_C", r"Temperature ($^{\circ}$C)")
plot_trend(data, "relative_humidity", "Relative Humidity")
plot_trend(data, "absolute_humidity", "Absolute humidity")

# Sensors
plot_trend(data, "sensor_1", "Sensor 1")
plot_trend(data, "sensor_2", "Sensor 2")
plot_trend(data, "sensor_3", "Sensor 3")
plot_trend(data, "sensor_4", "Sensor 4")
plot_trend(data, "sensor_5", "Sensor 5")

# Targets
plot_trend(data, "target_carbon_monoxide", "Target carbon monoxide")
plot_trend(data, "target_benzene", "Target benzene")
plot_trend(data, "target_nitrogen_oxides", "Target nitrogen oxides")

<a id="data_distribution"></a>
<h2 style="background-color:#f5deb3;font-size: 2.5rem; font-weight: 700; padding: 0.5rem 0 0.5rem 0;" align = 'center'>Data Distribution</h2>

Standard check for normality of data. There are notebooks which do the same thing but we will show it here
because we want to do a quick comparison on a Lightgbm model with and without the `holiday` feature.

[\[Back to top\]](#toc)

The data is not normal except for sensor 4 (without the small peak).
We might need to do some boxcox when using linear regression or neural network.


In [None]:
data.drop(columns=['date_time', 'day', 'month', 'dayofweek', 'hour', 'year']).hist(figsize=(20, 20), density=True, bins=50)
plt.suptitle("Raw data", fontsize=24, y=1)
plt.tight_layout()
plt.show()

In [None]:
# NOTE: yeo-johnson method complains about division by zero
pt = PowerTransformer(method='box-cox')

# NOTE: We are dropping the deg_K since deg_C is the same as deg_C
data_normalized = data.drop(columns=['date_time', 'day', 'month', 'dayofweek', 'hour', 'year', 'date', 'holiday'])
data_normalized = pd.DataFrame(data=pt.fit_transform(data_normalized[data_normalized.columns[:-4]]), columns=data_normalized.columns[:-4])
data_normalized['target_carbon_monoxide'] = StandardScaler().fit_transform(np.log(data['target_carbon_monoxide']).to_numpy().reshape(-1,1))
data_normalized['target_benzene'] = StandardScaler().fit_transform(np.log(1+data['target_benzene']).to_numpy().reshape(-1,1))
data_normalized['target_nitrogen_oxides'] = StandardScaler().fit_transform(np.log(data['target_nitrogen_oxides']).to_numpy().reshape(-1,1))
data_normalized.hist(figsize=(20, 20), density=True, bins=50)
plt.suptitle("After normalization", fontsize=24, y=1)
plt.tight_layout()
plt.show()


# Add back the missing columns
data_normalized[['date_time', 'day', 'month', 'dayofweek', 'hour', 'year', 'date', 'holiday']] = data[['date_time', 'day', 'month', 'dayofweek', 'hour', 'year', 'date', 'holiday']]

In [None]:
# Double check the normality of the outputs
# because it is crucial to have normal distribution
fig, ax = plt.subplots(1, 3, figsize=(12, 6), dpi=100)
for ii, x in enumerate(['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']):
    stats.probplot(data_normalized[x], plot=ax[ii])
    ax[ii].set(title=x)
plt.tight_layout()
plt.show()

<a id="simple_model"></a>
<h2 style="background-color:#f5deb3;font-size: 2.5rem; font-weight: 700; padding: 0.5rem 0 0.5rem 0;" align = 'center'>Simple Models</h2>

Here we will treat the problem as a regular regression problem. For each row of the table
we will try to predict the current targets.

[\[Back to top\]](#toc)

In [None]:
from lightgbm import LGBMRegressor
from sklearn.linear_model import Ridge
from sklearn.multioutput import MultiOutputRegressor
from sklearn.preprocessing import OneHotEncoder, PowerTransformer, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import mean_squared_log_error

In [None]:
class LogTransform(BaseEstimator, TransformerMixin):
    def __init__(self, p1=False, **kwargs):
        super(LogTransform, self).__init__(**kwargs)
        self.p1 = p1
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        if self.p1:
            return np.log(X+1)
        return np.log(X)
    
    def fit_transform(self, X, y=None):
        return self.transform(X)
    
    def inverse_transform(self, X):
        if self.p1:
            return np.exp(X)-1
        return np.exp(X) 

def get_model(estimator, num_feat, cat_feat, other_feat, with_fixed_feat=True):
    # Given an estimator and some features,
    # return a model and target transformers
    
    # Define preprocessing steps
    # Apply box-cox to numerical input features
    num_proc = make_pipeline(PowerTransformer(method='box-cox'))

    # Apply one hot encoding to categorical input feature
    months = [np.array([i for i in range(1, 13) ])]
    days = [np.array([i for i in range(1, 32) ])]
    dayofweeks = [np.array([i for i in range(0, 7) ])]
    hours = [np.array([i for i in range(0, 24) ])]

    cat_proc = make_pipeline(OneHotEncoder(sparse=False))
    if with_fixed_feat:
        fixed_feat = ['hour', 'day', 'dayofweek', 'month']
        preprocessor = make_column_transformer((num_proc, num_feat),
                                               (OneHotEncoder(sparse=False, categories=days), ['day']),
                                               (OneHotEncoder(sparse=False, categories=dayofweeks), ['dayofweek']),
                                               (OneHotEncoder(sparse=False, categories=months), ['month']),
                                               (OneHotEncoder(sparse=False, categories=hours), ['hour']),
                                               (cat_proc, cat_feat), remainder='passthrough')
    else:
        fixed_feat = []
        preprocessor = make_column_transformer((num_proc, num_feat),
                                               (cat_proc, cat_feat), remainder='passthrough')
    
    # Column transformer does not support inverse transform
    co_proc = make_pipeline(LogTransform(), StandardScaler())
    bz_proc = make_pipeline(LogTransform(p1=True), StandardScaler())
    no_proc = make_pipeline(LogTransform(), StandardScaler())

    features = num_feat + cat_feat + other_feat + fixed_feat
    X = data[features].copy()
    Y = data[TARGETS].copy()
    model = make_pipeline(preprocessor, MultiOutputRegressor(estimator))
    tgt_proc = { 'target_carbon_monoxide': co_proc,
                 'target_benzene': bz_proc,
                 'target_nitrogen_oxides': no_proc }
    return model, X, Y, tgt_proc

def transform_targets(Y, tgt_proc, fit=False, forward=True):
    # Transform the target variables
    if fit and forward:
        for x in TARGETS:
            tgt_proc[x].fit(Y[x].values.reshape(-1,1))
    
    arr = np.zeros(Y.shape)
    for ii, x in enumerate(TARGETS):
        if forward:
            arr[:,ii] = tgt_proc[x].transform(Y[x].values.reshape(-1,1)).flatten()
        else:
            arr[:,ii] = tgt_proc[x].inverse_transform(Y[x].values.reshape(-1,1)).flatten()
    y_transformed = pd.DataFrame(data=arr, columns=Y.columns)
    return y_transformed

def fit_model(model, X, Y, tgt_proc, split=True, test_size=0.3):
    # Given data, a model and target transformer, fit the model
    if split:
        X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=SEED, test_size=test_size)
        y_train_tf = transform_targets(y_train, tgt_proc, fit=True, forward=True) 
        y_test_tf = transform_targets(y_test, tgt_proc, fit=False, forward=True) 
    else:
        X_train = X
        y_train = Y
        y_train_tf = transform_targets(Y, tgt_proc, fit=True, forward=True) 
        
    model.fit(X_train, y_train_tf)

    pred_train = model.predict(X_train)
    pred_train = transform_targets(pd.DataFrame(data=pred_train, columns=TARGETS), tgt_proc, forward=False).to_numpy()
    assert len(pred_train[pred_train<0]) == 0
    loss_train = mean_squared_log_error(y_pred=pred_train, y_true=y_train)
    
    if split:
        pred_test = model.predict(X_test)
        pred_test = transform_targets(pd.DataFrame(data=pred_test, columns=TARGETS), tgt_proc, forward=False).to_numpy()
        assert len(pred_test[pred_test<0]) == 0
        loss_test = mean_squared_log_error(y_pred=pred_test, y_true=y_test)
        print("Train: {:.6f}, Test: {:.6f}".format(loss_train, loss_test))
    else:
        print("Train: {:.6f}".format(loss_train))
    return model

Without holiday feature

In [None]:
other_feat = []
cat_feat = []
num_feat = ['deg_K', 'absolute_humidity', 'relative_humidity'] + ['sensor_{}'.format(ii) for ii in range(1, 6)]

for wff in [True, False]:
    for estimator in [Ridge(alpha=60), LGBMRegressor(random_state=SEED) ]:
        print(estimator, "with fixed feat = ", wff)
        model, X, Y, tgt_proc = get_model(estimator, num_feat, cat_feat, other_feat, with_fixed_feat=wff)
        model = fit_model(model, X, Y, tgt_proc)
        print()

With holiday feature

In [None]:
other_feat = []
cat_feat = ['holiday']
num_feat = ['deg_K', 'absolute_humidity', 'relative_humidity'] + ['sensor_{}'.format(ii) for ii in range(1, 6)]

for wff in [True, False]:
    for estimator in [Ridge(alpha=60), LGBMRegressor(random_state=SEED) ]:
        print(estimator, "with fixed feat = ", wff)
        model, X, Y, tgt_proc = get_model(estimator, num_feat, cat_feat, other_feat, with_fixed_feat=wff)
        model = fit_model(model, X, Y, tgt_proc)
        print()

**Note**: This does not mean that holiday is useful or not useful. Do your own testing.

<a id="submission"></a>
<h2 style="background-color:#f5deb3;font-size: 2.5rem; font-weight: 700; padding: 0.5rem 0 0.5rem 0;" align = 'center'>Submission</h2>


Let's create a submission using a simple LGBM model with holiday feature.

**Note**: The last entry of our training set is the first entry of our test set. So, we will replace that first entry with our training values


Public score:

1. With LGBM + holiday: 0.35572
2. With Ridge + holiday: 0.30528

[\[Back to top\]](#toc)

In [None]:
def submit(estimator, with_fixed_feat=True):
    # Make prediction and save to file
    
    fix_feat = ['hour', 'day', 'dayofweek', 'month']
    other_feat = []
    cat_feat = ['holiday']
    num_feat = ['deg_K', 'absolute_humidity', 'relative_humidity'] + ['sensor_{}'.format(ii) for ii in range(1, 6)]
    
    if with_fixed_feat:
        fix_feat = ['hour', 'day', 'dayofweek', 'month']
    else:
        fix_feat = []
    model, X, Y, tgt_proc = get_model(estimator, num_feat, cat_feat, other_feat, with_fixed_feat=with_fixed_feat)
    
    model = fit_model(model, X, Y, tgt_proc, split=False)

    features = num_feat + cat_feat  + fix_feat + other_feat
    
    submission = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/test.csv')
    idx = submission.date_time.copy()
    submission = feature_engineering(submission).copy()[features]
    pred = model.predict(submission)
    pred = transform_targets(pd.DataFrame(data=pred, columns=TARGETS), tgt_proc, forward=False).to_numpy()
    assert len(pred[pred<0]) == 0
    
    # Note that the last entry of our training set is the first entry of our test set
    # So, we will replace that first entry with our training values
    pred[0,:] = Y.values[-1]
    
    df = pd.DataFrame(data=pred, columns=TARGETS, index=idx).reset_index(level=0)
    df.to_csv('submission.csv', index=False)
    return pred
    
# pred = submit(LGBMRegressor(random_state=SEED))
pred = submit(Ridge(alpha=60))

**Visualizing our predictions**

The baseline here is taken from [Bojan Tunguz](#baseline).

In [None]:
ref = pd.read_csv('/kaggle/input/tps-07-21-simple-linear-baseline-by-tunguz/submission_rr_1.csv')
ref = ref.drop(columns=['date_time']).to_numpy()

fig, ax = plt.subplots(1, 3, figsize=(12, 5), dpi=100)
for i in range(3):
    ax[i].plot(pred[:,i], lw=2, c='b', label="This work")
    ax[i].plot(ref[:,i], lw=2, c='r', label="Baseline", ls='--')
    ax[i].legend()
    ax[i].set_xlabel("Index")
    ax[i].set_title(TARGETS[i])
plt.suptitle("Submission", fontsize=24, y=1)
plt.tight_layout()
plt.show()

<a id="references"></a>
<h2 style="background-color:#f5deb3;font-size: 2.5rem; font-weight: 700; padding: 0.5rem 0 0.5rem 0;" align = 'center'>References</h2>

[\[Back to top\]](#toc)

<a id="notebook_toc"></a><h6>1. Tommaso Guerrini's <a href="https://www.kaggle.com/tomwarrens/tps-july-2021-full-eda">notebook</a> styling is good, well written. People should follow the format. Check out the notebook for more EDA.</h6>
<a id="time_feat_1"></a><h6>2. Abu Bakar's <a href="https://www.kaggle.com/c/tabular-playground-series-jul-2021/discussion/250630">discussion</a> on time feature engineering. The link to his notebook in the discussion is broken.</h6>
<a id="time_feat_2"></a><h6>3. Fellipe Gomes's <a href="https://www.kaggle.com/c/tabular-playground-series-jul-2021/discussion/250074">discussion</a> on time feature engineering. In the comment section, Bojan Tunguz suggested an useful feature by using features from previous time step which is common in autoregressive models.</h6>
<a id="time_feat_3"></a><h6>4. Alessandro Benetti's <a href="https://www.kaggle.com/alessandrobenetti/feature-engineering-automl-with-autogluon">notebook</a> on time feature engineering and AutoML. Impressive score, which means many of us still have lots to learn to beat the machine. </h6>
<a id="time_feat_4"></a><h6>5. Rajat.P's <a href="https://www.kaggle.com/rajatpaliwal02/tps-july-fastai-decision-tress-random-forset">notebook</a> on using time feature in Decision trees+Random Forest. I could not find the code for <i>add_datepart</i>.</h6>
<a id="baseline"></a><h6>6. Bojan Tunguz's <a href="https://www.kaggle.com/tunguz/tps-07-21-simple-linear-baseline">notebook</a>. This is the baseline used.</h6>
