# TODO:

0. Read more notebooks and posts.
1. **Reimplement the CV step**
2. Try NNs
3. Try Time2Vec time encoding with NNs. Time2Vec transforms absolute time to sin() activations (variable size) with learnable frequencies and phase shifts. This could be a more efficient way to encode time than the hand-crafted sin() and cos() transformations. Paper: https://arxiv.org/abs/1907.05321
4. Try manually coded loss function on PyTorch. (haven't done that before, not sure if possible)

# Version History:

V1: [baseline notebook] date processing + catboost regressor + ts split cv. CV: 27.1

V2: 

added fourier transforms for the one-hot encoded categorical features; 

added more date features including week (of year) and quarter; 

tidied up some codes of loading data & feature engineering. CV: 24.9

V3: 

added grouping and averaging by groups model; 

plotted target & prediction discrepancies. Train score: 9.7, LB: 7.1.

V4: 

added holiday encoding; 

reduced selected features for all models. Train score: 9.5, LB: 7.1.

V5:

fixed `DateProcessor` bug of incorrectly extrapolating year column;

upgraded `DateParser` to a class;

added helper functions for selecting features and splitting data by year. LB: 7.0.

**Note: Reason for poor performance of V1 and V2, see Lesson #2**

# Lessons:

1. Boosting models like LightGBM are constrained to predict within the range of values of the target variable in the training data and don't extrapolate when there is strong trend. From: https://www.kaggle.com/rohanrao/a-modern-time-series-tutorial#Auto-ARIMAX
2. Redundant date features: people claim that they are useful sometimes, but these date features will dominate the other three categorical ones (if they outnumber the other ones) in my case. Then the model cannot see how important the `store`, `country`, and `product` features are. Even within the date features themselves, many of them turn out to be completely irrelevant, only four (`date_month_sin`, `date_month_cos`, `date_dayofweek_sin`, `date_dayofweek_cos`) turn out to be essential, other two (`holiday` and `date_year`) could be good too given a good model.
3. Transform targets using log() before fitting a linear model. Reasons: https://www.kaggle.com/ambrosm/tpsjan22-03-linear-model/comments
4. SMAPE penalizes underestimations more than overestimations. https://www.kaggle.com/carlmcbrideellis/tps-jan-2022-a-simple-average-model-no-ml/notebook


## Other Models and Reading Materials
1. https://www.kaggle.com/andreshg/timeseries-analysis-a-complete-guide. 
2. Prophet: https://www.kaggle.com/robikscube/time-series-forecasting-with-prophet
3. CNN-LSTM hybrid: https://www.kaggle.com/dimitreoliveira/deep-learning-for-time-series-forecasting/notebook
4. LSTM as Autoencoder and then MLP: https://www.kaggle.com/dimitreoliveira/time-series-forecasting-with-lstm-autoencoders/notebook#Regular-LSTM-model.
5. https://builtin.com/data-science/time-series-forecasting-python (TS classic models)
6. https://www.kaggle.com/jagangupta/time-series-basics-exploring-traditional-ts (TS classic models)
7. https://towardsdatascience.com/an-overview-of-time-series-forecasting-models-a2fa7a358fcb (more TS)
8. Temporal CNN for ts: https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298344

 # Packages and helpers

In [None]:
import pandas as pd
import numpy as np
import random
from matplotlib import pyplot as plt
import seaborn as sns
import os
import math

Helper functions. Thanks to: https://www.kaggle.com/vad13irt/tps-2022-baseline/

In [None]:
def seed_everything(seed=1):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    
def make_submission(ids, predictions, path="submission.csv"):
    assert len(ids) == len(predictions), f"Lengths of `ids` ({len(ids)}) and `predictions` ({len(predictions)}) aren't the same."
    assert not predictions.isnull().any(), 'Some predictions are blank!'
    ids = ids.astype(int)
    predictions = predictions.astype(float)
    df = pd.DataFrame({
        "row_id": ids,
        "num_sold": predictions,
    })
    
    df.to_csv(path, index=False)
    print('submission csv outputted!')
    return 

SEED = 24
seed_everything(SEED)

# Data loader

In [None]:
def DataLoader(path, train=True):
    df = pd.read_csv(path, parse_dates=['date'])
    ids = df.pop("row_id")
    if train:
        targets = df.pop('num_sold')
        return df, ids, targets
    return df, ids

train_path = '../input/tabular-playground-series-jan-2022/train.csv'
test_path = "../input/tabular-playground-series-jan-2022/test.csv"
holiday_path = '../input/holidays-finland-norway-sweden-20152019/Holidays_Finland_Norway_Sweden_2015-2019.csv'

We can see that there are no missing values in the original data.

In [None]:
train_df , _, targets = DataLoader(train_path, train=True)
print('Missing values in data: \n',train_df.isnull().any())
print('\nMissing values in targets: ',targets.isnull().any())

In [None]:
test_df , _= DataLoader(test_path, train=False)
print('Missing values in data: \n',test_df.isnull().any())

# Feature engineering (experimental)

**Note: not all features from this section are useful.**

## (1) Transform year, month, and day 


### Method 1: Trigonometric transforms
Thanks to: https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298202
- To get info on the year, just min-max normalize the year number.
- To extract any cyclical feature we desire, multiply the feature by `2*np.pi` and divide by the maximum of that feature, then apply `np.sin()` and `np.cos()`. Thus, one cyclical feature ends up having two columns.
- In total, we are creating seven new columns, and effectively excluding the original `'date'` column from the model.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class DateProcessor(BaseEstimator, TransformerMixin):
    def __init__(self, date_format='%d/%m/%Y', hours_secs=False):
        self.format = date_format
        self.earliest_yr = None
        self.latest_yr = None
        self.columns = None
        self.time_transformations = [
            ('day_sin', lambda x: np.sin(2*np.pi*x.dt.day/31)),
            ('day_cos', lambda x: np.cos(2*np.pi*x.dt.day/31)),
            ('dayofweek_sin', 
                lambda x: np.sin(2*np.pi*x.dt.dayofweek/6)),
            ('dayofweek_cos', 
                lambda x: np.cos(2*np.pi*x.dt.dayofweek/6)),
            ('dayofyear_sin', 
                lambda x: np.sin(2*np.pi*x.dt.dayofyear/366)),
            ('dayofyear_cos', 
                lambda x: np.cos(2*np.pi*x.dt.dayofyear/366)),
            ('dayofyear', lambda x: x.dt.dayofyear),
            ('week_sin', 
                lambda x: np.sin(2*np.pi*x.dt.isocalendar().week.astype(int)/52)),
            ('week_cos', 
                lambda x: np.cos(2*np.pi*x.dt.isocalendar().week.astype(int)/52)),
            ('month_sin', 
                lambda x: np.sin(2*np.pi*x.dt.month/12)),
            ('month_cos', 
                lambda x: np.cos(2*np.pi*x.dt.month/12)),
            ('quarter_sin', 
                lambda x: np.sin(2*np.pi*x.dt.quarter/4)),
            ('quarter_cos',
                lambda x: np.cos(2*np.pi*x.dt.quarter/4)),
            #('year', 
            #    lambda x: (x.dt.year - x.dt.year.min()
            #              ) / (x.dt.year.max() - x.dt.year.min()))
            ('year', 
                lambda x: (x.dt.year - self.earliest_yr)
                  / (self.latest_yr - self.earliest_yr))
        ]
        if hours_secs:
            self.time_transformations = [
                ('hour_sin', 
                lambda x: np.sin(2*np.pi*x.dt.hour/23)),
                ('hour_cos', 
                lambda x: np.cos(2*np.pi*x.dt.hour/23)),
                ('minute_sin', 
                lambda x: np.sin(2*np.pi*x.dt.minute/59)),
                ('minute_cos', 
                lambda x: np.cos(2*np.pi*x.dt.minute/59))
            ] + self.time_transformations

    def fit(self, X, y=None, **fit_params):
        self.columns = self.transform(X.iloc[0:1,:]).columns
        return self

    def transform(self, X, y=None, **fit_params):
        transformed = list()
        for col in X.columns:
            if col == 'date':
                time_column = pd.to_datetime(X[col],
                                  format=self.format)
                for label, func in self.time_transformations:
                    transformed.append(func(time_column))
                    transformed[-1].name += '_' + label
        transformed = pd.concat(transformed, axis=1)
        return transformed

    def fit_transform(self, X, y=None, **fit_params):
        time_column = pd.to_datetime(X['date'], format=self.format)
        self.earliest_yr = time_column.dt.year.min()
        self.latest_yr = time_column.dt.year.max()
        self.fit(X, y, **fit_params)
        return self.transform(X)

### Method 2: Merely extract the year, month, and date numbers, & normalize the years

In [None]:
#if DATE_TFM_METHOD != 'trigonometric':

class DateParser():
    def __init__(self):
        self.earliest_yr = None
        self.latest_yr = None
    
    def parse(self, df, train_set=True):
        df["date_day"] = df["date"].dt.day
        df["date_week"] = df['date'].dt.isocalendar().week
        df["date_month"] = df["date"].dt.month
        df["date_quarter"] = df["date"].dt.quarter
        df["date_year"] = df["date"].dt.year
        df['date_dayofweek'] = df['date'].dt.dayofweek
        df['date_dayofyear'] = df['date'].dt.dayofyear
        if train_set:
            self.earliest_yr = df.date_year.min()
            self.latest_yr = df.date_year.max()
        else: 
            assert self.earliest_yr != None, 'Please provide earliest year!'
            assert self.latest_yr != None, 'Please provide latest year!'
        df.date_year = (df.date_year - self.earliest_yr)/(self.latest_yr - self.earliest_yr)
        return df

## (2) One-hot encode the categorical features
One-hot encoding cat features is optional, as the boosting model often is capable of dealing with the cat features itself. But it could be required for other models.

**This step is prerequisite to the next step, i.e. fourier transform**

Note: sometimes it does improve performance of boosting models (even without Fourier transforming the one-hot encoded categorical features).

In [None]:
# To one-hot encode the categorical features, i.e. countries, products, and stores
# no need to encode the last categories
def OneHot(df):
    new_df = pd.DataFrame({})
    for country in ['Finland', 'Norway']:
        new_df[country] = df.country == country
        
    new_df['KaggleRama'] = df.store == 'KaggleRama'
    
    for product in ['Kaggle Mug', 'Kaggle Sticker']:
        new_df[product] = df['product'] == product
    
    return new_df

## (3) Fourier transform for the categorical features
Thanks to: https://www.kaggle.com/ambrosm/tpsjan22-03-linear-model/notebook

In [None]:
# Requires day of year column
# Prerequisite: OneHot() the training dataframe

# The three products have different seasonal patterns
def FourierTfm(df, mult=50):
    for k in range(1, mult):
        df[f'sin{k}'] = np.sin(df.date_dayofyear / 365 * 2 * math.pi * k)
        df[f'cos{k}'] = np.cos(df.date_dayofyear / 365 * 2 * math.pi * k)
        df[f'mug_sin{k}'] = df[f'sin{k}'] * df['Kaggle Mug']
        df[f'mug_cos{k}'] = df[f'cos{k}'] * df['Kaggle Mug']
        df[f'sticker_sin{k}'] = df[f'sin{k}'] * df['Kaggle Sticker']
        df[f'sticker_cos{k}'] = df[f'cos{k}'] * df['Kaggle Sticker']
    return df

## (4) Encode holiday info

Source dataset: https://www.kaggle.com/drcapa/holidays-finland-norway-sweden-20152019

The goal here is to add one column to `train_df` so that it indicates whether the date of the current entry is an holiday in the corresponding country.

In [None]:
def GetHoliday(holiday_path, df):
    """
    Get a boolean feature of whether the current row is a holiday sale
    """
    
    holiday = pd.read_csv(holiday_path, parse_dates=['Date'])
    fin_holiday = holiday.loc[holiday.Country == 'Finland']
    swe_holiday = holiday.loc[holiday.Country == 'Sweden']
    nor_holiday = holiday.loc[holiday.Country == 'Norway']
    df['fin holiday'] = df.date.isin(fin_holiday.Date).astype(float)
    df['swe holiday'] = df.date.isin(swe_holiday.Date).astype(float)
    df['nor holiday'] = df.date.isin(nor_holiday.Date).astype(float)
    
    df['holiday'] = np.zeros(df.shape[0]).astype(float)
    df.loc[df.country == 'Finland', 'holiday'] = df.loc[df.country == 'Finland', 'fin holiday']
    df.loc[df.country == 'Sweden', 'holiday'] = df.loc[df.country == 'Sweden', 'swe holiday']
    df.loc[df.country == 'Norway', 'holiday'] = df.loc[df.country == 'Norway', 'nor holiday']
    df.drop(['fin holiday', 'swe holiday', 'nor holiday'], axis=1, inplace=True)
    return df

## (5) Feature engineering master function

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
def Engineer(df, train_set=True, fourier_mult=0, trigonometric_dates=True, date_tfm=None):
    """
    Return a new dataframe with the engineered features
    """
    
    new_df = pd.DataFrame({})
    
    # Keep original columns in df
    kept_cols = df[['country', 'store', 'product', 'date']]
    new_df[kept_cols.columns] = kept_cols
    new_df["date"] = pd.to_datetime(new_df["date"])
    
    # Encode holiday info
    df = GetHoliday(holiday_path, df)
    new_df['holiday'] = df['holiday']
    
    # Transforming the dates (Method 1 or 2)
    if trigonometric_dates:
        if train_set:
            date_tfm = DateProcessor(date_format='%Y-%m-%d', hours_secs=False)
            new_dates = date_tfm.fit_transform(pd.DataFrame(df['date']))
        else: 
            assert date_tfm != None, 'Please provide the DateProcessor object!'
            new_dates = date_tfm.transform(pd.DataFrame(df['date']))
        new_df[new_dates.columns] = new_dates
    # need to fix dateparser (upgrade to class, stop it from overwriting new_df)
    else: 
        if train_set: 
            date_tfm = DateParser()
        assert date_tfm != None, 'Please provide the DateParser object!'
        new_df = date_tfm.parse(new_df, train_set=train_set)

    # One-hot encoding of cat features (no need to encode the last categories)
    tmp_df = OneHot(df)
    new_df[tmp_df.columns] = tmp_df
        
    # Seasonal variations (Fourier series)
    if fourier_mult:
        new_df = FourierTfm(new_df, mult=fourier_mult)
    
    # Return (if this is for training set, return both the engineered dataset and the date transformer)
    return new_df, date_tfm


I add an option for `load_transform_data` to apply `np.log()` to the target values.

In [None]:
def load_transform_data(path=None, train=True, date_tfm = None, hyperparams=None, log_targets=False):
    """
    Load data and apply feature engineering
    """
    
    if train:
        orig_df, _, targets = DataLoader(path, train=True)
    else:
        orig_df, ids = DataLoader(path, train=False)
        
    FOURIER_MULT = hyperparams['FOURIER_MULT']
    TRIGON = hyperparams['TRIGON']
    
    df, date_tfm = Engineer(orig_df,
                           train_set=train,
                           fourier_mult=FOURIER_MULT,
                           trigonometric_dates=TRIGON,
                           date_tfm = date_tfm)
    if train:
        targets = targets.astype(np.float32)
        if log_targets:
            df["num_sold"] = np.log(targets)
        else:
            df["num_sold"] = targets
    else: 
        df['row_id'] = ids
        
    df['day_of_the_week'] = df['date'].dt.day_name()
    df['month'] = df['date'].dt.month_name()
    
        
    for col in df.columns:
        if df[col].dtype != 'O' and df[col].dtype != '<M8[ns]' and col != 'row_id':
            df[col] = df[col].astype(np.float32)
    
    return df, date_tfm

In [None]:
# Feature engineering hyperparams
engn_params = {'FOURIER_MULT': 0, 'TRIGON': True}

# Load train data & apply feature engineering
train_df, date_tfm = load_transform_data(path=train_path, train=True, hyperparams=engn_params)

In [None]:
test_df, _ = load_transform_data(path=test_path, train=False, date_tfm = date_tfm, hyperparams=engn_params)

# Modeling & training (experimental)

(2nd day of competition) I discover that keeping all the produced features is fallible. Especially when I let the number of date features explode, all models wouldn't behave nicely. Only some of the features I produce turn out to be useful, which is shown in the models below. 

## (1) Grouping and averaging by features
Thanks to: https://www.kaggle.com/carlmcbrideellis/tps-jan-2022-a-simple-average-model-no-ml/notebook.

The selected features to group are: country, store, products, day-of-week, holiday, and month. For December entries, there is an additional `'day'` column, because daily variation is bigger in December, and we'd like to capture that.


In [None]:
def SMAPE(y_true, y_pred):
    """
    Loss function for the competition
    """
    
    denominator = (y_true + np.abs(y_pred)) / 200.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return np.mean(diff)

def SplitByYear(df, train_start_yr, train_end_yr, test_df=pd.DataFrame()):
    """
    Split the original dataset into two subsets, one for training, one for test. 
    Make sure the date column of df is parsed when doing pd.read_csv(df)
    """
    
    # test year is one year after the end training year
    test_yr = train_end_yr + 1
    if test_yr < 2019:
        test_df = df[df.date.between(f'{test_yr}-01-01', f'{test_yr}-12-31')].copy()
    train_df = df[df.date.between(f'{train_start_yr}-01-01', f'{train_end_yr}-12-31')].copy()
    return train_df, test_df

def TrainAndValid_no_model(train_path, engn_params, train_start_yr, train_end_yr, test_path=None, log_targets=False):
    """
    Use averaging method (i.e. Model 1) to get predictions.
    """
    
    orig_train_df, date_tfm = load_transform_data(path=train_path, train=True, hyperparams=engn_params, log_targets=log_targets)
    train_df, valid_df = SplitByYear(orig_train_df, train_start_yr, train_end_yr)
    train_means = train_df.groupby(['country','store','product','month','day_of_the_week'])['num_sold'].mean().to_dict()
    train_df['pred'] = train_df.set_index(['country','store','product','month','day_of_the_week']).index.map(train_means.get)
    
    train_df_dec = train_df.query("month == 'December'").copy()
    train_means_dec = train_df_dec.groupby(['country','store','product','date_dayofyear'])['num_sold'].mean().to_dict()
    train_df_dec['pred'] = train_df_dec.set_index(['country','store','product','date_dayofyear']).index.map(train_means_dec.get)
    train_df.update(train_df_dec)
    train_df['pred'] = np.ceil(train_df['pred'])
    print('train score: ', SMAPE(train_df["num_sold"], train_df["pred"]))
    
    if len(valid_df) > 0:
        valid_df['pred'] = valid_df.set_index(['country','store','product', 'month', 'day_of_the_week']).index.map(train_means.get)
        valid_df_dec = valid_df.query("month == 'December'").copy()
        valid_df_dec['pred'] = valid_df_dec.set_index(['country','store','product','date_dayofyear']).index.map(train_means_dec.get)
        valid_df.update(valid_df_dec)
        valid_df['pred'] = np.ceil(valid_df['pred'])
        print('valid score: ', SMAPE(valid_df['num_sold'], valid_df['pred']))
    if test_path:
        test_df, date_tfm = load_transform_data(path=test_path, train=False, hyperparams=engn_params, date_tfm=date_tfm, log_targets=log_targets)
        test_df['pred'] = test_df.set_index(['country','store','product', 'month', 'day_of_the_week']).index.map(train_means.get)
        test_df_dec = test_df.query("month == 'December'").copy()
        test_df_dec['pred'] = test_df_dec.set_index(['country','store','product','date_dayofyear']).index.map(train_means_dec.get)
        test_df.update(test_df_dec)
        test_df['pred'] = np.ceil(test_df['pred'])
    else:
        test_df = None
    return train_df, valid_df, test_df

def PlotPredVsTarget(df, holiday=0, start_year = 2018, end_year = 2018, country='Finland', store='KaggleRama'):
    """
    Produce a line plot of target values and predicted values, in a given country at a given store. 
    X-axis is time, y-axis is sale volumne, colors represent product types.
    """
    
    plt.style.use('fivethirtyeight')
    plt.rcParams.update({'font.size': 16})
    one_country_and_store = df.query(f"country == '{country}' & store == '{store}'").copy()
    one_country_and_store = one_country_and_store[one_country_and_store.date.between(f'{start_year}-01-01', f'{end_year}-12-31')].copy()

    if holiday:
        fig, axs = plt.subplots(1+holiday, figsize=(30*(end_year - start_year + 1), 26))
        sns.lineplot(ax=axs[0], data=one_country_and_store, x="date", y="num_sold", hue="product", linewidth = 2, linestyle='--')
        sns.lineplot(ax=axs[0], data=one_country_and_store, x="date", y="pred", hue="product", linewidth = 3.5)
        sns.scatterplot(ax=axs[1], data=one_country_and_store, x='date', y='holiday', s=100)
    else:
        fig, axs = plt.subplots(1+holiday, figsize=(30*(end_year - start_year + 1), 13))
        sns.lineplot(data=one_country_and_store, x="date", y="num_sold", hue="product", linewidth = 2, linestyle='--')
        sns.lineplot(data=one_country_and_store, x="date", y="pred", hue="product", linewidth = 3.5)
    plt.legend([],[], frameon=False)       
    return 

In [None]:
engn_params = {'FOURIER_MULT': 0, 'TRIGON': True}
train_df, valid_df, test_df = TrainAndValid_no_model(train_path, engn_params, 2015, 2017, test_path, log_targets=False)
if len(valid_df) > 0:
    PlotPredVsTarget(valid_df, 1, 2018, 2018, 'Finland', 'KaggleRama')

make_submission(test_df['row_id'], test_df['pred'], path="sub_averaging.csv")

As we can see, this simple averaging method captures the majority of the targets, except for days in December, and perhaps on holidays too. Next step is to get the holidays of the three countries from 2015 to 2019.

## (2) CatBoost Regressor

According to experiments using PyCaret by https://www.kaggle.com/akmalmir/pycaret-for-starters-tps-jan-2022, I am using catboost regressor as the baseline model.

The trigonometric transformations of selected date features (month & day of week) do little to catboost performance.

Findings: Finland customers have higher enthusiasm all week. 

In [None]:
from catboost import CatBoostRegressor
from sklearn.linear_model import HuberRegressor
from dateutil.easter import easter
from datetime import timedelta

def GetEasterDates(years, forward=6):
    Easter_dates = []
    for year in years:
        Easter_dates.append(easter(year))
        # also calculate the dates of the thirteen days following Easter
        for day in range(1, forward+1):
            Easter_dates.append(easter(year)+timedelta(days=day))        
    return Easter_dates

def GetCatBoost(params):
    model = CatBoostRegressor(**params)
    return model

def FeatureSelect(df, engn_params, december=False):
    """
    Selects particular features that are useful before model training
    """
    
    if engn_params['TRIGON']:
        feat_list = ['Finland', 'Norway', 'KaggleRama', 'Kaggle Mug', 'Kaggle Sticker', 'date_month_sin', 'date_month_cos', 'date_dayofweek_sin', 'date_dayofweek_cos', 'holiday', 'date_year']
        if december:
            feat_list.extend(['date_day_sin', 'date_day_cos'])
    else:
        feat_list = ['Finland', 'Norway', 'KaggleRama', 'Kaggle Mug', 'Kaggle Sticker', 'holiday', 'date_year', 'date_dayofweek', 'date_month']
        if december:
            feat_list.append('date_day')

    return df[feat_list]

def TrainAndValid(train_path, model_name, model_params, engn_params, train_start_yr, train_end_yr, log_targets=False):
    """
    Load data, apply feature engineering, do train and validation split (by year) if possible, select good features, train given model, print scores, and return trained model and dataframes
    """
    print(f'Train dates range from year {train_start_yr} to year {train_end_yr}.')
    
    CORRECT = 1.05
    
    if model_name == 'catboost':
        model1 = GetCatBoost(model_params)
        model2 = GetCatBoost(model_params)
        model3 = GetCatBoost(model_params)
    elif model_name == 'huber':
        model1 = HuberRegressor(**model_params)
        model2 = HuberRegressor(**model_params)
        model3 = HuberRegressor(**model_params)

    orig_train_df, date_tfm = load_transform_data(path=train_path, train=True, hyperparams=engn_params, log_targets=log_targets)
    train_df, valid_df = SplitByYear(orig_train_df, train_start_yr, train_end_yr)
    train_df_dec = train_df.query("month == 'December'")
    easter_dates_tr = GetEasterDates(range(train_start_yr, train_end_yr+1))
    train_df_easter = train_df.query("date == @easter_dates_tr")
    
    if len(valid_df) > 0:
        valid_df_dec = valid_df.query("month == 'December'")
        easter_dates_va = GetEasterDates(range(train_end_yr+1,train_end_yr+2))
        valid_df_easter = valid_df.query("date == @easter_dates_va")
        valid_data = FeatureSelect(valid_df, engn_params)
        valid_targets = valid_df['num_sold']
        
    train_data = FeatureSelect(train_df, engn_params)
    train_targets = train_df['num_sold']

    model1.fit(train_data, train_targets)
    train_df['pred'] = np.ceil(model1.predict(train_data)*CORRECT)
    print('train score (w/o december feature): ', SMAPE(train_targets, train_df['pred']))
    
    if len(valid_df) > 0:
        valid_df['pred'] = np.ceil(model1.predict(valid_data)*CORRECT)
        print('valid score (w/o december feature): ', SMAPE(valid_targets, valid_df['pred']))
    
    train_data = FeatureSelect(train_df_dec, engn_params, december=True)
    train_targets = train_df_dec['num_sold']

    model2.fit(train_data, train_targets)
    train_df_dec['pred'] = np.ceil(model2.predict(train_data)*CORRECT)
    train_df.update(train_df_dec)
    print('train score (with december feature): ', SMAPE(train_targets, train_df['pred']))
    
    if len(valid_df) > 0:
        valid_data = FeatureSelect(valid_df_dec, engn_params, december=True)
        valid_targets = valid_df_dec['num_sold']
        valid_df_dec['pred'] = np.ceil(model2.predict(valid_data)*CORRECT)
        valid_df.update(valid_df_dec)
        print('valid score (with december feature): ', SMAPE(valid_targets, valid_df['pred']))
        
    train_data = FeatureSelect(train_df_easter, engn_params, december=True)
    train_targets = train_df_easter['num_sold']

    model3.fit(train_data, train_targets)
    train_df_easter['pred'] = np.ceil(model3.predict(train_data)*CORRECT)
    train_df.update(train_df_easter)
    print('train score (with december + easter feature): ', SMAPE(train_targets, train_df['pred']))
    
    if len(valid_df) > 0:
        valid_data = FeatureSelect(valid_df_easter, engn_params, december=True)
        valid_targets = valid_df_easter['num_sold']
        valid_df_easter['pred'] = np.ceil(model3.predict(valid_data)*CORRECT)
        valid_df.update(valid_df_easter)
        print('valid score (with december + easter feature): ', SMAPE(valid_targets, valid_df['pred']))
        
    if log_targets:
        train_df.num_sold = np.exp(train_df.num_sold)
        valid_df.num_sold = np.exp(valid_df.num_sold)
        train_df.pred = np.exp(train_df.pred)
        valid_df.pred = np.exp(valid_df.pred)
        print('transformed back to original pred values!')
    
    return model1, model2, model3, train_df, valid_df, date_tfm

def Predict(test_path, model1, model2, model3, engn_params, date_tfm, log_targets=False):
    """
    Get prediction of test data from a trained model
    """
    
    CORRECT = 1.05
    
    test_df, date_tfm = load_transform_data(path=test_path, train=False, hyperparams=engn_params, date_tfm=date_tfm, log_targets=log_targets)
    test_data = FeatureSelect(test_df, engn_params)
    test_df['pred'] = np.ceil(model1.predict(test_data)*CORRECT)
    
    test_df_dec = test_df.query("month == 'December'")
    test_data = FeatureSelect(test_df_dec, engn_params, december=True)
    test_df_dec['pred'] = np.ceil(model2.predict(test_data)*CORRECT)
    test_df.update(test_df_dec)
    
    easter_dates = GetEasterDates(range(2019, 2020))
    test_df_easter = test_df.query("date == @easter_dates")
    test_data = FeatureSelect(test_df_easter, engn_params, december=True)
    test_df_easter['pred'] = np.ceil(model3.predict(test_data)*CORRECT)
    test_df.update(test_df_easter)
    return test_df


In [None]:
model_params= {'verbose': 0, 
               'one_hot_max_size': 2,
               #'boosting_type': 'Ordered',
               'random_seed': 24,
               'learning_rate': 0.3, 
               'iterations': 4000,
               'depth': 7}
               #'l2_leaf_reg': 7, 
               #'border_count': 64}
#model_params= {'verbose': 0}
engn_params = {'FOURIER_MULT': 0, 'TRIGON': True}

model1, model2, model3, train_df, valid_df, date_tfm = TrainAndValid(train_path, 
                                         'catboost',
                                         model_params, 
                                         engn_params, 
                                         train_start_yr = 2015, 
                                         train_end_yr = 2018,
                                         log_targets=False)

if len(valid_df) > 0:
    PlotPredVsTarget(valid_df, 1, 2018, 2018, 'Finland', 'KaggleRama')
test_df = Predict(test_path, model1, model2, model3, engn_params, date_tfm, log_targets=False)
make_submission(test_df['row_id'], test_df['pred'], 'submission_catboost.csv')

Adding `'date_year'` columns apart from those features used in Model (1) has a marginal improvement on the performance. 

Adding `'holiday'` can make the model fit better on the holiday dates!

The cat boost model also fails to capture the huge daily variation in the 2nd half of December. So it inflates the prediction of the 1st half of December keeping the shapes of the curve unchanged. 

Sales tend to be volatile around Easter until the end of June too.

# Inference with CV

Not as the section title suggests, the implementation of cross validation is still in progress.

**With the extrapolation bug fixed in the `DataProcessor` class,** now we conduct the same feature engineering to the test data. See bug demo in the same section as here in versions before #19.