#### What are you trying to do in this notebook?
- Importing Libraries
- Loading the data
- Preprocessing
- Hyperperameter Optimization
- Model Training

#### Why are you trying it?
When reading in our files, we can pass the 'date' column in a list in the parse_dates argument. This transforms the column's datatype from str to datetime64 and allows us to extract more information from the dates later.
Instead of using a single 'date' feature, we will split it up into three separate features of 'year', 'month' and 'day'. We can also derive some useful information from the dates, like whether is falls on a weekend, what quarter it's in, and so on. Lastly, we will add a time-step feature that counts the days that have passed since the first date in the dataset. Thank you to @siukeitin for his help with the last one.
We will be using Optuna to find optimal hyperparameter values and add regularization to combat overfitting.
As we are dealing with time series, we can't use regular Kfold cross-validation. If we did, we would train our model on future data and predict past data, resulting in data leakage. Instead, we need to make sure predictions are made only on folds that come after the folds used for training. Luckily, scikit-learn offers a time series cross-validator, called TimeSeriesSplit.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("twitter_token")

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import OrdinalEncoder
from catboost import CatBoostRegressor
import optuna

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-jan-2022/train.csv', parse_dates=['date'])

In [None]:
df_test = pd.read_csv('../input/tabular-playground-series-jan-2022/test.csv', parse_dates=['date'])

In [None]:
sample_submission = pd.read_csv('../input/tabular-playground-series-jan-2022/sample_submission.csv')

In [None]:
df_train['year'] = df_train['date'].dt.year
df_train['quarter'] = df_train['date'].dt.quarter
df_train['month'] = df_train['date'].dt.month
df_train['week'] = df_train['date'].dt.isocalendar().week.astype(int)
df_train['day'] = df_train['date'].dt.day
df_train['dayofyear'] = df_train['date'].dt.dayofyear
df_train['daysinmonth'] = df_train['date'].dt.days_in_month
df_train['dayofweek'] = df_train['date'].dt.dayofweek
df_train['weekend'] = ((df_train['date'].dt.dayofweek) // 5 == 1).astype(int)

In [None]:
df_test['year'] = df_test['date'].dt.year
df_test['quarter'] = df_test['date'].dt.quarter
df_test['month'] = df_test['date'].dt.month
df_test['week'] = df_test['date'].dt.isocalendar().week.astype(int)
df_test['day'] = df_test['date'].dt.day
df_test['dayofyear'] = df_test['date'].dt.dayofyear
df_test['daysinmonth'] = df_test['date'].dt.days_in_month
df_test['dayofweek'] = df_test['date'].dt.dayofweek
df_test['weekend'] = ((df_test['date'].dt.dayofweek) // 5 == 1).astype(int)

In [None]:
t0 = np.datetime64('2015-01-01')
df_train['time_step'] = (df_train.date-t0).astype('timedelta64[D]').astype(np.int)
df_test['time_step'] = (df_test.date-t0).astype('timedelta64[D]').astype(np.int)

features = [c for c in df_test.columns if c not in ('row_id', 'date')]
cat_features = ['country', 'store', 'product']

ordinal_encoder = OrdinalEncoder()
df_train[cat_features] = ordinal_encoder.fit_transform(df_train[cat_features])
df_test[cat_features] = ordinal_encoder.fit_transform(df_test[cat_features])

In [None]:
tss = TimeSeriesSplit(n_splits=4)

m = 1 # change this if you want to train different models
seeds = 5 # set the number of seeds you want to average

seed_valid_preds = {}
seed_test_preds = []
seed_scores = []

for s in range(seeds):
    fold_valid_preds = {}
    fold_test_preds = []
    fold_scores = []
    seed_valid_ids = []

    for fold, (i_train, i_test) in enumerate(tss.split(df_train)):
        X_train = df_train.iloc[i_train]
        y_train = df_train['num_sold'].iloc[i_train]

        X_test = df_test.copy()

        X_valid = df_train.iloc[i_test]
        y_valid = df_train['num_sold'].iloc[i_test]

        fold_valid_ids = X_valid.row_id.values.tolist()
        seed_valid_ids += fold_valid_ids

        X_train = X_train[features]
        X_valid = X_valid[features]
        
        params = {'depth': 5,
                  'learning_rate': 0.02115836775592321,
                  'l2_leaf_reg': 1.6690467062981804,
                  'random_strength': 1.9718499661794475,
                  'min_data_in_leaf': 10}
                  
        model = CatBoostRegressor(**params,
                                  iterations=10000,
                                  bootstrap_type='Bayesian',
                                  boosting_type='Plain',
                                  loss_function='MAE',
                                  eval_metric='SMAPE',
                                  random_seed=s)

        model.fit(X_train,
                  y_train,
                  early_stopping_rounds=200,
                  eval_set=[(X_valid, y_valid)],
                  verbose=0)

        fold_valid_pred = model.predict(X_valid)
        fold_test_pred = model.predict(X_test)

        fold_valid_preds.update(dict(zip(fold_valid_ids, fold_valid_pred)))
        fold_test_preds.append(fold_test_pred)

        fold_score = np.mean(np.abs(fold_valid_pred - y_valid) / ((np.abs(y_valid) + np.abs(fold_valid_pred)) / 2)) * 100
        fold_scores.append(fold_score)
        print(f'Seed {s} fold {fold} SMAPE: {fold_score}')

    print(f'Seed {s} SMAPE {np.mean(fold_scores)}, std {np.std(fold_scores)}')
    
    seed_valid_pred = np.array(list(fold_valid_preds.values()))
    seed_test_pred = np.mean(np.column_stack(fold_test_preds), axis=1)
    
    seed_valid_preds.update(dict(zip(seed_valid_ids, seed_valid_pred)))
    seed_test_preds.append(seed_test_pred)
    
    seed_score = np.mean(fold_scores)
    seed_scores.append(seed_score)
    
print(f'SMAPE of {s+1} seeds: {np.mean(seed_scores)}, std {np.std(seed_scores)}')

In [None]:
# Out-of-fold predictions for later use
valid_preds = pd.DataFrame(list(zip(seed_valid_ids, seed_valid_preds)))
valid_preds.columns = ['row_id', f'CB{m}_pred']
valid_preds.to_csv(f'CB{m}_valid_pred.csv', index=False)

In [None]:
# Test predictions for later use
sample_submission.num_sold = np.mean(np.column_stack(seed_test_preds), axis=1)
sample_submission.columns = ['row_id', f'CB{m}_pred']
sample_submission.to_csv(f'CB{m}_test_pred.csv', index=False)

In [None]:
# Submission
sample_submission.num_sold = np.mean(np.column_stack(seed_test_preds), axis=1)
sample_submission.columns = ['row_id', 'num_sold']
sample_submission.to_csv('submission.csv', index=False)

#### Did it work?
In this competition, our submissions will be evaluated on SMAPE (Symmetric mean absolute percentage error) between forecasts and actual values. 
We are going to be using CatBoost, which offers SMAPE as an evaluation metric, but it's nice to calculate it anyway.

Finally, because the training set is so small, we can afford to train our model on more than one random state/seed. I go into a bit more detail as to why this is useful and how much it improves results in this post.

#### What did you not understand about this process?
Well, everything provides in the competition data page. I've no problem while working on it. If you guys don't understand the thing that I'll do in this notebook then please comment on this notebook.

#### What else do you think you can try as part of this approach?
Look at a notebook which presents feature engineering (based on the insights of this EDA) and a linear model which makes use of the features.