# Intro

Hi All, This is my very first Kaggle Kernel, so hopefully is useful for a lot of people.
I've been working as a DS since a long time and finally I decided to enter a Kaggle competition, even if it is a Playground one. 

During this Competition I have learned a lot from different other kernels so I will try to do my best to give proper credit to everyone!!

**Disclaimer:** Despite taking a lot of ideas from other kernels I tend to reimplementing to learn what is hapenning in there.

This Kernel will focus only in my Scripts to get my best submission so far.

* I'm using **Feature Engine** for all the Preprocessing Steps. 
* **Scikit-Learn** Pipelines to make everything more organized.
* **Catboost** as my main Model.

If yout think this helps you with your learning, please **UPVOTE**.

**COMMENT: I know I'm not using seeds for reproducibility, but for some reason the results in my machine and in the Kernel were exactly the same.**

For some reason, once I tried a new learning rate the results in my Local Validation Scheme changed, instead of giving me 7.01 started ginving me 7.05 with no change. So I started tuning HPs until I got a 7.00. I submitted and I got a new LB Score of 4.89 just by regularizing Catboost.

# Installing/Importing Libraries

In [None]:
!pip install feature_engine

In [None]:
import holidays
import joblib
import numpy as np
import pandas as pd
from catboost import CatBoostRegressor
from feature_engine.creation import CombineWithReferenceFeature
from feature_engine.creation import CyclicalTransformer
from feature_engine.encoding import OneHotEncoder
from feature_engine.imputation import CategoricalImputer
from feature_engine.selection import DropFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import TimeSeriesSplit

# Utilities

These are Utilities I just created/used to make my code more organized.

* I changed SMAPE implementations several times, so I think I took it from this [Kernel](https://www.kaggle.com/maxencefzr/tps-jan22-eda-simple-catboost). Thanks a lot!

* I also hacked the CyclicalTransformer from Feature Engine. The main reason to do this was that if I tried to create 2 different Cyclycal variables with the same Base variables they were overwritten. This is because by default Cyclical Transformer assigns _cos and _sin as suffixes. Thanks [Soledad Galli](https://www.kaggle.com/solegalli) for the Library. 

Finally I created Variables from Dates using Pandas. I made it by myself but then I noticed a lot of people where using the same. So thanks to all of them.

In [None]:
# Smape Calculation to test model.
def smape(y_true, y_pred):
    numerator = np.abs(y_pred - y_true)
    denominator = (np.abs(y_true) + np.abs(y_pred)) / 2
    
    return np.mean(numerator / denominator)*100

# This is a modified version of Cyclical Transformers of FEature Engine.
# This modification allows me to make multiple different Cyclycal Transformers 
# to same variable with no overwriting.
class CyclicalTransformerV2(CyclicalTransformer):
    def __init__(self, suffix = None, **kwargs):
            super().__init__(**kwargs)
            self.suffix = suffix
        
    def transform(self, X):
        X = super().transform(X)
        if self.suffix is not None:
            transformed_names = X.filter(regex = r'sin$|cos$').columns
            new_names = {name: name + self.suffix for name in transformed_names}
            X.rename(columns=new_names, inplace=True)
        return X


# Utility to create Date Features 
def create_date_features(df):
    df['day_of_year'] = df.date.dt.day_of_year
    df['day_of_month'] = df.date.dt.day
    df['day_of_week'] = df.date.dt.weekday
    df['month'] = df.date.dt.month
    df['quarter'] = df.date.dt.quarter
    df['year'] = df.date.dt.year
    df['period'] = df.date.dt.to_period('M')
    return df

# Importing Data

So I just Imported the data making sure to parse dates correctly.

In [None]:
model_name = 'MAE_15000_cyc2_es_cv'
df = pd.read_csv('../input/tabular-playground-series-jan-2022/train.csv', 
                                 index_col=0, parse_dates=['date'])
df = create_date_features(df)

Then I took an Idea from [here](https://www.kaggle.com/maxencefzr/tps-jan22-eda-simple-catboost) too about using Holidays. It turns out I didn't know at all the holidays library. So Inspired by that I just added the holidays for every Country to capture the effect and merge it to the DataFrame.

In [None]:

finland = pd.DataFrame([dict(date = date, finland_holiday = event, country= 'Finland') 
                        for date, event in holidays.Finland(years=[2015, 2016, 2017, 2018, 2019]).items()])
finland['date'] = finland['date'].astype("datetime64")

norway = pd.DataFrame([dict(date = date, norway_holiday = event, country= 'Norway') 
                       for date, event in holidays.Norway(years=[2015, 2016, 2017, 2018, 2019]).items()])
norway['date'] = norway['date'].astype("datetime64")

sweden = pd.DataFrame([dict(date = date, sweden_holiday = event.replace(", Söndag", ""), country= 'Sweden') 
    for date, event in holidays.Sweden(years=[2015, 2016, 2017, 2018, 2019]).items() if event != 'Söndag'])
sweden['date'] = sweden['date'].astype("datetime64")

df = (df.merge(finland, on = ['date', 'country'], how = 'left')
      .merge(norway, on = ['date', 'country'], how = 'left')
      .merge(sweden, on = ['date', 'country'], how = 'left'))

print('Columns Before Training:', df.columns)
df

# Preprocessing

Here is where the modeling part starts.

* I started learning a lot from AmbroSM [Kernel](https://www.kaggle.com/ambrosm/tpsjan22-03-linear-model#Residuals-of-the-simple-model). And in this kernel I found how to create Fourier Series and Interaction with some of the variables. 

* By using the Hacked Cyclical Transformer I was able to quicly create Fourier Terms for n=1, 2 at the day level. I just used these because adding more terms started to be detrimental for the model. I'm guessing I was adding too much noise.

The Fourier Terms using Cyclical Transformer are as follows: 

$$var\_sin = sin\left(\frac{2\pi}{max\_value}t\right)$$
$$var\_cos = cos\left(\frac{2\pi}{max\_value}t\right)$$

$$var\_sin = sin\left(\frac{2n\pi}{max\_value}t\right) = sin\left(\frac{2\pi}{\frac{max\_value}{n}}t\right)$$
$$var\_cos = cos\left(\frac{2n\pi}{max\_value}t\right) = cos\left(\frac{2\pi}{\frac{max\_value}{n}}t\right)$$

* Then using `CombineWithReferenceFeature` from feature engine I could quickly create interactions between the products (OneHotEncoded) and the Fourier Terms.

* Finally I dropped the features I was not using.

One thing that could be a bit weird is why I'm doing my preprocessing separately. This is because I'm using Early Stopping with my Catboost Model, so in order to have the same transformations in the Training and Validation sets I decided to do this stage by separate.

In [None]:
# Fourier Terms of n = 1
cyc_1 = CyclicalTransformerV2(variables = ['day_of_year','day_of_month','day_of_week'], 
                max_values = {'day_of_year':365,'day_of_month': 30,'day_of_week': 7}, 
                suffix = '_1')
# Fourier Terms of n = 2
cyc_2 = CyclicalTransformerV2(variables = ['day_of_year','day_of_month','day_of_week'], 
                max_values = {'day_of_year':365/2,'day_of_month': 30/2,'day_of_week': 7/2}, 
                suffix = '_2')

prep = Pipeline(steps  =[
        ('cat_imputation', CategoricalImputer()),        
        ('ohe',OneHotEncoder()),
        ('cyc', cyc_1),
        ('cyc2', cyc_2),
        # ('cyc3', cyc_3),
        ('combo', CombineWithReferenceFeature(
        variables_to_combine=['day_of_year_sin_1', 'day_of_year_cos_1','day_of_month_sin_1', 
                            'day_of_month_cos_1', 'day_of_week_sin_1','day_of_week_cos_1', 
                            'day_of_year_sin_2', 'day_of_year_cos_2','day_of_month_sin_2', 
                            'day_of_month_cos_2', 'day_of_week_sin_2','day_of_week_cos_2', 
                            ], 
        reference_variables=['product_Kaggle Mug', 'product_Kaggle Hat','product_Kaggle Sticker'],
        operations=['mul'])),
        ('drop',DropFeatures(features_to_drop=['date','period'])),
        ])


    
X = prep.fit_transform(df.drop(columns=['num_sold']))
y = df.num_sold

joblib.dump(prep, f'./prep_{model_name}.joblib')

# CV Strategy

* I'm just using TimeSeries Split using Four Folds. I'm not 100% percent sure, but since we have almost the same number of records in every year (probably different in 2016 since it was a leap year), this *should* be equivalent to train by year.

As I said before, training Catboost with MAE as a loss function and checking overfitting with SMAPE directly since Catboost includes it. Also I'm adding 1000 early stopping rounds to make training faster since it cannot be done in GPU (SMAPE is not available as a metric in GPU).

* One thing to point out is that I'm saving every model trained (In the different folds) as regularization strategy. I learned this in this [Abishek's Video](https://www.youtube.com/watch?v=zcqgj-Udcqs), so thanks.

* Got a ~7.01 in my Validation Scheme.



In [None]:
folds = TimeSeriesSplit(n_splits=4)
score = []
for fold, (train_idx, val_idx) in enumerate(folds.split(X)):
    
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
    
    model = CatBoostRegressor(iterations=15000,
                            learning_rate=0.03,
                            bootstrap_type='Bayesian',
                            boosting_type='Plain',
                            loss_function='MAE',
                            l2_leaf_reg = 5, # Added as Regularization
                            use_best_model = True, 
                            # loss_function='Huber:delta=0.5',
                            eval_metric='SMAPE',
                            # random_seed = 123,
                            # task_type="GPU",
                            # devices='0:1'
                            )

    model.fit(X_train, y_train, 
            eval_set = (X_val, y_val),
            early_stopping_rounds = 1000
            )

    y_pred = model.predict(X_val)
    
    
    joblib.dump(model, f'./{model_name}_fold_{fold}.joblib')
    score.append(smape(y_val, y_pred))

print('Score', score)
print('Mean Score', np.mean(score))

# Inference

In [None]:
df_test = pd.read_csv('../input/tabular-playground-series-jan-2022/test.csv', index_col=0, parse_dates=['date'])
id = df_test.index
df_test = create_date_features(df_test)

df_test = df_test.merge(finland, on = ['date', 'country'], how = 'left').merge(norway, on = ['date', 'country'], how = 'left').merge(sweden, on = ['date', 'country'], how = 'left')
prep = joblib.load(f'./prep_{model_name}.joblib')

X = prep.fit_transform(df_test)

* For Inference I just repeated the process for training, making sure to Blend all the Models of the different Folds by using different weights. I applied this since the good results obtained by Luca Massaron in a Linear Model implementing sample weights [here](https://www.kaggle.com/lucamassaron/kaggle-merchandise-eda-with-baseline-linear-model). 

* I also applied rounding to the Predictions seen [here](https://www.kaggle.com/xinyangkabuda/baseline-xgboost-lightgbm-stacking-v5-round) and actually I got a little boost in my score.

One thing I don't want to do is adding the **GDP data** (but I feel very pressured to do it). I've seen a lot of kernels doing this but in my opinion this is just exploiting a Leakage. Probably it is a good idea for this competition, but not sure if it could be applied in a real model. I tried to wait as far as I can before add those features.

In [None]:
preds_dict = {}
for fold in range(4):
    pipe = joblib.load(f'./{model_name}_fold_{fold}.joblib')
    preds = pipe.predict(X)
    
    preds_dict[fold] = preds
    
final_preds = pd.DataFrame(preds_dict)#.mean(axis = 1).apply(np.round).astype("int")

final_preds = (0.1 * final_preds.iloc[:,0] + 0.2 * final_preds.iloc[:,1] + 0.3 * final_preds.iloc[:,2] + 0.4 * final_preds.iloc[:,3]).apply(np.round).astype("int")
final_preds.index = id
final_preds.name = 'num_sold'
print(final_preds)

final_preds.to_csv(f'submission.csv')





# Conclusion

Thanks a lot to every one here helping me become a better data scientist. And if you like the kernel, please **UPVOTE**, it would be really nice to see I can help others too!!

Cheers!!

# This is my Local Submission

In [None]:
import pandas as pd
df = pd.read_csv('../input/local-submission-filecsv/MAE_15000_cyc2_es_cv_l2_5_sub_weighting.csv')
df.to_csv('submission_local.csv', index = False)