# <b>5. Feature Engineering </b>

In this notebook we show:
1. Feature Generation
2. Feature Selection


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns


csv_train = pd.read_csv("../dataset/original/train.csv")
csv_test = pd.read_csv("../dataset/original/x_test.csv")
import sys
sys.path.append("../")
from preprocessing.preprocessing import preprocessing 
df = preprocessing(csv_train, csv_test)
df = df.sort_values(['Date'])

In [2]:
from utils import add_all_features

Initially, we do the usual preprocessing and we add all the new features with the <i> add_all_features()</i> function.

In [3]:
import os
#os.chdir("../")
df, categorical_f = add_all_features(df)

6019it [00:00, 7357.17it/s]
6019it [00:00, 40801.69it/s]
100%|██████████| 43/43 [00:00<00:00, 597.15it/s]


Generate Target Encoding Feature


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
157it [00:24,  6.40it/s]


Then for a better visualization of the dataset, we drop some of the existing features.

In [6]:
col=['pack','size (GM)', 'brand','price','POS_exposed w-1','volume_on_promo w-1','month','day','year','seasons','cluster','scope',
    'target','real_target', 'sales w-1']
df1=df.drop(col,axis=1)
df1

Unnamed: 0,Date,sku,moving_average_20,increment,exp_ma,lag_target_25,lag_target_50,lag_pos1,days_to_christmas,heavy_light,partial_sales,gte_pack,gte_brand,gte_cluster,gte_pack_brand,gte_pack_cluster,gte_brand_cluster,gte_pack_brand_cluster,week_of_the_year
0,2016-12-10,144,10.497091,0.000000,10.497091,,,,15,0,10.497091,,,,,,,,49
1,2016-12-17,144,10.671473,0.348765,10.780462,,,73.0,8,0,21.342946,9.772905,10.948660,10.316708,10.425722,10.268269,10.599625,10.416904,50
2,2016-12-24,144,10.815627,0.258079,11.032383,,,45.0,1,0,32.446880,9.778908,10.928479,10.356906,10.441833,10.323733,10.634837,10.461260,51
3,2016-12-31,144,10.849430,-0.153093,10.969480,,,17.0,-6,0,43.397722,9.778101,10.898996,10.401520,10.453786,10.388416,10.662687,10.509111,52
4,2017-01-07,144,10.603411,-1.331509,9.930225,,,2.0,352,0,53.017054,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6014,2019-05-25,2718,10.248088,0.039499,10.417519,10.260427,10.319662,6.0,214,1,1328.606741,,,,,,,,21
6015,2019-06-01,2718,10.277276,-0.016279,10.414953,10.223467,10.324859,0.0,207,1,1339.020924,10.339415,9.893133,9.954137,10.052164,10.062220,9.856015,10.076289,22
6016,2019-06-08,2718,10.268582,-0.392335,10.112565,10.204518,10.307318,1.0,200,1,1349.042772,,,,,,,,23
6017,2019-06-15,2718,10.244279,-0.254066,9.847348,10.253088,10.400072,0.0,193,1,1358.810554,,,,,,,,24


As we can notice in the above dataframe, we have added:
- lag features, as for example <i>lag_target_50</i>, since when working with timeseries it is useful to use the lag;

- time features, since in timeseries the time is a fundamental characteristic, we wanted to found and exploit some meaningful and useful time pattern,  as <i>day_to_christmas</i> that simply counts the day from the current week to Christmas; 

- moving average feature, that basically represent the moving average of the sales w-1 values;

- <i>exp_ma</i> feature that is a weighted average that gives exponential weights, to give more importance to the most recent values;

- difference features, that are few features that takes into account the variation across the week of a certain field, as for example <i> increment</i> that computes the difference between sales w-1 and sales w-2.

- Gaussian Target Encoding features: these set of features takes into account only the categorical features _pack_, _brand_, _cluster_. Traditional approaches encode categorical features using One-Hot-Encoding or let the model to deal with them. Instead with Target Encoding we exploit directly the interactions among categorical variable, considering the target as a random variable. We assume that target is normally distributed, so Target Encoding can be viewed as estimating the parameters of the normal distribution. This kind of feature leads easily to overfit if not handled in the right way, since for each interaction of the features we extract the mean of the target. To prevent overfit we perform a cross-validation and apply a regularizer that works as a prior. The parameters of the posterior distribution are given by the following equations:
$$
\mu_{post} = \frac {
    \tau_{prior} \mu _{prior} +
    n \tau \mu_{mle}
}{
    \tau_{prior} +  
    n \tau
}
$$
where $\mu_{mle}$ represents the maximum likelihood, the precision $\tau=1/\sigma^2_{mle}$, \mu_{prior} represents the prior associated to its precision $\tau_{prior} = 1/\sigma^2_{prior}$