# Mean encoding - simple

In this notebook, we will encode static features with mean encoding. We will split the data into train and test sets, learn the mean target value per category using the train set, and then encode both the train and test sets with those learned parameters.

It has the advantage that this logic is implemented by open-source libraries.

The drawback is that we may overfit because we are leaking future data into the past. 

We will use the online retail dataset, which we prepared in the notebook `02-create-online-retail-II-datasets.ipynb` located in the `01-Create-Datasets` folder.

In [1]:
import numpy as np
import pandas as pd
from feature_engine.encoding import MeanEncoder

## Load data

In [2]:
df = pd.read_csv("../Datasets/online_retail_dataset_countries.csv",
                parse_dates=["week"],
                index_col="week",
                )

df.head()

Unnamed: 0_level_0,country,quantity,revenue
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-12-06,Belgium,143,439.1
2009-12-13,Belgium,10,8.5
2009-12-20,Belgium,0,0.0
2009-12-27,Belgium,0,0.0
2010-01-03,Belgium,0,0.0


## Split into train and test

In [3]:
# Split the data before and after June 2011

X_train = df[df.index <= pd.to_datetime('2011-06-30')]
X_test = df[df.index > pd.to_datetime('2011-06-30')]

y_train = X_train["revenue"]
y_test = X_test["revenue"]

In [4]:
# sanity check

X_train.index.min(), X_train.index.max()

(Timestamp('2009-12-06 00:00:00'), Timestamp('2011-06-26 00:00:00'))

In [5]:
# sanity check

X_test.index.min(), X_test.index.max()

(Timestamp('2011-07-03 00:00:00'), Timestamp('2011-12-11 00:00:00'))

## Encode

In [6]:
# Set up the mean encoder

enc = MeanEncoder()

In [7]:
# Find mean target value per category
# (it uses the entire train set)

enc.fit(X_train, y_train)

In [8]:
# Feature-engine's encoder finds categorical variables
# by default

enc.variables_

['country']

In [9]:
# the encoding values

enc.encoder_dict_

{'country': {'Belgium': 511.37853658536585,
  'EIRE': 5579.161829268293,
  'France': 2872.7475609756098,
  'Germany': 3764.180012195122,
  'Spain': 919.3335365853659,
  'United Kingdom': 129124.83931707316}}

In [10]:
# Encode datasets

X_train_t = enc.transform(X_train)
X_test_t = enc.transform(X_test)

X_train_t.head()

Unnamed: 0_level_0,country,quantity,revenue
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-12-06,511.378537,143,439.1
2009-12-13,511.378537,10,8.5
2009-12-20,511.378537,0,0.0
2009-12-27,511.378537,0,0.0
2010-01-03,511.378537,0,0.0


Note that Belgium was replaced by 511.37 in all rows, even though on various occasions the revenue was 0. This may result in a "look ahead" bias.