# Mean encoding - expanding window

In this notebook, we will encode static features with mean encoding by using expanding windows. This implementation avoids look-ahead bias.

We will use the online retail dataset, which we prepared in the notebook `02-create-online-retail-II-datasets.ipynb` located in the `01-Create-Datasets` folder.

In [1]:
import numpy as np
import pandas as pd

## Load data

In [2]:
df = pd.read_csv("../Datasets/online_retail_dataset_countries.csv",
                parse_dates=["week"],
                index_col="week",
                )

df.head()

Unnamed: 0_level_0,country,quantity,revenue
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-12-06,Belgium,143,439.1
2009-12-13,Belgium,10,8.5
2009-12-20,Belgium,0,0.0
2009-12-27,Belgium,0,0.0
2010-01-03,Belgium,0,0.0


## Split into train and test

In [3]:
# Split data before an after June 2011

X_train = df[df.index <= pd.to_datetime('2011-06-30')]

# We need the past data for the expanding window.
X_test = df.copy()

# the target variable
y_train = X_train["revenue"]
y_test = X_test["revenue"]

In [4]:
# sanity check

X_train.index.min(), X_train.index.max()

(Timestamp('2009-12-06 00:00:00'), Timestamp('2011-06-26 00:00:00'))

In [5]:
# sanity check

X_test.index.min(), X_test.index.max()

(Timestamp('2009-12-06 00:00:00'), Timestamp('2011-12-11 00:00:00'))

## Encode countries

In [6]:
# train set first

train_enc = (
    X_train
    .groupby(['country'])['revenue']
    .expanding()
    .mean()
    .shift()
).reset_index()

train_enc.rename(columns = {"revenue": "country_enc"}, inplace = True)

train_enc

Unnamed: 0,country,week,country_enc
0,Belgium,2009-12-06,
1,Belgium,2009-12-13,439.100000
2,Belgium,2009-12-20,223.800000
3,Belgium,2009-12-27,149.200000
4,Belgium,2010-01-03,111.900000
...,...,...,...
487,United Kingdom,2011-05-29,129923.850701
488,United Kingdom,2011-06-05,129810.417487
489,United Kingdom,2011-06-12,129208.338025
490,United Kingdom,2011-06-19,129708.159425


In [7]:
# Add encoded variable to original train set

X_train_enc = X_train.reset_index().merge(train_enc)

X_train_enc

Unnamed: 0,week,country,quantity,revenue,country_enc
0,2009-12-06,Belgium,143,439.10,
1,2009-12-13,Belgium,10,8.50,439.100000
2,2009-12-20,Belgium,0,0.00,223.800000
3,2009-12-27,Belgium,0,0.00,149.200000
4,2010-01-03,Belgium,0,0.00,111.900000
...,...,...,...,...,...
487,2011-05-29,United Kingdom,67666,121076.06,129923.850701
488,2011-06-05,United Kingdom,44422,82246.14,129810.417487
489,2011-06-12,United Kingdom,77850,169194.05,129208.338025
490,2011-06-19,United Kingdom,68207,120797.68,129708.159425


In [8]:
# Now we drop the static variable

X_train_enc = X_train_enc.drop("country", axis=1)

# Reset the index
X_train_enc.set_index("week", inplace=True)

X_train_enc.head()

Unnamed: 0_level_0,quantity,revenue,country_enc
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-12-06,143,439.1,
2009-12-13,10,8.5,439.1
2009-12-20,0,0.0,223.8
2009-12-27,0,0.0,149.2
2010-01-03,0,0.0,111.9


In [9]:
# Now we repeat for the test set

# Find the encoding values
test_enc = (
    X_test
    .groupby(['country'])['revenue']
    .expanding()
    .mean()
    .shift()
).reset_index()

test_enc.rename(columns = {"revenue": "country_enc"}, inplace = True)

# join encoded variable
X_test_enc = X_test.reset_index().merge(test_enc)

# Drop original variable
X_test_enc = X_test_enc.drop("country", axis=1)

# Reset the index
X_test_enc.set_index("week", inplace=True)

# Remove data that belongs to the train set
X_test_enc = X_test_enc[X_test_enc.index > pd.to_datetime('2011-06-30')]

X_test_enc.head()

Unnamed: 0_level_0,quantity,revenue,country_enc
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-07-03,103,163.9,511.378537
2011-07-10,666,1022.82,507.192048
2011-07-17,13,45.6,513.330476
2011-07-24,0,0.0,507.827765
2011-07-31,1000,1407.15,501.922791


That's it!

As you can see, with this way of encoding the static feature, we need to do a lot of the work manually, and we need to be careful to have enough data in the train set, and to split the data correctly after the encoding.