# Mean target encoding

1) Split train on n-folds; (Next steps will be described for fold1 and fold2 as trainset and fold3 as validation. For other combinations everything is the same)

2) Split new train (fold1 + fold2) on n-folds one more time;

![image](../pictures/mean_encoding.png)

3) Calculate mean for each categorical feature on subfold1 and subfold2, then replace categorical value on subfold3. Then repeat the same for different combinations of folds.

![image](../pictures/mean_encoding2.png)

4) Calculate mean for each categorical feature on the whole new train (fold1 + fold2) and replace each categorical value on fold3.

![image](../pictures/mean_encoding3.png)

5) Perform 2-4 steps for each validation combination.

6) Now you have new datasets for each validation combination.

![image](../pictures/mean_encoding4.png)

7) Calculate mean for each categorical feature on the whole train (fold1 + fold2 + fold3) and replace each categorical value on test.

8) Enjoy :)

![image](../pictures/mean_encoding5.png)

Some practical advices by **Stas Semenov** and how he applied this approach on BNP Paribas Competition: https://www.youtube.com/watch?v=g335THJxkto

In [1]:
import warnings

import numpy as np
import pandas as pd


class MeanTargetEncoding:
    def __init__(self, c=10):
        self.c = c
        self.global_mean = 0
        self.features = []
        self.values = dict()

    def fit(self, data, y, features="all"):
        if features == "all":
            self.features = sorted([i for i in data.columns if data[i].dtype == "O"])
        else:
            assert all(feature in data.columns for feature in features)
            self.features = features

        self.global_mean = np.mean(y)

        f = {"y": ["size", "mean"]}

        for col in self.features:
            self.values[col] = dict()
            temp = pd.DataFrame({"y": y, col: data[col]}).groupby([col]).agg(f)

            self.values[col] = (
                    (temp["y"]["mean"] * temp["y"]["size"] + self.global_mean * self.c) /
                    (temp["y"]["size"] + self.c)
            ).to_dict()

        return self.values

    def fit_transform(self, data, y, features="all", inplace=True):

        self.fit(data, y, features)
        return self.transform(data, inplace=inplace)

    def transform(self, data, inplace=True):
        if not inplace:
            new_data = data.copy()
            new_data = self._apply_mean_encoding(new_data)
            return new_data
        return self._apply_mean_encoding(data)

    def _apply_mean_encoding(self, data):
        for col in self.values:
            if col in data.columns:
                temp = pd.DataFrame.from_dict(
                    self.values[col], orient="index").reset_index()
                temp.columns = [col, "value"]
                data = pd.merge(data, temp, how="left").fillna(self.global_mean)
                data[col] = data["value"].copy()
                del data["value"]
                data[col] = data[col].astype("float32")
            else:
                warnings.warn("Column " + col + " is missed in this dataset.")
        return data

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold

In [3]:
data = pd.read_csv("../data/telecom_churn.csv")
y = data["Churn"].astype('int8')
data.drop(["Churn"], axis=1, inplace=True)
data.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3


In [4]:
data["International plan"].value_counts()

No     3010
Yes     323
Name: International plan, dtype: int64

In [5]:
# split data on train/test
train, test, y_train, y_test = train_test_split(
    data, 
    y, 
    test_size=0.2, 
    random_state=1, 
    stratify=y
)

print(train.shape, y_train.shape, test.shape, y_test.shape)

(2666, 19) (2666,) (667, 19) (667,)


In [6]:
# 1)
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

In [7]:
def create_new_df_with_categorical_encodings(new_train, new_train_y, new_val, cols):
    mte = MeanTargetEncoding()
    new_skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    val_dfs = []
    # 2)
    for new_train_split, new_val_split in new_skf.split(new_train, new_train_y):
        # 3)
        mte.fit(
            new_train.iloc[new_train_split], 
            new_train_y.iloc[new_train_split], 
            features=cols
        )
        val_dfs.append(
            mte.transform(new_train.iloc[new_val_split], inplace=False)
        )
    # 4)
    mte.fit(new_train, new_train_y, features=cols)
    main_val = mte.transform(new_val, inplace=False)
    return val_dfs, main_val

In [8]:
%%time

new_train_dfs = []
new_val_dfs = []
main_train_dfs = []

for train_split, val_split in skf.split(train, y_train): 
    # 5)
    temp_train_dfs, temp_val_df = create_new_df_with_categorical_encodings(
        train.iloc[train_split], 
        y_train.iloc[train_split], 
        train.iloc[val_split], 
        ["International plan"]
    )
    # 6)
    new_train_dfs.append(temp_train_dfs)
    new_val_dfs.append(temp_val_df)
    # 7)
    mte = MeanTargetEncoding()
    mte.fit(train.iloc[train_split], y.iloc[train_split], ["International plan"])
    main_train_dfs.append(
        mte.transform(train.iloc[val_split], inplace=False)
    )

CPU times: user 220 ms, sys: 4 ms, total: 224 ms
Wall time: 223 ms


In [9]:
new_val_dfs[0]

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
0,IL,151,408,0.113996,No,0,175.3,106,29.80,144.3,87,12.27,160.2,88,7.21,11.8,5,3.19,0
1,WV,72,510,0.113996,Yes,33,96.6,59,16.42,315.4,98,26.81,163.3,117,7.35,6.2,4,1.67,4
2,OR,80,415,0.113996,No,0,113.2,86,19.24,185.5,97,15.77,237.3,145,10.68,9.5,5,2.57,1
3,NY,112,415,0.113996,Yes,23,286.6,79,48.72,315.3,102,26.80,193.9,101,8.73,10.3,6,2.78,1
4,GA,86,510,0.113996,No,0,124.1,82,21.10,202.6,120,17.22,289.6,119,13.03,6.7,8,1.81,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
884,OH,29,408,0.113996,No,0,196.8,81,33.46,168.0,110,14.28,132.6,98,5.97,12.7,7,3.43,2
885,KY,75,415,0.113996,No,0,314.6,102,53.48,169.8,86,14.43,285.1,100,12.83,5.7,3,1.54,2
886,SC,100,510,0.113996,No,0,115.9,87,19.70,111.3,56,9.46,170.2,77,7.66,7.1,4,1.92,1
887,KS,132,415,0.113996,No,0,83.4,110,14.18,232.2,137,19.74,146.7,114,6.60,7.6,5,2.05,1


In [10]:
mte = MeanTargetEncoding()
mte.fit(train, y, features=["International plan"])
main_test = mte.transform(test)

In [11]:
test.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
2720,MI,158,415,No,No,0,195.9,103,33.3,89.1,95,7.57,302.2,82,13.6,10.3,3,2.78,1
451,KS,86,408,No,Yes,23,225.5,107,38.34,246.3,105,20.94,245.7,81,11.06,9.8,2,2.65,0
974,OR,21,510,No,Yes,31,135.9,90,23.1,271.0,84,23.04,179.1,89,8.06,9.5,7,2.57,6
175,NE,94,415,No,No,0,252.6,104,42.94,169.0,125,14.37,170.9,106,7.69,11.1,7,3.0,2
619,KS,110,415,Yes,No,0,293.3,79,49.86,188.5,90,16.02,266.9,91,12.01,14.5,4,3.92,0


In [12]:
main_test.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
0,MI,158,415,0.11401,No,0,195.9,103,33.3,89.1,95,7.57,302.2,82,13.6,10.3,3,2.78,1
1,KS,86,408,0.11401,Yes,23,225.5,107,38.34,246.3,105,20.94,245.7,81,11.06,9.8,2,2.65,0
2,OR,21,510,0.11401,Yes,31,135.9,90,23.1,271.0,84,23.04,179.1,89,8.06,9.5,7,2.57,6
3,NE,94,415,0.11401,No,0,252.6,104,42.94,169.0,125,14.37,170.9,106,7.69,11.1,7,3.0,2
4,KS,110,415,0.420182,No,0,293.3,79,49.86,188.5,90,16.02,266.9,91,12.01,14.5,4,3.92,0


In [13]:
# check results
new_val_dfs[2]["International plan"].value_counts()

0.114830    801
0.412301     87
Name: International plan, dtype: int64

In [14]:
temp = pd.concat([train, y_train], axis=1)

In [15]:
temp.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
672,IL,151,408,No,No,0,175.3,106,29.8,144.3,87,12.27,160.2,88,7.21,11.8,5,3.19,0,0
2465,IN,88,415,No,No,0,183.5,93,31.2,170.5,80,14.49,193.8,88,8.72,8.3,5,2.24,3,0
473,WV,72,510,No,Yes,33,96.6,59,16.42,315.4,98,26.81,163.3,117,7.35,6.2,4,1.67,4,1
2062,ME,140,415,No,No,0,159.1,104,27.05,269.8,106,22.93,220.4,116,9.92,10.3,4,2.78,1,0
2604,MD,106,415,No,No,0,208.3,89,35.41,169.4,67,14.4,102.0,90,4.59,15.9,4,4.29,3,0


In [16]:
temp.groupby(["International plan"])["Churn"].agg(["mean", "size"])

Unnamed: 0_level_0,mean,size
International plan,Unnamed: 1_level_1,Unnamed: 2_level_1
No,0.113882,2406
Yes,0.430769,260


In [17]:
y_train.mean()

0.1447861965491373

In [18]:
(
    (0.113882 * 2406 + 10*0.1447861965491373) / (10 + 2406),
    (0.430769 * 260 + 10*0.1447861965491373) / (10 + 260)
)

(0.11400991472081597, 0.4201770443166347)

In [19]:
main_test

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
0,MI,158,415,0.114010,No,0,195.9,103,33.30,89.1,95,7.57,302.2,82,13.60,10.3,3,2.78,1
1,KS,86,408,0.114010,Yes,23,225.5,107,38.34,246.3,105,20.94,245.7,81,11.06,9.8,2,2.65,0
2,OR,21,510,0.114010,Yes,31,135.9,90,23.10,271.0,84,23.04,179.1,89,8.06,9.5,7,2.57,6
3,NE,94,415,0.114010,No,0,252.6,104,42.94,169.0,125,14.37,170.9,106,7.69,11.1,7,3.00,2
4,KS,110,415,0.420182,No,0,293.3,79,49.86,188.5,90,16.02,266.9,91,12.01,14.5,4,3.92,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
662,UT,91,510,0.114010,No,0,123.8,107,21.05,319.0,125,27.12,237.6,78,10.69,7.3,4,1.97,2
663,MS,76,408,0.114010,No,0,173.2,93,29.44,131.2,80,11.15,170.9,104,7.69,5.4,3,1.46,0
664,ID,105,408,0.114010,No,0,232.6,96,39.54,253.4,117,21.54,154.0,101,6.93,10.5,9,2.84,1
665,WI,102,408,0.114010,No,0,200.6,106,34.10,152.5,127,12.96,199.4,128,8.97,7.7,2,2.08,3


# Feature Interactions as Features

In [20]:
import xgbfir
import xgboost as xgb

In [21]:
data = pd.read_csv("../data/telecom_churn.csv")
y = data["Churn"].astype('int8')
data.drop(["Churn"], axis=1, inplace=True)

train_cols = [col for col in data.columns if data[col].dtype != 'O']

In [22]:
parameters = {
    #default
    'objective': 'reg:logistic',
    'eta': 0.1,
    'silent': 1,
    "nthread": -1,
    "random_seed": 1,
    "eval_metric": 'auc',
    
    # regularization parameters
    'max_leaves': 20,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    
    #lightgbm approach
    'tree_method': 'hist',
    'grow_policy': 'lossguide'
}

num_rounds = 10000

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
xgb_train = xgb.DMatrix(data[train_cols], y.values, feature_names=train_cols)

results = xgb.cv(
    parameters, 
    xgb_train, 
    num_rounds, 
    early_stopping_rounds=10,
    folds=skf, 
    verbose_eval=10
)

[0]	train-auc:0.832862+0.00472396	test-auc:0.807838+0.0305292
[10]	train-auc:0.920047+0.00706104	test-auc:0.875377+0.0158608
[20]	train-auc:0.9368+0.00316342	test-auc:0.880931+0.0104092
[30]	train-auc:0.954769+0.00407126	test-auc:0.882399+0.00904475


In [23]:
results.shape

(28, 4)

In [24]:
results.iloc[-1]

train-auc-mean    0.951424
train-auc-std     0.004248
test-auc-mean     0.885309
test-auc-std      0.009439
Name: 27, dtype: float64

In [25]:
model = xgb.train(parameters, xgb_train, num_boost_round=30)

In [26]:
xgbfir.saveXgbFI(
    model, 
    feature_names=train_cols, 
    OutputXlsxFile="xgbfir_importance.xlsx"
)

## Importance metrics

<img src="https://raw.githubusercontent.com/Far0n/xgbfi/master/doc/ScoresExample_small.png">

In [27]:
train.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
672,IL,151,408,No,No,0,175.3,106,29.8,144.3,87,12.27,160.2,88,7.21,11.8,5,3.19,0
2465,IN,88,415,No,No,0,183.5,93,31.2,170.5,80,14.49,193.8,88,8.72,8.3,5,2.24,3
473,WV,72,510,No,Yes,33,96.6,59,16.42,315.4,98,26.81,163.3,117,7.35,6.2,4,1.67,4
2062,ME,140,415,No,No,0,159.1,104,27.05,269.8,106,22.93,220.4,116,9.92,10.3,4,2.78,1
2604,MD,106,415,No,No,0,208.3,89,35.41,169.4,67,14.4,102.0,90,4.59,15.9,4,4.29,3


In [28]:
data["Customer service calls|Total day minutes"] = data["Total day minutes"] / data["Customer service calls"]

In [29]:
train_cols = [col for col in data.columns if data[col].dtype != 'O']

In [30]:
parameters = {
    #default
    'objective': 'reg:logistic',
    'eta': 0.1,
    'silent': 1,
    "nthread": -1,
    "random_seed": 1,
    "eval_metric": 'auc',
    
    # regularization parameters
    'max_leaves': 20,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    
    #lightgbm approach
    'tree_method': 'hist',
    'grow_policy': 'lossguide'
}

xgb_train = xgb.DMatrix(data[train_cols], y.values, feature_names=train_cols)

results = xgb.cv(
    parameters, 
    xgb_train, 
    num_rounds, 
    early_stopping_rounds=10,
    folds=skf, 
    verbose_eval=10
)

[0]	train-auc:0.833238+0.00565979	test-auc:0.80842+0.0310423
[10]	train-auc:0.911742+0.00463087	test-auc:0.868252+0.0157973
[20]	train-auc:0.934649+0.00234912	test-auc:0.875449+0.00966463
[30]	train-auc:0.955025+0.00255232	test-auc:0.876259+0.00757311
[40]	train-auc:0.968677+0.00315412	test-auc:0.873621+0.00842852


In [31]:
results.tail()

Unnamed: 0,train-auc-mean,train-auc-std,test-auc-mean,test-auc-std
33,0.959399,0.003356,0.876826,0.007256
34,0.960953,0.003613,0.876782,0.007719
35,0.962097,0.004289,0.876466,0.008278
36,0.963978,0.003827,0.877872,0.008246
37,0.964964,0.003399,0.878796,0.008406
