## Mean Encoding or Target Encoding

Mean encoding implies replacing the category by the average target value for that category. For example, if we have the variable city, with categories London, Manchester and Bristol, and we want to predict the default rate, if the default rate for London is 30% we replace London by 0.3, if the default rate for Manchester is 20% we replace Manchester by 0.2 and so on.


## In this demo:

We will see how to perform one hot encoding with Category encoders using the Titanic dataset.

For guidelines to obtain the dataset, visit **section 2** of the course.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

from category_encoders.target_encoder import TargetEncoder

In [2]:
# load dataset

data = pd.read_csv(
    "../../titanic.csv", usecols=["cabin", "sex", "embarked", "survived"]
)

data.head()

Unnamed: 0,survived,sex,cabin,embarked
0,1,female,B5,S
1,1,male,C22,S
2,0,female,C22,S
3,0,male,C22,S
4,0,female,C22,S


In [3]:
# Now we extract the first letter of the cabin
# to create a simpler variable for the demo

data["cabin"] = data["cabin"].astype(str).str[0]

In [4]:
# let's fill na in embarked

data.fillna("Missing", inplace=True)

### Encoding important

We calculate the target mean per category using the train set, and then use those mappings in the test set.

In [5]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[["cabin", "sex", "embarked"]],
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,
)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((916, 3), (393, 3))

In [6]:
mean_enc = TargetEncoder(
    cols=["cabin", "sex", "embarked"],
    smoothing=10,
)

In [7]:
# when fitting the transformer, we need to pass the target as well
# just like with any Scikit-learn predictor class

mean_enc.fit(X_train, y_train)

In [8]:
# in the mapping we see the target mean assigned to each
# category for each of the selected variables

mean_enc.mapping

{'cabin': cabin
  1    0.304843
  2    0.641581
  3    0.562302
  4    0.641581
  5    0.724345
  6    0.446669
  7    0.491572
  8    0.335231
  9    0.404627
 -1    0.385371
 -2    0.385371
 dtype: float64,
 'sex': sex
  1    0.728358
  2    0.187608
 -1    0.385371
 -2    0.385371
 dtype: float64,
 'embarked': embarked
  1    0.338957
  2    0.553073
  3    0.373516
  4    0.472557
 -1    0.385371
 -2    0.385371
 dtype: float64}

In [9]:
# this is the list of variables that the encoder will transform

mean_enc.cols

['cabin', 'sex', 'embarked']

In [10]:
X_train = mean_enc.transform(X_train)
X_test = mean_enc.transform(X_test)

# let's explore the result
X_train.head()

Unnamed: 0,cabin,sex,embarked
501,0.304843,0.728358,0.338957
588,0.304843,0.728358,0.338957
402,0.304843,0.728358,0.553073
1193,0.304843,0.187608,0.373516
686,0.304843,0.728358,0.373516


**Note**

If the argument `cols` is left to None, then the encoder will automatically identify all categorical variables. Is that not sweet?