In [1]:
import pandas as pd
from category_encoders import LeaveOneOutEncoder
import seaborn as sns

# Load data and get just a subset
iris = sns.load_dataset("iris")
iris = iris.sample(20, random_state=1969)

# Manual train test split to get one of each class in test
test_idxs = [137, 55, 33]
train_idxs = [i for i in iris.index if i not in test_idxs]

train = iris.loc[train_idxs]
test = iris.loc[test_idxs]

X_train = train.drop(columns=["sepal_length"])
X_test = test.drop(columns=["sepal_length"])
y_train = train["sepal_length"]
y_test = test["sepal_length"]

In this example, we want to predict a flower's `'sepal_length'` and we'll be using its `'species'` as a predictor.  This predictor is currently a category represented as a string.  We want to numerically encode this predictor somehow for our model.

Eventually we'll get to leave one out encoding this variable; however, leave one out encoding is a flavor of target encoding.  So let's start with target encoding first.

*Note: We're going to target encode using the mean of the response.  The `category_encoders.TargetEncoder()` uses a little bit fancier methodology than the mean, so you'll get slightly different results than our manual process here.  See its documentation for what it's using*

With our target encoding we'll focus in on 2 pieces of information: 

1. The column we want to encode (`'species'`)
* The target (`'sepal_length'`)

With this information, we want to find the mean of the response for each category of the variable we want to encode.  So here, we find the mean `'sepal_length'` for each `'species'`.

In [2]:
cat_target = train[["species", "sepal_length"]]
cat_target.groupby("species").mean()

Unnamed: 0_level_0,sepal_length
species,Unnamed: 1_level_1
setosa,4.8
versicolor,6.3
virginica,6.566667


We would now use these values to replace the categories in the `'species'` column.  So everywhere the `'species'` is `'setosa'` we would put `4.8`, everywhere the `'species'` is `'versicolor'` we would put `6.3`, etc.

Our original categorical column had 3 values: `['setosa', 'versicolor', 'virginica']`.

After encoding, our column will have a new 3 values: `[4.8000, 6.3000, 6.5667]`.

When we want to encode our testing data or other new observations, we will follow the same rules: 

```
* 'setosa'     --> 4.8000
* 'versicolor' --> 6.3000
* 'virginica'  --> 6.5667
```

----

Now let's look at the little bit of flavor that leave one out encoding adds to this process.  First, let's say what stays the same.  Leave one out encoding will do the same exact process for encoding testing data and new observations.  We will still use the mean response for each category to encode new data.

In [3]:
print("pre encoding")
display(X_test)

print("\npost encoding")
encoder = LeaveOneOutEncoder(cols=["species"])
encoder.fit(X_train, y_train)
encoder.transform(X_test)

pre encoding


Unnamed: 0,sepal_width,petal_length,petal_width,species
137,3.1,5.5,1.8,virginica
55,2.8,4.5,1.3,versicolor
33,4.2,1.4,0.2,setosa



post encoding


Unnamed: 0,sepal_width,petal_length,petal_width,species
137,3.1,5.5,1.8,6.566667
55,2.8,4.5,1.3,6.3
33,4.2,1.4,0.2,4.8


The difference between leave one out and target encoding with the mean is how we calculate the values for the training data.  For this, let's filter to `'virginica'`.

In [4]:
train_virginica = train[train["species"] == "virginica"]
train_virginica

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
119,6.0,2.2,5.0,1.5,virginica
149,5.9,3.0,5.1,1.8,virginica
148,6.2,3.4,5.4,2.3,virginica
102,7.1,3.0,5.9,2.1,virginica
111,6.4,2.7,5.3,1.9,virginica
107,7.3,2.9,6.3,1.8,virginica
128,6.4,2.8,5.6,2.1,virginica
130,7.4,2.8,6.1,1.9,virginica
115,6.4,3.2,5.3,2.3,virginica


Now we consider 1 row at a time.  We'll start with the one labeled `119`.  To encode this row, we'll take the mean of the response for every other row; we'll leave out `119` when calculating this number.  Below, we filter out the first row (the row labeled `119`) and then we take the mean of the `sepal_length` column (our target).  We end up with `6.6375`, and this is the value we'll use to encode the species column for row `119`.  This process will be repeated for every row.

Overall process:
1. Locate the row of interest and note the value of the category column to be encoded
* Remove the row of interest
* Filter to rows with the same category as the row of interest
* Take the mean of the target column
* Replace the category in the row of interest with this calculated mean

In [5]:
leave_out_119 = train_virginica.iloc[1:]

# Find mean of target
cat_target = leave_out_119[["species", "sepal_length"]]
cat_target.groupby("species").mean()

Unnamed: 0_level_0,sepal_length
species,Unnamed: 1_level_1
virginica,6.6375


In [6]:
print("pre encoding")
display(X_train.head())

print("\npost encoding")
encoder = LeaveOneOutEncoder(cols=["species"])
encoder.fit_transform(X_train, y_train).head()

pre encoding


Unnamed: 0,sepal_width,petal_length,petal_width,species
119,2.2,5.0,1.5,virginica
149,3.0,5.1,1.8,virginica
148,3.4,5.4,2.3,virginica
97,2.9,4.3,1.3,versicolor
102,3.0,5.9,2.1,virginica



post encoding


Unnamed: 0,sepal_width,petal_length,petal_width,species
119,2.2,5.0,1.5,6.6375
149,3.0,5.1,1.8,6.65
148,3.4,5.4,2.3,6.6125
97,2.9,4.3,1.3,6.333333
102,3.0,5.9,2.1,6.5
