Question: Difference between TargetEncoder and LeaveOneOutEncoder #167

amueller · 2019-02-05T22:37:57Z

It's not really clear to me what the difference between TargetEncoder and LeaveOneOutEncoder, as both encode using the target with leave-one-out. Can you maybe clarify and clarify this in the docs?
Does either work for multi-class classification?

janmotl · 2019-02-06T15:17:16Z

It is best to look at some references:

There are two differences. Assuming binary classification:

TargetEncoding returns a weighted average of p(y|x) and p(y). LeaveOneOut does not calculate the average - it just returns an estimate of p(y|x).
LeaveOneOut performs leave-one-out estimation of p(y|x) - it excludes the current row from the estimate. TargetEncoding does not do that - it is using even the current row.

The references and the documentation in the code should possibly be updated. Feel free to submit a pull request.

amueller · 2019-02-06T16:06:25Z

Thanks for the quick reply. will do. The documentation explicitly says that TargetEncoder uses leave-one-out. So that's wrong?

janmotl · 2019-02-06T16:51:15Z

The documentation for TargetEncoder is wrong (likely because of copy-paste refactoring).
Proof: 'enc.transform(X)' and 'transform(X, y)' give the same result in TargetEncoder. On the other end, if we used LeaveOneOut, we would get different results.

amueller · 2019-02-06T17:16:17Z

wait, transform takes a y? That's not scikit-learn api... but I guess that's a different issue.

janmotl · 2019-02-06T17:41:05Z

Yes, that's an incompatibility with scikit-learn. LeaveOneOut needs 'y' in order to transform the training data correctly. Of course, the encoder could remember the training 'y', but then the trained encoder would be large even in the deployment...

amueller · 2019-02-06T22:58:28Z

large meaning number of categories times number of classes, right? That doesn't seem so bad. What does transform do if you only have a single test example? Usually scikit-learn assumes that the test examples are independent, and so running them through one-by-one should give the same result.

janmotl · 2019-02-07T08:50:09Z

large meaning number of categories times number of classes, right?
Large in the sense that we have to remember the whole 'y' because we have to know the target value for each training sample.

Another workaround could be that 'fit()' would return the transformed training set. But I am not sure that it would improve compatibility with scikit-learn.

What does transform do if you only have a single test example? Usually scikit-learn assumes that the test examples are independent, and so running them through one-by-one should give the same result.

The encoders adhere to this logic as well. Leave-one-out is applied only on the training data in order to decrease the overfitting of the model when we observe just a few samples for each category. Leave-one-out is not applied on the testing set. First, we generally do not have the target for the testing set. Second, even if we had the target, it would not decrease the amount of overfitting but it would still increase the error.

amueller · 2019-02-07T16:08:57Z

Wait, so how do you distinguish between training and test set for transform?
You have the y for training and not for the test set?

In sklearn I think we're slowly going in the direction of allowing fit_transform to do something else than fit().transform() and fit_transform is for transforming the trainingset.

janmotl · 2019-02-07T19:52:26Z

Wait, so how do you distinguish between training and test set for transform?
You have the y for training and not for the test set?

Correct.

In sklearn I think we're slowly going in the direction of allowing fit_transform to do something else than fit().transform() and fit_transform is for transforming the trainingset.

In our case, fit_transform returns self.fit(X, y, **fit_params).transform(X, y). So, it is intended for training set.

bdubreu-adeo · 2019-06-03T13:10:03Z

Hello !

LeaveOneOut performs leave-one-out estimation of p(y|x) - it excludes the current row from the estimate. TargetEncoding does not do that - it is using even the current row

Well, except it doesn't exclude anything:

liste1 = ['a','b','a','b','a','b']
liste2 = [1,2,1,4,1,6]
df=pd.DataFrame(np.array([liste1, liste2]).transpose())
df.columns = ['category', 'target']
df

gives this :

category	target
a	1
b	2
a	1
b	4
a	1
b	6

encoder = ce.leave_one_out.LeaveOneOutEncoder(cols=['category'], return_df=True)
encoder.fit(df['category'], df['target'], sigma=0.05)
test = encoder.transform(df['category'])
test

category

1.0
4.0
1.0
4.0
1.0
4.0

Excluding rows from the calculation should give me:

1 5 1 4 1 2.5
instead of 1 4 1 4 1 4 which is just the mean of the target for groups A and B, not excluding any rows...

janmotl · 2019-06-03T15:09:52Z

I think the documentation should be more clear about it. LeaveOneOut excludes the current row only in fit_transform(X, y) method. When transform(X2) is used, no exclusion is performed (as the count of rows in X2 can be different from the count of rows in Y... we do not have any other choice).

The idea is, that leave-one-out estimate is used only for training of the downstream model, in order to decrease overfitting of the downstream model. For scoring, we use as exact estimates, as we can get.

If you come with a concrete proposal how to change the documentation, I am happy to do it.

amueller · 2019-06-03T15:28:44Z

FYI the difference between .fit().transform() and fit_transform() is something we have discussed in sklearn but haven't gotten any consensus yet. It's surprising behavior for the user because it violates one of the sklearn API contracts but it also makes a lot of sense here and it's hard to come up with a better API.

bdubreu-adeo · 2019-06-04T13:57:25Z

Thank you for your answers. By perusing the other threads I had managed to figure this out. I have no clue about the documentation. Perhaps one example of classic TE should be followed by a LOO TE using the same example as the slides from Owen ?
Anyway, thank you for your work!

PaulWestenthanner added the question label Jan 5, 2023

PaulWestenthanner closed this as completed Jan 5, 2023

tvdboom mentioned this issue May 7, 2023

[BUG]: Data leakage in pycaret 3 classification with unbalanced dataset? pycaret/pycaret#3507

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Difference between TargetEncoder and LeaveOneOutEncoder #167

Question: Difference between TargetEncoder and LeaveOneOutEncoder #167

amueller commented Feb 5, 2019

janmotl commented Feb 6, 2019

amueller commented Feb 6, 2019 •

edited

janmotl commented Feb 6, 2019

amueller commented Feb 6, 2019

janmotl commented Feb 6, 2019

amueller commented Feb 6, 2019

janmotl commented Feb 7, 2019 •

edited

amueller commented Feb 7, 2019

janmotl commented Feb 7, 2019

bdubreu-adeo commented Jun 3, 2019 •

edited

janmotl commented Jun 3, 2019

amueller commented Jun 3, 2019

bdubreu-adeo commented Jun 4, 2019

Question: Difference between TargetEncoder and LeaveOneOutEncoder #167

Question: Difference between TargetEncoder and LeaveOneOutEncoder #167

Comments

amueller commented Feb 5, 2019

janmotl commented Feb 6, 2019

amueller commented Feb 6, 2019 • edited

janmotl commented Feb 6, 2019

amueller commented Feb 6, 2019

janmotl commented Feb 6, 2019

amueller commented Feb 6, 2019

janmotl commented Feb 7, 2019 • edited

amueller commented Feb 7, 2019

janmotl commented Feb 7, 2019

bdubreu-adeo commented Jun 3, 2019 • edited

category

janmotl commented Jun 3, 2019

amueller commented Jun 3, 2019

bdubreu-adeo commented Jun 4, 2019

amueller commented Feb 6, 2019 •

edited

janmotl commented Feb 7, 2019 •

edited

bdubreu-adeo commented Jun 3, 2019 •

edited