New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Difference between TargetEncoder and LeaveOneOutEncoder #167
Comments
It is best to look at some references:
There are two differences. Assuming binary classification:
The references and the documentation in the code should possibly be updated. Feel free to submit a pull request. |
Thanks for the quick reply. will do. The documentation explicitly says that TargetEncoder uses leave-one-out. So that's wrong? |
The documentation for TargetEncoder is wrong (likely because of copy-paste refactoring). |
wait, transform takes a |
Yes, that's an incompatibility with scikit-learn. LeaveOneOut needs 'y' in order to transform the training data correctly. Of course, the encoder could remember the training 'y', but then the trained encoder would be large even in the deployment... |
large meaning number of categories times number of classes, right? That doesn't seem so bad. What does transform do if you only have a single test example? Usually scikit-learn assumes that the test examples are independent, and so running them through one-by-one should give the same result. |
Another workaround could be that 'fit()' would return the transformed training set. But I am not sure that it would improve compatibility with scikit-learn.
The encoders adhere to this logic as well. Leave-one-out is applied only on the training data in order to decrease the overfitting of the model when we observe just a few samples for each category. Leave-one-out is not applied on the testing set. First, we generally do not have the target for the testing set. Second, even if we had the target, it would not decrease the amount of overfitting but it would still increase the error. |
Wait, so how do you distinguish between training and test set for In sklearn I think we're slowly going in the direction of allowing |
Correct.
In our case, |
Hello !
Well, except it doesn't exclude anything:
gives this :
category1.0 Excluding rows from the calculation should give me: 1 5 1 4 1 2.5 |
I think the documentation should be more clear about it. LeaveOneOut excludes the current row only in The idea is, that leave-one-out estimate is used only for training of the downstream model, in order to decrease overfitting of the downstream model. For scoring, we use as exact estimates, as we can get. If you come with a concrete proposal how to change the documentation, I am happy to do it. |
FYI the difference between |
Thank you for your answers. By perusing the other threads I had managed to figure this out. I have no clue about the documentation. Perhaps one example of classic TE should be followed by a LOO TE using the same example as the slides from Owen ? |
It's not really clear to me what the difference between TargetEncoder and LeaveOneOutEncoder, as both encode using the target with leave-one-out. Can you maybe clarify and clarify this in the docs?
Does either work for multi-class classification?
The text was updated successfully, but these errors were encountered: