New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Target Encoding with Hierarchical Structure Smoothing #136
Comments
Can you write a PR? Just an idea: it could also work on domains like |
From a design perspective, you probably want a general way to specify the hierarchical relationships between levels, and then convenience functions that produce this relationship structure for common cases. That is, you'd still want to be able to use it even if your levels are random numbers with no meaning but for which you have a hierarchy. You could also use say a clustering algorithm to infer a hierarchy for variables that don't have one, or just come up with one manually. Since hierarchy means tree / forest, I'd suggest specifying the hierarchical relationships as a Series or dict mapping levels to their parents, and note that entries will not all be levels themselves: for a column of zipcodes, the parents might be counties or states or regions, which don't show up in the column. |
@jkleint I think I need examples. Are we talking about branching based on the values? E.g.: Postal addresses. Some countries (like the USA) divide to countries. But other countries (like Switzerland) divide directly into cantons... And there can be (but doesn't have to be) a bigger difference between countries than between cantons. Hence, a support for branching based on the values makes sense. Or about a simple and fixed (value-independent) hierarchy like: Or about some other example? |
Let's say I have categorical values whose names are meaningless:
But I have some external knowledge that tells me they nest into a hierarchy like so:
Then I need a way to specify which levels (and groups) go to which groups:
Hey that looks a lot like a dict (or Series):
And that's what I'm proposing to pass to Then you can have a convenience function
I'm just proposing |
That sounds reasonable. Just note that when we get 'tim@example.com' during the scoring time, we will have to look up keys for values, because 'tim' was not observed during the training time but 'example' was. Also, we may get a clash of keys. For example, we may observe 'bob@example.com' and 'admin@bob.com' during the training time. |
@jkleint @janmotl personally, I don't feel this is the correct way to do this. I think we should be treating a single column in an array or dataframe as the sole data source of categories we wish to encode. Anything beyond this, IMHO, becomes feature engineering, not categorical encoding (ignoring the fact that some people might consider ce part of fe...). I think we should have an option of splitting the string either by single characters (e.g. zip code 45243 -> [4, 5, 2, 4, 3]) or by delimiters (e.g. email bob@example.com, sep=['.'] -> ['example', 'com']). We can then also give the choice of an order of If someone truly wishes to achieve more complex encodings, they could simply create a new |
Another option: If TargetEncoder observes a column of dtype==list, it will treat the column as hierarchical with the first item being the top most level. And it would be up to the user to split (and potentially reverse) the input string (or whatever data type it was originally). @JoshuaC3 I propose to first implement CountEncoder from #137. Then you may add |
@janmotl I have started this. Yes, I see that |
I agree we don't want to be in the business of a million custom hierarchy-generating functions. It's fine to have a function that makes hierarchies from delimited strings. But do we really want that to be the interface to I would hope you agree that not everyone's hierarchies are simple string prefixes... or even strings. That creating those convenience functions is an inexhaustible task is exactly why we don't want to specify hierarchies that way: everyone's hierarchies are different. But you don't want to limit them or force them to jam them into strings (that we will just have to parse out again). I agree we don't want to try to accommodate them all individually, but we do want a general mechanism that anyone can use to specify their own. We might offer a few common ones, but you need to be able to handle any hierarchy. The general structure of a hierarchy is a forest, and the simple way to specify a forest is an edge list: a map of nodes to their parents. And the natural implementation of a map is a dict or Series. I like Jan's idea of a column of lists for hierarchies, but it requires the user to preprocess the data, it's not very efficient, and again we'd just have to parse out the edge list. Specifying hierarchies directly as mappings, and passing a parameter that gives the hierarchy for each column is clean and general, and composes nicely with convenience functions. Decomposing this problem into creating hierarchies and encoding hierarchies is a clear design win here: it's less code and will save many headaches down the road. The stronger design point is separating target coding from parsing things into hierarchies: these are totally orthogonal. You really want to have zipcode parsing functions in a Let's keep the signaling "out-of-band," the metadata separate from the data: leave the data as is, specify hierarchy as metadata. Keep separate things separate. Here's the relevant principle from The Pragmatic Programmer:
Let's separate making hierarchies from encoding them. Concretely, here's what that would look like:
And then it's really easy to apply new/different/custom hierarchies to other columns or data:
This doesn't require the user to pre-process the data, composes nicely, and is efficient. And it can handle hierarchies that don't have a nice functional form, but are just data. A general |
Ok. @jkleint has the longest comment. He implements the hierarchical processing for TargetEncoder. @JoshuaC3 implements the CountEncoder. @jkleint Be careful about deep hierarchies. E.g.: If each level is represented by a single char, then the memory requirements of the dictionary representation grows quadratically with the depth (ignoring the dictionary implementation details):
Hence, please, consider implementing:
But I am leaving it up to you. |
Hi @JoshuaC3 @jkleint I've actually been in touch with Daniele Micci-Barreca @miccibarreca, the author of the paper, and he would also like to see an implementation of the hierarchical processing. He is also very eager to offer any help with the theoretical part (he does not consider himself a python coder). |
My new handle is actually @dmiccibarreca…looking forward to collaborate with whoever picks this up. |
From section 4 of the paper sited in TargetEncoding.
In other words, if we have a single zipcode 54321 but 100 zipcodes 54322 and 100 zipcodes 54323, we could use the mean of zipcode level 4 5432X as our mean smoothing term for zipcode 54321, instead of the mean for all zipcodes XXXXX.
This would be a really nice additional piece of functionality to add as an encoder.
The text was updated successfully, but these errors were encountered: