Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Target Encoding with Hierarchical Structure Smoothing #136

Closed
JoshuaC3 opened this issue Oct 16, 2018 · 13 comments · Fixed by #366
Closed

Implement Target Encoding with Hierarchical Structure Smoothing #136

JoshuaC3 opened this issue Oct 16, 2018 · 13 comments · Fixed by #366

Comments

@JoshuaC3
Copy link
Contributor

JoshuaC3 commented Oct 16, 2018

From section 4 of the paper sited in TargetEncoding.

Instead of choosing the prior probability of the target as the null hypothesis, it is reasonable to replace it with the estimated probability at the next higher level of aggregation in the attribute hierarchy

In other words, if we have a single zipcode 54321 but 100 zipcodes 54322 and 100 zipcodes 54323, we could use the mean of zipcode level 4 5432X as our mean smoothing term for zipcode 54321, instead of the mean for all zipcodes XXXXX.

This would be a really nice additional piece of functionality to add as an encoder.

@janmotl
Copy link
Collaborator

janmotl commented Oct 16, 2018

Can you write a PR?

Just an idea: it could also work on domains like github.com or emails like name@company.com. You would just specify a list of delimiters (by default it would work char by char/digit by digit) and whether the top level is on the left (as in the zip code) or on the right (as in domain). For simplicity of the implementation, the code could treat everything as a string and just reverse the string to cope with left/right difference. Of course, the disadvantage is that string processing is slow...

@JoshuaC3
Copy link
Contributor Author

@janmotl that is a really nice idea. This does add a fair bit of complexity though.

I am a bit of a newb here and this seems a little complex so I am going to do #137 first and see how I get on. If someone wants to pick this first then please go ahead.

@jkleint
Copy link

jkleint commented Oct 23, 2018

From a design perspective, you probably want a general way to specify the hierarchical relationships between levels, and then convenience functions that produce this relationship structure for common cases. That is, you'd still want to be able to use it even if your levels are random numbers with no meaning but for which you have a hierarchy. You could also use say a clustering algorithm to infer a hierarchy for variables that don't have one, or just come up with one manually.

Since hierarchy means tree / forest, I'd suggest specifying the hierarchical relationships as a Series or dict mapping levels to their parents, and note that entries will not all be levels themselves: for a column of zipcodes, the parents might be counties or states or regions, which don't show up in the column.

@janmotl
Copy link
Collaborator

janmotl commented Oct 23, 2018

@jkleint I think I need examples.

Are we talking about branching based on the values? E.g.: Postal addresses. Some countries (like the USA) divide to countries. But other countries (like Switzerland) divide directly into cantons... And there can be (but doesn't have to be) a bigger difference between countries than between cantons. Hence, a support for branching based on the values makes sense.

Or about a simple and fixed (value-independent) hierarchy like: level1.level3.level2, which can be simply rearranged into: level1.level2.level3? An example of this could be date: 2018 28 January -> 2018 January 28.

Or about some other example?

@jkleint
Copy link

jkleint commented Oct 23, 2018

Let's say I have categorical values whose names are meaningless:

'xjs', 'kxu', 'lwj', 'vbz', 'woz'

But I have some external knowledge that tells me they nest into a hierarchy like so:

          grpA
      /    |    \
  grpB    grpC   grpD
  |  |     |     |  |
xjs kxu   lwj   vbz woz

Then I need a way to specify which levels (and groups) go to which groups:

xjs -> grpB
kxu -> grpB
lwj -> grpC
vbz -> grpD
woz -> grpD
grpB -> grpA
grpC -> grpA
grpD -> grpA

Hey that looks a lot like a dict (or Series):

mapping = {
  'xjs': 'grpB',
  'kxu': 'grpB',
  'lwj': 'grpC',
  'vbz': 'grpD',
  'woz': 'grpD',
  'grpB': 'grpA',
  'grpC': 'grpA',
  'grpD': 'grpA',
}

And that's what I'm proposing to pass to TargetEncoder: a dict or Series that tells you the hierarchy.

Then you can have a convenience function encode_email() or whatever that returns such a dict (or preferably a Series):

encode_email('bob@example.com')

{
  'bob@example.com': 'example.com',
  'example.com': '.com'
}

I'm just proposing TargetEncoder has a clean, general interface to specify hierarchies, and separately, functions to create those hierarchies for common use cases.

@janmotl
Copy link
Collaborator

janmotl commented Oct 24, 2018

That sounds reasonable. Just note that when we get 'tim@example.com' during the scoring time, we will have to look up keys for values, because 'tim' was not observed during the training time but 'example' was.

Also, we may get a clash of keys. For example, we may observe 'bob@example.com' and 'admin@bob.com' during the training time.

@JoshuaC3
Copy link
Contributor Author

@jkleint @janmotl personally, I don't feel this is the correct way to do this. I think we should be treating a single column in an array or dataframe as the sole data source of categories we wish to encode. Anything beyond this, IMHO, becomes feature engineering, not categorical encoding (ignoring the fact that some people might consider ce part of fe...).

I think we should have an option of splitting the string either by single characters (e.g. zip code 45243 -> [4, 5, 2, 4, 3]) or by delimiters (e.g. email bob@example.com, sep=['.'] -> ['example', 'com']). We can then also give the choice of an order of ascending=True of False which is the same as the Pandas API.

If someone truly wishes to achieve more complex encodings, they could simply create a new level1 column using pandas map and a dict and then add these two string columns together. I think creating convenience functions such as encode_email() would become an inexhaustible task.

@janmotl
Copy link
Collaborator

janmotl commented Oct 25, 2018

Another option: If TargetEncoder observes a column of dtype==list, it will treat the column as hierarchical with the first item being the top most level. And it would be up to the user to split (and potentially reverse) the input string (or whatever data type it was originally).

@JoshuaC3 I propose to first implement CountEncoder from #137. Then you may add ascending=None and sep=None into TargetEncoder, where ascending=None means no hierarchy. And sep=None means char by char processing. My only complain is that from ascending=True it is not evident whether the top most level is on the left or on the right.

@JoshuaC3
Copy link
Contributor Author

@janmotl I have started this.

Yes, I see that ascending is not as intuitive as I first though. Would be good to hear if there are any better key names for it. Otherwise, we can just make sure it is documented well.

@jkleint
Copy link

jkleint commented Oct 25, 2018

I agree we don't want to be in the business of a million custom hierarchy-generating functions. It's fine to have a function that makes hierarchies from delimited strings. But do we really want that to be the interface to TargetEncoder?

I would hope you agree that not everyone's hierarchies are simple string prefixes... or even strings. That creating those convenience functions is an inexhaustible task is exactly why we don't want to specify hierarchies that way: everyone's hierarchies are different. But you don't want to limit them or force them to jam them into strings (that we will just have to parse out again). I agree we don't want to try to accommodate them all individually, but we do want a general mechanism that anyone can use to specify their own. We might offer a few common ones, but you need to be able to handle any hierarchy. The general structure of a hierarchy is a forest, and the simple way to specify a forest is an edge list: a map of nodes to their parents. And the natural implementation of a map is a dict or Series.

I like Jan's idea of a column of lists for hierarchies, but it requires the user to preprocess the data, it's not very efficient, and again we'd just have to parse out the edge list. Specifying hierarchies directly as mappings, and passing a parameter that gives the hierarchy for each column is clean and general, and composes nicely with convenience functions. Decomposing this problem into creating hierarchies and encoding hierarchies is a clear design win here: it's less code and will save many headaches down the road.

The stronger design point is separating target coding from parsing things into hierarchies: these are totally orthogonal. You really want to have zipcode parsing functions in a TargetEncoder? They are completely unrelated. What if another encoder needs to handle hierarchies in the future? What if I can't or don't want to change my category names to conform to some string format? What if my categories contain the delimiter? You really want to worry about collisions like "a.b" and "a.a.b"? What if I already have my hierarchy separately, and I don't want to take the performance hit of joining a bunch of strings, or worry about how to do that at all? Whatever the format is, we're going to need to parse it back into a Series representation of an edge list internally anyway, for an efficient vectorized implementation. Let's just use the simple, direct, efficient interface directly.

Let's keep the signaling "out-of-band," the metadata separate from the data: leave the data as is, specify hierarchy as metadata. Keep separate things separate. Here's the relevant principle from The Pragmatic Programmer:

Eliminate Effects Between Unrelated Things
Design components that are self-contained, independent, and have a single, well-defined purpose.

Let's separate making hierarchies from encoding them.

Concretely, here's what that would look like:

X = pd.DataFrame({'e-mail': ['bob@example.com', 'jan@company.com', 'user@example.com']})

TargetEncoder(hierarchy={'e-mail': make_email_hierarchy(X['e-mail'])}).fit(X)
hierarchy = make_email_hierarchy(X['e-mail'])
hierarchy
{
    'bob@example.com': 'example.com',
    'jan@company.com': 'company.com', 
    'user@example.com': 'example.com',
    'example.com': '.com',
    'company.com': '.com',
}

TargetEncoder(hierarchy={'e-mail': hierarchy}).fit(X)

And then it's really easy to apply new/different/custom hierarchies to other columns or data:

TargetEncoder(hierarchy={'zipcode': make_zipcode_hierarchy(Z['zip'])}).fit(Z)

This doesn't require the user to pre-process the data, composes nicely, and is efficient. And it can handle hierarchies that don't have a nice functional form, but are just data.

A general make_delimited_hierarchy() function for parsing delimited strings is a great idea, but let's keep the TargetEncoder interface itself clean, simple, and efficient.

@janmotl
Copy link
Collaborator

janmotl commented Oct 26, 2018

Ok. @jkleint has the longest comment. He implements the hierarchical processing for TargetEncoder. @JoshuaC3 implements the CountEncoder.

@jkleint Be careful about deep hierarchies. E.g.: If each level is represented by a single char, then the memory requirements of the dictionary representation grows quadratically with the depth (ignoring the dictionary implementation details):

key    value  depth  char_len  cumsum_char_len
ba     a      2      3         3
cba    ba     3      5         8
dcba   cba    4      7         15

Hence, please, consider implementing:

  1. Definition of the hierarchies by values (this the dictionary proposal).
  2. Definition of the hierarchies by pattern (for handling simple string prefixes).

But I am leaving it up to you.

@PaulWestenthanner
Copy link
Collaborator

Hi @JoshuaC3 @jkleint
I want to follow up on this issue. Nothing has been merged to master so far, right? Is there still anyone interested in doing this?

I've actually been in touch with Daniele Micci-Barreca @miccibarreca, the author of the paper, and he would also like to see an implementation of the hierarchical processing. He is also very eager to offer any help with the theoretical part (he does not consider himself a python coder).
My two cents on the issue of how to implement the hierarchy:
I'm pretty much in favour of what @janmotl and @JoshuaC3 suggest. I think it's the simplest to just have the user pass a list of columns that are their hierarchy. This puts some work on the user as they might have to prepare columns (e.g. they have to create a zip_code_3 column and add it to their data frame) but it keeps the library focused on its core functionality. For convenience some (but not many and not domain specific) functions can be added like make_delimited_hierarchy or make_fixed_length_substring_hierarchy. I fully understand the benefits of complete decoupling that @jkleint mentions but since the functions I propose are generic I wouldn't see any issue there. In @jkleint s the problem of new values that are not in the mapping is still unsolved, right? Also arbitrary hierarchies are supported this way if the user creates columns appropriately. So in a nutshell this solution can deal with arbitrarily complex hierarchies if the user puts in the effort of defined them before calling the library but also offers simple usage if they have a standard hierarchy. I also get the point that this would be specific to target encoding and leave out other encoders that might need hierarchical features as well. But this is a problem I'd postpone to the point where we actually have that problem.

@dmiccibarreca
Copy link

My new handle is actually @dmiccibarreca…looking forward to collaborate with whoever picks this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants