Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOCATS: Categorical splits for tree-based learners #4899

Closed
wants to merge 13 commits into from

Conversation

@jblackburne
Copy link
Contributor

jblackburne commented Jun 25, 2015

NOCATS stands for "Near-Optimal Categorical Algorithm Technology System". (What can I say? My coworker came up with it.) It adds support for categorical features to tree-based learners (e.g., DecisionTreeRegressor or ExtraTreesClassifier).

This PR is very similar to #3346, but allows for more categories, particularly with extra-randomized trees (see below).

How it works

We've replaced the threshold attribute of each node (a float64) with a union datatype containing a float64 threshold field and a uint64 cat_split field. When splitting on non-categorical features, we use the threshold field and everything works as before.

But when a feature is marked as categorical, the cat_split field is used instead. In a decision tree, each of its 64 bits indicates which direction a certain category goes; this implies a hard maximum of 64 categories in any feature. Which is fine, because finding the best way to split 64 categories during the tree-building step is very expensive, and the practical limit will certainly be less than 64.

In an extra-randomized tree, however, the expensive process of finding the very best split is bypassed, so it would be nice to allow more categories. So for these trees we use cat_split in a completely different way: when building the tree we randomly choose a set of categories to go left, then store only the minimum information needed to regenerate that set during tree evaluation. The information that we store is a random seed (in the most significant 32 bits) and the number of draws to perform (in the next 31 bits) [Edit: We now flip a virtual coin for each category, so the number of draws is no longer necessary]. By recreating the split information as needed in each node rather than storing it explicitly, we are able to support large numbers of categories without causing the classifiers to balloon in size.

How does a tree know which type it is? We encode that information in the least significant bit of cat_split. If the LSB is 0, we treat it as a flag field; if it is 1, we treat it as a random seed and number of draws. We do not lose generality by forcing category 0 to always go right, since there is a left-right symmetry.

One last detail: to avoid regenerating the random split for every sample during tree evaluation, we allocate a temporary buffer for each node large enough to serve as a bit field. The buffers are freed when evaluation finishes.

How to use it

The fit method of the relevant learners has a new optional parameter categorical. You can give it an array of feature indices, a boolean array of length n_features, or the strings 'None' (the default) or 'All'. Categorical feature data will be rounded to the nearest integer, then those integers will serve as the category labels. (Internally they are mapped to range(n_categories)).

Comments, caveats, etc.

  1. RandomSplitter generates a random categorical split by first generating a random seed, then generating a number of draws to make. To simulate flipping a coin for each category, the number of draws should come from a Binomial distribution, but currently we use a uniform distribution. I welcome comments on how desirable it would be to change this into a Binomial draw. [Edit: RandomSplitter now sends each category left or right using a simple coin flip. This is equivalent to the Binomial draw.]
  2. When building the tree, each node generates its split using the full set of categories for the feature in question rather than the subset of categories represented by the node's samples. For the BestSplitters, this means it will take longer to find the split. For the RandomSplitter, it means there is a chance that the current subset will all be sent in the same direction. This contrasts with the non-categorical behavior, where a non-trivial split is guaranteed for non-constant features. The chance is generally small (and it's smaller if we use a Binomial draw rather than a uniform draw). I made this choice because it would introduce a lot of new complexity and storage requirements to split based on the current subset of categories. One alternative would be to have the RandomSplitter generate random splits until a non-trivial split is achieved. [Edit: This is now implemented. Random splits are generated until a non-trivial split is found, or until a maximum of 20 tries (to limit the worst case runtime). This change renders this whole bullet point essentially meaningless aside from runtime speed considerations.] Comments on this are also welcome.
  3. Categorical features are not supported for sparse inputs. This is because I did most of this work before the support for sparse inputs was added, and I am not as familiar with that part of the code. Plus, it seems that sparse inputs become less necessary when you are not using one-hot encoding.
@glouppe

This comment has been minimized.

Copy link
Member

glouppe commented Jun 25, 2015

Awesome! I will be on vacation for the next two weeks, but I will definitely look into it at my return.

(Be patient, our review and integration process requires some time -- but dont hesitate to ping us if you see things stalling. )

@jblackburne

This comment has been minimized.

Copy link
Contributor Author

jblackburne commented Jun 25, 2015

Ok, thanks.

Hm, it looks like there are two test errors. The first is easy; I need to use six.moves.zip instead of itertools.izip. The second is that older versions of Numpy apparently don't like union datatypes, at least the way I constructed it. Looks like I can fix it using this SO question.

@arjoly

This comment has been minimized.

Copy link
Member

arjoly commented Jun 26, 2015

ping myself. looks awesome. I will unlock time to review this pr.

@arjoly

This comment has been minimized.

Copy link
Member

arjoly commented Jun 26, 2015

it would be awesome if you add some tests.

@jblackburne

This comment has been minimized.

Copy link
Contributor Author

jblackburne commented Oct 15, 2015

Fixed some bugs and addressed most of the caveats. Working on some unit tests. Code review welcome!

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Oct 15, 2015

there is a bunch of changes in master in the trees. Not sure how they relate to yours. Maybe try to rebase? Or check out the changes first?

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Nov 4, 2015

@jblackburne Could I help you in this PR? (That would involve me sending PR's to your branch "NOCATS" following reviews from our devs). Also you need to rebase it upon master first! :)

@jblackburne

This comment has been minimized.

Copy link
Contributor Author

jblackburne commented Nov 4, 2015

HI @rvraghav93 Sure, PRs would be welcome, especially unit tests. The rebase is done, and I'm waiting to push it until I've had a chance to test it a little. Give me a couple days.

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Nov 4, 2015

Sure please take your time :)

@jblackburne jblackburne force-pushed the jblackburne:NOCATS branch from 847f442 to bfd6bb0 Nov 5, 2015
@jblackburne

This comment has been minimized.

Copy link
Contributor Author

jblackburne commented Nov 5, 2015

Ok, rebase is done. Anyone who has cloned this will need to re-clone it since I altered history. Travis-CI fails when numpy version < 1.7; this is a known problem. Don't know why the appveyor build was canceled.

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Nov 6, 2015

We can safely ignore appveyor for the time being... Thanks for the rebase! I'll clone your fork and send a PR to your branch soon!

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Nov 6, 2015

Also I think you could squash to <= 3 commits! It will be cleaner to trace back any regressions in the future! :)
Also a minor tip (which you can choose to ignore) you could prefix the commit headers with tags ENH / FIX / MAINT and put all the squashed description inside if you feel that is necessary...

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Nov 6, 2015

Never mind about the squash... We'll do it at the end... I've cloned your repo and started working on it... Will ping you when I'm done :)

@raghavrv raghavrv referenced this pull request Nov 10, 2015
5 of 12 tasks complete
@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Nov 15, 2015

Could you update your master and rebase this branch again please? ;) (since c files are removed, you might have to check them out too)

EDIT: I think rebase should do that... but I am not sure as you must have explicitly committed those c files previously...

@jblackburne jblackburne force-pushed the jblackburne:NOCATS branch from bfd6bb0 to 224949a Nov 16, 2015
@jblackburne

This comment has been minimized.

Copy link
Contributor Author

jblackburne commented Nov 16, 2015

Here you go. Git didn't do it for me, but it was pretty easy anyway.

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Feb 17, 2016

Now that I am getting to know the tree code better, this PR looks amazing!

One comment. Is not splitting based on the current subset of data the correct thing to do? Is it how r handles it?

Also could you compare your implementation with a dataset having categorical features vs the master branch for accuracy variations (by simply encoding those categorical features)?

Thanks for the PR...

@jblackburne

This comment has been minimized.

Copy link
Contributor Author

jblackburne commented Feb 18, 2016

Not splitting on the current subset of data causes two problems.

The first is that it's not as fast (I have traded speed for algorithmic simplicity). This problem affects DecisionTree more than ExtraTree because the former must test every possible permutation of categories when fitting, and where factorials are concerned, smaller arguments are much better! But I'm hoping that it's not too bad compared to one-hot, for the values of n_categories that people will be using. This is not a problem for ExtraTree, and honestly I'm more excited about that one anyway, because it allows you to have really large n_categories.

The second problem only affects ExtraTree. There's a chance that the random permutation that is chosen will result in a trivial split (meaning that it will send all samples to one child) despite there being a variety of categories present for the chosen feature. For example, if the sample consisted of three "smoky" and two "effervescent" and zero "swirly", this would happen if the RandomSplitter randomly sent "swirly" right and the other two left (a 25% chance). Because it's not restricting itself to "smoky" and "effervescent", it doesn't know that it has selected a trivial split. This is the incorrect thing to do if you consider that the baseline (non-categorical) RandomSplitter will never make this mistake. You can see that it's more likely to happen with fewer categories represented in the current sample, so 25% is as bad as it gets. RandomSplitter currently works around this by re-rolling until it gets a nontrivial split, up to a maximum of 20 re-rolls. In the case above, this reduces the chances of a trivial split to 0.25**20, or about a part in a trillion.

TL;DR It's not incorrect (well, maybe once in a very great while). It makes DecisionTree slower than it could be for categorical features, but I think it's good enough for now.

I'm not sure how R's implementation works under the hood, unfortunately.

EDIT: Sorry, my math is wrong. It is a 50% chance in the example above, not 25%, because the trivial split can occur by sending both categories left OR right. So 20 iterations leads to a trivial split one time in a million, not one time in a trillion. Hm. I will push a new commit raising the maximum from 20 to 40, or maybe more.

@jblackburne

This comment has been minimized.

Copy link
Contributor Author

jblackburne commented Feb 18, 2016

I have done some comparisons of NOCATS to one-hot encoding using a toy dataset, and convinced myself that things were working. I'll try and put together a more in-depth study with larger train/test datasets. Stay tuned.

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Mar 1, 2016

@jblackburne Thanks a lot for the patient response!

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Mar 14, 2016

Ok so the important question from the API point of view is to ask if we are okay with the categorical parameter in fit?

@amueller @jnothman @GaelVaroquaux @glouppe @agramfort Views on the same?

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Mar 14, 2016

IMHO it should be a class parameter: as usual the question is: how do you do cross-val with categorical variables.

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Mar 14, 2016

a class parameter

categorical becomes data dependent... I'm not sure if we want it as a class param??

how do you do cross-val with categorical variables.

If I am not missing something, we can pass the categorical parameter inside fit_params dict correct?

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Mar 14, 2016

categorical becomes data dependent... I'm not sure if we want it as a class
param??

Yes, but only in the feature direction.

how do you do cross-val with categorical variables.

If I am not missing something, we can pass the categorical parameter inside
fit_params dict correct?

Yes, but then it becomes very clumbersome to use in a larger setting.

@lesshaste

This comment has been minimized.

Copy link

lesshaste commented Mar 18, 2016

Would it make sense to run the new code on the benchmarks from https://github.com/szilard/benchm-ml ? @GaelVaroquaux mentioned on the mailing list in relation to these benchmarks specifically that "In tree-based Not handling categorical variables as such hurts us a lot"

@jblackburne

This comment has been minimized.

Copy link
Contributor Author

jblackburne commented Mar 18, 2016

@lesshaste: It looks like they are using decision tree-based classifiers (i.e., RandomForestClassifier and GradientBoostingClassifier) rather than extra-random tree-based classifiers. And it looks like their dataset's categorical features (airlines, origin & destination airports) probably have cardinality > 64. These two factors together mean NOCATS can't be used.

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Mar 18, 2016

@jblackburne would you be willing to give me push access to this branch? It would make it easier for me to collaborate. I'll make sure I don't force push.

And now the todo for this PR

  • Move categorical from fit to class parameter.
  • Make node based categorical splitting.
  • Benchmarking with master (one hot encoding) - Thanks Jblackbrune for doing this!

(PS: I'm currently in a OpenML workshop. A lot of people here seem to want this feature!)

@jph00

This comment has been minimized.

Copy link

jph00 commented Aug 2, 2017

Sorry one question - what's the view of the core team about this general approach? I had assumed that something much simpler would be done, which is to do exactly the same thing as 1-hot encoding, but in the faster and lower memory way that you can do if you have categorical variables (i.e. just allow a single 1-vs-rest split at each leaf). I haven't seen any upside in practice of supporting more complex splits where you pick multiple levels to split on - since in practice the tree can always handle that case with multiple 1-vs-rest splits in the tree.

So what I'm trying to ask is: which approach do you guys feel is most interesting:

  1. Fast, low-memory, 1-vs-rest splits (i.e. supports same functionality as one-hot encoding)
  2. More complex multi-level splits like in this PR
  3. Or neither - just let users do integer or 1-hot coding themselves.
@jimmywan

This comment has been minimized.

Copy link
Contributor

jimmywan commented Sep 11, 2017

I haven't seen any upside in practice of supporting more complex splits where you pick multiple levels to split on - since in practice the tree can always handle that case with multiple 1-vs-rest splits in the tree.

Others can probably explain this better than I can, but the general idea here is that in the presence of a categorical feature with multiple values, the optimal way to split the tree may be to partition multiple values at the same time.

If you're using an integer encoding (aka LabelEncoder), your encoding may not be in the optimal ordering and it may not be possible to generate it in the optimal ordering for all cases.

If you use one-hot encoding, the entropy reduction for partitioning that single value might not be beneficial enough for the algorithm to choose that route.

A different way to say this is that currently supported approaches could theoretically reach the same conclusions, but it's very easy to concoct scenarios where it's highly unlikely to do so.

Example: let's say you had 20 different values for a particular categorical value that have been integer encoded. In any particular part of the tree, the optimal split might be any one of the following:

  • "odd vs even"
  • "split by the midpoint"
  • "numbers divisible by 7"
  • etc.
@julioasotodv

This comment has been minimized.

Copy link

julioasotodv commented Sep 24, 2017

I just wanted to complete the list that @raghavrv started:

Listing down the Cat. Variable handling methods of other packages :

XGBoost - dmlc/xgboost#95 (comment) - One hot encoding or Level encoding (No categorical splitting)
randomForest - http://stats.stackexchange.com/a/96442/58790 - The same way as this PR
(sends some labels left and others right)
rpart - Not clear
gbm - Found no info
weka - Does not (needs one hot encoding)
H2O - http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/histograms_and_binning.html (Using bitsets, seems to be very efficient and accurate)
Spark ML - Naturally handles categorical features, but only up to the maxBins hyperparameter, given that all features are histogram binned (I still have to browse through the source code)

@scikit-learn scikit-learn deleted a comment from codecov bot Oct 20, 2017
@h-vetinari

This comment has been minimized.

Copy link

h-vetinari commented Nov 15, 2017

Any news on the current status of this? I needed (wanted?) this feature so much I'm currently working on a local copy of this pull request, haha.

@jblackburne

This comment has been minimized.

Copy link
Contributor Author

jblackburne commented Nov 17, 2017

@h-vetinari Only a few things remain to be done on this. It needs to be brought up to the latest changes in master, and more unit tests need to be written, as codecov has so helpfully pointed out. :) I could probably make time to do this.

And then of course it needs to be reviewed. This is challenging, since it is a fairly substantial change to a fairly hairy section of the code. See @amueller's comment above.

@julioasotodv

This comment has been minimized.

Copy link

julioasotodv commented Nov 17, 2017

Given that I believe that this is one of the most requested features in sklearn (alongside with surrogate splits for natural null handling in trees), there should be quite a couple of people willing to test and benchmark this with different datasets (myself included) :)

@js3711

This comment has been minimized.

Copy link

js3711 commented Jan 9, 2018

I am interested in seeing this feature as well. For those that are interested, how can we help push this over the finish line? Exactly what work is left (other than rebasing)?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jan 9, 2018

It needs a code review:

  • Check that tests are understandable and adequate to test the new functionality
  • Check that the implementation does not present substantial risks to existing functionality (including memory leaks)
  • Check that the implementation is readable / maintainable and there are no obvious ways to improve that
  • Check that the API is well designed
sjonany pushed a commit to sjonany/Kaggle-Titanic that referenced this pull request Jan 14, 2018
Doesn't look like svms or even random forest in sklearn handle categorical features: scikit-learn/scikit-learn#4899. They just get converted to enums.

The SVM score improved, but random forest went down a bit. But that's probably because we now have more features for random forest, and will need to do hyperparam tuning later.

Before:
Random forest 0.822780047668
SVM 0.76217937805

After:
Random forest 0.810420497106
SVM 0.795838156849
@amueller

This comment has been minimized.

Copy link
Member

amueller commented Mar 6, 2018

Not sure it's a good idea to add more features on top of an already big PR, so maybe that's for a follow up, but i think it would be good to add a multi-class heuristic for efficient splits. I've read of people doing one vs rest with the binary algorithm.

@dipanjanS

This comment has been minimized.

Copy link

dipanjanS commented Jul 22, 2018

Any update on the status of when this might be coming in?

azrdev added a commit to azrdev/sklearn-seco that referenced this pull request Sep 1, 2018
@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Oct 8, 2018

Hi @jblackburne, @raghavrv,

Took me a while to go through this thread and the code. A lot has changed since two years ago, which I guess is the last commit on this branch.

You think you've got time to rebase/merge master and we take it from there?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 8, 2018

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Oct 11, 2018

(I'm really sorry about that, and that I didn't realize).

Alternatively, I can base a new PR on this one and try to address the list I gathered reading through this thread. @jblackburne what would you prefer?

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Oct 17, 2019

Closing in favor of #4899.

@ogrisel ogrisel closed this Oct 17, 2019
@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Oct 17, 2019

You mean in favor of #12866 probably :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.