Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

[WIP] Categorical split for decision tree #3346

Open
wants to merge 18 commits into from

10 participants

@MatthieuBizien

Contrary to many algorithms that can only use dummy variables, decision trees can behave differently for categorical data. The leaves of a node will partition the categories. We could expect a better accuracy in some cases, and to limit the number of dummy columns. This is the default behavior of R randomForest package.

I am currently implementing that in sklearn, using the Cython classes. I propose to ask a categorical_features option to the decision trees classes (DecisionTreeClassifier, DecisionTreeRegressor, ExtraTreeClassifier, ExtraTreeRegressor). This option could be, like in other modules of sklearn, 'None', 'all', a mask or a list of features.

Each feature could have up to 32 classes, because we will have to test all the combinaisons, so 2**31 cases. This limit allows us to use a binary representation of a split. The same limit exists in R, I think for the same reasons.

This is a work in progress, and not ready to be merged. I prefer to release it early, so I could have feedbacks.

@coveralls

Coverage Status

Coverage decreased (-0.04%) when pulling cc3eb48 on MatthieuBizien:categorical_split into 1775095 on scikit-learn:master.

@gallamine

Each feature could have up to 32 classes, because we will have to test all the combinaisons, so 2**31 cases.

Can you explain this part a bit more? You're testing all permutations of subsets of the data?

@MatthieuBizien

Yes, I have to test all permutations. Because of the symmetry of the problem, without any loss of generality, we can assume the first class is in the left leaf, so we have to test 231 cases in the worst case (i.e. 32 classes), and not 232.

2**31 is a lot, but it is still computable, and it is the worst case, when the user provide 32 classes. If the number of classes is less important, or if the tree had already been split on this feature, the complexity would be less important. I assume that for most real world cases, the number of classes will be small.

We can imagine some heuristics if we have many classes (it is "just" discrete optimization), but it is, I think, too soon.

@gallamine

Do you have any thoughts on how you'd handle the case when the user provides more than 32 categories? I'm thinking of my own work where almost everything has more than 32 categories (e.g. country or postal codes)

@MatthieuBizien

At the beginning, I think it is easier not to handle that case, to raise an exception and to ask the user to use dummy variables. When this pull request will be working and merged, it will be possible to start working on heuristics for finding the best split, without testing all combinaisons. I am not a specialist of discrete optimization, but I am sure there are efficient algorithms for that. The underlying structure will also need to be different because we will no longer be able to store a split in an int32.

@jnothman
Owner
@glouppe
Owner

The expressive power of the tree is identical whether or not these are
handled specially.

Not exactly. By assuming numerical features, we assume that categorical features are ordered, which restricts the sets of candidate splits and therefore the expressive power of the tree (for a finite learning set).

@glouppe
Owner

Thanks for you contribution @MatthieuBizien !

A few comments though before you proceed further:

  • The API for this has already been subject to debate. We have never settled to something that pleases everyone. I would like to hear some core developers opinion on the proposed API? As I understand, the interface is here similar to what we already have for OneHotEncoder. CC: @ogrisel @larsmans @jnothman @GaelVaroquaux

  • In terms of algorithms.
    i) 231 is way too large. In R, they restrict the number of combinations to 28. If the number of categories is larger, then 28 combinations are sampled at random.
    ii) In binary classification or in regression, there exists an optimal **linear
    algorithm for finding the best split. It basically boils down to replace the categories by their probability, use these probabilities as a new ordered feature and apply the usual algorithm for finding the best split. You can find details about this in Section 3.6.3.2 of http://orbi.ulg.ac.be/handle/2268/170309

@glouppe
Owner

In terms of internal interface, this may also be the opportunity to try to factor out code from Splitters. What is your opinion on this @arjoly ?

@MatthieuBizien

@glouppe You're welcome. Thanks for your advices in term of algorithm, I will use that.

@arjoly
Owner

In terms of internal interface, this may also be the opportunity to try to factor out code from Splitters. What is your opinion on this @arjoly ?

Yeah, this would a great opportunity. This could already be done outside this pull request.

@jnothman
Owner
@amueller
Owner
@mblondel
Owner

I'm enthusiastic about this feature. One usecase is to do hyper-parameter optimization (as in hyperopt) over categorical hyper-parameters.

@GaelVaroquaux GaelVaroquaux changed the title from Categorical split for decision tree to [WIP] Categorical split for decision tree
@ogrisel
Owner

Note that pandas 0.15 will have a native data type for categories encoding:

http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#categoricals-in-series-dataframe

We could make the decision trees able to deal with dataframe features natively. That would make it more natural to use for the user: no need to pass a feature mask.

However that would require some refactoring to support lazy, per-column __array__ conversion instead of doing it globally for the whole datafreame in the check_X_y call.

@ogrisel
Owner

Yes. But even then, the resulting decision surface would most likely not be the same.

Also it would make the graphical export of a single decision tree much easier to understand. Many users are interested by the structure of the learned trees when applied to categorical data.

@pprett
Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 5, 2014
  1. @MatthieuBizien
  2. @MatthieuBizien

    Revert "First draft of categorical split"

    MatthieuBizien authored
    This reverts commit 004ac5a.
  3. @MatthieuBizien

    Generate Cpp from Cython for _tree and _utils (BROKEN)

    MatthieuBizien authored
    We could use std::map and std::set. Compilation fail.
  4. @MatthieuBizien
  5. @MatthieuBizien
  6. @MatthieuBizien
  7. @MatthieuBizien
Commits on Jul 11, 2014
  1. @MatthieuBizien
Commits on Jul 14, 2014
  1. @MatthieuBizien
Commits on Jul 15, 2014
  1. @MatthieuBizien
Commits on Jul 17, 2014
  1. @MatthieuBizien

    Remove unnecessary TODO

    MatthieuBizien authored
  2. @MatthieuBizien
  3. @MatthieuBizien
  4. @MatthieuBizien

    Notation : switch categorical splits to partition

    MatthieuBizien authored
    changed incoherent notations categorical_split and split_categories to
    partition
Commits on Jul 18, 2014
  1. @MatthieuBizien
  2. @MatthieuBizien
  3. @MatthieuBizien
  4. @MatthieuBizien

    Create functions _categorical_feature_split and _continuous_feature_s…

    MatthieuBizien authored
    …plit
    
    BestSplitter.node_split is now "just" 100 lines long
Something went wrong with that request. Please try again.