Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

Open
wants to merge 59 commits into
base: master
from

Conversation

@adrinjalali
Copy link
Member

adrinjalali commented Dec 26, 2018

This PR continues the work of #4899. For now I've merged the master into the PR, made it compile and make the tests run. There are several issues which need to be fixed. The list will be updated as I encounter them. Also, not all of these items are necessarily open, I have only collected them from the comments on the original PR, and need to make sure they're either already addressed or address them.

  • merge master into the PR (done)
  • sparse tests pass (done)
    • The code is supposed to be the same as the status quo implementation if categories are not passed. But right now the tests related to sparse data fail.
    • EDIT: The tests pass if we compare floats with almost_equal
  • LabelEncoder -> CategoricalEncoder (done)
    • Preprocessing is not a part of NOCATS anymore.
  • Is maximum random generations 20 or 40 (done)
    • It's actually 60
  • Don't quantize features automatically (done)
  • check the category count limits for given data. (done)
  • add a benchmark
  • add tests (right now only invalid input are tested)
    • tree/tests done
    • ensemble/tests done
  • benchmark against master
  • add an example with plots
  • check numpy upgrade related issues (we've upgraded our numpy requirement in the meantime)
  • run some benchmarks with a simple integer coding of the features (with arbitrary ordering)
  • add cat_split to NODE_DTYPE once joblib.hash can handle it (padded struct)

Closes #4899

Future Work: These are the possible future work we already know of (i.e. outside the scope of this PR):

  • Heuristic methods to allow fast Breiman-like training for multi-class classification
  • export to graphviz
  • One-hot emulation using the NOCATS machinery
  • support sparse input
  • handle categories as their unique valies instead of [0, max(feature)]
    • This is to be consistent with our encoders' behavior
    • moved this to future work per #12866 (comment)

P.S. I moved away from "task list" due to the extremely buggy interface when used in combination with editing the post, which I'm extensively doing to keep it easy for us to keep up with the status.

jblackburne and others added 20 commits Oct 6, 2016
…to categorical variables. Replaced the threshold attribute of SplitRecord and Node with SplitValue.
…hat defaults to -1 for each feature (indicating non-categorical).
…ediction with trees. Also introduced category caches for quick evaluation of categorical splits.
…causing all kinds of problems. Now safe_realloc requires the item size to be explicitly provided. Also, it can allocate arrays of pointers to any type by casting to void*.
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Dec 26, 2018

Wow. Good on you for taking this on!

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Dec 26, 2018

I̶ ̶a̶s̶s̶u̶m̶e̶ ̶t̶h̶e̶ ̶a̶p̶p̶v̶e̶y̶o̶r̶ ̶f̶a̶i̶l̶u̶r̶e̶ ̶i̶s̶ ̶u̶n̶r̶e̶l̶a̶t̶e̶d̶ ̶t̶o̶ ̶t̶h̶i̶s̶ ̶P̶R̶ ̶I̶ ̶s̶u̶p̶p̶o̶s̶e̶.̶

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Feb 4, 2019

Question: I introduced a BitSet in this commit: 532061f, but the object is a Python one and it's not easy to have an array or an array of arrays of it.

My efforts so far as gotten me to a stackoverflow question of mine, this disappointing question, and a solution using boost.

I guess some options are:

  • cover what we exactly need in a separate header/cpp file and use it in our cython code
  • use cpp's stl::vector but use a custom bitset written in cpp
  • other cython woodoo which I'm not aware of

@NicolasHug do you happen to have a good answer to this one?

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Feb 4, 2019

It seems that the BitSet class is just a wrapper on a uint64 right? And you dont' need to use it inside Python? In this case I would just directly create arrays of uint64 and translate the methods into pure cdef functions. Pretty much like what you would do if you were writing pure C. You can make a typeded from uint64 to bitset if you want to be more explicit.

I wanted to declare arrays of cdef classes as well, but it doesn't seem to be possible. In #12807 I have this class SplitInfo that I need to use from Python, and I had to create a split_info_struct that has the same attributes. I can create arrays of split_info_struct in C-mode, and when I need to manipulate such an object in Python I just wrap it into the class. It works in my case because:

  • I don't need arrays in Python, just single objects
  • It's a pure data class (no method). If there was any method I think I should have to duplicate them: methods for the class, and equivalent functions for C. Which would be pretty annoying.

Hope that helps!


Unrelated but might useful: I've found that some weird stuff happens when using 1d slices of 2d arrays. For example having a function cdef f(int [:] 1d_slice): ... called with f(some_2d_array[index, :]) will generate strange Python interactions (it's related to the GIL and probably also to the use of prange so that might not affect you). My work-around was to make f signature as cdef f(int [:, :] 2d_array, const unsigned int index): ... . Looking at the annotated html files (cython -a) will show you the Python interactions and sometimes they appear in unexpected places!

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Feb 25, 2019

sprint discussion conclusions:

The new implementation in #12807 and the fact that it makes sense to have NOCATS there, deprioritizes this PR.

The _splitter.pyx can be very much simplified, and presort can be removed from that code, since it's only used for gbc, and the new gbc is much faster anyway.

@h-vetinari

This comment has been minimized.

Copy link

h-vetinari commented Jun 17, 2019

Any update on this? :)

Also is there an issue to track categorical features for the new (currently still experimental) implementation?
(It's currently a bit hard to find out what the status of categorical features is, and (possibly) discuss how it should be tackled, now that the implementation changed. Readers of #9960 are redirected to this PR here.)

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Jun 17, 2019

@h-vetinari we'll soon work on NOCATS for HGB models, it'll hopefully be there by next release.

@amueller amueller added the Needs work label Aug 6, 2019
@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Aug 7, 2019

@adrinjalali considering that we decided not to implement categorical support in the tree module, I think we can close this one and #4899 ?

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Aug 8, 2019

We decided not to prioritize this one. I'm still planning to finish this, but first I want to clean up and simplify the splitter code, which we said we could do once the HGBT is released, which it is now. I just haven't go to it. This also implements split policies which we probably won't have in HGBT (random splitter to be specific), which work really well with extra trees.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 22, 2019

This paper is interesting:
https://peerj.com/articles/6339/

Unfortunately it doesn't consider high cardinality categorical variables, and it considers only a small set of datasets.
But it shows that actually ordering categories once before building the trees might be a good strategy. That's interesting because that's waaay easier to implement ;)
Even if we also implement the NOCATS approach, we could provide an estimator that does the "once and for all" ordering for regression and binary classification.
The authors also provide a similar heuristic for multi-class, which also sounds interesting.

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Nov 26, 2019

Even if we also implement the NOCATS approach, we could provide an estimator that does the "once and for all" ordering for regression and binary classification.

What we plan to implement in the Histogram-GBDTs is what the paper does, except that we sort categories at each split instead of at the beginning (paper also discuss that)

CC @adrinjalali

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Nov 26, 2019

Also I'm super late to the party, but what is the benefit of NOCATs over One-Hot-Encoding the categories?
As far as I understand the strategy proposed here is equivalent to re-implementing the OHE within the tree logic. So what are the main benefits of NOCATs over OHE, apart from using less memory?

@h-vetinari

This comment has been minimized.

Copy link

h-vetinari commented Nov 26, 2019

Also I'm super late to the party, but what is the benefit of NOCATs over One-Hot-Encoding the categories?

One-hot encoding only allows you to split off 1-vs-the-rest, whereas the optimal split for a categorical variable may be many-vs-many. For example, the optimal split at a given node may be:

{A, B, C, D, E, F, G} --> {B, C, F} vs. {A, D, E, G}

but one-hot encoding would only be able to yield one of

{A, B, C, D, E, F, G} --> {A} vs. {B, C, D, E, F, G}
{A, B, C, D, E, F, G} --> {B} vs. {A, C, D, E, F, G}
{A, B, C, D, E, F, G} --> {C} vs. {A, B, D, E, F, G}
{A, B, C, D, E, F, G} --> {D} vs. {A, B, C, E, F, G}
{A, B, C, D, E, F, G} --> {E} vs. {A, B, C, D, F, G}
{A, B, C, D, E, F, G} --> {F} vs. {A, B, C, D, E, G}
{A, B, C, D, E, F, G} --> {G} vs. {A, B, C, D, E, F}

This obviously affects the depth / number of splits that are necessary to get a similarly good result.

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Nov 26, 2019

@NicolasHug this is only one benchmark, but at least on this dataset, there are benefits to using NOCATS: #12866 (comment)

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Dec 2, 2019

@NicolasHug

What we plan to implement in the Histogram-GBDTs is what the paper does, except that we sort categories at each split instead of at the beginning (paper also discuss that)

That is the exact solution for regression and some binary cases.
This PR mentions in the beginning "Heuristic methods to allow fast Breiman-like training for multi-class classification" which is basically what you're implementing.

Thought I imagine you're doing this based on the unnormalized probabilities, which is one more level of indirection compared with the trees.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Dec 2, 2019

actually, because you're doing a regression tree each time, the sorting my always be exact, depending on the loss. I need to think about that again and look at the formula.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Categorical
  
To do
6 participants
You can’t perform that action at this time.