Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

Open
wants to merge 59 commits into
base: main
Choose a base branch
from

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Dec 26, 2018

This PR continues the work of #4899. For now I've merged the master into the PR, made it compile and make the tests run. There are several issues which need to be fixed. The list will be updated as I encounter them. Also, not all of these items are necessarily open, I have only collected them from the comments on the original PR, and need to make sure they're either already addressed or address them.

  • merge master into the PR (done)
  • sparse tests pass (done)
    • The code is supposed to be the same as the status quo implementation if categories are not passed. But right now the tests related to sparse data fail.
    • EDIT: The tests pass if we compare floats with almost_equal
  • LabelEncoder -> CategoricalEncoder (done)
    • Preprocessing is not a part of NOCATS anymore.
  • Is maximum random generations 20 or 40 (done)
    • It's actually 60
  • Don't quantize features automatically (done)
  • check the category count limits for given data. (done)
  • add a benchmark
  • add tests (right now only invalid input are tested)
    • tree/tests done
    • ensemble/tests done
  • benchmark against master
  • add an example with plots
  • check numpy upgrade related issues (we've upgraded our numpy requirement in the meantime)
  • run some benchmarks with a simple integer coding of the features (with arbitrary ordering)
  • add cat_split to NODE_DTYPE once joblib.hash can handle it (padded struct)

Closes #4899

Future Work: These are the possible future work we already know of (i.e. outside the scope of this PR):

  • Heuristic methods to allow fast Breiman-like training for multi-class classification
  • export to graphviz
  • One-hot emulation using the NOCATS machinery
  • support sparse input
  • handle categories as their unique valies instead of [0, max(feature)]

P.S. I moved away from "task list" due to the extremely buggy interface when used in combination with editing the post, which I'm extensively doing to keep it easy for us to keep up with the status.

jblackburne and others added 20 commits February 11, 2017 12:43
…causing all kinds of problems. Now safe_realloc requires the item size to be explicitly provided. Also, it can allocate arrays of pointers to any type by casting to void*.
…to categorical variables. Replaced the threshold attribute of SplitRecord and Node with SplitValue.
…hat defaults to -1 for each feature (indicating non-categorical).
…ediction with trees. Also introduced category caches for quick evaluation of categorical splits.
@jnothman
Copy link
Member

Wow. Good on you for taking this on!

@adrinjalali
Copy link
Member Author

adrinjalali commented Dec 26, 2018

I̶ ̶a̶s̶s̶u̶m̶e̶ ̶t̶h̶e̶ ̶a̶p̶p̶v̶e̶y̶o̶r̶ ̶f̶a̶i̶l̶u̶r̶e̶ ̶i̶s̶ ̶u̶n̶r̶e̶l̶a̶t̶e̶d̶ ̶t̶o̶ ̶t̶h̶i̶s̶ ̶P̶R̶ ̶I̶ ̶s̶u̶p̶p̶o̶s̶e̶.̶

Base automatically changed from master to main January 22, 2021 10:50
@nehargupta
Copy link

Hello,
I just wanted to check in and see if categorical implementation of decision trees might still happen in a future iteration? My team has been checking on this periodically, as we are hoping to see it sometime. We developed an open source package that uses feature selection in an intermediate step of our algorithm, so relying on one-hot-encoder is not ideal. I might be able to assist in some way, if needed, although I would be a new contributor to scikit. Thanks :)

@SinaDBMS
Copy link

SinaDBMS commented May 29, 2021

@NicolasHug

Also I'm super late to the party, but what is the benefit of NOCATs over One-Hot-Encoding the categories?
As far as I understand the strategy proposed here is equivalent to re-implementing the OHE within the tree logic. So what are the main benefits of NOCATs over OHE, apart from using less memory?

Another drawback of One-Hot-Encoding is when the categorical feature to be encoded has a lot of possible values. This results in a large set of One-Hot features. So if a tree picks randomly a subset of the features for splitting, it is more likely that these one-hot-encoded features be picked up in comparison to the original features.

@Jing25
Copy link

Jing25 commented Nov 18, 2021

Hi I'm wondering if random forest has supported the categorical data?

@AndreaTrucchia
Copy link

Hallo, I would like to inquire about the status of this branch. My team would really benefit from this and be free from recurring to
R every now and then.

@adrinjalali
Copy link
Member Author

@AndreaTrucchia have you checked HistGradientBoosting* instead?

@AndreaTrucchia
Copy link

@adrinjalali I am checking it, too bad most of my works concern Random Forest. However, I think that I can give it a try for
studies that revolves just on the effect of different categories on the predicted label. Thanks a lot

@adrinjalali
Copy link
Member Author

Out of curiosity, do the preprocessing techniques we have to handle categorical variables not satisfy your needs in a Pipeline?

@AndreaTrucchia
Copy link

Dear @adrinjalali , while in a scikit-learn environment, I tend to one-hot-encode the categorical variable, with very high performances (see e.g. https://www.mdpi.com/2571-6255/5/1/30) .However, in the R (randomForest package) -style of treating canonical variables, I can use the partialPlot function that can rank the variables from let's say 1 "this category enhance the calssification of being label A" to -1 "this category strongly disagrees with classification of label A" .
I hope I was clear enough :)

@adrinjalali
Copy link
Member Author

Isn't partialPlot the partial dependence plots that we have? (https://scikit-learn.org/stable/modules/partial_dependence.html#partial-dependence)

You could pass a pipeline with the OneHotEncoder included in it and get the partial dependence (I think).

@NicolasHug
Copy link
Member

NicolasHug commented Feb 24, 2022

That would probably be a different thing. Our PDP support is only defined for regressors, not classifiers. The "partial dependence" as we support it is defined as the expectation of a continuous target.
EDIT nvm I'm wrong, it can rely on the decision_function

@QianqianHan96
Copy link

How can we use this one? If I may ask this dumb question? Is it a function in scikit-learn? Thanks a lot!

@adrinjalali
Copy link
Member Author

@AliciaPython it's not included. You can checkout this branch, compile the package, and install it locally, but I wouldn't recommend it since it's quite outdated compared to the main branch at this point. This requires some substantial work to get in if it's happening, and I'm not sure if it will. You're probably better off using HistGradientBoosting models.

@lcrmorin
Copy link

lcrmorin commented Apr 5, 2023

If a simple tree is needed, would it be a good idea to use HistGradientBoosting with max_iter=1 ? Does this default to a simple tree that would be relatively equivalent to a simple tree model ?

@adrinjalali
Copy link
Member Author

@lcrmorin if you don't lose too much information by quantizing your features (which is what HistGradientBoosting does), then they might be similar I think.

@bmreiniger
Copy link
Contributor

@lcrmorin cc @adrinjalali
Set learning rate to 1, and early stopping to False to prevent a validation set being split out.

I think there still can be some significant differences: the first tree is fit to the pseudo-residual (gradient of the loss function, sometimes with hessian information too) from an initial prediction (see also the init parameter of the vanilla GBMs here). The splits chosen might be the same as an ordinary tree, but that might depend on the loss function chosen(??); and the leaf values will certainly be different.

@adam2392
Copy link
Contributor

Any chance anyone has the link to the original issue, or a summarizing comment for why this is stalled or difficult?

My impression is that since R and other packages have a similar feature, maybe there's some friction here due to just the internals of the sklearn tree API or something?

I realize this may just not get in, but I want to see what some of the ideas were to see if I can implement a robust soln into scikit-tree: as just a separate splitter.

Thanks!

@adrinjalali
Copy link
Member Author

At some point in the past this PR was in a pretty good shape, but I was asked to provide more benchmarks and more evidence that this is good enough. There was also the issue of trying to simplify the existing codebase so that this could be introduced easier / also for sparse case. At some point I didn't have the time to spend on this anymore (after quite some time of working on this almost everyday). So it was left behind. At this point for me to get back to this it would cost me more that I have to spare.

At the same time, HistGradientBoosting* now natively supports categorical features, which also deprioritized this work.

So, I'm not sure.

@adam2392
Copy link
Contributor

adam2392 commented Jun 23, 2023

I see. Thanks for the update @adrinjalali!

If you don't mind, I will probably incorporate this into our sklearn fork and probably refactor this to account for the most up-to-date Cython changes in sklearn:main and also make the API more similar to the categorical API in HistGradientBoosting*. Is that fine w/ you since you're the author of this work?

It sounds like this kind of feature is okay w/ maintainers for inclusion. The main bottleneck is some significant benchmarking(?) (and I suppose the missing-value work being carried out by Thomas). If so, I'll post back here with a link to the commit from our fork that implements this, so a PR can more easily be carried out that's in line w/ sklearn:main.

@adrinjalali
Copy link
Member Author

I really don't like the idea of the fork as I've mentioned before, but sure, you can have it there.

@adam2392
Copy link
Contributor

adam2392 commented Jun 23, 2023

I really don't like the idea of the fork as I've mentioned before, but sure, you can have it there.

Agreed... I'm trying to pipe most of the features possible downstream to scikit-tree, but for this specific one we want to enable categorical splits for all possible tree models w/o diverging from the sklearn's Cython code.

Therefore it has to be done at the Python BaseDecisionTree and Cython splitter level unfortunately. I'm not a huge fan of codebases that hard-forked the sklearn Cython code. At the moment, I'll just have to eat the cost of rebasing a submodule or hard-forking and then figuring out how to re-align code.

Thanks for the feedback and updates!

self.n_nodes = 0
self.bits = NULL

def _dealloc__(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _dealloc__(self):
def __dealloc__(self):

adam2392 added a commit to neurodata/scikit-learn that referenced this pull request Jul 20, 2023
<!--
Thanks for contributing a pull request! Please ensure you have taken a
look at
the contribution guidelines:
https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md
-->

#### Reference Issues/PRs
Helps bring in fork wrt changes in
scikit-learn#12866

#### What does this implement/fix? Explain your changes.


#### Any other comments?


<!--
Please be aware that we are a loose team of volunteers so patience is
necessary; assistance handling other issues is very welcome. We value
all user contributions, no matter how minor they are. If we are slow to
review, either the pull request needs some benchmarking, tinkering,
convincing, etc. or more likely the reviewers are simply busy. In either
case, we ask for your understanding during the review process.
For more information, see our FAQ on this topic:

http://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.

Thanks for contributing!
-->

---------

Signed-off-by: Adam Li <adam2392@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Categorical
  
To do
Development

Successfully merging this pull request may close these issues.

None yet