Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds Categorical Support to Histogram Gradient Boosting #16909
ENH Adds Categorical Support to Histogram Gradient Boosting #16909
Changes from 7 commits
02d89d7
8472f60
1198340
63f56fd
f34087e
0b2ed9c
5eaf099
43822ab
0d6012a
b22151f
8432bac
ae9be56
d0557a5
7692325
590d95f
95e79f2
e62479b
9086fad
3e323b2
197fac0
63af0d5
7ef6a8d
e6a03c6
eabcfae
2abe579
9a5a3f4
ebb68e5
470c146
0fc4c24
cebd6c0
d1478ba
ba00644
1806c2b
95919e3
38966d5
c4869ba
f63ad6a
26d0796
17afb0f
5246cc1
60523a3
b014d6e
96d0687
dc0a3a4
e10b346
af58498
3c2f672
c8f31f9
fe16b42
cf5bb6d
3d9e449
3615dc2
6608715
2c384e6
8c6e985
2357ae9
52048af
f70416e
a4159cf
a398786
8ea46cc
3dcbd31
c3b5eef
280784a
019de8a
24d0711
2d0e79d
966379c
1c920f7
3966432
6c1af62
f535c33
2afca55
9b44d82
9f3fa46
1054754
bbae955
c003b76
40c3f9b
bb0e899
f47da15
bb5877d
c3061b5
8762e88
6d7ec60
69f3f9a
f9f837c
b913ff1
730d69f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the categorical features useful for this classification task? It may be worth it to add another example where the categorical features are dropped, training should be faster but predictive performance should be worse. Dropping categorical features is another way to deal with them (in a dummy way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this dataset, the categories do not matter as much. So I will be on the lookout for a nicer dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation needs a space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use
long
? we usually useunsigned int
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To match the dtype of
orig_feature_to_binned_cat
, but this will change when we do not bin anymore in predict.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does lightgbm also do that? I.e. bin during predict, and rely on a bitset of internally encoded features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lightgbm does not bin during predict. It has a dynamically sized bitset that encodes the input categorical features, so it can accept a category with any cardinality.
Currently, the implementation accepts a category with any cardinality. If the cardinality is higher than
max_bins
, then only the topmax_bins
categories are kept, ranked by cardinality, and the rest are considered missing. In this way, it is also handling infrequent categories as well. This option is more flexible, but means I have to bin predict which is disappointing.A simpler alternative would be to restrict the input to be "ints" with range ~
[0, max_bin]
and anything outside of that range will be considered missing. This would not do anything special to handle infrequent categories, but it will simplify some of the code.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts @ogrisel ?
Personally, for a first version, I would prefer keeping things as simple as possible. As such, rxpecting ints in [0, max_bins] sounds reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go either way on this. I spoke to @amueller about this and seems to prefer the current approach of "binning categories during predict".