-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIL to infer categorical forests and generate them in C++ tests #4092
FIL to infer categorical forests and generate them in C++ tests #4092
Conversation
…osed to forest construction one
Co-authored-by: Andy Adinets <adinetz@gmail.com>
3m50s to compile with the rest 1m30s to compile alone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved provided that the comments are addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great! Love the overall approach, and a lot of the implementation details make this clear and easy to reason about. I've mostly made note of style things and just a couple places where it seems like algorithms could make things easier and less error-prone.
cpp/src/fil/fil.cu
Outdated
cat_sets_owner::cat_sets_owner(const std::vector<cat_feature_counters>& cf) | ||
{ | ||
max_matching.resize(cf.size()); | ||
std::size_t bits_size = 0; | ||
// feature ID | ||
for (std::size_t fid = 0; fid < cf.size(); ++fid) { | ||
RAFT_EXPECTS( | ||
cf[fid].max_matching >= -1, "@fid %zu: max_matching invalid (%d)", fid, cf[fid].max_matching); | ||
RAFT_EXPECTS(cf[fid].n_nodes >= 0, "@fid %zu: n_nodes invalid (%d)", fid, cf[fid].n_nodes); | ||
|
||
max_matching[fid] = cf[fid].max_matching; | ||
bits_size += | ||
categorical_sets::sizeof_mask_from_max_matching(max_matching[fid]) * cf[fid].n_nodes; | ||
|
||
RAFT_EXPECTS(bits_size <= INT_MAX, | ||
"@fid %zu: cannot store %zu categories given `int` offsets", | ||
fid, | ||
bits_size); | ||
} | ||
bits.resize(bits_size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we previously discussed this function in a different context, and now that I see it in situ, I think we don't even need a zip iterator, since we're constructing all of the relevant containers as we go. How about something like the following (composed in Github; excuse any formatting issues)?
cat_sets_owner::cat_sets_owner(const std::vector<cat_feature_counters>& cf) | |
{ | |
max_matching.resize(cf.size()); | |
std::size_t bits_size = 0; | |
// feature ID | |
for (std::size_t fid = 0; fid < cf.size(); ++fid) { | |
RAFT_EXPECTS( | |
cf[fid].max_matching >= -1, "@fid %zu: max_matching invalid (%d)", fid, cf[fid].max_matching); | |
RAFT_EXPECTS(cf[fid].n_nodes >= 0, "@fid %zu: n_nodes invalid (%d)", fid, cf[fid].n_nodes); | |
max_matching[fid] = cf[fid].max_matching; | |
bits_size += | |
categorical_sets::sizeof_mask_from_max_matching(max_matching[fid]) * cf[fid].n_nodes; | |
RAFT_EXPECTS(bits_size <= INT_MAX, | |
"@fid %zu: cannot store %zu categories given `int` offsets", | |
fid, | |
bits_size); | |
} | |
bits.resize(bits_size); | |
} | |
template <typename Iter> | |
cat_sets_owner(Iter cf_begin, Iter cf_end) : | |
max_matching{}, | |
bits([]() { | |
auto bit_size = std::size_t{}; | |
// Populate max_matching and compute the size of the bits vector along the way. | |
// Doing both of these together avoids the overhead of loading the cf values into cache twice. | |
std::transform(cf_begin, cf_end, std::back_inserter(max_matching), [](auto& cf_val) { | |
auto fid = max_matching.size(); // Tracking feature id for error messages | |
RAFT_EXPECTS( | |
cf_val.max_matching >= -1, "@fid %zu: max_matching invalid (%d)", fid, cf_val.max_matching); | |
RAFT_EXPECTS(cf_val.n_nodes >= 0, "@fid %zu: n_nodes invalid (%d)", fid, cf_val.n_nodes); | |
bit_size += categorical_sets::sizeof_mask_from_max_matching(cf_val.max_matching) * cf_val.n_nodes; | |
RAFT_EXPECTS(bits_size <= INT_MAX, | |
"@fid %zu: cannot store %zu categories given `int` offsets", | |
fid, | |
bits_size); | |
return cf_val.max_matching; | |
}); | |
return bit_size; | |
}()) {} |
We'd obviously want to move this to the header.
I see a few advantages to an approach like this:
- We are clearly initializing each of our member variables in the initializer list. No chance that
resize
gets called on an uninitialized vector. - We remain agnostic to the type of container we are constructing from.
- Using
std::transform
clearly communicates that we are constructingmax_matching
via a one-to-one transformation from our input elements.
The annoying bit is obviously the fact that we're computing bit_size
as a side-effect, but if we comment on that to make it clear that it's for performance reasons, I think this should be reasonably clear to future developers.
What do you think? Do we like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping the indices is neat, but a couple of things are confusing.
If we don't trust this line from cppreference.com, then what do we trust when we read the code? "Before the compound statement that forms the function body of the constructor begins executing, initialization of all direct bases, virtual bases, and non-static data members is finished." AFAIK it's not an experimental feature.
Relying instead on the order of member declaration is IMHO more error-prone, since it's not specified right
next to it, let alone in the same block. It's also odd that max_matching
is changed while initializing bits
. Not a fan of sneaking statements into expressions in a UB-prone context.
In this particular case, it's actually a bug, since I happened to declare bits
first.
I'll update the code using some of your ideas, maybe it'll become clear enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open an issue to do a follow up discussion of this point?
} | ||
|
||
// proto inner node | ||
#define NODE(...) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason we need a macro here? If we really must, could you add a comment explaining why it's necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tagged you in a thread. To repeat my impression here:
without reflection, won't we need to repeat the list of members in the struct definition and the recursive template that defines the non-defaults? E.g. one that passes the default-initialized struct through to the next instantiation of itself, while editing one field, controlled by a switch statement, which checks an enum that defines the field name? Or do we generate the modifier function that accepts such enum right near the definition, and define the other template later?
struct ProtoCategorySets { | ||
// each bit set for each feature id is in a separate vector | ||
// read each uint8_t from right to left, and the vector(s) - from left to right | ||
std::vector<std::vector<uint8_t>> bits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<std::vector<uint8_t>> bits; | |
std::vector<std::vector<uint8_t>> bits{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what changes when we use curly braces to invoke the default constructor?
// each bit set for each feature id is in a separate vector | ||
// read each uint8_t from right to left, and the vector(s) - from left to right | ||
std::vector<std::vector<uint8_t>> bits; | ||
std::vector<int> max_matching; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<int> max_matching; | |
std::vector<int> max_matching{}; |
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #4092 +/- ##
===============================================
Coverage ? 86.07%
===============================================
Files ? 231
Lines ? 18637
Branches ? 0
===============================================
Hits ? 16042
Misses ? 2595
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, the remaining discussions for potential improvements will be moved to an issue and done in follow up PRs
Improvements done in PR, except for 2 that will be moved to issue discussion so we can merge this PR and unblock #4173
@gpucibot merge |
…dsai#4092) This doesn't include treelite import (export). That will come in rapidsai#4041 Authors: - https://github.com/levsnv Approvers: - Andy Adinets (https://github.com/canonizer) - Robert Maynard (https://github.com/robertmaynard) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4092
This doesn't include treelite import (export). That will come in #4041