-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
optimize CPU inference with Array-Based Tree Traversal #11519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thank you for the optimization on the inference. Please unmark the "draft" status and ping me when the PR is ready for testing. |
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
…rdin/xgboost into dev/cpu/eytzinger_layout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmetic changes.
The next possible step would be to convert the trees into array-based representation only once, and not to do it for each block of data.
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>
it sounds reasonable and will further improve perf (by cost of increasing memory consumption). |
hi @trivialfis, the PR is ready for review. |
cc @hcho3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still trying to understand the code, will give it a try later. In the meanwhile, could you please craft some specific unittests for the new inference algorithm?
*/ | ||
std::array<bst_node_t, kNodesCount + 1> nidx_in_tree_; | ||
|
||
inline static bool IsLeaf(const RegTree& tree, bst_node_t nidx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the inline
keyword, same goes for all following methods.
* We use transforming trees to array layout for each block of data to avoid memory overheads. | ||
* It makes the array layout inefficient for block_size == 1 | ||
*/ | ||
const bool use_array_tree_layout = block_size > 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this is a small online inference call? The input size could be a few samples per call.
for (std::size_t i = 0; i < block_size; ++i) { | ||
bst_node_t nidx = 0; | ||
if constexpr (use_array_tree_layout) { | ||
nidx = p_nidx[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
This PR introduces optimization for CPU inference. For each tree, the top N levels are transformed into a compact array-based layout. This allows for a branchless node indexing rule: idx = 2 * idx + int(val < split_cond). To minimize memory overhead, this transformation from the standard tree structure to the array layout is performed on-the-fly for each block of data being processed. Even with this additional calculations, improved data locality in the cache-friendly array layout leads to inference speed up to ~2x (x1.4 on average).
