Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature to print forest shape in FIL upon importing #3763

Merged
merged 30 commits into from
May 24, 2021

Conversation

levsnv
Copy link
Contributor

@levsnv levsnv commented Apr 18, 2021

When benchmarking forests, it's important to know what we're benchmarking, regardless of the source of the forest. E.g. are we still importing an equivalent-sized forest after updating xgboost version?
This is also specific to understanding FIL performance on a model, as opposed to generic forest statistics.
This allows to obtain a couple of numbers from FIL, both pertaining to the model and the chosen storage type, post-import size, etc.
A sample output looks like this:

Depth hist:
depth	branches	leaves	nodes
  0	     5	     0	      5
  1	    10	     0	     10
  2	    20	     0	     20
  3	    19	    21	     40
  4	    15	    23	     38
  5	    12	    18	     30
  6	     2	    22	     24
  7	     0	     4	      4
Total: branches: 83 leaves: 88 nodes: 171
Avg nodes per tree: 34.2
Leaf depth: min: 3 avg 4.6 max 7
Depth histogram fingerprint: 4752b3a7893d5e22
DENSE model size 0.01 MB

The feature is breaking because one new C parameter is added to treelite_params_t. Python API is not broken, because it sets the backward-compatible defaults.

@levsnv levsnv requested review from a team as code owners April 18, 2021 06:14
@github-actions github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Apr 18, 2021
@levsnv levsnv requested a review from canonizer April 18, 2021 06:15
@levsnv levsnv added 3 - Ready for Review Ready for review by team breaking Breaking change CUDA / C++ CUDA issue feature request New feature or request labels Apr 18, 2021
@hcho3
Copy link
Contributor

hcho3 commented Apr 18, 2021

Should this feature be incorporated into Treelite?

@levsnv
Copy link
Contributor Author

levsnv commented Apr 19, 2021

Yes, that would be great. My understanding is treelite would only have a model dumping function, and simpler stats? In which case, the code to make leaf/branch depth histogram would need to live somewhere. Hence, I thought FIL would serve.
Also, some parameters, like post-import format and size (and potentially perf tuning parameters in the future) are specific to FIL and would need an API to dump anyway.
What do you think?

@levsnv
Copy link
Contributor Author

levsnv commented Apr 19, 2021

I also don't know whether the parameters should be part of treelite_params_t instead of naked arguments to from_treelite()

cpp/include/cuml/fil/fil.h Outdated Show resolved Hide resolved
cpp/src/fil/fil.cu Show resolved Hide resolved
int min_depth = -1, leaves_times_depth = 0, total_branches = 0,
total_leaves = 0;
// 64-bit Fowler/Noll/Vo
size_t fingerprint = 14695981039346656037l;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fingerprinting isn't necessary here.

However, if you still want to do it, I would suggest the following approach:

  1. Build the data structure containing the forest statistics.
  2. Convert it into an array of bytes by successively appending the bytes representing each integer in the data structure.
  3. Compute a standard hash function on the array. A standard cryptographic hash function is definitely preferred.

In this way, you can split this code into multiple functions, each performing a single piece of work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to do this, because when sweeping models for benchmarking, it let me quickly understand which models are different and which are the same. Same with putting the results into a spreadsheet - easier to compare runs to each other if the fingerprint is just equal.

cpp/src/fil/fil.cu Outdated Show resolved Hide resolved
cpp/src/fil/fil.cu Outdated Show resolved Hide resolved
cpp/src/fil/fil.cu Outdated Show resolved Hide resolved
cpp/src/fil/fil.cu Outdated Show resolved Hide resolved
cpp/src/fil/fil.cu Outdated Show resolved Hide resolved
cpp/test/sg/fil_test.cu Outdated Show resolved Hide resolved
python/cuml/fil/fil.pyx Show resolved Hide resolved
@dantegd dantegd added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Apr 25, 2021
Co-authored-by: Andy Adinets <adinetz@gmail.com>
@levsnv
Copy link
Contributor Author

levsnv commented Apr 28, 2021

Waiting until #3800 is merged to push the conflict-resolved version

@dantegd dantegd added the 4 - Waiting on Author Waiting for author to respond to review label May 13, 2021
@levsnv levsnv requested a review from a team as a code owner May 19, 2021 04:20
@github-actions github-actions bot removed the gpuCI gpuCI issue label May 19, 2021
@ajschmidt8 ajschmidt8 removed the request for review from a team May 19, 2021 20:50
@ajschmidt8
Copy link
Member

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

@levsnv levsnv added 3 - Ready for Review Ready for review by team and removed 4 - Waiting on Author Waiting for author to respond to review labels May 19, 2021
v21.06 Release automation moved this from PR-WIP to PR-Needs review May 21, 2021
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had 1 very minor comment about a header and one question/comment regarding how the API looks like from a python perspective

cpp/include/cuml/fil/fnv_hash.h Outdated Show resolved Hide resolved
python/cuml/test/test_fil.py Outdated Show resolved Hide resolved
@levsnv levsnv requested a review from dantegd May 21, 2021 21:57
@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.06@1ea479b). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.06    #3763   +/-   ##
===============================================
  Coverage                ?   85.41%           
===============================================
  Files                   ?      227           
  Lines                   ?    17317           
  Branches                ?        0           
===============================================
  Hits                    ?    14791           
  Misses                  ?     2526           
  Partials                ?        0           
Flag Coverage Δ
dask 48.95% <0.00%> (?)
non-dask 77.35% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ea479b...0d2a420. Read the comment docs.

v21.06 Release automation moved this from PR-Needs review to PR-Reviewer approved May 24, 2021
@dantegd
Copy link
Member

dantegd commented May 24, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 79140bb into rapidsai:branch-21.06 May 24, 2021
v21.06 Release automation moved this from PR-Reviewer approved to Done May 24, 2021
@levsnv levsnv deleted the print-model-shape branch May 24, 2021 21:47
wphicks added a commit to triton-inference-server/fil_backend that referenced this pull request May 26, 2021
Include cuML header directory that is currently nested one layer too
deep (to be fixed by rapidsai/cuml#3901)

Ignore unused variable warnings due to unused variable in fil.h
introduced by rapidsai/cuml#3763
wphicks added a commit to triton-inference-server/fil_backend that referenced this pull request May 28, 2021
* Temporary fix for upstream cuML issues

Include cuML header directory that is currently nested one layer too
deep (to be fixed by rapidsai/cuml#3901)

Ignore unused variable warnings due to unused variable in fil.h
introduced by rapidsai/cuml#3763

* Remove extra include path following upstream fix

* Update to CalVer for cuML
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
When benchmarking forests, it's important to know what we're benchmarking, regardless of the source of the forest. E.g. are we still importing an equivalent-sized forest after updating xgboost version?
This is also specific to understanding FIL performance on a model, as opposed to generic forest statistics.
This allows to obtain a couple of numbers from FIL, both pertaining to the model and the chosen storage type, post-import size, etc.
A sample output looks like this:
```
Depth hist:
depth	branches	leaves	nodes
  0	     5	     0	      5
  1	    10	     0	     10
  2	    20	     0	     20
  3	    19	    21	     40
  4	    15	    23	     38
  5	    12	    18	     30
  6	     2	    22	     24
  7	     0	     4	      4
Total: branches: 83 leaves: 88 nodes: 171
Avg nodes per tree: 34.2
Leaf depth: min: 3 avg 4.6 max 7
Depth histogram fingerprint: 4752b3a7893d5e22
DENSE model size 0.01 MB
```

The feature is breaking because one new C parameter is added to `treelite_params_t`. Python API is not broken, because it sets the backward-compatible defaults.

Authors:
  - https://github.com/levsnv

Approvers:
  - Andy Adinets (https://github.com/canonizer)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#3763
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change CMake CUDA / C++ CUDA issue CUDA/C++ Cython / Python Cython or Python issue feature request New feature or request
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

7 participants