Add feature to print forest shape in FIL upon importing #3763

levsnv · 2021-04-18T06:14:48Z

When benchmarking forests, it's important to know what we're benchmarking, regardless of the source of the forest. E.g. are we still importing an equivalent-sized forest after updating xgboost version?
This is also specific to understanding FIL performance on a model, as opposed to generic forest statistics.
This allows to obtain a couple of numbers from FIL, both pertaining to the model and the chosen storage type, post-import size, etc.
A sample output looks like this:

Depth hist:
depth	branches	leaves	nodes
  0	     5	     0	      5
  1	    10	     0	     10
  2	    20	     0	     20
  3	    19	    21	     40
  4	    15	    23	     38
  5	    12	    18	     30
  6	     2	    22	     24
  7	     0	     4	      4
Total: branches: 83 leaves: 88 nodes: 171
Avg nodes per tree: 34.2
Leaf depth: min: 3 avg 4.6 max 7
Depth histogram fingerprint: 4752b3a7893d5e22
DENSE model size 0.01 MB

The feature is breaking because one new C parameter is added to treelite_params_t. Python API is not broken, because it sets the backward-compatible defaults.

hcho3 · 2021-04-18T08:39:49Z

Should this feature be incorporated into Treelite?

levsnv · 2021-04-19T06:14:46Z

Yes, that would be great. My understanding is treelite would only have a model dumping function, and simpler stats? In which case, the code to make leaf/branch depth histogram would need to live somewhere. Hence, I thought FIL would serve.
Also, some parameters, like post-import format and size (and potentially perf tuning parameters in the future) are specific to FIL and would need an API to dump anyway.
What do you think?

levsnv · 2021-04-19T06:16:01Z

I also don't know whether the parameters should be part of treelite_params_t instead of naked arguments to from_treelite()

cpp/include/cuml/fil/fil.h

cpp/src/fil/fil.cu

canonizer · 2021-04-24T01:44:29Z

cpp/src/fil/fil.cu

+  int min_depth = -1, leaves_times_depth = 0, total_branches = 0,
+      total_leaves = 0;
+  // 64-bit Fowler/Noll/Vo
+  size_t fingerprint = 14695981039346656037l;


Fingerprinting isn't necessary here.

However, if you still want to do it, I would suggest the following approach:

Build the data structure containing the forest statistics.

Convert it into an array of bytes by successively appending the bytes representing each integer in the data structure.

Compute a standard hash function on the array. A standard cryptographic hash function is definitely preferred.

In this way, you can split this code into multiple functions, each performing a single piece of work.

I want to do this, because when sweeping models for benchmarking, it let me quickly understand which models are different and which are the same. Same with putting the results into a spreadsheet - easier to compare runs to each other if the fingerprint is just equal.

cpp/src/fil/fil.cu

cpp/test/sg/fil_test.cu

python/cuml/fil/fil.pyx

Co-authored-by: Andy Adinets <adinetz@gmail.com>

…nt-model-shape

levsnv · 2021-04-28T05:07:41Z

Waiting until #3800 is merged to push the conflict-resolved version

ajschmidt8 · 2021-05-19T20:50:30Z

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

dantegd

Had 1 very minor comment about a header and one question/comment regarding how the API looks like from a python perspective

cpp/include/cuml/fil/fnv_hash.h

python/cuml/test/test_fil.py

codecov-commenter · 2021-05-22T01:50:08Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.06@1ea479b). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.06    #3763   +/-   ##
===============================================
  Coverage                ?   85.41%           
===============================================
  Files                   ?      227           
  Lines                   ?    17317           
  Branches                ?        0           
===============================================
  Hits                    ?    14791           
  Misses                  ?     2526           
  Partials                ?        0

Flag	Coverage Δ
dask	`48.95% <0.00%> (?)`
non-dask	`77.35% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ea479b...0d2a420. Read the comment docs.

dantegd · 2021-05-24T19:20:15Z

@gpucibot merge

Include cuML header directory that is currently nested one layer too deep (to be fixed by rapidsai/cuml#3901) Ignore unused variable warnings due to unused variable in fil.h introduced by rapidsai/cuml#3763

* Temporary fix for upstream cuML issues Include cuML header directory that is currently nested one layer too deep (to be fixed by rapidsai/cuml#3901) Ignore unused variable warnings due to unused variable in fil.h introduced by rapidsai/cuml#3763 * Remove extra include path following upstream fix * Update to CalVer for cuML

When benchmarking forests, it's important to know what we're benchmarking, regardless of the source of the forest. E.g. are we still importing an equivalent-sized forest after updating xgboost version? This is also specific to understanding FIL performance on a model, as opposed to generic forest statistics. This allows to obtain a couple of numbers from FIL, both pertaining to the model and the chosen storage type, post-import size, etc. A sample output looks like this: ``` Depth hist: depth branches leaves nodes 0 5 0 5 1 10 0 10 2 20 0 20 3 19 21 40 4 15 23 38 5 12 18 30 6 2 22 24 7 0 4 4 Total: branches: 83 leaves: 88 nodes: 171 Avg nodes per tree: 34.2 Leaf depth: min: 3 avg 4.6 max 7 Depth histogram fingerprint: 4752b3a7893d5e22 DENSE model size 0.01 MB ``` The feature is breaking because one new C parameter is added to `treelite_params_t`. Python API is not broken, because it sets the backward-compatible defaults. Authors: - https://github.com/levsnv Approvers: - Andy Adinets (https://github.com/canonizer) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#3763

levsnv added 3 commits April 16, 2021 23:29

try 1

1245d00

fixes

c81eb32

fixed

e8433c1

levsnv requested review from a team as code owners April 18, 2021 06:14

github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Apr 18, 2021

levsnv requested a review from canonizer April 18, 2021 06:15

levsnv added 3 - Ready for Review Ready for review by team breaking Breaking change CUDA / C++ CUDA issue feature request New feature or request labels Apr 18, 2021

raydouglass removed the CUDA/C++ label Apr 19, 2021

canonizer suggested changes Apr 24, 2021

View reviewed changes

dantegd added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Apr 25, 2021

Apply suggestions from code review

2759de3

Co-authored-by: Andy Adinets <adinetz@gmail.com>

github-actions bot added the CUDA/C++ label Apr 27, 2021

levsnv added 5 commits April 27, 2021 14:05

addressed review comments

e21f0e9

refactor

3f4a5d6

style

9e5c5a4

style

8ae9e5f

Merge remote-tracking branch 'origin/refactor-cython-kwargs' into pri…

fe12e6c

…nt-model-shape

levsnv and others added 3 commits April 28, 2021 15:34

Merge branch 'branch-0.20' into print-model-shape

5f85f27

fix memory leaks; confusing naming

31f9f46

stop sprawling code duplication

22abfd4

dantegd added the 4 - Waiting on Author Waiting for author to respond to review label May 13, 2021

style, copyright

d3c24f9

levsnv requested a review from a team as a code owner May 19, 2021 04:20

github-actions bot removed the gpuCI gpuCI issue label May 19, 2021

levsnv added 2 commits May 18, 2021 22:08

addressed review comments

f5f825f

added back " MB" suffix

47d5911

ajschmidt8 removed the request for review from a team May 19, 2021 20:50

levsnv added 3 - Ready for Review Ready for review by team and removed 4 - Waiting on Author Waiting for author to respond to review labels May 19, 2021

levsnv added 2 commits May 19, 2021 16:49

fixed insufficient precision

6b41953

fix extra <<

739940b

v21.06 Release automation moved this from PR-WIP to PR-Needs review May 21, 2021

dantegd requested changes May 21, 2021

View reviewed changes

cpp/include/cuml/fil/fnv_hash.h Outdated Show resolved Hide resolved

python/cuml/test/test_fil.py Outdated Show resolved Hide resolved

switched from forest_shape_file to compute_shape_str and shape_str

7e6302f

levsnv requested a review from dantegd May 21, 2021 21:57

levsnv added 3 commits May 21, 2021 14:59

copyright.year

dcce65a

style, documentation

2a07441

more verbose error message

0d2a420

v21.06 Release automation moved this from PR-Needs review to PR-Reviewer approved May 24, 2021

dantegd approved these changes May 24, 2021

View reviewed changes

rapids-bot bot merged commit 79140bb into rapidsai:branch-21.06 May 24, 2021

v21.06 Release automation moved this from PR-Reviewer approved to Done May 24, 2021

levsnv deleted the print-model-shape branch May 24, 2021 21:47

wphicks mentioned this pull request May 26, 2021

Temporary fix for upstream cuML issues triton-inference-server/fil_backend#64

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add feature to print forest shape in FIL upon importing #3763

Add feature to print forest shape in FIL upon importing #3763

levsnv commented Apr 18, 2021 •

edited

hcho3 commented Apr 18, 2021

levsnv commented Apr 19, 2021

levsnv commented Apr 19, 2021

canonizer Apr 24, 2021

levsnv Apr 27, 2021

levsnv commented Apr 28, 2021

ajschmidt8 commented May 19, 2021

dantegd left a comment

codecov-commenter commented May 22, 2021

dantegd commented May 24, 2021

Add feature to print forest shape in FIL upon importing #3763

Add feature to print forest shape in FIL upon importing #3763

Conversation

levsnv commented Apr 18, 2021 • edited

hcho3 commented Apr 18, 2021

levsnv commented Apr 19, 2021

levsnv commented Apr 19, 2021

canonizer Apr 24, 2021

Choose a reason for hiding this comment

levsnv Apr 27, 2021

Choose a reason for hiding this comment

levsnv commented Apr 28, 2021

ajschmidt8 commented May 19, 2021

dantegd left a comment

Choose a reason for hiding this comment

codecov-commenter commented May 22, 2021

Codecov Report

dantegd commented May 24, 2021

levsnv commented Apr 18, 2021 •

edited