PERF Improve runtime for early stopping in HistGradientBoosting #26163

thomasjpfan · 2023-04-12T18:08:05Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR reuses the raw_predictions for the validation and training set when scoring is a predefined metric. I ran this benchmark and this PR is about ~ 2-3 times faster than main.

main

python bench_hist_early_stopping.py --problem regression
Runtime: 11.068207994001568

python bench_hist_early_stopping.py --problem classification
Runtime: 32.575281541998265

PR

python bench_hist_early_stopping.py --problem regression
Runtime: 2.9262633729995287

python bench_hist_early_stopping.py --problem classification
Runtime: 13.191749556999639

lorentzenchr

A first round.

doc/whats_new/v1.3.rst

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

lorentzenchr · 2023-04-14T11:27:34Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+                    # scoring is a predefined metric string
+                    if isinstance(self.scoring, str):
+                        raw_predictions_small_train = raw_predictions[
+                            indices_small_train
+                        ]


Shouln't we use the whole training set as in self.scoring == "loss"?

Yea, this is still the case. This branch is never reached when self.scoring == "loss".

That being said, I can see how using instance(..., str) can lead to confusion. I updated this PR with 89c3dc0 (#26163) to have a more explicit variable name for "scorer is a predefined string".

lorentzenchr · 2023-04-14T11:29:10Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+                    if isinstance(self.scoring, str):
+                        raw_predictions_small_train = raw_predictions[
+                            indices_small_train
+                        ]


Again, should be now use the whole training set as in self.scoring == "loss"?

lorentzenchr · 2023-04-14T11:32:33Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+@contextmanager
+def _patch_raw_predict(estimator, raw_predictions):
+    """Context manager that patches _raw_predict to return raw_predictions."""
+    orig_raw_predict = estimator._raw_predict


I need to digest this 😄

lorentzenchr

LGTM.
A second review should focus on _patch_raw_predict and the solution with the context manager.

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

lorentzenchr · 2023-07-06T14:23:09Z

We should also change the statement in the user guide of scorers in HGBT being much slower. Maybe better wait for #26778.

github-actions · 2023-07-24T18:56:36Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 25b7b9a. Link to the linter CI: here}

…g_early_stopping

ogrisel

I had missed this PR. I think it's a useful improvement ~~but I do not understand the call count assertion in the tests. Maybe there is a bug?~~. The test seems correct (apart a mistake in the inline comment) but I think it could be improved to test the actual change in this PR.

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

ogrisel · 2023-11-24T14:12:34Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    )
+    hist.fit(X, y, sample_weight=sample_weight)
+
+    # For scorer is called three times per iteration. (2 x 3 = 6)


I am trying to investigate by I don't understand this. I would have expected 3 in total: one for the baseline score call at line 614 and 2 for the line score at the end of each iteration at line 789.

~~There might be a bug.~~

EDIT: I forgot about validation. So it's two for the baseline (train + validation) and two per iteration (train + val):

Actually this test should also check the contents of hist.train_score_ and hist.validation_score_.

EDIT: a first version of this comment was wrong (I did a local change to the code base that introduced a bug in the mock).

ogrisel · 2023-11-24T14:16:28Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

@@ -848,6 +850,37 @@ def test_early_stopping_on_test_set_with_warm_start():
    gb.fit(X, y)


+def test_early_stopping_with_sample_weights(monkeypatch):


This test is useful in itself but it does not check that we do not call the original hist._raw_predict when doing the early stopping checks.

I think the test could be extended to monkeypatch with a second mocker to check that that enabling early stopping does not induce extra calls to hist._raw_predict compared to fitting with a fixed number of iteration with early stopping disabled and with the default scoring parameter value.

In edafefa, I added a mock to _raw_predict to make sure it is not called.

compared to fitting with a fixed number of iteration with early stopping disabled and with the default scoring parameter value.

This is unclear to me. Assuming warm_start=False:

With scoring="loss", _raw_predict is not called.

With early_stopping disabled, _raw_predict is not called.

After this PR, _raw_predict is only called when scoring is a custom callable and early stopping is enabled. I added a new test in b4cc727 to assert this behavior.

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel

Thanks for the follow. LGTM once the black formatting pbm is fixed.

PERF Improve runtime for early stopping in HistGradientBoosting

137e005

github-actions bot added the module:ensemble label Apr 12, 2023

thomasjpfan added 3 commits April 12, 2023 14:08

DOC Adds whats new number

7f117ca

TST Adds test to increase coverage

2f9aee4

DOC Fix whats new number

8eb3aa6

lorentzenchr reviewed Apr 14, 2023

View reviewed changes

thomasjpfan added 5 commits April 17, 2023 09:40

CLN Use improved variable names

89c3dc0

CLN Adjust valiable name

4f55f79

DOC Adds comments

6470664

DOC Update whats new

5ae2db9

FIX Fixes bug

e906f0a

lorentzenchr approved these changes Apr 17, 2023

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Show resolved Hide resolved

FIX Fixes bug

f626fa8

lorentzenchr added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jun 23, 2023

DOC Update user guide regarding callable scorers

fd96c5c

Merge remote-tracking branch 'upstream/main' into hist_gradient_strin…

7d7d56a

…g_early_stopping

ogrisel reviewed Nov 24, 2023

View reviewed changes

Merge branch 'main' into hist_gradient_string_early_stopping

b1c94c2

ogrisel reviewed Nov 24, 2023

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py Outdated Show resolved Hide resolved

thomasjpfan and others added 6 commits November 25, 2023 09:43

Apply suggestions from code review

ddc4c82

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

STY Black linting

4cac643

STY Ruff linting

693347b

Check that hist._raw_predict is not called

edafefa

CLN Change variable name

e78cbae

TST Adds test for custom scorer

b4cc727

thomasjpfan force-pushed the hist_gradient_string_early_stopping branch from 43f8ecb to b4cc727 Compare November 25, 2023 16:02

ogrisel approved these changes Nov 26, 2023

View reviewed changes

STY black linting

25b7b9a

ogrisel merged commit 08b6157 into scikit-learn:main Nov 27, 2023
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF Improve runtime for early stopping in HistGradientBoosting #26163

PERF Improve runtime for early stopping in HistGradientBoosting #26163

thomasjpfan commented Apr 12, 2023

lorentzenchr left a comment

lorentzenchr Apr 14, 2023

thomasjpfan Apr 17, 2023

lorentzenchr Apr 14, 2023

lorentzenchr Apr 14, 2023

lorentzenchr left a comment

lorentzenchr commented Jul 6, 2023

github-actions bot commented Jul 24, 2023 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel Nov 24, 2023 •

edited

Loading

ogrisel Nov 24, 2023 •

edited

Loading

ogrisel Nov 24, 2023

thomasjpfan Nov 25, 2023

ogrisel left a comment

		@@ -848,6 +850,37 @@ def test_early_stopping_on_test_set_with_warm_start():
		gb.fit(X, y)


		def test_early_stopping_with_sample_weights(monkeypatch):

PERF Improve runtime for early stopping in HistGradientBoosting #26163

PERF Improve runtime for early stopping in HistGradientBoosting #26163

Conversation

thomasjpfan commented Apr 12, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

main

PR

lorentzenchr left a comment

Choose a reason for hiding this comment

lorentzenchr Apr 14, 2023

Choose a reason for hiding this comment

thomasjpfan Apr 17, 2023

Choose a reason for hiding this comment

lorentzenchr Apr 14, 2023

Choose a reason for hiding this comment

lorentzenchr Apr 14, 2023

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

lorentzenchr commented Jul 6, 2023

github-actions bot commented Jul 24, 2023 • edited Loading

✔️ Linting Passed

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

ogrisel Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

ogrisel Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

ogrisel Nov 24, 2023

Choose a reason for hiding this comment

thomasjpfan Nov 25, 2023

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 24, 2023 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel Nov 24, 2023 •

edited

Loading

ogrisel Nov 24, 2023 •

edited

Loading