Fix worker streams in OLS-eig executing in an unsafe order #4539

achirkin · 2022-01-31T12:42:30Z

The latest version of the "eig" OLS solver has a bug producing garbage results under some conditions. When at least one worker stream is used to run some operations concurrently, for sufficiently large workset sizes, the memory allocation in the main stream may finish later than the worker stream starts to use it.

This PR adds more ordering between the main and the worker streams, fixing this and some other theoretically possible edge cases.

cjnolet · 2022-01-31T19:00:52Z

cpp/test/sg/ols.cu

@@ -228,11 +252,15 @@ class OlsTest : public ::testing::TestWithParam<OlsInputs<T>> {
  T intercept, intercept2, intercept3;
 };

-const std::vector<OlsInputs<float>> inputsf2 = {
-  {0.001f, 4, 2, 2, 0}, {0.001f, 4, 2, 2, 1}, {0.001f, 4, 2, 2, 2}};
+const std::vector<OlsInputs<float>> inputsf2 = {{hconf::NON_BLOCKING_ONE, 0.001f, 4, 2, 2, 0},


Without the changes in this PR, are these assertions able to reliably reproduce the problem?

No, this bug seems to be very elusive. I managed to reproduce it only under some specific conditions in python, but then it disappeared again after I did some further changes to optimize preProcessData (perhaps, due to changing the pattern of calls using the main stream / rmm resources).
Yet I hope the changes in these tests will help to find other streams-related bugs if there are any more.

I meant to comment, not approve.

… operation

tfeher

Thanks @achirkin for fixing this problem! It looks good, I just have a few smaller comment.

cpp/src_prims/linalg/lstsq.cuh

tfeher

Thanks Artem for the update, the PR looks good to me.

tfeher · 2022-02-03T09:56:09Z

If I understand correctly, this bug affects LinearRegression models with the default solver algorithm (eig). A workarround in existing releases would be to pass a Handle without stream pool:

LinearRegression(handle=Handle(), ...)

or use a different algorithm:

LinearRegression(algorithm='qr', ...)

@achirkin would the first option (passing handle) be the preferred workaround?

achirkin · 2022-02-03T10:10:26Z

@achirkin would the first option (passing handle) be the preferred workaround?

Yes, I think it would make less impact on performance. Thanks for the suggestion!

Also I should note that I could only reproduce the problem when both fit_intercept and normalize are True. By default, normalize is disabled in this model. Though this does not guarantee the bug cannot pop up without normalize if the stars are properly aligned :)

codecov-commenter · 2022-02-09T23:43:09Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.04@9921c61). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.04    #4539   +/-   ##
===============================================
  Coverage                ?   85.73%           
===============================================
  Files                   ?      239           
  Lines                   ?    19585           
  Branches                ?        0           
===============================================
  Hits                    ?    16791           
  Misses                  ?     2794           
  Partials                ?        0

Flag	Coverage Δ
dask	`46.18% <0.00%> (?)`
non-dask	`78.73% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9921c61...5ee810b. Read the comment docs.

cjnolet · 2022-02-10T00:18:01Z

@gpucibot merge

…4539) The latest version of the "eig" OLS solver has a bug producing garbage results under some conditions. When at least one worker stream is used to run some operations concurrently, for sufficiently large workset sizes, the memory allocation in the main stream may finish later than the worker stream starts to use it. This PR adds more ordering between the main and the worker streams, fixing this and some other theoretically possible edge cases. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#4539

Fix worker streams in OLS-eig executing in an unsafe order

0e1a836

achirkin requested a review from a team as a code owner January 31, 2022 12:42

github-actions bot added the CUDA/C++ label Jan 31, 2022

achirkin added 3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change and removed CUDA/C++ labels Jan 31, 2022

cjnolet previously approved these changes Jan 31, 2022

View reviewed changes

Make sure the workset deallocator runs in the same stream as the last…

49d271e

… operation

github-actions bot added the CUDA/C++ label Feb 2, 2022

Merge branch 'branch-22.04' into fix-ols-unsafe-multistream

72b7db0

caryr35 added this to PR-WIP in v22.04 Release via automation Feb 2, 2022

caryr35 moved this from PR-WIP to PR-Needs review in v22.04 Release Feb 2, 2022

tfeher requested changes Feb 2, 2022

View reviewed changes

cpp/src_prims/linalg/lstsq.cuh Outdated Show resolved Hide resolved

cpp/src_prims/linalg/lstsq.cuh Outdated Show resolved Hide resolved

dantegd added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Feb 3, 2022

Address review comments

f1fb4d3

achirkin added 3 - Ready for Review Ready for review by team and removed 4 - Waiting on Author Waiting for author to respond to review labels Feb 3, 2022

achirkin requested a review from tfeher February 3, 2022 09:01

tfeher approved these changes Feb 3, 2022

View reviewed changes

v22.04 Release automation moved this from PR-Needs review to PR-Reviewer approved Feb 3, 2022

achirkin added 2 commits February 9, 2022 07:42

Merge branch 'branch-22.04' into fix-ols-unsafe-multistream

4dcbd2b

Merge branch 'branch-22.04' into fix-ols-unsafe-multistream

5ee810b

rapids-bot bot merged commit df6d7be into rapidsai:branch-22.04 Feb 10, 2022

v22.04 Release automation moved this from PR-Reviewer approved to Done Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix worker streams in OLS-eig executing in an unsafe order #4539

Fix worker streams in OLS-eig executing in an unsafe order #4539

achirkin commented Jan 31, 2022

cjnolet Jan 31, 2022

achirkin Feb 2, 2022

tfeher left a comment

tfeher left a comment

tfeher commented Feb 3, 2022

achirkin commented Feb 3, 2022

codecov-commenter commented Feb 9, 2022

cjnolet commented Feb 10, 2022

Fix worker streams in OLS-eig executing in an unsafe order #4539

Fix worker streams in OLS-eig executing in an unsafe order #4539

Conversation

achirkin commented Jan 31, 2022

cjnolet Jan 31, 2022

Choose a reason for hiding this comment

achirkin Feb 2, 2022

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

tfeher commented Feb 3, 2022

achirkin commented Feb 3, 2022

codecov-commenter commented Feb 9, 2022

Codecov Report

cjnolet commented Feb 10, 2022