Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerated DAG channel lock fixes #45119

Merged
merged 2 commits into from
May 11, 2024

Conversation

jackhumphries
Copy link
Contributor

@jackhumphries jackhumphries commented May 3, 2024

Why are these changes needed?

This PR fixes two issues:

  1. When the error bit is set on a mutable object, a reader may return from ReadAcquire() without releasing the channel lock. This will cause deadlock if another reader calls ReadAcquire() on the channel.

  2. Abseil reports a deadlock on the channel lock, though this is a false positive. Note that Abseil only tracks deadlocks when compiled with debug mode (export RAY_DEBUG_BUILD=debug), so this issue will not appear otherwise.
    Specifically, the following occurs:
    (1) Abseil sees ReadAcquire() acquire the destructor lock followed by the input channel lock.
    (2) When WriteAcquire() is called on the output channel, Abseil sees that the thread already holds the input channel lock and then the thread acquires the destructor lock.
    (3) Since Abseil sees these locks acquired in the reverse order in (2), it thinks there could be a deadlock.

The deadlock check can only be disabled globally, which is undesirable, rather than for a specific lock. To resolve this issue for now, I replace absl::Mutex with std::mutex so that Abseil does not track deadlock on this lock.

A better long-term solution is to rework the synchronization model so that a lock is not held after ReadAcquire() returns.

Related issue number

N/A

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test for this?

@jackhumphries jackhumphries force-pushed the lock-fix branch 2 times, most recently from 8f25ae2 to 53837bc Compare May 3, 2024 07:37
Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>
Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>
@jackhumphries
Copy link
Contributor Author

This is ready for review. Thanks Stephanie!

@jackhumphries jackhumphries changed the title Lock fix Accelerated DAG channel lock fixes May 10, 2024
@stephanie-wang stephanie-wang merged commit 094748e into ray-project:master May 11, 2024
5 checks passed
stephanie-wang added a commit that referenced this pull request May 13, 2024
Fixes test that was broken due a hidden merge conflict between #45092 and #45119.
Related issue number

Closes #45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
This PR fixes two issues:
1. When the error bit is set on a mutable object, a reader may return
from ReadAcquire() without releasing the channel lock. This will cause
deadlock if another reader calls ReadAcquire() on the channel.

2. Abseil reports a deadlock on the channel lock, though this is a false
positive. Note that Abseil only tracks deadlocks when compiled with
debug mode (`export RAY_DEBUG_BUILD=debug`), so this issue will not
appear otherwise.
Specifically, the following occurs:
(1) Abseil sees ReadAcquire() acquire the destructor lock followed by
the _input_ channel lock.
(2) When WriteAcquire() is called on the _output_ channel, Abseil sees
that the thread already holds the _input_ channel lock and then the
thread acquires the destructor lock.
(3) Since Abseil sees these locks acquired in the reverse order in (2),
it thinks there could be a deadlock.

The deadlock check can only be disabled globally, which is undesirable,
rather than for a specific lock. To resolve this issue for now, I
replace absl::Mutex with std::mutex so that Abseil does not track
deadlock on this lock.

A better long-term solution is to rework the synchronization model so
that a lock is not held after ReadAcquire() returns.

---------

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
This PR fixes two issues:
1. When the error bit is set on a mutable object, a reader may return
from ReadAcquire() without releasing the channel lock. This will cause
deadlock if another reader calls ReadAcquire() on the channel.

2. Abseil reports a deadlock on the channel lock, though this is a false
positive. Note that Abseil only tracks deadlocks when compiled with
debug mode (`export RAY_DEBUG_BUILD=debug`), so this issue will not
appear otherwise.
Specifically, the following occurs:
(1) Abseil sees ReadAcquire() acquire the destructor lock followed by
the _input_ channel lock.
(2) When WriteAcquire() is called on the _output_ channel, Abseil sees
that the thread already holds the _input_ channel lock and then the
thread acquires the destructor lock.
(3) Since Abseil sees these locks acquired in the reverse order in (2),
it thinks there could be a deadlock.

The deadlock check can only be disabled globally, which is undesirable,
rather than for a specific lock. To resolve this issue for now, I
replace absl::Mutex with std::mutex so that Abseil does not track
deadlock on this lock.

A better long-term solution is to rework the synchronization model so
that a lock is not held after ReadAcquire() returns.

---------

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
This PR fixes two issues:
1. When the error bit is set on a mutable object, a reader may return
from ReadAcquire() without releasing the channel lock. This will cause
deadlock if another reader calls ReadAcquire() on the channel.

2. Abseil reports a deadlock on the channel lock, though this is a false
positive. Note that Abseil only tracks deadlocks when compiled with
debug mode (`export RAY_DEBUG_BUILD=debug`), so this issue will not
appear otherwise.
Specifically, the following occurs:
(1) Abseil sees ReadAcquire() acquire the destructor lock followed by
the _input_ channel lock.
(2) When WriteAcquire() is called on the _output_ channel, Abseil sees
that the thread already holds the _input_ channel lock and then the
thread acquires the destructor lock.
(3) Since Abseil sees these locks acquired in the reverse order in (2),
it thinks there could be a deadlock.

The deadlock check can only be disabled globally, which is undesirable,
rather than for a specific lock. To resolve this issue for now, I
replace absl::Mutex with std::mutex so that Abseil does not track
deadlock on this lock.

A better long-term solution is to rework the synchronization model so
that a lock is not held after ReadAcquire() returns.

---------

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
GabeChurch pushed a commit to GabeChurch/ray that referenced this pull request Jun 11, 2024
This PR fixes two issues:
1. When the error bit is set on a mutable object, a reader may return
from ReadAcquire() without releasing the channel lock. This will cause
deadlock if another reader calls ReadAcquire() on the channel.

2. Abseil reports a deadlock on the channel lock, though this is a false
positive. Note that Abseil only tracks deadlocks when compiled with
debug mode (`export RAY_DEBUG_BUILD=debug`), so this issue will not
appear otherwise.
Specifically, the following occurs:
(1) Abseil sees ReadAcquire() acquire the destructor lock followed by
the _input_ channel lock.
(2) When WriteAcquire() is called on the _output_ channel, Abseil sees
that the thread already holds the _input_ channel lock and then the
thread acquires the destructor lock.
(3) Since Abseil sees these locks acquired in the reverse order in (2),
it thinks there could be a deadlock.

The deadlock check can only be disabled globally, which is undesirable,
rather than for a specific lock. To resolve this issue for now, I
replace absl::Mutex with std::mutex so that Abseil does not track
deadlock on this lock.

A better long-term solution is to rework the synchronization model so
that a lock is not held after ReadAcquire() returns.

---------

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>
Signed-off-by: gchurch <gabe1church@gmail.com>
GabeChurch pushed a commit to GabeChurch/ray that referenced this pull request Jun 11, 2024
Fixes test that was broken due a hidden merge conflict between ray-project#45092 and ray-project#45119.
Related issue number

Closes ray-project#45264.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: gchurch <gabe1church@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants