-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interruptible execution #433
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I'm really excited about having this feature in RAFT and I think this is going to find use in many different RAPIDS projects. Here's my feedback so far.
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. We can continue to iterate on it and fix additional issues as they may arise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. We can continue to iterate on it and fix additional issues as they may arise.
@achirkin, this is ready to merge but RAFT's CPU builds were just enabled and aren't yet running successfully (hence the CI failure here). This will be merged very soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a hiccup in permissions. LGTM.
rerun tests |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @achirkin for this work! It looks good to me. I have just a small suggestion for improving the documentation, pre-approving.
* in code from outside of the thread. In particular, it provides an interruptible version of the | ||
* blocking CUDA synchronization function, that allows dropping a long-running GPU work. | ||
* | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider copying parts of the PR description here:
Interruptible execution is facilitated using the following three functions:
static void synchronize(rmm::cuda_stream_view stream);
static void yield();
static void cancel(std::thread::id thread_id);
synchronize and yield serve as cancellation points for the executing CPU thread. cancel allows to throw an async exception in a target CPU thread, which is observed in the nearest cancellation point. Altogether, these allow to cancel a long-running job without killing the OS process.
The key to make this work is an obvious observation that the CPU spends most of the time waiting on cudaStreamSynchronize. By replacing that with interruptible::synchronize, we introduce cancellation points in all critical places in code. If that is not enough in some edge cases (the cancellation points are too far apart), a developer can use yield to ensure that a cancellation request is received sooner rather than later.
rerun tests |
@gpucibot merge |
### Cooperative-style interruptible C++ threads. This proposal attempts to make cuml experience more responsive by allowing easier way to interrupt/cancel long running cuml tasks. It replaces calls `cudaStreamSynchronize` with `raft::interruptible::synchronize`, which serve as a cancellation points in the algorithms. With a small extra hook on the python side, Ctrl+C requests now can interrupt the execution (almost) immediately. At this moment, I adapted just a few models as a proof-of-concept. Example: ```python import sklearn.datasets import cuml.svm X, y = sklearn.datasets.fetch_olivetti_faces(return_X_y=True) model = cuml.svm.SVC() print("Data loaded; fitting... (try Ctrl+C now)") try: model.fit(X, y) print("Done! Score:", model.score(X, y)) except Exception as e: print("Canceled!") print(e) ``` #### Implementation details rapidsai/raft#433 #### Adoption costs From the changeset in this PR you can see that I introduce two types of changes: 1. Change `cudaStreamSynchronize` to either `handle.sync_thread` or `raft::interruptible::synchronize` 2. Wrap the cython calls with [`cuda_interruptible`](https://github.com/rapidsai/raft/blob/36e8de5f73e9ec7e604b38a4290ac82bc35be4b7/python/raft/common/interruptible.pyx#L28) and `nogil` Change (1) is straightforward and can mostly be automated. Change (2) is a bit more involved. You definitely have to wrap a C++ call with `interruptibleCpp` to make `Ctrl+C` work, but that is also rather simple. The tricky part is adding `nogil`, because you have to make sure there is no python objects within `with nogil` block. However, `nogil` does not seem to be strictly required for the signal handler to successfully interrupt the C++ thread. It worked in my tests without `nogil` as well. Yet, I chose to add `nogil` in the code where possible, because in theory it should reduce the interrupt latency and enable more multithreading. #### Motivation In general, this proposal makes executing threads (and thus algos/models) more controllable. The main use cases I see: 1. Being able to Ctrl+C the running model using signal handlers. 2. Stopping the thread programmatically, e.g. we can create the tests of sort "if running for more than n seconds, stop and fail". Resolves #4384 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #4463
### Cooperative-style interruptible C++ threads. This proposal attempts to make cuml experience more responsive by allowing easier way to interrupt/cancel long running cuml tasks. It replaces calls `cudaStreamSynchronize` with `raft::interruptible::synchronize`, which serve as a cancellation points in the algorithms. With a small extra hook on the python side, Ctrl+C requests now can interrupt the execution (almost) immediately. At this moment, I adapted just a few models as a proof-of-concept. Example: ```python import sklearn.datasets import cuml.svm X, y = sklearn.datasets.fetch_olivetti_faces(return_X_y=True) model = cuml.svm.SVC() print("Data loaded; fitting... (try Ctrl+C now)") try: model.fit(X, y) print("Done! Score:", model.score(X, y)) except Exception as e: print("Canceled!") print(e) ``` #### Implementation details rapidsai/raft#433 #### Adoption costs From the changeset in this PR you can see that I introduce two types of changes: 1. Change `cudaStreamSynchronize` to either `handle.sync_thread` or `raft::interruptible::synchronize` 2. Wrap the cython calls with [`cuda_interruptible`](https://github.com/rapidsai/raft/blob/36e8de5f73e9ec7e604b38a4290ac82bc35be4b7/python/raft/common/interruptible.pyx#L28) and `nogil` Change (1) is straightforward and can mostly be automated. Change (2) is a bit more involved. You definitely have to wrap a C++ call with `interruptibleCpp` to make `Ctrl+C` work, but that is also rather simple. The tricky part is adding `nogil`, because you have to make sure there is no python objects within `with nogil` block. However, `nogil` does not seem to be strictly required for the signal handler to successfully interrupt the C++ thread. It worked in my tests without `nogil` as well. Yet, I chose to add `nogil` in the code where possible, because in theory it should reduce the interrupt latency and enable more multithreading. #### Motivation In general, this proposal makes executing threads (and thus algos/models) more controllable. The main use cases I see: 1. Being able to Ctrl+C the running model using signal handlers. 2. Stopping the thread programmatically, e.g. we can create the tests of sort "if running for more than n seconds, stop and fail". Resolves rapidsai#4384 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4463
Cooperative-style interruptible C++ threads.
This proposal introduces
raft::interruptible
introducing three functions:synchronize
andyield
serve as cancellation points for the executing CPU thread.cancel
allows to throw an async exception in a target CPU thread, which is observed in the nearest cancellation point. Altogether, these allow to cancel a long-running job without killing the OS process.The key to make this work is an obvious observation that the CPU spends most of the time waiting on
cudaStreamSynchronize
. By replacing that withinterruptible::synchronize
, we introduce cancellation points in all critical places in code. If that is not enough in some edge cases (the cancellation points are too far apart), a developer can useyield
to ensure that a cancellation request is received sooner rather than later.Implementation
C++
raft::interruptible
keeps anstd::atomic_flag
in the thread-local storage in each thread, which tells whether the thread can continue executing (being in non-cancelled state).cancel
clears this flag, andyield
checks it and resets to the signalled state (throwing araft::interrupted_exception
exception if necessary).synchronize
implements a spinning lock querying the state of the stream andyield
ing on each iteration. I also add an overloadsync_stream
to the raft handle type, to make it easier to modify the behavior of all synchronization calls in raft and cuml.python
This proposal adds a context manager
cuda_interruptible
to handle Ctrl+C requests during C++ calls (using posix signals).cuda_interruptible
simply callsraft::interruptible::cancel
on the target C++ thread.Motivation
See rapidsai/cuml#4463
Resolves rapidsai/cuml#4384