-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] streaming generator integration #37736
Conversation
@@ -332,6 +335,27 @@ def process_completed_tasks(topology: Topology) -> None: | |||
op = active_tasks.pop(ref) | |||
op.notify_work_completed(ref) | |||
|
|||
if active_streaming_gens: | |||
ready, _ = ray.wait( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably need to ray.wait() these together with the active_tasks for efficiency right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. I plan to convert everything to streaming gen. then there will be only one ray.wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found that ActorPoolMapOperator still need to use regular ObjectRefs. Unified them in one single ray.wait.
# The generator should always yield 2 values (block and metadata) each time. | ||
# If we get a StopIteration here, it means an error happened in the task. | ||
# And in this case, the block_ref is the exception object. | ||
# TODO(hchen): Ray Core should have a better interface for detecting and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rkooo567 It looks like a bug that when retry_exceptions=True
, gen._generator_task_exception
is empty. So I have to use this workaround.
Should we put this under a feature flag? A bit worried about regressions like the retry_exceptions thing. |
The change is a bit too large to put under a feature flag. I think this should be fine. For correctness issues like the retry_exceptions param, unit tests should cover them. For perf impact, I'll double check the results of all benchmarks before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(pr seems not related to ci or release test infra).
@@ -165,6 +165,7 @@ def main( | |||
or os.environ.get("BUILDKITE_SOURCE", "manual") == "schedule" | |||
or buildkite_branch.startswith("releases/") | |||
) | |||
report = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will revert, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I wanted to manually run release tests and get all results. @can-anyscale said that only master and release branch will report results. So I had to temporarily change this.
BTW, is there a reason why we don't report other branches? It'd be great to avoid manually doing this.
All benchmark results (buildkite link). The former is this branch, and the latter is master.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me at a high level, though I didn't do a detailed review. Consider breaking this down into a refactoring PR and the main change for streaming support.
|
||
@abstractmethod | ||
def on_waitable_ready(self): | ||
"""Called when the waitable is ready.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called multiple times for the streaming refs right? Should clarify that in the comment.
task = _TaskState(bundle) | ||
self._tasks[ref] = (task, actor) | ||
self._handle_task_submitted(task) | ||
# Note, for some reaosn, if we don't define a new variable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Note, for some reaosn, if we don't define a new variable, | |
# Note, for some reason, if we don't define a new variable, |
@@ -63,7 +69,12 @@ def __init__( | |||
self._output_queue: _OutputQueue = None | |||
# Output metadata, added to on get_next(). | |||
self._output_metadata: List[BlockMetadata] = [] | |||
|
|||
# All active `DataOptasks`s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# All active `DataOptasks`s. | |
# All active `DataOpTasks`s. |
|
||
def _task_done_callback(): | ||
nonlocal task_index | ||
nonlocal inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an alternate pattern to bind these variables correctly is the following:
def _task_done(task_index, inputs):
...
DataOpTask(..., lambda task_index=task_index, inputs=inputs: _task_done(task_index, inputs))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion!
During benchmarking a stress test, we found a race condition bug of streaming generator, which is being fixed by #38258 |
98bf4d6
to
8d23c94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
""" | ||
|
||
@abstractmethod | ||
def get_waitable(self) -> Union[ray.ObjectRef, StreamingObjectRefGenerator]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: def get_waitable(self) -> Waitable
?
# 2. This method should only take a block-processing function as input, | ||
# instead of a streaming generator. The logic of submitting ray tasks | ||
# can also be capsulated in the base class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this makes more sense to me, right now _submit_data_task
is just doing bookkeeping here, the task already being submitted when creating gen
.
wip Signed-off-by: Hao Chen <chenh1024@gmail.com> handle stop Signed-off-by: Hao Chen <chenh1024@gmail.com> destroy Signed-off-by: Hao Chen <chenh1024@gmail.com> wip Signed-off-by: Hao Chen <chenh1024@gmail.com> runnable Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> fix exception handling Signed-off-by: Hao Chen <chenh1024@gmail.com> fix exception handling Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> fix metrics Signed-off-by: Hao Chen <chenh1024@gmail.com> optimize output queue memory Signed-off-by: Hao Chen <chenh1024@gmail.com> fix exception handling Signed-off-by: Hao Chen <chenh1024@gmail.com> reduce wait timeout Signed-off-by: Hao Chen <chenh1024@gmail.com> fix destroy Signed-off-by: Hao Chen <chenh1024@gmail.com> Revert "fix exception handling" This reverts commit d04e69b. fix metrics test Signed-off-by: Hao Chen <chenh1024@gmail.com> fix bulk executor Signed-off-by: Hao Chen <chenh1024@gmail.com> run_op_tasks_sync only_existing Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> refine physcial_operator and map_operator Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> refine actor map operator Signed-off-by: Hao Chen <chenh1024@gmail.com> refine Signed-off-by: Hao Chen <chenh1024@gmail.com> comment Signed-off-by: Hao Chen <chenh1024@gmail.com> comment Signed-off-by: Hao Chen <chenh1024@gmail.com> lint Signed-off-by: Hao Chen <chenh1024@gmail.com> lint Signed-off-by: Hao Chen <chenh1024@gmail.com> lint Signed-off-by: Hao Chen <chenh1024@gmail.com> trace_allocation Signed-off-by: Hao Chen <chenh1024@gmail.com> revert outqueue memory size Signed-off-by: Hao Chen <chenh1024@gmail.com> handle all existing outputs Signed-off-by: Hao Chen <chenh1024@gmail.com> lint Signed-off-by: Hao Chen <chenh1024@gmail.com> fix streaming gen Signed-off-by: Hao Chen <chenh1024@gmail.com> lambda Signed-off-by: Hao Chen <chenh1024@gmail.com> comments Signed-off-by: Hao Chen <chenh1024@gmail.com> Revert "fix streaming gen" This reverts commit a218cad0e9610e003c6b175951582527c574ade7. capture variable Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> lint Signed-off-by: Hao Chen <chenh1024@gmail.com>
e9e8d1a
to
05a0632
Compare
The CI failure (test_iter_batches_no_spilling_upon_prior_transformation) is an existing issue about DatasetPipeline. But somehow this PR makes it more likely to happen. I've figured out the fix. Will merge this PR first and submit another PR to fix the issue today. |
This PR changes the MapOperator to use streaming generators, and move common code about task management from `TaskPoolMapOperator` and `ActorPoolMapOperator` to the base `MapOperator` class. Note, ideally all operators should use streaming generators. But that requires a larger refactor, which will be done in follow-up PRs (see ray-project#37630). Signed-off-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
This PR changes the MapOperator to use streaming generators, and move common code about task management from `TaskPoolMapOperator` and `ActorPoolMapOperator` to the base `MapOperator` class. Note, ideally all operators should use streaming generators. But that requires a larger refactor, which will be done in follow-up PRs (see ray-project#37630). Signed-off-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
This PR changes the MapOperator to use streaming generators, and move common code about task management from `TaskPoolMapOperator` and `ActorPoolMapOperator` to the base `MapOperator` class. Note, ideally all operators should use streaming generators. But that requires a larger refactor, which will be done in follow-up PRs (see ray-project#37630). Signed-off-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR changes the MapOperator to use streaming generators, and move common code about task management from
TaskPoolMapOperator
andActorPoolMapOperator
to the baseMapOperator
class.Note, ideally all operators should use streaming generators. But that requires a larger refactor, which will be done in follow-up PRs (see #37630).
Related issue number
#36444
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.