Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Ray Data is not respecting object store memory limit #42374

Closed
stephanie-wang opened this issue Jan 12, 2024 · 4 comments · Fixed by #42504
Closed

[data] Ray Data is not respecting object store memory limit #42374

stephanie-wang opened this issue Jan 12, 2024 · 4 comments · Fixed by #42504
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P0 Issues that should be fixed in short order

Comments

@stephanie-wang
Copy link
Contributor

stephanie-wang commented Jan 12, 2024

What happened + What you expected to happen

This script runs 10 tasks, each producing a row that exceeds the configured object store memory limit. The consumer loop sleeps after reading each row, so I expected the consumer to backpressure the Dataset execution so that we only one run task at a time. Instead all 10 tasks get scheduled immediately.

Versions / Dependencies

3.0dev

Reproduction script

import ray
import time
import numpy as np

ctx = ray.data.DataContext.get_current()

def sleep(row):
    time.sleep(0.5)
    return {"val": np.zeros(int(1e6 / 8))}
    
ctx.execution_options.resource_limits.cpu = 2
ctx.execution_options.resource_limits.object_store_memory=1e5

ds = ray.data.range(20, parallelism=20)
sleep_ds = ds.map(sleep)
batch_start = time.perf_counter()

start = time.perf_counter()

i = 0
for batch in sleep_ds.iter_batches(batch_size=None):
    print("blocked time", time.perf_counter() - batch_start)
    time.sleep(1)
    i += 1
    batch_start = time.perf_counter()

end = time.perf_counter()
print("Took", end - start, "expected time", 0.5 * i)
print(sleep_ds.stats())

ray.timeline("timeline.json")

Issue Severity

High: It blocks me from completing my task.

@stephanie-wang stephanie-wang added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P0 Issues that should be fixed in short order data Ray Data-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 12, 2024
@stephanie-wang
Copy link
Contributor Author

Seems like a regression and stability issue, so marking p0.

cc @raulchen @franklsf95

@bveeramani
Copy link
Member

@stephanie-wang I ran the script with verbose logging, but I only saw at most 2 active tasks. How did you determine that all tasks got scheduled immediately?

@stephanie-wang
Copy link
Contributor Author

Ah sorry issue description is outdated.

What actually happens is:

  • some tasks get scheduled right away due to prefetch_batches (expected)
  • later, Ray Data always schedules 2 tasks to run at a time, which exceeds the object store limit. We should only be running 1 at a time.

@bveeramani
Copy link
Member

Got it. Think #42504 should fix the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P0 Issues that should be fixed in short order
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants