You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran a batch inference job on spot instances. When GCP interrupted some instances, I expected Ray Data to recover, but instead my program errored with a message saying that the input objects are missing.
The issue might be caused by us removing references to input objects in the InputDataBuffer physical operator:
When we pop the object reference, the reference goes out scope, and Ray might garbage collect the object. So, when the object is later needed to reconstruct an output, Ray isn't able to find the input object.
The text was updated successfully, but these errors were encountered:
bveeramani
added
bug
Something that is supposed to be working; but isn't
P0
Issues that should be fixed in short order
data
Ray Data-related issues
labels
Jun 26, 2024
What happened + What you expected to happen
I ran a batch inference job on spot instances. When GCP interrupted some instances, I expected Ray Data to recover, but instead my program errored with a message saying that the input objects are missing.
The issue might be caused by us removing references to input objects in the
InputDataBuffer
physical operator:ray/python/ray/data/_internal/execution/operators/input_data_buffer.py
Lines 62 to 63 in b582905
When we
pop
the object reference, the reference goes out scope, and Ray might garbage collect the object. So, when the object is later needed to reconstruct an output, Ray isn't able to find the input object.Versions / Dependencies
bd9dc16
Reproduction script
Difficult to reproduce.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: