-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Data] Prevent unbounded growth of_StatsActor.datasets #55925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a crucial fix to prevent unbounded memory growth in the _StatsActor
by implementing an eviction policy for finished datasets. The changes are well-structured and include a new collections.deque
for efficient FIFO operations and clear eviction logic within update_dataset
. The addition of the test_stats_actor_datasets_eviction
unit test is excellent, as it thoroughly verifies the new functionality.
My review includes a couple of minor suggestions to improve code conciseness and readability in the eviction logic. Overall, this is a solid improvement to the stability of long-running Ray Data clusters.
python/ray/data/_internal/stats.py
Outdated
if ( | ||
state["state"] == DatasetState.FINISHED.name | ||
or state["state"] == DatasetState.FAILED.name | ||
): | ||
self.finished_datasets_queue.append(dataset_tag) | ||
while len(self.datasets) > self.max_stats and self.finished_datasets_queue: | ||
tag_to_evict = self.finished_datasets_queue.popleft() | ||
if tag_to_evict in self.datasets: | ||
del self.datasets[tag_to_evict] | ||
if tag_to_evict in self.dataset_metadatas: | ||
del self.dataset_metadatas[tag_to_evict] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooc did you try considering using 1 OrderedDict instead of a dict + deque to limit the memory usage? Basically on each use of the dataset tag you can move the tag to the end of the ordereddict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the follow-up here?
Thanks for your contribution. Please ensure to sign the commits. As far as the change goes, I was wondering if you considered using |
Signed-off-by: qiwenju <qiwenju@xiaomi.com>
Signed-off-by: qiwenju <qiwenju@xiaomi.com>
5716b08
to
f89a854
Compare
…5925) Background Closes ray-project#43924. The _StatsActor suffers from unbounded memory usage in long-running clusters. Its core metadata dictionaries, `datasets` and `dataset_metadatas`, lacked a proper garbage collection (GC) mechanism. This could lead to an Out-of-Memory (OOM) error in the _StatsActor when a large number of datasets are created. Solution This patch implements an eviction policy for _StatsActor based on the `max_stats` configuration to effectively limit the size of the `datasets` and `dataset_metadatas` dictionaries, preventing their unbounded growth. The implementation details are as follows: 1. **Optimize Queue Implementation** The old, unused `fifo_queue` field has been removed. It is replaced by a new `collections.deque`, `finished_datasets_queue`, which serves as a more efficient FIFO queue for storing the tags of completed datasets. 2. **Implement Eviction Logic** - When a dataset's status is updated to `FINISHED` or `FAILED`, its tag is appended to the `finished_datasets_queue`. - A check is then immediately performed to see if the total number of entries in the `datasets` dictionary exceeds `max_stats`. - If the limit is exceeded, the oldest dataset tag is popped from the front of the `finished_datasets_queue`, and the corresponding entries are synchronously deleted from the `datasets` and `dataset_metadatas` dictionaries. 3. **Clarify Limitation Strategy** `max_stats` is not a strict hard limit. Since the eviction logic is only triggered when a dataset completes (`FINISHED` or `FAILED`), it is possible for the number of `RUNNING` datasets to cause the total entry count to temporarily exceed `max_stats`. This design ensures that metadata for in-progress tasks is never evicted, while still effectively preventing unbounded memory growth and OOM errors by cleaning up the oldest completed data as soon as a task finishes. Testing To verify the correctness of this fix, a new unit test, `test_stats_actor_datasets_eviction`, has been added. This test sets a low `max_stats` value and asserts that the oldest finished dataset is correctly evicted when the limit is surpassed. --------- Signed-off-by: qiwenju <qiwenju@xiaomi.com> Co-authored-by: qiwenju <qiwenju@xiaomi.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…5925) Background Closes ray-project#43924. The _StatsActor suffers from unbounded memory usage in long-running clusters. Its core metadata dictionaries, `datasets` and `dataset_metadatas`, lacked a proper garbage collection (GC) mechanism. This could lead to an Out-of-Memory (OOM) error in the _StatsActor when a large number of datasets are created. Solution This patch implements an eviction policy for _StatsActor based on the `max_stats` configuration to effectively limit the size of the `datasets` and `dataset_metadatas` dictionaries, preventing their unbounded growth. The implementation details are as follows: 1. **Optimize Queue Implementation** The old, unused `fifo_queue` field has been removed. It is replaced by a new `collections.deque`, `finished_datasets_queue`, which serves as a more efficient FIFO queue for storing the tags of completed datasets. 2. **Implement Eviction Logic** - When a dataset's status is updated to `FINISHED` or `FAILED`, its tag is appended to the `finished_datasets_queue`. - A check is then immediately performed to see if the total number of entries in the `datasets` dictionary exceeds `max_stats`. - If the limit is exceeded, the oldest dataset tag is popped from the front of the `finished_datasets_queue`, and the corresponding entries are synchronously deleted from the `datasets` and `dataset_metadatas` dictionaries. 3. **Clarify Limitation Strategy** `max_stats` is not a strict hard limit. Since the eviction logic is only triggered when a dataset completes (`FINISHED` or `FAILED`), it is possible for the number of `RUNNING` datasets to cause the total entry count to temporarily exceed `max_stats`. This design ensures that metadata for in-progress tasks is never evicted, while still effectively preventing unbounded memory growth and OOM errors by cleaning up the oldest completed data as soon as a task finishes. Testing To verify the correctness of this fix, a new unit test, `test_stats_actor_datasets_eviction`, has been added. This test sets a low `max_stats` value and asserts that the oldest finished dataset is correctly evicted when the limit is surpassed. --------- Signed-off-by: qiwenju <qiwenju@xiaomi.com> Co-authored-by: qiwenju <qiwenju@xiaomi.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…5925) Background Closes ray-project#43924. The _StatsActor suffers from unbounded memory usage in long-running clusters. Its core metadata dictionaries, `datasets` and `dataset_metadatas`, lacked a proper garbage collection (GC) mechanism. This could lead to an Out-of-Memory (OOM) error in the _StatsActor when a large number of datasets are created. Solution This patch implements an eviction policy for _StatsActor based on the `max_stats` configuration to effectively limit the size of the `datasets` and `dataset_metadatas` dictionaries, preventing their unbounded growth. The implementation details are as follows: 1. **Optimize Queue Implementation** The old, unused `fifo_queue` field has been removed. It is replaced by a new `collections.deque`, `finished_datasets_queue`, which serves as a more efficient FIFO queue for storing the tags of completed datasets. 2. **Implement Eviction Logic** - When a dataset's status is updated to `FINISHED` or `FAILED`, its tag is appended to the `finished_datasets_queue`. - A check is then immediately performed to see if the total number of entries in the `datasets` dictionary exceeds `max_stats`. - If the limit is exceeded, the oldest dataset tag is popped from the front of the `finished_datasets_queue`, and the corresponding entries are synchronously deleted from the `datasets` and `dataset_metadatas` dictionaries. 3. **Clarify Limitation Strategy** `max_stats` is not a strict hard limit. Since the eviction logic is only triggered when a dataset completes (`FINISHED` or `FAILED`), it is possible for the number of `RUNNING` datasets to cause the total entry count to temporarily exceed `max_stats`. This design ensures that metadata for in-progress tasks is never evicted, while still effectively preventing unbounded memory growth and OOM errors by cleaning up the oldest completed data as soon as a task finishes. Testing To verify the correctness of this fix, a new unit test, `test_stats_actor_datasets_eviction`, has been added. This test sets a low `max_stats` value and asserts that the oldest finished dataset is correctly evicted when the limit is surpassed. --------- Signed-off-by: qiwenju <qiwenju@xiaomi.com> Co-authored-by: qiwenju <qiwenju@xiaomi.com> Signed-off-by: Gang Zhao <gang@gang-JQ62HD2C37.local>
…5925) Background Closes ray-project#43924. The _StatsActor suffers from unbounded memory usage in long-running clusters. Its core metadata dictionaries, `datasets` and `dataset_metadatas`, lacked a proper garbage collection (GC) mechanism. This could lead to an Out-of-Memory (OOM) error in the _StatsActor when a large number of datasets are created. Solution This patch implements an eviction policy for _StatsActor based on the `max_stats` configuration to effectively limit the size of the `datasets` and `dataset_metadatas` dictionaries, preventing their unbounded growth. The implementation details are as follows: 1. **Optimize Queue Implementation** The old, unused `fifo_queue` field has been removed. It is replaced by a new `collections.deque`, `finished_datasets_queue`, which serves as a more efficient FIFO queue for storing the tags of completed datasets. 2. **Implement Eviction Logic** - When a dataset's status is updated to `FINISHED` or `FAILED`, its tag is appended to the `finished_datasets_queue`. - A check is then immediately performed to see if the total number of entries in the `datasets` dictionary exceeds `max_stats`. - If the limit is exceeded, the oldest dataset tag is popped from the front of the `finished_datasets_queue`, and the corresponding entries are synchronously deleted from the `datasets` and `dataset_metadatas` dictionaries. 3. **Clarify Limitation Strategy** `max_stats` is not a strict hard limit. Since the eviction logic is only triggered when a dataset completes (`FINISHED` or `FAILED`), it is possible for the number of `RUNNING` datasets to cause the total entry count to temporarily exceed `max_stats`. This design ensures that metadata for in-progress tasks is never evicted, while still effectively preventing unbounded memory growth and OOM errors by cleaning up the oldest completed data as soon as a task finishes. Testing To verify the correctness of this fix, a new unit test, `test_stats_actor_datasets_eviction`, has been added. This test sets a low `max_stats` value and asserts that the oldest finished dataset is correctly evicted when the limit is surpassed. --------- Signed-off-by: qiwenju <qiwenju@xiaomi.com> Co-authored-by: qiwenju <qiwenju@xiaomi.com> Signed-off-by: sampan <sampan@anyscale.com>
…5925) Background Closes ray-project#43924. The _StatsActor suffers from unbounded memory usage in long-running clusters. Its core metadata dictionaries, `datasets` and `dataset_metadatas`, lacked a proper garbage collection (GC) mechanism. This could lead to an Out-of-Memory (OOM) error in the _StatsActor when a large number of datasets are created. Solution This patch implements an eviction policy for _StatsActor based on the `max_stats` configuration to effectively limit the size of the `datasets` and `dataset_metadatas` dictionaries, preventing their unbounded growth. The implementation details are as follows: 1. **Optimize Queue Implementation** The old, unused `fifo_queue` field has been removed. It is replaced by a new `collections.deque`, `finished_datasets_queue`, which serves as a more efficient FIFO queue for storing the tags of completed datasets. 2. **Implement Eviction Logic** - When a dataset's status is updated to `FINISHED` or `FAILED`, its tag is appended to the `finished_datasets_queue`. - A check is then immediately performed to see if the total number of entries in the `datasets` dictionary exceeds `max_stats`. - If the limit is exceeded, the oldest dataset tag is popped from the front of the `finished_datasets_queue`, and the corresponding entries are synchronously deleted from the `datasets` and `dataset_metadatas` dictionaries. 3. **Clarify Limitation Strategy** `max_stats` is not a strict hard limit. Since the eviction logic is only triggered when a dataset completes (`FINISHED` or `FAILED`), it is possible for the number of `RUNNING` datasets to cause the total entry count to temporarily exceed `max_stats`. This design ensures that metadata for in-progress tasks is never evicted, while still effectively preventing unbounded memory growth and OOM errors by cleaning up the oldest completed data as soon as a task finishes. Testing To verify the correctness of this fix, a new unit test, `test_stats_actor_datasets_eviction`, has been added. This test sets a low `max_stats` value and asserts that the oldest finished dataset is correctly evicted when the limit is surpassed. --------- Signed-off-by: qiwenju <qiwenju@xiaomi.com> Co-authored-by: qiwenju <qiwenju@xiaomi.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…5925) Background Closes ray-project#43924. The _StatsActor suffers from unbounded memory usage in long-running clusters. Its core metadata dictionaries, `datasets` and `dataset_metadatas`, lacked a proper garbage collection (GC) mechanism. This could lead to an Out-of-Memory (OOM) error in the _StatsActor when a large number of datasets are created. Solution This patch implements an eviction policy for _StatsActor based on the `max_stats` configuration to effectively limit the size of the `datasets` and `dataset_metadatas` dictionaries, preventing their unbounded growth. The implementation details are as follows: 1. **Optimize Queue Implementation** The old, unused `fifo_queue` field has been removed. It is replaced by a new `collections.deque`, `finished_datasets_queue`, which serves as a more efficient FIFO queue for storing the tags of completed datasets. 2. **Implement Eviction Logic** - When a dataset's status is updated to `FINISHED` or `FAILED`, its tag is appended to the `finished_datasets_queue`. - A check is then immediately performed to see if the total number of entries in the `datasets` dictionary exceeds `max_stats`. - If the limit is exceeded, the oldest dataset tag is popped from the front of the `finished_datasets_queue`, and the corresponding entries are synchronously deleted from the `datasets` and `dataset_metadatas` dictionaries. 3. **Clarify Limitation Strategy** `max_stats` is not a strict hard limit. Since the eviction logic is only triggered when a dataset completes (`FINISHED` or `FAILED`), it is possible for the number of `RUNNING` datasets to cause the total entry count to temporarily exceed `max_stats`. This design ensures that metadata for in-progress tasks is never evicted, while still effectively preventing unbounded memory growth and OOM errors by cleaning up the oldest completed data as soon as a task finishes. Testing To verify the correctness of this fix, a new unit test, `test_stats_actor_datasets_eviction`, has been added. This test sets a low `max_stats` value and asserts that the oldest finished dataset is correctly evicted when the limit is surpassed. --------- Signed-off-by: qiwenju <qiwenju@xiaomi.com> Co-authored-by: qiwenju <qiwenju@xiaomi.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Background
Closes #43924.
The _StatsActor suffers from unbounded memory usage in long-running clusters. Its core metadata dictionaries,
datasets
anddataset_metadatas
, lacked a proper garbage collection (GC) mechanism. This could lead to an Out-of-Memory (OOM) error in the _StatsActor when a large number of datasets are created.Solution
This patch implements an eviction policy for _StatsActor based on the
max_stats
configuration to effectively limit the size of thedatasets
anddataset_metadatas
dictionaries, preventing their unbounded growth.The implementation details are as follows:
Optimize Queue Implementation
The old, unused
fifo_queue
field has been removed. It is replaced by a newcollections.deque
,finished_datasets_queue
, which serves as a more efficient FIFO queue for storing the tags of completed datasets.Implement Eviction Logic
FINISHED
orFAILED
, its tag is appended to thefinished_datasets_queue
.datasets
dictionary exceedsmax_stats
.finished_datasets_queue
, and the corresponding entries are synchronously deleted from thedatasets
anddataset_metadatas
dictionaries.Clarify Limitation Strategy
max_stats
is not a strict hard limit. Since the eviction logic is only triggered when a dataset completes (FINISHED
orFAILED
), it is possible for the number ofRUNNING
datasets to cause the total entry count to temporarily exceedmax_stats
. This design ensures that metadata for in-progress tasks is never evicted, while still effectively preventing unbounded memory growth and OOM errors by cleaning up the oldest completed data as soon as a task finishes.Testing
To verify the correctness of this fix, a new unit test,
test_stats_actor_datasets_eviction
, has been added. This test sets a lowmax_stats
value and asserts that the oldest finished dataset is correctly evicted when the limit is surpassed.