[Dashboard] Add platform events module with K8s event ingestion and caching#62314
Conversation
db4d367 to
76d31f9
Compare
|
cc @sampan-s-nayak @andrewsykim for initial feedback |
There was a problem hiding this comment.
Code Review
This pull request introduces the PlatformEventsHead module to the Ray dashboard, which monitors Kubernetes events for Ray clusters, jobs, and services. The module watches for relevant K8s events in separate threads, converts them into RayEvent protobuf messages, and exposes them via a new REST API endpoint. Review feedback suggests improving the efficiency of event deduplication and eviction using OrderedDict, preventing potential race conditions when accessing resource versions across threads, and optimizing the conversion of protobuf messages to dictionaries for the API response.
16ec456 to
73e6b4e
Compare
92a3826 to
a1b7fb1
Compare
a1b7fb1 to
36786f0
Compare
a6a4152 to
59737da
Compare
59737da to
917867a
Compare
917867a to
08e857a
Compare
08e857a to
ff013ed
Compare
52dfb26 to
5436c5f
Compare
|
Thanks to some excellent points raised by @MengjinYan , I have refactored the PR to encapsulate all the K8s specific event-watching logic in k8s_provider.py while keeping platform_event_head.py lightweight for processing of the RayEvents (after receiving them from the k8s provider). This way, we can extend the setup to have any other platform_event provider have their own event-watching logic as a new provider that can be wired into the platform_event_head.py Requesting another round of review for the changes. |
cc17528 to
5787751
Compare
MengjinYan
left a comment
There was a problem hiding this comment.
Thanks for the refactoring! The change looks good! Only one minor comment that can be addressed in a followup PR.
| "Platform events will be disabled." | ||
| ) | ||
|
|
||
| def _process_event_callback(self, ray_event: RayEvent): |
There was a problem hiding this comment.
Minor: The docstring expects _process_event_callback to be called from the main asyncio loop. But the contract is not enforced from the head side. A potential fix can be to dispatch the callback directly to the main asyncio loop in the _process_event_callback itself.
It doesn't break the functionality for this PR so no need to address in this one. We can add it when integrating with the python event library
5787751 to
e60b26b
Compare
0773bb0 to
1ffc0c3
Compare
edoakes
left a comment
There was a problem hiding this comment.
Non-blocking comments, aside from using asyncio k8s client if possible to avoid spawning excessive threads. If this has already been considered, please note it in the PR description.
| # Start a dedicated, named OS thread for each target to ensure strict execution guarantees | ||
| for kind, name in targets: | ||
| t = threading.Thread( | ||
| target=self._run_k8s_watch, | ||
| args=(kind, name), |
There was a problem hiding this comment.
is there no asyncio-compatible watch API?
There was a problem hiding this comment.
I dont think so.. recently some support for asyncio was added in kubernetes-client/python#2547, but the note in the PR description says "still missing dynamic client, watch, stream etc", so asyncio watch is not yet supported in the K8s python client. cc @yliaog for confirmation
There was a problem hiding this comment.
that's correct, not right now. that will be added in the next release.
There was a problem hiding this comment.
thanks @yliaog, let's follow-up to use the new asyncio client once there's an official release
…aching Signed-off-by: Richa Banker <richabanker@google.com>
1ffc0c3 to
ff47644
Compare
ryanaoleary
left a comment
There was a problem hiding this comment.
did another pass and have no remaining comments besides what's already been mentioned - LGTM
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Reviewed by Cursor Bugbot for commit ff47644. Configure here.
| try: | ||
| t.join(timeout=1.0) | ||
| except Exception as e: | ||
| logger.warning(f"Error joining thread {t.name}: {e}") |
There was a problem hiding this comment.
Thread join timeout silently loses thread cleanup guarantee
Low Severity
The join_all_threads function uses a 1-second timeout per thread but doesn't check whether the join actually succeeded. If a thread is blocked on a long network call and doesn't terminate within the timeout, cleanup() completes while the thread is still alive and may attempt to call loop.call_soon_threadsafe after the event loop is torn down, potentially raising an unhandled RuntimeError in the background thread.
Reviewed by Cursor Bugbot for commit ff47644. Configure here.
…aching (ray-project#62314) ## Description Add a new platform_events dashboard head module that introduces a modular, extensible Provider pattern for platform-specific event monitoring in Ray. Key changes: 1. `PlatformEventProvider`: An abstract base class setting a strict type contract (`Callable[[RayEvent], None]`) for all platform-specific event sources 2. KubernetesEventProvider: Encapsulates all K8s client library and watch logic including - watching Kubernetes events related to Ray custom resources (RayCluster, RayJob, RayService) - converting them to RayEvent proto 3. PlatformEventsHead: Acts as a generic, platform-agnostic in-memory cache and REST controller Coming next: 1. PlatformEventsHead will use the RayEventRecorder to publish events through the unified event framework 2. Expand KubernetesEventProvider to ingest generic K8s pod events (e.g., OOMKilled, Evicted) ## Related issues ## Additional information [Design doc](https://docs.google.com/document/d/14kRE4S0vMDKX7o8imDTh1ku6eKOcdLakCRLDuTPf2mU/edit?resourcekey=0-olOaP0W6oeRpB27WxTqMZQ&tab=t.0) Tested on a GKE cluster (with some UI changes not included in this PR) <img width="3336" height="1986" alt="image" src="https://github.com/user-attachments/assets/cd258ab4-8157-46a5-8073-31b9c2709ff1" /> Signed-off-by: Richa Banker <richabanker@google.com>
…aching (ray-project#62314) ## Description Add a new platform_events dashboard head module that introduces a modular, extensible Provider pattern for platform-specific event monitoring in Ray. Key changes: 1. `PlatformEventProvider`: An abstract base class setting a strict type contract (`Callable[[RayEvent], None]`) for all platform-specific event sources 2. KubernetesEventProvider: Encapsulates all K8s client library and watch logic including - watching Kubernetes events related to Ray custom resources (RayCluster, RayJob, RayService) - converting them to RayEvent proto 3. PlatformEventsHead: Acts as a generic, platform-agnostic in-memory cache and REST controller Coming next: 1. PlatformEventsHead will use the RayEventRecorder to publish events through the unified event framework 2. Expand KubernetesEventProvider to ingest generic K8s pod events (e.g., OOMKilled, Evicted) ## Related issues ## Additional information [Design doc](https://docs.google.com/document/d/14kRE4S0vMDKX7o8imDTh1ku6eKOcdLakCRLDuTPf2mU/edit?resourcekey=0-olOaP0W6oeRpB27WxTqMZQ&tab=t.0) Tested on a GKE cluster (with some UI changes not included in this PR) <img width="3336" height="1986" alt="image" src="https://github.com/user-attachments/assets/cd258ab4-8157-46a5-8073-31b9c2709ff1" /> Signed-off-by: Richa Banker <richabanker@google.com> Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
…aching (ray-project#62314) ## Description Add a new platform_events dashboard head module that introduces a modular, extensible Provider pattern for platform-specific event monitoring in Ray. Key changes: 1. `PlatformEventProvider`: An abstract base class setting a strict type contract (`Callable[[RayEvent], None]`) for all platform-specific event sources 2. KubernetesEventProvider: Encapsulates all K8s client library and watch logic including - watching Kubernetes events related to Ray custom resources (RayCluster, RayJob, RayService) - converting them to RayEvent proto 3. PlatformEventsHead: Acts as a generic, platform-agnostic in-memory cache and REST controller Coming next: 1. PlatformEventsHead will use the RayEventRecorder to publish events through the unified event framework 2. Expand KubernetesEventProvider to ingest generic K8s pod events (e.g., OOMKilled, Evicted) ## Related issues ## Additional information [Design doc](https://docs.google.com/document/d/14kRE4S0vMDKX7o8imDTh1ku6eKOcdLakCRLDuTPf2mU/edit?resourcekey=0-olOaP0W6oeRpB27WxTqMZQ&tab=t.0) Tested on a GKE cluster (with some UI changes not included in this PR) <img width="3336" height="1986" alt="image" src="https://github.com/user-attachments/assets/cd258ab4-8157-46a5-8073-31b9c2709ff1" /> Signed-off-by: Richa Banker <richabanker@google.com>


Description
Add a new platform_events dashboard head module that introduces a modular, extensible Provider pattern for platform-specific event monitoring in Ray.
Key changes:
PlatformEventProvider: An abstract base class setting a strict type contract (Callable[[RayEvent], None]) for all platform-specific event sourcesComing next:
Related issues
Additional information
Design doc
Tested on a GKE cluster (with some UI changes not included in this PR)
