Skip to content

[Dashboard] Add platform events module with K8s event ingestion and caching#62314

Merged
andrewsykim merged 2 commits into
ray-project:masterfrom
richabanker:k8s-events-watcher
May 9, 2026
Merged

[Dashboard] Add platform events module with K8s event ingestion and caching#62314
andrewsykim merged 2 commits into
ray-project:masterfrom
richabanker:k8s-events-watcher

Conversation

@richabanker
Copy link
Copy Markdown
Contributor

@richabanker richabanker commented Apr 3, 2026

Description

Add a new platform_events dashboard head module that introduces a modular, extensible Provider pattern for platform-specific event monitoring in Ray.

Key changes:

  1. PlatformEventProvider: An abstract base class setting a strict type contract (Callable[[RayEvent], None]) for all platform-specific event sources
  2. KubernetesEventProvider: Encapsulates all K8s client library and watch logic including
    • watching Kubernetes events related to Ray custom resources (RayCluster, RayJob, RayService)
    • converting them to RayEvent proto
  3. PlatformEventsHead: Acts as a generic, platform-agnostic in-memory cache and REST controller

Coming next:

  1. PlatformEventsHead will use the RayEventRecorder to publish events through the unified event framework
  2. Expand KubernetesEventProvider to ingest generic K8s pod events (e.g., OOMKilled, Evicted)

Related issues

Additional information

Design doc

Tested on a GKE cluster (with some UI changes not included in this PR)
image

@richabanker richabanker requested a review from a team as a code owner April 3, 2026 01:33
@richabanker
Copy link
Copy Markdown
Contributor Author

cc @sampan-s-nayak @andrewsykim for initial feedback

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the PlatformEventsHead module to the Ray dashboard, which monitors Kubernetes events for Ray clusters, jobs, and services. The module watches for relevant K8s events in separate threads, converts them into RayEvent protobuf messages, and exposes them via a new REST API endpoint. Review feedback suggests improving the efficiency of event deduplication and eviction using OrderedDict, preventing potential race conditions when accessing resource versions across threads, and optimizing the conversion of protobuf messages to dictionaries for the API response.

Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community kubernetes labels Apr 3, 2026
@dancingactor
Copy link
Copy Markdown
Contributor

dancingactor commented Apr 5, 2026

@richabanker richabanker force-pushed the k8s-events-watcher branch 3 times, most recently from 16ec456 to 73e6b4e Compare April 6, 2026 22:20
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
@richabanker richabanker force-pushed the k8s-events-watcher branch 2 times, most recently from 92a3826 to a1b7fb1 Compare April 9, 2026 22:37
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py
Comment thread python/ray/dashboard/modules/platform_events/tests/test_platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py
Comment thread python/ray/dashboard/modules/platform_events/tests/test_platform_event_head.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py
@richabanker richabanker force-pushed the k8s-events-watcher branch from 52dfb26 to 5436c5f Compare May 6, 2026 22:59
@richabanker
Copy link
Copy Markdown
Contributor Author

Thanks to some excellent points raised by @MengjinYan , I have refactored the PR to encapsulate all the K8s specific event-watching logic in k8s_provider.py while keeping platform_event_head.py lightweight for processing of the RayEvents (after receiving them from the k8s provider). This way, we can extend the setup to have any other platform_event provider have their own event-watching logic as a new provider that can be wired into the platform_event_head.py

Requesting another round of review for the changes.
FYI @andrewsykim @ryanaoleary

Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py
Comment thread python/ray/dashboard/modules/platform_events/providers/k8s_provider.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/providers/base.py Outdated
@richabanker richabanker force-pushed the k8s-events-watcher branch from cc17528 to 5787751 Compare May 7, 2026 00:19
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py
Copy link
Copy Markdown
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactoring! The change looks good! Only one minor comment that can be addressed in a followup PR.

"Platform events will be disabled."
)

def _process_event_callback(self, ray_event: RayEvent):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: The docstring expects _process_event_callback to be called from the main asyncio loop. But the contract is not enforced from the head side. A potential fix can be to dispatch the callback directly to the main asyncio loop in the _process_event_callback itself.

It doesn't break the functionality for this PR so no need to address in this one. We can add it when integrating with the python event library

@richabanker richabanker force-pushed the k8s-events-watcher branch from 5787751 to e60b26b Compare May 7, 2026 04:02
Comment thread python/ray/dashboard/modules/platform_events/providers/k8s_provider.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/providers/k8s_provider.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/providers/k8s_provider.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/providers/k8s_provider.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/providers/k8s_provider.py Outdated
@richabanker richabanker force-pushed the k8s-events-watcher branch 2 times, most recently from 0773bb0 to 1ffc0c3 Compare May 7, 2026 19:53
Comment thread python/ray/dashboard/modules/platform_events/tests/test_k8s_provider.py Outdated
Comment thread python/ray/dashboard/modules/platform_events/providers/__init__.py Outdated
@andrewsykim andrewsykim added the go add ONLY when ready to merge, run all tests label May 7, 2026
Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking comments, aside from using asyncio k8s client if possible to avoid spawning excessive threads. If this has already been considered, please note it in the PR description.

Comment thread python/ray/dashboard/modules/platform_events/providers/k8s_provider.py Outdated
Comment on lines +146 to +150
# Start a dedicated, named OS thread for each target to ensure strict execution guarantees
for kind, name in targets:
t = threading.Thread(
target=self._run_k8s_watch,
args=(kind, name),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no asyncio-compatible watch API?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think so.. recently some support for asyncio was added in kubernetes-client/python#2547, but the note in the PR description says "still missing dynamic client, watch, stream etc", so asyncio watch is not yet supported in the K8s python client. cc @yliaog for confirmation

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct, not right now. that will be added in the next release.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @yliaog, let's follow-up to use the new asyncio client once there's an official release

Comment thread python/ray/dashboard/modules/platform_events/tests/test_k8s_provider.py Outdated
…aching

Signed-off-by: Richa Banker <richabanker@google.com>
@richabanker richabanker force-pushed the k8s-events-watcher branch from 1ffc0c3 to ff47644 Compare May 8, 2026 22:23
Copy link
Copy Markdown
Contributor

@ryanaoleary ryanaoleary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did another pass and have no remaining comments besides what's already been mentioned - LGTM

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit ff47644. Configure here.

try:
t.join(timeout=1.0)
except Exception as e:
logger.warning(f"Error joining thread {t.name}: {e}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread join timeout silently loses thread cleanup guarantee

Low Severity

The join_all_threads function uses a 1-second timeout per thread but doesn't check whether the join actually succeeded. If a thread is blocked on a long network call and doesn't terminate within the timeout, cleanup() completes while the thread is still alive and may attempt to call loop.call_soon_threadsafe after the event loop is torn down, potentially raising an unhandled RuntimeError in the background thread.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ff47644. Configure here.

@andrewsykim andrewsykim merged commit b8329de into ray-project:master May 9, 2026
6 checks passed
@richabanker richabanker deleted the k8s-events-watcher branch May 11, 2026 19:46
dancingactor pushed a commit to dancingactor/ray that referenced this pull request May 13, 2026
…aching (ray-project#62314)

## Description
Add a new platform_events dashboard head module that introduces a
modular, extensible Provider pattern for platform-specific event
monitoring in Ray.

Key changes:
1. `PlatformEventProvider`: An abstract base class setting a strict type
contract (`Callable[[RayEvent], None]`) for all platform-specific event
sources
2. KubernetesEventProvider: Encapsulates all K8s client library and
watch logic including
- watching Kubernetes events related to Ray custom resources
(RayCluster, RayJob, RayService)
    - converting them to RayEvent proto
3. PlatformEventsHead: Acts as a generic, platform-agnostic in-memory
cache and REST controller

Coming next:
1. PlatformEventsHead will use the RayEventRecorder to publish events
through the unified event framework
2. Expand KubernetesEventProvider to ingest generic K8s pod events
(e.g., OOMKilled, Evicted)

## Related issues

## Additional information
[Design
doc](https://docs.google.com/document/d/14kRE4S0vMDKX7o8imDTh1ku6eKOcdLakCRLDuTPf2mU/edit?resourcekey=0-olOaP0W6oeRpB27WxTqMZQ&tab=t.0)

Tested on a GKE cluster (with some UI changes not included in this PR)
<img width="3336" height="1986" alt="image"
src="https://github.com/user-attachments/assets/cd258ab4-8157-46a5-8073-31b9c2709ff1"
/>

Signed-off-by: Richa Banker <richabanker@google.com>
am-kinetica pushed a commit to kineticadb/ray that referenced this pull request May 14, 2026
…aching (ray-project#62314)

## Description
Add a new platform_events dashboard head module that introduces a
modular, extensible Provider pattern for platform-specific event
monitoring in Ray.

Key changes:
1. `PlatformEventProvider`: An abstract base class setting a strict type
contract (`Callable[[RayEvent], None]`) for all platform-specific event
sources
2. KubernetesEventProvider: Encapsulates all K8s client library and
watch logic including
- watching Kubernetes events related to Ray custom resources
(RayCluster, RayJob, RayService)
    - converting them to RayEvent proto
3. PlatformEventsHead: Acts as a generic, platform-agnostic in-memory
cache and REST controller

Coming next:
1. PlatformEventsHead will use the RayEventRecorder to publish events
through the unified event framework
2. Expand KubernetesEventProvider to ingest generic K8s pod events
(e.g., OOMKilled, Evicted)

## Related issues

## Additional information
[Design
doc](https://docs.google.com/document/d/14kRE4S0vMDKX7o8imDTh1ku6eKOcdLakCRLDuTPf2mU/edit?resourcekey=0-olOaP0W6oeRpB27WxTqMZQ&tab=t.0)

Tested on a GKE cluster (with some UI changes not included in this PR)
<img width="3336" height="1986" alt="image"
src="https://github.com/user-attachments/assets/cd258ab4-8157-46a5-8073-31b9c2709ff1"
/>

Signed-off-by: Richa Banker <richabanker@google.com>
Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…aching (ray-project#62314)

## Description
Add a new platform_events dashboard head module that introduces a
modular, extensible Provider pattern for platform-specific event
monitoring in Ray.

Key changes:
1. `PlatformEventProvider`: An abstract base class setting a strict type
contract (`Callable[[RayEvent], None]`) for all platform-specific event
sources
2. KubernetesEventProvider: Encapsulates all K8s client library and
watch logic including
- watching Kubernetes events related to Ray custom resources
(RayCluster, RayJob, RayService)
    - converting them to RayEvent proto
3. PlatformEventsHead: Acts as a generic, platform-agnostic in-memory
cache and REST controller

Coming next:
1. PlatformEventsHead will use the RayEventRecorder to publish events
through the unified event framework
2. Expand KubernetesEventProvider to ingest generic K8s pod events
(e.g., OOMKilled, Evicted)

## Related issues

## Additional information
[Design
doc](https://docs.google.com/document/d/14kRE4S0vMDKX7o8imDTh1ku6eKOcdLakCRLDuTPf2mU/edit?resourcekey=0-olOaP0W6oeRpB27WxTqMZQ&tab=t.0)

Tested on a GKE cluster (with some UI changes not included in this PR)
<img width="3336" height="1986" alt="image"
src="https://github.com/user-attachments/assets/cd258ab4-8157-46a5-8073-31b9c2709ff1"
/>

Signed-off-by: Richa Banker <richabanker@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests kubernetes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants