Skip to content

[Core] Publish platform events via Ray Event Recorder#63329

Open
richabanker wants to merge 1 commit into
ray-project:masterfrom
richabanker:platform-events-ray-event-recorder
Open

[Core] Publish platform events via Ray Event Recorder#63329
richabanker wants to merge 1 commit into
ray-project:masterfrom
richabanker:platform-events-ray-event-recorder

Conversation

@richabanker
Copy link
Copy Markdown
Contributor

Description

Add support for publishing Platform events via the python ray event exporter framework

@richabanker richabanker requested review from a team, MengjinYan, dayshah and edoakes as code owners May 13, 2026 22:30
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the PlatformEventBuilder class to support infrastructure platform events, such as those from Kubernetes, and integrates it into the Ray dashboard's observability module. The changes include initializing the EventRecorder in the dashboard head and emitting events during processing callbacks. Feedback suggests allowing the EventRecorder to generate unique IDs for event updates to avoid deduplication issues, refactoring environment variable checks for efficiency, and moving imports out of the event processing hot path.

Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
Comment on lines +134 to +142
if os.environ.get("RAY_ENABLE_PYTHON_RAY_EVENT", "False").lower() in (
"true",
"1",
):
try:
from ray._common.observability.platform_events import (
PlatformEventBuilder,
)
from ray._raylet import EventRecorder
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Several modules are imported inside _process_event_callback for every event. Since this callback can be invoked frequently, these imports introduce unnecessary overhead. It is recommended to move these imports to the top of the file or at least outside the hot path of the callback.

timestamp_ns=int(time.time() * 1e9),
)
EventRecorder.emit(cython_event)
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E2E test never initializes EventRecorder on worker

High Severity

The remote task emit_test_platform_event calls EventRecorder.emit() on a worker process, but EventRecorder.initialize() is never called on that worker. EventRecorder is a per-process singleton, and the only initialize() call in the entire codebase is in platform_event_head.py's run() method, which runs in the dashboard head process — a completely different process. When the worker calls emit(), the singleton _event_recorder_instance is None, so emit_batch silently drops the event and returns False. The test will then timeout at the wait_for_condition check. The comment even says "explicitly initializes and emits" but the code only emits.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e7be190. Configure here.

@richabanker richabanker force-pushed the platform-events-ray-event-recorder branch from 8718785 to b35d418 Compare May 13, 2026 22:40
Comment thread python/ray/dashboard/modules/aggregator/tests/test_ray_platform_events.py Outdated
@richabanker richabanker force-pushed the platform-events-ray-event-recorder branch from b35d418 to bac7c35 Compare May 13, 2026 23:25
Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py
@richabanker richabanker force-pushed the platform-events-ray-event-recorder branch from bac7c35 to 271db30 Compare May 13, 2026 23:36
Comment thread python/ray/dashboard/modules/platform_events/tests/test_platform_event_head.py Outdated
@richabanker richabanker force-pushed the platform-events-ray-event-recorder branch 3 times, most recently from a8a923c to 1e59f09 Compare May 14, 2026 00:21
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 1e59f09. Configure here.

)
head_node_id_hex = (
head_node_id_bytes.decode() if head_node_id_bytes else None
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicates existing get_head_node_id utility function

Low Severity

The logic to fetch the head node ID from GCS (lines 54–61) duplicates the existing get_head_node_id() utility in ray.dashboard.modules.job.utils, which performs the identical KV lookup with the same key, namespace, and timeout, and returns the decoded hex string or None.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1e59f09. Configure here.

@richabanker
Copy link
Copy Markdown
Contributor Author

@sampan-s-nayak could you please help take a pass at this PR whenever possible for you? Thanks!


def _process_event_callback(self, ray_event: RayEvent):
"""Callback running in the main asyncio loop to cache events."""
"""Thread-safe entry point that dispatches event caching to the main asyncio loop."""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change driven by this comment

@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels May 14, 2026
@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented May 15, 2026

@sampan-s-nayak PTAL

RAY_EXPORT_EVENT_MAX_BACKUP_COUNT = env_bool("RAY_EXPORT_EVENT_MAX_BACKUP_COUNT", 20)

# Enables emitting events through the Python EventRecorder (One-Event framework)
# to the AggregatorAgent. When enabled, dashboard modules that support it emit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this specific to dashboard modules? should we have a specific flag just for platform events? (I think we already had some config to enable platform events, can we reuse that?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried to keep this from your PR here #61099 which I think added this env var to control whether the One-Event framework pipeline is used for Python event emission. The other platform specific env var is RAY_DASHBOARD_INGEST_PLATFORM_EVENTS but thats more for whether the k8s events should even be ingested/collected.

Copy link
Copy Markdown
Contributor

@sampan-s-nayak sampan-s-nayak May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have a single flag for all python events then in the future we wont have a way to disable a specific event type. can we instead make the config a list of supported event types and enable Platform_events by default?

try:
from ray._raylet import EventRecorder

head_node_id_bytes = await self.gcs_client.async_internal_kv_get(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use get_head_node_id() here instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean from https://github.com/ray-project/ray/blob/master/python/ray/dashboard/modules/job/utils.py#L37? I could.. but that adds a cross-module dependency from the platform_events module onto the job module which is unrelated...

I could however add it in ray/dashboard/utils.py and make both platform_events and job modules use it, lmk if you prefer that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sg

Comment thread python/ray/dashboard/modules/platform_events/platform_event_head.py Outdated
@richabanker richabanker force-pushed the platform-events-ray-event-recorder branch from 1e59f09 to aeeb1e1 Compare May 19, 2026 22:57
Signed-off-by: Richa Banker <richabanker@google.com>
@richabanker richabanker force-pushed the platform-events-ray-event-recorder branch from 64e7752 to 07e69ed Compare May 19, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants