-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Add metrics for disk and network I/O #23546
[core] Add metrics for disk and network I/O #23546
Conversation
@@ -348,19 +361,25 @@ def _get_load_avg(self): | |||
per_cpu_load = tuple((round(x / self._cpu_counts[0], 2) for x in load)) | |||
return load, per_cpu_load | |||
|
|||
@staticmethod | |||
def _compute_speed_from_hist(hist): | |||
while len(hist) > 7: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is 7 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, I just copied what was here already...
) | ||
network_speed_stats = self._compute_speed_from_hist(self._network_stats_hist) | ||
|
||
disk_stats = self._get_disk_stats() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is only _get_disk_io
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks... clearly I did not actually run this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let me look into that...
@@ -651,10 +660,20 @@ bool PullManager::TryPinObject(const ObjectID &object_id) { | |||
if (ref != nullptr) { | |||
pinned_objects_size_ += ref->GetSize(); | |||
pinned_objects_[object_id] = std::move(ref); | |||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks this changes the semantics of TryPinObject, is it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, actually the previous semantics were flipped :)
3e21861
to
bfdec73
Compare
) | ||
network_speed_stats = self._compute_speed_from_hist(self._network_stats_hist) | ||
|
||
disk_stats = self._get_disk_io() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to _get_disk_io_stats()
?
@@ -165,6 +189,9 @@ def __init__(self, dashboard_agent): | |||
self._hostname = socket.gethostname() | |||
self._workers = set() | |||
self._network_stats_hist = [(0, (0.0, 0.0))] # time, (sent, recv) | |||
self._disk_stats_hist = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _disk_io_stats_hist since there are other disk related stats there as well like disk_usage
auto it = object_pull_requests_.find(obj_id); | ||
RAY_CHECK(it != object_pull_requests_.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use map_find_or_die()
which gives you better check failure message.
@@ -640,10 +649,20 @@ bool PullManager::TryPinObject(const ObjectID &object_id) { | |||
if (ref != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also useful to record num_failed_pins_
if (it->second.activate_time_ms > 0) { | ||
ray::stats::STATS_pull_manager_object_request_time_ms.Record( | ||
absl::GetCurrentTimeNanos() / 1e3 - it->second.activate_time_ms, | ||
"MemoryAvailableToPin"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ActivateToPin?
Why are these changes needed?
Adds some metrics useful for object-intensive workloads:
Checks
scripts/format.sh
to lint the changes in this PR.