Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Add metrics for disk and network I/O #23546

Merged
merged 11 commits into from
Apr 1, 2022

Conversation

stephanie-wang
Copy link
Contributor

Why are these changes needed?

Adds some metrics useful for object-intensive workloads:

  • Per raylet/object manager:
    • Add num bytes pending restore to spill manager
    • Add num requests cumulative to PullManager
    • Num bytes pushed/pulled from other nodes cumulative
    • Histogram for request latencies in PullManager:
      • total life time of request, from start to cancel
      • request satisfaction time, from start to object local
      • pull time, from object activation to object local
  • Per-node disk read/write speed, IOPS

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -348,19 +361,25 @@ def _get_load_avg(self):
per_cpu_load = tuple((round(x / self._cpu_counts[0], 2) for x in load))
return load, per_cpu_load

@staticmethod
def _compute_speed_from_hist(hist):
while len(hist) > 7:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 7 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I just copied what was here already...

)
network_speed_stats = self._compute_speed_from_hist(self._network_stats_hist)

disk_stats = self._get_disk_stats()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is only _get_disk_io function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks... clearly I did not actually run this :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let me look into that...

@@ -651,10 +660,20 @@ bool PullManager::TryPinObject(const ObjectID &object_id) {
if (ref != nullptr) {
pinned_objects_size_ += ref->GetSize();
pinned_objects_[object_id] = std::move(ref);
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks this changes the semantics of TryPinObject, is it intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, actually the previous semantics were flipped :)

)
network_speed_stats = self._compute_speed_from_hist(self._network_stats_hist)

disk_stats = self._get_disk_io()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to _get_disk_io_stats()?

@@ -165,6 +189,9 @@ def __init__(self, dashboard_agent):
self._hostname = socket.gethostname()
self._workers = set()
self._network_stats_hist = [(0, (0.0, 0.0))] # time, (sent, recv)
self._disk_stats_hist = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _disk_io_stats_hist since there are other disk related stats there as well like disk_usage

Comment on lines 174 to 175
auto it = object_pull_requests_.find(obj_id);
RAY_CHECK(it != object_pull_requests_.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use map_find_or_die() which gives you better check failure message.

@@ -640,10 +649,20 @@ bool PullManager::TryPinObject(const ObjectID &object_id) {
if (ref != nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also useful to record num_failed_pins_

if (it->second.activate_time_ms > 0) {
ray::stats::STATS_pull_manager_object_request_time_ms.Record(
absl::GetCurrentTimeNanos() / 1e3 - it->second.activate_time_ms,
"MemoryAvailableToPin");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ActivateToPin?

@stephanie-wang stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 31, 2022
@stephanie-wang stephanie-wang added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Apr 1, 2022
@stephanie-wang
Copy link
Contributor Author

@scv119 @ericl or @rkooo567 could you approve as a codeowner?

@stephanie-wang stephanie-wang merged commit b43426b into ray-project:master Apr 1, 2022
@stephanie-wang stephanie-wang deleted the spill-metrics branch April 1, 2022 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants