Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Improved autoscaler log messages #12221

Closed
ericl opened this issue Nov 21, 2020 · 25 comments
Closed

[RFC] Improved autoscaler log messages #12221

ericl opened this issue Nov 21, 2020 · 25 comments
Assignees
Labels
enhancement Request for new feature and/or capability fix-error-msg This issue has a bad error message that should be improved. P1 Issue that should be fixed within a few weeks RFC RFC issues
Milestone

Comments

@ericl
Copy link
Contributor

ericl commented Nov 21, 2020

The current autoscaler output is quite difficult to interpret due to its verbosity and low-level details. This is a proposal to clean it by periodically emitting the following summary table:

======== Autoscaler status 2020-11-20 23:14:36,653 ========
Node status
------------------------------------------------------------
Healthy:
 2 p3.2xlarge (2 active)
 20 m4.4xlarge (18 active, 2 idle)

Pending:
 34.5.234.51: m4.4xlarge, launching
 34.5.234.52: m4.4xlarge, launching
 34.5.234.53: m4.4xlarge, waiting for ssh
 34.5.234.54: m4.4xlarge, waiting for ssh
 34.5.234.55: m4.4xlarge, starting ray, /tmp/ray/setup-10.log
 34.5.234.56: m4.4xlarge, setting up, /tmp/ray/setup-11.log
 34.5.234.57: m4.4xlarge, setting up, /tmp/ray/setup-12.log

Recent failures:
 172.24.25.33: m4.4xlarge, /tmp/ray/setup-8.log
 35.4.235.11: p3.2xlarge, /tmp/ray/setup-9.log

Resources
------------------------------------------------------------
Usage:
 530.0/544.0 CPU
 2.0/2.0 GPU
 0.0/2.0 AcceleratorType:V100
 0.0 GiB/1583.19 GiB memory
 0.0 GiB/471.02 GiB object_store_memory

Demands:
 {"CPU": 1}: 150 pending tasks
 [{"CPU": 4} * 5]: 5 pending placement groups
 [{"CPU": 1} * 100]: from request_resources()

Implementation details:

  • The autoscaler should periodically generate a JSON status message that includes the above information.
  • We should log the above text summary for of the JSON status every 10-30s.
  • Other ray components such as the dashboard and ray status can also access this information.
@ericl ericl added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks labels Nov 21, 2020
@ericl ericl added this to the Serverless Autoscaling milestone Nov 21, 2020
@ericl ericl added the fix-error-msg This issue has a bad error message that should be improved. label Nov 21, 2020
@richardliaw
Copy link
Contributor

cc @mkoh-asapp @mattearllongshot

@ericl ericl pinned this issue Nov 21, 2020
@ijrsvt
Copy link
Contributor

ijrsvt commented Nov 21, 2020

@ericl Would it be useful to include something about the head_node in this output (like which one it is)?

@rkooo567
Copy link
Contributor

cc @mfitton We should definitely port this to our dashboard.

@markgoodhead
Copy link
Contributor

This would be a game-changing feature for autoscaler debugging/visibility - can't wait until this is on the dashboard!

@richardliaw
Copy link
Contributor

cc @maximsmol this would be a good thing to use (especially since the plan is to emit json)

@wuisawesome
Copy link
Contributor

Hmm now that we're adding these to ray status, I wonder if we should also make some of this information available programmatically... Just a thought.

@richardliaw
Copy link
Contributor

richardliaw commented Dec 11, 2020 via email

@edoakes
Copy link
Contributor

edoakes commented Dec 21, 2020

FYI we are currently exposing this programmatically in the dashboard (/api/cluster_status), but it currently isn't advertised anywhere.

@rkooo567 rkooo567 added the RFC RFC issues label Dec 27, 2020
@AmeerHajAli
Copy link
Contributor

@ericl @wuisawesome can we close this issue?

@wuisawesome
Copy link
Contributor

I'd call it 95% done since we don't have the per-file logging doen yet

@DmitriGekhtman DmitriGekhtman self-assigned this Jan 14, 2021
@DmitriGekhtman
Copy link
Contributor

What remains to do is the per-node logging, and this task has been delegated to me.

Let me just confirm the requirement and a potential way to do it.

?Requirement?:
We need a log file for each node_id, and each NodeUpdater running inside of the monitor process should log to that file.

?Design?:
One strategy is to put the requisite logging logic in the NodeUpdater --
Give the NodeUpdater a self.logger attribute.
Have the NodeUpdater detect by some means if it's running inside the monitor process.
If it's not running in the monitor process, then
self.logger = logger, where logger is the logger = logging.getLogger(__name__) defined at the top of the file.
If it is running in the monitor process, then self.logger is a custom logger that doesn't belong to the standard logging hierarchy and writes to a node_id-dependent log file.

@rkooo567
Does this sound like a remotely sane strategy?

If anyone has a better idea, let me know.

@ericl
Copy link
Contributor Author

ericl commented Jan 15, 2021 via email

@richardliaw
Copy link
Contributor

@DmitriGekhtman

IIRC you shouldn't be using logging in NodeUpdater, but rather the custom autoscaler CLI logger. There should be an option for that logger to write stdout/stderr to a specific file upon invocation of command (at least for non-kubernetes environments such as Docker and standard).

This logger should be used in both cases of being in a monitor process or not (rather, the stdout redirect should be toggled).

@DmitriGekhtman
Copy link
Contributor

@richardliaw Ah, yep you're right, it uses the CLI logger.
@ericl And yeah, need to figure out how to redirect child command output.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Jan 15, 2021

cli_logger.print just prints, as far as I can tell
https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/cli_logger.py#L422

@wuisawesome
Copy link
Contributor

I'm not sure how you would go about using a logger here since iirc we don't capture any output. My assumption this would be done by redirecting stdout at the process runner level.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Jan 15, 2021

hmm, well, right now stdout and stderr go to monitor.out and monitor.err,
and monitor.log gets all log records at level >= info.

For the purposes of this issue, are the "autoscaler log messages" we're talking about the contents of monitor.log?

@DmitriGekhtman
Copy link
Contributor

ech -- anyways, i see: the goal is to redirect all output of cmds run by NodeUpdater.cmd_runner to the relevant file.

@AmeerHajAli AmeerHajAli removed their assignment Jan 24, 2021
@ericl
Copy link
Contributor Author

ericl commented Feb 4, 2021

fyi @clarkzinzow this might be a good starter issue

@DmitriGekhtman
Copy link
Contributor

^ feel free to change the assignment from me to @clarkzinzow if Clark is interested
cc @AmeerHajAli

@wuisawesome
Copy link
Contributor

Minor update: we should refer to failed nodes by their id, since ip addresses can be reused when failed nodes are terminated.

@ericl
Copy link
Contributor Author

ericl commented Feb 11, 2021 via email

@wuisawesome
Copy link
Contributor

ok, we can use IP, but it puts us at the mercy of the k8s ip allocation policy.

@wuisawesome
Copy link
Contributor

@yiranwang52 what should the /tmp/ray/setup-10.log part look like for the k8s operator?

@AmeerHajAli
Copy link
Contributor

Closing this since the final piece here is covered in this issue: #13586

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability fix-error-msg This issue has a bad error message that should be improved. P1 Issue that should be fixed within a few weeks RFC RFC issues
Projects
None yet
Development

No branches or pull requests

9 participants