[RFC] Improved autoscaler log messages #12221

ericl · 2020-11-21T00:12:53Z

The current autoscaler output is quite difficult to interpret due to its verbosity and low-level details. This is a proposal to clean it by periodically emitting the following summary table:

======== Autoscaler status 2020-11-20 23:14:36,653 ========
Node status
------------------------------------------------------------
Healthy:
 2 p3.2xlarge (2 active)
 20 m4.4xlarge (18 active, 2 idle)

Pending:
 34.5.234.51: m4.4xlarge, launching
 34.5.234.52: m4.4xlarge, launching
 34.5.234.53: m4.4xlarge, waiting for ssh
 34.5.234.54: m4.4xlarge, waiting for ssh
 34.5.234.55: m4.4xlarge, starting ray, /tmp/ray/setup-10.log
 34.5.234.56: m4.4xlarge, setting up, /tmp/ray/setup-11.log
 34.5.234.57: m4.4xlarge, setting up, /tmp/ray/setup-12.log

Recent failures:
 172.24.25.33: m4.4xlarge, /tmp/ray/setup-8.log
 35.4.235.11: p3.2xlarge, /tmp/ray/setup-9.log

Resources
------------------------------------------------------------
Usage:
 530.0/544.0 CPU
 2.0/2.0 GPU
 0.0/2.0 AcceleratorType:V100
 0.0 GiB/1583.19 GiB memory
 0.0 GiB/471.02 GiB object_store_memory

Demands:
 {"CPU": 1}: 150 pending tasks
 [{"CPU": 4} * 5]: 5 pending placement groups
 [{"CPU": 1} * 100]: from request_resources()

Implementation details:

The autoscaler should periodically generate a JSON status message that includes the above information.
We should log the above text summary for of the JSON status every 10-30s.
Other ray components such as the dashboard and ray status can also access this information.

The text was updated successfully, but these errors were encountered:

richardliaw · 2020-11-21T00:16:35Z

cc @mkoh-asapp @mattearllongshot

ijrsvt · 2020-11-21T00:39:27Z

@ericl Would it be useful to include something about the head_node in this output (like which one it is)?

rkooo567 · 2020-11-21T05:13:00Z

cc @mfitton We should definitely port this to our dashboard.

markgoodhead · 2020-11-21T12:33:56Z

This would be a game-changing feature for autoscaler debugging/visibility - can't wait until this is on the dashboard!

richardliaw · 2020-11-21T19:50:36Z

cc @maximsmol this would be a good thing to use (especially since the plan is to emit json)

wuisawesome · 2020-12-11T20:37:57Z

Hmm now that we're adding these to ray status, I wonder if we should also make some of this information available programmatically... Just a thought.

richardliaw · 2020-12-11T21:11:47Z

yep! cc @maximsmol who is investigating a prototype

…

On Fri, Dec 11, 2020 at 12:38 PM Alex Wu ***@***.***> wrote: Hmm now that we're adding these to ray status, I wonder if we should also make some of this information available programmatically... Just a thought. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#12221 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCRZZIMTUSUANODXN5LIP3SUJ7LRANCNFSM4T5MYIPQ> .

edoakes · 2020-12-21T16:50:07Z

FYI we are currently exposing this programmatically in the dashboard (/api/cluster_status), but it currently isn't advertised anywhere.

AmeerHajAli · 2020-12-28T01:19:28Z

@ericl @wuisawesome can we close this issue?

wuisawesome · 2020-12-28T06:02:01Z

I'd call it 95% done since we don't have the per-file logging doen yet

DmitriGekhtman · 2021-01-15T05:13:29Z

What remains to do is the per-node logging, and this task has been delegated to me.

Let me just confirm the requirement and a potential way to do it.

?Requirement?:
We need a log file for each node_id, and each NodeUpdater running inside of the monitor process should log to that file.

?Design?:
One strategy is to put the requisite logging logic in the NodeUpdater --
Give the NodeUpdater a self.logger attribute.
Have the NodeUpdater detect by some means if it's running inside the monitor process.
If it's not running in the monitor process, then
self.logger = logger, where logger is the logger = logging.getLogger(__name__) defined at the top of the file.
If it is running in the monitor process, then self.logger is a custom logger that doesn't belong to the standard logging hierarchy and writes to a node_id-dependent log file.

@rkooo567
Does this sound like a remotely sane strategy?

If anyone has a better idea, let me know.

ericl · 2021-01-15T05:24:38Z

I recall the node updater used to be a process, so you could redirect stdout to a file directly. It seems to be a thread now, so maybe we have to pass a logger and also capture any child command outputs as well. We could also try moving it back to a process but that sounds like a bigger change.

…

On Thu, Jan 14, 2021, 9:13 PM Dmitri Gekhtman ***@***.***> wrote: What remains to do is the per-node logging, and this task has been delegated to me. Let me just confirm the requirement and a potential way to do it. ?Requirement?: We need a log file for each node_id, and each NodeUpdater running inside of the monitor process should log to that file. ?Design?: One strategy is to put the requisite logging logic in the NodeUpdater -- Give the NodeUpdater a self.logger attribute. Have the NodeUpdater detect by some means if it's running inside the monitor process. If it's not running in the monitor process, then self.logger = logger, where logger is the logger = logging.getLogger(__name__) defined at the top of the file. If it is running in the monitor process, then self.logger is a custom logger that doesn't belong to the standard logging hierarchy and writes to a node_id-dependent log file. @rkooo567 <https://github.com/rkooo567> Does this sound like a remotely sane strategy? If anyone has a better idea, let me know. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12221 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSVZGHTNCSWKAWBUFELSZ7FIPANCNFSM4T5MYIPQ> .

richardliaw · 2021-01-15T05:51:51Z

@DmitriGekhtman

IIRC you shouldn't be using logging in NodeUpdater, but rather the custom autoscaler CLI logger. There should be an option for that logger to write stdout/stderr to a specific file upon invocation of command (at least for non-kubernetes environments such as Docker and standard).

This logger should be used in both cases of being in a monitor process or not (rather, the stdout redirect should be toggled).

DmitriGekhtman · 2021-01-15T06:00:32Z

@richardliaw Ah, yep you're right, it uses the CLI logger.
@ericl And yeah, need to figure out how to redirect child command output.

DmitriGekhtman · 2021-01-15T06:35:57Z

cli_logger.print just prints, as far as I can tell
https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/cli_logger.py#L422

wuisawesome · 2021-01-15T06:47:31Z

I'm not sure how you would go about using a logger here since iirc we don't capture any output. My assumption this would be done by redirecting stdout at the process runner level.

DmitriGekhtman · 2021-01-15T07:26:28Z

hmm, well, right now stdout and stderr go to monitor.out and monitor.err,
and monitor.log gets all log records at level >= info.

For the purposes of this issue, are the "autoscaler log messages" we're talking about the contents of monitor.log?

DmitriGekhtman · 2021-01-15T07:37:07Z

ech -- anyways, i see: the goal is to redirect all output of cmds run by NodeUpdater.cmd_runner to the relevant file.

ericl · 2021-02-04T22:53:55Z

fyi @clarkzinzow this might be a good starter issue

DmitriGekhtman · 2021-02-05T16:49:15Z

^ feel free to change the assignment from me to @clarkzinzow if Clark is interested
cc @AmeerHajAli

wuisawesome · 2021-02-11T01:20:42Z

Minor update: we should refer to failed nodes by their id, since ip addresses can be reused when failed nodes are terminated.

ericl · 2021-02-11T01:29:46Z

Hmm IDs aren't as useful though, about as useful as an out of date IP. If we keep failed nodes alive, then IPs should still be relevant right? So we should still keep the IPs since they're more generally useful.

…

On Wed, Feb 10, 2021, 5:21 PM Alex Wu ***@***.***> wrote: Minor update: we should refer to failed nodes by their id, since ip addresses can be reused when failed nodes are terminated. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12221 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSSLLNNWQVZXOMSKS3TS6MWHPANCNFSM4T5MYIPQ> .

wuisawesome · 2021-02-11T01:37:18Z

ok, we can use IP, but it puts us at the mercy of the k8s ip allocation policy.

wuisawesome · 2021-02-16T06:58:39Z

@yiranwang52 what should the /tmp/ray/setup-10.log part look like for the k8s operator?

AmeerHajAli · 2021-03-08T19:59:00Z

Closing this since the final piece here is covered in this issue: #13586

ericl added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks labels Nov 21, 2020

ericl added this to the Serverless Autoscaling milestone Nov 21, 2020

ericl added the fix-error-msg This issue has a bad error message that should be improved. label Nov 21, 2020

ericl modified the milestones: Serverless Autoscaling, Better Error Messages Nov 21, 2020

ericl pinned this issue Nov 21, 2020

ericl mentioned this issue Nov 23, 2020

[autoscaler] Logging a bit too verbose on CPU requests #12137

Closed

richardliaw unpinned this issue Nov 25, 2020

ericl modified the milestones: Better Error Messages, Serverless Autoscaling Nov 30, 2020

ericl assigned wuisawesome Nov 30, 2020

wuisawesome mentioned this issue Dec 11, 2020

[Autoscaler] New output log format #12772

Merged

6 tasks

maximsmol mentioned this issue Dec 14, 2020

[Draft] [Autoscaler] Autoscaler Status Command POC #12847

Closed

6 tasks

rkooo567 added the RFC RFC issues label Dec 27, 2020

AmeerHajAli closed this as completed Dec 28, 2020

AmeerHajAli reopened this Dec 28, 2020

ericl assigned AmeerHajAli and unassigned wuisawesome Dec 28, 2020

ericl mentioned this issue Dec 31, 2020

[RFC] Piping important autoscaler events to driver logs #13141

Closed

DmitriGekhtman self-assigned this Jan 14, 2021

AmeerHajAli removed their assignment Jan 24, 2021

wuisawesome mentioned this issue Feb 12, 2021

[Autoscaler] remember failed nodes and their logs #14054

Closed

6 tasks

AmeerHajAli closed this as completed Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Improved autoscaler log messages #12221

[RFC] Improved autoscaler log messages #12221

ericl commented Nov 21, 2020 •

edited

Loading

richardliaw commented Nov 21, 2020

ijrsvt commented Nov 21, 2020

rkooo567 commented Nov 21, 2020

markgoodhead commented Nov 21, 2020

richardliaw commented Nov 21, 2020

wuisawesome commented Dec 11, 2020

richardliaw commented Dec 11, 2020 via email

edoakes commented Dec 21, 2020

AmeerHajAli commented Dec 28, 2020

wuisawesome commented Dec 28, 2020

DmitriGekhtman commented Jan 15, 2021

ericl commented Jan 15, 2021 via email

richardliaw commented Jan 15, 2021

DmitriGekhtman commented Jan 15, 2021

DmitriGekhtman commented Jan 15, 2021 •

edited

Loading

wuisawesome commented Jan 15, 2021

DmitriGekhtman commented Jan 15, 2021 •

edited

Loading

DmitriGekhtman commented Jan 15, 2021

ericl commented Feb 4, 2021

DmitriGekhtman commented Feb 5, 2021

wuisawesome commented Feb 11, 2021

ericl commented Feb 11, 2021 via email

wuisawesome commented Feb 11, 2021

wuisawesome commented Feb 16, 2021

AmeerHajAli commented Mar 8, 2021

[RFC] Improved autoscaler log messages #12221

[RFC] Improved autoscaler log messages #12221

Comments

ericl commented Nov 21, 2020 • edited Loading

richardliaw commented Nov 21, 2020

ijrsvt commented Nov 21, 2020

rkooo567 commented Nov 21, 2020

markgoodhead commented Nov 21, 2020

richardliaw commented Nov 21, 2020

wuisawesome commented Dec 11, 2020

richardliaw commented Dec 11, 2020 via email

edoakes commented Dec 21, 2020

AmeerHajAli commented Dec 28, 2020

wuisawesome commented Dec 28, 2020

DmitriGekhtman commented Jan 15, 2021

ericl commented Jan 15, 2021 via email

richardliaw commented Jan 15, 2021

DmitriGekhtman commented Jan 15, 2021

DmitriGekhtman commented Jan 15, 2021 • edited Loading

wuisawesome commented Jan 15, 2021

DmitriGekhtman commented Jan 15, 2021 • edited Loading

DmitriGekhtman commented Jan 15, 2021

ericl commented Feb 4, 2021

DmitriGekhtman commented Feb 5, 2021

wuisawesome commented Feb 11, 2021

ericl commented Feb 11, 2021 via email

wuisawesome commented Feb 11, 2021

wuisawesome commented Feb 16, 2021

AmeerHajAli commented Mar 8, 2021

ericl commented Nov 21, 2020 •

edited

Loading

DmitriGekhtman commented Jan 15, 2021 •

edited

Loading

DmitriGekhtman commented Jan 15, 2021 •

edited

Loading