Skip to content

Latest commit

 

History

History
190 lines (130 loc) · 8.65 KB

ray-logging.rst

File metadata and controls

190 lines (130 loc) · 8.65 KB

Logging

This document will explain Ray's logging system and its best practices.

Driver logs

An entry point of Ray applications that calls ray.init(address='auto') or ray.init() is called a driver. All the driver logs are handled in the same way as normal Python programs.

Worker logs

Ray's tasks or actors are executed remotely within Ray's worker processes. Ray has special support to improve the visibility of logs produced by workers.

  • By default, all of the tasks/actors stdout and stderr are redirected to the worker log files. Check out Logging directory structure <logging-directory-structure> to learn how Ray's logging directory is structured.
  • By default, all of the tasks/actors stdout and stderr that is redirected to worker log files are published to the driver. Drivers display logs generated from its tasks/actors to its stdout and stderr.

Let's look at a code example to see how this works.

import ray
# Initiate a driver.
ray.init()

@ray.remote
def task():
    print("task")

ray.get(task.remote())

You should be able to see the string task from your driver stdout.

When logs are printed, the process id (pid) and an IP address of the node that executes tasks/actors are printed together. Check out the output below.

(pid=45601) task

How to set up loggers

When using ray, all of the tasks and actors are executed remotely in Ray's worker processes. Since Python logger module creates a singleton logger per process, loggers should be configured on per task/actor basis.

Note

To stream logs to a driver, they should be flushed to stdout and stderr.

import ray
import logging
# Initiate a driver.
ray.init()

@ray.remote
class Actor:
    def __init__(self):
        # Basic config automatically configures logs to
        # be streamed to stdout and stderr.
        # Set the severity to INFO so that info logs are printed to stdout.
        logging.basicConfig(level=logging.INFO)

    def log(self, msg):
        logging.info(msg)

actor = Actor.remote()
ray.get(actor.log.remote("A log message for an actor."))

@ray.remote
def f(msg):
    logging.basicConfig(level=logging.INFO)
    logging.info(msg)

ray.get(f.remote("A log message for a task"))
(pid=95193) INFO:root:A log message for a task
(pid=95192) INFO:root:A log message for an actor.

How to use structured logging

The metadata of tasks or actors may be obtained by Ray's runtime_context APIs <runtime-context-apis>. Runtime context APIs help you to add metadata to your logging messages, making your logs more structured.

import ray
# Initiate a driver.
ray.init()

@ray.remote
def task():
    print(f"task_id: {ray.get_runtime_context().task_id}")

ray.get(task.remote())
(pid=47411) task_id: TaskID(a67dc375e60ddd1affffffffffffffffffffffff01000000)

Logging directory structure

By default, Ray logs are stored in a /tmp/ray/session_*/logs directory.

Note

The default temp directory is /tmp/ray (for Linux and Mac OS). If you'd like to change the temp directory, you can specify it when ray start or ray.init() is called.

A new Ray instance creates a new session ID to the temp directory. The latest session ID is symlinked to /tmp/ray/session_latest.

Here's a Ray log directory structure. Note that .out is logs from stdout/stderr and .err is logs from stderr. The backward compatibility of log directories is not maintained.

  • dashboard.log: A log file of a Ray dashboard.
  • dashboard_agent.log: Every Ray node has one dashboard agent. This is a log file of the agent.
  • gcs_server.[out|err]: The GCS server is a stateless server that manages Ray cluster metadata. It exists only in the head node.
  • log_monitor.log: The log monitor is in charge of streaming logs to the driver.
  • monitor.log: Ray's cluster launcher is operated with a monitor process. It also manages the autoscaler.
  • monitor.[out|err]: Stdout and stderr of a cluster launcher.
  • plasma_store.[out|err]: Deprecated.
  • python-core-driver-[worker_id]_[pid].log: Ray drivers consist of CPP core and Python/Java frontend. This is a log file generated from CPP code.
  • python-core-worker-[worker_id]_[pid].log: Ray workers consist of CPP core and Python/Java frontend. This is a log file generated from CPP code.
  • raylet.[out|err]: A log file of raylets.
  • redis-shard_[shard_index].[out|err]: Redis shard log files.
  • redis.[out|err]: Redis log files.
  • worker-[worker_id]-[job_id]-[pid].[out|err]: Python/Java part of Ray drivers and workers. All of stdout and stderr from tasks/actors are streamed here. Note that job_id is an id of the driver.
  • io-worker-[worker_id]-[pid].[out|err]: Ray creates IO workers to spill/restore objects to external storage by default from Ray 1.3+. This is a log file of IO workers.
  • runtime_env_setup-[job_id].log: Logs from installing runtime environments<runtime-environments> for a task, actor or job. This file will only be present if a runtime environment is installed.
  • runtime_env_setup-ray_client_server_[port].log: Logs from installing runtime environments<runtime-environments> for a job when connecting via Ray Client<ray-client>.

Log rotation

Ray supports log rotation of log files. Note that not all components are currently supporting log rotation. (Raylet and Python/Java worker logs are not rotating).

By default, logs are rotating when it reaches to 512MB (maxBytes), and there could be up to 5 backup files (backupCount). Indexes are appended to all backup files (e.g., raylet.out.1) If you'd like to change the log rotation configuration, you can do it by specifying environment variables. For example,

RAY_ROTATION_MAX_BYTES=1024; ray start --head # Start a ray instance with maxBytes 1KB.
RAY_ROTATION_BACKUP_COUNT=1; ray start --head # Start a ray instance with backupCount 1.

Redirecting Ray logs to stderr

By default, Ray logs are written to files under the /tmp/ray/session_*/logs directory. If you wish to redirect all internal Ray logging and your own logging within tasks/actors to stderr of the host nodes, you can do so by ensuring that the RAY_LOG_TO_STDERR=1 environment variable is set on the driver and on all Ray nodes. This is very useful if you are using a log aggregator that needs log records to be written to stderr in order for them to be captured.

Redirecting logging to stderr will also cause a ({component}) prefix, e.g. (raylet), to be added to each of the log record messages.

[2022-01-24 19:42:02,978 I 1829336 1829336] (gcs_server) grpc_server.cc:103: GcsServer server started, listening on port 50009.
[2022-01-24 19:42:06,696 I 1829415 1829415] (raylet) grpc_server.cc:103: ObjectManager server started, listening on port 40545.
2022-01-24 19:42:05,087 INFO (dashboard) dashboard.py:95 -- Setup static dir for dashboard: /mnt/data/workspace/ray/python/ray/dashboard/client/build
2022-01-24 19:42:07,500 INFO (dashboard_agent) agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:49228

This should make it easier to filter the stderr stream of logs down to the component of interest. Note that multi-line log records will not have this component marker at the beginning of each line.

When running a local Ray cluster, this environment variable should be set before starting the local cluster:

os.environ["RAY_LOG_TO_STDERR"] = "1"
ray.init()

When starting a local cluster via the CLI or when starting nodes in a multi-node Ray cluster, this environment variable should be set before starting up each node:

env RAY_LOG_TO_STDERR=1 ray start

If using the Ray cluster launcher, you would specify this environment variable in the Ray start commands:

head_start_ray_commands:
    - ray stop
    - env RAY_LOG_TO_STDERR=1 ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - env RAY_LOG_TO_STDERR=1 ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

When connecting to the cluster, be sure to set the environment variable before connecting:

os.environ["RAY_LOG_TO_STDERR"] = "1"
ray.init(address="auto")