The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes. Once the cluster configuration is defined, you will need to use the Ray CLI <ray-cli>
to perform any operations such as starting and stopping the cluster.
cluster_name <cluster-configuration-cluster-name>
: str max_workers <cluster-configuration-max-workers>
: int upscaling_speed <cluster-configuration-upscaling-speed>
: float idle_timeout_minutes <cluster-configuration-idle-timeout-minutes>
: int docker <cluster-configuration-docker>
: docker <cluster-configuration-docker-type>
provider <cluster-configuration-provider>
: provider <cluster-configuration-provider-type>
auth <cluster-configuration-auth>
: auth <cluster-configuration-auth-type>
available_node_types <cluster-configuration-available-node-types>
: node_types <cluster-configuration-node-types-type>
head_node_type <cluster-configuration-head-node-type>
: str file_mounts <cluster-configuration-file-mounts>
: file_mounts <cluster-configuration-file-mounts-type>
cluster_synced_files <cluster-configuration-cluster-synced-files>
: - str rsync_exclude <cluster-configuration-rsync-exclude>
: - str rsync_filter <cluster-configuration-rsync-filter>
: - str initialization_commands <cluster-configuration-initialization-commands>
: - str setup_commands <cluster-configuration-setup-commands>
: - str head_setup_commands <cluster-configuration-head-setup-commands>
: - str worker_setup_commands <cluster-configuration-worker-setup-commands>
: - str head_start_ray_commands <cluster-configuration-head-start-ray-commands>
: - str worker_start_ray_commands <cluster-configuration-worker-start-ray-commands>
: - str
image <cluster-configuration-image>
: str head_image <cluster-configuration-head-image>
: str worker_image <cluster-configuration-worker-image>
: str container_name <cluster-configuration-container-name>
: str pull_before_run <cluster-configuration-pull-before-run>
: bool run_options <cluster-configuration-run-options>
: - str head_run_options <cluster-configuration-head-run-options>
: - str worker_run_options <cluster-configuration-worker-run-options>
: - str disable_automatic_runtime_detection <cluster-configuration-disable-automatic-runtime-detection>
: bool disable_shm_size_detection <cluster-configuration-disable-shm-size-detection>
: bool
AWS
ssh_user <cluster-configuration-ssh-user>
: str ssh_private_key <cluster-configuration-ssh-private-key>
: str
Azure
ssh_user <cluster-configuration-ssh-user>
: str ssh_private_key <cluster-configuration-ssh-private-key>
: str ssh_public_key <cluster-configuration-ssh-public-key>
: str
GCP
ssh_user <cluster-configuration-ssh-user>
: str ssh_private_key <cluster-configuration-ssh-private-key>
: str
AWS
type <cluster-configuration-type>
: str region <cluster-configuration-region>
: str availability_zone <cluster-configuration-availability-zone>
: str cache_stopped_nodes <cluster-configuration-cache-stopped-nodes>
: bool security_group <cluster-configuration-security-group>
: Security Group <cluster-configuration-security-group-type>
Azure
type <cluster-configuration-type>
: str location <cluster-configuration-location>
: str resource_group <cluster-configuration-resource-group>
: str subscription_id <cluster-configuration-subscription-id>
: str cache_stopped_nodes <cluster-configuration-cache-stopped-nodes>
: bool
GCP
type <cluster-configuration-type>
: str region <cluster-configuration-region>
: str availability_zone <cluster-configuration-availability-zone>
: str project_id <cluster-configuration-project-id>
: str cache_stopped_nodes <cluster-configuration-cache-stopped-nodes>
: bool
AWS
GroupName <cluster-configuration-group-name>
: str IpPermissions <cluster-configuration-ip-permissions>
: - IpPermission
The available_nodes_types
object's keys represent the names of the different node types.
Deleting a node type from available_node_types
and updating with ray up<ray-up-doc>
will cause the autoscaler to scale down all nodes of that type. In particular, changing the key of a node type object will result in removal of nodes corresponding to the old key; nodes with the new key name will then be created according to cluster configuration and Ray resource demands.
<node_type_1_name>: node_config <cluster-configuration-node-config>
: Node config <cluster-configuration-node-config-type>
resources <cluster-configuration-resources>
: Resources <cluster-configuration-resources-type>
min_workers <cluster-configuration-node-min-workers>
: int max_workers <cluster-configuration-node-max-workers>
: int worker_setup_commands <cluster-configuration-node-type-worker-setup-commands>
: - str docker <cluster-configuration-node-docker>
: Node Docker <cluster-configuration-node-docker-type>
<node_type_2_name>: ... ...
Cloud-specific configuration for nodes of a given node type.
Modifying the node_config
and updating with ray up<ray-up-doc>
will cause the autoscaler to scale down all existing nodes of the node type; nodes with the newly applied node_config
will then be created according to cluster configuration and Ray resource demands.
AWS
A YAML object which conforms to the EC2 create_instances
API in the AWS docs.
Azure
A YAML object as defined in the deployment template whose resources are defined in the Azure docs.
GCP
A YAML object as defined in the GCP docs.
image <cluster-configuration-image>
: str pull_before_run <cluster-configuration-pull-before-run>
: bool worker_run_options <cluster-configuration-worker-run-options>
: - str disable_automatic_runtime_detection <cluster-configuration-disable-automatic-runtime-detection>
: bool disable_shm_size_detection <cluster-configuration-disable-shm-size-detection>
: bool
CPU <cluster-configuration-CPU>
: int GPU <cluster-configuration-GPU>
: int object_store_memory <cluster-configuration-object-store-memory>
: int memory <cluster-configuration-memory>
: int <custom_resource1>: int <custom_resource2>: int ...
<path1_on_remote_machine>: str # Path 1 on local machine <path2_on_remote_machine>: str # Path 2 on local machine ...
The name of the cluster. This is the namespace of the cluster.
- Required: Yes
- Importance: High
- Type: String
- Default: "default"
- Pattern:
[a-zA-Z0-9_]+
The maximum number of workers the cluster will have at any given time.
- Required: No
- Importance: High
- Type: Integer
- Default:
2
- Minimum:
0
- Maximum: Unbounded
The number of nodes allowed to be pending as a multiple of the current number of nodes. For example, if set to 1.0, the cluster can grow in size by at most 100% at any time, so if the cluster currently has 20 nodes, at most 20 pending launches are allowed. Note that although the autoscaler will scale down to min_workers (which could be 0), it will always scale up to 5 nodes at a minimum when scaling up.
- Required: No
- Importance: Medium
- Type: Float
- Default:
1.0
- Minimum:
0.0
- Maximum: Unbounded
The number of minutes that need to pass before an idle worker node is removed by the Autoscaler.
- Required: No
- Importance: Medium
- Type: Integer
- Default:
5
- Minimum:
0
- Maximum: Unbounded
Configure Ray to run in Docker containers.
- Required: No
- Importance: High
- Type:
Docker <cluster-configuration-docker-type>
- Default:
{}
In rare cases when Docker is not available on the system by default (e.g., bad AMI), add the following commands to initialization_commands <cluster-configuration-initialization-commands>
to install it.
initialization_commands:
- curl -fsSL https://get.docker.com -o get-docker.sh
- sudo sh get-docker.sh
- sudo usermod -aG docker $USER
- sudo systemctl restart docker -f
The cloud provider-specific configuration properties.
- Required: Yes
- Importance: High
- Type:
Provider <cluster-configuration-provider-type>
Authentication credentials that Ray will use to launch nodes.
- Required: Yes
- Importance: High
- Type:
Auth <cluster-configuration-auth-type>
Tells the autoscaler the allowed node types and the resources they provide. Each node type is identified by a user-specified key.
- Required: No
- Importance: High
- Type:
Node types <cluster-configuration-node-types-type>
- Default:
AWS
available_node_types:
ray.head.default:
node_config:
InstanceType: m5.large
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 100
resources: {"CPU": 2}
ray.worker.default:
node_config:
InstanceType: m5.large
InstanceMarketOptions:
MarketType: spot
resources: {"CPU": 2}
min_workers: 0
The key for one of the node types in available_node_types <cluster-configuration-available-node-types>
. This node type will be used to launch the head node.
If the field head_node_type
is changed and an update is executed with ray up<ray-up-doc>
, the currently running head node will be considered outdated. The user will receive a prompt asking to confirm scale-down of the outdated head node, and the cluster will restart with a new head node. Changing the node_config<cluster-configuration-node-config>
of the node_type<cluster-configuration-node-types-type>
with key head_node_type
will also result in cluster restart after a user prompt.
- Required: Yes
- Importance: High
- Type: String
- Pattern:
[a-zA-Z0-9_]+
The files or directories to copy to the head and worker nodes.
- Required: No
- Importance: High
- Type:
File mounts <cluster-configuration-file-mounts-type>
- Default:
[]
A list of paths to the files or directories to copy from the head node to the worker nodes. The same path on the head node will be copied to the worker node. This behavior is a subset of the file_mounts behavior, so in the vast majority of cases one should just use file_mounts <cluster-configuration-file-mounts>
.
- Required: No
- Importance: Low
- Type: List of String
- Default:
[]
A list of patterns for files to exclude when running rsync up
or rsync down
. The filter is applied on the source directory only.
Example for a pattern in the list: **/.git/**
.
- Required: No
- Importance: Low
- Type: List of String
- Default:
[]
A list of patterns for files to exclude when running rsync up
or rsync down
. The filter is applied on the source directory and recursively through all subdirectories.
Example for a pattern in the list: .gitignore
.
- Required: No
- Importance: Low
- Type: List of String
- Default:
[]
A list of commands that will be run before the setup commands <cluster-configuration-setup-commands>
. If Docker is enabled, these commands will run outside the container and before Docker is setup.
- Required: No
- Importance: Medium
- Type: List of String
- Default:
[]
A list of commands to run to set up nodes. These commands will always run on the head and worker nodes and will be merged with head setup commands <cluster-configuration-head-setup-commands>
for head and with worker setup commands <cluster-configuration-worker-setup-commands>
for workers.
- Required: No
- Importance: Medium
- Type: List of String
- Default:
AWS
# Default setup_commands:
setup_commands:
- echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl
- Setup commands should ideally be idempotent (i.e., can be run multiple times without changing the result); this allows Ray to safely update nodes after they have been created. You can usually make commands idempotent with small modifications, e.g.
git clone foo
can be rewritten astest -e foo || git clone foo
which checks if the repo is already cloned first. - Setup commands are run sequentially but separately. For example, if you are using anaconda, you need to run
conda activate env && pip install -U ray
because splitting the command into two setup commands will not work. - Ideally, you should avoid using setup_commands by creating a docker image with all the dependencies preinstalled to minimize startup time.
Tip: if you also want to run apt-get commands during setup add the following list of commands:
setup_commands: - sudo pkill -9 apt-get || true - sudo pkill -9 dpkg || true - sudo dpkg --configure -a
A list of commands to run to set up the head node. These commands will be merged with the general setup commands <cluster-configuration-setup-commands>
.
- Required: No
- Importance: Low
- Type: List of String
- Default:
[]
A list of commands to run to set up the worker nodes. These commands will be merged with the general setup commands <cluster-configuration-setup-commands>
.
- Required: No
- Importance: Low
- Type: List of String
- Default:
[]
Commands to start ray on the head node. You don't need to change this.
- Required: No
- Importance: Low
- Type: List of String
- Default:
AWS
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
Command to start ray on worker nodes. You don't need to change this.
- Required: No
- Importance: Low
- Type: List of String
- Default:
AWS
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
The default Docker image to pull in the head and worker nodes. This can be overridden by the head_image <cluster-configuration-head-image>
and worker_image <cluster-configuration-worker-image>
fields. If neither image nor (head_image <cluster-configuration-head-image>
and worker_image <cluster-configuration-worker-image>
) are specified, Ray will not use Docker.
- Required: Yes (If Docker is in use.)
- Importance: High
- Type: String
The Ray project provides Docker images on DockerHub. The repository includes following images:
rayproject/ray-ml:latest-gpu
: CUDA support, includes ML dependencies.rayproject/ray:latest-gpu
: CUDA support, no ML dependencies.rayproject/ray-ml:latest
: No CUDA support, includes ML dependencies.rayproject/ray:latest
: No CUDA support, no ML dependencies.
Docker image for the head node to override the default docker image <cluster-configuration-image>
.
- Required: No
- Importance: Low
- Type: String
Docker image for the worker nodes to override the default docker image <cluster-configuration-image>
.
- Required: No
- Importance: Low
- Type: String
The name to use when starting the Docker container.
- Required: Yes (If Docker is in use.)
- Importance: Low
- Type: String
- Default: ray_container
If enabled, the latest version of image will be pulled when starting Docker. If disabled, docker run
will only pull the image if no cached version is present.
- Required: No
- Importance: Medium
- Type: Boolean
- Default:
True
The extra options to pass to docker run
.
- Required: No
- Importance: Medium
- Type: List of String
- Default:
[]
The extra options to pass to docker run
for head node only.
- Required: No
- Importance: Low
- Type: List of String
- Default:
[]
The extra options to pass to docker run
for worker nodes only.
- Required: No
- Importance: Low
- Type: List of String
- Default:
[]
If enabled, Ray will not try to use the NVIDIA Container Runtime if GPUs are present.
- Required: No
- Importance: Low
- Type: Boolean
- Default:
False
If enabled, Ray will not automatically specify the size /dev/shm
for the started container and the runtime's default value (64MiB for Docker) will be used. If --shm-size=<>
is manually added to run_options
, this is automatically set to True
, meaning that Ray will defer to the user-provided value.
- Required: No
- Importance: Low
- Type: Boolean
- Default:
False
The user that Ray will authenticate with when launching new nodes.
- Required: Yes
- Importance: High
- Type: String
AWS
The path to an existing private key for Ray to use. If not configured, Ray will create a new private keypair (default behavior). If configured, the key must be added to the project-wide metadata and KeyName
has to be defined in the node configuration <cluster-configuration-node-config>
.
- Required: No
- Importance: Low
- Type: String
Azure
The path to an existing private key for Ray to use.
- Required: Yes
- Importance: High
- Type: String
You may use ssh-keygen -t rsa -b 4096
to generate a new ssh keypair.
GCP
The path to an existing private key for Ray to use. If not configured, Ray will create a new private keypair (default behavior). If configured, the key must be added to the project-wide metadata and KeyName
has to be defined in the node configuration <cluster-configuration-node-config>
.
- Required: No
- Importance: Low
- Type: String
AWS
Not available.
Azure
The path to an existing public key for Ray to use.
- Required: Yes
- Importance: High
- Type: String
GCP
Not available.
AWS
The cloud service provider. For AWS, this must be set to aws
.
- Required: Yes
- Importance: High
- Type: String
Azure
The cloud service provider. For Azure, this must be set to azure
.
- Required: Yes
- Importance: High
- Type: String
GCP
The cloud service provider. For GCP, this must be set to gcp
.
- Required: Yes
- Importance: High
- Type: String
AWS
The region to use for deployment of the Ray cluster.
- Required: Yes
- Importance: High
- Type: String
- Default: us-west-2
Azure
Not available.
GCP
The region to use for deployment of the Ray cluster.
- Required: Yes
- Importance: High
- Type: String
- Default: us-west1
AWS
A string specifying a comma-separated list of availability zone(s) that nodes may be launched in. Nodes will be launched in the first listed availability zone and will be tried in the following availability zones if launching fails.
- Required: No
- Importance: Low
- Type: String
- Default: us-west-2a,us-west-2b
Azure
Not available.
GCP
A string specifying a comma-separated list of availability zone(s) that nodes may be launched in.
- Required: No
- Importance: Low
- Type: String
- Default: us-west1-a
AWS
Not available.
Azure
The location to use for deployment of the Ray cluster.
- Required: Yes
- Importance: High
- Type: String
- Default: westus2
GCP
Not available.
AWS
Not available.
Azure
The resource group to use for deployment of the Ray cluster.
- Required: Yes
- Importance: High
- Type: String
- Default: ray-cluster
GCP
Not available.
AWS
Not available.
Azure
The subscription ID to use for deployment of the Ray cluster. If not specified, Ray will use the default from the Azure CLI.
- Required: No
- Importance: High
- Type: String
- Default:
""
GCP
Not available.
AWS
Not available.
Azure
Not available.
GCP
The globally unique project ID to use for deployment of the Ray cluster.
- Required: Yes
- Importance: Low
- Type: String
- Default:
null
If enabled, nodes will be stopped when the cluster scales down. If disabled, nodes will be terminated instead. Stopped nodes launch faster than terminated nodes.
- Required: No
- Importance: Low
- Type: Boolean
- Default:
True
AWS
A security group that can be used to specify custom inbound rules.
- Required: No
- Importance: Medium
- Type:
Security Group <cluster-configuration-security-group-type>
Azure
Not available.
GCP
Not available.
The name of the security group. This name must be unique within the VPC.
- Required: No
- Importance: Low
- Type: String
- Default:
"ray-autoscaler-{cluster-name}"
The inbound rules associated with the security group.
- Required: No
- Importance: Medium
- Type: IpPermission
The configuration to be used to launch the nodes on the cloud service provider. Among other things, this will specify the instance type to be launched.
- Required: Yes
- Importance: High
- Type:
Node config <cluster-configuration-node-config-type>
The resources that a node type provides, which enables the autoscaler to automatically select the right type of nodes to launch given the resource demands of the application. The resources specified will be automatically passed to the ray start
command for the node via an environment variable. If not provided, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers. For more information, see also the resource demand scheduler
- Required: Yes (except for AWS/K8s)
- Importance: High
- Type:
Resources <cluster-configuration-resources-type>
- Default:
{}
In some cases, adding special nodes without any resources may be desirable. Such nodes can be used as a driver which connects to the cluster to launch jobs. In order to manually add a node to an autoscaled cluster, the ray-cluster-name tag should be set and ray-node-type tag should be set to unmanaged. Unmanaged nodes can be created by setting the resources to {}
and the maximum workers <cluster-configuration-node-min-workers>
to 0. The Autoscaler will not attempt to start, stop, or update unmanaged nodes. The user is responsible for properly setting up and cleaning up unmanaged nodes.
The minimum number of workers to maintain for this node type regardless of utilization.
- Required: No
- Importance: High
- Type: Integer
- Default:
0
- Minimum:
0
- Maximum: Unbounded
The maximum number of workers to have in the cluster for this node type regardless of utilization. This takes precedence over minimum workers <cluster-configuration-node-min-workers>
. By default, the number of workers of a node type is unbounded, constrained only by the cluster-wide max_workers <cluster-configuration-max-workers>
. (Prior to Ray 1.3.0, the default value for this field was 0.)
Note, for the nodes of type head_node_type
the default number of max workers is 0.
- Required: No
- Importance: High
- Type: Integer
- Default: cluster-wide
max_workers <cluster-configuration-max-workers>
- Minimum:
0
- Maximum: cluster-wide
max_workers <cluster-configuration-max-workers>
A list of commands to run to set up worker nodes of this type. These commands will replace the general worker setup commands <cluster-configuration-worker-setup-commands>
for the node.
- Required: No
- Importance: low
- Type: List of String
- Default:
[]
AWS
The number of CPUs made available by this node. If not configured, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers.
- Required: Yes (except for AWS/K8s)
- Importance: High
- Type: Integer
Azure
The number of CPUs made available by this node.
- Required: Yes
- Importance: High
- Type: Integer
GCP
The number of CPUs made available by this node.
- Required: No
- Importance: High
- Type: Integer
AWS
The number of GPUs made available by this node. If not configured, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers.
- Required: No
- Importance: Low
- Type: Integer
Azure
The number of GPUs made available by this node.
- Required: No
- Importance: High
- Type: Integer
GCP
The number of GPUs made available by this node.
- Required: No
- Importance: High
- Type: Integer
AWS
The memory in bytes allocated for python worker heap memory on the node. If not configured, Autoscaler will automatically detect the amount of RAM on the node for AWS/Kubernetes and allocate 70% of it for the heap.
- Required: No
- Importance: Low
- Type: Integer
Azure
The memory in bytes allocated for python worker heap memory on the node.
- Required: No
- Importance: High
- Type: Integer
GCP
The memory in bytes allocated for python worker heap memory on the node.
- Required: No
- Importance: High
- Type: Integer
AWS
The memory in bytes allocated for the object store on the node. If not configured, Autoscaler will automatically detect the amount of RAM on the node for AWS/Kubernetes and allocate 30% of it for the object store.
- Required: No
- Importance: Low
- Type: Integer
Azure
The memory in bytes allocated for the object store on the node.
- Required: No
- Importance: High
- Type: Integer
GCP
The memory in bytes allocated for the object store on the node.
- Required: No
- Importance: High
- Type: Integer
A set of overrides to the top-level Docker <cluster-configuration-docker>
configuration.
- Required: No
- Importance: Low
- Type:
docker <cluster-configuration-node-docker-type>
- Default:
{}
AWS
../../../python/ray/autoscaler/aws/example-minimal.yaml
Azure
../../../python/ray/autoscaler/azure/example-minimal.yaml
GCP
../../../python/ray/autoscaler/gcp/example-minimal.yaml
AWS
../../../python/ray/autoscaler/aws/example-full.yaml
Azure
../../../python/ray/autoscaler/azure/example-full.yaml
GCP
../../../python/ray/autoscaler/gcp/example-full.yaml
It is possible to use TPU VMs on GCP. Currently, TPU pods (TPUs other than v2-8 and v3-8) are not supported.
Before using a config with TPUs, ensure that the TPU API is enabled for your GCP project.
GCP
../../../python/ray/autoscaler/gcp/tpu.yaml