Skip to content

[ray] worker_start_ray_commands are not executed for private cluster #4902

@neychev

Description

@neychev

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
  • Ray installed from (source or binary): pip
  • Ray version: 0.7.0
  • Python version: 3.6.7
  • Exact command to reproduce:

Describe the problem

I am following private cluster setup instructions, but only head node starts. Few interesting points:

Source code / logs

cluster_name: tesq_cluster
min_workers: 48
max_workers: 48
initial_workers: 48
provider:
    type: local
    head_ip: ip1
    worker_ips: [ip2, ip3, ip4]
auth:
    ssh_user: tesq
    ssh_private_key: /home/me/.ssh/keys/local_user
file_mounts: {}
setup_commands: []
initialization_commands: []
head_setup_commands: []
worker_setup_commands: []

head_start_ray_commands:
    - source activate py3_prod && ray stop
    - echo 'I am here' >> /home/tesq/new_file.txt
    - source activate py3_prod && ulimit -c unlimited && ray start --head --redis-port=6379
worker_start_ray_commands:
    - echo 'I am there' >> /home/tesq/new_file.txt
    - source activate py3_prod && ray stop
    - echo 'I am there' >> /home/tesq/new_file.txt
    - source activate py3_prod && ray start --redis-address=ip1:6379

After that only head node starts, and only on the head node I see the created file new_file.txt
Example output of command ray.global_state.client_table()

{'ClientID': 'a7ce937ffcbece9b25a779fa126ba47edef27267',
  'IsInsertion': True,
  'NodeManagerAddress': 'ip1',
  'NodeManagerPort': 45759,
  'ObjectManagerPort': 34107,
  'ObjectStoreSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/raylet',
  'Resources': {'GPU': 3.0, 'CPU': 24.0}},

Update:
Seems very similar to issue #3190
But files monitor.err and monitor.out are empty.

Metadata

Metadata

Assignees

Labels

questionJust a question :)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions