Skip to content

Global State Available Resources Hangs on Node Removal #2875

@richardliaw

Description

@richardliaw

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux 16
  • Ray installed from (source or binary): Source
  • Ray version: master
  • Python version: 3.6.6
  • Exact command to reproduce:

ray.global_state.available_resources() on a small cluster

Describe the problem

In [8]: ray.global_state.available_resources()

hangs after a node is removed.

Source code / logs

...
  'NodeManagerAddress': '169.229.49.172',
  'NodeManagerPort': 38033,
  'ObjectManagerPort': 42375,
  'ObjectStoreSocketName': '/tmp/plasma_store65771381',
  'RayletSocketName': '/tmp/raylet8912751',
  'Resources': {'GPU': 0.0, 'CPU': 2.0}},
 {'ClientID': 'ed085bb78046ccbc423fb6587ab1fcbf838514da',
  'IsInsertion': True,
  'NodeManagerAddress': '169.229.49.173',
  'NodeManagerPort': 32821,
  'ObjectManagerPort': 34913,
  'ObjectStoreSocketName': '/tmp/plasma_store60872177',
  'RayletSocketName': '/tmp/raylet24528210',
  'Resources': {'GPU': 0.0, 'CPU': 2.0}}]

In [7]: The node with client ID ed085bb78046ccbc423fb6587ab1fcbf838514da has been marked dead because the monitor has missed too many heartbeats from it.
In [7]: ray.global_state.client_table()
Out[7]:
[{'ClientID': 'c9f0b409ca553018762fadba26de1dffa4338478',
  'IsInsertion': True,
  'NodeManagerAddress': '169.229.49.172',
  'NodeManagerPort': 38033,
  'ObjectManagerPort': 42375,
  'ObjectStoreSocketName': '/tmp/plasma_store65771381',
  'RayletSocketName': '/tmp/raylet8912751',
  'Resources': {'GPU': 0.0, 'CPU': 2.0}},
 {'ClientID': 'ed085bb78046ccbc423fb6587ab1fcbf838514da',
  'IsInsertion': True,
  'NodeManagerAddress': '169.229.49.173',
  'NodeManagerPort': 32821,
  'ObjectManagerPort': 34913,
  'ObjectStoreSocketName': '/tmp/plasma_store60872177',
  'RayletSocketName': '/tmp/raylet24528210',
  'Resources': {'GPU': 0.0, 'CPU': 2.0}},
 {'ClientID': 'ed085bb78046ccbc423fb6587ab1fcbf838514da',
  'IsInsertion': False,
  'NodeManagerAddress': '',
  'NodeManagerPort': 0,
  'ObjectManagerPort': 0,
  'ObjectStoreSocketName': '',
  'RayletSocketName': '',
  'Resources': {}}]

In [8]: ray.global_state.available_resources()
...

@pschafhalter

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn't

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions