Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Non Unit Instance fractional value fix #39293

Merged
merged 14 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion doc/source/ray-core/scheduling/resources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -193,4 +193,8 @@ The precision of the fractional resource requirement is 0.0001 so you should avo
.. tip::

Besides resource requirements, you can also specify an environment for a task or actor to run in,
which can include Python packages, local files, environment variables, and more---see :ref:`Runtime Environments <runtime-environments>` for details.
which can include Python packages, local files, environment variables, and more. See :ref:`Runtime Environments <runtime-environments>` for details.

.. note::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move it to be above tip?

Unit resource requirements that are greater than 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People usually don't know what unit resources are. We should just list them out like

GPU, TPU, neuron_cores resource requirements that are greater than 1, need to be whole numbers. For example, ``num_gpus=1.5`` is invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the GPU, TPU and neuron core are written on the next line

need to be whole numbers. For example, ``num_gpus=1.5`` is invalid. This restriction applies to resources that the scheduler assigns using IDs such as GPU, TPU, and Neuron Core.
8 changes: 6 additions & 2 deletions python/ray/_raylet.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -640,10 +640,14 @@ cdef int prepare_resources(
if value < 0:
raise ValueError("Resource quantities may not be negative.")
if value > 0:
unit_resources = f"{RayConfig.instance().predefined_unit_instance_resources()},\
{RayConfig.instance().custom_unit_instance_resources()}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets split by (,) and convert them to a set and then check. Otherwise, we may mismatch: e.g. we have a unit resource called Foo_Bar and you will think Foo is also unit resource.


if (value >= 1 and isinstance(value, float)
and not value.is_integer()):
and not value.is_integer() and str(key) in unit_resources):
raise ValueError(
"Resource quantities >1 must be whole numbers.")
"Unit instance resource (GPU, TPU, Neuron Core) quantities >1 must",
f" be whole numbers. {key} resource with value {value} is invalid.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Unit instance resource (GPU, TPU, Neuron Core) quantities >1 must",
f" be whole numbers. {key} resource with value {value} is invalid.")
f"{key} resource quantities >1 must",
f" be whole numbers. The specified quantity {value} is invalid.")

resource_map[0][key.encode("ascii")] = float(value)
return 0

Expand Down
4 changes: 4 additions & 0 deletions python/ray/includes/ray_config.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,10 @@ cdef extern from "ray/common/ray_config.h" nogil:

c_bool enable_autoscaler_v2() const

c_string predefined_unit_instance_resources() const

c_string custom_unit_instance_resources() const

int64_t nums_py_gcs_reconnect_retry() const

int64_t py_gcs_connect_timeout_s() const
23 changes: 15 additions & 8 deletions python/ray/tests/test_advanced_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ def method(self):


def test_fractional_resources(shutdown_only):
ray.init(num_cpus=6, num_gpus=3, resources={"Custom": 1})
ray.init(num_cpus=6, num_gpus=3, resources={"Custom": 3, "TPU": 3})

@ray.remote(num_gpus=0.5)
class Foo1:
Expand All @@ -168,7 +168,7 @@ def method(self):
pass

# Create an actor that requires 0.7 of the custom resource.
f1 = Foo2._remote([], {}, resources={"Custom": 0.7})
f1 = Foo2._remote([], {}, resources={"Custom": 2.7})
ray.get(f1.method.remote())
# Make sure that we cannot create an actor that requires 0.7 of the
# custom resource. TODO(rkn): Re-enable this once ray.wait is
Expand All @@ -183,18 +183,25 @@ def method(self):

del f1, f3

# Make sure that we get exceptions if we submit tasks that require a
# fractional number of resources greater than 1.
# Non unit resources (e.g. CPU, ) allow fractional
# number of resources greather than 1.
@ray.remote(num_cpus=1.5, resources={"Custom": 2.5})
def test_frac_cpu():
return True

@ray.remote(num_cpus=1.5)
def test():
assert ray.get(test_frac_cpu.remote())

# Unit instance resources (GPU, TPU, neuron_core) throw exceptions
# for fractional number of resources greater than 1.
@ray.remote(num_gpus=1.5)
def test_frac_gpu():
pass

with pytest.raises(ValueError):
test.remote()
test_frac_gpu.remote()

with pytest.raises(ValueError):
Foo2._remote([], {}, resources={"Custom": 1.5})
Foo2._remote([], {}, resources={"TPU": 2.5})


def test_fractional_memory_round_down(shutdown_only):
Expand Down
Loading