Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Non Unit Instance fractional value fix #39293

Merged
merged 14 commits into from
Sep 26, 2023
Merged

Conversation

jonathan-anyscale
Copy link
Contributor

@jonathan-anyscale jonathan-anyscale commented Sep 6, 2023

Why are these changes needed?

There's a bug where Ray should've allow non unit instance resource to have fractional value greater than 1 (unit instance resource currently: GPU, TPU, Neuron Core). This PR is to fix that

Related issue number

Closes #37241

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jonathan-anyscale jonathan-anyscale requested a review from a team as a code owner September 6, 2023 01:23
@jonathan-anyscale jonathan-anyscale changed the title Add small note on documentation for fragmented resource [docs] Add small note on documentation for fragmented resource Sep 6, 2023
Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -194,3 +194,6 @@ The precision of the fractional resource requirement is 0.0001 so you should avo

Besides resource requirements, you can also specify an environment for a task or actor to run in,
which can include Python packages, local files, environment variables, and more---see :ref:`Runtime Environments <runtime-environments>` for details.

.. note::
Fractional input must be between 0 to 1. For any value greater than 1, `num_cpus` and `num_gpus` requires a whole number.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Fractional input must be between 0 to 1. For any value greater than 1, `num_cpus` and `num_gpus` requires a whole number.
Fractional resource requirement must be smaller than 1. For any resource requirement greater than 1, it needs to be a whole number (e.g. ``num_cpus=1.5`` is invalid).

Can you move it to be above tip?

Signed-off-by: Jonathan Nitisastro <jonathannitisastro@jonathancn-QC69NQYVVG.local.meter>
@jonathan-anyscale jonathan-anyscale changed the title [docs] Add small note on documentation for fragmented resource [Core] Non Unit Instance fractional value fix Sep 8, 2023
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
@rkooo567 rkooo567 self-assigned this Sep 11, 2023
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp for docs

python/ray/_raylet.pyx Outdated Show resolved Hide resolved
@@ -194,3 +194,6 @@ The precision of the fractional resource requirement is 0.0001 so you should avo

Besides resource requirements, you can also specify an environment for a task or actor to run in,
which can include Python packages, local files, environment variables, and more---see :ref:`Runtime Environments <runtime-environments>` for details.

.. note::
For any unit resource (GPU, TPU, Neuron Core) requirement greater than 1, it needs to be a whole number (e.g. ``num_gpus=1.5`` is invalid).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add more details about what 's the definition of unit resource?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the internal terminology


@ray.remote(num_cpus=1.5)
def test():
test_frac_cpu.remote()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test_frac_cpu.remote()
assert ray.get(test_frac_cpu.remote())

and make test_frac_cpu return True (it is better having explicit assert)

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 12, 2023
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
@jonathan-anyscale jonathan-anyscale removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 13, 2023
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 15, 2023
@rkooo567
Copy link
Contributor

lint failure

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope the rewrite still makes sense. Let me know if you have questions.

doc/source/ray-core/scheduling/resources.rst Outdated Show resolved Hide resolved
doc/source/ray-core/scheduling/resources.rst Outdated Show resolved Hide resolved
doc/source/ray-core/scheduling/resources.rst Outdated Show resolved Hide resolved
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: jonathan-anyscale <144177685+jonathan-anyscale@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: jonathan-anyscale <144177685+jonathan-anyscale@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: jonathan-anyscale <144177685+jonathan-anyscale@users.noreply.github.com>
@jonathan-anyscale jonathan-anyscale removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 15, 2023
@rkooo567
Copy link
Contributor

merge conflict

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 21, 2023
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
@jonathan-anyscale jonathan-anyscale removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 21, 2023
which can include Python packages, local files, environment variables, and more---see :ref:`Runtime Environments <runtime-environments>` for details.
which can include Python packages, local files, environment variables, and more. See :ref:`Runtime Environments <runtime-environments>` for details.

.. note::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move it to be above tip?

which can include Python packages, local files, environment variables, and more. See :ref:`Runtime Environments <runtime-environments>` for details.

.. note::
Unit resource requirements that are greater than 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People usually don't know what unit resources are. We should just list them out like

GPU, TPU, neuron_cores resource requirements that are greater than 1, need to be whole numbers. For example, ``num_gpus=1.5`` is invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the GPU, TPU and neuron core are written on the next line

Comment on lines 643 to 644
unit_resources = f"{RayConfig.instance().predefined_unit_instance_resources()},\
{RayConfig.instance().custom_unit_instance_resources()}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets split by (,) and convert them to a set and then check. Otherwise, we may mismatch: e.g. we have a unit resource called Foo_Bar and you will think Foo is also unit resource.

Comment on lines 649 to 650
"Unit instance resource (GPU, TPU, Neuron Core) quantities >1 must",
f" be whole numbers. {key} resource with value {value} is invalid.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Unit instance resource (GPU, TPU, Neuron Core) quantities >1 must",
f" be whole numbers. {key} resource with value {value} is invalid.")
f"{key} resource quantities >1 must",
f" be whole numbers. The specified quantity {value} is invalid.")

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
doc/source/ray-core/scheduling/resources.rst Outdated Show resolved Hide resolved
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: jonathan-anyscale <144177685+jonathan-anyscale@users.noreply.github.com>
@jjyao jjyao merged commit 54901bf into master Sep 26, 2023
82 of 85 checks passed
@jjyao jjyao deleted the fractional_resource_doc branch September 26, 2023 18:26
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
There's a bug where Ray should've allow non unit instance resource to have fractional value greater than 1 (unit instance resource currently: GPU, TPU, Neuron Core). This PR is to fix that

Signed-off-by: Jonathan Nitisastro <jonathannitisastro@jonathancn-QC69NQYVVG.local.meter>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: jonathan-anyscale <144177685+jonathan-anyscale@users.noreply.github.com>
Co-authored-by: Jonathan Nitisastro <jonathannitisastro@jonathancn-QC69NQYVVG.local.meter>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] "ValueError" For "num_cpus" worker option == 1.5
5 participants