Skip to content

Conversation

@elliot-barn
Copy link
Contributor

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

elliot-barn and others added 30 commits June 17, 2025 17:58
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Why are these changes needed?

Make `x-request-id` a constant.

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…#53577)

```cpp
task_spec.TaskId().Binary()
```
* `TaskId()` deserializes binary to TaskId.
* `Binary()` serializes `TaskId` to binary.

This PR:
* renames `TaskId` to `GetTaskId`
* adds `GetTaskIdBinary`

---------

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…53656)

This is purely refactoring with no new logic:

- Rename `GrpcStubManager` to `GrpcClientManager`
- Turn `GrpcClientManager` into an abstract class with an implementation

---------

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
`test_runtime_env_*` large -> medium

Move a few others.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Fix map_batches release test back_to_back option plumbing to release
tests.

---------

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
allowing people to check in byod scripts

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
we upgraded the machine and releases

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Saw this flake on
[postmerge](https://buildkite.com/ray-project/postmerge/builds/10702#01975572-3a26-4e0b-b27d-0869ac5830fe/177-1191).

Cleaned up the test in general:

- Remove sleep conditions.
- Remove use of direct gRPC connection to raylet.
- Remove runtime_env test that was redundant with
`test_runtime_env_env_vars.py`.

Runtime decreased by ~50% locally, from `62.98s` to `33.48s`.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
These have been deprecated/ignored for a long time and are polluting the
help string.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
… Data (#53220)

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
the option is only used for `manylinux1`. ray is not using manylinux1 to build things any more.

---------

Signed-off-by: Gagandeep Singh <gdp.1807@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
)

Signed-off-by: hipudding <huafengchun@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Adding back "Run on Anyscale" button after a Anyscale PR was merged to
show the button on Ray docs but not Anyscale template previews

Signed-off-by: Chris Zhang <chris@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…fig (#53681)

The schema of compute config that Kuberay service takes in is currently
a bit different from the schema of cluster compute in release tests.
This is a helper function built to convert the cluster compute into
Kuberay compute config that eventually gets sent into Kuberay service

---------

Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…private to _common (#53652)

Fixes #53478

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
don't know for sure, but seems like a race condition between when the
cancel happens and attempt to access the ray task result causes
`RayTaskCancelled` exception.

Used the repro script in the ticket to confirm that the issue is
resolved #53639.

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
- Add prefix `rayproject/ray` for all tags
- Authorize docker with credentials from SSM
- Mock authorize docker in unit test since it's not needed

---------

Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
kevin85421 and others added 28 commits June 17, 2025 17:58
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
I made the workload more stressful in
#53803 by fetching all of the
results concurrently. That seems to have caused Windows to time out:
https://buildkite.com/ray-project/postmerge/builds/10876#01977755-755b-485e-bd13-f0ea3e33cc36/158-818

I won't pretend to fully understand why, but reverting to the old
pattern in an attempt to fix it.

Also added an explicit wait for the dir to drain because there was an
error during cleanup caused by it.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Deflakes the test by ensuring that results come in out of order for the
`imap_unordered` tests instead of relying on random `time.sleep`s.

In the process of rewriting the tests, I discovered a bug in the
implementation. We weren't re-fetching available object refs from the
queue once we had gotten at least one, which caused the tests to hang.
Added a timeout to the `ray.wait` call so we continually check for new
object refs to add to the batch.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
)

## Why are these changes needed?

In `CoreWorker::Exit()`, the code calls:

https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169

This tells the raylet to immediately release all resources that were
allocated to this worker. However, the worker may still have tasks
running at this point in the exit sequence.

I have added a test to reproduce the issue. However, the test is
reproducible 2 out of 5 times locally. In the test:

0. Ray init with only 1 CPU.
1. Start task1 and wait for it to signal it's running
2. Submit task2 (should be queued)
3. Try to wait for task2 to start with 1-second timeout:
    - No timeout: Bug! Task2 started immediately --> cleanup and fail
    - Timeout occurs: Correct! Task2 is queued --> continue
4. Complete task1 and assert expected result
5. Wait for task2 to start (should happen immediately now)
6. Complete task2 and assert expected result

With the fix, the test always passes (no oversubscription detected)

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

The example still uses Ray 2.0. It is pretty old and KubeRay has already
had other similar examples. Remove it from the doc.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
If we pass GPU object refs within the same actor, NCCL send/recv will
block indefinitely and the transfer is also unnecessary. This PR allows
intra-process communication to retrieve tensors directly from the
in-process actor store.

Example:

```
small_tensor = torch.randn((1,))

# Intra-actor communication for pure GPU tensors
ref = actor.echo.remote(small_tensor)
result = actor.double.remote(ref)
assert ray.get(result) == pytest.approx(small_tensor * 2)
```

## Related issue number

Closes #51685

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…S FT guide (#53832)

Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Release tests seem to be failing because of a merge conflict on master
after #53390 was merged where a
param is missing

Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
The example still uses Ray 2.2. It is pretty old and KubeRay has already
had other similar examples (ex:
https://docs.ray.io/en/latest/cluster/kubernetes/examples/mnist-training-example.html#kuberay-mnist-training-example).
Remove it from the doc.

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…ead (#53844)

https://docs.ray.io/en/master/cluster/kubernetes/examples/rayserve-llm-example.html#kuberay-rayservice-llm-example

Currently, users can use Ray Serve LLM to run LLM serving workloads.
Remove the old vLLM / RayService example guide.

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…efill disagg (#53821)

Signed-off-by: kouroshhakha <kourosh@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

`read_text` currently treats a trailing newline as an empty line, which
can lead to unexpected results. This isn’t consistent with how standard
Python methods like `str.splitlines()` behave. Since trailing newlines
are common in text files, this PR updates the behavior to ignore them.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

It is not safe to use the current process' cmdline to determine the "uv
run" commandline args, since e.g. multiprocessing spawn will garble the
commandline, so our current way of detecting the "uv run" commandline
won't work if the process that runs `ray.init` was created with
multiprocessing spawn like it is the case in some vllm settings. Also
the command line might have been modified by setproctitle.

This PR changes the extraction of the `uv run` command line to a more
robust way that doesn't involve any state in the current process, only
the uv command line of the uv shim process, and also adds a regression
test.

It also makes it possible to run `uv run -m <module>` and adds a
regression test for that as well.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Co-authored-by: pcmoritz <pcmoritz@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

This change standardizes null handling across internal block formats by
converting pd.NA to None. Since different formats (e.g., Pandas and
Arrow) use different null representations, this ensures consistent
semantics when converting between them. In particular, we treat pd.NA as
a null value, while preserving np.nan as a distinct floating-point NaN.
## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…GPU objects (#53720)

Adds integration between the single-controller collective APIs
introduced in #53319 and the GPU objects feature prototyped in #52938.
Actor collectives created through
`ray.experimental.collective.create_collective_group` will now be
automatically used if a task declares a tensor transport other than the
default OBJECT_STORE. This also adds support for allocating the torch
tensors on the correct device (GPU for NCCL and CPU for GLOO).

See updates in test_gpu_objects.py for examples.
---------

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…l instead (#53822)

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…er process (#53815)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

If a map transform function produces outputs with different column
(e.g., you're reading a JSONL file where lines contain different keys),
then Ray Data errors.

This PR updates the implementation to make it more robust to variation
in column names.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
….values`. (#53514)

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
#53871

* CI error:

![image](https://github.com/user-attachments/assets/9c4d533d-9d09-4574-ae3d-42ef03c3be51)

Closes #53871

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
These are no longer flaky 🙌

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
The test_all_to_all.py is taking a really long time to finish ([35
minutes](https://buildkite.com/ray-project/postmerge/builds/10883/steps/canvas?jid=0197797e-9c25-4030-a1a6-ff89dba44f8e#0197797e-9c25-4030-a1a6-ff89dba44f8e/176-1023)).
I'm breaking this into smaller chunks so they run in parallel.

Test:

CI

Signed-off-by: can <can@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Deprecate `use_polars` flag in favor of `use_polars_sort`

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
… for data workloads (#53857)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Logs a warning if the object store is given less < 50% memory for Data
Workloads to minimize disk spilling

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@elliot-barn elliot-barn force-pushed the elliot-barn/python-dep-sets branch from 9f9150a to 1a674ac Compare June 18, 2025 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.