Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray CI] Error: Observed wheel commit () is not expected commit (63783...). Aborting #32156

Closed
cadedaniel opened this issue Feb 1, 2023 · 4 comments
Assignees
Labels
P1 Issue that should be fixed within a few weeks testing topics about testing

Comments

@cadedaniel
Copy link
Member

Some PRs suffer from this flaky failure.

@cadedaniel cadedaniel added P1 Issue that should be fixed within a few weeks testing topics about testing labels Feb 1, 2023
@cadedaniel
Copy link
Member Author

cc @rkooo567 to link his most recent failure

@krfricke
Copy link
Contributor

krfricke commented Feb 1, 2023

I've seen this failure in different jobs. It's flaky, a retry usually works. It comes up in any job that calls LINUX_WHEELS=1 ./ci/ci.sh build as that will build wheels and check the commit hash.

I think it came up more often when we upgraded our manylinux build container here: 0c8b59d#diff-a2fce6a33e7a666e89ec201eb5eb823cf5c5c411ace62008781a9adc90cc0adaR461

So far I couldn't reproduce it on repro instances, likely because it doesn't come up very often...

@cadedaniel
Copy link
Member Author

@can-anyscale can-anyscale self-assigned this Mar 19, 2023
jjyao pushed a commit that referenced this issue Mar 20, 2023
Speed up wheel commit validation check by 100x. Also hopefully will alleviate if not eliminate the 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

Signed-off-by: Cuong Nguyen <can@anyscale.com>
can-anyscale added a commit that referenced this issue Mar 20, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add a sample GCE test

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Disable TEST_ATTR_REGEX_FILTERS for testing

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
xwjiang2010 pushed a commit that referenced this issue Mar 21, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* PR 31978 replaced result_output_json and metrics_output_json with fixed values, but did not update client_runner.
GCE tests using client_runner is failing with the following error because of that. Simple fix by reusing the fixed global values.

> [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'
> Traceback (most recent call last):
>  File "/tmp/release-HS2M44AnMX/release/ray_release/command_runner/client_runner.py", line 122, in _fetch_json
>    with open(path, "rt") as fp:
> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* remove tempfile import, not used

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
can-anyscale added a commit that referenced this issue Mar 21, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add a sample GCE test

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Disable TEST_ATTR_REGEX_FILTERS for testing

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Make tests running on staging v2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Mar 21, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add a sample GCE test

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Disable TEST_ATTR_REGEX_FILTERS for testing

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Make tests running on staging v2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix compute configs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Mar 21, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add a sample GCE test

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Disable TEST_ATTR_REGEX_FILTERS for testing

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Make tests running on staging v2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix compute configs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use n2-standard-8 machines

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
cadedaniel pushed a commit to cadedaniel/ray that referenced this issue Mar 22, 2023
Speed up wheel commit validation check by 100x. Also hopefully will alleviate if not eliminate the 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

Signed-off-by: Cuong Nguyen <can@anyscale.com>
cadedaniel pushed a commit to cadedaniel/ray that referenced this issue Mar 22, 2023
* Fix 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (ray-project@234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* PR 31978 replaced result_output_json and metrics_output_json with fixed values, but did not update client_runner.
GCE tests using client_runner is failing with the following error because of that. Simple fix by reusing the fixed global values.

> [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'
> Traceback (most recent call last):
>  File "/tmp/release-HS2M44AnMX/release/ray_release/command_runner/client_runner.py", line 122, in _fetch_json
>    with open(path, "rt") as fp:
> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* remove tempfile import, not used

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
edoakes pushed a commit to edoakes/ray that referenced this issue Mar 22, 2023
Speed up wheel commit validation check by 100x. Also hopefully will alleviate if not eliminate the 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this issue Mar 22, 2023
* Fix 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (ray-project@234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* PR 31978 replaced result_output_json and metrics_output_json with fixed values, but did not update client_runner.
GCE tests using client_runner is failing with the following error because of that. Simple fix by reusing the fixed global values.

> [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'
> Traceback (most recent call last):
>  File "/tmp/release-HS2M44AnMX/release/ray_release/command_runner/client_runner.py", line 122, in _fetch_json
>    with open(path, "rt") as fp:
> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* remove tempfile import, not used

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
clarng pushed a commit to clarng/ray that referenced this issue Mar 23, 2023
Speed up wheel commit validation check by 100x. Also hopefully will alleviate if not eliminate the 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

Signed-off-by: Cuong Nguyen <can@anyscale.com>
clarng pushed a commit to clarng/ray that referenced this issue Mar 23, 2023
* Fix 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (ray-project@234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* PR 31978 replaced result_output_json and metrics_output_json with fixed values, but did not update client_runner.
GCE tests using client_runner is failing with the following error because of that. Simple fix by reusing the fixed global values.

> [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'
> Traceback (most recent call last):
>  File "/tmp/release-HS2M44AnMX/release/ray_release/command_runner/client_runner.py", line 122, in _fetch_json
>    with open(path, "rt") as fp:
> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* remove tempfile import, not used

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
@can-anyscale
Copy link
Collaborator

I used to see this once every few builds in [OSS CI build: branch], but not any more. Closing this, but if it happens again feel free to re-open and assign to me. Thanks

scottsun94 pushed a commit to scottsun94/ray that referenced this issue Mar 28, 2023
Speed up wheel commit validation check by 100x. Also hopefully will alleviate if not eliminate the 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

Signed-off-by: Cuong Nguyen <can@anyscale.com>
scottsun94 pushed a commit to scottsun94/ray that referenced this issue Mar 28, 2023
* Fix 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (ray-project@234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* PR 31978 replaced result_output_json and metrics_output_json with fixed values, but did not update client_runner.
GCE tests using client_runner is failing with the following error because of that. Simple fix by reusing the fixed global values.

> [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'
> Traceback (most recent call last):
>  File "/tmp/release-HS2M44AnMX/release/ray_release/command_runner/client_runner.py", line 122, in _fetch_json
>    with open(path, "rt") as fp:
> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* remove tempfile import, not used

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
cassidylaidlaw pushed a commit to cassidylaidlaw/ray that referenced this issue Mar 28, 2023
Speed up wheel commit validation check by 100x. Also hopefully will alleviate if not eliminate the 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

Signed-off-by: Cuong Nguyen <can@anyscale.com>
cassidylaidlaw pushed a commit to cassidylaidlaw/ray that referenced this issue Mar 28, 2023
* Fix 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (ray-project@234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* PR 31978 replaced result_output_json and metrics_output_json with fixed values, but did not update client_runner.
GCE tests using client_runner is failing with the following error because of that. Simple fix by reusing the fixed global values.

> [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'
> Traceback (most recent call last):
>  File "/tmp/release-HS2M44AnMX/release/ray_release/command_runner/client_runner.py", line 122, in _fetch_json
>    with open(path, "rt") as fp:
> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* remove tempfile import, not used

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
can-anyscale added a commit that referenced this issue Mar 30, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

#33597 changed the log statements for adding a replica to a deployment. The assert statement in test_ray_server_basic checks for the exact log statement - we need to update that assert statement.

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a 
           method in Tune, I've added it in `doc/source/tune/api/` under the 
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.


**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`
 
This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.


Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Mar 30, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

#33597 changed the log statements for adding a replica to a deployment. The assert statement in test_ray_server_basic checks for the exact log statement - we need to update that assert statement.

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a 
           method in Tune, I've added it in `doc/source/tune/api/` under the 
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.


**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`
 
This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.


Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Mar 30, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

#33597 changed the log statements for adding a replica to a deployment. The assert statement in test_ray_server_basic checks for the exact log statement - we need to update that assert statement.

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a 
           method in Tune, I've added it in `doc/source/tune/api/` under the 
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.


**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`
 
This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.


Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 1, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 1, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 1, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 1, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 1, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 1, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 1, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 1, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 3, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 3, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 3, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 3, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 5, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 5, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 5, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 5, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 5, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 5, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 5, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
can-anyscale added a commit that referenced this issue Apr 5, 2023
* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add new lines to some files

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support test definition with multiple flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Use not in to check key in dict

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 2

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Debugging 03

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove temoprary logs

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Update flavors

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Only initialize gs client on gs host

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update image for Sematic integration (#33469)

* [RLlib] fix preprocessor test (#33719)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665)

To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console.

Co-authored-by: Qing Wang <kingchin1218@126.com>

* [serve] Fix serve HA test (#33699)

* Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731)

<img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png">
This reverts commit cb5bb0e.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

<!-- Please give a short summary of the change and the problem this solves. -->

<!-- For example: "Closes #1234" -->

- [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    - [ ] I've added any new APIs to the API Reference. For example, if I added a
           method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

* [tune] add data to CI test dependencies (#33729)

1. #33565 introduced  `DATA_PROCESSING_TESTING=1` as a requirement to  `:octopus: Tune tests and examples (medium)"`.
2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).
3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1).

**Note:** There should probably be a better way for handling dependencies in CI tests...

* [Test] Fix test event test timeout (#33704)

* [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747)

This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code.

* [runtime env] Close schema after loading and continue on error (#33535)

This PR fixes a few things:

* A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink)
* Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist.
    * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed.

**Steps to Reproduce**
1. Save this script as `test.py`
```python
import ray

@ray.remote(runtime_env={"env_vars": {}})
def my_fn():
    return True

ray.init()
print(ray.get(my_fn.remote()))
```
2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py`
3.
    a. save `:` or other invalid JSON as `bad-json.json`
    b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py`

This PR fixes the issue and adds a new test case.
Signed-off-by: James Clark <james.clark@zapatacomputing.com>

* [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259)

In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata).

Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV.

This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name.

This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value).

Also adds a unit test which fails without this change.

* Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)

Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740)

Additionally fix `test_usage_test.py`.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* Deprecate RuntimeContext.get (#33734)

RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Serve] Fix the serve.batch api doc (#33588)

Fix the example formatting in the serve batch API doc

* [infra] increase Build timeout (#33756)

Why are these changes needed?
release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g

https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2

* [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* [Test] Fix the failing workflow test_dataset after streaming executor is enabled.  (#33736)

Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside).

I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. )

* [Test] Fix out of disk error (#33732)

Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case.

* [Data] Repurpose streaming CI to bulk CI(#33478)

Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now).

* [Serve] Enable serve metrics lib working in ray actor (#33717)

Make sure ray.serve.lib working with ray.actor without serve context.
```
@ray.remote
class MyActor:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_ray_actor",
            description=("The number of requests to this deployment."),
            tag_keys=("my_tag",),
        )
    def test(self):
        self.my_counter.inc(tags={"my_tag": "value"})
        return "hello"

@serve.deployment(num_replicas=2)
class Model:
    def __init__(self, model_name):
        self.my_actor = MyActor.remote()

    async def __call__(self, req: starlette.requests.Request):
        await self.my_actor.test.remote()
        return
```

* [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648)

Signed-off-by: sven1977 <svenmika1977@gmail.com>

* [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761)

If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Change ray to 2.3.1 to work around the #ir-glorious-shape

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Revert to normal ray image

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix delete_fn

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772)

Add the ability for AnyscaleJobRunner to run on GCE host. The added logic:
 - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported.
 - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works

- [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825

* Run lint

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814)

* Setup dependencies and crendential for GCE in buildkite

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add google-cloud-storage package to requirements

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Support for gs:// in anyscale job runner

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Correct adding gce tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [RLlib] APPO TF with RLModule and Learner API (#33310)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713)

It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations.

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linter

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Run linters

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* -s

* Fix some tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Add unit tests for test definition parser

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Fix lints

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* @aslonnie's comments

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Check that parse_test_definition throws exception on empty variations

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Remove the constant test definition in test.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* The cluster environment name does not allow the character '.', so fix that.

Address Lonnie's comments and add more tests.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: augray <augray@users.noreply.github.com>
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: jiafu zhang <jiafu.zhang@intel.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
Co-authored-by: clarng <clarence.wyng@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
elliottower pushed a commit to elliottower/ray that referenced this issue Apr 22, 2023
Speed up wheel commit validation check by 100x. Also hopefully will alleviate if not eliminate the 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: elliottower <elliot@elliottower.com>
elliottower pushed a commit to elliottower/ray that referenced this issue Apr 22, 2023
* Fix 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (ray-project@234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* PR 31978 replaced result_output_json and metrics_output_json with fixed values, but did not update client_runner.
GCE tests using client_runner is failing with the following error because of that. Simple fix by reusing the fixed global values.

> [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'
> Traceback (most recent call last):
>  File "/tmp/release-HS2M44AnMX/release/ray_release/command_runner/client_runner.py", line 122, in _fetch_json
>    with open(path, "rt") as fp:
> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* remove tempfile import, not used

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue May 4, 2023
Speed up wheel commit validation check by 100x. Also hopefully will alleviate if not eliminate the 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue May 4, 2023
* Fix 'Observed wheel commit () is not expected' issue (ray-project#32156) that has been creeping through many of ci/cd builds in our pipeline.

The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit.

You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (ray-project@234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Improve wheel commit validation error message

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* PR 31978 replaced result_output_json and metrics_output_json with fixed values, but did not update client_runner.
GCE tests using client_runner is failing with the following error because of that. Simple fix by reusing the fixed global values.

> [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'
> Traceback (most recent call last):
>  File "/tmp/release-HS2M44AnMX/release/ray_release/command_runner/client_runner.py", line 122, in _fetch_json
>    with open(path, "rt") as fp:
> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr33nmui3'

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* remove tempfile import, not used

Signed-off-by: Cuong Nguyen <can@anyscale.com>

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Issue that should be fixed within a few weeks testing topics about testing
Projects
None yet
Development

No branches or pull requests

3 participants