[air-output] Add new console output code path (behind feature flag) #33609

xwjiang2010 · 2023-03-23T00:42:15Z

Why are these changes needed?

Have a separate code path for the new air console output.

Most of the logic is contained in ray/tune/experimental/output.py.
There are mostly two interfaces: ProgressReporter(which does periodic reporting) and ResultCallback(which does event based reporting).
The new code path will be triggered by setting AIR_VERBOSITY env var. Once this env var is set, old tune verbosity will be treated as silent.
To de-risk 2.4 timeline pressure, AIR_VERBOSITY is only for non ray-client non JupyterNote book for now. (To be fair, our design and product exploration and guideline are mainly for terminal output anyways.)
The PR is more about setting the right code structure to iterate on design than getting the styling/wording exactly right.
One missing piece of current design/prod exploration is how developers can define their experience, beyond just toggling AIR_VERBOSITY bit. This PR also didn't touch that part, but note it down as a further TODO:

TODO (post 2.4): think about library user interface - rllib users, train users probably have different needs.
For example how would they configure the console output experience?
Right now, we have ProgressReporter which does periodic reporting and ResultCallback interface which is event based.
Ideally they should only need to configure at a single place instead of several classes.

One idea is to have a Factory class/Builder class that can be configured by users and that Builder class can create the right reporter to plug in depending on:

tune v.s. train
ray client or not
JupyterNote book or not.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

…onsole_2

krfricke

Few nits, overall code (especially output.py) looks good so far. I'll try it out later but think we can go ahead with this!

.buildkite/pipeline.ml.yml

python/ray/tune/BUILD

python/ray/tune/logger/logger.py

krfricke · 2023-03-24T14:33:16Z

python/ray/tune/logger/logger.py

    result = result.copy()
    result.update(config=None)  # drop config from pretty print
    result.update(hist_stats=None)  # drop hist_stats from pretty print
    out = {}
    for k, v in result.items():
-        if v is not None:
+        if v is not None and (blacklisted_keys is None or k not in blacklisted_keys):


Suggested change

if v is not None and (blacklisted_keys is None or k not in blacklisted_keys):

if v is not None and k not in exclude:

Need to check if exclude is None (in which case, it's not iterable).

Also it's not good practice to have exclude taking in an optional list from function signature. and should avoid that.

Should be fine with the above exclude = exclude or set()?

python/ray/tune/tune.py

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

krfricke

I think this is good to go. Stamping for now. Merging should be low risk (assuming other tests pass) as it is a completely separate code path hidden behind a feature flag.

xwjiang2010 · 2023-03-24T22:52:02Z

Thanks Kai. For your feedback on the screenshot, I will address them as part of dogfooding process.

… flag) (#33609)" This reverts commit bfbe8d1.

1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests...

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

…ay-project#33609) * implement a separate path for AirProgressReporter and ResultCallback. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * debug on ws. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * callback without num_samples. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix getattribute Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * polish rich table. some minor fixes. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * polish tune's case. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * polish train's case Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * current_best_trial Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * try running ci using the new AIR_VERBOSITY Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * minor fixes Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix _create_default_callback call site. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * do not hardcode basic info table height. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * separate buildkite pipeline. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * use nested table. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * lint Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * address comments Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: Jonathan Carter <jonathan.carter@magd.ox.ac.uk>

1. ray-project#33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. ray-project#33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. ray-project#33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... Signed-off-by: Jonathan Carter <jonathan.carter@magd.ox.ac.uk>

…ray-project#33880) The new experimental output path (ray-project#33609) and the new experimental execution engine (ray-project#33499) currently don't work together because the new output path is calling a trial executor API directly (which does not exist in the experimental execution engine). This PR calls the correct API on the TrialRunner/TuneController so both features can be tested together. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Jonathan Carter <jonathan.carter@magd.ox.ac.uk>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

@jianoaix

* Fix 'Observed wheel commit () is not expected' issue (#32156) that has been creeping through many of ci/cd builds in our pipeline. The existing code uses pipe to read from a rather large file (>50MB). Pipe however has buffer limit which by default in term of kb (https://man7.org/linux/man-pages/man7/pipe.7.html) so what we look for might not exist. We can fix this by tell unzip the exact file we are looking for. That file is pretty small so we should not hit buffer limit. You might notice other surpises might still happen with this fix (e.g. many files that match ^__commit__). This sanity check goes back to 2 years ago by our veteran Kai (234b015) to sanity check issues with stale artifacts from previous builds or race conditions between builds. Further investigation on how builkite agent multi-tenant is setup might or might not simplify this logic further. Signed-off-by: Cuong Nguyen <can@anyscale.com> * Improve wheel commit validation error message Signed-off-by: Cuong Nguyen <can@anyscale.com> * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add new lines to some files Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support test definition with multiple flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Use not in to check key in dict Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 2 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Debugging 03 Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove temoprary logs Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Update flavors Signed-off-by: Cuong Nguyen <can@anyscale.com> * Only initialize gs client on gs host Signed-off-by: Cuong Nguyen <can@anyscale.com> * Lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * Update image for Sematic integration (#33469) * [RLlib] fix preprocessor test (#33719) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to make Java logging consistent with Python (#33665) To make Java logging consistent with PR #31772 which seems for lazy worker binding. Otherwise, we may print too many logs from different drivers in shell console. Co-authored-by: Qing Wang <kingchin1218@126.com> * [serve] Fix serve HA test (#33699) * Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" (#33731) <img width="762" alt="Screen Shot 2023-03-26 at 7 54 30 PM" src="https://user-images.githubusercontent.com/18510752/227829626-001349f1-218e-4538-98c1-851f3dcf8a0e.png"> This reverts commit cb5bb0e.     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( * [tune] add data to CI test dependencies (#33729) 1. #33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. #33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. #33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... * [Test] Fix test event test timeout (#33704) * [RLlib] Fixed a typo in multi-agent definition using RLModules in test_env_runner_v2::test_guess_the_number_multi_agent (#33723) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled rl_module in one of the subtests in test_curiosity (#33726) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule in Two trainer workflow example (#33727) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu tests becauase of LSTMs (#33728) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#32747) This PR adds a benchmark for operator fusion, where we're interested in the performance of operators that have been fused into a single task. This primarily tests our fusion rule and data layer code. * [runtime env] Close schema after loading and continue on error (#33535) This PR fixes a few things: * A warning from not closing the file opened with `open()`. (We have these warnings as errors and Ray was causing some integration tests to blink) * Using a custom runtime env schema with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS` would result in a failure when the JSON file is incorrectly decoded or the file doesn't exist. * There was a test for invalid decoded JSON, but by chance it ran *after* a previous schema, meaning the missing `continue` wasn't noticed. **Steps to Reproduce** 1. Save this script as `test.py` ```python import ray @ray.remote(runtime_env={"env_vars": {}}) def my_fn(): return True ray.init() print(ray.get(my_fn.remote())) ``` 2. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./non-exist.json python test.py` 3. a. save `:` or other invalid JSON as `bad-json.json` b. run with `RAY_RUNTIME_ENV_PLUGIN_SCHEMAS=./bad-json.json python test.py` This PR fixes the issue and adds a new test case. Signed-off-by: James Clark <james.clark@zapatacomputing.com> * [Jobs] Fix race condition on submitting multiple jobs with the same id (#33259) In the internal KV store, we store a map of Job IDs to their JobInfo (containing Ray Jobs API metadata). Previously, when submitting a job, we (1) check if the info for already exists in the internal KV, and then (2) put the new info and job ID into the internal KV. This caused a race condition when two jobs with the same submission_id were submitted within a second or so of each other. Both jobs would see the info doesn't already exist, so both would try to go ahead with the job submission. This would eventually fail with an unfriendly internal error about named actors (JobSupervisor actor) having the same name. This PR fixes the race condition by making operations (1) and (2) happen at the same time (this is already supported by internal_kv_put(... overwrite=False) -> int which returns the number of keys newly added; this PR just updates the Jobs code to use overwrite=False and the return value). Also adds a unit test which fails without this change. * Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733) Encountered check failure `redis_client.cc:73: Check failed: reply->type == REDIS_REPLY_INTEGER Expected integer, found Redis type 6 for JobCounter`. This PR retries REDIS_REPLY_ERROR which is 6 and also prints out the error message. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)"" (#33740) Additionally fix `test_usage_test.py`. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * Deprecate RuntimeContext.get (#33734) RuntimeContext.get exposes Cython ids instead of strings so we should deprecate it and in favor of get_xxx_id() methods. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> * [Serve] Fix the serve.batch api doc (#33588) Fix the example formatting in the serve batch API doc * [infra] increase Build timeout (#33756) Why are these changes needed? release test failing due to timeout when building the cluster env. Currently timeout is 30 minutes, but the build could take longer, e.g https://buildkite.com/ray-project/release-tests-branch/builds/1479#0187244b-ef66-4a39-9367-3b2eb3adc9d2 * [RLlib][RLModule] Use forward_exploration() inside the unit-test for test_log_likelihood since the action_logps are not necessary fields for exploration (#33745) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * [Test] Fix the failing workflow test_dataset after streaming executor is enabled. (#33736) Looks like the workflow will start 1 CPU cluster, and it has its own remote task that uses 1 CPU which blocks scheduling dataset tasks that require CPUs. There was an option to make workflow remote task to use 0 CPU, but I think that doesn't really make sense (since user probably just writes regular function inside). I fixed the issue by explicitly allocating 2 CPUs to the cluster. It is mysterious why it worked before streaming executor was enabled (cc @jianoaix if you have good theory. ) * [Test] Fix out of disk error (#33732) Sometimes, there are more than 1 OOD event if test runs more than 10 seconds. I alleviated the assert condition in that case. * [Data] Repurpose streaming CI to bulk CI(#33478) Streaming executor is enabled by default. We repurpose this streaming CI to bulk so we can get some coverage of bulk (at least for now). * [Serve] Enable serve metrics lib working in ray actor (#33717) Make sure ray.serve.lib working with ray.actor without serve context. ``` @ray.remote class MyActor: def __init__(self): self.my_counter = metrics.Counter( "my_ray_actor", description=("The number of requests to this deployment."), tag_keys=("my_tag",), ) def test(self): self.my_counter.inc(tags={"my_tag": "value"}) return "hello" @serve.deployment(num_replicas=2) class Model: def __init__(self, model_name): self.my_actor = MyActor.remote() async def __call__(self, req: starlette.requests.Request): await self.my_actor.test.remote() return ``` * [RLlib] Fix: Recovered eval worker should use eval-config's policy_mapping_fn and policy_to_train fn, not the main train workers' ones. (#33648) Signed-off-by: sven1977 <svenmika1977@gmail.com> * [Data] Don't automatically move batches to device if `collate_fn` is specified (#33761) If the user provides a collate_fn to iter_torch_batches, it is expected that the collate_fn is responsible for moving tensors to the correct device. We remove the automatic device transfer if a collate_fn is specified. --------- Signed-off-by: amogkam <amogkamsetty@yahoo.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * Change ray to 2.3.1 to work around the #ir-glorious-shape Signed-off-by: Cuong Nguyen <can@anyscale.com> * Revert to normal ray image Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix delete_fn Signed-off-by: Cuong Nguyen <can@anyscale.com> * [CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host (#33772) Add the ability for AnyscaleJobRunner to run on GCE host. The added logic: - Read from ENV variable, or the storage link, to see if this is a GCE host. If it is, has custom logic inside job file manager and runner. Both read, write and delete are supported. - Add some sample tests to use gce as an environment so we can run a CI and check that this diff works - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] CI tests: https://buildkite.com/ray-project/release-tests-pr/builds/32825 * Run lint Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`). (#33814) * Setup dependencies and crendential for GCE in buildkite Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add google-cloud-storage package to requirements Signed-off-by: Cuong Nguyen <can@anyscale.com> * Support for gs:// in anyscale job runner Signed-off-by: Cuong Nguyen <can@anyscale.com> * Correct adding gce tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * [RLlib] APPO TF with RLModule and Learner API (#33310) Signed-off-by: Avnish <avnishnarayan@gmail.com> * [data] [streaming] Dataset.cache() doesn't work properly for streaming executor (#33713) It seems like we didn't have a test for the caching behavior, so when we enabled streaming mode, it broke caching. Previously, the cache assumption relied on the eager execution behavior of Dataset in general for all operations. * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linter Signed-off-by: Cuong Nguyen <can@anyscale.com> * Run linters Signed-off-by: Cuong Nguyen <can@anyscale.com> * -s * Fix some tests Signed-off-by: Cuong Nguyen <can@anyscale.com> * Add unit tests for test definition parser Signed-off-by: Cuong Nguyen <can@anyscale.com> * Fix lints Signed-off-by: Cuong Nguyen <can@anyscale.com> * @aslonnie's comments Signed-off-by: Cuong Nguyen <can@anyscale.com> * Check that parse_test_definition throws exception on empty variations Signed-off-by: Cuong Nguyen <can@anyscale.com> * Remove the constant test definition in test. Signed-off-by: Cuong Nguyen <can@anyscale.com> * The cluster environment name does not allow the character '.', so fix that. Address Lonnie's comments and add more tests. Signed-off-by: Cuong Nguyen <can@anyscale.com> --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: sven1977 <svenmika1977@gmail.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: augray <augray@users.noreply.github.com> Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: jiafu zhang <jiafu.zhang@intel.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: James Clark <70290797+jamesclark-Zapata@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Sihan Wang <sihanwang41@gmail.com> Co-authored-by: clarng <clarence.wyng@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jian Xiao <99709935+jianoaix@users.noreply.github.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

…ay-project#33609) * implement a separate path for AirProgressReporter and ResultCallback. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * debug on ws. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * callback without num_samples. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix getattribute Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * polish rich table. some minor fixes. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * polish tune's case. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * polish train's case Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * current_best_trial Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * try running ci using the new AIR_VERBOSITY Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * minor fixes Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix _create_default_callback call site. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * do not hardcode basic info table height. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * separate buildkite pipeline. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * use nested table. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * lint Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * address comments Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: elliottower <elliot@elliottower.com>

1. ray-project#33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. ray-project#33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. ray-project#33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... Signed-off-by: elliottower <elliot@elliottower.com>

…ray-project#33880) The new experimental output path (ray-project#33609) and the new experimental execution engine (ray-project#33499) currently don't work together because the new output path is calling a trial executor API directly (which does not exist in the experimental execution engine). This PR calls the correct API on the TrialRunner/TuneController so both features can be tested together. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: elliottower <elliot@elliottower.com>

…ay-project#33609) * implement a separate path for AirProgressReporter and ResultCallback. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * debug on ws. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * callback without num_samples. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix getattribute Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * polish rich table. some minor fixes. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * polish tune's case. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * polish train's case Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * current_best_trial Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * try running ci using the new AIR_VERBOSITY Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * minor fixes Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * fix _create_default_callback call site. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * do not hardcode basic info table height. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * separate buildkite pipeline. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * use nested table. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * lint Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * address comments Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> --------- Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

1. ray-project#33565 introduced `DATA_PROCESSING_TESTING=1` as a requirement to `:octopus: Tune tests and examples (medium)"`. 2. ray-project#33609 introduced `":octopus: :spiral_note_pad: New output: Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). 3. ray-project#33499 introduced a `":octopus: :sunny: New execution path:Tune tests and examples (medium)"` as a copy of `:octopus: Tune tests and examples (medium)"` but was done prior to merging (1). **Note:** There should probably be a better way for handling dependencies in CI tests... Signed-off-by: Jack He <jackhe2345@gmail.com>

…ray-project#33880) The new experimental output path (ray-project#33609) and the new experimental execution engine (ray-project#33499) currently don't work together because the new output path is calling a trial executor API directly (which does not exist in the experimental execution engine). This PR calls the correct API on the TrialRunner/TuneController so both features can be tested together. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

xwjiang2010 added 14 commits March 20, 2023 08:42

implement a separate path for AirProgressReporter and ResultCallback.

30c7056

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

debug on ws.

407a1f8

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into console

a15538d

callback without num_samples.

f896e50

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

fix getattribute

3cce4cb

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

polish rich table. some minor fixes.

557784e

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

polish tune's case.

e5300b2

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

polish train's case

6586bd8

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

current_best_trial

748923a

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

try running ci using the new AIR_VERBOSITY

bb69ca6

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

minor fixes

6fe8bd2

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

fix _create_default_callback call site.

00631da

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

do not hardcode basic info table height.

fcbacdf

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

separate buildkite pipeline.

29fe295

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

xwjiang2010 changed the title ~~[wip] Run AIR_VERBOSITY code path through current CI.~~ [air-output] Add new console output code path (behind feature flag) Mar 23, 2023

xwjiang2010 assigned krfricke Mar 23, 2023

xwjiang2010 and others added 4 commits March 23, 2023 17:52

use nested table.

3d89cab

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Merge branch 'ray-project:master' into console_2

dd37372

lint

8ad917e

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Merge branch 'console_2' of https://github.com/xwjiang2010/ray into c…

6a98110

…onsole_2

krfricke reviewed Mar 24, 2023

View reviewed changes

xwjiang2010 and others added 2 commits March 24, 2023 08:45

address comments

5f3723a

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Merge branch 'ray-project:master' into console_2

8bd8b47

krfricke approved these changes Mar 24, 2023

View reviewed changes

xwjiang2010 merged commit bfbe8d1 into ray-project:master Mar 24, 2023

Yard1 added a commit that referenced this pull request Mar 25, 2023

Revert "[air-output] Add new console output code path (behind feature…

f3ef7d6

… flag) (#33609)" This reverts commit bfbe8d1.

Yard1 mentioned this pull request Mar 25, 2023

Revert "[air-output] Add new console output code path (behind feature flag)" #33698

Closed

matthewdeng mentioned this pull request Mar 27, 2023

[tune] add data to CI test dependencies #33729

Merged

8 tasks

xwjiang2010 deleted the console_2 branch July 26, 2023 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air-output] Add new console output code path (behind feature flag) #33609

[air-output] Add new console output code path (behind feature flag) #33609

xwjiang2010 commented Mar 23, 2023 •

edited

Loading

krfricke left a comment

krfricke Mar 24, 2023

xwjiang2010 Mar 24, 2023

krfricke Mar 24, 2023

krfricke left a comment

xwjiang2010 commented Mar 24, 2023

	if v is not None and (blacklisted_keys is None or k not in blacklisted_keys):
	if v is not None and k not in exclude:

[air-output] Add new console output code path (behind feature flag) #33609

[air-output] Add new console output code path (behind feature flag) #33609

Conversation

xwjiang2010 commented Mar 23, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

krfricke left a comment

Choose a reason for hiding this comment

krfricke Mar 24, 2023

Choose a reason for hiding this comment

xwjiang2010 Mar 24, 2023

Choose a reason for hiding this comment

krfricke Mar 24, 2023

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

xwjiang2010 commented Mar 24, 2023

xwjiang2010 commented Mar 23, 2023 •

edited

Loading