[ray_client]: Wait for ready and retry on ray.connect() #13376

barakmich · 2021-01-12T21:10:47Z

Why are these changes needed?

Implements wait and retry logic on ray.connect()

This takes care of the case where the server is not yet ready, but does not yet cover the case where the server drops mid-connection (still to come).

Related issue number

First half of #13353

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6

Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0

python/ray/util/client/worker.py

ericl

Probably should add one to the retries param

Change-Id: I43f5378322029267ddd69f518ce8206876e2129d

…13376) * [ray_client]: wait until connection ready Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6 * lint Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0 * docs and retry minimum Change-Id: I43f5378322029267ddd69f518ce8206876e2129d

* [core] Pull Manager exponential backoff (#13024) * [RLlib] Issue 12789: RLlib throws the warning "The given NumPy array is not writeable" (#12793) * [release tests] test_many_tasks fix (#12984) * Add "beta" documentation for enabling object spilling manually (#13047) * [Serve] Handle Bug Fixes (#12971) * [Dashboard] Add GET /logical/actors API (#12913) * [GCS]Decouple gcs resource manager and gcs node manager (#13012) * [ray_client]: Insert decorators into the real ray module to allow for client mode (#13031) * [GCS] Delete redis gcs client and redis_xxx_accessor (#12996) * [RLlib] Fix broken unity3d_env import in example server script. (#13040) * [RLlib] TorchPolicies: Accessing "infos" dict in train_batch causes `TypeError`. (#13039) * [joblib] Fix flaky joblib test. (#13046) * [Tune]Add integer loguniform support (#12994) * Add integer quantization and loguniform support * Fix hyperopt qloguniform not being np.log'd first * Add tests, __init__ * Try to fix tests, better exceptions * Tweak docstrings * Type checks in SearchSpaceTest * Update docs * Lint, tests * Update doc/source/tune/api_docs/search_space.rst Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> * [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048) * Add index for tasks to dispatch * Task dependency manager interface * Unsubscribe dependencies and tests * NodeManager * Revert "Add index for tasks to dispatch" This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea. * tmp * Move back to waiting if args not ready * update * Update to new form of brew cask install command * [Autoscaler] New output log format (#12772) * Fix typo RMSProp -> RMSprop (#13063) * [serve] Centralize HTTP-related logic in HTTPState (#13020) * Remove suppress output to see why wheel is not building * Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006) * New dependency manager * Switch raylet to new DependencyManager * PullManager accepts bundles * Cleanup, remove old task dependency manager * x * PullManager unit tests * lint * Unit tests * Rename * lint * test * Update src/ray/raylet/dependency_manager.cc Co-authored-by: SangBin Cho <rkooo567@gmail.com> * Update src/ray/raylet/dependency_manager.cc Co-authored-by: SangBin Cho <rkooo567@gmail.com> * x * lint Co-authored-by: SangBin Cho <rkooo567@gmail.com> * [docs] Fix args + kwargs instead of docstrings (#13068) * functools wraps * Fix typo (functoools -> functools) * Fix OS X Wheel Build - Update brew cask install (#13062) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * speed up local mode object store get (#13052) Co-authored-by: senlin.zsl <senlin.zsl@antfin.com> * [RLlib] Execution Annotation (#13036) * [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943) * [C++ API] Added reference counting to ObjectRef (#13058) * Added reference counting to ObjectRef * Addressed the comments * [Core] Remove cuda support in plasma store (#13070) * remove cuda support in plasma store * [Core] Remote outdated external store (#13080) * remove outdated external store * [GCS] Move resource usage info to gcs resource manager (#13059) * [RLlib] JAXPolicy prep. PR #1. (#13077) * [RLlib] Preprocessor fixes (multi-discrete) and tests. (#13083) * [RLlib] BC/MARWIL/recurrent nets minor cleanups and bug fixes. (#13064) * [Collective][PR 3.5/6] Send/Recv calls and some initial code for communicator caching (#12935) * other collectives all work * auto-linting * mannual linting #1 * mannual linting 2 * bugfix * add send/recv point-to-point calls * add some initial code for communicator caching * auto linting * optimize imports * minor fix * fix unpassed tests * support more dtypes * rerun some distributed tests for send/recv * linting * [Serve] [Doc] Front page update (#13032) * Deprecate experimental / dynamic resources (#13019) * [docs] fix wandb url (#13094) * [Serve] Implement Graceful Shutdown (#13028) * [Serve] Use ServeHandle in HTTP proxy (#12523) * [Java] Format ray java code (#13056) * [docker] Fix restart behavior with Docker (#12898) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: ijrsvt <ilr@anyscale.com> * Disable broken streaming tests (#13095) * [autoscaler] Make placement groups bypass max launch limit (#13089) * Serve metrics docs (#13096) * [RLlib] run_regression_tests.py: --framework flag (instead of --torch). (#13097) * [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035) * [Doc] Fix Sphinx.add_stylesheet deprecation (#13067) * Fix streaming ci failure (#12830) * [RLlib] New Offline RL Algorithm: CQL (based on SAC) (#13118) * [Bugfix][Dashboard] Fix undefined logCount, errorCount UI crash (#13113) * [RLlib] Deflake test case: 2-step game MADDPG. (#13121) * [RLlib] Trajectory view API docs. (#12718) * Job module without submission (#13081) Co-authored-by: 刘宝 <po.lb@antfin.com> * [RLlib] JAXPolicy prep PR #2 (move get_activation_fn (backward-compatibly), minor fixes and preparations). (#13091) * [Java] Avoid failure of serializing a user-defined unserializable exception. (#13119) * [Tune] Update URL to fix 403 not found error in PBT tranformers test case (#13131) * [serve] Async controller (#13111) * [dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948) * [Serve] Use a small object to track requests (#13125) * [docs][kubernetes][minor] Update K8s examples in doce (#13129) * [RLlib] Support easy `use_attention=True` flag for using the GTrXL model. (#11698) * [docs] Documentation + example for the C++ language API (#13138) * [Java] Support `wasCurrentActorRestarted` in actor task. (#13120) * Remove check. * Add test * fix lint * lint * Fix spotless lint * Address comments. * Fix lint Co-authored-by: Qing Wang <jovany.wq@antgroup.com> * [docs] Minor change to formating C++ docs. (#13151) * Deprecate setResource java api (#13117) * [docs] Small fix in C++ documentation. (#13154) * prepare for head node * move command runner interface outside _private * remove space * Eric * flake * min_workers in multi node type * fixing edge cases * eric not idle * fix target_workers to consider min_workers of node types * idle timeout * minor * minor fix * test * lint * eric v2 * eric 3 * min_workers constraint before bin packing * Update resource_demand_scheduler.py * Revert "Update resource_demand_scheduler.py" This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5. * reducing diff * make get_nodes_to_launch return a dict * merge * weird merge fix * auto fill instance types for AWS * Alex/Eric * Update doc/source/cluster/autoscaling.rst * merge autofill and input from user * logger.exception * make the yaml use the default autofill * docs Eric * remove test_autoscaler_yaml from windows tests * lets try changing the test a bit * return test * lets see * edward * Limit max launch concurrency * commenting frac TODO * move to resource demand scheduler * use STATUS UP TO DATE * Eric * make logger of gc freed refs debug instead of info * add cluster name to docker mount prefix directory * grrR * fix tests * moving docker directory to sdk * move the import to prevent circular dependency * smallf fix * ian * fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running * small fix * deflake test_joblib * lint * placement groups bypass * remove space * Eric * first ocmmit * lint * exmaple * documentation * hmm * file path fix * fix test * some format issue in docs * modified docs Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan> Co-authored-by: Alex Wu <alex@anyscale.io> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal> * [Serve] [Doc] Add existing web server integration ServeHandle tutorial (#13127) * [kubernetes][docs][minor] Kubernetes version warning (#13161) * [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817) * Locality-aware leasing for owned refs (pinned locations). * LessorPicker --> LeasePolicy. * Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects. * Update comments. * Turn on locality-aware leasing feature flag by default. * Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy. * Add lease policy consulting assertions to the direct task submitter tests. * Add lease policy tests. * LocalityLeasePolicy --> LocalityAwareLeasePolicy. * Add missing const declarations. Co-authored-by: SangBin Cho <rkooo567@gmail.com> * Add RAY_CHECK for raylet address nullptr when creating lease client. * Make the fact that LocalLeasePolicy always returns the local node more explicit. * Flatten GetLocalityData conditionals to make it more readable. * Add ReferenceCounter::GetLocalityData() unit test. * Add data-intensive microbenchmarks for single-node perf testing. * Add data-intensive microbenchmarks for simulated cluster perf testing. * Remove redundant comment. * Remove data-intensive benchmarks. * Add locality-aware leasing Python test. * Formatting changes in ray_perf.py. Co-authored-by: SangBin Cho <rkooo567@gmail.com> * Enabling the cancellation of non-actor tasks in a worker's queue (#12117) * wrote code to enable cancellation of queued non-actor tasks * minor changes * bug fixes * added comments * rev1 * linting * making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error * bug fix * added two unit tests * linting * iterating through pending_normal_tasks starting from end * fixup! iterating through pending_normal_tasks starting from end * fixup! fixup! iterating through pending_normal_tasks starting from end * post merge fixes * added debugging instructions, pulled Accept() out of guarded loop * removed debugging instructions, linting * [Serve] Bug in Serve node memory-related resources calculation #11198 (#13061) * [Release] Update Release Process Documentation (#13123) * [Core] Remove Arrow dependencies (#13157) * remove arrow ubsan * remove arrow build depend * remove arrow buffer * [XGboost] Update Documentation (#13017) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [SGD] Fix Docstring for `as_trainable` (#13173) * Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178) This reverts commit b4d688b4a64c595a071e8c7380b653e0bfea4ad2. * Surface object store spilling statistics in `ray memory` (#13124) * [ray_client]: Move from experimental to util (#13176) Change-Id: I9f054881f0429092d265cd6944d89804cce9d946 * Remove unused file(object_manager_integration_test.cc) (#12989) * Notify listeners after registered node stored (#13069) * [build]Update description and add some keywords (#13163) * [Collective][PR 2/6] Driver program declarative interfaces (#12874) * scaffold of the code * some scratch and options change * NCCL mostly done, supporting API#1 * interface 2.1 2.2 scratch * put code into ray and fix some importing issues * add an addtional Rendezvous class to safely meet at named actor * fix some small bugs in nccl_util * some small fix * scaffold of the code * some scratch and options change * NCCL mostly done, supporting API#1 * interface 2.1 2.2 scratch * put code into ray and fix some importing issues * add an addtional Rendezvous class to safely meet at named actor * fix some small bugs in nccl_util * some small fix * add a Backend class to make Backend string more robust * add several useful APIs * add some tests * added allreduce test * fix typos * fix several bugs found via unittests * fix and update torch test * changed back actor * rearange a bit before importing distributed test * add distributed test * remove scratch code * auto-linting * linting 2 * linting 2 * linting 3 * linting 4 * linting 5 * linting 6 * 2.1 2.2 * fix small bugs * minor updates * linting again * auto linting * linting 2 * final linting * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * added actor test * lint * remove local sh * address most of richard's comments * minor update * remove the actor.option() interface to avoid changes in ray core * minor updates Co-authored-by: YLJALDC <dal177@ucsd.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [serve] Merge ActorReconciler and BackendState (#13139) * [tune] better signature check for `tune.sample_from` (#13171) * [tune] better signature check for `tune.sample_from` * Update python/ray/tune/sample.py Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> * Disable atexit test on windows (#13207) * [serve] Move controller state into separate files (#13204) * Update multi_agent_independent_learning.py (#13196) pettingzoo.utils.error.DeprecatedEnv: waterworld_v0 is now depreciated, use waterworld_v2 instead * [Collective] Some necessary abstraction of collective calls before introducing stream management (#13162) * [Tune] Fix PBT Transformers Example (#13174) * [Serve] HTTPOptions for deployment modes (#13142) * [tests] Fix Autoscaler Test failure on Windows (#13211) * skip create_or_update tests * Update python/ray/tests/test_autoscaler.py Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> * [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158) * [GCS]Fix TestActorSubscribeAll bug (#13193) * [Metrics] Record per node and raylet cpu / mem usage (#12982) * Record per node and raylet cpu / mem usage * Add comments. * Addressed code review. * [Tune] Fix tune serve integration example (#13233) * [Redis] Note that each Redis Connect retry takes two minutes (#12183) * Slightly alter error message so it's the same in both cases. * Each retry takes about two minutes. * [Log] fix spdlog init race (#12973) * fix spdlog init race * use global logger * refine logger name and constructor * [Release] Add 1.1.0 release test logs (#13054) * Add microbenchmark to release logs * check in many_tasks stress test result * Add results of placement group stress test for 1.1.0 * Add result for test_dead_actors test and correct the name of test_many_tasks.txt * Add rllib regression test result * Add pytorch test results for rllib * remove extraneous log entries * [Core] Fix incorrect comment (#13228) * [Serialization] Fix cloudpickle (#13242) * [GCS]Fix gcs table storage `GetAll` and `GetByJobId` api bug (#13195) * Start ray client server with 'ray start' (#13217) * [GCS]Add gcs actor schedule strategy (#13156) * Publish job/worker info with Hex format instead of Binary (#13235) * [RLlib] SquashedGaussians should throw error when entropy or kl are called. (#13126) * [Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247) Now that `HeadOnly` becomes the new default HTTP location, we can re-enable the long running tests to use local multi-clusters. (also fixed the controller's API to match up to date, we should have caught these, I will open issues for this.) * Update autoscaler-cluster yaml files for release tests (#13114) * [Release] Use ray-ml image for logn running test (#13267) * [RLlib] Fix missing "info_batch" arg (None) in `compute_actions` calls. (#13237) * [Tune] Improve error message for Session Detection (#13255) * Improve error message * log once * [Tune] Pin Tune Dependencies (#13027) Co-authored-by: Ian <ian.rodney@gmail.com> * [Dependabot] Add Dependabot (#13278) Co-authored-by: Ian <ian.rodney@gmail.com> * [docker] Pull if image is not present (#13136) * [GCS] Remove old lightweight resource usage report code path (#13192) * [Dashboard] Add GET /log_proxy API (#13165) * Fix a crash problem caused by GetActorHandle in ActorManager (#13164) * [ray_client] Add metadata to gRPC requests (#13167) * [RLlib] Preparatory PR for: Documentation on Model Building. (#13260) * [tune](deps): Bump mlflow from 1.13.0 to 1.13.1 in /python/requirements (#13286) * [tune](deps): Bump gluoncv from 0.9.0 to 0.9.1 in /python/requirements (#13287) * Remove top-level ray.connect() and ray.disconnect() APIs (#13273) * [Pull manager] Only pull once per retry period (#13245) * . * docs * cleanup * . * . * . * . Co-authored-by: Alex <alex@anyscale.com> * [Cancellation] Make Test Cancel Easier to Debug (#13243) * first commit * lint-fix * [ray_client]: first draft of documentation (#13216) * Do not give an error if both `RAY_ADDRESS` and `address` is specified on initialization (#13305) * Finalize handling of RAY_ADDRESS * lint * [serve] Clean up EndpointState interface, move checkpointing inside of EndpointState (#13215) * [RLlib] SlateQ Documentation (#13266) * [RLlib] Add more detailed Documentation on Model building API (#13261) * [tune] convert search spaces: parse spec before flattening (#12785) * Parse spec before flattening * flatten after parse * Test for ValueError if grid search is passed to search algorithms * remove empty extras streaming deps (#12933) * add the method annotation and a comment explaining what's happening (#13306) Change-Id: I848cc2f0beaed95340d9de7cca19a50c78d9da9a * Use wait_for_condition to reduce flakiness in test_queue.py::test_custom_resources (#13210) * [RLlib] Issue 13330: No TF installed causes crash in `ModelCatalog.get_action_shape()` (#13332) * [serve] Cleanup backend state, move checkpointing and async goal logic inside (#13298) * fix removal of task dependencies (#13333) Co-authored-by: senlin.zsl <senlin.zsl@antfin.com> * [Serve] Support Starlette streaming response (#13328) * [RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339) * [client] Report number of currently active clients on connect (#13326) * wip * update * update * reset worker * fix conn * fix * disable pycodestyle * Implement internal kv in ray client (#13344) * kv internal * fix * [Tune] Rename MLFlow to MLflow (#13301) * Forgot overwrite parameter in Ray client internal kv * Fix typo in Tune Docs (Checkpointing) (#13348) See issue #13299 * [Kubernetes][Docs] GPU usage (#13325) * gpu-note * gpu-note * More info * lint? * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * GKE->Kubernetes Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Revert "[RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)" (#13361) This reverts commit e2b2abb88b82c0c2402a338bba51e5dbd1739419. * [Dependabot] [CI] Re-configure Dependabot and disable duplicate builds (#13359) * [tune] buffer trainable results (#13236) * Working prototype * Pass buffer length, fix tests * Don't buffer per default * Dispatch and process save in one go, added tests * Fix tests * Pass adaptive seconds to train_buffered, stop result processing after STOP decision * Fix tests, add release test * Update tests * Added detailed logs for slow operations * Update python/ray/tune/trial_runner.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Apply suggestions from code review * Revert tests and go back to old tuning loop * nit Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [Serve] Add dependency management support for driver not running in a conda env (#13269) * [RLlib] Add `__len__()` method to SampleBatch (#13371) * [Serve] Backend state unit tests (#13319) * trigger doc build for serve updates (#13373) * [Object Spilling] Long running object spilling test (#13331) * done. * formatting. * Remove unimplemented GetAll method in actor info accessor (#13362) * [Doc] Remove trailing whitespaces (#13390) * Enable Ray client server by default (#13350) * update * fix * fix test * update * [RLlib] Trajectory View API: Atari framestacking. (#13315) * [ray_client]: Wait for ready and retry on ray.connect() (#13376) * [ray_client]: wait until connection ready Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6 * lint Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0 * docs and retry minimum Change-Id: I43f5378322029267ddd69f518ce8206876e2129d * [Dashboard] Fix missing actor pid (#13229) * [ray_client]: Fix multiple attempts at checking connection (#13422) * Plumb retries update (#13411) * [Serve] [Doc] Improve batching doc (#13389) * [autoscaler/k8s] [CI] Kubernetes test ray up, exec, down (#12514) * Fix Serve release test (#13385) * Add bazel logs upload to GHA (#13251) * [tune] Fix f-string in error message (#13423) * [serve] Pull out goal management logic into AsyncGoalManager class (#13341) * Make request_resources() use internal kv instead of redis pub sub (#13410) * Remove unused handler methods (#13394) * [Tune] Pin Transitive Dependencies (#13358) * Split out the part of get_node_ip_address for which the docstring is correct (#12796) * Fix raylet::MockWorker::GetProcess crashes (#13440) Co-authored-by: 刘宝 <po.lb@antfin.com> * Revert "Enable Ray client server by default (#13350)" (#13429) This reverts commit 912d0cbbf912d5b52d6176155bdff02f504b657d. * Fix linter error (#13451) * [GCS]Add gcs resource scheduler (#13072) * [RLlib] Redo: Make TFModelV2 fully modular like TorchModelV2 (soft-deprecate register_variables, unify var names wrt torch). (#13363) * [Core]Fix raylet scheduling bug (#13452) * [Core]Fix raylet scheduling bug * fix lint error * fix lint error Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com> * [joblib] joblib strikes again but this time on windows (#13212) * [ray_client]: fix exceptions raised while executing on the server on behalf of the client (#13424) * [kubernetes][minor] Operator garbage collection fix (#13392) * [Core][CLI] `ray status` and `ray memory` no longer starts a new job (#13391) * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Make status and error args required in commands.py#debug.status * Remove unnecessary imports * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Make status and error args required in commands.py#debug.status * Remove unnecessary imports * Job 38482.1 should now pass * Resolve merge conflict * [RLlib] Deflake 2x remote & local inference tests (external env). (#13459) * [docs] Add more guideline on using ray in slurm cluster (#12819) Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> Co-authored-by: PENG Zhenghao <pengzh@ie.cuhk.edu.hk> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [Dashboard] Fix GPU resource rendering issue (#13388) * [Release] Fix Serve release test (#13303) The Docker image we were using now uses `ray` users so we have to call sudo. * [serve] Properly obey SERVE_LOG_DEBUG=0 (#13460) * Fix getting runtime context dict in driver (#13417) * [xgb] re-enable xgboost_ray tests (#13416) * re-enable * fix * update xgb_ray version * [Serialization] New custom serialization API (#13291) * new serialization API with doc & test * add more notes * refine notes * doc * [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220) * Added owned object reference before Plasma put on Create() + Seal() path. * Consolidated location table and reference table in reference counter. * Restore type in definition. * Clean up owned reference on failed Seal(). * Added RemoveOwnedObject test for reference counter. * Guard against ref going out of scope before location RPCs. * Add 'owner must have ref in scope' precondition to documentation for object location methods. * Move to separate Create() + Seal() methods for existing objects. * Clearer distinction between Create() and Seal() methods. * Make it clear that references will normally be cleaned up by reference counting. * [ray_client]: Support runtime_context as metadata (#13428) * [GCS]Remove unused class variable (#13454) * [Object Spilling] Dedup restore objects (#13470) * done. * Addressed code review. * [CI] Enable Dashboard tests for master (#13425) * [docker/dashboard] Fix ray dashboard (#12899) * [CI] Fix Windows Bazel Upload (#13436) * Return version info from Ray client connect, to allow for discovering version mismatches * Update ID specification doc (#13356) * [ray_client]: fix wrong reference in server_pickler (#13474) Change-Id: Ie3d219541b1875e986e72e3ae73ece145c715acf * Bump dev branch to 2.0 to avoid endless version bump toil (#13497) * wip * fix * fix * Remove an unnecessary file (#13499) * [Tests] Skip failing windows tests (#13495) * skip failing windows tests * skip more * remove * updates * [tune] fix small docs typo (#13355) Signed-off-by: Richard Liaw <rliaw@berkeley.edu> * move message to debug (#13472) * Minimal version of piping autoscaler events to driver logs (#13434) * sync write internal config in gcs (#13197) * Refactor node manager to eliminate `new_scheduler_enabled_` (#12936) * [GCS]Only publish changed field when node dead (#13364) * Only update changed field when node dead * node_id missed * [CI] Buildkite PR Environment for Simple Tests (#13130) * [GCS] Remove task info publish as nowhere uses it (#13509) * Remove task info publish as nowhere uses it * simplify right publish channel * [RLlib] Solve PyTorch/TF-eager A3C async race condition between calling model and its value function. (#13467) * [tune] placement group support (#13370) * [Serve] Allow ObjectRef for Composition (#12592) * Add Dashboard Python Test to Buildkite (#13530) * Add ability to not start Monitor when calling `ray start` (#13505) * [tune] support experiment checkpointing for grid search (#13357) * Fix typo (#13098) * Remove PYTHON_MODE that is not defined in Ray so that import * will work from other packages. (#13544) * [RLlib] MARWIL loss function test case and cleanup. (#13455) * [RLlib] Deprecate `vf_share_layers` in top-level PPO/MAML/MB-MPO configs. (#13397) * [RLlib] Env directory cleanup and tests. (#13082) * [RLlib] Issue 9071 A3C w/ RNN not working due to VF assuming no RNN. (#13238) * Fix passing env on windows (#13253) * [Object Spilling] Remove retries and use a timer instead. (#13175) * [metrics] Better validation for tags (#13421) * [Tune] MLflow Credentials (#13533) * Make AWSNodeProvider.create_node return nodes created (#13498) * Make AWSNodeProvider.create_node return node config * return-dict * Node provider interface create node return type Any * Type clarification. * Delete debug code * Oops reset example-full changes * Return type specified. GCP create node returns None. * Article * Fix Docker Permission for Serve release test again (#13543) * Pipe monitor.err logs to driver * Debug info to GCS pub sub (#13564) * Fix restoration request dedup issues. (#13546) * [core] refactor disconnect message processing and enrich WorkExitType (#13527) * [core] refactor disconnect message processing and enrich WorkExitType add changes from refactor pr fix type typo fix typo fix * address comments * also update WorkerTableData * fix tests * [GCS]Only publish fileds used by sub clients in WorkerTableData (#13508) * Revert "Pipe monitor.err logs to driver" (#13574) This reverts commit a0d08c2cc638c1766a08e2030642c9b434609efa. * [tune] wandb - WandbLogger now also accepts wandb.data_types.Video (#13169) * [tune] Allow actor reuse for new trials (#13549) * Allow actor reuse for new trials * Fix tests and update conf when starting new trial * Move magic config to `reset_trial` * [Core] add thread name to help performance profiling (#13506) * Extra fix ray client newline (#13577) * [xgboost] Add XGBoost release tests (#13456) * Add XGBoost release tests * Add more xgboost release tests * Use failure state manager * Add release test documentation * Fix wording * Automate fault tolerance tests * Fix for operator role definition to add raycluster/finalizer (#13567) * [metrics] Check that all tag_keys are set when recording (#13420) * [Core] Remove 'PlasmaBuffer' in the buffer header (#13188) * Sync Bonsai Changes in 1.1.0 (#49) * [autoscaler/AWS] Updated AWS Node Provider threading logic (#11422) * [autoscaler] Add rsync_exclude and rsync_filter options to cluster config (#11512) * Add --worker-port-list option to ray start (#11481) * [hotfix] Pin node version (fix linux wheel build) (#11532) Co-authored-by: Max Fitton <max@semprehealth.com> * [Core] Allow creating tasks/actors in a detached actor when driver has exited (#11493) * Allow creating tasks/actors in a detached actor when driver has exited * lint * Address comment * [Autoscaler] Do not count unmanaged nodes in load metrics (#11458) * fixedd * lint * fixed other test case * . Co-authored-by: Alex Wu <alex@anyscale.com> * [RaySGD] Docs for SGD+Tune usage (#11479) * Clean up release tests (#11420) * [tune] a tiny ptl example (#11497) * [yaml] HotFix for correct example full (#11584) * [releng]: Quiet Docker Push (and explain why) (#11623) * [release] Do not tag docker latest on release builds (#11694) * fix * Added comment Co-authored-by: Alex Wu <alex@anyscale.com> * [tune] fixed validation for search metrics (#11583) * fixed validation for search metrics * formatting * made error report better * if only one metric is missing extract it from list * any can take a generator * Fix asyncio plasma integration in cluster mode (#11665) * [tune] PB2 (#11466) Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Version bump 1.0.1 * Disable validation of cluster config on the cluster to allow for cluster configs with new properties. (#11693) * [Hotfix] Pin Pydantic Version (#11622) * [docker] Fix docker regex (#11726) Co-authored-by: Alex Wu <alex@anyscale.com> * [GCS]Decouple node failure detector with resoure related operations (#11465) * [Placement Group] Placement group automatic cleanup. (#11546) * In progress. Done with all placement group manager code. * It is working with job. * Finished detached actor implementation. * Fix minor issue. * In progress. * Addressed code review. * Addressed code review. * Addressed code reivew. * Fix a build error. * [docker] Push to DockerHub in CI (#11442) * [docker] Disable Readme push to avoid errors (#11770) * Release testing things * rllib regression results * [Metrics] Implement basic metrics changes (#11769) * Implement basic metrics changes * Addressed code review. * Fix build issue. * Fix build issue. * [Core] Fix ray start failure to due to bug of redis address detection (#11735) * Fix ray start failure to due redis address detection bug * Address comment * [Test] Ignore setproctitle for local mode (#11819) * [Dashboard] Patch issue in 1.0.1 release where worker stats are not present for a node (#12062) * [autoscaler] Add the cluster_name to docker file mounts directory prefix to make it more unique (#11600) * Set version to 1.0.1.post1 * Sync Bonsai Changes in 1.0.1 (#47) * Bump up the version to 0.8.6 * Linting fix. * Add release test runnning full asan python test (#8836) * [MERGE TO MASTER] Add microbenchmark result. * Fix asyncio re-entry error message (#8842) * Change os.uname()[1] and socket.gethostname() to the portable and faster platform.node_ip() (#8839) Co-authored-by: Mehrdad <noreply@github.com> * [serve] Fix long running failure test (#8863) * [Serve] Serve long running test fix (#8864) * Replace ps call with psutil (#8851) * Replace ps call with psutil * Minor formatting Co-authored-by: Mehrdad <noreply@github.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> * [Core] Fix a detached actor bug fix when GCS actor management is off. (#8843) * [Testing] Fix LINT/sphinx errors. (#8874) * Node failure test fix (#8882) * [core] Check that port is unused before assigning to worker (#8773) * [rllib] Set framework to tf by default and remove import checks; "Auto" option (#8748) * tf by default * Update rllib/agents/trainer.py Co-authored-by: Sven Mika <sven@anyscale.io> * remove it * fix * remove * fix * lint Co-authored-by: Sven Mika <sven@anyscale.io> * [RLlib] Issue 8889: action clipping bug ppo not learning mujoco (#8898) * Fix Windows build (#8905) Co-authored-by: Mehrdad <noreply@github.com> * Use no_restart=False for ray.kill in Serve failure test (#8952) * Display GPU Utilization in the Dashboard (#8564) * Update incorrect detached actor docs (#8930) * [Dashboard] Dashboard pubsub hotfix. (#8944) * [CI] Fix Conda Permission on MacOS Github Action(#9004) Co-authored-by: Mehrdad <noreply@github.com> * Update pandas to 1.0.5 (#9065) Co-authored-by: Mehrdad <noreply@github.com> * Do not add reference count when it is local mode. (#8979) * [Dashboard] Update the Ray dashboard documentation to explain memory view. (#8945) * Windows compatibility (#93) Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com> Co-authored-by: Mehrdad <noreply@github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Preparing 0.8.6 (#26) * Updated Version to 0.8.5. * Formatting. * Fix Serve long running test (#8223) * Fix release 0.8.5 tests for PPO torch Breakout. (#8226) * Remove logging (#8211) * [BRING BACK TO MASTER] Fix cluster.yaml config. * [rllib] Copy plasma memory before adding data to replay buffer * [sgd] Resource limit lift for GPU test (#8238) * Fix resource_ids_ data race (#8253) * [rllib] [hotfix] Remove assert that trips on pytorch multiagent (#8241) * [BRING BACK TO MASTER] add torch download for rllib regresstion test. * [serve] Master actor fault tolerance (#8116) * [serve] Add delete_backend call (#8252) * Fix resource_ids_ data race (#8253) * [serve] Add delete_endpoint call (#8256) * [serve] Refactor BackendConfig (#8202) * Delete example files. * Fix serve long running test (#8268) * [tune] Avoid breakage - soft deprecation warning for search algs (#8258) * [tune] Hotfix Ax breakage when fixing backwards-compat (#8285) * Async actor microbenchmark Script (#8275) * [core] Disable GCS actor management (#8271) * Pin redis-py version (#8290) * [BRING BACK TO MASTER] add pip install upgrade to the command. * Add ipython as dependency for autoscaler container (#8297) Co-authored-by: rbusche <rbusche@inserve.de> * Revert "Async actor microbenchmark Script (#8275)" This reverts commit 6a6eead1fe45c774ce75da0d5f90f443ac3748ec. * Docs and LINT. * [RLlib] Increasing reusability v0 (#8) * Set up CI with Azure Pipelines Specifically, we are setting a travis like ADO pipeline following what is already present in the .travis.yml file in the root of the repo. * Separating travis like pipeline from main pipeline * Adding Jenkings jobs equivalent * Making some improvements * Adding validation of the upstream CI * Disabling Tune and large memory tests * Changing threshold for simple reservoir sampling test * Addressing comments * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with more travis updates * Updating CI with new cpp worker tests * Setting code owners * Fixing the version number generation * Making main pipeline also our release pipeline * Updating Azure Pipelines with travis updates * Fixing wheels test * Fixing codeowners * Updating Azure Pipelines with travis updates * Bumping up MACOSX_DEPLOYMENT_TARGET * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with travis updates * Disabling Serve tests * Making explicit which branches GitHubActions workflows should watch * Desabling Ray serve tests * Installing numpy explicitly * consolidating Ray test steps in one yml * Making worker set, apex and ppo a little bit more reusable for custom agents * Making Dynamic TF policy more reusable * Allow the actions dict carry user data defined for the episodes * Forcing RLlib tests to run always * Making SAC model more extensible * Adapting exploration API * Reverting the random worker index change * Making epsilon configurable * Fixing method doc * Fixing aguments check in reset_schedule * Fixing per worker epsilon greedy * Activating logs for failing test * Making original_space check more roboust * Allow normalized actions rescaling happend outside RLlib * Passing infos values from agents to callbacks * Installing node js using a task * Adding kwargs in TFModels * Fixing npm and node in mac * Fixing the num workers value passed * Forcing RLlib tests * Merging 0.8.5 * Running some RLlib test in custom agent * Adding echo bazelisk * Force CI * Force CI * Relaxing an installation * Using container jobs * Fixing container jobs * Change base image for container job * Install with sude * Exec with sudo * Test * Changing agent pool * Remove python selection * Fix version replacement * Fix version replacement * Trying Bazel * Installing node with sudo * Run all install as sudo * Reverting sudo -s * Fixing omitted param * install python manually * Fixing missing param * Making NVM available * Fix nvm installation * Fix copye-paste * renaming to req file * fix typo * Install JDK 8 * Install req in other jobs * Install JDK with sudo * Removing docker clean up * Install Docker * fix installation issue * Adding azure package source * Fix docker permissions * Install jq * downloading with sudo * Install llvm as root * Skiping flaky test * copy artifacts as sudo * Fix Bazel build in MacOS (#23) * Fixing mac os building issue * Bazelisk check * Increase bazel version * Fixing typos * Update hash * Include unzip * Improved compilation and convergence tests Added compilation tests that follow proper PyTest conventions. These tests use parametrized settings, and allow for multiple algorithms to be tested with a single test. I've commented out tests these two tests can replace, to show the improvement. Only about half of the algorithms have been transitioned to the new tests in interest of keeping the PR small. * Increasing bazel version * Increasing bazel version only mac pipelines * Printing system info in Ubuntu wheels pipeline * making docker install optional * Compilation and convergence tests for more algos Added compilation and convergence tests for Apex DQN, Apex DDPG Added convergence tests for SAC Removed old (commented out) compilation test code from `rllib.agents.dqn.tests.test_apex` * Clean up Deleted old (commented out) test code * Updated BUILD file Split tests into test_compilation and test_learning.py to work with BAZEL build files. * Updated BUILD file Fixed bug in BUILD - wrong files passed in. * BugFix: Improper imports causing test failures * BugFix: Improper imports causing test failures * Removed test_appo from BUILD file * Fixing copy-paste error * Applying some bazel fixes * Fixing installation issues * Update hash * Fixing NVM/NODE installation * Applying latest changes in travis.yml * Fixing fixture data exclusions * Disable some java tests * Adgudime/apex sac (#25) * WIP: Compilation tests work * Fixed bugs with Apex SAC continuous action spaces * Bugfix: Bad imports * Fixing PyArrow issue * Fixing guava check * Fix datetime java format * Fixing Bazel issues finding or loading conftest * Fixing pytest module loading order * Trying different approach to pickle check * Installing latest pickle5 explicitly * Fixing conftest resolution * Temporarily disabling pickle5 validation * Fixing fixture data exclusions * Fixing data files treated as src * Disable some java tests Co-authored-by: Edilmo Palencia <edilmo@gmail.com> * Fix multiple CI errors * Update hash * Fixing more build issues * Fixing more build issues * Fix pipeline cache path * More fixes * Fix cache * Fixing bazel test command * Fix bazel test * Allowing custom sumarize episodes * Adding custom metrics ops in exec plan * Apex SAC exploration should be stochastic * Leting DQN deal with rechaping for Discrete spaces * Commenting the cache Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Simon Mo <xmo@berkeley.edu> Co-authored-by: Sven Mika <sven@anyscale.io> Co-authored-by: ijrsvt <ian.rodney@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Rüdiger Busche <rbusche@posteo.net> Co-authored-by: rbusche <rbusche@inserve.de> Co-authored-by: sven1977 <svenmika1977@gmail.com> Co-authored-by: Aditya Gudimella <aditya.gudimella@gmail.com> * Fix system info step (#29) * Fix system info step (#30) * adding testing framework (#28) * adding testing framework * install kubebuilder for testing * adding crrect hash Co-authored-by: Ali Kanso <ali.kanso@microsoft.com> * add shared mem max flag * change readme * Tuned hyperparams for ApexSAC * Bugfix for exploration config. * Allowing PPO to handle async sampling (#34) * Making ppo ParallelRollouts mode configurable * Making dqn ParallelRollouts mode configurable * Making RolloutWorker generator function public * Missing argument * Stop iteration if round robim proportion is not met * fixing wheels parsing * Improving iter union stop-iteration conditions * Fixing DDPG * Fixing MADDPG * Fix tflite compat issue (#35) * Fix tflite compat issue * Fixing iter corner case * Manual stride with elipsis * Fix unecesary stop iteration * Allow replay ops to stop if they are unhealthy (#36) * Allow the replay ops to stop if they are unhealthy * Allowing to configure dqn execution plan consistently * Making configurable concurrency mode in DQN and metric collection in Apex (#37) * Fixing concurrency op in dqn (#38) * Replaced Prioritized Experience Replay with normal Experience replay to create AsyncSAC. * Setting prioritized_replay in config now uses PrioritizedReplay correctly. * Renamed LocalAsyncReplayBuffer and AsyncReplayActor to better reflect usage * Added test with prioritized_replay set to True * Cleaned up code. * Fixing manual slicing (#40) * Fixing manual slicing * Handling the Box space explicitly * Including the force stop in gather_async (#41) * Including the force stop in gather_async * Fix missing bar * Fix for gather across shards * Fix for gather async extreme case * Making env-runner an explicit iterator and Local Iterator regenerable (#42) * Making env-runner an explicit iterator And also making the LocalIterator able to regenerate. * Fix multi agent test * Fix union * Making infinite sequence explicit For the sake of the parallel iterators, one that hold a infinite sequence, could be called again after a stop iteration message. In other words, an StopIteration for a infinite sequence must be seen as a "no items available" message. * Fix unexpected error * Fixing gym version * Update hash * Addressing comments * Improve gathering async and by shards (#44) * Improve gathering async and by shards * Making ParallelIteratorWorker an explicit Iterator in all cases * Making ParallelIteratorWorker an explicit Iterator in all cases * Fixing inverted condition * Removing ForceStopIteration * Make seeding possible even if env cannot be seeded. * Fix grep versions (#46) * Fix grep versions * Spliting the stages * Using pool for all rllib * Update hash * fixing path permissions * Changing node version * Reverting some OS changes * Fixing compilation errors * More compilation errors * More compilations errors * Fix node installation * Fixing some package versions * Using right bazel version * Fix mac os version in wheels * Fix mac os version in wheels * Some minor fixes * Force the target mac os * Fix path * Disable stress test temporarily * Fixing gitignore * Fixing Sampler merge mistakes * Fixing epsilon greddy merge mistakes and requirements versions * Fix merge error * Apply changes in travis.yml * Fix several issues * Fixing more compatibility bugs * Fix more incompatibilities * More incompatibilities * Fixing more compat issues * Disable tune horovod torch tests * Fixing more tests Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Simon Mo <xmo@berkeley.edu> Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com> Co-authored-by: Mehrdad <noreply@github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Sven Mika <sven@anyscale.io> Co-authored-by: Ian Rodney <ian.rodney@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Max Fitton <mfitton@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Rüdiger Busche <rbusche@posteo.net> Co-authored-by: rbusche <rbusche@inserve.de> Co-authored-by: sven1977 <svenmika1977@gmail.com> Co-authored-by: Aditya Gudimella <aditya.gudimella@gmail.com> Co-authored-by: Ali Kanso <akanso@us.ibm.com> Co-authored-by: Ali Kanso <ali.kanso@microsoft.com> * Applying travis.yml changes * Use latest pip * Update the hash * Fix rllib issues * Fix rllib issues 2 * Fix tune errors * Fix ray issues * Remove old operator * revert some rllib test deletions * revert changes on release folder * Revert more changes * Logging dashboard building * Use previous docker image * Use centos docker image * more logging * Comment step * hash * installing node 14 * Fix hash Co-authored-by: Gekho457 <62982571+Gekho457@users.noreply.github.com> Co-authored-by: Alan Guo <aguo@aguo.software> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Max Fitton <maxfitton@anyscale.com> Co-authored-by: Max Fitton <max@semprehealth.com> Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Alex Wu <alex@anyscale.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Barak Michener <me@barakmich.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Ian Rodney <ian.rodney@gmail.com> Co-authored-by: Raoul Khouri <69156393+raoul-khour-ts@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Jack Parker-Holder <jackph@robots.ox.ac.uk> Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Tao Wang <wangtaothetonic@163.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: Simon Mo <xmo@berkeley.edu> Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com> Co-authored-by: Mehrdad <noreply@github.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Sven Mika <sven@anyscale.io> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Max Fitton <mfitton@berkeley.edu> Co-authored-by: Rüdiger Busche <rbusche@posteo.net> Co-authored-by: rbusche <rbusche@inserve.de> Co-authored-by: sven1977 <svenmika1977@gmail.com> Co-authored-by: Aditya Gudimella <aditya.gudimella@gmail.com> Co-authored-by: Ali Kanso <akanso@us.ibm.com> Co-authored-by: Ali Kanso <ali.kanso@microsoft.com> * Apply changes in travis.yml * Apply changes in travis.yml * Fix hash * Fix sampler * node 14 * Fix sampler 2 * Disable flaky test * Fix tune test Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Sven Mika <sven@anyscale.io> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: fyrestone <fyrestone@outlook.com> Co-authored-by: fangfengbin <869218239a@zju.edu.cn> Co-authored-by: Barak Michener <me@barakmich.com> Co-authored-by: DK.Pino <loushang.ls@antfin.com> Co-authored-by: Ameer Haj Ali <ameer@anyscale.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Max Fitton <maxfitton@anyscale.com> Co-authored-by: Corey Lowman <coreylowman@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: ZhuSenlin <wumuzi520@126.com> Co-authored-by: senlin.zsl <senlin.zsl@antfin.com> Co-authored-by: Michael Luo <michael.luo123456789@gmail.com> Co-authored-by: Alind Khare <alindkhare@gatech.edu> Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com> Co-authored-by: Hao Zhang <zhisbug@users.noreply.github.com> Co-authored-by: architkulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Lavanya Shukla <lavanya.shukla12@gmail.com> Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com> Co-authored-by: Ian Rodney <ian.rodney@gmail.com> Co-authored-by: ijrsvt <ilr@anyscale.com> Co-authored-by: 刘宝 <po.lb@antfin.com> Co-authored-by: Qing Wang <kingchin1218@126.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> Co-authored-by: Qing Wang <jovany.wq@antgroup.com> Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan> Co-authored-by: Alex Wu <alex@anyscale.io> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: Gabriele Oliaro <gabriele_oliaro@college.harvard.edu> Co-authored-by: Raed Shabbir <raedshabbir@gmail.com> Co-authored-by: Tao Wang <dooku.wt@antfin.com> Co-authored-by: YLJALDC <dal177@ucsd.edu> Co-authored-by: Basu Jindal <42815171+basujindal@users.noreply.github.com> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: dHannasch <David.A.Hannasch@gmail.com> Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Alex <alex@anyscale.com> Co-authored-by: Akash Patel <17132214+acxz@users.noreply.github.com> Co-authored-by: Edwin Goh <37746563+edwinytgoh@users.noreply.github.com> Co-authored-by: Maltimore <git@maltimore.info> Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com> Co-authored-by: Micah Yong <micahtyong@gmail.com> Co-authored-by: PENG Zhenghao <pengzh@ie.cuhk.edu.hk> Co-authored-by: SameerF <sameer@blueplastic.com> Co-authored-by: Todd A. Anderson <drtodd13@comcast.net> Co-authored-by: Keqiu Hu <khu@linkedin.com> Co-authored-by: Daan Klijn <daanklijn0@gmail.com> Co-authored-by: dmatch01 <dmatch01@users.noreply.github.com> Co-authored-by: Gekho457 <62982571+Gekho457@users.noreply.github.com> Co-authored-by: Alan Guo <aguo@aguo.software> Co-authored-by: Max Fitton <max@semprehealth.com> Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Raoul Khouri <69156393+raoul-khour-ts@users.noreply.github.com> Co-authored-by: Jack Parker-Holder <jackph@robots.ox.ac.uk> Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Tao Wang <wangtaothetonic@163.com> Co-authored-by: Simon Mo <xmo@berkeley.edu> Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com> Co-authored-by: Mehrdad <noreply@github.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Max Fitton <mfitton@berkeley.edu> Co-authored-by: Rüdiger Busche <rbusche@posteo.net> Co-authored-by: rbusche <rbusche@inserve.de> Co-authored-by: sven1977 <svenmika1977@gmail.com> Co-authored-by: Aditya Gudimella <aditya.gudimella@gmail.com> Co-authored-by: Ali Kanso <akanso@us.ibm.com> Co-authored-by: Ali Kanso <ali.kanso@microsoft.com>

* Set up CI with Azure Pipelines Specifically, we are setting a travis like ADO pipeline following what is already present in the .travis.yml file in the root of the repo. * Separating travis like pipeline from main pipeline * Adding Jenkings jobs equivalent * Making some improvements * Adding validation of the upstream CI * Disabling Tune and large memory tests * Changing threshold for simple reservoir sampling test * Addressing comments * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with more travis updates * Updating CI with new cpp worker tests * Setting code owners * Fixing the version number generation * Making main pipeline also our release pipeline * Updating Azure Pipelines with travis updates * Fixing wheels test * Fixing codeowners * Updating Azure Pipelines with travis updates * Bumping up MACOSX_DEPLOYMENT_TARGET * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with travis updates * Disabling Serve tests * Making explicit which branches GitHubActions workflows should watch * Desabling Ray serve tests * Installing numpy explicitly * consolidating Ray test steps in one yml * Syncing with upstream master 2020-07-30 (#21) * [Core] Enhance common client connection (#9367) * enhance client connection * add write buffer async * read message * add test * Bazel move more shell to native rules (#9314) Co-authored-by: Mehrdad <noreply@github.com> * [tune] Fix github readme (#9365) Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> * Combine different severities into the same log files (#9230) * Combine different severities into the same log files Co-authored-by: Mehrdad <noreply@github.com> * [core] Pass owner address from the workers to the raylet (#9299) * Add intended worker ID to GetObjectStatus, tests * Remove TaskID owner_id * lint * Add owner address to task args * Make TaskArg a virtual class, remove multi args * Set owner address for task args * merge * Fix tests * Add ObjectRefs to task dependency manager, pass from task spec args * tmp * tmp * Fix * Add ownership info for task arguments * Convert WaitForDirectActorCallArgs * lint * build * update * build * java * Move code * build * Revert "Fix Google log directory again (#9063)" This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1. * Fix free * fix tests * Fix tests * build * build * fix * Change assertion to warning to fix java * [Core] Add placement group scheduler and some api in resource scheduler (#9039) * Add placement group scheduler and some api of resource scheduler. Merge fix cv hang in multithread variables race (#8984). * change the bundle id and delete unit count in bundle change vector<bundle_spec> to vector<shared_ptr<bundle_spec>> Add placement group scheduler and some api of resource scheduler. Merge fix cv hang in multithread variables race (#8984). change the bundle id and delete unit count in bundle remove CheckIfSchedulable() add comments and fix the bug in resource * fix placement group schedule * add placement group scheduler and change some api in resource scheduler * fix by the comments * fix conflict * fix lint * fix lint * fix bug in merge * fix lint Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> * [Core] New scheduler fixes (#9186) * . * test_args passes * . * test_basic.py::test_many_fractional_resources causes ray to hang * test_basic.py::test_many_fractional_resources causes ray to hang * . * . * useful * test_many_fractional_resources fails instead of hanging now :) * Passes test_fractional_resources * . * . * Some cleanup * git is hard * cleanup * Fixed scheduling tests * . * . * [Core] put small objects in memory store (#8972) * remove the put in memory store * put small objects directly in memory store * cast data type * fix another place that uses Put to spill to plasma store * fix multiple tests related to memory limits * partially fix test_metrics * remove not functioning codes * fix core_worker_test * refactor put to plasma codes * add a flag for the new feature * add flag to more places * do a warmup round for the plasma store * lint * lint again * fix warmup store * Update _raylet.pyx Co-authored-by: Eric Liang <ekhliang@gmail.com> * [autoscaler] Move command runners into separate file and clean up interface. (#9340) * cleanup * wip * fix imports * fix lint * [docs][rllib] Recommended workflow for training, saving, and testing (#9319) * [autoscaler] Allow users to disable the cluster config cache (#8117) * [autoscaler] Remove autoscaler config cache. * [autoscaler] Add flag allowing users to explicitly disable the config cache. * Update hiredis and remove Windows patches (#9289) Co-authored-by: Mehrdad <noreply@github.com> * Fix flaky test_dynres.py (#9310) * Fix gcs_table_storage testcase bug (#9393) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [HOTFIX] Fix compile direct_actor_transport_test on mac (#9403) * Change Python's `ObjectID` to `ObjectRef` (#9353) * [Java] Improve JNI performance when submitting and executing tasks (#9032) * Remove the RAY_CHECK in Worker::Port() (#9348) * [RLlib] Issue #9366 (DQN w/o dueling produces invalid actions). (#9386) * Fix macos compliation bug (#9391) * Fix. * [Core] Plasma RAII support (#9370) * [Serve] Merge router with HTTPProxy (#9225) * Pass run args to DockerCommandRunner (#9411) * Fix copy to workspace (#9400) * [RLlib] Tf2.x native. (#8752) * Update conda and ray wheel on GCP images (#9388) * [Core] Simplify Raylet Client (#9420) * Masking error. With t*valid_mask, we get the error np.inf*0 = np.inf (#9407) * [RLLib] WindowStat bug fix (#9213) * WindowStat error catching, which processes NaNs properly instead of erroring. This ought to resolve issue #7910. https://github.com/ray-project/ray/issues/7910 * [tune] handling nan values (#9381) * TRAVIS_PULL_REQUEST is false for non-PRs, not empty (#9439) Co-authored-by: Mehrdad <noreply@github.com> * [GCS] Fix the bug about raylet receiving duplicate actor creation tasks (#9422) * [Tune] Trainable documentation fix (#9448) * Allow --lru-evict to be passed into `ray start` (#8959) * GCP authentication using oauth tokens (#9279) * Bazel selects compiler flags based on compiler (#9313) Co-authored-by: Mehrdad <noreply@github.com> * [Core] Build raylet client as an independent component (#9434) * [tune] sklearn comment out (#9454) * Add ability to specify SOCKS proxy for SSH connections (#8833) * [docs] Render ActorPool documentation, etc (#9433) * [tune] Put examples under proper version control (#9427) Co-authored-by: krfricke <krfricke@users.noreply.github.com> * Fix test-multi-node (#9453) * Machine View Sorting / Grouping (#9214) * Convert NodeInfo.tsx to a functional component * Update NodeRowGroup to be a functional component * lint * Convert TotalRow to functional component. * lint * move node info over to using the sortable table head component. spacing is still a little wonky. * Factor a NoewWorkerRow class out of NodeRowGroup that will be usable when grouping / ungrouping * Compilation checkpoint, I factored the worker filtering logic out of node info into the reducer * Add sort accessors for CPU * Add sort accessors for Disk * Add sort accessors for RAM * add a table sort util for function based accessors (rather than flat attribute-based accessor) * wip refactor node info features * wip * Rendering Checkpoint. I've refactored the features and how they are called to add sorting support. Also reworks the way error counts and log counts are passed to the front-end to remove some ugly logic * wip * wip * wip * Finish adding sorting and grouping of machine view * lint * fix bug in filtration of logs and errors by worker from recent refactor. * Add export of Cluster Disk feature * fix some merge issues Co-authored-by: Max Fitton <max@semprehealth.com> * [RLlib] Layout of Trajectory View API (new class: Trajectory; not used yet). (#9269) * [RLlib] Issue 9402 MARWIL producing nan rewards. (#9429) * Fix gcs_pubsub_test bug(#9438) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * change error code name of boost timer (#9417) * [tune] PyTorch CIFAR10 example (#9338) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com> * Remove legacy C++ code (#9459) * Fix ObjectRef and ActorHandle serialization (#9462) * [Stats] metrics agent exporter (#9361) * [Core] Support GCS server port assignment. (#8962) * Add scripts symlink back (#9219) (#9475) (cherry picked from commit 77933c922d5136c5c2e2f0ac2edb4da67111d690) Co-authored-by: Simon Mo <xmo@berkeley.edu> * [tune] Issue 8821: ExperimentAnalysis doesn't expand user (#9461) * [docker] Include base-deps image in rayproject Docker Hub (#9458) * [Core] remove create_and_seal and create_and_seal_batch (#9457) * Speedups for GitHub Actions (#9343) Co-authored-by: Mehrdad <noreply@github.com> * Fix flaky test_object_manager.py (#9472) * [Java] fix redis-server binary path (#9398) * [core] Handle out-of-order actor table notifications (#9449) * Drop stale actor table notifications * build * Add num_restarts to disconnect handler * Unit test and increment num_restarts on ALIVE, not RESTARTING * Wait for pid to exit * Fix name clash on Windows (#9412) Co-authored-by: Mehrdad <noreply@github.com> * Add job configs to gcs (#9374) * Make pip install verbose (#9496) Co-authored-by: Mehrdad <noreply@github.com> * Make more tests compatible with Windows (#9303) * [tune] extend PTL template (GPU, typing fixes, tensorboard) (#9451) Co-authored-by: Kai Fricke <kai@anyscale.com> * [core] Replace task resubmission in raylet with ownership protocol (#9394) * Add intended worker ID to GetObjectStatus, tests * Remove TaskID owner_id * lint * Add owner address to task args * Make TaskArg a virtual class, remove multi args * Set owner address for task args * merge * Fix tests * Add ObjectRefs to task dependency manager, pass from task spec args * tmp * tmp * Fix * Add ownership info for task arguments * Convert WaitForDirectActorCallArgs * lint * build * update * build * java * Move code * build * Revert "Fix Google log directory again (#9063)" This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1. * Fix free * Regression tests - shorten timeouts in reconstruction unit tests * Remove timeout for non-actor tasks * Modify tests using ray.internal.free * Clean up future resolution code * Raylet polls the owner * todo * comment * Update src/ray/core_worker/core_worker.cc Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> * Drop stale actor table notifications * Fix bug where actor restart hangs * Revert buggy code for duplicate tasks * build * Fix errors for lru_evict and internal.free * Revert "Drop stale actor table notifications" This reverts commit 193c5d20e5577befd43f166e16c972e2f9247c91. * Revert "build" This reverts commit 5644edbac906ff6ef98feb40b6f62c9e63698c29. * Fix free test * Fixes for freed objects Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> * release gil in global state accessor (#9357) * [Java] Named java actor (#9037) * Fix clang-cl build (#9494) Co-authored-by: Mehrdad <noreply@github.com> * [GCS Actor Management] Gcs actor management broken detached actor (#9473) * [RLlib] Issue #9437 (PyTorch converts to CPU tensor, even if on GPU). (#9497) * Get rid of build shell scripts and move them to Python (#6082) * Fix broken test_raylet_info_endpoint (#9511) * Fix. (#9464) * [Autoscaler] Making bootstrap config part of the node provider interface (#9443) * supporting custom bootstrap config for external node providers * bootstrap config * renamed config to cluster_config * lint * remove 2 args from importer * complete move of bootstrap to node_provider * renamed provider_cls * move imports outside functions * lint * Update python/ray/autoscaler/node_provider.py Co-authored-by: Eric Liang <ekhliang@gmail.com> * final fixes * keeping lines to reduce diff * lint * lamba config * filling in -> adding for lint Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: Eric Liang <ekhliang@gmail.com> * Fix flaky test_actor_failures::test_actor_restart (#9509) * Fix flaky test * os exit * [rllib] MAML Transform (#9463) * MAML Transform * Moved Inner Adapt to Method in Execution Plan * Cleanup Plasma Store (hash utilities) (#9524) * [Serve] Improve buffering for simple cases (#9485) * [Serve] Use pickle instead of clouldpickle (#9479) * Fix pip and Bazel interaction messing up CI (#9506) Co-authored-by: Mehrdad <noreply@github.com> * [Core] Fix Java detached error (#9526) * fix java createActor NPE bug (#9532) * [RLlib] Issue 9218: PyTorch Policy places Model on GPU even with num_gpus=0 (#9516) * [Stats] Fix metric exporter test (#9376) * Hotfix Lint for Serve (#9535) * Windows cleanup (#9508) * Remove unneeded code for Windows * Get rid of usleep() * Make platform_shims includes non-transitive Co-authored-by: Mehrdad <noreply@github.com> * [RLlib] Issue 8384: QMIX doesn't learn anything. (#9527) * Add placement group manager and some code in core_worker (#9120) Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> * [core] Add flag to enable object reconstruction during ray start (#9488) * Add flag * doc * Fix tests * Pipelining task submission to workers (#9363) * first step of pipelining * pipelining tests & default configs - added pipelining unit tests in direct_task_transport_test.cc - added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker - consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_ * post-review revisions * linting, following naming/style convention * linting * [New scheduler] Queueing refactor (#9491) * . * test_args passes * . * test_basic.py::test_many_fractional_resources causes ray to hang * test_basic.py::test_many_fractional_resources causes ray to hang * . * . * useful * test_many_fractional_resources fails instead of hanging now :) * Passes test_fractional_resources * . * . * Some cleanup * git is hard * cleanup * . * . * . * . * . * . * . * cleanup * address reviews * address reviews * more refactor * :) * travis pls * . * travis pls * . * [Serve] Add internal instruction for running benchmarks (#9531) * MADDPG learning confirmation test. (#9538) * Fix Bazel in Docker (#9530) Co-authored-by: Mehrdad <noreply@github.com> * Fix bug that `test_multi_node.py::test_multi_driver_logging` hangs when GCS actor management is turned on (#9539) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [tune] Unflattened lookup for ProgressReporter (#9525) Co-authored-by: Kai Fricke <kai@anyscale.com> * Add plasma store benchmark for small objects (#9549) * [Tune] Copy default_columns in new ProgressReporter instances (#9537) * quickfix (#9552) * [tune] pin tune-sklearn (#9498) * [cli] ray memory: added redis_password (#9492) * [GCS]Fix lease worker leak bug when gcs server restarts (#9315) * add part code * fix compile bug * fix review comments * fix review comments * fix review comments * fix review comments * fix review comment * fix ut bug * fix lint error * fix review comment * fix review comments * add testcase * add testcase * fix bug * fix review comments * fix review comment * fix review comment * refine comments Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> * [tune] fix pbt checkpoint_freq (#9517) * Only delete old checkpoint if it is not the same as the new one * Return early if old checkpoint value coincides with new checkpoint value Co-authored-by: Kai Fricke <kai@anyscale.com> * [Core] Remove socket pair exchange in Plasma Store (#9565) * try use boost::asio for notification processing * [Metric] new cython interface for python worker metric (#9469) * Bazel fixes (#9519) * GCS client add fetch operation before subscribe (#9564) * [RLlib] Fix combination of lockstep and multiple agnts controlled by the same policy. (#9521) * Change aggregation when lockstep is activated. Modification of MultiAgentBatch.timeslices to support the combination of lockstep and multiple agents controlled by the same policy. fix ray-project/ray#9295 * Line too long. * [Core] Replace the Plasma eventloop with boost::asio (#9431) * Fix Java named actor bug (#9580) * Fix setup.py bug (#9581) Co-authored-by: Mehrdad <noreply@github.com> * [Serve] Serialize Query object directly (#9490) * Add dashboard dependencies to default ray installation (#9447) * Dashboard next-version API support in backend (#9345) * Fix log losses (#9559) * Close log on shutdown * Disable log buffering Co-authored-by: Mehrdad <noreply@github.com> * [docker] run Ubuntu 20.04 as base image (#9556) * Add PTL to README.rst (#9594) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Skip uneeded steps on CI (#9582) Co-authored-by: Mehrdad <noreply@github.com> * Fix Windows CI (#9588) Co-authored-by: Mehrdad <noreply@github.com> * [serve] Rename to `Controller` (#9566) * Handle warnings in core (#9575) * [New scheduler] Fix new scheduler bug (#9467) * fix new scheduler bug * add testcase for soft resource allocation * modify RemoveNode * Ensure unique log file names across same-node raylets. (#9561) * fix tag key typo (#9606) * Rename path variable due to zsh conflict (#9610) * [doc] [minor] Make API docs easier to find. (#9604) * Issue 9568: `rllib train` framework in config gets overridden with tf. (#9572) * Use UTF-8 for encoding of python code for collision hashing (#9586) Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de> Co-authored-by: simon-mo <simon.mo@hey.com> * Add bazel to the PATH in setup.py (#9590) Co-authored-by: Mehrdad <noreply@github.com> * Fix Lint in setup.py (#9618) Co-authored-by: Mehrdad <noreply@github.com> * Shellcheck comments (#9595) * [Serve] Document Metric Infrastructure (#9389) * [CI] Do not run jenkins test on GHA (#9621) * Support ray task type checking (#9574) * [Metrics] Java metric API (#9377) * [GCS] fix the fault tolerance about gcs node manager (#9380) * Shellcheck quoting (#9596) * Fix SC2006: Use $(...) notation instead of legacy backticked `...`. * Fix SC2016: Expressions don't expand in single quotes, use double quotes for that. * Fix SC2046: Quote this to prevent word splitting. * Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching. * Fix SC2068: Double quote array expansions to avoid re-splitting elements. * Fix SC2086: Double quote to prevent globbing and word splitting. * Fix SC2102: Ranges can only match single chars (mentioned due to duplicates). * Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"? * Fix SC2145: Argument mixes string and array. Use * or separate argument. * Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string). Co-authored-by: Mehrdad <noreply@github.com> * Fix bug in Bazel version check (#9626) Co-authored-by: Mehrdad <noreply@github.com> * [Java] Avoid data copy from C++ to Java for ByteBuffer type (#9033) * Revert "Dashboard next-version API support in backend (#9345)" (#9639) This reverts commit fca1fb18f366ebff6016978cb6440dd1ed8637fe. * [Autoscaler] Command Line Interface improvements (#9322) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [Core] GCS Actor management on by default. (#8845) * GCS Actor management on by default. * Fix travis config. * Change condition. * Remove unnecessary CI. * [Core] Fix concurrency issues in plasma store runner (#9642) * fix window jni unhappy compiler (#9635) * Fix TestObjectTableResubscribe testcase bug (#9650) * fix named actor single process mode bug (#9652) * [core] Fix Ray service startup when logging redirection is disabled. (#9547) * Fix TorchDeterministic (#9241) * [RaySGD] revised existing transformer example to work with transformers>=3.0 (#9661) Co-authored-by: Kai Fricke <kai@anyscale.com> * [rllib] Fix torch TD error, IMPALA LR updates (#9477) * update * add test * lint * fix super call * speed es test up * Auto-cancel build when a new commit is pushed (#8043) Co-authored-by: Mehrdad <noreply@github.com> * Fix lint in remote-watch.py (#9668) * [Core] Remove unnecessary windows syscall in plasma store (#9602) * Remove unused windows shims (#9583) * Temporarily disable remote watcher (#9669) * Drop support for Python 3.5. (#9622) * Drop support for Python 3.5. * Update setup.py * [Core] WorkerInterface refactor (#9655) * . * . * refactor WorkerInterface * . * Basic unit test structure complete? * . * . * . * . * Fixed tests * Fixed tests * . * [core] Enable object reconstruction for retryable actor tasks (#9557) * Test actor plasma reconstruction * Allow resubmission of actor tasks * doc * Test for actor constructor * Kill PID before removing node * Kill pid before node * fix java coreworker crash (#9674) * use help proto-init-macro for streaming config (#9272) * Update release information from 0.8.6. (#9124) * [BRING BACK TO MASTER] Update release information. * [MERGE TO MASTER] Add microbenchmark result. * Update asan tests to the doc. * Refinements to the Serve documentation (#9587) Co-authored-by: Dean Wampler <dean@concurrentthought.com> * [tune] survey (#9670) * Fix ERROR logging not being printed to standard error (#9633) Co-authored-by: Mehrdad <noreply@github.com> * [Tune Docs] Logging doc fix (#9691) * [rllib] Type annotations for model classes (#9646) * [Serve] Allow multiple HTTP servers. (#9523) * Issue 9631: Tf1.14 does not have tf.config.list_physical_devices. (#9681) * [Serve] Fix Formatting, stale docs (#9617) * fixed simplex initialisation seeding bug (#9660) Co-authored-by: Petros Christodoulou <petrochr@amazon.com> * Switch from GitHub checkout@v2 to checkout@v1 due to bugs in checkout (#9697) Co-authored-by: Mehrdad <noreply@github.com> * Add Ray Serve to README.rst (#9688) * Shellcheck rewrites (#9597) * Fix SC2001: See if you can use ${variable//search/replace} instead. * Fix SC2010: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames. * Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames. * Fix SC2015: Note that A && B || C is not if-then-else. C may run when A is true. * Fix SC2028: echo may not expand escape sequences. Use printf. * Fix SC2034: variable appears unused. Verify use (or export if used externally). * Fix SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options. * Fix SC2071: > is for string comparisons. Use -gt instead. * Fix SC2154: variable is referenced but not assigned * Fix SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails. * Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op). * Fix SC2236: Use -n instead of ! -z. * Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr. * Fix SC2086: Double quote to prevent globbing and word splitting. Co-authored-by: Mehrdad <noreply@github.com> * [Autoscaler] CLI Logger docs (#9690) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update rllib-algorithms.rst (#9640) * [tune] move jenkins tests to travis (#9609) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com> * [RLlib] Implement DQN PyTorch distributional head. (#9589) * Add placement group java api (#9611) * add part code * add part code * add part code * fix code style * fix review comment * fix review comment * add part code * add part code * add part code * add part code * fix review comment * fix review comment * fix code style * fix review comment * fix lint error * fix lint error Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [Stats] Improve Stats::Init & Add it to GCS server (#9563) * [Core] Try remove all windows compat shims (#9671) * try remove compat for arrow * remove unistd.h * remove socket compat * delete arrow windows patch * Fix a few flaky tests (#9709) Fix test_custom_resources, Remove test_pandas_parquet_serialization, Better error message for test_output.py, Potentially fix test_dynres::test_dynamic_res_creation_scheduler_consistency * [GCS]Open test_gcs_fault_tolerance testcase (#9677) * enable test_gcs_fault_tolerance * fix lint error Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [Tests]lock vector to avoid potential flaky test (#9656) * [tune] distributed torch wrapper (#9550) * changes * add-working * checkpoint * ccleanu * fix * ok * formatting * ok * tests * some-good-stuff * fix-torch * ddp-torch * torch-test * sessions * add-small-test * fix * remove * gpu-working * update-tests * ok * try-test * formgat * ok * ok * [GCS] Fix actor task hang when its owner exits before local dependencies resolved (#8045) * Only update raylet map when autoscaler configured (#9435) * [Dashboard] New dashboard skeleton (#9099) * Fixing multiple building issues * Make wait_for_condition raise exception when timing out. (#9710) * [GCS]GCS client support multi-thread subscribe&resubscribe&unsubscribe (#9718) * Package and upload ray cross-platform jar (#9540) * Revert "Package and upload ray cross-platform jar (#9540)" (#9730) This reverts commit 881032593d3c1b9360ea641c24d50a022677a25e. * Only build docker wheels in LINUX_WHEELS env (#9729) * Keep build-autoscaler-images.sh alive in CI (#9720) * [core] Removes Error when Internal Config is not set (#9700) * [Cluster Launcher] Re Org the cluster launcher pages. (#9687) * [RLlib] Offline Type Annotations (#9676) * Offline Annotations * Modifications * Fixed circular dependencies * Linter fix * Python api of placement group (#9243) * Include open-ssh-client for transparency (#9693) * Fix remote-watch.py (#9625) Co-authored-by: Mehrdad <noreply@github.com> * [docker] Uses Latest Conda & Py 3.7 (#9732) * Fix broken actor failure tests. (#9737) * [Stats] fix stats shutdown crash if opencensus exporter not initialized (#9727) * Fix package and upload ray jar (#9742) * Introduce file_mounts_sync_continuously cluster option (#9544) * Separate out file_mounts contents hashing into its own separate hash Add an option to continuously sync file_mounts from head node to worker nodes: monitor.py will re-sync file mounts whenver contents change but will only run setup_commands if the config also changes * add test and default value for file_mounts_sync_continuously * format code * Update comments * Add param to skip setup commands when only file_mounts content changed during monitor.py's update tick Fixed so setup commands run when ray up is run and file_mounts content changes * Refactor so that runtime_hash retains previous behavior runtime_hash is almost identical as before this PR. It is used to determine if setup_commands need to run file_mounts_contents_hash is an additional hash of the file_mounts content that is used to detect when only file syncing has to occur. Note: runtime_hash value will have changed from before the PR because we hash the hash of the contents of the file_mounts as a performance optimization * fix issue with hashing a hash * fix bug where trying to set contents hash when it wasn't generated * Fix lint error Fix bug in command_runner where check_output was no longer returning the output of the command * clear out provider between tests to get rid of flakyness * reduce chance of race condition from node_launcher launching a node in the middle of an autoscaler.update call * [dist] swap mac/linux wheel build order (#9746) * [RLlib] Enhance reward clipping test; add action_clipping tests. (#9684) * [RLlib] Issue 9667 DDPG Torch bugs and enhancements. (#9680) * [Metrics]Ray java worker metric registry (#9636) * ray worker metrics gauge init * ray java metric mapping * add jni source files for gauge and tagkey * mapping all metric classes to stats object * check non-null for tags and name * lint * add symbol for native metric JNI * extern c for symbol * add tests for all metrics * Update Metric.java use metricNativePointer instead. * unify metric native stuff to one class * fix jni file * add comments for metric transform function in jni utils * move metric function to native metric file * remove unused disconnect jni * Add a metric registry for java metircs * Restore install-bazel.sh * Add some comments for metric registry * Fix thread safe problem of metrics * Fix metric tests and remove sleep code from tests * Fix comments of metrics Co-authored-by: lingxuan.zlx <skyzlxuan@gmail.com> * fix windows compile bug (#9741) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * Run _with_interactive in Docker (#9747) * [New scheduler] First unit test for task manager (#9696) * . * . * refactor WorkerInterface * . * Basic unit test structure complete? * . * bad git >:-( * small clean up * CR * . * . * One more fixture * One more fixture * . * . * bazel-format * . * [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (#9607) * [Release] Fix release tests (#9733) * Register function race (#9346) * Revert "[dist] swap mac/linux wheel build order (#9746)" and "Fix package and upload ray jar (#9742)" (#9758) * Revert "[dist] swap mac/linux wheel build order (#9746)" This reverts commit a9340565ff46626b18fd36f22a37d0380ae18d85. * Revert "Fix package and upload ray jar (#9742)" This reverts commit c290c308fe1e496480db5c37489df619cff6168f. * Fix some Windows CI issues (#9708) Co-authored-by: Mehrdad <noreply@github.com> * Pin pytest version (#9767) * [Java] Use test groups to filter tests of different run modes (#9703) * [Java] Fix MetricTest.java due to incomplete changes from #9703 (#9770) * Fix leased worker leak bug if lease worker requests that are still waiting to be scheduled when GCS restarts (#9719) * [Stats] enable core worker stats (#9355) * [GCS]Use a separate thread in node failure detector to handle heartbeat (#9416) * use a sole thread to handle heartbeat * separate signal thread * use work to avoid exiting when task is underway * protect shared data structure to avoid deadlock * add comments * decrease io service num * minor changes * fix test * per stephanie's comments * use single io service instead of 1-size io service pool * typo * [GCS Actor Management] Fix flaky test_dead_actors. (#9715) * Fix. * Add logs. * Add an unit test. * [TUNE] Tune Docs re-organization (#9600) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [RLlib] Trajectory View API (preparatory cleanup and enhancements). (#9678) * [Core] Socket creation race condition bug fixes (#9764) * fix issues * hot fixes * test * test * Always info log * Fixed stderr logging (9765) * [Core] Custom socket name (#9766) * fix issues * hot fixes * test * test * socket name change only * Fix src/ray/core_worker/common.h deleted constructor (#9785) Co-authored-by: Mehrdad <noreply@github.com> * [Stats] Fix harvestor threads + Fix flaky stats shutdown. (#9745) * More fixes * Applying latest changes in travis.yml * Fixing fixture data exclusions * Disable some java tests * Fix some CI errors * Update hash * Fixing more build issues * Fixing more build issues * Fix pipeline cache path * More fixes * Fix bazel test command * Fix bazel test * Fix general info steps * Custom env var for docker build * Trying a different way to install bazel * Bazel fix * Updating hash Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com> Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com> Co-authored-by: Mehrdad <noreply@github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Alisa <wuminyan0607@gmail.com> Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@vip.qq.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Stefan Schneider <stefan.schneider@upb.de> Co-authored-by: Patrick Ames <pdames@amazon.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Co-authored-by: fangfengbin <869218239a@zju.edu.cn> Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> Co-authored-by: Tao Wang <dooku.wt@antfin.com> Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Sven Mika <sven@anyscale.io> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Ian Rodney <ian.rodney@gmail.com> Co-authored-by: Henk Tillman <henktillman@gmail.com> Co-authored-by: Tanay Wakhare <twakhare@gmail.com> Co-authored-by: Nicolaus93 <nicolo.campolongo@unimi.it> Co-authored-by: Vasily Litvinov <45396231+vnlitvinov@users.noreply.github.com> Co-authored-by: krfricke <krfricke@users.noreply.github.com> Co-authored-by: Max Fitton <maxfitton@gmail.com> Co-authored-by: Max Fitton <max@semprehealth.com> Co-authored-by: kisuke95 <2522134184@qq.com> Co-authored-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Simon Mo <xmo@berkeley.edu> Co-authored-by: Michael Mui <68102089+heyitsmui@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: Michael Luo <michael.luo123456789@gmail.com> Co-authored-by: Gabriele Oliaro <gabriele_oliaro@college.harvard.edu> Co-authored-by: Tom <veniat.tom@gmail.com> Co-authored-by: jerrylee.io <JerryDeKo@gmail.com> Co-authored-by: Raphael Avalos <raphael@avalos.fr> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Arne Sachtler <arne.sachtler@gmail.com> Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: ZhuSenlin <wumuzi520@126.com> Co-authored-by: Max Fitton <mfitton@berkeley.edu> Co-authored-by: Maksim Smolin <maximsmol@gmail.com> Co-authored-by: Dean Wampler <dean@polyglotprogramming.com> Co-authored-by: Dean Wampler <dean@concurrentthought.com> Co-authored-by: Bill Chambers <bill@anyscale.com> Co-authored-by: Petros Christodoulou <p.christodoulou2@gmail.com> Co-authored-by: Petros Christodoulou <petrochr@amazon.com> Co-authored-by: Justin Terry <justinkterry@gmail.com> Co-authored-by: Tao Wang <wangtaothetonic@163.com> Co-authored-by: fyrestone <fyrestone@outlook.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: bermaker <495571751@qq.com> * Sync Upstream master (#50) * [core] Pull Manager exponential backoff (#13024) * [RLlib] Issue 12789: RLlib throws the warning "The given NumPy array is not writeable" (#12793) * [release tests] test_many_tasks fix (#12984) * Add "beta" documentation for enabling object spilling manually (#13047) * [Serve] Handle Bug Fixes (#12971) * [Dashboard] Add GET /logical/actors API (#12913) * [GCS]Decouple gcs resource manager and gcs node manager (#13012) * [ray_client]: Insert decorators into the real ray module to allow for client mode (#13031) * [GCS] Delete redis gcs client and redis_xxx_accessor (#12996) * [RLlib] Fix broken unity3d_env import in example server script. (#13040) * [RLlib] TorchPolicies: Accessing "infos" dict in train_batch causes `TypeError`. (#13039) * [joblib] Fix flaky joblib test. (#13046) * [Tune]Add integer loguniform support (#12994) * Add integer quantization and loguniform support * Fix hyperopt qloguniform not being np.log'd first * Add tests, __init__ * Try to fix tests, better exceptions * Tweak docstrings * Type checks in SearchSpaceTest * Update docs * Lint, tests * Update doc/source/tune/api_docs/search_space.rst Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> * [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048) * Add index for tasks to dispatch * Task dependency manager interface * Unsubscribe dependencies and tests * NodeManager * Revert "Add index for tasks to dispatch" This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea. * tmp * Move back to waiting if args not ready * update * Update to new form of brew cask install command * [Autoscaler] New output log format (#12772) * Fix typo RMSProp -> RMSprop (#13063) * [serve] Centralize HTTP-related logic in HTTPState (#13020) * Remove suppress output to see why wheel is not building * Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006) * New dependency manager * Switch raylet to new DependencyManager * PullManager accepts bundles * Cleanup, remove old task dependency manager * x * PullManager unit tests * lint * Unit tests * Rename * lint * test * Update src/ray/raylet/dependency_manager.cc Co-authored-by: SangBin Cho <rkooo567@gmail.com> * Update src/ray/raylet/dependency_manager.cc Co-authored-by: SangBin Cho <rkooo567@gmail.com> * x * lint Co-authored-by: SangBin Cho <rkooo567@gmail.com> * [docs] Fix args + kwargs instead of docstrings (#13068) * functools wraps * Fix typo (functoools -> functools) * Fix OS X Wheel Build - Update brew cask install (#13062) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * speed up local mode object store get (#13052) Co-authored-by: senlin.zsl <senlin.zsl@antfin.com> * [RLlib] Execution Annotation (#13036) * [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943) * [C++ API] Added reference counting to ObjectRef (#13058) * Added reference counting to ObjectRef * Addressed the comments * [Core] Remove cuda support in plasma store (#13070) * remove cuda support in plasma store * [Core] Remote outdated external store (#13080) * remove outdated external store * [GCS] Move resource usage info to gcs resource manager (#13059) * [RLlib] JAXPolicy prep. PR #1. (#13077) * [RLlib] Preprocessor fixes (multi-discrete) and tests. (#13083) * [RLlib] BC/MARWIL/recurrent nets minor cleanups and bug fixes. (#13064) * [Collective][PR 3.5/6] Send/Recv calls and some initial code for communicator caching (#12935) * other collectives all work * auto-linting * mannual linting #1 * mannual linting 2 * bugfix * add send/recv point-to-point calls * add some initial code for communicator caching * auto linting * optimize imports * minor fix * fix unpassed tests * support more dtypes * rerun some distributed tests for send/recv * linting * [Serve] [Doc] Front page update (#13032) * Deprecate experimental / dynamic resources (#13019) * [docs] fix wandb url (#13094) * [Serve] Implement Graceful Shutdown (#13028) * [Serve] Use ServeHandle in HTTP proxy (#12523) * [Java] Format ray java code (#13056) * [docker] Fix restart behavior with Docker (#12898) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: ijrsvt <ilr@anyscale.com> * Disable broken streaming tests (#13095) * [autoscaler] Make placement groups bypass max launch limit (#13089) * Serve metrics docs (#13096) * [RLlib] run_regression_tests.py: --framework flag (instead of --torch). (#13097) * [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035) * [Doc] Fix Sphinx.add_stylesheet deprecation (#13067) * Fix streaming ci failure (#12830) * [RLlib] New Offline RL Algorithm: CQL (based on SAC) (#13118) * [Bugfix][Dashboard] Fix undefined logCount, errorCount UI crash (#13113) * [RLlib] Deflake test case: 2-step game MADDPG. (#13121) * [RLlib] Trajectory view API docs. (#12718) * Job module without submission (#13081) Co-authored-by: 刘宝 <po.lb@antfin.com> * [RLlib] JAXPolicy prep PR #2 (move get_activation_fn (backward-compatibly), minor fixes and preparations). (#13091) * [Java] Avoid failure of serializing a user-defined unserializable exception. (#13119) * [Tune] Update URL to fix 403 not found error in PBT tranformers test case (#13131) * [serve] Async controller (#13111) * [dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948) * [Serve] Use a small object to track requests (#13125) * [docs][kubernetes][minor] Update K8s examples in doce (#13129) * [RLlib] Support easy `use_attention=True` flag for using the GTrXL model. (#11698) * [docs] Documentation + example for the C++ language API (#13138) * [Java] Support `wasCurrentActorRestarted` in actor task. (#13120) * Remove check. * Add test * fix lint * lint * Fix spotless lint * Address comments. * Fix lint Co-authored-by: Qing Wang <jovany.wq@antgroup.com> * [docs] Minor change to formating C++ docs. (#13151) * Deprecate setResource java api (#13117) * [docs] Small fix in C++ documentation. (#13154) * prepare for head node * move command runner interface outside _private * remove space * Eric * flake * min_workers in multi node type * fixing edge cases * eric not idle * fix target_workers to consider min_workers of node types * idle timeout * minor * minor fix * test * lint * eric v2 * eric 3 * min_workers constraint before bin packing * Update resource_demand_scheduler.py * Revert "Update resource_demand_scheduler.py" This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5. * reducing diff * make get_nodes_to_launch return a dict * merge * weird merge fix * auto fill instance types for AWS * Alex/Eric * Update doc/source/cluster/autoscaling.rst * merge autofill and input from user * logger.exception * make the yaml use the default autofill * docs Eric * remove test_autoscaler_yaml from windows tests * lets try changing the test a bit * return test * lets see * edward * Limit max launch concurrency * commenting frac TODO * move to resource demand scheduler * use STATUS UP TO DATE * Eric * make logger of gc freed refs debug instead of info * add cluster name to docker mount prefix directory * grrR * fix tests * moving docker directory to sdk * move the import to prevent circular dependency * smallf fix * ian * fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running * small fix * deflake test_joblib * lint * placement groups bypass * remove space * Eric * first ocmmit * lint * exmaple * documentation * hmm * file path fix * fix test * some format issue in docs * modified docs Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan> Co-authored-by: Alex Wu <alex@anyscale.io> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal> * [Serve] [Doc] Add existing web server integration ServeHandle tutorial (#13127) * [kubernetes][docs][minor] Kubernetes version warning (#13161) * [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817) * Locality-aware leasing for owned refs (pinned locations). * LessorPicker --> LeasePolicy. * Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects. * Update comments. * Turn on locality-aware leasing feature flag by default. * Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy. * Add lease policy consulting assertions to the direct task submitter tests. * Add lease policy tests. * LocalityLeasePolicy --> LocalityAwareLeasePolicy. * Add missing const declarations. Co-authored-by: SangBin Cho <rkooo567@gmail.com> * Add RAY_CHECK for raylet address nullptr when creating lease client. * Make the fact that LocalLeasePolicy always returns the local node more explicit. * Flatten GetLocalityData conditionals to make it more readable. * Add ReferenceCounter::GetLocalityData() unit test. * Add data-intensive microbenchmarks for single-node perf testing. * Add data-intensive microbenchmarks for simulated cluster perf testing. * Remove redundant comment. * Remove data-intensive benchmarks. * Add locality-aware leasing Python test. * Formatting changes in ray_perf.py. Co-authored-by: SangBin Cho <rkooo567@gmail.com> * Enabling the cancellation of non-actor tasks in a worker's queue (#12117) * wrote code to enable cancellation of queued non-actor tasks * minor changes * bug fixes * added comments * rev1 * linting * making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error * bug fix * added two unit tests * linting * iterating through pending_normal_tasks starting from end * fixup! iterating through pending_normal_tasks starting from end * fixup! fixup! iterating through pending_normal_tasks starting from end * post merge fixes * added debugging instructions, pulled Accept() out of guarded loop * removed debugging instructions, linting * [Serve] Bug in Serve node memory-related resources calculation #11198 (#13061) * [Release] Update Release Process Documentation (#13123) * [Core] Remove Arrow dependencies (#13157) * remove arrow ubsan * remove arrow build depend * remove arrow buffer * [XGboost] Update Documentation (#13017) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [SGD] Fix Docstring for `as_trainable` (#13173) * Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178) This reverts commit b4d688b4a64c595a071e8c7380b653e0bfea4ad2. * Surface object store spilling statistics in `ray memory` (#13124) * [ray_client]: Move from experimental to util (#13176) Change-Id: I9f054881f0429092d265cd6944d89804cce9d946 * Remove unused file(object_manager_integration_test.cc) (#12989) * Notify listeners after registered node stored (#13069) * [build]Update description and add some keywords (#13163) * [Collective][PR 2/6] Driver program declarative interfaces (#12874) * scaffold of the code * some scratch and options change * NCCL mostly done, supporting API#1 * interface 2.1 2.2 scratch * put code into ray and fix some importing issues * add an addtional Rendezvous class to safely meet at named actor * fix some small bugs in nccl_util * some small fix * scaffold of the code * some scratch and options change * NCCL mostly done, supporting API#1 * interface 2.1 2.2 scratch * put code into ray and fix some importing issues * add an addtional Rendezvous class to safely meet at named actor * fix some small bugs in nccl_util * some small fix * add a Backend class to make Backend string more robust * add several useful APIs * add some tests * added allreduce test * fix typos * fix several bugs found via unittests * fix and update torch test * changed back actor * rearange a bit before importing distributed test * add distributed test * remove scratch code * auto-linting * linting 2 * linting 2 * linting 3 * linting 4 * linting 5 * linting 6 * 2.1 2.2 * fix small bugs * minor updates * linting again * auto linting * linting 2 * final linting * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * added actor test * lint * remove local sh * address most of richard's comments * minor update * remove the actor.option() interface to avoid changes in ray core * minor updates Co-authored-by: YLJALDC <dal177@ucsd.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [serve] Merge ActorReconciler and BackendState (#13139) * [tune] better signature check for `tune.sample_from` (#13171) * [tune] better signature check for `tune.sample_from` * Update python/ray/tune/sample.py Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> * Disable atexit test on windows (#13207) * [serve] Move controller state into separate files (#13204) * Update multi_agent_independent_learning.py (#13196) pettingzoo.utils.error.DeprecatedEnv: waterworld_v0 is now depreciated, use waterworld_v2 instead * [Collective] Some necessary abstraction of collective calls before introducing stream management (#13162) * [Tune] Fix PBT Transformers Example (#13174) * [Serve] HTTPOptions for deployment modes (#13142) * [tests] Fix Autoscaler Test failure on Windows (#13211) * skip create_or_update tests * Update python/ray/tests/test_autoscaler.py Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> * [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158) * [GCS]Fix TestActorSubscribeAll bug (#13193) * [Metrics] Record per node and raylet cpu / mem usage (#12982) * Record per node and raylet cpu / mem usage * Add comments. * Addressed code review. * [Tune] Fix tune serve integration example (#13233) * [Redis] Note that each Redis Connect retry takes two minutes (#12183) * Slightly alter error message so it's the same in both cases. * Each retry takes about two minutes. * [Log] fix spdlog init race (#12973) * fix spdlog init race * use global logger * refine logger name and constructor * [Release] Add 1.1.0 release test logs (#13054) * Add microbenchmark to release logs * check in many_tasks stress test result * Add results of placement group stress test for 1.1.0 * Add result for test_dead_actors test and correct the name of test_many_tasks.txt * Add rllib regression test result * Add pytorch test results for rllib * remove extraneous log entries * [Core] Fix incorrect comment (#13228) * [Serialization] Fix cloudpickle (#13242) * [GCS]Fix gcs table storage `GetAll` and `GetByJobId` api bug (#13195) * Start ray client server with 'ray start' (#13217) * [GCS]Add gcs actor schedule strategy (#13156) * Publish job/worker info with Hex format instead of Binary (#13235) * [RLlib] SquashedGaussians should throw error when entropy or kl are called. (#13126) * [Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247) Now that `HeadOnly` becomes the new default HTTP location, we can re-enable the long running tests to use local multi-clusters. (also fixed the controller's API to match up to date, we should have caught these, I will open issues for this.) * Update autoscaler-cluster yaml files for release tests (#13114) * [Release] Use ray-ml image for logn running test (#13267) * [RLlib] Fix missing "info_batch" arg (None) in `compute_actions` calls. (#13237) * [Tune] Improve error message for Session Detection (#13255) * Improve error message * log once * [Tune] Pin Tune Dependencies (#13027) Co-authored-by: Ian <ian.rodney@gmail.com> * [Dependabot] Add Dependabot (#13278) Co-authored-by: Ian <ian.rodney@gmail.com> * [docker] Pull if image is not present (#13136) * [GCS] Remove old lightweight resource usage report code path (#13192) * [Dashboard] Add GET /log_proxy API (#13165) * Fix a crash problem caused by GetActorHandle in ActorManager (#13164) * [ray_client] Add metadata to gRPC requests (#13167) * [RLlib] Preparatory PR for: Documentation on Model Building. (#13260) * [tune](deps): Bump mlflow from 1.13.0 to 1.13.1 in /python/requirements (#13286) * [tune](deps): Bump gluoncv from 0.9.0 to 0.9.1 in /python/requirements (#13287) * Remove top-level ray.connect() and ray.disconnect() APIs (#13273) * [Pull manager] Only pull once per retry period (#13245) * . * docs * cleanup * . * . * . * . Co-authored-by: Alex <alex@anyscale.com> * [Cancellation] Make Test Cancel Easier to Debug (#13243) * first commit * lint-fix * [ray_client]: first draft of documentation (#13216) * Do not give an error if both `RAY_ADDRESS` and `address` is specified on initialization (#13305) * Finalize handling of RAY_ADDRESS * lint * [serve] Clean up EndpointState interface, move checkpointing inside of EndpointState (#13215) * [RLlib] SlateQ Documentation (#13266) * [RLlib] Add more detailed Documentation on Model building API (#13261) * [tune] convert search spaces: parse spec before flattening (#12785) * Parse spec before flattening * flatten after parse * Test for ValueError if grid search is passed to search algorithms * remove empty extras streaming deps (#12933) * add the method annotation and a comment explaining what's happening (#13306) Change-Id: I848cc2f0beaed95340d9de7cca19a50c78d9da9a * Use wait_for_condition to reduce flakiness in test_queue.py::test_custom_resources (#13210) * [RLlib] Issue 13330: No TF installed causes crash in `ModelCatalog.get_action_shape()` (#13332) * [serve] Cleanup backend state, move checkpointing and async goal logic inside (#13298) * fix removal of task dependencies (#13333) Co-authored-by: senlin.zsl <senlin.zsl@antfin.com> * [Serve] Support Starlette streaming response (#13328) * [RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339) * [client] Report number of currently active clients on connect (#13326) * wip * update * update * reset worker * fix conn * fix * disable pycodestyle * Implement internal kv in ray client (#13344) * kv internal * fix * [Tune] Rename MLFlow to MLflow (#13301) * Forgot overwrite parameter in Ray client internal kv * Fix typo in Tune Docs (Checkpointing) (#13348) See issue #13299 * [Kubernetes][Docs] GPU usage (#13325) * gpu-note * gpu-note * More info * lint? * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * GKE->Kubernetes Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Revert "[RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)" (#13361) This reverts commit e2b2abb88b82c0c2402a338bba51e5dbd1739419. * [Dependabot] [CI] Re-configure Dependabot and disable duplicate builds (#13359) * [tune] buffer trainable results (#13236) * Working prototype * Pass buffer length, fix tests * Don't buffer per default * Dispatch and process save in one go, added tests * Fix tests * Pass adaptive seconds to train_buffered, stop result processing after STOP decision * Fix tests, add release test * Update tests * Added detailed logs for slow operations * Update python/ray/tune/trial_runner.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Apply suggestions from code review * Revert tests and go back to old tuning loop * nit Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [Serve] Add dependency management support for driver not running in a conda env (#13269) * [RLlib] Add `__len__()` method to SampleBatch (#13371) * [Serve] Backend state unit tests (#13319) * trigger doc build for serve updates (#13373) * [Object Spilling] Long running object spilling test (#13331) * done. * formatting. * Remove unimplemented GetAll method in actor info accessor (#13362) * [Doc] Remove trailing whitespaces (#13390) * Enable Ray client server by default (#13350) * update * fix * fix test * update * [RLlib] Trajectory View API: Atari framestacking. (#13315) * [ray_client]: Wait for ready and retry on ray.connect() (#13376) * [ray_client]: wait until connection ready Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6 * lint Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0 * docs and retry minimum Change-Id: I43f5378322029267ddd69f518ce8206876e2129d * [Dashboard] Fix missing actor pid (#13229) * [ray_client]: Fix multiple attempts at checking connection (#13422) * Plumb retries update (#13411) * [Serve] [Doc] Improve batching doc (#13389) * [autoscaler/k8s] [CI] Kubernetes test ray up, exec, down (#12514) * Fix Serve release test (#13385) * Add bazel logs upload to GHA (#13251) * [tune] Fix f-string in error message (#13423) * [serve] Pull out goal management logic into AsyncGoalManager class (#13341) * Make request_resources() use internal kv instead of redis pub sub (#13410) * Remove unused handler methods (#13394) * [Tune] Pin Transitive Dependencies (#13358) * Split out the part of get_node_ip_address for which the docstring is correct (#12796) * Fix raylet::MockWorker::GetProcess crashes (#13440) Co-authored-by: 刘宝 <po.lb@antfin.com> * Revert "Enable Ray client server by default (#13350)" (#13429) This reverts commit 912d0cbbf912d5b52d6176155bdff02f504b657d. * Fix linter error (#13451) * [GCS]Add gcs resource scheduler (#13072) * [RLlib] Redo: Make TFModelV2 fully modular like TorchModelV2 (soft-deprecate register_variables, unify var names wrt torch). (#13363) * [Core]Fix raylet scheduling bug (#13452) * [Core]Fix raylet scheduling bug * fix lint error * fix lint error Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com> * [joblib] joblib strikes again but this time on windows (#13212) * [ray_client]: fix exceptions raised while executing on the server on behalf of the client (#13424) * [kubernetes][minor] Operator garbage collection fix (#13392) * [Core][CLI] `ray status` and `ray memory` no longer starts a new job (#13391) * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Make status and error args required in commands.py#debug.status * Remove unnecessary imports * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Make status and error args required in commands.py#debug.status * Remove unnecessary imports * Job 38482.1 should now pass * Resolve merge conflict * [RLlib] Deflake 2x remote & local inference tests (external env). (#13459) * [docs] Add more guideline on using ray in slurm cluster (#12819) Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> Co-authored-by: PENG Zhenghao <pengzh@ie.cuhk.edu.hk> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [Dashboard] Fix GPU resource rendering issue (#13388) * [Release] Fix Serve release test (#13303) The Docker image we were using now uses `ray` users so we have to call sudo. * [serve] Properly obey SERVE_LOG_DEBUG=0 (#13460) * Fix getting runtime context dict in driver (#13417) * [xgb] re-enable xgboost_ray tests (#13416) * re-enable * fix * update xgb_ray version * [Serialization] New custom serialization API (#13291) * new serialization API with doc & test * add more notes * refine notes * doc * [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220) * Added owned object reference before Plasma put on Create() + Seal() path. * Consolidated location table and reference table in reference counter. * Restore type in definition. * Clean up owned reference on failed Seal(). * Added RemoveOwnedObject test for reference counter. * Guard against ref going out of scope before location RPCs. * Add 'owner must have ref in scope' precondition to documentation for object location methods. * Move to separate Create() + Seal() methods for existing objects. * Clearer distinction between Create() and Seal() methods. * Make it clear that references will normally be cleaned up by reference counting. * [ray_client]: Support runtime_context as metadata (#13428) * [GCS]Remove unused class variable (#13454) * [Object Spilling] Dedup restore objects (#13470) * done. * Addressed code review. * [CI] Enable Dashboard tests for master (#13425) * [docker/dashboard] Fix ray dashboard (#12899) * [CI] Fix Windows Bazel Upload (#13436) * Return version info from Ray client connect, to allow for discovering version mismatches * Update ID specification doc (#13356) * [ray_client]: fix wrong reference in server_pickler (#13474) Change-Id: Ie3d219541b1875e986e72e3ae73ece145c715acf * Bump dev branch to 2.0 to avoid endless version bump toil (#13497) * wip * fix * fix * Remove an unnecessary file (#13499) * [Tests] Skip failing windows tests (#13495) * skip failing windows tests * skip more * remove * updates * [tune] fix small docs typo (#13355) Signed-off-by: Richard Liaw <rliaw@berkeley.edu> * move message to debug (#13472) * Minimal version of piping autoscaler events to driver logs (#13434) * sync write internal config in gcs (#13197) * Refactor node manager to eliminate `new_scheduler_enabled_` (#12936) * [GCS]Only publish changed field when node dead (#13364) * Only update changed field when node dead * node_id missed * [CI] Buildkite PR Environment for Simple Tests (#13130) * [GCS] Remove task info publish as nowhere uses it (#13509) * Remove task info publish as nowhere uses it * simplify right publish channel * [RLlib] Solve PyTorch/TF-eager A3C async race condition between calling model and its value function. (#13467) * [tune] placement group support (#13370) * [Serve] Allow ObjectRef for Composition (#12592) * Add Dashboard Python Test to Buildkite (#13530) * Add ability to not start Monitor when calling `ray start` (#13505) * [tune] support experiment checkpointing for grid search (#13357) * Fix typo (#13098) * Remove PYTHON_MODE that is not defined in Ray so that import * will work from other packages. (#13544) * [RLlib] MARWIL loss function test case and cleanup. (#134…

barakmich added 2 commits January 12, 2021 21:06

[ray_client]: wait until connection ready

811f873

Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6

lint

32850f8

Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0

barakmich assigned ericl Jan 12, 2021

ericl approved these changes Jan 12, 2021

View reviewed changes

python/ray/util/client/worker.py Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 12, 2021

ericl reviewed Jan 12, 2021

View reviewed changes

python/ray/util/client/worker.py Outdated Show resolved Hide resolved

ericl requested changes Jan 12, 2021

View reviewed changes

docs and retry minimum

d5e9e01

Change-Id: I43f5378322029267ddd69f518ce8206876e2129d

barakmich removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 12, 2021

barakmich mentioned this pull request Jan 13, 2021

[ray_client]: Monitor client stream errors #13386

Merged

6 tasks

ericl approved these changes Jan 13, 2021

View reviewed changes

ericl merged commit 0b22341 into ray-project:master Jan 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ray_client]: Wait for ready and retry on ray.connect() #13376

[ray_client]: Wait for ready and retry on ray.connect() #13376

barakmich commented Jan 12, 2021

ericl left a comment

[ray_client]: Wait for ready and retry on ray.connect() #13376

[ray_client]: Wait for ready and retry on ray.connect() #13376

Conversation

barakmich commented Jan 12, 2021

Why are these changes needed?

Related issue number

Checks

ericl left a comment

Choose a reason for hiding this comment