Add service deployment instructions to stable diffusion template #37645

akshay-anyscale · 2023-07-21T16:23:00Z

[core] remove import thread ([core] remove import thread #36293)
[Docs][CI] Fix warnings about 'MOCK' option ([Docs][CI] Fix warnings about 'MOCK' option #36359)
[Docs] Fix :book: Doctest (GPU) ([Docs] Fix :book: Doctest (GPU) #36383)
[Data] Implement optimizer with Dataset.split() ([Data] Implement optimizer with Dataset.split() #36363)
[Train][Docs] Skip flaky example in Train FAQ ([Train][Docs] Skip flaky example in Train FAQ #36384)
[Data] [Docs] Update Data doc codeowners ([Data] [Docs] Update Data doc codeowners #36405)
Avoid modify test object before running tests ([ci][fix] avoid update test object before its run #36354)
Update typo in lightning_trainer.py (Update typo in lightning_trainer.py #36310)
[Core][Node Labels 3/n]Add node labels to node resources and publish to all node([Core][Node Labels 3/n]Add node labels to node resources and publish to all node #36009)
[Data] Remove shutdown logging from StreamingExecutor ([Data] Remove shutdown logging from StreamingExecutor #36408)
[RLlib] Add config option to enable/disable TO vs FROM worker filter updates (aka filter syncing). ([RLlib] Add config option to enable/disable TO vs FROM worker filter updates (aka filter syncing). #36204)
[serve] Add node ID to replica started log ([serve] Add node ID to replica started log #36385)
[serve] Bump test_gradio_visualization to medium-sized test ([serve] Bump test_gradio_visualization to medium-sized test #36378)
[docs] create advanced guides directory; rename files 1/N ([docs] create advanced guides directory; rename files 1/N #36297)
Add run-jailed-tests when the script invoke itself ([ci][fix] add --run-jailed-tests when build-pipeline invoke itself #36361)
[Data] Increase disk size for shuffle benchmark cluster setup ([Data] Increase disk size for shuffle benchmark cluster setup #36339)
Kill cython examples. (Kill cython examples #36393)
Fix data test failure (Fix data test failure on listdir check #36397)
Fix byod disabling for PR ([ci][fix] fix byod disable on PR #36392)
Unskip test_autoscaler_shutdown_node_http_everynode ([Serve] Unskip test_autoscaler_shutdown_node_http_everynode #36420)
[Object Store]Remove push requests without remaining chunks from the pending list. ([Object Store]Remove push requests without remaining chunks from the pending list. #36017)
[Core] Only instantiate gcs channels on driver ([Core] Only instantiate gcs channels on driver #36389)
[Docs] Fix Data examples TOC and rendering ([Docs] Fix Data examples TOC and rendering #36404)
[Docs][Serve] Fix failing Serve doctests ([Docs][Serve] Fix failing Serve doctests #36423)
[core] Delete object spilling dead code path. ([core] Delete object spilling dead code path. #36286)
Remove unnecessary AsyncGetResources in NodeManager::NodeAdded (Remove unnecessary AsyncGetResources in NodeManager::NodeAdded #36412)
[Data][Docs] Replace code-block:: python with testcode in Data docs ([Data][Docs] Replace code-block:: python with testcode in Data docs #36425)
[Data][Docs] Remove built-in datasources from API ref ([Data][Docs] Remove built-in datasources from API ref #36428)
Fix double reference in tune docs ([Doc][Tune] Fix double reference in tune docs #36427)
[data] Spread map task stages by default for arg size <50MB ([data] Spread map task stages by default for arg size <50MB #36290)
Unskip test_get_release_wheel_url for mac (Unskip test_get_release_wheel_url for mac #36430)
[ci/docker] Move CI + Docker to Python 3.8 [build_base] ([ci/docker] Move CI + Docker to Python 3.8 [build_base] #35040)
Update key-concepts.rst (Update key-concepts.rst #36456)
[train] Re-upgrade mlflow to 2.x for python 3.8+ ([train] Re-upgrade mlflow to 2.x for python 3.8+ #36446)
[serve] Clear replica_updated_event in streaming router to avoid blocking proxy event loop ([serve] Clear replica_updated_event in streaming router to avoid blocking proxy event loop #36459)
[Templates] Unify the batch inference template with an existing Data example ([Templates] Unify the batch inference template with an existing Data example #36401)
[doc][core] Fix doc for runtime-env-auth ([doc][core] Fix doc for runtime-env-auth #36421)
Improve gcs connection error message (redo connection error message #35978)
[serve] add early return when there are still in progress http health check ([serve] add early return when there are still in progress http health check #36461)
[Doc] Clarify that session can also mean a ray cluster ([Doc] Clarify that session can also mean a ray cluster #36422)
[tune] Upgrade ax-platform ([tune] Upgrade ax-platform #36452)
Update docker build script owners (Update docker build script owners to Cuong and Lonnie #36365)
Remove commented parts in build pipeline. (Remove commented parts in build pipeline. #36407)
Remove go installation. (Remove go compiler install on CI #36391)
Use matrix to run minimal installs. (Use matrix to run minimal installs. #36410)
Use matrix to build docker images. (Use matrix to build docker images on buildkite #36409)
[Data] Clean up ray.data.aggregate namespace ([Data] Clean up ray.data.aggregate namespace #36468)
un-byod a jailed test ([ci][fix] un-byod a jailed test #36441)
[docs] moved M1 instructions further up ([docs] moved M1 instructions further up #36438)
[docs] change Getting Started title to Ray Dashboard ([docs] change Getting Started title to Ray Dashboard #36440)
[Doc] Some instructions on how to size the head node ([Doc] Some instructions on how to size the head node #36429)
[air] Upgrade release tests that depend on upgraded modin to py38 ([release/air] Upgrade release tests that depend on upgraded modin to py38 #36402)
[release/air] Fix release test timeout for tune_scalability_network_overhead ([release/air] Fix release test timeout for tune_scalability_network_overhead #36360)
[Docs] [typo] Rename acecelerators.md to accelerators.md ([Docs] [typo] Rename acecelerators.md to accelerators.md #36500)
[RLlib] Use PyTorch's implementation of grad norm clipping. ([RLlib] Use torch's implementation of grad norm clippling #36382)
[Autoscaler] Monitor: do not call logger.Exception in sig handler ([Autoscaler] monitor: do not call logger.Exception in sig handler #36491)
[mac] Skip test_cli.py::test_ray_start as it's currently flaky ([mac] Skip test_cli.py::test_ray_start as it's currently flaky #36496)
[data] Upgrade modin (and pandas) ([data] Upgrade modin (and pandas) #36451)
[Core] Check that temp_dir must be absolute path. ([Core] Check that temp_dir must be absolute path. #36431)
[ci][core] Add more visbility into state api stress test ([ci][core] Add more visbility into state api stress test #36465)
Migrate most data tests to byod ([ci][byod] migrate all data tests to byod #36344)
[Train] LightningTrainer support DeepSpeedStrategy ([Train] LightningTrainer support DeepSpeedStrategy #36165)
[RLlib] RLModule API change: If "actions" key returned from forward_inference|exploration, use actions as-is. ([RLlib] RLModule API change: If "actions" key returned from forward_inference|exploration, use actions as-is. #36067)
[Data] Implement Operators for union() ([Data] Implement Operators for union() #36242)
[RLlib] DreamerV3: Main algo code and required changes to some RLlib APIs (RolloutWorker). ([RLlib] DreamerV3: Main algo code and required changes to some RLlib APIs (RolloutWorker) #35386)
[Core] Remove grpcio import from usage_lib ([Core] Remove grpcio import from usage_lib #36542)
[Core]Change the log level of key placement group logs to facilitate troubleshooting of PG-related issues ([Core]Change the log level of key placement group logs to facilitate troubleshooting of PG-related issues #36318)
[release/air] Fix air_example_gptj_deepspeed_fine_tuning.gce failing to pull model from a public s3 bucket ([release/air] Fix air_example_gptj_deepspeed_fine_tuning.gce failing to pull model from a public s3 bucket #36276)
[tune] Reduce test_client keras training time ([tune] Reduce test_client keras training time #36552)
Revert "[RLlib] DreamerV3: Main algo code and required changes to some RLlib APIs (RolloutWorker). ([RLlib] DreamerV3: Main algo code and required changes to some RLlib APIs (RolloutWorker) #35386)" (Revert "[RLlib] DreamerV3: Main algo code and required changes to some RLlib APIs (RolloutWorker)" #36564)
[RLlib] Fix env_check for parametric actions (with action mask). ([RLlib] Fix env_check for parametric actions (with action mask) #34790)
[Streaming Generator] Handle out of order report when retry ([Streaming Generator] Handle out of order report when retry #36069)
[ci] Mark documentation test as soft fail. ([ci] mark documentation test as soft fail #36480)
[ci] collapse sphinix lint into build_sphinx_docs ([ci] collapse sphinix lint into build_sphinx_docs #36538)
[ci] merge some lines and save some indents ([ci] merge some lines in test_wheels.sh #36539)
[ci] refactor test_wheels in ci.sh ([ci] refactor test_wheels in ci.sh #36540)
[ci] remove arm asan build and test ([ci] remove arm asan build and test #36541)
[ci] remove /root/bazel-3.2.0/output ([ci] remove /root/bazel-3.2.0/output #36545)
Support bisect of smoke tests ([ci][bisect] support bisecting smoke test #36503)
Revert "[Data] Implement Operators for union() ([Data] Implement Operators for union() #36242)" (Revert "[Data] Implement Operators for union()" #36583)
[RLlib] Enable eager_tracing=True by default. ([RLlib] Enable eager_tracing=True by default. #36556)
[Data][Docs] Remove "Data Representations (internal)" ([Data][Docs] Remove "Data Representations (internal)" #36435)
[Serve] Support magic attributes in serve.batch args ([Serve] Support magic attributes in serve.batch args #36381)
[Serve] Make @serve.batch wait before creating initial batch ([Serve] Make @serve.batch wait before creating initial batch #36510)
[Data] Remove option to disable strict mode ([Data] Remove option to disable strict mode #36472)
Revert "Revert "[Data] Implement Operators for union() ([Data] Implement Operators for union() #36242)" (Revert "[Data] Implement Operators for union()" #36583)" (Revert "Revert "[Data] Implement Operators for union()"" #36587)
[Data] Test Data notebook examples ([Data] Test Data notebook examples #35618)
[ci] fix wheels building ([ci] fix wheels building #36608)
[core] Add ClusterID token to GRPC server [1/n] ([core] Add ClusterID token to GRPC server [1/n] #36517)
[Core][C++ Worker]Add error message for unsupported Object Ref parameters in C++ Worker ([Core][C++ Worker]Add error message for unsupported Object Ref parameters in C++ Worker #36414)
[core][autoscalerv2] Update autoscaler.proto / instance_manager.proto dependency ([core][autoscalerv2] Update autoscaler.proto / instance_manager.proto dependency #36116)
[core][devex] Move ray/util build targets to separate build files ([core][devex] Move ray/util build targets to separate build files #36598)
Revert revert [RLlib] DreamerV3 Main Algo. (Revert revert [RLlib] DreamerV3 Main Algo #36571)
[train] TransformersPredictor: Add support for custom pipeline class ([train] TransformersPredictor: Add support for custom pipeline class #36494)
[RLlib] Replace remaining mentions of "trainer" by "algorithm". ([RLlib] Replace remaining mentions of "trainer" by "algorithm". #36557)
[RLlib] Compile update logic on learner and use cudagraphs ([RLlib] Compile update logic on learner and use cudagraphs #35759)
[serve][deploy refactor][3/X] Move deployment state manager into application state manager ([serve][deploy refactor][3/X] Move deployment state manager into application state manager #34328)
[Serve] Support serve component callback ([Serve] Support serve component callback #35619)
[Tune] Remove tune/automl ([Tune] Remove tune/automl #35557)
[Job] [Python 3.11] fixes agent info container type ([Job] [python 3.11] Fixes agent info container type #36642)
[serve, vllm] add vllm-example that we can reference to ([serve, vllm] add vllm-example that we can reference to #36617)
[Serve] Shorten proxy timeout to 10s ([Serve] Shorten proxy timeout to 10s #36470)
[Doc][AIR] Remove ray_lightning documentation and tests ([Doc][AIR] Remove ray_lightning documentation and tests #36400)
[Tests][Doc] Remove LightningTrainer PBT Examples and Test ([Tests][Doc] Remove LightningTrainer PBT Examples and Test #36476)
[CI] skip vllm_example [CI] skip vllm_example #36665
[Data] Remove simple blocks ([Data] Remove simple blocks #36477)
[Data][Docs] Standardize from_items API ref ([Data][Docs] Standardize from_items API ref #36432)
[Core] Add is_running_tasks bit in JobStatus ([Core] Add is_running_tasks bit in JobStatus #35188)
[Serve][Doc] Add Ray Service CR inspection ([Serve][Doc] Add Ray Service CR inspection #36313)
[Serve][Doc] Kuberay RayService Autoscaling ([Serve][Doc] Kuberay RayService Autoscaling #36590)
Support post build script for byod ([ci][byod] properly support post_build_script #36457)
Migrate all serve tests to byod ([ci][byod] migrate all serve tests to byod #36380)
Migrate some ml tests to byod ([ci][byod/ml][1] migrate some ml tests to byod #36394)
fix min install title ([ci] fix min install matrix title. #36654)
Migrate the rest of ml tests to byod ([ci][byod/ml][2] migrate the rest of ml tests to byod #36434)
Add all rllib tests to byod ([ci][byod] migrate many rllib tests to byod #36560)
[ci][test] Add team name as a tag in github test failure issue ([ci][test] Add team name as a tag in github test failure issue #36657)
[docs] print dependencies in Documentation job ([docs] print dependencies in Documentation job #36683)
[data] Read->SplitBlocks to ensure requested read parallelism is always met ([data] Read->SplitBlocks to ensure requested read parallelism is always met #36352)
[CI] Fix Python 3.7 ML compatibility tests by removing ray_lightning ([CI] Fix Python 3.7 ML compatibility tests by removing ray_lightning #36673)
[tune] Upgrade various tune search library dependencies ([tune] Upgrade various tune search library dependencies #36641)
[tune][dashboard] upgrade tensorboardX to 2.6.0 ([tune][dashboard] upgrade tensorboardX to 2.6.0 #36643)
[ci] Fix docker image build job names from matrix ([ci] Fix docker image build job names from matrix #36696)
[tune] BOHB: Fix nested bracket processing ([tune] BOHB: Fix nested bracket processing #36568)
[CI] Fix lint by using double quote ([ci] fix lint #36689)
[RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstrapping to IMPALA and APPO. ([RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstrapping to IMPALA and APPO. #36013)
Remove urllib3 dependency ([Dependencies] Remove urllib3 dependency #36609)
[core] Fix ray.timeline() ([core] Fix ray.timeline() #36676)
[serve] Don't change deployment status when autoscaling ([serve] Don't change deployment status when autoscaling #36520)
Revert "[train] TransformersPredictor: Add support for custom pipeline class ([train] TransformersPredictor: Add support for custom pipeline class #36494)" (Revert "[train] TransformersPredictor: Add support for custom pipeline class" #36701)
[Doc] Explain the limitation of ray client for certain use case ([Doc] Explain the limitation of ray client for certain use case #36512)
[core] Add ClusterID to ClientCallManager [2/n] ([core] Add ClusterID to ClientCallManager [2/n] #36526)
[CI] second try of fixing vllm example in CI [CI] second try of fixing vllm example in CI #36712
[Docs] Document that we follow the Google editorial style ([Docs] Document that we follow the Google editorial style #36695)
[Data][Docs] Add Block API references ([Data][Docs] Add Block API references #36692)
[Data] Enable execution plan optimizer for supported Ray Data APIs ([Data] Enable execution plan optimizer for supported Ray Data APIs #36294)
[Core] Remove grpcio from Ray minimal dashboard ([Core] Remove grpcio from Ray minimal dashboard #36636)
[Serve][Doc] Fix API ref ([Serve][Doc] Fix API ref #36731)
[Tune][Docs] Remove missing example from Tune "Other examples" ([Tune][Docs] Remove missing example from Tune "Other examples" #36691)
[Data] Remove vectorized operation warning ([Data] Remove vectorized operation warning #36733)
[Streaming Generator] Remove busy waiting ([Streaming Generator] Remove busy waiting #36070)
[doc] Fix documentation failure due to typing_extensions ([doc] Fix documentation failure due to typing_extensions #36732)
Revert "[data] Read->SplitBlocks to ensure requested read parallelism is always met ([data] Read->SplitBlocks to ensure requested read parallelism is always met #36352)" (Revert "[data] Read->SplitBlocks to ensure requested read parallelism is always met" #36747)
[docs] include referenced docstrings ([docs] include referenced docstrings #36743)
[ci] unmark documentation test as soft fail ([ci] unmark documentation test as soft fail #36750)
[tune] Optuna: Update distributions to use new APIs ([tune] Optuna: Update distributions to use new APIs #36704)
[tune] Avoid infinite recursion in log redirection ([tune] Avoid infinite recursion in log redirection #36644)
Fix lint (Fix lint after https://github.com/ray-project/ray/pull/36636 #36739)
[data] Implement Dataset.distinct ([data] Implement Dataset.distinct #36655)
[ci][python3.11] windows 3.11 wheel build
[ci][release] Add mac 3.11 wheels to release scripts ([ci][release] Add mac 3.11 wheels to release scripts #36396)
[ci/pye11] Support docker build for 3.11 ([ci][py11] support docker image build for x86 #36223)
[Streaming Generator] Make it compatible with wait ([Streaming Generator] Make it compatible with wait #36071)
[Data] Make sure progress bars always finish at 100% ([Data] Make sure progress bars always finish at 100% #36679)
[minor] Remove unreachable lines of code ([minor] Remove unreachable lines of code #36599)
[Core] Fix the Exception error message bug which convert byte array to String. ([Ray Core] Issue#35880 Fix the Exception error message bug which convert byte array to String #35883)
[RLlib][Torch 2.0 compile] Inference benchmarks ([RLlib][Torch 2.0 compile] Inference benchmarks #36534)
Fix numba/numpy incompatibility ([release/air] Fix numba/numpy incompatibility for workspace_template_many_model_training release test #36687)
[ci] remove bazel_version subcommand in setup.py api ([ci] remove bazel_version subcommand in setup.py __api__ #36549)
Fix a few data tests that require horovod ([ci][fix] fix a few data tests that require horovod #36708)
Fix rllib_learning_tests_ddpg_tf ([ci] Fix rllib_learning_tests_ddpg_tf #36730)
Fix bisect for byod ([ci] fix bisect for byod #36711)
[ci] Deprecated jail field ([ci] deprecated jail field in release test definition #36614)
Get rid of shared_ptr for GcsNodeManager ([core] Get rid of shared_ptr for GcsNodeManager and make GcsServer sole owner #36738)
[Core] Introduce fail_on_unavailable option for hard NodeAffinitySchedulingStrategy ([Core] Introduce fail_on_unavailable option for hard NodeAffinitySchedulingStrategy #36718)
[Serve] Remove spurious capitalization ([Serve] Remove unnecessary capitalization in status message #36746)
[Serve] Rename test_serve_agent_fault_tolerance.py ([Serve] Rename test_serve_agent_fault_tolerance.py #36745)
[Data] Enforce strict mode batch format for DataIterator.iter_batches() ([Data] Enforce strict mode batch format for DataIterator.iter_batches() #36686)
Revert "Revert "[train] TransformersPredictor: Add support for custom pipeline class"" (Revert "Revert "[train] TransformersPredictor: Add support for custom pipeline class"" #36705)
[air/output] Add newlines between status blocks ([air/output] Add newlines between status blocks #36765)
[core] Deflakey gcs fault tolerance test in mac os ([core] Deflakey gcs fault tolerance test in mac os #36471)
Add a flag to not running unstable tests. This will be useful for the ([ci] add a flag to not run unstable release tests #36615)
[ci][byod] support python 3.9 ([ci][byod] support python 3.9 #36672)
[Core]Fix the error message prompt for command-line parameters in JSON format(--resources) ([Core]Fix the error message prompt for command-line parameters in JSON format(--resources) #36550)
[Serve] Add ray serve request timeout to config ([Serve] Add ray serve request timeout to config #36107)
[data] Re-merge splitblocks ([data] Re-merge splitblocks #36748)
[ci] add grouping for tests ([ci] add grouping for test jobs. #36648)
[core][autoscaler] GCS Autoscaler V2: Handle ReportAutoscalingState ([core][autoscaler] GCS Autoscaler V2: Handle ReportAutoscalingState #36768)
[core][autoscaler] GCS Autoscaler V2: Add node type name to ray ([core][autoscaler] GCS Autoscaler V2: Add node type name to ray #36714)
[lint] fix lint error ([lint] fix lint error #36789)
[requirements/tune] Slightly downgrade boto3 for s3fs compatibility ([requirements/tune] Slightly downgrade boto3 for s3fs compatibility #36316)
[RLlib] Fix LSTM + Connector bug: StateBuffer restarting states on every in_eval() call. ([RLlib] Fix LSTM + Connector bug: StateBuffer restarting states on every in_eval() call. #36774)
[Data] Refactor .distinct() to .unique() ([Data] Refactor .distinct() to .unique() #36802)
[serve] Update noop_latency.py benchmark to use modern API ([serve] Update noop_latency.py benchmark to use modern API #36715)
[Serve] Document WebSocket support in Serve ([Serve] Document WebSocket support in Serve #36735)
[AIR] [Data] Add case for Dict[str, np.array] batches in DummyTrainer read bytes calculation ([AIR] [Data] Add case for Dict[str, np.array] batches in DummyTrainer read bytes calculation #36484)
[Serve] [Docs] Remove unnecessary tab in Ray Serve quickstart ([Serve] [Docs] Remove unnecessary tab in Ray Serve quickstart #36785)
[serve] Change serve build to use multi app by default ([serve] Change serve build to use multi app by default #36716)
[Serve] Scale from 0 to respect the smoothing factor ([Serve] Scale from 0 to respect the smoothing factor #36677)
[serve] Fix issue http proxy downscaling issues ([serve] Fix issue http proxy downscaling issues #36652)
[Doc] Add Distributed Testing Example for pl.Trainer.test() ([Doc] Add Distributed Testing Example for pl.Trainer.test() #36395)
[air/xgboost] XGBoost benchmark test without BYOD ([air/xgboost] XGBoost benchmark test without BYOD #36698)
[ci][py11/2] support docker build image for arm64 ([ci][py11/2] support docker build image for arm64 #36586)
Upgrade dataplane version ([ci][byod] upgrade anyscale dataplane version #36734)
[ci] build wheels in quiet mode ([ci] build wheels in quiet mode #36796)
[ci] cleanup install-bazel.sh ([ci] cleanup install-bazel.sh #36729)
Improve the wording and fix typos ([observability doc] Improve the wording and fix typos #36593)
[Doc] add troubleshooting info for ray client ([Doc] add troubleshooting info for ray client #36390)
[serve] update active nodes before updating http states ([serve] update active nodes before updating http states #36820)
Revert "[core] Deflakey gcs fault tolerance test in mac os ([core] Deflakey gcs fault tolerance test in mac os #36471)" (Revert "[core] Deflakey gcs fault tolerance test in mac os" #36835)
[dashboard] fix dashboard conftest visibility issue ([dashboard] fix dashboard conftest visibility issue #36770)
[ci] add shallow_since in rules_perl ([ci] add shallow_since in rules_perl #36793)
[ci] rerun byod requirements update ([ci] rerun byod requirements update #36834)
[ci] splitting pipeline.build.yml into parts ([ci] splitting pipeline.build.yml into parts #36841)
[core] Fix issues with worker churn in WorkerPool ([core] Fix issues with worker churn in WorkerPool #36766)
[doc] fix documentation test failure ([doc] fix documentation test failure #36855)
[Data][Docs] Add "Working with Text" ([Data][Docs] Add "Working with Text" #36436)
[core] Add resource idle time to resource report from node. ([core] Add resource idle time to resource report from node. #36670)
[air/output] Fix excluded keys in results and config ([air/output] Fix excluded keys in results and config #36764)
[Data] Disable Limit Pushdown rule in new execution plan optimizer ([Data] Disable Limit Pushdown rule in new execution plan optimizer #36831)
[Serve] [Docs] Align file names in getting_started.md ([Serve] [Docs] Align file names in getting_started.md #36800)
[Data] Support partial execution in Dataset.schema() with new execution plan optimizer ([Data] Support partial execution in Dataset.schema() with new execution plan optimizer #36740)
[serve] Power of two random choices routing ([serve] Power of two random choices routing #36501)
Remove from java code owners. (Remove from java code owners. #36857)
[serve] Ensure the dropping keep alive object is not logged to stdout ([serve] Ensure the dropping keep alive object is not logged to stdout #36879)
[RLlib] RLlib deprecation Notices Part 1 (algorithm/, evaluation/, execution/, models/jax/) ([RLlib] RLlib deprecation Notices Part 1 (algorithm/, evaluation/, execution/, models/jax/) #36826)
[Serve] Return error string from deploy_serve_application task ([Serve] Return error string from deploy_serve_application task #36744)
[serve] fix http proxy retry counter ([serve] fix http proxy retry counter #36479)
[java] fix compile warnings ([java] fix compile warnings #36788)
Validate ray wheel ([ci][byod] validate that correct ray is installed #36794)
[ci] Add region name for boto3 erc client ([ci] Add region name for boto3 erc client #36892)
[ci][byod] byod the rest non-rllib of release tests ([ci][byod] byod the rest non-rllib of release tests #36737)
[RLlib][RLModules] RNNs and RLModules ([RLlib][early-kickoff] RNNs and RLModules #32723)
[Serve] Make max_batch_size and batch_wait_timeout_s reconfigurable ([Serve] Make max_batch_size and batch_wait_timeout_s reconfigurable #36881)
[Train] Unifying Lightning and AIR CheckpointConfig ([Train] Unifying Lightning and AIR CheckpointConfig #36368)
[serve] Add Java support for power of two choices routing (RAY_SERVE_ENABLE_NEW_ROUTING=1) ([serve] Add Java support for power of two choices routing (RAY_SERVE_ENABLE_NEW_ROUTING=1) #36865)
[docs] Add Aviary to Use Cases page ([docs] Add Aviary to Use Cases page #36442)
[data] Use default scheduling strategy to fix failing dataset_shuffle_push_based_sort_1tb test ([data] Use default scheduling strategy to fix failing dataset_shuffle_push_based_sort_1tb test #36722)
[Data] Fix progress bar inital value ([Data] Fix progress bar initial value #36890)
[Dashboard] Update Serve System UI ([Dashboard] Update Serve System UI #36787)
Adding new guides for developing and deploying an ML application in S… (Adding new guides for developing and deploying an ML application in S… #36700)
[serve] Add multiplexing support for power of two choices routing ([serve] Add multiplexing support for power of two choices routing #36874)
[Serve][Doc] Add loadbalancer and monitoring section for K8s ([Serve][Doc] Add loadbalancer and monitoring section for K8s #36775)
[data] Fix wrong output order of streaming_split ([data] Fix wrong output order of streaming_split #36919)
[Core] Fix proctitle for generator tasks ([Core] Fix proctitle for generator tasks #36928)
[Client] Implement runtime_context().gcs_address ([Client] Implement runtime_context().gcs_address #36895)
[core] Clear CPU affinity for worker processes ([core] Clear CPU affinity for worker processes #36816)
Fix stress_test_state_api_scale (Fix stress_test_state_api_scale #36882)
[serve] Fix utils import in serve deploy ([serve] Fix utils import in serve deploy #36324)
Revert "[core] Clear CPU affinity for worker processes ([core] Clear CPU affinity for worker processes #36816)" (Revert "[core] Clear CPU affinity for worker processes (#36816)" #36958)
[Data] Handle shuffle-map fusion with multiple blocks returned from fused map function ([Data] Handle shuffle-map fusion with multiple blocks returned from fused map function #36921)
[core][autoscaler] Ray status interface [1/x] ([core][autoscaler] Ray status interface [1/x] #36894)
[air/deprecations] Deprecate some redundant syncing related parameters ([air/deprecations] Deprecate some redundant syncing related parameters #36900)
[air/deprecations] Deprecate initializing exp tracking integrations with a special key in the config ([air/deprecations] Deprecate initializing exp tracking integrations with a special key in the config #36899)
[Data][Docs] Consolidate Data user guides ([Data][Docs] Consolidate Data user guides #36439)
[CI] [Minimal install] Check python version in minimal install ([CI] [Minimal install] Check python version in minimal install #36887)
Pin joblib ([CI] Pin joblib to 1.2.0 in CI #36932)
[core] Fix the GCS crash when connecting to a redis cluster with TLS ([core] Fix the GCS crash when connecting to a redis cluster with TLS #36916)
[Data] Remove _convert_block_to_tabular_block ([Data] Remove _convert_block_to_tabular_block #36943)
[Data][Docs] Add introduction to Data "Key concepts" ([Data][Docs] Add introduction to Data "Key concepts" #36948)
[Templates] Small fixes for 2.6 ([Templates] Small fixes for 2.6 #36781)
[Data] Define optimizer rules with global variables ([Data] Define optimizer rules with global variables #36920)
Add python 3.8 to one of the byod test ([ci] fix aws_cluster_launcher test #36955)
[docker] drop python minor version ([docker] drop python minor version #36907)
[Serve] Expose serve request id from http request/resp ([Serve] Expose serve request id from http request/resp #35789)
[Serve] Replica gauge metric set ([Serve] Replica gauge metric set #36622)
[Doc] Make doc code snippet testable [7/n] ([Doc] Make doc code snippet testable [7/n] #36960)
[RLlib] RLlib deprecation Notices Part 2 (models/tf, models/torch, base_mode, catalog, modelv2, models/temp_spec_classes, policy/) ([RLlib] RLlib deprecation Notices Part 2 (models/tf, models/torch, base_mode, catalog, modelv2, models/temp_spec_classes, policy/) #36840)
[Doc] Make doc code snippet testable [8/n] ([Doc] Make doc code snippet testable [8/n] #36963)
[Data][Docs] Add "Working with Images" guide ([Data][Docs] Add "Working with Images" guide #36839)
[Release Test] Fix dask on ray 1tb sort failure. ([Release Test] Fix dask on ray 1tb sort failure. #36905)
[Data] Simplify Preprocessor fitted status ([Data] Simplify Preprocessor fitted status #36910)
[data] Propagate iter stats for streaming_split ([data] Propagate iter stats for streaming_split #36908)
[core][autoscaler] Remove usage of grpcio from autoscaler SDK ([core][autoscaler] Remove usage of grpcio from autoscaler SDK #36967)
[Docs] Update references in Ray Docs about batch inference ([Docs] Update references in Ray Docs about batch inference #36968)
[Data] [Docs] Docs for working with PyTorch ([Data] [Docs] Docs for working with PyTorch #36880)
Remove extra step in M1 installation instructions (Remove extra step in M1 installation instructions #36029)
[core][autoscaler] Update autoscaler proto for default enum value ([core][autoscaler] Update autoscaler proto for default enum value #36962)
[air/docs] Remove experimental features page, add github issue instead ([air/docs] Remove experimental features page, add github issue instead #36950)
[serve] Fix flaky test_metrics.py::test_replica_metrics_fields ([serve] Fix flaky test_metrics.py::test_replica_metrics_fields #36987)
[Serve] [Docs] Add note about DAGDriver redirect ([Serve] [Docs] Add note about DAGDriver redirect #36971)
[UI] [Serve Replica] Making the log link works ([UI] [Serve Replica] Making the log link works #36659)
[UI] [Log viewer] Prevent the infinite loading for logs fetch ([UI] [Log viewer] Prevent the infinite loading for logs fetch #36664)
[Test][Train] Migrate Ray Train code-block to testcode. ([Test][Train] Migrate Ray Train code-block to testcode. #36483)
[Autoscaler] Frequent update to Raycluster headpod's annotation "ray.io/autoscaler-update-timestamp" causing reconciliation by argocd ([Autoscaler] Frequent update to Raycluster headpod's annotation "ray.io/autoscaler-update-timestamp" causing reconciliation by argocd #36886)
[core] Add ClusterID token to GCS server [3/n] ([core] Add ClusterID token to GCS server [3/n] #36535)
[requirements] Split requirements into different tiers [build_base] ([requirements] Split requirements into different tiers [build_base] #36808)
[tune] Remove hard-deprecated modules from structure refactor
[core] Add extra metrics for workers ([core] Add extra metrics for workers #36973)
[air] Add long running cloud storage checkpoint test ([air] Add long running cloud storage checkpoint test #36115)
[tune] Fix bug in execution for actor re-use ([tune] Fix bug in execution for actor re-use #36951)
[RLlib] Various updates to the Release CI RLlib ([RLlib] Various updates to the Release CI RLlib #36883)
[data] Fix the issue that StreamingExecutor is not shutdown when the iterator is not fully consumed ([data] Fix the issue that StreamingExecutor is not shutdown when the iterator is not fully consumed #36933)
[Serve] Let the controller look up the head node and fix flaky standalone3 healthz test ([Serve] Let the controller look up the head node and fix flaky standalone3 healthz test #36878)
[RLlib] Change placement group strategy for learner ([RLlib] Change placement group strategy for learner #36929)
[core][autoscaler] Add idle time information to autoscaler endpoint. ([core][autoscaler] Add idle time information to autoscaler endpoint. #36918)
[Data] Deprecate BatchPredictor ([Data] Deprecate BatchPredictor #36947)
[Data] Update Data quickstart on Ray "Getting Started" ([Data] Update Data quickstart on Ray "Getting Started" #36975)
[serve] Implement timeouts for streaming & enable all tests with experimental streaming flag turned on ([serve] Implement timeouts for streaming & enable all tests with experimental streaming flag turned on #36261)
[Serve] Use serve_instance for test_http_headers to reduce the test time ([Serve] Use serve_instance for test_http_headers to reduce the test time #36988)
[Serve][Dashboard] Rename Deployments to Replicas per deployment ([Serve][Dashboard] Rename Deployments to Replicas per deployment #36915)
[UI] [Cluster Page]Adding tooltip ([UI] [Cluster Page]Adding tooltip #36976)
[Doc] Fix a broken reference in Configure logging page ([Doc] Fix a broken reference in Configure logging page #36489)
[core] Graceful shutdown in TaskEventBuffer destructor ([core] Graceful shutdown in TaskEventBuffer destructor #35857)
[ci][champagne/1] script to build anyscale BYOD champagne image ([ci][champagne/1] script to build anyscale BYOD champagne image #36930)
[ci] rename reference to builckite-ci-pipeline to rayci ([ci] rename reference to builckite-ci-pipeline to rayci #36946)
[observability doc] fix typo in the diagram and call out the worker_process_setup_hook doesn't support ray client ([observability doc] fix typo in the diagram and call out the worker_process_setup_hook doesn't support ray client #36725)
[ci] Add torchtext as a requirement for release tests ([ci] Add torchtext as a requirement for release tests #37011)
[tune] Remove deprecated mlflow+wandb aliases ([tune] Remove deprecated mlflow+wandb aliases #36860)
[Doc] Update example gallery ([Doc] Update examples page #36782)
[serve] Turn new routing & streaming feature flags on by default ([serve] Turn new routing & streaming feature flags on by default #37008)
[data] Standardize Dataset APIs ([data] Standardize Dataset APIs #36937)
Update version numbers to 2.6.0
Pin pydantic to < 2.0.0 (Pin pydantic to < 2.0.0 #37000) (Pin pydantic to < 2.0.0 #37018)
Hot fix release test infra (Hot fix release test infra #37048)
[serve] Clean up microbenchmark & don't pass raw starlette.requests.Request object ([serve] Clean up microbenchmark & don't pass raw starlette.requests.Request object #37040) ([serve] Clean up microbenchmark & don't pass raw starlette.requests.Request object (#37040) #37057)
[docs]replaced link to pay wall, to an archived copy of the article ([docs]replaced link to pay wall, to an archived copy of the article #37064) ([cherry-pick][docs]replaced link to pay wall, to an archived copy of the article (… #37066)
[docs] fix code indentation ([docs] fix code indentation #37012) ([cherry-pick][docs] fix code indentation (#37012) #37052)
[jobs] Fix test_backwards_compatibility.py by pinning pydantic<2 ([jobs] Fix test_backwards_compatibility.py by pinning pydantic<2 #37097) ([jobs] Fix test_backwards_compatibility.py by pinning pydantic<2 #37101)
Fix lint for KubeRay NodeProvider (Fix lint for KubeRay NodeProvider #37020) (Fix lint for KubeRay NodeProvider (#37020) #37125)
[core] Add metrics for object size distribution in object store ([core] Add metrics for object size distribution in object store #37005) ([cherry-pick][core] Add metrics for object size distribution in object store (#37005) #37110)
[RLlib] Make checkpointing test have multiple nodes, make node for dqn test larger ([RLlib] Make checkpointing test have multiple nodes, make node for dqn test larger #37127) ([RLlib] Cherry pick rllib test fixes 2 6 #37154)
[Serve] [Docs] Add end-to-end documentation for streaming ([Serve] [Docs] Add end-to-end documentation for streaming #36961) ([Serve] Cherry pick #36961 #37167)
[RLlib; release blocker] DreamerV3: hotfix; remove assert ([RLlib; release blocker] DreamerV3: hotfix; remove assert #37149)
[air/release] Fix batch format in dreambooth example ([air/release] Fix batch format in dreambooth example #37102) ([air/release][cherry-pick] Fix batch format in dreambooth example (#37102) #37189)
Cache schema and test ([data] Cache Dataset.schema #37103) (Cherry-pick #37103 #37201)
[core] Suppress harmless ObjectRefStreamEndOfStreamError when using asyncio ([core] Suppress harmless ObjectRefStreamEndOfStreamError when using asyncio #37062) (Cherry-pick #37062 #37200)
[Serve] Fix serve non atomic shutdown ([Serve] Fix serve non atomic shutdown #36927) ([Serve] Fix serve non atomic shutdown (#36927) #37211)
[tune] Remove temporary checkpoint directories after restore ([tune] Remove temporary checkpoint directories after restore #37173) ([2.6.0][tune] Remove temporary checkpoint directories after restore (#37173) #37219)
[core][autoscaler] Fix pg id serialization with hex rather than binary for cluster state reporting [core][autoscaler] Fix pg id serialization with hex rather than binary for cluster state reporting #37132 ([core][autoscaler] Fix pg id serialization with hex rather than binary for cluster state reporting #37132 #37176)
[core][autoscaler] Fix idle time duration when node resource is not updated periodically ([core][autoscaler] Fix idle time duration when node resource is not updated periodically #37121) ([core][autoscaler] Fix idle time duration when node resource is not updated periodically (#37121) #37175)
[core][logging][ipython] Fix log buffering when consecutive runs within ray log dedup window ([core][logging][ipython] Fix log buffering when consecutive runs within ray log dedup window #37134) ([core][logging][ipython] Fix log buffering when consecutive runs with in ray log dedup window (#37134) #37174)
[Docs] De-flake doctests ([Docs] De-flake doctests #37162) ([Docs] De-flake doctests (#37162) #37262)
[Doc] Refactor example gallery to remove temporary flash ([Doc] Refactor example gallery to remove temporary flash #37217) ([Doc] Refactor example gallery to remove temporary flash (#37217) #37284)
[AIR][Docs] Remove BatchPredictor from examples ([AIR][Docs] Remove BatchPredictor from examples #37178) ([AIR][Docs] Remove BatchPredictor from examples (#37178) #37269)
[ci][doc] Add windows 3.11 wheel support in doc and ci [ci][doc] Add windows 3.11 wheel support in doc and ci #37297 ([ci][doc] Add windows 3.11 wheel support in doc and ci #37297 #37302)
[Core] Fix the race condition where grpc requests are handled while c… ([Core] Fix the race condition where grpc requests are handled while c… #37301)
[tune] Hotfix failing checkpoint test ([tune] Hotfix failing checkpoint test #37220) ([tune] Hotfix failing checkpoint test (#37220) #37305)
[core] Retrieve the token from GCS server [4/n] ([core] Retrieve the token from GCS server [4/n] #37003) ([core] Retrieve the token from GCS server [4/n] (#37003) #37294)
Revert "[Core] Fix the race condition where grpc requests are handled while c… ([Core] Fix the race condition where grpc requests are handled while c… #37301)" (Revert "[Core] Fix the race condition where grpc requests are handled while c…" #37342)
[Data] Calculate stage execution time in StageStatsSummary from BlockMetadata ([Data] Calculate stage execution time in StageStatsSummary from BlockMetadata #37119) ([Data] Cherry-pick #37119 #37263)
[Data] Add option for parallelizing post-collation data batch operations in DataIterator.iter_batches() ([Data] Add option for parallelizing post-collation data batch operations in DataIterator.iter_batches() #36842) ([Data] Cherry-pick #36842 #37260)
[Data][Train] Fix remaining issues on DatasetConfig->DataConfig migration ([Data][Train] Fix remaining issues on DatasetConfig->DataConfig migration #37215) ([Cherry-pick][Data][Train] Fix remaining issues on DatasetConfig->DataConfig migration #37352)
Revert "[core] Retrieve the token from GCS server [4/n] ([core] Retrieve the token from GCS server [4/n] #37003) ([core] Retrieve the token from GCS server [4/n] (#37003) #37294)" (Revert "[core] Retrieve the token from GCS server [4/n] (#37003)" #37399)
[Core] Fix the segfault from Opencensus upon shutdown ([Core] Fix the segfault from Opencensus upon shutdown #36906) ([Core] Fix the segfault from Opencensus upon shutdown (#36906) #37311)
[cherry-pick][data][doc] Add DatasetConfig -> DataConfig migration guide ([cherry-pick][data][doc] Add DatasetConfig -> DataConfig migration guide #37383)
single commit ([air][cherry-pick] Disable head node syncing by default #37385)
[Core] Add a better error message for health checking network failures ([Core] Add a better error message for health checking network failures #36957) ([Core] Add a better error message for health checking network failure… #37366)
[Core] Fix the unnecessary logs ([Core] Fix the unnecessary logs #36931) ([Core] Fix the unnecessary logs (#36931) #37313)
[serve] Pin fastapi==0.99.1 in requirements-doc.txt to fix API reference ([serve] Pin fastapi==0.99.1 in requirements-doc.txt to fix API reference #37340) ([serve] Pin fastapi==0.99.1 in requirements-doc.txt to fix API reference (#37340) #37354)
Fix a wrong metrics setup link from the doc. (Fix a wrong metrics setup link from the doc. #37312) (Fix a wrong metrics setup link from the doc. (#37312) #37367)
[Java] Fix vulnerabilities ([Java] Fix vulnerabilities #37320) ([Java] Fix vulnerabilities #37320 #37402)
[Test] Change instance type to r5.8xlarge for dask_on_ray_1tb_sort ([Test] Change instance type to r5.8xlarge for dask_on_ray_1tb_sort #37321) ([Test] Change instance type to r5.8xlarge for dask_on_ray_1tb_sort (#37321) #37409)
[Train] Fix Deepspeed device ranks check in Lightning 2.0.5 ([Train] Fix Deepspeed device ranks check in Lightning 2.0.5 #37387) ([pick][Train] Fix Deepspeed device ranks check in Lightning 2.0.5 (#37387) #37408)
[Core] Fix test_get_master_wheel_url ([Core] Fix test_get_master_wheel_url #37424)
[Core] Fix test_get_wheel_filename ([Core] Fix test_get_wheel_filename #37433)
[core][autoscaler] Cherry picks change to autoscaler intereface ([core][autoscaler] Cherry picks change to autoscaler intereface #37407)
Revert "Revert "[Core] Fix the race condition where grpc requests are handled while c… ([Core] Fix the race condition where grpc requests are handled while c… #37301)" (Revert "[Core] Fix the race condition where grpc requests are handled while c…" #37342)" (Revert "Revert "[Core] Fix the race condition where grpc requests are handled while c…"" #37448)
Cherry pick 37449 (Cherry pick 37449 #37459)
Support both pydantic 1.x and 2.x (Support both pydantic 1.x and 2.x #37055) (Support both pydantic 1.x and 2.x #37468)
[Data] Cherry-pick [Data] Replace usages of example:// from docs and code snippets with S3 paths #37359 ([Data] Cherry-pick #37359 #37428)
[Logging] Switch worker_setup_hook to worker_process_setup_hook ([Logging] Switch worker_setup_hook to worker_process_setup_hook #37247) ([Cherry pick] Change worker-setup-hook -> worker process setup hook #37463)
[data] Fix "ImportError: sys.meta_path is None, Python is likely shutting down" from StreamingExecutor ([data] Fix "ImportError: sys.meta_path is None, Python is likely shutting down" from StreamingExecutor #37524) ([cherry-pick][data] Fix "ImportError: sys.meta_path is None, Python is likely ting down" from StreamingExecutor #37550)
[air] Fix behavior of multi-node checkpointing without an external storage_path to hard-fail ([air] Fix behavior of multi-node checkpointing without an external storage_path to hard-fail #37543) ([air][cherry-pick] Fix behavior of multi-node checkpointing without an external storage_path to hard-fail #37567)
[air][doc] Update docs to reflect head node syncing deprecation ([air][doc] Update docs to reflect head node syncing deprecation #37475) ([air][doc] Update docs to reflect head node syncing deprecation (#37475) #37568)
[DOC] Added in new CSAT widget ([DOC] Added in new CSAT widget #37351) ([DOC] Added in new CSAT widget (#37351) #37487)
Add service deployment instructions to stable diffusion template (Add service deployment instructions to stable diffusion template #37153)
Revert "Add service deployment instructions to stable diffusion template (Add service deployment instructions to stable diffusion template #37153)"
Add service deployment instructions to stable diffusion template (Add service deployment instructions to stable diffusion template #37153)

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

2.0.0 brings breaking changes Here's a failed CI run: https://buildkite.com/ray-project/oss-ci-build-pr/builds/27338#01890d26-94ff-42f2-80be-c2ac0a86e8d3/447-762

Signed-off-by: can <can@anyscale.com>

…Request` object (#37040) (#37057) - Update test to use new API. - Clean up test output: disable access log, print results using `print` instead of logger (which wasn't logged to stdout). - Don't pass the raw starlette request object (it isn't serializable).

…37064) (#37066) fixing the user experience so that readers don't face a pay wall.

code snippets for Batch Inference and Hyperparameter Tuning needed minor fixes (typo, indentation)

…37097) (#37101) This appears to be the same issue as #36990 Pinning the version in the install in `test_backwards_compatibility.sh` Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>

…) (#37110) This PR add the metrics for the object size distribution to help the user understand how the objects are used in the script.

…n test larger (#37127) (#37154) Signed-off-by: Avnish <avnishnarayan@gmail.com>

…37167) Serve has recently added streaming and WebSocket support. This change adds end-to-end examples to guide users through these features. Link to documentation: https://anyscale-ray--36961.com.readthedocs.build/en/36961/serve/tutorials/streaming.html Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

Signed-off-by: sven1977 <svenmika1977@gmail.com>

This fixes an error caused by the default batch format of Ray Data changing to numpy. We need to manually specify pandas. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Cache the computed schema to avoid re-executing. Closes #37077.

…syncio (#37062) (#37200) The last ref returned by a streaming generator is a sentinel ObjectRef that contains the end-of-stream error. This suppresses an error from asyncio that the exception is never retrieved (which is expected). Related issue number Closes #36956. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Currently we are relying on the client to wait for all the resources before shutting off the controller. This caused the issue for when they interrupt the process and can cause incomplete shutdown. In this PR we moved the shutdown logic into the event loop which would be triggered by a `_shutting_down` flag on the controller. So even if the client interrupted the process, the controller will continue to shutdown all the resources and then kill itself. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Gene Der Su <e870252314@gmail.com>

…#37219) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com>

…y for cluster state reporting #37132 (#37176) Why are these changes needed? The labels are declared as strings, and PG will generate (anti)affinity labels. The current implementation geneates _PG_<binary_pg_id> as the label key. However, binary chars are not encodable in string. This PR changes the pg generated dynamic labels to _PG_<hex_pg_id> which is more readable as well.

…pdated periodically (#37121) (#37175) Why are these changes needed? It was assumed resource update is broadcasted periodically (which isn't the case), so the idle time wasn't updated when the node keeps in the idle state. This PR makes the raylet sent the last idle time (if idle) to the GCS, and allows GCS to calculate the duration. --------- Signed-off-by: rickyyx <rickyx@anyscale.com>

…in ray log dedup window (#37134) (#37174)

📖 Doctest (CPU) fails 25% of runs due to a few flaky tests. This PR deflakes those tests. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

…7284) Signed-off-by: pdmurray <peynmurray@gmail.com> Co-authored-by: Peyton Murray <peynmurray@gmail.com>

The following examples already use updated APIs: * Stable Diffusion Batch Prediction with Ray AIR * GPT-J-6B Batch Prediction with Ray AIR (LLM) The following examples have been updated to use updated APIs: * Training a model with distributed XGBoost * Training a model with distributed LightGBM I've removed batch prediction sections from the other examples, and, where appropriate, linked to the batch inference user guide. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Signed-off-by: rickyyx <rickyx@anyscale.com>

#37301) * [Core] Fix the race condition where grpc requests are handled while core worker not yet initialized (#37117) Why are these changes needed? there is a race condition where grpc server start handling requests before the core worker is initialized. This PR fixes by waiting for initialization before handling any grpc request. * update

#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com>

Retrieve the token from the GCS server in the GCS client while connecting, to attach to metadata in requests. Previous PR (GCS server): #36535 Next PR (auth): #36073

… while c… (#37301)" (#37342) This reverts commit 0138bdc.

…ockMetadata` (#37119) (#37263) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com>

…ons in `DataIterator.iter_batches()` (#36842) (#37260) Currently, the prefetch_batches arg of Dataset.iter_batches is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU. This PR adds a separate parameter, _finalize_fn, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default _finalize_fn used when _collate_fn is not specified. Note that when _collate_fn is provided by the user, they should also handle the host to device transfer themselves outside of _collate_fn in order to maximize performance. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Co-authored-by: amogkam <amogkamsetty@yahoo.com>

bveeramani and others added 30 commits June 30, 2023 17:01

Update version numbers to 2.6.0

2441e73

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Pin pydantic to < 2.0.0 (#37000) (#37018)

879249b

2.0.0 brings breaking changes Here's a failed CI run: https://buildkite.com/ray-project/oss-ci-build-pr/builds/27338#01890d26-94ff-42f2-80be-c2ac0a86e8d3/447-762

Hot fix release test infra (#37048)

c887c0b

Signed-off-by: can <can@anyscale.com>

[docs]replaced link to pay wall, to an archived copy of the article (#…

7bfddb0

…37064) (#37066) fixing the user experience so that readers don't face a pay wall.

[docs] fix code indentation (#37012) (#37052)

0dad6d3

code snippets for Batch Inference and Hyperparameter Tuning needed minor fixes (typo, indentation)

[jobs] Fix test_backwards_compatibility.py by pinning pydantic<2 (#…

ab5e515

…37097) (#37101) This appears to be the same issue as #36990 Pinning the version in the install in `test_backwards_compatibility.sh` Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Fix lint for KubeRay NodeProvider (#37020) (#37125)

7616ec8

Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>

[core] Add metrics for object size distribution in object store (#37005…

3259647

…) (#37110) This PR add the metrics for the object size distribution to help the user understand how the objects are used in the script.

[RLlib] Make checkpointing test have multiple nodes, make node for dq…

1096f75

…n test larger (#37127) (#37154) Signed-off-by: Avnish <avnishnarayan@gmail.com>

[RLlib; release blocker] DreamerV3: hotfix; remove assert (#37149)

a4bc5b2

Signed-off-by: sven1977 <svenmika1977@gmail.com>

[air/release] Fix batch format in dreambooth example (#37102) (#37189)

47ec25b

This fixes an error caused by the default batch format of Ray Data changing to numpy. We need to manually specify pandas. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Cache schema and test (#37103) (#37201)

12a569f

Cache the computed schema to avoid re-executing. Closes #37077.

[core][logging][ipython] Fix log buffering when consecutive runs with…

73d41fb

…in ray log dedup window (#37134) (#37174)

[Docs] De-flake doctests (#37162) (#37262)

0c24524

📖 Doctest (CPU) fails 25% of runs due to a few flaky tests. This PR deflakes those tests. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

[Doc] Refactor example gallery to remove temporary flash (#37217) (#3…

590b0e6

…7284) Signed-off-by: pdmurray <peynmurray@gmail.com> Co-authored-by: Peyton Murray <peynmurray@gmail.com>

[ci][doc] Add windows 3.11 wheel support in doc and ci #37297 (#37302)

a40d019

Signed-off-by: rickyyx <rickyx@anyscale.com>

[tune] Hotfix failing checkpoint test (#37220) (#37305)

fc53889

#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com>

[core] Retrieve the token from GCS server [4/n] (#37003) (#37294)

456f532

Retrieve the token from the GCS server in the GCS client while connecting, to attach to metadata in requests. Previous PR (GCS server): #36535 Next PR (auth): #36073

Revert "[Core] Fix the race condition where grpc requests are handled…

913fea3

… while c… (#37301)" (#37342) This reverts commit 0138bdc.

akshay-anyscale closed this Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add service deployment instructions to stable diffusion template #37645

Add service deployment instructions to stable diffusion template #37645

akshay-anyscale commented Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Add service deployment instructions to stable diffusion template #37645

Conversation

akshay-anyscale commented Jul 21, 2023

Why are these changes needed?

Related issue number

Checks