-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add service deployment instructions to stable diffusion template #37645
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
2.0.0 brings breaking changes Here's a failed CI run: https://buildkite.com/ray-project/oss-ci-build-pr/builds/27338#01890d26-94ff-42f2-80be-c2ac0a86e8d3/447-762
Signed-off-by: can <can@anyscale.com>
…) (#37110) This PR add the metrics for the object size distribution to help the user understand how the objects are used in the script.
…37167) Serve has recently added streaming and WebSocket support. This change adds end-to-end examples to guide users through these features. Link to documentation: https://anyscale-ray--36961.com.readthedocs.build/en/36961/serve/tutorials/streaming.html Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
This fixes an error caused by the default batch format of Ray Data changing to numpy. We need to manually specify pandas. Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…syncio (#37062) (#37200) The last ref returned by a streaming generator is a sentinel ObjectRef that contains the end-of-stream error. This suppresses an error from asyncio that the exception is never retrieved (which is expected). Related issue number Closes #36956. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Currently we are relying on the client to wait for all the resources before shutting off the controller. This caused the issue for when they interrupt the process and can cause incomplete shutdown. In this PR we moved the shutdown logic into the event loop which would be triggered by a `_shutting_down` flag on the controller. So even if the client interrupted the process, the controller will continue to shutdown all the resources and then kill itself. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Gene Der Su <e870252314@gmail.com>
…#37219) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com>
…y for cluster state reporting #37132 (#37176) Why are these changes needed? The labels are declared as strings, and PG will generate (anti)affinity labels. The current implementation geneates _PG_<binary_pg_id> as the label key. However, binary chars are not encodable in string. This PR changes the pg generated dynamic labels to _PG_<hex_pg_id> which is more readable as well.
…pdated periodically (#37121) (#37175) Why are these changes needed? It was assumed resource update is broadcasted periodically (which isn't the case), so the idle time wasn't updated when the node keeps in the idle state. This PR makes the raylet sent the last idle time (if idle) to the GCS, and allows GCS to calculate the duration. --------- Signed-off-by: rickyyx <rickyx@anyscale.com>
The following examples already use updated APIs: * Stable Diffusion Batch Prediction with Ray AIR * GPT-J-6B Batch Prediction with Ray AIR (LLM) The following examples have been updated to use updated APIs: * Training a model with distributed XGBoost * Training a model with distributed LightGBM I've removed batch prediction sections from the other examples, and, where appropriate, linked to the batch inference user guide. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
#37301) * [Core] Fix the race condition where grpc requests are handled while core worker not yet initialized (#37117) Why are these changes needed? there is a race condition where grpc server start handling requests before the core worker is initialized. This PR fixes by waiting for initialization before handling any grpc request. * update
…ockMetadata` (#37119) (#37263) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com>
…ons in `DataIterator.iter_batches()` (#36842) (#37260) Currently, the prefetch_batches arg of Dataset.iter_batches is used to configure the number of preloaded batches on both the CPU and GPU; therefore, in the typical case where there is much more CPU than GPU, this constrains the number of batches to prefetch on the CPU. This PR adds a separate parameter, _finalize_fn, which allows for a user-defined function that is executed in a separate threadpool, which allows for parallelization of these steps. For example, this could be useful for host to device transfers as the last step in getting a batch; this is the default _finalize_fn used when _collate_fn is not specified. Note that when _collate_fn is provided by the user, they should also handle the host to device transfer themselves outside of _collate_fn in order to maximize performance. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Co-authored-by: amogkam <amogkamsetty@yahoo.com>
akshay-anyscale
requested review from
maxpumperla,
a team,
edoakes,
shrekris-anyscale,
sihanwang41,
zcin,
architkulkarni,
sven1977,
avnishn,
ArturNiederfahrenhorst,
smorad,
kouroshHakha,
ericl,
scv119,
c21,
scottjlee,
bveeramani,
raulchen,
kfstorm,
fishbone,
WangTaoTheTonic,
wuisawesome,
DmitriGekhtman,
AmeerHajAli,
robertnishihara,
pcmoritz,
justinvyu and
sofianhnaide
as code owners
July 21, 2023 16:23
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
:book: Doctest (GPU)
([Docs] Fix:book: Doctest (GPU)
#36383)Dataset.split()
([Data] Implement optimizer withDataset.split()
#36363)test_gradio_visualization
tomedium
-sized test ([serve] Bumptest_gradio_visualization
tomedium
-sized test #36378)code-block:: python
withtestcode
in Data docs ([Data][Docs] Replacecode-block:: python
withtestcode
in Data docs #36425)replica_updated_event
in streaming router to avoid blocking proxy event loop ([serve] Clearreplica_updated_event
in streaming router to avoid blocking proxy event loop #36459)tune_scalability_network_overhead
([release/air] Fix release test timeout fortune_scalability_network_overhead
#36360)union()
([Data] Implement Operators forunion()
#36242)air_example_gptj_deepspeed_fine_tuning.gce
failing to pull model from a public s3 bucket ([release/air] Fixair_example_gptj_deepspeed_fine_tuning.gce
failing to pull model from a public s3 bucket #36276)env_check
for parametric actions (with action mask). ([RLlib] Fix env_check for parametric actions (with action mask) #34790)union()
([Data] Implement Operators forunion()
#36242)" (Revert "[Data] Implement Operators forunion()
" #36583)eager_tracing=True
by default. ([RLlib] Enable eager_tracing=True by default. #36556)serve.batch
args ([Serve] Support magic attributes inserve.batch
args #36381)@serve.batch
wait before creating initial batch ([Serve] Make@serve.batch
wait before creating initial batch #36510)union()
([Data] Implement Operators forunion()
#36242)" (Revert "[Data] Implement Operators forunion()
" #36583)" (Revert "Revert "[Data] Implement Operators forunion()
"" #36587)tune/automl
([Tune] Removetune/automl
#35557)from_items
API ref ([Data][Docs] Standardizefrom_items
API ref #36432)vtrace_drop_last_ts
option and add proper vf bootstrapping to IMPALA and APPO. ([RLlib] Removevtrace_drop_last_ts
option and add proper vf bootstrapping to IMPALA and APPO. #36013)urllib3
dependency ([Dependencies] Removeurllib3
dependency #36609)Block
API references #36692)workspace_template_many_model_training
release test #36687)test_serve_agent_fault_tolerance.py
#36745)DataIterator.iter_batches()
([Data] Enforce strict mode batch format forDataIterator.iter_batches()
#36686)noop_latency.py
benchmark to use modern API ([serve] Updatenoop_latency.py
benchmark to use modern API #36715)Dict[str, np.array]
batches inDummyTrainer
read bytes calculation ([AIR] [Data] Add case forDict[str, np.array]
batches inDummyTrainer
read bytes calculation #36484)pl.Trainer.test()
([Doc] Add Distributed Testing Example forpl.Trainer.test()
#36395)Dataset.schema()
with new execution plan optimizer ([Data] Support partial execution inDataset.schema()
with new execution plan optimizer #36740)deploy_serve_application
task ([Serve] Return error string fromdeploy_serve_application
task #36744)max_batch_size
andbatch_wait_timeout_s
reconfigurable ([Serve] Makemax_batch_size
andbatch_wait_timeout_s
reconfigurable #36881)RAY_SERVE_ENABLE_NEW_ROUTING=1
) ([serve] Add Java support for power of two choices routing (RAY_SERVE_ENABLE_NEW_ROUTING=1
) #36865)joblib
to1.2.0
in CI #36932)_convert_block_to_tabular_block
([Data] Remove_convert_block_to_tabular_block
#36943)test_metrics.py::test_replica_metrics_fields
([serve] Fix flakytest_metrics.py::test_replica_metrics_fields
#36987)DAGDriver
redirect ([Serve] [Docs] Add note aboutDAGDriver
redirect #36971)code-block
totestcode
. ([Test][Train] Migrate Ray Traincode-block
totestcode
. #36483)BatchPredictor
([Data] DeprecateBatchPredictor
#36947)starlette.requests.Request
object ([serve] Clean up microbenchmark & don't pass rawstarlette.requests.Request
object #37040) ([serve] Clean up microbenchmark & don't pass rawstarlette.requests.Request
object (#37040) #37057)test_backwards_compatibility.py
by pinningpydantic<2
([jobs] Fixtest_backwards_compatibility.py
by pinningpydantic<2
#37097) ([jobs] Fixtest_backwards_compatibility.py
by pinningpydantic<2
#37101)dreambooth
example ([air/release] Fix batch format indreambooth
example #37102) ([air/release][cherry-pick] Fix batch format indreambooth
example (#37102) #37189)BatchPredictor
from examples ([AIR][Docs] RemoveBatchPredictor
from examples #37178) ([AIR][Docs] RemoveBatchPredictor
from examples (#37178) #37269)StageStatsSummary
fromBlockMetadata
([Data] Calculate stage execution time inStageStatsSummary
fromBlockMetadata
#37119) ([Data] Cherry-pick #37119 #37263)DataIterator.iter_batches()
([Data] Add option for parallelizing post-collation data batch operations inDataIterator.iter_batches()
#36842) ([Data] Cherry-pick #36842 #37260)fastapi==0.99.1
inrequirements-doc.txt
to fix API reference ([serve] Pinfastapi==0.99.1
inrequirements-doc.txt
to fix API reference #37340) ([serve] Pinfastapi==0.99.1
inrequirements-doc.txt
to fix API reference (#37340) #37354)test_get_master_wheel_url
([Core] Fixtest_get_master_wheel_url
#37424)test_get_wheel_filename
([Core] Fixtest_get_wheel_filename
#37433)example://
from docs and code snippets with S3 paths #37359 ([Data] Cherry-pick #37359 #37428)storage_path
to hard-fail ([air] Fix behavior of multi-node checkpointing without an externalstorage_path
to hard-fail #37543) ([air][cherry-pick] Fix behavior of multi-node checkpointing without an externalstorage_path
to hard-fail #37567)Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.