Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] LightningTrainer Dolly V2 FSDP Fine-tuning Example #34990

Merged

Conversation

woshiyyya
Copy link
Member

@woshiyyya woshiyyya commented May 3, 2023

Why are these changes needed?

As a flagship example of LightningTrainer, introduce how to fine-tune LLM with FSDP.

Dataset: tiny_shakespear
Cluster: g4dn.8xlarge (head) + 15 x g4dn.4xlarge (workers)
Performance:

dolly-v2-3b dolly-v2-7b
training time 553s 3024s
training + checkpointing time 703s 3583s
Cost $2.96 $16.2

Release Test Passed:

Rendered Doc: https://anyscale-ray--34990.com.readthedocs.build/en/34990/ray-air/examples/dolly_lightning_fsdp_finetuning.html

TODO:

  • Keep only the 7B example
  • Inference on a single T4 GPU
  • Convert this as a release test
  • Add links in toc tree and other examples.

Screenshot 2023-05-08 at 10 39 51 AM

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya marked this pull request as ready for review May 3, 2023 18:18
@woshiyyya woshiyyya added air train Ray Train Related Issue labels May 3, 2023
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya marked this pull request as draft May 6, 2023 04:39
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya marked this pull request as ready for review May 8, 2023 17:51
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya assigned richardliaw and unassigned Yard1 May 8, 2023
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@gjoliver gjoliver merged commit 886926c into ray-project:master May 10, 2023
1 check passed
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
…#34990)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
fishbone added a commit that referenced this pull request May 24, 2023
* [Overview][Serve] Add Recent Serve Applications Card (#34642)

Expected Results:

Create a Recent Serve Applications card: We have a Serve page with a table containing information about the Application name, Import Path, and Status. We will create a card based on this information and the styles of the Recent Jobs Card.
Different icons to show: https://files.slack.com/files-pri/TKC2KFWG3-F055XEKED98/screen_shot_2023-05-02_at_12.01.36_pm.png

* Clean SWR cache between each test cases (#35097)

In some cases, it's necessary to mock the useSWR function in Jest test cases. However, if we don't clear the SWR cache between different test cases, we may encounter an error where a test case unintentionally reuses data from a previous test case instead of creating new mock data.

By implementing this fix, we can ensure that our test cases are isolated and independent, and that they accurately reflect the behavior of our code under different scenarios.

* Revert "[Overview][Serve] Add Recent Serve Applications Card" (#35155)

Reverts ray-project/ray#34642

This seems be breaking all pipeline builds. as it is failing to build the base container.

* [air/output] Fix trial status at end (more info + cut off) (#35128)

This PR ensures that the full trial status table is printed at the end of a Ray Tune run with the new output engine. Additionally, trial status data was previously always cut off - now we enforce that when `force= True`, all trial data is reported.

It also fixes a bug for showing the `more_info` field (how many more trials with a specific status are available).

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [Core] Fix async actor shutdown issue when exit_actor is used (#32407)

There are 2 issues.

When the actor exits via sys.exit, exit_actor, or max_call=1, we didn't cancel queued tasks, which means all queued tasks will still be executed although you call exit APIs. It is an unexpected/unintuitive behavior.

The segfault happened when we call disconnect() on exit_actor API & there are still queued tasks. it's because the actor won't exit until the queued tasks are all executed, but since we called disconnect(), it will break the worker with a segfault (it is not expected disconnect is called when you are executing actor tasks). This happened even when a normal actor (not an async actor) was used if there are queued tasks when exit_actor is called.

This API fixes the issues by doing 2 things.

First, if cpp Exit API is called, we guarantee the queued tasks won't be executed. I fixed this issue by returning ExecuteTask immediately. Alternatively, we could manually clean actor_scheduling_queue, but this will require much more complicated code to have a good error message. I am open for this approach as well.
Remove disconnect call from exit_actor API. It was written before 2020, and the comment there seems irrelevant (and also all tests seem to pass, so it should be okay). I assume it was a hack, and the issue from the comment was fixed at some point of time.
Also, this PR adds 2 guarantees to exit_actor APIs.

Once exit_actor or exit is called on an actor, there will be no additional tasks running from that actor. Any queued or incoming requests will fail with a clear error message.
When the actor is terminated via exit_actor or exit, the atexit handler is guaranteed to be called (I will add tests).

* [serve] Add log file path to replica details (#33640)

Add absolute file path to log files for each replica.
https://github.com/ray-project/ray/pull/33503#discussion_r1142835813

Example:
```
replicas:
- replica_id: foo_DAGDriver#jsrUNs
  state: RUNNING
  pid: 68276
  actor_name: SERVE_REPLICA::foo_DAGDriver#jsrUNs
  actor_id: 7c1c702270bb634a7cf4c24f01000000
  node_id: 568bf20e0658e89361a997fe57b896b15fcb97268f3b039e1513c6a5
  node_ip: 192.168.1.14
  start_time_s: 1679598497.387779
  log_file_path_id: /serve/deployment_foo_DAGDriver_foo_DAGDriver#jsrUNs.log
```

* [Docker] [runtime env] Bump boto3 version from 1.4.8 to 1.26.82, add pyOpenSSL and cryptography (#33273)

runtime_env working_dir S3 urls require a recent version of boto3 to read environment variables for authentication for downloading from private buckets. We currently include an outdated boto3 version in the Ray Docker images.

This PR bumps the version in the Ray Docker images to make the S3 working_dir download feature work out of the box.

The reason this is important is that users might try to use S3 URLs for runtime_env with the Ray Docker image, but it's hard to debug the failure that occurs with the outdated boto3 version (see linked issue). This is worse than not having boto3 installed, since in that case the error message is clear ("You must pip install boto3 to fetch URIs").

Related issue number
Closes #33256
Closes #34752

* [core] Make ray.get(timeout=0) to throw timeout error (#35126)

Why are these changes needed?
With telemetry tracking since ray 2.3, we have not seen significant and recent usage of the timeout=0 behaviour:
image

Raw query behind firewall

So we will update this behaviour as documented in #28465

cc vitrioil for the original PR: https://github.com/ray-project/ray/pull/30210/files
Signed-off-by: Ricky Xu <xuchen727@hotmail.com>

---------

Signed-off-by: Ricky Xu <xuchen727@hotmail.com>
Co-authored-by: vitrioil <opm249@gmail.com>
Co-authored-by: Prem <41074533+vitrioil@users.noreply.github.com>

* [core] Change worker niceness in job submission environment (#34727)

The niceness of the job supervisor should be set to 0.

Signed-off-by: vitsai <victoria@anyscale.com>

* [ci/release] Resolve dependencies with python 3.9 inside conda. (#35176)

Was resolved in a python 3.7 environment. Now resolving in a python 3.9 environment.

Also upgraded dependencies.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

* [Data] Remove "Scalable Batch Inference with Ray" from batch inference examples (#35151)

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

BatchPredictor isn't a recommended API for batch inference anymore. "Scalable Batch Inference with Ray" uses BatchPredictor, so we're removing it until it gets updated with the recommended APIs.

* [core] Start ray syncer reconnection after a delay (#35115)

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

Previously, then a connection is broken, it'll try to do reconnection immediately. Usually when network issues happened, it's going to take a while to recover. This PR adds a 2s delay before initializing a reconnect to make the workload more reasonable.

* [docker] Add netbase to base deps docker image (#35174)

This package is available in the ubuntu:focal base images but not in the CUDA base images, but may be required by downstream dependencies in our docker ML images.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [Data] Update `pipelined_training_50_gb.aws` instance type (#35150)

Anyscale recently stopped supporting i3.8xlarge instance types. As a result, the pipelined_training_50_gb.aws release test -- which uses i3.8xlarge -- has been failing.

This PR updates the instance type to m6i.16xlarge (a supported instance type).

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

* [data] Update the strict mode message to be less confusing (#35185)

* [RLlib] Activate RLModules and Learner together in docs (#35145)

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

* [RLlib] Add test utils for rllib contrib (#35056)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Data] Allow fusing `MapOperator` -> `Repartition` operators (#35178)

As a followup to https://github.com/ray-project/ray/pull/34847, allow fusing `MapOperator` -> `Repartition` operators for the shuffle repartition case (we do not support fusing for split repartition, which only uses `ShuffleTaskSpec.reduce` and thus cannot call the upstream map function passed to `ShuffleTaskSpec.map`).

Signed-off-by: Scott Lee <sjl@anyscale.com>

* [core] Add object owner and copy metrics to node stats (#35119)

This PR adds the object owner and copy metrics to `GetNodeStats` RPC endpoint.

The inlined small objects are not counted as one copy because it's not stored in object store and when it's used, it'll be copied inline, so no need to count it.

But it's still counted as 1 as ownership for correctness because it's actually owned by worker.

The local copies are retrieved from local object manager directly and owner counts needs the caller to aggregate the metrics from each core worker.

* [data] Revert the dataset to datastream class rename (#35082)

After getting further feedback about confusion from some types of users, we've decided to not proceed with the Dataset -> Datastream rename for 2.5. Instead, we will retain the data structure name and just refer to it as "streaming datasets" in the copy and emphasize its streaming nature in other ways.

---------

Signed-off-by: Eric Liang <ekhliang@gmail.com>

* [AIR] LightningTrainer Dolly V2 FSDP Fine-tuning Example (#34990)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [ci][core] Remove test_ray_get_timeout_zero #35196

Signed-off-by: Ricky Xu <xuchen727@hotmail.com>

* [Core]Fixing the flakey test cases caused by Redis startup failure due to port conflicts (#35127)

The test_placement_group_3 use case occasionally fails. I have seen that the reason for the failure is that redis failed to start ("Warning: Could not create server TCP listening socket ::*:49152: bind: Address already in use").

Now for the test case with external redis, when starting redis, add the judgment of whether the redis process starts successfully, and try again if it fails to start.

* [core][state] Push down filtering to GCS for listing/getting task from state api (#35109)

Re-Revert of #34433

* [Core] Add bundles_to_node_id info in placement_group_table (#35122)

Now there is node_id information corresponding to each bundles in gcs_utils.PlacementGroupTableData.
But there is no node_id information corresponding to bundles in ray.util.placement_group_table() interface in python.
Now add a "bundles_to_node_id" field in the returned result of the ray.util.placement_group_table() interface

To ensure compatibility. A "bundles_to_node_id" field is added.

* Downgrade hermetic python to 3.8 (#35198)

And build dependencies with 3.7.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

* [Core] Put pg state to kv store when pg rescheduling (resubmit) (#34948)

This PR may have caused flakey failures in the test case 'test_placement_group_3', so it was rolled back.
This is a resubmitted PR

If it is confirmed that the issue was caused by this PR, then I will make the necessary modifications to address the problem

* Add runtime env metadata to jobs detail page. (#34984)

Also makes the job detail page work when accessed via the submission id in the path. This will enable future work to link to submission-only jobs.
Also fixes bug where the grafana dashboard dropdowns for Deployments and Replicas don't work until after the first request was received for that replica or deployment.

* [RLlib] Unity3D adapter: Disable env pre-checking (agent IDs not known before connection to Unity editor). (#35167)

* [docs] update batch guide link, fix tensor ref (#35171)

Direct users to the new batch inference guide. Found a broken reference while doing so.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

* [docs] synced tabs in AIR getting started (#35170)

as discussed here https://docs.google.com/document/d/1fMF-Pt0gzJDhPJpmGQVUmEoUXsPwvSnHnJR058lLm8g/edit?disco=AAAAuOuL7ME&usp_dm=true

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

* [docs] fixing missing libs in batch-x examples (#35169)

currently going through some batch-processing related examples, noticed that we're still missing installation instructions in some of them.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

* [docs] replace deprecated sklearn by scikit-learn installation (#35168)

some AIR notebooks use pip install sklearn, which is deprecated (see https://pypi.org/project/sklearn/), we fix this here.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

* [RLlib] No longer return action distribution objects from RLModule's `forward_...()` methods. (#35085)

* [docs][serve] add note that Ray doesn't pickle (#35194)

Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>

* [docs] batch inference pass (#35041)

we're focusing more on explaining how batch inference works without Ray first, what the differences are, and what to know about batches and their formats to scale out your workloads.

* [docs] fix outdated tensor data ref (#35212)

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

* [data] [doc] Fix dataset images

* [RLlib] Replace calls to socket in learner group for getting ip address with ray (#35218)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [Data][CI] Mark `dataset_shuffle_sort_1tb` tests as unstable (#35203)

`dataset_shuffle_sort_1tb` and its chaos variant have been flaky for some time. We don't have time to fix this right now, so I'm marking this as unstable.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

* [data] Update tagline to datasets for ML (#35228)

* [Data] Improve notebook widget display (#34359)

This PR aims to fix some outstanding issues with notebook ipywidget display.

Changes error messaging for the ipywidgets soft dependency to include explicit instructions to
Install/upgrade ipywigets as appropriate
Restart the notebook (i.e. jupyter) server, without which widgets will not be properly displayed.
I've also added a number of tests to ensure these decorators are working correctly.

Switch from using _ipython_display_ to _repr_mimebundle_ for displaying reprs of DataParallelTrainer and Datastream objects.
This change is motivated by the fact that when using _ipython_display_ to display widgets, it is the responsibility of the author of the _ipython_display_ method to identify the right repr to display depending on the display capabilities of the frontend. This introduces additional complexity, since IPython.get_ipython() is usually the way by which people detect whether they are running in a notebook, but the result of this function depends on the kernel being used and doesn't directly tell you about the display capabilities of the frontend.

Instead, a better way to do this is to provide a variety of reprs (e.g. an ipywidget, a simple text repr, etc...) and let the frontend decide which one to display. This is what the _repr_mimebundle_ function is meant to do.

---------

Signed-off-by: pdmurray <peynmurray@gmail.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Co-authored-by: amogkam <amogkamsetty@yahoo.com>

* [CI/air] Fix lightning_gpu_tune_.* release test (#35193)

Temporarily fix the release tests fails described in #35187. TODO: Come up with a holistic solution for metric dict flattening.



Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [Doc] Correctly Render the Enumerate Numbers in `convert_torch_code_to_ray_air` (#35224)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [Data] Clarify `map` slow warning (#35204)

* [Release tests] Moving Ray Data bulk ingest test ownership team to Data (#35238)

* [core] Reduce self alive check from 60s to 5s. (#34992)

This PR reduce the liveness check from 60s to 5s. It also fixed a bug in the old code where if the server is restarted locally, it'll mark the current one as unhealthy incorrectly because the set is not multi set. 

One example,

- If the head node restart in-place
- when it started, it'll mark the old raylet as dead
- in the liveness check endpoint here, it'll mark the raylet as dead because they share the same ip+port.
- then the head node will exit

Node id should be used for this check in the future for simplicity but the correctness is ok, given: No two raylets can start at the same address (ip+port).

* [core] Turn on ray syncer again. (#35116)

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

After fixing several issues:

 - https://github.com/ray-project/ray/pull/34645
 - https://github.com/ray-project/ray/pull/35115
 - https://github.com/ray-project/ray/pull/34687

Ray syncer should be ready to be turned on again.

* [Data][Docs] Fix `hf_quick_start.py` (#35240)

`hf_quick_start.py` was failing with

> ValueError: ArrowVariableShapedTensorArray only supports heterogeneous-shaped tensor collections, not arbitrarily nested ragged tensors. Got arrays: [('dtype=object', 'shape=(1,)'), ('dtype=object', 'shape=(1,)')]

This is because we're returning an object that looks like

```python
{"output": 
    [[{'generated_text': 'Complete this page to stay up to date with our latest news in aviation related news. You can also'}], 
     [{'generated_text': "for me. We could use those resources as time goes on. We'll get to it in the"}]]
}
```

from a UDF.

This PR updates the UDF so it returns object like

```python
{"output": [
    'Complete this page to stay up to date with our latest news in aviation related news. You can also', 
    "for me. We could use those resources as time goes on. We'll get to it in the"
]}
```

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

* [core] Deflakey test advanced 9 (#35247)

Previously a bug was fixed in #33311 where pubsub causes the leak. Somehow the fix has race conditions and got triggered later when code changes.

The test is flakey because there is a race condition between raylet sending node failure and core worker exit itself.

When disconnect is sent to Raylet, Raylet will start to report worker failure. But the worker still continue to run.

GCS uses worker failure to close the connection. But if the worker is still alive, the worker might send another request the GCS which will lead to the FD leak.

Compare with #34883 it's a short term fix and the goal is to make the case the same as 2.3.

* [ci] Fix dask Ray client tests (#35233)

The Ray client tests for dask are broken in master:

```
ModuleNotFoundError: No module named 'dask'
```

We didn't change any logic in our CI, but it seems we never explicitly installed dask in the respective job. Maybe it was previously automatically installed by some subdependency.

This PR adds the data processing requirements to the job, thus explicitly installing dask.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [docs] auto-remove gen apis on make clean (#35210)

if you don't regularly clean API docs generated by autodoc/summary, your build output will be spammed with warnings about outdated/non-existing APIs. we make it so that

```
make clean && make develop
```

truly builds from scratch to avoid this issue.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

* [RLlib] APPO+new-stack (Atari benchmark) - Preparatory PR 04 - LearnerAPI changes/tf-tracing fixes. (#34959)

* [ci] Use python 3.9 in WORKSPACE (#35255)

Seems that python 3.8 tool chain will break arm building..

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

* [Client] Optimize chunk size (#35025)

* Optimize chunk size

The time it takes to serialize a protobuf object is not linear by its size.

Signed-off-by: ZhengYu, Xu <zen-xu@outlook.com>

* resize large objects

---------

Signed-off-by: ZhengYu, Xu <zen-xu@outlook.com>
Co-authored-by: Chris Wong <cwong@anyscale.com>

* Run bisect with the correct python version (#35186)

Currently because we do not specify the python version, bisect defaults to 3.7. Some tests want to run with a specific python version, so read the python version from the test configuration for those cases.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [ci/bazel][2] bazelize all other ray_release tests (#35032)

Bazelize all other ray-release tests

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* Update README.rst (#35267)

Fixed the URL to point to the correct location

Signed-off-by: Justin Coffi <jcoffi+github@gmail.com>

* Bring back "[Core] Port GcsPublisher to Cython" (#34393) (#35179)

I spent quite a bit of time debugging the test failure in #34393 (see also #35108)

It turns out the PR slightly made the _do_importing race condition (first time call in the import thread) more likely to happen. There is already a plan / PR to get rid of it (#30895) but it is currently waiting for having a replacement mechanism that @rkooo567 is working on.

I synced with @scv119 and for the time being, we are planning to skip the offending test on Windows and once we got rid of the import thread, we can re-activate it.

* [serve] Add controller metadata (#35182)

* Return node id, node ip, actor id, actor name, worker id, log file path for controller. (Field name `controller_info` up for discussion)
Example:
```
    "controller_info": {
        "node_id": "a2ee49da74f69cb177cfca907354ea7cf669a015b4af1e0e9224500a",
        "node_ip": "192.168.0.141",
        "actor_id": "539ac33eadf7ead5375d741c01000000",
        "actor_name": "SERVE_CONTROLLER_ACTOR",
        "worker_id": "766d2f49edcfe39b422fb7c237a2084a618302dd51e9904795d7492b",
        "log_file_path": "/serve/controller_5629.log"
    },
```
* Add worker id to http proxy and replica details
* Update http proxy to use `get_component_logger_file_path()`

* [serve] Stream Serve logs across different drivers (#35070)

Add back `_filter_logs_by_job` to worker.py, and use it to disable filtering of streamed logs in `print_logs`.
This existed in worker.py before, but was removed at some point.

* [Overview][Serve] Add Recent Serve Applications Card #34642 (#35227)

Follow up with #34642

Fixed ESLint errors and test cases

* [Serve] Add status_code to http qps & latency (#35134)

Add status_code for http qps & latency stats.
This is to resolve "double counting" issue because of redirect request.

Related issue number
Close #33686

* [train] Fix HuggingFace -> Transformers wrapping logic (#35276)

Properly pass constructor arguments through.

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [core][dashboard][state] Support task logs from state API (#35101)

This PR adds support for retrieving task logs from state API with task ID directly.

In the high level:

On task execution: it writes magical token including task id + attempt before and after task is run in the worker file.
On query: it reconstructs the worker file name from querying worker id from the task backend for a task attempt, and look for the magical tokens in the worker file to find beginning and end of the files.
The end token is used to identify task log in the case of async actor: if async actors product interleaved logs, one task's log will contain all logs from the task and other interleaved logs from other tasks. (We could probably return some sort of warning to let users know this)

* [core][dashboard] Task backend GC policy - worker update [1/3] (#34896)

This is the series of PRs that improve GC policy for task backend. The overall goal of the stack is to make data loss at a task attempt granularity: if a task attempt incurred some data loss (due to number of task events enforced at the worker/ GCS), all the status change w.r.t that task attempt will be dropped, so there will be no partial task attempt.
Right now, individual events (e.g. task started running) could be lost for a task attempt, which isn't great for observability.

This PR adds task attempt level data loss tracking on the worker side, by tracking:

per job profile events dropped for timeline.
task attempts dropped.
Worker will send the data loss info ^ the GCS

In the subsequent PRs:

GCS side will be updated
Dashboard front-end will be updated to reflect the per job profile events loss.

* Use state-api for job driver logs (#35235)

Update Actors page with new IA look and feel
Fix actors page IA for job actors

Follow-up PR: Use state-api for actor logs
Follow-up PR 2: Use state-api for serve replica and controller logs.

Actor list page:

* Add docs for setting up metrics for homebrew installations (#35026)

fixes #35121

* Add HTTPProxy details to Serve Dashboard UI (#35159)

Adds details about HTTPProxy to the Serve UI

* [AIR] Remove hard-deprecated and unused code (#35163)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [Doc] [no_early_kickoff] Revamp ray core api reference [1/n] (#34428)

We have coding style for function docstring but not for class docstring. This PR tries to propose one by following the same Google style guild and makes sure it works well with autogenerated class page.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [docs][observability] O11y refactor 1/N (#35158)

* [train] Fix HuggingFace -> Transformers wrapping logic 2 (#35284)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [Serve] Add route tags with custom metrics (#35246)

- Add route value automatically with custom metrics 
- Fix some metrics tests.

* [Serve] Add more bucket size (#35242)

Increase the bucket size. 
- Still providing current granularity with 0 - 100 ms latency.
- Providing more buckets for 100ms - 1000ms latency precision out of the box.
- Increase the bucket range to handle heavier use case.

**note: we are going to increase 2x stats points for the latency, it should be trivial comparing with host network bandwidth.**

* [release test] [Cluster launcher] Add gcp minimal and full cluster launcher release test (#34878)

Adds a nightly release test for the example-minimal.yaml and example-full files in the cluster launcher docs for GCP.

Adds optional no-config-cache argument to test script for debugging purposes

* [docs] [data] Update use case doc links and resources (#35277)

* [Release test] Disabling empty-runtime-env tests in benchmark_worker_startup.aws #35232

This test uses some hackery to get a "default" (empty) runtime environment on Anyscale. This allows us to measure startup performance for non-Anyscale environments. We validate that the numbers are correct by asserting empty runtime env for these measurements. Since our infra team recently added cgroup to the default runtime env, the assertion now fails.

We can fix this but not a priority right now -- this PR disables the empty-runtime-env tests in this release test, so we still have metrics for the normal path.

Closes #35183

* [docs] nav fixes #34583 (#35296)

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

* [AIR] Distributed checkpointing (#34709)

Signed-off-by: Jun Gong <jungong@anyscale.com>

* [RLlib] RLlib contrib (#35141)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [telemetry] Add libc version to ray telemetry. (#33444)

This PR added GLIBC version to Ray telemetry. Linux distribution is not added because it needs distro pkg.

* [data] Capture the context when the dataset is first created (#35239)

* [core] Make execute_after accept chrono (#35099)

The current execute_after is verbose and the unit waiting is not stated clearly in the signature. This PR fixed this by passing chrono there.

This PR doesn't update the existing code base for this but make it compatible backward.

* [no_early_kickoff] [data] Improve our handling of tensor returns in strict mode (#35272)

* [Serve] Add multiplex support (#34941)

Introduce multiplex API.
@serve.multiplexed(num_models_per_replica=0)
@serve.get_model_id
router & controller & deployment_state change will be in followed up pr.

Usage:

* Add support for multi-tab log viewer (#35280)

Adds log views to Actor detail page, serve replica page, http proxy page.
Adds links to job detail pages for driverless jobs
Multi-tab log viewer is expandable
Fixes side tab layouts for Job -> Actors page and Cluster -> node page.

* Fix "ImportError: sys.meta_path is None, Python is likely shutting down" (#35304)

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* [data] Add GPU data ingestion nightly test (#34986)

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* [core][state][dashboard][log] Fix subdirectory log getting (#35283)

We are not able to get logs from a subdirectory from state API. This PR fixed it.

* [core][state][no_early_kickoff] Add "humanify" feature to StateSchema (#35059)

This PR introduces a framework to address this issue: #31876.

Essentially, we add a humanify() method to the base stateSchema class, and any subclasses would provide relevant format_fn as a metadata argument to any of its fields, and the humanify() method would aggregate the output from the lambdas.

This PR is meant to introduce the general framework, and any additions (new format_fn) can be added by request.

* [docs] clarify FAST build option, fixes #35293 (#35297)

nit changes (sub-project --> subproject)

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

* [AIR] Deprecate `ray.tune.logger.Logger` interface (#35162)

This PR deprecates:
1. Soft-deprecated, for removal in 2.7.
    - `ray.tune.logger.Logger`, in favor of `ray.tune.logger.LoggerCallback`
         - Also, deprecated any built-in `Logger` subclasses, including `CSVLogger`, `JsonLogger`, `TBXLogger`, etc.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [Core/Logging] Worker startup hook (#34738)

This PR supports the basic worker setup hook API to runtime env according to the design; https://docs.google.com/document/d/1ngiuAZAMnl9c4LozoTpWh37KPviDRIpmEjEI6BsNL7w/edit

This PR also exposes exit API to the Python so that we can easily fail the worker with a exception we want

It is the first PR to support this feature. The PR allows users to add a setup method using runtime env. There will be 2 more PRs that will be coming a follow up

Merge the runtime env when the job + driver specifies the runtime env.
Allow to specify setup hook for individual task and actor

* [serve] Log to files in JSON format by default (#35118)

Update the logging format to json format. Used for better parsing and log search.

- User can set `SERVE_JSONIFY_LOG_MESSAGE` to jsonify the log message.
- Stream log doesn't have effect with this change.


controller log file

```
{"levelname": "INFO", "asctime": "2023-05-07 17:22:58,360", "component_name": "controller", "component_id": "3525674", "message": "http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-9e1688af72409c6ffaf805b05c397632cfb4eb7acf0703468e8e3535' on node '9e1688af72409c6ffaf805b05c397632cfb4eb7acf0703468e8e3535' listening on '127.0.0.1:8000'"}
{"levelname": "INFO", "asctime": "2023-05-07 17:22:59,252", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:1220 - Deploying new version of deployment app_testv2."}
{"levelname": "INFO", "asctime": "2023-05-07 17:22:59,282", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:1459 - Adding 2 replicas to deployment app_testv2."}
{"levelname": "INFO", "asctime": "2023-05-07 17:22:59,283", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:330 - Starting replica app_testv2#QZdGDm for deployment app_testv2."}
{"levelname": "INFO", "asctime": "2023-05-07 17:22:59,297", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:330 - Starting replica app_testv2#jorUqs for deployment app_testv2."}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:00,217", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:1615 - Replica app_testv2#QZdGDm started successfully."}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:00,217", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:1615 - Replica app_testv2#jorUqs started successfully."}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:00,270", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:1220 - Deploying new version of deployment app2_testv2."}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:00,319", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:1459 - Adding 2 replicas to deployment app2_testv2."}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:00,319", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:330 - Starting replica app2_testv2#lIYdIP for deployment app2_testv2."}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:00,332", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:330 - Starting replica app2_testv2#FxSRtl for deployment app2_testv2."}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:01,254", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:1615 - Replica app2_testv2#lIYdIP started successfully."}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:01,255", "component_name": "controller", "component_id": "3525674", "message": "deployment_state.py:1615 - Replica app2_testv2#FxSRtl started successfully."}
```

http proxy

```
{"levelname": "INFO", "asctime": "2023-05-07 17:22:59,242", "component_name": "http_proxy", "component_id": "b'172.31.5.229'", "message": "http_proxy.py:185 - Got updated endpoints: {}."}
{"levelname": "INFO", "asctime": "2023-05-07 17:22:59,253", "component_name": "http_proxy", "component_id": "b'172.31.5.229'", "message": "http_proxy.py:185 - Got updated endpoints: {'app_testv2': EndpointInfo(route='/app1', app_name='app')}."}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:00,272", "component_name": "http_proxy", "component_id": "b'172.31.5.229'", "message": "http_proxy.py:185 - Got updated endpoints: {'app_testv2': EndpointInfo(route='/app1', app_name='app'), 'app2_testv2': EndpointInfo(route='/app2', app_name='app2')}."}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:06,895", "component_name": "http_proxy", "component_id": "b'172.31.5.229'", "request_id": "rzqadzRWFP", "route": "/app1", "app_name": "app", "message": "http_proxy.py:435 - GET 200 4.8ms"}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:08,168", "component_name": "http_proxy", "component_id": "b'172.31.5.229'", "request_id": "hYjyoiHTPJ", "route": "/app2", "app_name": "app2", "message": "http_proxy.py:435 - GET 200 4.6ms"}
{"levelname": "INFO", "asctime": "2023-05-07 17:23:32,596", "component_name": "http_proxy", "component_id": "b'172.31.5.229'", "message": "http_proxy.py:185 - Got updated endpoints: {'app_testv2': EndpointInfo(route='/app1', app_name='app'), 'app2_testv2': EndpointInfo(route='/app2', app_name='app2')}."}
{"levelname": "INFO", "asctime": "2023-05-07 17:24:02,716", "component_name": "http_proxy", "component_id": "b'172.31.5.229'", "message": "http_proxy.py:185 - Got updated endpoints: {'app_testv2': EndpointInfo(route='/app1', app_name='app'), 'app2_testv2': EndpointInfo(route='/app2', app_name='app2')}."}
{"levelname": "INFO", "asctime": "2023-05-07 17:24:35,044", "component_name": "http_proxy", "component_id": "b'172.31.5.229'", "message": "http_proxy.py:185 - Got updated endpoints: {'app_testv2': EndpointInfo(route='/app1', app_name='app'), 'app2_testv2': EndpointInfo(route='/app2', app_name='app2')}."}
```

deployment:

```
{"levelname": "INFO", "asctime": "2023-05-12 17:38:00,698", "deployment": "app2_Model", "replica": "app2_Model#XlFWYc", "request_id": "OfbLbdKgjT", "route": "/class_method", "application": "app2", "message": "replica.py:440 - Started executing request OfbLbdKgjT"}
{"levelname": "INFO", "asctime": "2023-05-12 17:38:00,699", "deployment": "app2_Model", "replica": "app2_Model#XlFWYc", "request_id": "OfbLbdKgjT", "route": "/class_method", "application": "app2", "message": "test_logging.py:200 - user log message from class method"}
{"levelname": "INFO", "asctime": "2023-05-12 17:38:00,699", "deployment": "app2_Model", "replica": "app2_Model#XlFWYc", "request_id": "OfbLbdKgjT", "route": "/class_method", "application": "app2", "message": "replica.py:537 - __CALL__ OK 0.6ms"}
```

* [autoscaler v2][4/n] introducing node-provider and node-provider-config (#34983)

Why are these changes needed?
this is the stack of PRs to introduce new node_provider for autoscaler v2.

Stack of PRs
#34976
#34977
#34979
#34983 <- this PR
#34985

This PR introduces node provider where instance manager can allocates instances from. Implementation wise, it's a wrapper around the v1 node provider, node launcher and node updater

* [docs] fix map_batches ActorPoolStrategy ref (#35331)

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

* [ci/github] Track external code changes (blogs, tutorials) (#35261)

This adds a GitHub workflow to track changes to code we're using in external sources (e.g. blog posts or other repositories).

If a change to a tracked file is detected in a PR, a comment is added calling out the changed file and the URI of the external resource where it's being used. Also, the label `external-code-affected` is added.

In subsequent pushes to the PR, if the changes to tracked files change, the comment is updated.

This will enable us to update external sources better. It is very easy to miss subtle changes, especially when a lot of files are changed at the same time. In result some of our external blog posts or tutorials are outdated and use stale APIs. 

With a comment and label, we can proactively update external sources or filter for them after a new ray release and update in batch.

Example PR + bot interaction: https://github.com/ray-project/ray/pull/35263

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [Data] Improve compute validation error (#35234)

We repeat the validation code six times. I've abstracted the validation into a function to avoid code duplication.

I've also fully qualified ActorPoolStrategy in the error message, so users don't need to search how to import it.

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

* [Data] Fix inference release test (#35339)

0785e97 broke the inference release test. Since the release test is two years old, I've decided to rewrite the test altogether.

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

* [Data] Improve `Schema` representation (#35278)

The current representation doesn't make it clear that the keys represent column names. This can be especially confusing when your dataset contains one column (e.g., Schema({'text': DataType(string)}))

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

* During GCS restarts, grpc based resource broadcaster should only add ALIVE nodes during initialization (#35349)

During GCS restarts, grpc based resource broadcaster should only add ALIVE nodes during initialization. Otherwise it will keep broadcasting messages to dead nodes after restart.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [Data] Improve docstring and warning message for `from_huggingface` (#35206)

Corrects return type hint, add docstring example, and log warning message for from_huggingface, per confusions from user feedback.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* [Data] Add `column` API to Dataset (#35241)

Adds columns API to Dataset to be able to see the columns of the Dataset.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* [Doc] Make doc code snippet testable [2/n] (#35274)

Change code snippet from ..code-block:: to ..testcode::

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [RLlib] Remove some specs from encoders to smoothen dev experience (#34911)

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

* [Train] Don't repartition if xgboost-ray>=0.1.16 (#32960)

* [Train] Don't repartition if xgboost-ray>=0.1.14

Repartitioning is not necessary anymore with xgboost-ray>=0.1.16.

---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

* [Serve] Multiplex API Impl (#35326)

Adds @serve.multiplexed and @serve.get_multiplexed_model_id implementation.

* [UI] Unify colors of different status for Jobs, Services, Actors (#35138)

Unify the colors of different status for Ray entities(Jobs, Actors, Serve....)
Change icon for Recent Job with status STOPPED
Remove icon for job status in job detail page and job list page
In this pull request, we will be removing the jobs status icon from the detail page. The reason is that it's not compatible with the background color, for example, the PENDING status with an orange background and a blue loading icon. We will only keep the blue loading icon for the RUNNING status.cc @alanwguo

* [core] Delete disconnected node view in ray syncer when connection is broken. (#35312)

The current ray syncer doesn't take care of disconnection very well. If one raylet is disconnected due to some reason, the ray syncer won't clear its local view and will send the dead node info to any newly joined node.

This won't introduce any correctness bugs because the raylet will just reject the offending message.
And this won't introduce too big performance impact since only the newly added node will receive these mesages.

This PR cleaned up its local view table when the node is disconnect. It'll get the new snapshot when it rejoin if the disconnection is due to network.

* [doc] [data] Update dataset intro page and fix some typos (#35361)

* [data] Fix bugs in handling of nested ndarrays (and other complex object types) (#35359)

There were a couple bugs in our handling of complex ndarrays:

We weren't consistently falling back to PandasBlock for object dtypes. This was due to raising different exception types, some of which were not caught at the upper layer. This PR simplifies our exception handling path, removing legacy code.
We weren't calling create_ragged_ndarray for certain return types due to a bug in the shape mismatch detection code.

* [Train] LightningTrainer: Enable prog bar (#35350)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* Add "all" option for SessionName (#35303)

This opens as default in the Grafana page but for the Dashboard UI, the default is still scoped to the latest session.

* [Dashboard] Add task detail page with logs (#35328)

Add task detail page with logs
Update actor logs to use state-api actor_id filter instead of filename to fetch logs
Hide log tabs if there is only 1 tab available.
Update the driver logs in the jobs page to show explanation why logs are not available.

* [core][state][ci] Fix stress_test_state_api_scale (#35332)

We changed the logging content with per task magic token. Updating test logic.

* [core][state][job] Supporting job listing(getting) and logs from state API (#35124)

This PR adds better support for listing jobs (ray list jobs) and getting submission job logs (ray logs job --id)

The state API client mirrors the implementation of job endpoint and depends on job's implementations for retrieving job related info.

* [tests] fix lint and dependency issues in tests (#35373)

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [ci] External code tracker: Ignore if file is not found (#35376)

When master is not merged, a file not found error can come up. This update to the script will prevent that from happening by catching errors and setting default values for the variables.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [Dashboard] Add serve controller info to the Serve page  (#35327)

* [core][dashboard] Make actor tasks'name default to <actor_repr>.<task_name>  (#35371)

We updated the dashboard to show actor's name with actor repr, and we should do the same for actor tasks.

Caveat:

Actor.__init__ is currently not showing up as repr_name.__init__ because during creation task initialization, we haven't initialized the actor states yet, so the repr func should not yet be called. There's workaround (i.e we could modify the init task name later at rendering, but chose not implement in this PR), but would be rather hacky. The major issue is that repr info for an actor is only available on the executor, and on the submitter (or encoded in actor handle). Given the feature is only for dashboard, I feel this the intrusive change in this PR is better.

* [tune/execution] 1/n Add more unittests for TuneController (#34833)

For full test coverage, we should migrate all tests in `test_ray_trial_executor` and `test_trial_runner_*` to use the new execution backend as well.

This is a WIP PR to migrate the first batch of these unittests.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [air/output] Context-aware output engine: Add docs, experimental feature docs, prepare default on (#35129)

This prepare to enable the new output engine per default. When activated, a hint is displayed how to disable the new output engine.

This PR also adds documentation around experimental features in Ray AIR.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

* [ci][byod/1] clean up local environment setup for release tests (#35355)

All release tests are using remote execution via anyscale at this point. Remove code path for local environment setup. In particular, the driver_setup is used to install packages on buildkite host, which is no longer neccessary/used.

Signed-off-by: Cuong Nguyen <can@anyscale.com>

* [AIR][Telemetry] Cluster storage configuration (#34905)

This PR adds telemetry for the storage / syncing configuration by adding 1 usage tag: `AIR_STORAGE_CONFIGURATION`

The storage configuration is set by `RunConfig(storage_path, sync_config)`. The possible configurations are:
- 'driver' = Default head node syncing if no remote path is specified
- 'local' = No synchronization at all.
- 'nfs' = Using a mounted shared network filesystem.
- ('s3', 'gs', 'hdfs', 'custom_remote_storage'): Various remote storage schemes.
- ('local_uri', 'memory'): Mostly used by internal testing by setting `storage_path`
    to `file://` or `memory://`.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* Remove previously added debug logs (#35360)

Revert some debug logs added in #35062

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [serve] Catch all exceptions during deploy (#35307)

Be extra careful with fetching task references in the controller update loop.
- The `deploy_obj_ref` task reference should only be fetched once (for both application state and http state). 
- Also, `RayTaskError` and `RuntimeEnvSetupError` may not be the only two types of possible exceptions thrown by `ray.get()` (e.g. saw a case of `RaySystemError`) so we should catch all exceptions.

* [AIR][Telemetry] Experiment tracking integrations + callbacks  (#34904)

This PR adds telemetry for built-in experiment tracking integrations by adding 3 usage tags:
1. `AIR_SETUP_WANDB_INTEGRATION_USED` ("1" if used)
2. `AIR_SETUP_MLFLOW_INTEGRATION_USED` ("1" if used)
3. `AIR_CALLBACKS` (a JSON string representing a dict of callback name -> count)
    - The key `CustomCallback` gets a tally if the user passed in just a subclass of `Callback`
    - The key `CustomLoggerCallback` gets a tally if the user passed in a custom `LoggerCallback` (not including above)


The need for 1 and 2 is because wandb and mlflow allow the `setup_x` path, where the user calls this in their training function and logs whatever they want themselves.

These 3 can be used together to extract the total wandb/mlflow integration usage. (Ex: `setup_wandb` usage + `WandbLoggerCallback` usage. There may be some overlap, as it's technically possible to use both.)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [Train] Change `num_boost_round` to target iterations (#33602)

Previously, we have been following the xgboost/lightgbm conventions of fitting for num_boost_round, regardless of how many iterations the model already has been fitted on. This, however, causes issues when resuming from checkpoints, especially during training/tuning, as you may end up with more trees than desired:

Trial has num_boost_round=100
Trial fits for 50 and dies
Trial is restored from checkpoint, model starts with 50 iterations already complete
Because num_boost_round=100, model is fitted for 100 iterations, giving a total of 150 iterations instead of desired 100
Now, we will subtract the already completed iterations when resuming.

num_boost_round was already a part of **train_kwargs, we just promote it here for docstring purposes.

---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

* [serve][docs] Add user guide for application builders  (#35392)

Docs follow-up for: https://github.com/ray-project/ray/pull/34584

* [air/output] Add parameter columns to status table (#35388)


Signed-off-by: Kai Fricke <kai@anyscale.com>

* Revert "Add "all" option for SessionName (#35303)" (#35403)

This reverts commit 496024df480c74e83e31363155341605aa28b0bf.

* [Serve] Mutliplexed information report impl (#35372)

- Multiplexed metrics.
- Pass multiplexed information into controller and make it available at `RunningReplicaInfo`.

* [core][state] Move state API out of experimental  (#35318)

This is part of effort to make state API no longer experimental:

Move everything under ray/experimental/state into ray/util/state
Declare state API's python SDK to be DeveloperAPI, CLIs commands to be Stable.
Make all imports from ray.experimental.state.api to ray.util.state:
from ray.util.state import list_tasks # works
...
Forward importing from ray.experimental.state to ray.utils.state such that existing users will work.
from ray.experimental.state.api import list_tasks # old way works, with warning
Add warning and telemetry for the deprecating import path (ray.experimental.state)

* [data] Improve map batches error message for strict mode migration (#35368)

* [RLlib] Add missing `sampler_results` key to fetch min desired reward in RLlib release tests (#35354)

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

* [core] Change log error to log info when node disconnect and got detected by ray syncer. (#35415)

Logging an error will be pushed to the driver and it's annoying since it's actually not an error and will be handled by ray internally.
This PR changes it to log info for usability.

* Add a disconnect button to the context widgets in notebooks (#34815)

This PR adds a nice button for disconnecting (i.e. calling ray.shutdown()) when ray.init() is called in a notebook.

I also was able to dedup a bit of code that was in RayContext and ClientContext classes, moving it up to the shared parent class of both. 🎉

* [core] Serialize requests in redis store client. (#35123)

## Why are these changes needed?

This PR is trying to make redis able to resend the failed request to improve the GCS availability. Right now if any request failed, GCS will crash.

There are several ways to do this:

1. The failure is taken care of in the application layer.
2. Or retry is supported in the redis client layer.

The first one requires reviewing all the code and ensure all failures are covered. Due to the complexity of the system, it might be better for the long-term work.

In short term, a better way to do it is to resolve it in redis store client layer. And this requires no concurrent operations happens. For example, if there are requests to update actor status:

1. Alive
2. Dead

And if 1 failed and 2 succeeded and later 1 got retried and succeeded, then the final status will be Alive which is wrong.

In this PR, the requests are queued and fired one by one for certain key. It shouldn't impact the performance too much given that in the most of the time, the queue size should be 0.

A following PR will enable retry in hiredis layer for failed request.

## Related issue number
https://github.com/ray-project/ray/issues/34014

* [CI] Increase parallelism for Train tests to 4 (#35401)

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

* [core] Sending ReportWorkerFailure after the process died. (#35320)

## Why are these changes needed?
This fix is not fixing from the root https://github.com/ray-project/ray/pull/35247

And in many_nodes_actor_tests_v2, the file descriptor error still shows. 

This fix tries to monitor the process's liveness in some way.  It also introduce a new util function which will retry the failed function until certain number.

Some tests are disabled due to the race condition in detecting node failures which will be fixed later.

* [AIR] Deprecate modules in `ray.tune.integration` (#35160)

This PR deprecates the following:
1. Soft-deprecated, for removal in 2.7.
    - `ray.tune.integration.keras`, in favor of `ray.air.integrations.keras.ReportCheckpointCallback`.
2. Hard-deprecated, for removal in 2.6.
    - Passing wandb/mlflow API keys through config instead of the kwargs of `setup_wandb`/`setup_mlflow`. (Which was scheduled to be hard-deprecated in 2.4).
3. Removed (due to > 2 minor releases of hard-deprecation).
    -  `ray.tune.integration.comet` in favor of `ray.air.integrations.comet`

This PR also updates the API refs to reflect these changes. Namely, `ray-air/api/integrations` is now its own page instead of just linking to "Tune integrations".

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [RLlib] DreamerV3: RLModule class, tf.keras model components, dreamer-model, world-model, actor, critic. (#35381)

* [serve] Shutdown http proxy state (#35395)

Shutdown http proxy state so that it won't run anything in its update loop once a shutdown signal is received.

* [Development] Fix unbound `BUILDKITE` variable in `install-bazel.sh` (#35181)

The contribution docs instruct the user to run install-bazel.sh to install Bazel. However, this script errors with

install-bazel.sh: line 119: BUILDKITE: unbound variable
This PR fixes the error by providing a default value for the BUILDKITE variable when it's used in the script. The change ensures that if BUILDKITE is not set, it will be treated as an empty string, and the script will not throw an error.

* Revert "Add a disconnect button to the context widgets in notebooks (#34815)" (#35426)

This reverts commit f31d70e6be5db7432e1f82c87bde2c3b410f9552.

* Revert "[core] Sending ReportWorkerFailure after the process died. (#35320)" (#35447)

This reverts commit 9cd509723494eb861c77a989cfa880ec384d31c6.

* [Serve] Http proxy & router & handle to support multiplex impl (#35399)

- Support handle.options(multiplexed_model_id="")
- Http proxy to extract model id on the fly
- Choose correct replica based on information.
- nit: Move handle metrics pusher to router.

* [serve] Remove print statement + fix lint (#35439)

* [RLlib] Make CNN encoder test larger (#35374)

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

* [train] Restructure `ray.train` HuggingFace modules (#35270)

- Organize HuggingFace integrations together.
   - Additional HuggingFace integrations should have a logical place to be added.
- Make it simple for users to import and use integrations.
   - Imports should not be excessively long
   - Naming should be intuitive

Signed-off-by: Matthew Deng <matt@anyscale.com>

* [serve] Fix `app_builder` doc code test (#35456)

Missing import.

* [RLlib] Add torch compile capabilities to TorchRLModule (#34640)

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

* [Data/Train] Fix ipython representation (#35414)

Fixes a bug where repr would fail on ipython shell

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* Add "all" option for SessionName (#35408)

This opens as default in the Grafana page but for the Dashboard UI, the default is still scoped to the latest session.

* Update version in dask on ray guide for 2.5.0 release (#35458)

As part of the 2.5.0 release, the dask on ray version in the guide needs to be updated.

---------

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

* [Data] Fix `read_tfrecords_benchmark` (#35152)

read_tfrecord_benchmark generates synthetic datasets that are much smaller than expected. This PR fixes the implementation so that the synthetic dataset respects the specified num_rows.

Strict mode changed the type of ray.data.range batches from list to dict. Previously, this for loop generated one row per input row. Since batch is now a dict with one key, it instead generates one row per input batch.

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

* [core] Retry failed redis request (#35249)

## Why are these changes needed?
<!-- Please give a short summary of the change and the problem this solves. -->
After https://github.com/ray-project/ray/pull/35123 the request in ray is serialized and we should be able to retry the failed redis request in ray.

This PR refactor the RedisContext a little bit by remove the CallbackItem and introduce the RequestContext. In side the RequestContext, the failed request will be retried automatically.

If still failed in the end, it'll just crash.

## Related issue number
https://github.com/ray-project/ray/issues/34014

* [Dask on Ray] Attempt to fix line in dask doc (#35479)

#35458 introduced an issue with the table not being displayed.

---------

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

* [Doc] Make doc code snippet testable [3/n] (#35407)

Change code snippet from ..code-block:: to ..testcode::

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* [install] fix installation instructions for ray[default] (#35442)

The current exception message for a missing dependency is
```
RuntimeError: The Serve CLI requires the ray[default] installation: `pip install 'ray[default']``
```

The quotes and backticks are incorrect

Fix the instructions

* [Data] Add `num_cpus` and `num_gpus` as top-level args to map functions (#35486)

Add num_cpus and num_gpus as explicit top level argument to map functions (map, map_batches, flat_map). This is not an API change as these arguments were already previously accepted via **ray_remote_args. However, they now show up on API references and IDE hints.

* [AIR] Move Constants from tune/results.py to air/constants.py (#35404)

Currently we have result-related constants stored in `ray/tune/result.py`, However, our Result object is defined in `ray/air/result.py`. When we are trying to import constants from `ray/tune/result.py` in `ray/air/result.py`, there will be a cyclic import error:
```
import Result -> import ray.tune.result -> ray.tune.init -> import ResultGrid -> import Result
```

This PR only moves the following constants and changed the import path in all affected files accordingly. In the future, we will gradually move the rest constants along with the module class.

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

* [Tune] Fix hyperband scheduler raising an error for good `PENDING` trials (#35338)

- As reported [here](https://discuss.ray.io/t/trial-with-unexpected-good-status-encountered-pending/10529), the hyperband scheduler asserts that all good trials are either `RUNNING` or `PAUSED`.
- However, if a trial gets unpaused and its [status gets set to `PENDING`](https://github.com/ray-project/ray/blob/21e9d38320fd392fa34ce60b73277d367ddf5386/python/ray/tune/schedulers/hyperband.py#L276), then, while its still waiting actor assignment+setup and [before the trial is officially `RUNNING`](https://github.com/ray-project/ray/blob/21e9d38320fd392fa34ce60b73277d367ddf5386/python/ray/tune/execution/tune_controller.py#L716), calling `_process_bracket` enough times will raise the assertion error for the pending good trial.
- This happens if enough trials get removed (by either completing or erroring) in the in-between time, since [each call to remove will process the bracket again](https://github.com/ray-project/ray/blob/21e9d38320fd392fa34ce60b73277d367ddf5386/python/ray/tune/schedulers/hyperband.py#L292-L293).

`PENDING` trials should be a valid state for good trials. This PR also updates the tested BOHB example to use more samples and shorter trials to catch this error.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

* [CI] Fix ml_user_ray_lightning_user_test_(master|latest).aws release test (#35465)

The release tests failed due to the incompatible urllib3 version. Pin urllib < 1.27 to fix the ml_user_ray_lightning_user_test_(master|latest).aws release test.

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* Make sure connector related tests do not use pre-generated policy checkpoints (#35459)

Pre-generated checkpoints are python version dependent.
Let's avoid them.

Signed-off-by: Jun Gong <jungong@anyscale.com>

* Make test_torch_predictor a medium test. (#35466)

Looking at test_torch_predictor, it is a suite of 35 test cases, and runs roughly a minute of time.
Most of the time these tests do finish fine actually. That's why it's just flaky.
Make this a medium test to de-flake it.

Signed-off-by: Jun Gong <jungong@anyscale.com>

* [1/N] Streaming Generator. Cpp interfaces and implementation (#35291)

This PR introduces TaskManager interfaces to enable streaming generator.

* [tune] Track PyTorch tutorials file in our CI (#35351)

Adds the full source file of the tutorial at https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html to our CI and tracks it using our external source tracker (#35261)

Replacement for #35263

Signed-off-by: Kai Fricke <kai@anyscale.com>

* Revert "[Data] Add `num_cpus` and `num_gpus` as top-level args to map functions (#35486)" (#35504)

This reverts commit b1d424915b6c53e9dd6985025422bcd21e987bef.

* [docker] Preserve date and git commit prefix in tags. (#35474)

So the release tags are not changed when adding a cherrypick, like a
doc cherrypick, and there will be no need to disable container builds
manually.

We will retag the one that we eventually want to release at the last
moment.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

* [RLlib][RLlib contrib] add soft deprecation notices to maml and a3c (#35345)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [RLlib contrib] add contributing.md to RLlib contrib (#35346)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [RLlib contrib] Fix rllib contrib readmes (#35347)

Signed-off-by: Avnish <avnishnarayan@gmail.com>

* [RLlib] Fix IMPALA/APPO when using multi GPU setup and Multi-Agent Env (#35120)

Signed-off-by: Michael <e-mail@rocketrider.eu>
Co-authored-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

* [Doc] Fix error in "Writing code snippets" (#35462)

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

The "Writing code snippet" guide contains an error. To ignore outputs from a code snippet, the guide suggests using +SKIP. However, this causes CI to not test the code. This PR revises the guide to suggest +ELLIPSIS instead.

* [Doc] Fix batch_forecasting.ipynb (#35467)

* [Doc] Fix batch_forecasting.ipynb

Signed-off-by: Jun Gong <jungong@anyscale.com>

* [data] Fix ragged tensor conversion with map() (#35419)

From #35143, we found a map() case that is not covered in our numpy support test cases.

* [Core] Fix the recursion error when async actor has lots of deserialization. (#35494)

When an async actor is used, we always increase the max recursion limit before we post the function to the event loop because when we have lots of pending async tasks, Python thinks there's a recursion due to a large parallel callstacks (due to fiber which is used to implement async actor).

When running an async task, we have 3 steps.

run a deserialize function in the event loop,
ray/python/ray/_raylet.pyx

Line 866 in bfec451

 args = core_worker.run_async_func_in_event_loop( 
increase recursion limit,
ray/python/ray/_raylet.pyx

Line 831 in bfec451

 increase_recursion_limit() 
run a main function in the event loop
The problem here is that you increase the limit always "after" deserializing the object from the event loop. This means when the deserialization happens, the recursion limit is still low, and this can cause the exception.

When we have a lots of async tasks that has higher overhead of input deserialization, this can happen (because before we increase the recursion limit, we hit the max recursion error when deserializing the object), which is exactly what 1:1 async-actor calls with args async test does where the microbenchmark failed with the recursi…
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…#34990)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
train Ray Train Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants