Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from ray-project:master #25

Closed
wants to merge 37 commits into from

Conversation

pull[bot]
Copy link

@pull pull bot commented May 31, 2023

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

bveeramani and others added 29 commits May 26, 2023 17:00
)

#35272 changed the implementation of _convert_outputs_to_ndarray_batch, but didn't update the type annotation or docstring.

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
…when needed (#35475)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…nc actors (#35828)

Signed-off-by: Alan Guo <aguo@anyscale.com>
Thanks @iycheng for debugging this!

In #33976 we don't always guarantee that children of worker processes are killed -- namely if the worker dies in an unexpected way, they are not (see #26118).

This unfortunately happens in the `test_no_worker_child_process_leaks` test case since the Ray cluster is torn down when the driver exits, which can lead to worker processes being forcefully terminated [the solution in this PR is to not initialize the Ray cluster in the driver script but start a separate head node with `.add_node`].

As a result of this forceful killing, the test becomes flaky (it passes if the workers are killed through the normal termination, and it doesn't pass if the workers are killed otherwise). The flakiness is not super big right now but becomes big with some other changes.

This should eventually be fixed with #26118.
…35649)

Why are these changes needed?
This allows the autoscaler to pass cloud instance id into Ray.
When tune runs in a windows session with legacy encoding (e.g. cp-1252), the characters in the new output will result in an encoding error:

```
  File "c:\install\ray\python\ray\tune\experimental\output.py", line 700, in experiment_started
    print(
  File "C:\Miniconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-72: character maps to <undefined>
```

This PR detects a non-utf encoding and reverts to ascii characters which are fully compatible with windows. This is equivalent to the old Tune output format:

```
+------------------------------+----------+-------+
| Trial name                   | status   | loc   |
|------------------------------+----------+-------|
| LightningTrainer_47320_00000 | PENDING  |       |
+------------------------------+----------+-------+
```


The PR also adds a small fix in tune controller that came up when debugging this issue on windows.

Signed-off-by: Kai Fricke <kai@anyscale.com>
…odule sub-classes, flip inheritance order and remove c'tors. (#35659)
Make serve.batch compatible with generators

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
This PR switches our doctest tooling from sphinx.ext.doctest to pytest-sphinx.

Context
We want to test all code snippets in CI.

testcode and doctest examples are tested in CI with sphinx.ext.doctest. But, we often write code-block:: python examples, and they aren't tested in CI.

Problem
sphinx.ext.doctest doesn't let you test specific files. As a result, it's hard to update examples or isolate GPU examples.

Solution
Migrate to pytest-sphinx. Two reasons:

Lets us run GPU doctests
Enables developers to quickly test their examples

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
The "Custom datasource" guide isn't really a guide -- it's an example of how to implement a specific custom datasource. Accordingly, I've moved the document to the examples section.

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
We previously exposed three types of datasets: simple, tensor, and tabular. to_tf only supported tabular datasets, and we validated that this was the case. Now that there's only one type of dataset (tabular), the validation code isn't necessary.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
We import urllib.parse and pathlib in multiple functions. It's code duplication, and because the packages are built-in, there's no reason to lazily import them.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Subclassing object isn't necessary in Python 3.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Jun Gong <jungong@anyscale.com>
…esults (#35858)

Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
This PR reverts #35426, adding back in a nice button for disconnecting (i.e. calling `ray.shutdown()`) when `ray.init()` is called in a notebook.

## Testing

The main issue with the previous disconnect PR is that checking for the `ipywidgets` soft dependency at run time introduces a small performance penalty which bumps some test suites up beyond their timeout upper bound, breaking them. This happens in tests unrelated to where these changes were made previously, making troubleshooting more difficult.

Here, I've introduced some optimizations to avoid this penalty wherever possible:
* In `ray.widgets.util.in_notebook`, a significant performance penalty was previously being incurred by the `try-import/except` block upon each call. Now, we just check whether `"IPython" in sys.modules` is `True` before attempting an import. This is _much_ faster as `IPython` is pre-loaded for IPython kernels. I've also added this optimization to `repr_fallback_if_colab`.
* Secondly, `ensure_notebook_deps` now does dependency checking at function definition time, rather than on each function call.
* I've also reworked the logic for detecting the current shell, since there's really only one situation where displaying widgets is okay (when the user is running Jupyterlab). In all other cases, we fall back to simple reprs.
* The changes here also solve #35490.

Here's the artifact from the CI run below loaded onto google colab:

![image](https://github.com/ray-project/ray/assets/14017872/c1852128-9dce-4744-8bbc-1ed641ed9cce)

The rendered output has no dashboard URL row (evidently this isn't available when run on colab), but the HTML repr looks okay. The only graphical issue that I see is that colab is doing something weird to the "RAY" text that is supposed to appear by the Ray logo. Here's what this looks like in Jupyterlab:

![image](https://github.com/ray-project/ray/assets/14017872/7592f5f0-46d7-4fa3-8996-86371d39f3e2)
## Why are these changes needed?

This PR is the 1st part of enabling optimizer by default (split from #34937).

- Fix inconsistent behaviors for the Read op by reusing the `ReadTask`s from `read_api.py` in `plan_read_op.py`.
- Support cache in `materialize`.

Note, some changes in this PR may not be covered in this PR's CI, as the optimizer must be enabled to cover them. But they are already verified in #34937 CI).

## Related issue number

#32596
With this PR, actor tasks logs should be retrieved with actor id by default. Users could turn on the flag RAY_ENABLE_RECORD_ACTOR_TASK_LOGGING so that actor logs could be retrieved using task id.

This is needed to prevent regression since recording task logs is currently expensive due to Ray flushing all log lines for log monitoring.
@pull pull bot requested a review from raulchen as a code owner May 31, 2023 05:30
Next step in #34393

---------

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
@pull pull bot added the ⤵️ pull label May 31, 2023
rickyyx and others added 7 commits May 30, 2023 23:51
…n core user guide. (#35760)

We have GAed some of the State API related features (CLI) - and promoting its usage in core related user gudies.
…] ( #34896) (#35917)

This PR seems to be the root cause of stress_test_many_tasks and "1:n async-actor calls async per second" (10%).
…des. (#35884)

This PR uses autoscaler-emitted metrics for pending/active/failed nodes. The previous implementation was buggy and wasn't properly cleaned up, which displays the incorrect graph to the dashboard.

This PR keeps the backward compatibility by only changing namespace of metrics that we will use from the dashboard. I didn't touch other metrics because they are not used anywhere anyway (we can probably remove them or rethink in the future).
Reverts #35745 after fixing windows encoding issue.


Signed-off-by: Kai Fricke <kai@anyscale.com>
#35286 broke CI. These issues weren't caught earlier because the PR was out of date with master.
added unit test for ray installer. also add helper function to NodeProviderConfig.
@raulchen raulchen closed this May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.