Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data][Docs] Revise Transforming data #36162

Merged
merged 48 commits into from
Jun 8, 2023

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented Jun 7, 2023

Why are these changes needed?

Transforming data is verbose. It also contains information that's irrelevant for most users (for example, the fact repartition is an all-to-all operation). This PR revises the guide for brevity and discoverability.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

bveeramani and others added 30 commits November 21, 2022 21:00
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
This reverts commit de05655.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to cover batch_size and the notion of partitioning somehow. These are the two knobs that users most often get wrong and are both critical for performance, so it is hard to avoid these topics.

@ericl
Copy link
Contributor

ericl commented Jun 7, 2023

Looks pretty good! One comment on ensuring we address performance tuning.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 7, 2023
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

but +1 on not dropping the batch size configuration and repartitioning sections. In the repartitioning section, we should explicitly warn against using this unless necessary. And instead recommend to configure the parallelism as part of the read api.

doc/source/data/transforming-data.rst Show resolved Hide resolved
doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
@bveeramani bveeramani mentioned this pull request Jun 7, 2023
9 tasks
@bveeramani
Copy link
Member Author

In the repartitioning section, we should explicitly warn against using this unless necessary. And instead recommend to configure the parallelism as part of the read api.

@amogkam Wait, why do we prefer parallelism over repartition?

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
@bveeramani bveeramani removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023
@amogkam
Copy link
Contributor

amogkam commented Jun 8, 2023

@ericl can say more, but my understanding is that it essentially pushes the parallelism into the read operation. Instead of require an another repartition after.

@ericl
Copy link
Contributor

ericl commented Jun 8, 2023

Hmm I don't think we should recommend users set parallelism manually ever. The heuristic is pretty much always better than what users choose themselves.

However parallelism has one issue now where it cannot increase parallelism past the number of files. In this case, repartitioning is the only way you can actually increase the effective parallelism.

This is a bit unfortunate and I think we can fix this by auto repartitioning in these cases, but for now we probably still need to document repartitioning.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice lgtm!

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023
@bveeramani bveeramani added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jun 8, 2023
@bveeramani
Copy link
Member Author

bveeramani commented Jun 8, 2023

Failing doctest test is unrelated

______________________________ [doctest] faq.rst _______________________________
--
  | 082    gets propagated to all training workers.
  | 083
  | 084 .. testcode::
  | 085
  | 086     import ray
  | 087
  | 088     # Add this at the top of your Ray application.
  | 089     runtime_env = {"env_vars": {"NCCL_SOCKET_IFNAME": "ens5"}}
  | 090     ray.init(runtime_env=runtime_env, ignore_reinit_error=True)
  | 091
  | UNEXPECTED EXCEPTION: AttributeError("'NoneType' object has no attribute 'address_info'")
  | Traceback (most recent call last):
  | File "/opt/miniconda/lib/python3.7/site-packages/pytest_sphinx.py", line 457, in _DocTestRunner__run
  | test.globs,
  | File "<doctest faq.rst[0]>", line 5, in <module>
  | File "/ray/python/ray/_private/client_mode_hook.py", line 103, in wrapper
  | return func(*args, **kwargs)
  | File "/ray/python/ray/_private/worker.py", line 1529, in init
  | return RayContext(dict(_global_node.address_info, node_id=node_id.hex()))
  | AttributeError: 'NoneType' object has no attribute 'address_info'
  | faq.rst:91: UnexpectedException

@amogkam amogkam merged commit 9e964df into ray-project:master Jun 8, 2023
2 checks passed
@bveeramani bveeramani deleted the transforming-data branch June 8, 2023 18:55
amogkam pushed a commit to amogkam/ray that referenced this pull request Jun 8, 2023
Transforming data is verbose. It also contains information that's irrelevant for most users (for example, the fact repartition is an all-to-all operation). This PR revises the guide for brevity and discoverability.

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
amogkam added a commit that referenced this pull request Jun 9, 2023
#35749
#35751
#35753
#35755
#35757
#36018
#36105
#36121
#36144
#36145
#36162
#36124

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
amogkam added a commit to amogkam/ray that referenced this pull request Jun 9, 2023
ray-project#35749
ray-project#35751
ray-project#35753
ray-project#35755
ray-project#35757
ray-project#36018
ray-project#36105
ray-project#36121
ray-project#36144
ray-project#36145
ray-project#36162
ray-project#36124

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
amogkam added a commit that referenced this pull request Jun 9, 2023
* [Data] [Docs] Ray Data doc changes for 2.5 (#36224)

#35749
#35751
#35753
#35755
#35757
#36018
#36105
#36121
#36144
#36145
#36162
#36124

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
maxpumperla added a commit that referenced this pull request Jun 10, 2023
* [Data] [Docs] Ray Data doc changes for 2.5 (#36224)

#35749
#35751
#35753
#35755
#35757
#36018
#36105
#36121
#36144
#36145
#36162
#36124

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>

* [docs] relax kapa loading scheme (#36201)

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>

* Revert "[Data] [Docs] Ray Data doc changes for 2.5 (#36224)"

This reverts commit 48a6c26.

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Signed-off-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
Co-authored-by: Max Pumperla <max.pumperla@googlemail.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Transforming data is verbose. It also contains information that's irrelevant for most users (for example, the fact repartition is an all-to-all operation). This PR revises the guide for brevity and discoverability.

---------

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants