-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data][Docs] Revise Transforming data #36162
Conversation
This reverts commit 5984f1e.
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
This reverts commit de05655. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to cover batch_size
and the notion of partitioning somehow. These are the two knobs that users most often get wrong and are both critical for performance, so it is hard to avoid these topics.
Looks pretty good! One comment on ensuring we address performance tuning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
but +1 on not dropping the batch size configuration and repartitioning sections. In the repartitioning section, we should explicitly warn against using this unless necessary. And instead recommend to configure the parallelism as part of the read api.
@amogkam Wait, why do we prefer |
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
@ericl can say more, but my understanding is that it essentially pushes the parallelism into the read operation. Instead of require an another repartition after. |
Hmm I don't think we should recommend users set parallelism manually ever. The heuristic is pretty much always better than what users choose themselves. However parallelism has one issue now where it cannot increase parallelism past the number of files. In this case, repartitioning is the only way you can actually increase the effective parallelism. This is a bit unfortunate and I think we can fix this by auto repartitioning in these cases, but for now we probably still need to document repartitioning. |
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice lgtm!
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Failing doctest test is unrelated
|
Transforming data is verbose. It also contains information that's irrelevant for most users (for example, the fact repartition is an all-to-all operation). This PR revises the guide for brevity and discoverability. --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: amogkam <amogkamsetty@yahoo.com>
#35749 #35751 #35753 #35755 #35757 #36018 #36105 #36121 #36144 #36145 #36162 #36124 --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Hao Chen <chenh1024@gmail.com>
ray-project#35749 ray-project#35751 ray-project#35753 ray-project#35755 ray-project#35757 ray-project#36018 ray-project#36105 ray-project#36121 ray-project#36144 ray-project#36145 ray-project#36162 ray-project#36124 --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Hao Chen <chenh1024@gmail.com>
* [Data] [Docs] Ray Data doc changes for 2.5 (#36224) #35749 #35751 #35753 #35755 #35757 #36018 #36105 #36121 #36144 #36145 #36162 #36124 --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Hao Chen <chenh1024@gmail.com>
* [Data] [Docs] Ray Data doc changes for 2.5 (#36224) #35749 #35751 #35753 #35755 #35757 #36018 #36105 #36121 #36144 #36145 #36162 #36124 --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> * [docs] relax kapa loading scheme (#36201) Signed-off-by: Max Pumperla <max.pumperla@googlemail.com> * Revert "[Data] [Docs] Ray Data doc changes for 2.5 (#36224)" This reverts commit 48a6c26. --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Amog Kamsetty <amogkam@users.noreply.github.com> Signed-off-by: Max Pumperla <max.pumperla@googlemail.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Co-authored-by: Max Pumperla <max.pumperla@googlemail.com>
Transforming data is verbose. It also contains information that's irrelevant for most users (for example, the fact repartition is an all-to-all operation). This PR revises the guide for brevity and discoverability. --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Why are these changes needed?
Transforming data is verbose. It also contains information that's irrelevant for most users (for example, the fact
repartition
is an all-to-all operation). This PR revises the guide for brevity and discoverability.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.