Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Optimize sample_boundaries in SortTaskSpec #39581

Merged
merged 1 commit into from
Sep 12, 2023

Conversation

z4y1b2
Copy link
Contributor

@z4y1b2 z4y1b2 commented Sep 12, 2023

Optimize sample_boundaries in SortTaskSpec to call numpy.quantile once to get all boundaries for each column.

This is much faster than the old impl when num_reducers is large (eg. 5000), because each time numpy.quantile is called it actually sorts the array and the old impl calls it num_reducers times for each column.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Optimize sample_boundaries in SortTaskSpec to call numpy.quantile once to get all boundaries for each column.

Signed-off-by: z4y1b2 <88138737+z4y1b2@users.noreply.github.com>
Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@raulchen raulchen merged commit 7e74297 into ray-project:master Sep 12, 2023
47 of 49 checks passed
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Optimize sample_boundaries in SortTaskSpec to call numpy.quantile once to get all boundaries for each column.

This is much faster than the old impl when num_reducers is large (eg. 5000), because each time numpy.quantile is called it actually sorts the array and the old impl calls it num_reducers times for each column.

Signed-off-by: z4y1b2 <88138737+z4y1b2@users.noreply.github.com>
Signed-off-by: Victor <vctr.y.m@example.com>
@z4y1b2 z4y1b2 deleted the patch-2 branch November 2, 2023 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants