Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add handling for nulls in dask_cudf.sorting.quantile_divisions #9171

Merged
merged 4 commits into from Sep 7, 2021

Conversation

charlesbluca
Copy link
Member

Closes #9157

Originally, dask_cudf.DataFrame.sort_values would fail if the DataFrame had enough null values that divisions contained nulls here:

divisions = sorted(
divisions.drop_duplicates().astype(dtype).values.tolist()
)

As you cannot get the values of a Series containing nulls; this PR drops the nulls from divisions before calling values, appending the correct amount to the resulting list afterwards to avoid this failure.

@charlesbluca charlesbluca added bug Something isn't working 3 - Ready for Review Ready for review by team dask-cudf non-breaking Non-breaking change labels Sep 2, 2021
@charlesbluca charlesbluca requested a review from a team as a code owner September 2, 2021 16:48
@charlesbluca charlesbluca added this to PR-WIP in v21.10 Release via automation Sep 2, 2021
@github-actions github-actions bot added the cuDF (Python) Affects Python cuDF API. label Sep 2, 2021
Comment on lines 189 to 197
divisions = (
sorted(
divisions.dropna()
.drop_duplicates()
.astype(dtype)
.values.tolist()
)
+ [None] * divisions.null_count
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use .to_arrow and sort, that way we are tampering the null ordering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you thinking something like

        divisions = (
            sorted(
                divisions.drop_duplicates()
                .astype(dtype)
                .to_arrow().tolist()
            )

In that case, the null ordering is still being tampered by drop_duplicates, which places nulls first. From there we can't sort the resulting list as it contains None, unless you meant using an arrow sorting method before calling tolist()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, didn't notice the drop_duplicates call, yea drop_duplicates results in non-deterministic ordering. But .to_arrow().tolist() seems cleaner to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in that case do you have any preference for how we handle sorting the list with nulls? Doing a quick search, this seems like a suitable solution:

# https://stackoverflow.com/questions/18411560/sort-list-while-pushing-none-values-to-the-end

sorted(..., key=lambda x: (x is None, x))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try sort_indices of pyarrow, it seems to be doing what you want to do: https://arrow.apache.org/docs/python/generated/pyarrow.compute.sort_indices.html

If that's not possible the StackOverflow approach looks fine to me.

@codecov
Copy link

codecov bot commented Sep 2, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@858944b). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 96d085f differs from pull request most recent head 3956939. Consider uploading reports for the commit 3956939 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9171   +/-   ##
===============================================
  Coverage                ?   10.78%           
===============================================
  Files                   ?      115           
  Lines                   ?    19113           
  Branches                ?        0           
===============================================
  Hits                    ?     2062           
  Misses                  ?    17051           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 858944b...3956939. Read the comment docs.

@galipremsagar
Copy link
Contributor

rerun tests

@galipremsagar galipremsagar moved this from PR-WIP to PR-Needs review in v21.10 Release Sep 4, 2021
Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable change to me. Thanks @charlesbluca !

v21.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Sep 7, 2021
@quasiben
Copy link
Member

quasiben commented Sep 7, 2021

Thanks @charlesbluca !

@quasiben
Copy link
Member

quasiben commented Sep 7, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 91f8533 into rapidsai:branch-21.10 Sep 7, 2021
v21.10 Release automation moved this from PR-Reviewer approved to Done Sep 7, 2021
@charlesbluca charlesbluca deleted the fix-9157 branch July 19, 2022 14:26
@vyasr vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuDF (Python) Affects Python cuDF API. dask Dask issue non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

[BUG] sort_values sometimes fails for dask-cudf DataFrames with nulls
5 participants