Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NaN handling in drop_list_duplicates #7662

Merged
merged 38 commits into from
Mar 31, 2021

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Mar 20, 2021

This PR modifies the behavior of drop_list_duplicates to satisfy both Apache Spark and Pandas behavior when dealing with NaN value in floating-point columns data:

  • In Apache Spark, NaNs are treated as different values, thus no NaN entry should be removed after calling drop_list_duplicates.
  • In Pandas, NaNs are considered as the same value, and even -NaN is considered as the same as NaN. Thus, only one NaN entry per list will be kept.

New tests have also been added to verify such desired behavior.

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Mar 20, 2021
@ttnghia ttnghia requested a review from a team as a code owner March 20, 2021 03:23
@codecov
Copy link

codecov bot commented Mar 20, 2021

Codecov Report

Merging #7662 (b60cd90) into branch-0.19 (7871e7a) will increase coverage by 0.42%.
The diff coverage is n/a.

❗ Current head b60cd90 differs from pull request most recent head 7812e05. Consider uploading reports for the commit 7812e05 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7662      +/-   ##
===============================================
+ Coverage        81.86%   82.28%   +0.42%     
===============================================
  Files              101      101              
  Lines            16884    17066     +182     
===============================================
+ Hits             13822    14043     +221     
+ Misses            3062     3023      -39     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/lists.py 87.21% <0.00%> (-4.18%) ⬇️
python/dask_cudf/dask_cudf/backends.py 87.16% <0.00%> (-2.47%) ⬇️
python/cudf/cudf/core/column/decimal.py 93.84% <0.00%> (-1.03%) ⬇️
python/cudf/cudf/utils/utils.py 85.06% <0.00%> (-0.38%) ⬇️
python/cudf/cudf/core/column/column.py 87.43% <0.00%> (-0.33%) ⬇️
python/cudf/cudf/core/scalar.py 86.91% <0.00%> (-0.19%) ⬇️
python/cudf/cudf/utils/ioutils.py 78.71% <0.00%> (ø)
python/cudf/cudf/utils/cudautils.py 50.38% <0.00%> (ø)
python/cudf/cudf/core/tools/datetimes.py 84.44% <0.00%> (ø)
python/cudf/cudf/core/column/numerical.py 95.02% <0.00%> (ø)
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d49f75...7812e05. Read the comment docs.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 22, 2021

Rerun tests.

@ttnghia ttnghia mentioned this pull request Mar 22, 2021
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 30, 2021

I'm just thinking of another solution: re-implement sort_lists in drop_list_duplicates, only for floating-point numbers, and only for ascending order. That would be simpler than a full sort_lists implementation, and would solve your concern.

Note that we still have to call cub::DeviceSegmentedRadixSort for segmented sorting, and allocating an intermediate buffer for the data is unavoidable. The current sort_lists implementation allocates a device_uvector and replace all nulls entries with inf before sorting. I can do my best to avoid creating a full column, but cannot use a transform iterator for sorting anyhow.

@jrhemstad
Copy link
Contributor

Note that we still have to call cub::DeviceSegmentedRadixSort for segmented sorting, and allocating an intermediate buffer for the data is unavoidable.

That's a good point. I forgot CUB's radix sort doesn't take iterators. I think your current implementation is the best we can do for now.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second pass, another thought occurred to me. Instead of having to materialize the replaced NaN column, couldn't the equality comparator just be parameterized on nan_equality to determine whether -NaN and NaN are equal?

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 30, 2021

On second pass, another thought occurred to me. Instead of having to materialize the replaced NaN column, couldn't the equality comparator just be parameterized on nan_equality to determine whether -NaN and NaN are equal?

I thought about that, but couldn't apply it. Here is the reason: After sorting, -NaNs are put at the beginning of the list while NaNs are put at the end. We can compare and unique copy only 2 adjacent entries, not 2 entries from the end-points.

@jrhemstad
Copy link
Contributor

After sorting, -NaNs are put at the beginning of the list while NaNs are put at the end.

Oh, interesting. That must be an implementation detail of the CUB segmented radix sort.

Well, in that case, what you've done seems like the best we can do.

@jrhemstad jrhemstad added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 30, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 30, 2021

Rerun tests.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 30, 2021

Rerun tests.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 30, 2021

Rerun tests.

@kkraus14
Copy link
Collaborator

rerun tests

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 31, 2021

Rerun tests.

@rapids-bot rapids-bot bot merged commit bd11dbe into rapidsai:branch-0.19 Mar 31, 2021
@ttnghia ttnghia self-assigned this Apr 25, 2021
@ttnghia ttnghia deleted the fix_nan_drop_list_duplicates branch May 3, 2021 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants