Fix NaN handling in drop_list_duplicates #7662

ttnghia · 2021-03-20T03:23:37Z

This PR modifies the behavior of drop_list_duplicates to satisfy both Apache Spark and Pandas behavior when dealing with NaN value in floating-point columns data:

In Apache Spark, NaNs are treated as different values, thus no NaN entry should be removed after calling drop_list_duplicates.
In Pandas, NaNs are considered as the same value, and even -NaN is considered as the same as NaN. Thus, only one NaN entry per list will be kept.

New tests have also been added to verify such desired behavior.

…ng point numbers with NaN

…date doc

codecov · 2021-03-20T06:28:50Z

Codecov Report

Merging #7662 (b60cd90) into branch-0.19 (7871e7a) will increase coverage by 0.42%.
The diff coverage is n/a.

❗ Current head b60cd90 differs from pull request most recent head 7812e05. Consider uploading reports for the commit 7812e05 to get more accurate results

@@               Coverage Diff               @@
##           branch-0.19    #7662      +/-   ##
===============================================
+ Coverage        81.86%   82.28%   +0.42%     
===============================================
  Files              101      101              
  Lines            16884    17066     +182     
===============================================
+ Hits             13822    14043     +221     
+ Misses            3062     3023      -39

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/lists.py	`87.21% <0.00%> (-4.18%)`	⬇️
python/dask_cudf/dask_cudf/backends.py	`87.16% <0.00%> (-2.47%)`	⬇️
python/cudf/cudf/core/column/decimal.py	`93.84% <0.00%> (-1.03%)`	⬇️
python/cudf/cudf/utils/utils.py	`85.06% <0.00%> (-0.38%)`	⬇️
python/cudf/cudf/core/column/column.py	`87.43% <0.00%> (-0.33%)`	⬇️
python/cudf/cudf/core/scalar.py	`86.91% <0.00%> (-0.19%)`	⬇️
python/cudf/cudf/utils/ioutils.py	`78.71% <0.00%> (ø)`
python/cudf/cudf/utils/cudautils.py	`50.38% <0.00%> (ø)`
python/cudf/cudf/core/tools/datetimes.py	`84.44% <0.00%> (ø)`
python/cudf/cudf/core/column/numerical.py	`95.02% <0.00%> (ø)`
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d49f75...7812e05. Read the comment docs.

cpp/src/lists/drop_list_duplicates.cu

…_view

ttnghia · 2021-03-22T20:47:15Z

Rerun tests.

…ist_duplicates

cpp/tests/lists/drop_list_duplicates_tests.cpp

cpp/src/lists/drop_list_duplicates.cu

cpp/tests/lists/drop_list_duplicates_tests.cpp

ttnghia · 2021-03-30T03:27:44Z

I'm just thinking of another solution: re-implement sort_lists in drop_list_duplicates, only for floating-point numbers, and only for ascending order. That would be simpler than a full sort_lists implementation, and would solve your concern.

Note that we still have to call cub::DeviceSegmentedRadixSort for segmented sorting, and allocating an intermediate buffer for the data is unavoidable. The current sort_lists implementation allocates a device_uvector and replace all nulls entries with inf before sorting. I can do my best to avoid creating a full column, but cannot use a transform iterator for sorting anyhow.

cpp/src/lists/drop_list_duplicates.cu

jrhemstad · 2021-03-30T13:05:15Z

Note that we still have to call cub::DeviceSegmentedRadixSort for segmented sorting, and allocating an intermediate buffer for the data is unavoidable.

That's a good point. I forgot CUB's radix sort doesn't take iterators. I think your current implementation is the best we can do for now.

cpp/src/lists/drop_list_duplicates.cu

jrhemstad

On second pass, another thought occurred to me. Instead of having to materialize the replaced NaN column, couldn't the equality comparator just be parameterized on nan_equality to determine whether -NaN and NaN are equal?

ttnghia · 2021-03-30T13:28:40Z

On second pass, another thought occurred to me. Instead of having to materialize the replaced NaN column, couldn't the equality comparator just be parameterized on nan_equality to determine whether -NaN and NaN are equal?

I thought about that, but couldn't apply it. Here is the reason: After sorting, -NaNs are put at the beginning of the list while NaNs are put at the end. We can compare and unique copy only 2 adjacent entries, not 2 entries from the end-points.

jrhemstad · 2021-03-30T13:31:27Z

After sorting, -NaNs are put at the beginning of the list while NaNs are put at the end.

Oh, interesting. That must be an implementation detail of the CUB segmented radix sort.

Well, in that case, what you've done seems like the best we can do.

…AL before calling to `has_negative_nans`

kkraus14 · 2021-03-30T14:16:33Z

@gpucibot merge

ttnghia · 2021-03-30T17:08:10Z

Rerun tests.

ttnghia · 2021-03-30T21:04:43Z

Rerun tests.

cpp/include/cudf/types.hpp

ttnghia · 2021-03-30T22:46:29Z

Rerun tests.

kkraus14 · 2021-03-31T00:49:11Z

rerun tests

ttnghia · 2021-03-31T02:38:55Z

Rerun tests.

ttnghia added 7 commits March 17, 2021 17:47

Add tests for drop_list_duplicates in case of input containing floati…

81e0d79

…ng point numbers with NaN

Add negative NaN into the tests

e431549

Rewrite tests: split tests into smaller tests with some improvements

84b06a3

Some improvement to floating point tests with NaNs

34305f2

Add customized comparators for drop_list_duplicates, still need to up…

fa46446

…date doc

Some cleanup

452835b

Rewrite doc for element_comparator and element_comparator_fn

42535a0

ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Mar 20, 2021

ttnghia requested a review from a team as a code owner March 20, 2021 03:23

ttnghia requested review from rgsl888prabhu and nvdbaranec March 20, 2021 03:23

jrhemstad reviewed Mar 20, 2021

View reviewed changes

cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved

ttnghia added 2 commits March 22, 2021 10:31

Using type_dispatcher only for host code

2b5b8e4

Fix memory access violation bug when using reference to column_device…

dfa1c8a

…_view

ttnghia added 2 commits March 22, 2021 15:18

Merge remote-tracking branch 'origin/branch-0.19' into fix_nan_drop_l…

f2c4d5d

…ist_duplicates

Add test case when the list contains both -0.0 and 0.0

9d07cc7

ttnghia mentioned this pull request Mar 22, 2021

Adds list.unique API #7664

Merged

davidwendt reviewed Mar 23, 2021

View reviewed changes

cpp/tests/lists/drop_list_duplicates_tests.cpp Show resolved Hide resolved

davidwendt reviewed Mar 23, 2021

View reviewed changes

cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved

jrhemstad approved these changes Mar 23, 2021

View reviewed changes

davidwendt reviewed Mar 23, 2021

View reviewed changes

cpp/tests/lists/drop_list_duplicates_tests.cpp Outdated Show resolved Hide resolved

Minor cleanup

27b5beb

harrism reviewed Mar 30, 2021

View reviewed changes

cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved

jrhemstad reviewed Mar 30, 2021

View reviewed changes

cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved

jrhemstad reviewed Mar 30, 2021

View reviewed changes

cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved

jrhemstad reviewed Mar 30, 2021

View reviewed changes

cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved

jrhemstad reviewed Mar 30, 2021

View reviewed changes

ttnghia added 3 commits March 30, 2021 07:33

Replace is_null by is_null_nocheck

1406a25

Replace make_numeric_column by device_uvector

32b4393

Rewrite comments, and add a condition check for nans_equal == ALL_EQU…

b5af91e

…AL before calling to `has_negative_nans`

jrhemstad approved these changes Mar 30, 2021

View reviewed changes

jrhemstad added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 30, 2021

Merge branch 'branch-0.19' into fix_nan_drop_list_duplicates

7812e05

isVoid reviewed Mar 30, 2021

View reviewed changes

cpp/include/cudf/types.hpp Show resolved Hide resolved

rapids-bot bot merged commit bd11dbe into rapidsai:branch-0.19 Mar 31, 2021

ttnghia mentioned this pull request Apr 20, 2021

add null order support to detail::drop_duplicates #7938

Merged

ttnghia self-assigned this Apr 25, 2021

ttnghia deleted the fix_nan_drop_list_duplicates branch May 3, 2021 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NaN handling in drop_list_duplicates #7662

Fix NaN handling in drop_list_duplicates #7662

ttnghia commented Mar 20, 2021 •

edited

Loading

codecov bot commented Mar 20, 2021 •

edited

Loading

ttnghia commented Mar 22, 2021

ttnghia commented Mar 30, 2021 •

edited

Loading

jrhemstad commented Mar 30, 2021

jrhemstad left a comment

ttnghia commented Mar 30, 2021

jrhemstad commented Mar 30, 2021

kkraus14 commented Mar 30, 2021

ttnghia commented Mar 30, 2021

ttnghia commented Mar 30, 2021

ttnghia commented Mar 30, 2021

kkraus14 commented Mar 31, 2021

ttnghia commented Mar 31, 2021

Fix NaN handling in drop_list_duplicates #7662

Fix NaN handling in drop_list_duplicates #7662

Conversation

ttnghia commented Mar 20, 2021 • edited Loading

codecov bot commented Mar 20, 2021 • edited Loading

Codecov Report

ttnghia commented Mar 22, 2021

ttnghia commented Mar 30, 2021 • edited Loading

jrhemstad commented Mar 30, 2021

jrhemstad left a comment

Choose a reason for hiding this comment

ttnghia commented Mar 30, 2021

jrhemstad commented Mar 30, 2021

kkraus14 commented Mar 30, 2021

ttnghia commented Mar 30, 2021

ttnghia commented Mar 30, 2021

ttnghia commented Mar 30, 2021

kkraus14 commented Mar 31, 2021

ttnghia commented Mar 31, 2021

ttnghia commented Mar 20, 2021 •

edited

Loading

codecov bot commented Mar 20, 2021 •

edited

Loading

ttnghia commented Mar 30, 2021 •

edited

Loading