Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NaN handling in drop_list_duplicates #7662

Merged
merged 38 commits into from
Mar 31, 2021
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
81e0d79
Add tests for drop_list_duplicates in case of input containing floati…
ttnghia Mar 17, 2021
e431549
Add negative NaN into the tests
ttnghia Mar 19, 2021
84b06a3
Rewrite tests: split tests into smaller tests with some improvements
ttnghia Mar 19, 2021
34305f2
Some improvement to floating point tests with NaNs
ttnghia Mar 19, 2021
fa46446
Add customized comparators for drop_list_duplicates, still need to up…
ttnghia Mar 20, 2021
452835b
Some cleanup
ttnghia Mar 20, 2021
42535a0
Rewrite doc for element_comparator and element_comparator_fn
ttnghia Mar 20, 2021
2b5b8e4
Using type_dispatcher only for host code
ttnghia Mar 22, 2021
dfa1c8a
Fix memory access violation bug when using reference to column_device…
ttnghia Mar 22, 2021
f2c4d5d
Merge remote-tracking branch 'origin/branch-0.19' into fix_nan_drop_l…
ttnghia Mar 22, 2021
9d07cc7
Add test case when the list contains both -0.0 and 0.0
ttnghia Mar 22, 2021
ef6d7e2
Rename constants
ttnghia Mar 23, 2021
725155b
Change has_null from template parameter to runtime parameter
ttnghia Mar 23, 2021
97815de
Merge branch 'branch-0.19' into fix_nan_drop_list_duplicates
ttnghia Mar 23, 2021
6f76f8e
Remove redundant qualifiers from class constructor
ttnghia Mar 23, 2021
c66b88f
AddAdd `nan_equality` enum to specify whether NaN elements should be …
ttnghia Mar 25, 2021
806b900
Rewrite `drop_list_duplicate`, adding `nans_equal` parameter, allowin…
ttnghia Mar 25, 2021
d96bc79
Rewrite tests for `drop_list_duplicates`
ttnghia Mar 25, 2021
0246c78
Rewrite `collect_set_aggregation`, adding `nans_equal` parameter
ttnghia Mar 25, 2021
2185d8a
Fix typo
ttnghia Mar 25, 2021
919e859
Change `nan_equality` enum names
ttnghia Mar 25, 2021
a002e62
Fix enum in unit tests for `drop_list_duplicates`
ttnghia Mar 25, 2021
26cae09
Add an option to specify NaNs are compared equal only if they have th…
ttnghia Mar 25, 2021
4356bf6
Rework `drop_list_duplicates` for the new `nan_equality` option
ttnghia Mar 25, 2021
e4cfa11
Rewrite unit tests for `drop_list_duplicates` that can test for all c…
ttnghia Mar 25, 2021
98ec9b1
Revert "Rewrite unit tests for `drop_list_duplicates` that can test f…
ttnghia Mar 25, 2021
7a9f850
Revert "Rework `drop_list_duplicates` for the new `nan_equality` option"
ttnghia Mar 25, 2021
04b120d
Revert "Add an option to specify NaNs are compared equal only if they…
ttnghia Mar 25, 2021
0973be9
Fix typo
ttnghia Mar 25, 2021
fa035a6
Avoid initialize-then-assign
ttnghia Mar 30, 2021
8520f0d
Replace `thrust::any_of` by `thrust::count_if`
ttnghia Mar 30, 2021
49bfe13
Copy column by constructor
ttnghia Mar 30, 2021
3d50d8e
Merge remote-tracking branch 'origin/branch-0.19' into fix_nan_drop_l…
ttnghia Mar 30, 2021
27b5beb
Minor cleanup
ttnghia Mar 30, 2021
1406a25
Replace `is_null` by `is_null_nocheck`
ttnghia Mar 30, 2021
32b4393
Replace `make_numeric_column` by `device_uvector`
ttnghia Mar 30, 2021
b5af91e
Rewrite comments, and add a condition check for nans_equal == ALL_EQU…
ttnghia Mar 30, 2021
7812e05
Merge branch 'branch-0.19' into fix_nan_drop_list_duplicates
ttnghia Mar 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 114 additions & 22 deletions cpp/src/lists/drop_list_duplicates.cu
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/lists/detail/sorting.hpp>
#include <cudf/lists/drop_list_duplicates.hpp>
#include <cudf/table/row_operators.cuh>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>
Expand All @@ -35,6 +34,101 @@ namespace lists {
namespace detail {
namespace {
using offset_type = lists_column_view::offset_type;

/**
* @brief Performs an equality comparison between two entries in a lists column
*
* For the two elements that are in the same list in the lists column, they will always be
* considered as different. If they are from the same list and their type is one of floating
* point types, this functor will return the same comparison result as
* `cudf::element_equality_comparator`. For floating point types, entries holding NaN value will
* be considered as different.
*
* @tparam has_nulls Indicates the potential for null values in the column
* @tparam Type The data type of entries
*/
template <bool has_nulls, class Type>
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
class list_entry_comparator {
public:
__host__ __device__ list_entry_comparator(offset_type const* list_offsets,
column_device_view d_view,
null_equality nulls_equal)
: list_offsets(list_offsets), d_view{d_view}, nulls_equal{nulls_equal}
{
}

__device__ bool operator()(size_type i, size_type j) const noexcept
{
// Two entries are not considered for equality if they belong to different lists
if (list_offsets[i] != list_offsets[j]) { return false; }

if (has_nulls) {
bool const nullable = d_view.nullable();
bool const lhs_is_null{nullable and d_view.is_null(i)};
bool const rhs_is_null{nullable and d_view.is_null(j)};
if (lhs_is_null and rhs_is_null) {
return nulls_equal == null_equality::EQUAL;
} else if (lhs_is_null != rhs_is_null) {
return false;
}
}

// For floating point types, if both element(i) and element(j) are NaNs then this comparison
// will return `false`. This is the desired behavior for `drop_list_duplicates`.
return d_view.element<Type>(i) == d_view.element<Type>(j);
}

private:
offset_type const* list_offsets;
column_device_view d_view;
null_equality nulls_equal;
};

/**
* @brief Construct type-dispatched function object for copying indices of the list entries
* ignoring duplicates
*/
class get_unique_entries_fn {
public:
template <class Type, std::enable_if_t<not cudf::is_equality_comparable<Type, Type>()>* = nullptr>
offset_type* operator()(offset_type const*,
column_device_view&,
size_type,
offset_type*,
bool,
null_equality,
rmm::cuda_stream_view) const
{
CUDF_FAIL("Cannot operate on types that are not equally comparable.");
}

template <class Type, std::enable_if_t<cudf::is_equality_comparable<Type, Type>()>* = nullptr>
offset_type* operator()(offset_type const* list_offsets,
column_device_view& d_view,
size_type num_entries,
offset_type* output_begin,
bool has_nulls,
null_equality nulls_equal,
rmm::cuda_stream_view stream) const noexcept
{
if (has_nulls) {
list_entry_comparator<true, Type> const comp{list_offsets, d_view, nulls_equal};
return thrust::unique_copy(rmm::exec_policy(stream),
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(num_entries),
output_begin,
comp);
} else {
list_entry_comparator<false, Type> const comp{list_offsets, d_view, nulls_equal};
return thrust::unique_copy(rmm::exec_policy(stream),
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(num_entries),
output_begin,
comp);
}
}
};

/**
* @brief Copy list entries and entry list offsets ignoring duplicates
*
Expand All @@ -51,36 +145,34 @@ using offset_type = lists_column_view::offset_type;
* @return A pair of columns, the first one contains unique list entries and the second one
* contains their corresponding list offsets
*/
template <bool has_nulls>
std::vector<std::unique_ptr<column>> get_unique_entries_and_list_offsets(
column_view const& all_lists_entries,
column_view const& entries_list_offsets,
null_equality nulls_equal,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
// Create an intermediate table, since the comparator only work on tables
auto const device_input_table =
cudf::table_device_view::create(table_view{{all_lists_entries}}, stream);
auto const comp = row_equality_comparator<has_nulls>(
*device_input_table, *device_input_table, nulls_equal == null_equality::EQUAL);
auto const num_entries = all_lists_entries.size();
auto const d_view_entries = column_device_view::create(all_lists_entries, stream);

auto const num_entries = all_lists_entries.size();
// Allocate memory to store the indices of the unique entries
auto const unique_indices = cudf::make_numeric_column(
entries_list_offsets.type(), num_entries, mask_state::UNALLOCATED, stream);
auto const unique_indices_begin = unique_indices->mutable_view().begin<offset_type>();

auto const copy_end = thrust::unique_copy(
rmm::exec_policy(stream),
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(num_entries),
unique_indices_begin,
[list_offsets = entries_list_offsets.begin<offset_type>(), comp] __device__(auto i, auto j) {
return list_offsets[i] == list_offsets[j] && comp(i, j);
});
auto const copy_end = type_dispatcher(all_lists_entries.type(),
get_unique_entries_fn{},
entries_list_offsets.begin<offset_type>(),
*d_view_entries,
num_entries,
unique_indices_begin,
all_lists_entries.has_nulls(),
nulls_equal,
stream);

// Collect unique entries and entry list offsets
// The new null_count and bitmask of the unique entries will also be generated
// by the gather function
auto const indices = cudf::detail::slice(
unique_indices->view(), 0, thrust::distance(unique_indices_begin, copy_end));
return cudf::detail::gather(table_view{{all_lists_entries, entries_list_offsets}},
Expand Down Expand Up @@ -255,12 +347,8 @@ std::unique_ptr<column> drop_list_duplicates(lists_column_view const& lists_colu
detail::generate_entry_list_offsets(all_lists_entries.size(), lists_offsets->view(), stream);

// Copy non-duplicated entries (along with their list offsets) to new arrays
auto unique_entries_and_list_offsets =
all_lists_entries.has_nulls()
? detail::get_unique_entries_and_list_offsets<true>(
all_lists_entries, entries_list_offsets->view(), nulls_equal, stream, mr)
: detail::get_unique_entries_and_list_offsets<false>(
all_lists_entries, entries_list_offsets->view(), nulls_equal, stream, mr);
auto unique_entries_and_list_offsets = detail::get_unique_entries_and_list_offsets(
all_lists_entries, entries_list_offsets->view(), nulls_equal, stream, mr);

// Generate offsets for the new lists column
detail::generate_offsets(unique_entries_and_list_offsets.front()->size(),
Expand All @@ -269,6 +357,10 @@ std::unique_ptr<column> drop_list_duplicates(lists_column_view const& lists_colu
stream);

// Construct a new lists column without duplicated entries
// Reuse the null_count and bitmask of the lists_column: those are the null information for
// the list elements (rows)
// For the entries of those lists (rows), their null_count and bitmask were generated separately
// during the step `get_unique_entries_and_list_offsets` above
return make_lists_column(lists_column.size(),
std::move(lists_offsets),
std::move(unique_entries_and_list_offsets.front()),
Expand Down
Loading