Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] collect_set on a LIST<LIST<INT32>> with all null entries returns the wrong result type #14924

Closed
jlowe opened this issue Jan 29, 2024 · 0 comments · Fixed by #15243
Closed
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Jan 29, 2024

Describe the bug
Performing a collect_set aggregation on a lists column of lists of ints returns the wrong resulting scalar type when the input lists column contains all nulls. Instead of returning a LIST<LIST<INT32>> it returns a LIST<INT32>. When there are non-null entries in the input it properly returns LIST<LIST<INT32>>.

Steps/Code to reproduce bug
Compile and run the following program.

#include <cudf_test/debug_utilities.hpp>
#include <cudf_test/column_wrapper.hpp>
#include <cudf/aggregation.hpp>
#include <cudf/reduction.hpp>
#include <cudf/scalar/scalar.hpp>
#include <iostream>

using LCW = cudf::test::lists_column_wrapper<int>;

void debug(cudf::lists_column_view const& lists) {
  std::cout << "Lists size: " << lists.size() << " null count: " << lists.null_count() << std::endl;
  auto child = lists.child();
  int id = static_cast<int>(child.type().id());
  std::cout << "Lists child type: " << id << " size: " << child.size() << " null count: " << child.null_count() << std::endl;
  if (child.type().id() == cudf::type_id::LIST) {
    child = cudf::lists_column_view(child).child();
    id = static_cast<int>(child.type().id());
    std::cout << "Lists child child type: " << id << " size: " << child.size() << " null count: " << child.null_count() << std::endl;
  }
}

void collect_set(cudf::lists_column_view const& lists) {
  std::cout << "Before reduction:" << std::endl;
  debug(lists);
  auto agg = cudf::make_collect_set_aggregation<cudf::reduce_aggregation>(cudf::null_policy::EXCLUDE, cudf::null_equality::UNEQUAL, cudf::nan_equality::ALL_EQUAL);
  auto r = cudf::reduce(lists.parent(), *agg, cudf::data_type{cudf::type_id::LIST});
  std::cout << "After reduction:" << std::endl;
  auto rl = dynamic_cast<cudf::list_scalar*>(r.get());
  debug(rl->view());
}

int main(int argc, char** argv) {
  auto valids = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 0; });
  cudf::lists_column_view lists = cudf::test::lists_column_wrapper<int32_t, int32_t>{ {LCW{{1, 2, 3}}, LCW{{1, 2, 3}}}, valids };
  std::cout << "=== With null list and non-null list ===" << std::endl;
  collect_set(lists);
  std::cout << "=== With only null list ===" << std::endl;
  lists = cudf::test::lists_column_wrapper<int32_t, int32_t>{ {LCW{{1, 2, 3}}}, valids };
  collect_set(lists);
  return 0;
}

This produces the following output:

Before reduction:
Lists size: 2 null count: 1
Lists child type: 24 size: 1 null count: 0
Lists child child type: 3 size: 3 null count: 0
After reduction:
Lists size: 0 null count: 0
Lists child type: 24 size: 0 null count: 0
Lists child child type: 3 size: 0 null count: 0
=== With only null list ===
Before reduction:
Lists size: 1 null count: 1
Lists child type: 24 size: 0 null count: 0
Lists child child type: 3 size: 0 null count: 0
After reduction:
Lists size: 0 null count: 0
Lists child type: 3 size: 0 null count: 0

Note how the result types change based on whether the input has all nulls or not:

After reduction:
Lists size: 0 null count: 0
Lists child type: 24 size: 0 null count: 0
Lists child child type: 3 size: 0 null count: 0

vs.

After reduction:
Lists size: 0 null count: 0
Lists child type: 3 size: 0 null count: 0

Expected behavior
The collect_set aggregation should consistently return the same output type for a given input type, regardless of the input data for that type. In this specific case, it should have returned LIST<LIST<INT32>> instead of LIST<INT32>.

@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Jan 29, 2024
@ttnghia ttnghia self-assigned this Mar 2, 2024
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
rapids-bot bot pushed a commit that referenced this issue Mar 13, 2024
…5243)

This fixes a bug in the reduction code that shows up specifically in `collect_list`/`collect_set` of lists column. In particular, the output of these reduction ops should be a list scalar holding a column that has exactly the same type structure as the input. However, when the input column contains all nulls, the output list scalar holds an empty column having wrong type structure.

Closes #14924.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

URL: #15243
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants