Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Should datatype equality comparisons traverse nested children? #14494

Closed
ZelboK opened this issue Nov 26, 2023 · 11 comments
Closed

[QST] Should datatype equality comparisons traverse nested children? #14494

ZelboK opened this issue Nov 26, 2023 · 11 comments
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. question Further information is requested

Comments

@ZelboK
Copy link
Contributor

ZelboK commented Nov 26, 2023

Describe the bug
I have a tendency of running mini "experiments" to better understand functionality behavior through tests. Found some unexpected behavior while writing a test that should (likely?) throw an exception.

EDIT (@wence-):

This doesn't raise because the top-level data type ids match (both the left and right columns are structs). So, question: should datatype equality traverse nested children?

END EDIT

Steps/Code to reproduce bug

TEST_F(JoinTest, AntiJoinPlayground)
{
  auto left_col0 = [] {
    column_wrapper<int32_t> child1{1, 5, 43, 3};
    column_wrapper<int32_t> child2{11, 12, 13, 4};
    column_wrapper<int32_t> child3{90, 812, 513, 54};
    return cudf::test::structs_column_wrapper{{child1, child2, child3}};
  }();
  auto right_col0 = [] {
    column_wrapper<int32_t> child1{1, 5, 43, 3};
    column_wrapper<int32_t> child2{11, 12, 13, 4};
    return cudf::test::structs_column_wrapper{{child1, child2}};
  }();
  auto left  = cudf::table_view{{left_col0}};
  auto right = cudf::table_view{{right_col0}};

  auto result      = cudf::left_anti_join(left, right);
  auto result_span = cudf::device_span<cudf::size_type const>{*result};
  auto result_col  = cudf::column_view{result_span};
  auto expected    = column_wrapper<cudf::size_type>{0, 1};
  CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result_col);
}

Expected behavior
If you run this test with left_join, it will return an SFINAE error

/home/kashimo/workspace/cudf/cpp/tests/join/join_tests.cpp:1652:63: error: no match for ‘operator*’ (operand type is ‘std::pair<std::unique_ptr<rmm::device_uvector<int> >, std::unique_ptr<rmm::device_uvector<int> > >’)
 1652 |   auto result_span = cudf::device_span<cudf::size_type const>{*result};
      |                                                               ^~~~~~~
/home/kashimo/workspace/cudf/cpp/tests/join/join_tests.cpp:1652:70: error: no matching function for call to ‘cudf::device_span<const int>::device_span(<brace-enclosed initializer list>)’
 1652 |   auto result_span = cudf::device_span<cudf::size_type const>{*result};
      |                                                                      ^
In file included from /home/kashimo/workspace/cudf/cpp/include/cudf/column/column_view.hpp:20,
                 from /home/kashimo/workspace/cudf/cpp/include/cudf/column/column.hpp:18,
                 from /home/kashimo/workspace/cudf/cpp/tests/join/join_tests.cpp:17:
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:316:13: note: candidate: ‘template<class OtherT, long unsigned int OtherExtent, std::enable_if_t<(((18446744073709551615 == OtherExtent) || (18446744073709551615 == cudf::dynamic_extent)) && is_convertible_v<OtherT (*)[], const int (*)[]>), void>* <anonymous> > constexpr cudf::device_span<T, Extent>::device_span(const cudf::device_span<OtherT, OtherExtent>&) [with OtherT = OtherT; long unsigned int OtherExtent = OtherExtent; std::enable_if_t<(((Extent == OtherExtent) || (Extent == cudf::dynamic_extent)) && is_convertible_v<OtherT (*)[], T (*)[]>), void>* <anonymous> = <anonymous>; T = const int; long unsigned int Extent = 18446744073709551615]’
  316 |   constexpr device_span(device_span<OtherT, OtherExtent> const& other) noexcept
      |             ^~~~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:316:13: note:   template argument deduction/substitution failed:
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:305:13: note: candidate: ‘template<class C, std::enable_if_t<(cudf::is_device_span_supported_container<C>::value && is_convertible_v<typename std::remove_pointer<decltype (thrust::raw_pointer_cast(declval<C&>().data()))>::type (*)[], const int (*)[]>), void>* <anonymous> > constexpr cudf::device_span<T, Extent>::device_span(const C&) [with C = C; std::enable_if_t<(cudf::is_device_span_supported_container<C>::value && is_convertible_v<typename std::remove_pointer<decltype (thrust::raw_pointer_cast(declval<C&>().data()))>::type (*)[], T (*)[]>)>* <anonymous> = <anonymous>; T = const int; long unsigned int Extent = 18446744073709551615]’
  305 |   constexpr device_span(C const& in) : base(thrust::raw_pointer_cast(in.data()), in.size())
      |             ^~~~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:305:13: note:   template argument deduction/substitution failed:
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:292:13: note: candidate: ‘template<class C, std::enable_if_t<(cudf::is_device_span_supported_container<C>::value && is_convertible_v<typename std::remove_pointer<decltype (thrust::raw_pointer_cast(declval<C&>().data()))>::type (*)[], const int (*)[]>), void>* <anonymous> > constexpr cudf::device_span<T, Extent>::device_span(C&) [with C = C; std::enable_if_t<(cudf::is_device_span_supported_container<C>::value && is_convertible_v<typename std::remove_pointer<decltype (thrust::raw_pointer_cast(declval<C&>().data()))>::type (*)[], T (*)[]>)>* <anonymous> = <anonymous>; T = const int; long unsigned int Extent = 18446744073709551615]’
  292 |   constexpr device_span(C& in) : base(thrust::raw_pointer_cast(in.data()), in.size())
      |             ^~~~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:292:13: note:   template argument deduction/substitution failed:
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:68:13: note: candidate: ‘constexpr cudf::detail::span_base<T, Extent, Derived>::span_base(cudf::detail::span_base<T, Extent, Derived>::pointer, cudf::detail::span_base<T, Extent, Derived>::size_type) [with T = const int; long unsigned int Extent = 18446744073709551615; Derived = cudf::device_span<const int>; cudf::detail::span_base<T, Extent, Derived>::pointer = const int*; cudf::detail::span_base<T, Extent, Derived>::size_type = long unsigned int]’
   68 |   constexpr span_base(pointer data, size_type size) : _data(data), _size(size) {}
      |             ^~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:279:15: note:   inherited here
  279 |   using base::base;
      |               ^~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:279:15: note:   candidate expects 2 arguments, 1 provided
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:70:13: note: candidate: ‘constexpr cudf::detail::span_base<T, Extent, Derived>::span_base(const cudf::detail::span_base<T, Extent, Derived>&) [with T = const int; long unsigned int Extent = 18446744073709551615; Derived = cudf::device_span<const int>]’
   70 |   constexpr span_base(span_base const&) noexcept = default;  ///< Copy constructor
      |             ^~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:279:15: note:   inherited here
  279 |   using base::base;
      |               ^~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:279:15: note:   an inherited constructor is not a candidate for initialization from an expression of the same or derived type
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:281:13: note: candidate: ‘constexpr cudf::device_span<T, Extent>::device_span() [with T = const int; long unsigned int Extent = 18446744073709551615]’
  281 |   constexpr device_span() noexcept : base() {}  // required to compile on centos
      |             ^~~~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:281:13: note:   candidate expects 0 arguments, 1 provided
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:277:8: note: candidate: ‘constexpr cudf::device_span<const int>::device_span(const cudf::device_span<const int>&)’
  277 | struct device_span : public cudf::detail::span_base<T, Extent, device_span<T, Extent>> {
      |        ^~~~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:277:8: note:   conversion of argument 1 would be ill-formed:
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:277:8: note: candidate: ‘constexpr cudf::device_span<const int>::device_span(cudf::device_span<const int>&&)’
/home/kashimo/workspace/cudf/cpp/include/cudf/utilities/span.hpp:277:8: note:   conversion of argument 1 would be ill-formed:
/home/kashimo/workspace/cudf/cpp/tests/join/join_tests.cpp:1653:51: error: no matching function for call to ‘cudf::column_view::column_view(<brace-enclosed initializer list>)’
 1653 |   auto result_col  = cudf::column_view{result_span};
      |                                                   ^
In file included from /home/kashimo/workspace/cudf/cpp/include/cudf/column/column.hpp:18,
                 from /home/kashimo/workspace/cudf/cpp/tests/join/join_tests.cpp:17:
/home/kashimo/workspace/cudf/cpp/include/cudf/column/column_view.hpp:420:3: note: candidate: template<class T, std::enable_if_t<(is_numeric<T>() || is_chrono<T>())>* <anonymous> > cudf::column_view::column_view(cudf::device_span<const T>)’
  420 |   column_view(device_span<T const> data)
      |   ^~~~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/column/column_view.hpp:420:3: note:   template argument deduction/substitution failed:
/home/kashimo/workspace/cudf/cpp/include/cudf/column/column_view.hpp:371:3: note: candidate: cudf::column_view::column_view(cudf::data_type, cudf::size_type, const void*, const bitmask_type*, cudf::size_type, cudf::size_type, const std::vector<cudf::column_view>&)’
  371 |   column_view(data_type type,
      |   ^~~~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/column/column_view.hpp:371:3: note:   candidate expects 7 arguments, 1 provided
/home/kashimo/workspace/cudf/cpp/include/cudf/column/column_view.hpp:331:3: note: candidate: cudf::column_view::column_view(cudf::column_view&&)’
  331 |   column_view(column_view&&)      = default;  ///< Move constructor
      |   ^~~~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/column/column_view.hpp:331:3: note:   conversion of argument 1 would be ill-formed:
/home/kashimo/workspace/cudf/cpp/include/cudf/column/column_view.hpp:330:3: note: candidate: cudf::column_view::column_view(const cudf::column_view&)’
  330 |   column_view(column_view const&) = default;  ///< Copy constructor
      |   ^~~~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/column/column_view.hpp:330:3: note:   conversion of argument 1 would be ill-formed:
/home/kashimo/workspace/cudf/cpp/include/cudf/column/column_view.hpp:315:3: note: candidate: constexpr cudf::column_view::column_view()’
  315 |   column_view() = default;
      |   ^~~~~~~~~~~
/home/kashimo/workspace/cudf/cpp/include/cudf/column/column_view.hpp:315:3: note:   candidate expects 0 arguments, 1 provided
gmake[2]: *** [tests/CMakeFiles/JOIN_TEST.dir/build.make:76: tests/CMakeFiles/JOIN_TEST.dir/join/join_tests.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:1445: tests/CMakeFiles/JOIN_TEST.dir/all] Error 2
gmake: *** [Makefile:166: all] Error 2

When I get more time I can look into this more deeply. For now just wanted to mention.

@ZelboK ZelboK added Needs Triage Need team to review and classify bug Something isn't working labels Nov 26, 2023
@wence-
Copy link
Contributor

wence- commented Nov 27, 2023

The compile-time failure with left_join is because left_join returns a std::pair of maps (device_uvector<size_type>s) that gather from the left and right tables respectively. Whereas left_anti_join only returns a single map, since the return value gathers data from the left table that doesn't exist in the right table (so there's no way of returning something that gathers from the right table). (see https://docs.rapids.ai/api/libcudf/nightly/group__column__join#ga6067d4dd5851b2c9f7ebe616e52fae24)

Since the columns and tables are type-erased, the compiler cannot complain about mismatching join keys: it cannot see that they are mismatching. Is the problem you observe that the test you wrote with left_anti_join does not complain at runtime? Or were you expecting a compile-time failure?

@bdice
Copy link
Contributor

bdice commented Nov 27, 2023

@ZelboK Can you try and see if the left_join case fails at runtime, by deleting everything after auto result = cudf::left_join(left, right); ? I think that should compile -- but I think I expect a failure at runtime due to mismatching join keys. If that's not the case, we should figure out what the result has in it for left_join. You might need [[maybe_unused]] since result isn't used.

@ZelboK
Copy link
Contributor Author

ZelboK commented Nov 27, 2023

@wence-
Yeah sorry I should have clarified better in my original post. I was expecting a run time failure. The test I wrote above just passes normally though without any runtime failures whatsoever.

@bdice

TEST_F(JoinTest, AntiJoinPlaygroundFailing)
{
  auto left_col0 = [] {
    column_wrapper<int32_t> child1{1, 5, 43, 3};
    column_wrapper<int32_t> child2{11, 12, 13, 4};
    column_wrapper<int32_t> child3{90, 812, 513, 54};
    return cudf::test::structs_column_wrapper{{child1, child2, child3}};
  }();
  auto right_col0 = [] {
    column_wrapper<int32_t> child1{1, 5, 43, 3};
    column_wrapper<int32_t> child2{11, 12, 13, 4};
    return cudf::test::structs_column_wrapper{{child1, child2}};
  }();
  auto left  = cudf::table_view{{left_col0}};
  auto right = cudf::table_view{{right_col0}};

  [[maybe_unused]] auto result = cudf::left_join(left, right);
}

Does not actually run into any runtime failures.

(If it matters) I generally run specific tests this way ./JOIN_TEST --gtest_filter=JoinTest.AntiJoinPlaygroundFailing. Lmk if I am misunderstanding anything, when I come back later tonight I should be able to help.

@wence-
Copy link
Contributor

wence- commented Nov 28, 2023

tl;dr: I think this is not raising an error by design, but I am not sure of the history of that design, or whether it is, in fact, desirable behaviour.

For the left_join case, type checking occurs in compute_hash_join in hash_join.cu:

CUDF_EXPECTS(0 != probe.num_columns(), "Hash join probe table is empty");
CUDF_EXPECTS(_build.num_columns() == probe.num_columns(),
"Mismatch in number of columns to be joined on");
CUDF_EXPECTS(_has_nulls || !cudf::has_nested_nulls(probe),
"Probe table has nulls while build table was not hashed with null check.");
if (is_trivial_join(probe, _build, join)) {
return std::pair(std::make_unique<rmm::device_uvector<size_type>>(0, stream, mr),
std::make_unique<rmm::device_uvector<size_type>>(0, stream, mr));
}
CUDF_EXPECTS(std::equal(std::cbegin(_build),
std::cend(_build),
std::cbegin(probe),
std::cend(probe),
[](auto const& b, auto const& p) { return b.type() == p.type(); }),
"Mismatch in joining column data types");
return probe_join_indices(probe, join, output_size, stream, mr);

So, if the join is "trivial" (for example either table is empty for an inner join) we immediately return a null result. Otherwise we check if the type() of the columns being joined on is equal. That's cudf::data_type::operator== which does:

constexpr bool operator==(data_type const& lhs, data_type const& rhs)
{
  // use std::tie in the future, breaks JITIFY currently
  return lhs.id() == rhs.id() && lhs.scale() == rhs.scale();
}

The scale only has any bearing on the result if using a decimal type (which we're not doing here). So two data_types compare equal if their top-level type-id is identical. I note a TODO note in the docstring of that equality comparator:

 * // TODO Define exactly what it means for two `data_type`s to be equal. e.g.,
 * are two timestamps with different resolutions equal? How about decimals with
 * different scale/precision?

In any case, the current implementation just looks at the top level type id, and so a column of List(Float) will compare equal, at the type level, to List(Int) even though Float != Int. So in your case you have a struct(int, int, int) and a struct(int, int). But if we just look at struct those look the same.

The story with the semi-joins is the same, though the check happens elsewhere. Since those joins are wrappers around cudf::detail::contains we check on line 190 of contains_table.cu:

  CUDF_EXPECTS(cudf::have_same_types(haystack, needles), "Column types mismatch");

where have_same_types does the same "top-level" type comparison.

The way the nested types are attached to columns, if you just have the data_type object, you can't distinguish different nestings since the data_type definition is:

data DataType = List | Struct | FixedWidth | ...

Rather than

data DataType a = List a | Struct a | FixedWidth | ...

The "child" datatypes are only available by traversing the children of a column with a Struct datatype (say), so you could implement type-checking equality if you had two column views:

bool has_equal_types_nested(column_view const& lhs, column_view const& rhs)
{
  return lhs.type() == rhs.type() 
    and std::equal(lhs.child_begin(), lhs.child_end(), rhs.child_begin(), rhs.child_end(), 
                   [](auto const& left, auto const& right) { return has_equal_types_nested(left, right); });
}

@wence-
Copy link
Contributor

wence- commented Nov 28, 2023

So, if the join is "trivial" (for example either table is empty for an inner join) we immediately return a null result.

Aside, this also means that mismatched types can get through and produce a result if a table is empty where, were it not, we would get a runtime exception being raised.

@ZelboK
Copy link
Contributor Author

ZelboK commented Nov 29, 2023

Thank you for the explanations! I appreciate it :)

I personally believe it would be inappropriate for me to comment on whether or not this design is problematic because I am far too new to libcudf to visualize how much effort is required, if this is problematic, among other things. I mentioned it to @bdice and figured it was worth mentioning in an issue as well. I want to make it clear, I'm not actually blocked by anything. I like to write tests to better understand functionality and encountered this case on the way.

@wence- wence- added question Further information is requested libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Nov 29, 2023
@wence- wence- changed the title [BUG] Left anti join not returning error when struct columns with different types are used. [QST] Should datatype equality comparisons traverse nested children Nov 29, 2023
@wence- wence- changed the title [QST] Should datatype equality comparisons traverse nested children [QST] Should datatype equality comparisons traverse nested children? Nov 29, 2023
@wence-
Copy link
Contributor

wence- commented Nov 29, 2023

Thanks, I've made some small edits to the issue to pose it as a question.

@bdice
Copy link
Contributor

bdice commented Nov 29, 2023

This doesn't raise because the top-level data type ids match (both the left and right columns are structs). So, question: should datatype equality traverse nested children?

Yes, this is a bug in my view. The design probably predates nested type support. Here, the docs clearly state that types must match (though in a "detail" method):

* @throw cudf::logic_error if the column data types in build table and probe table do not match.

The bug is here, because this step doesn't look at nested types:

CUDF_EXPECTS(std::equal(std::cbegin(_build),
std::cend(_build),
std::cbegin(probe),
std::cend(probe),
[](auto const& b, auto const& p) { return b.type() == p.type(); }),
"Mismatch in joining column data types");

There is a file full of utility functions (https://github.com/rapidsai/cudf/blob/branch-24.02/cpp/src/utilities/type_checks.cpp) that can be used for this purpose.

@bdice
Copy link
Contributor

bdice commented Nov 30, 2023

@ZelboK I started a PR with fixes for this behavior in #14531. Thanks for your help identifying this!

@ZelboK
Copy link
Contributor Author

ZelboK commented Nov 30, 2023

@ZelboK I started a PR with fixes for this behavior in #14531. Thanks for your help identifying this!

Wow, that was insanely quick. I'm glad I was of some help :D

rapids-bot bot pushed a commit that referenced this issue May 2, 2024
Addresses most of #14527. See also #14494.

This PR expands the use of `cudf::column_types_equal(lhs, rhs)` and adds new methods `cudf::column_scalar_types_equal`, `cudf::scalar_types_equal`, and `cudf::all_column_types_equal`.

These type check functions are now employed throughout the code base instead of raw checks like `a.type() == b.type()` because those do not correctly handle nested types.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Lawrence Mitchell (https://github.com/wence-)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #14531
@bdice
Copy link
Contributor

bdice commented May 6, 2024

I reviewed the code base and updated the function cudf::have_same_types to address this in #14531. Also, I updated the developer guide to indicate that developers should use cudf::have_same_types(lhs, rhs) instead of lhs.type() == rhs.type(): https://github.com/rapidsai/cudf/blob/branch-24.06/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#comparing-data-types

@bdice bdice closed this as completed May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. question Further information is requested
Projects
Archived in project
Status: No status
Development

No branches or pull requests

3 participants