Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check column type equality, handling nested types correctly. #14531

Merged
merged 39 commits into from
May 2, 2024
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
39d14c8
Check column types with a utility that handles nesting correctly.
bdice Mar 11, 2024
2e3979f
Merge remote-tracking branch 'upstream/branch-24.04' into nested-colu…
bdice Mar 14, 2024
57b3b50
Merge remote-tracking branch 'upstream/branch-24.06' into nested-colu…
bdice Mar 26, 2024
5f7f1f8
Add column_scalar_types_equal.
bdice Mar 26, 2024
d880838
Use column_scalar_types_equal.
bdice Mar 26, 2024
8b152dc
Fix bugs and update tests.
bdice Mar 26, 2024
5cf4769
Add scalar_types_equal.
bdice Mar 27, 2024
8d938bb
Fix tests.
bdice Mar 27, 2024
f7d8c41
Merge remote-tracking branch 'upstream/branch-24.06' into nested-colu…
bdice Mar 27, 2024
5430397
Add data_type_errors.
bdice Mar 27, 2024
8e02603
Fix tests.
bdice Mar 27, 2024
a1bbecb
Fix remaining type checks that should use new utilities.
bdice Mar 27, 2024
0de9f9b
Add docs section on comparing data types.
bdice Mar 27, 2024
fc7b821
Fix typo.
bdice Mar 27, 2024
55e0c9f
Merge remote-tracking branch 'upstream/branch-24.06' into nested-colu…
bdice Apr 1, 2024
faa92c8
Use cudf::is_fixed_point.
bdice Apr 1, 2024
6556266
Fix missing paren.
bdice Apr 1, 2024
6531a53
list column, not lists column
bdice Apr 4, 2024
454d413
Rename functions to be overloads of cudf::types_equal, update and exp…
bdice Apr 9, 2024
c91df10
Rename function.
bdice Apr 9, 2024
c742253
Rename to have_same_types, apply PR feedback.
bdice Apr 10, 2024
8a44004
Merge remote-tracking branch 'upstream/branch-24.06' into nested-colu…
bdice Apr 10, 2024
fce091f
Add back deprecated methods.
bdice Apr 12, 2024
cafdf6f
Fix verb/subject agreement.
bdice Apr 12, 2024
cda4e0d
Update docs for all_have_same_types.
bdice Apr 12, 2024
ec740c9
Un-deprecate column_types_equivalent.
bdice Apr 12, 2024
ec0eeff
Apply patch from Lawrence.
bdice Apr 12, 2024
5afc5e2
Use std::equal.
bdice Apr 12, 2024
18cdd52
Switch back to column_types_equivalent.
bdice Apr 12, 2024
7c05d95
Fix tests that had incorrect nested types.
bdice Apr 13, 2024
7ba2664
Merge branch 'nested-column-type-checks' of github.com:bdice/cudf int…
bdice Apr 13, 2024
9904431
Merge remote-tracking branch 'upstream/branch-24.06' into nested-colu…
bdice Apr 30, 2024
15d05e0
Fix clang-format.
bdice Apr 30, 2024
57c1f64
Move table_view overload from table_view.hpp to type_checks.hpp.
bdice May 1, 2024
25c5ad9
Mirror scalar/column call.
bdice May 1, 2024
d764f7e
Minor cleanup.
bdice May 1, 2024
e0732b9
Improve dispatching to existing overloads.
bdice May 1, 2024
fc0dada
Merge branch 'nested-column-type-checks' of github.com:bdice/cudf int…
bdice May 1, 2024
32444c5
Merge branch 'branch-24.06' into nested-column-type-checks
bdice May 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 11 additions & 4 deletions cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -921,13 +921,14 @@ Use the `CUDF_EXPECTS` macro to enforce runtime conditions necessary for correct
Example usage:

```c++
CUDF_EXPECTS(lhs.type() == rhs.type(), "Column type mismatch");
CUDF_EXPECTS(cudf::have_same_types(lhs, rhs), "Type mismatch", cudf::data_type_error);
```

The first argument is the conditional expression expected to resolve to `true` under normal
conditions. If the conditional evaluates to `false`, then an error has occurred and an instance of
`cudf::logic_error` is thrown. The second argument to `CUDF_EXPECTS` is a short description of the
error that has occurred and is used for the exception's `what()` message.
conditions. The second argument to `CUDF_EXPECTS` is a short description of the error that has
occurred and is used for the exception's `what()` message. If the conditional evaluates to
`false`, then an error has occurred and an instance of the exception class in the third argument
(or the default, `cudf::logic_error`) is thrown.

There are times where a particular code path, if reached, should indicate an error no matter what.
For example, often the `default` case of a `switch` statement represents an invalid alternative.
Expand Down Expand Up @@ -1026,6 +1027,12 @@ types such as numeric types and timestamps/durations, adding support for nested
Enabling an algorithm differently for different types uses either template specialization or SFINAE,
as discussed in [Specializing Type-Dispatched Code Paths](#specializing-type-dispatched-code-paths).

## Comparing Data Types

When comparing the data types of two columns or scalars, do not directly compare
`a.type() == b.type()`. Nested types such as lists of structs of integers will not be handled
properly if only the top level type is compared. Instead, use the `cudf::have_same_types` function.

# Type Dispatcher

libcudf stores data (for columns and scalars) "type erased" in `void*` device memory. This
Expand Down
9 changes: 6 additions & 3 deletions cpp/include/cudf/detail/scatter.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020-2023, NVIDIA CORPORATION.
* Copyright (c) 2020-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -29,7 +29,9 @@
#include <cudf/strings/detail/scatter.cuh>
#include <cudf/strings/string_view.cuh>
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/error.hpp>
#include <cudf/utilities/traits.hpp>
#include <cudf/utilities/type_checks.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>
Expand Down Expand Up @@ -212,8 +214,9 @@ struct column_scatterer_impl<dictionary32> {
// check the keys match
dictionary_column_view const source(source_in);
dictionary_column_view const target(target_in);
CUDF_EXPECTS(source.keys().type() == target.keys().type(),
"scatter dictionary keys must be the same type");
CUDF_EXPECTS(cudf::have_same_types(source.keys(), target.keys()),
"scatter dictionary keys must be the same type",
cudf::data_type_error);

// first combine keys so both dictionaries have the same set
auto target_matched = dictionary::detail::add_keys(target, source.keys(), stream, mr);
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/lists/detail/scatter.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ std::unique_ptr<column> scatter_impl(rmm::device_uvector<unbound_list_view> cons
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
CUDF_EXPECTS(column_types_equal(source, target), "Mismatched column types.");
CUDF_EXPECTS(have_same_types(source, target), "Mismatched column types.");

auto const child_column_type = lists_column_view(target).child().type();

Expand Down
96 changes: 93 additions & 3 deletions cpp/include/cudf/utilities/type_checks.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021-2023, NVIDIA CORPORATION.
* Copyright (c) 2021-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -16,11 +16,16 @@
#pragma once

#include <cudf/column/column_view.hpp>
#include <cudf/scalar/scalar.hpp>

#include <algorithm>

namespace cudf {

/**
* @brief Compares the type of two `column_view`s
* @brief Compare the types of two `column_view`s
*
* @deprecated Since 24.06. Use cudf::have_same_types instead.
*
* This function returns true if the type of `lhs` equals that of `rhs`.
* - For fixed point types, the scale is compared.
Expand All @@ -34,10 +39,11 @@ namespace cudf {
* @param rhs The second `column_view` to compare
* @return true if column types match
*/
bool column_types_equal(column_view const& lhs, column_view const& rhs);
[[deprecated]] bool column_types_equal(column_view const& lhs, column_view const& rhs);

/**
* @brief Compare the type IDs of two `column_view`s
*
* This function returns true if the type of `lhs` equals that of `rhs`.
* - For fixed point types, the scale is ignored.
*
Expand All @@ -47,4 +53,88 @@ bool column_types_equal(column_view const& lhs, column_view const& rhs);
*/
bool column_types_equivalent(column_view const& lhs, column_view const& rhs);

/**
* @brief Compares the type of two `column_view`s
*
* This function returns true if the type of `lhs` equals that of `rhs`.
* - For fixed point types, the scale is compared.
* - For dictionary types, the type of the keys are compared if both are
* non-empty columns.
* - For lists types, the type of child columns are compared recursively.
* - For struct types, the type of each field are compared in order.
* - For all other types, the `id` of `data_type` is compared.
*
* @param lhs The first `column_view` to compare
* @param rhs The second `column_view` to compare
* @return true if types match
*/
bool have_same_types(column_view const& lhs, column_view const& rhs);

/**
* @brief Compare the types of a `column_view` and a `scalar`
*
* This function returns true if the type of `lhs` equals that of `rhs`.
* - For fixed point types, the scale is compared.
* - For dictionary column types, the type of the keys is compared to the
* scalar type.
* - For lists types, the types of child columns are compared recursively.
* - For struct types, the types of each field are compared in order.
* - For all other types, the `id` of `data_type` is compared.
*
* @param lhs The `column_view` to compare
* @param rhs The `scalar` to compare
* @return true if types match
*/
bool have_same_types(column_view const& lhs, scalar const& rhs);

/**
* @brief Compare the types of a `scalar` and a `column_view`
*
* This function returns true if the type of `lhs` equals that of `rhs`.
* - For fixed point types, the scale is compared.
* - For dictionary column types, the type of the keys is compared to the
* scalar type.
* - For lists types, the types of child columns are compared recursively.
* - For struct types, the types of each field are compared in order.
* - For all other types, the `id` of `data_type` is compared.
*
* @param lhs The `scalar` to compare
* @param rhs The `column_view` to compare
* @return true if types match
*/
bool have_same_types(scalar const& lhs, column_view const& rhs);

/**
* @brief Compare the types of two `scalar`s
*
* This function returns true if the type of `lhs` equals that of `rhs`.
* - For fixed point types, the scale is compared.
* - For lists types, the types of child columns are compared recursively.
* - For struct types, the types of each field are compared in order.
* - For all other types, the `id` of `data_type` is compared.
*
* @param lhs The first `scalar` to compare
* @param rhs The second `scalar` to compare
* @return true if types match
*/
bool have_same_types(scalar const& lhs, scalar const& rhs);
bdice marked this conversation as resolved.
Show resolved Hide resolved

/**
* @brief Compare the types of a range of `column_view` or `scalar` objects
*
* This function returns true if all objects in the range have the same type, in the sense of cudf::have_same_types.
*
* @tparam ForwardIt Forward iterator
* @param first The first iterator
* @param last The last iterator
* @return true if all types match
*/
template <typename ForwardIt>
inline bool all_have_same_types(ForwardIt first, ForwardIt last)
{
return first == last || std::all_of(std::next(first), last, [want = *first](auto const& c) {
return cudf::have_same_types(want, c);
});
}

} // namespace cudf
11 changes: 5 additions & 6 deletions cpp/src/copying/concatenate.cu
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@
#include <cudf/table/table.hpp>
#include <cudf/table/table_device_view.cuh>
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/error.hpp>
#include <cudf/utilities/type_checks.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>
Expand Down Expand Up @@ -460,12 +462,9 @@ void traverse_children::operator()<cudf::list_view>(host_span<column_view const>
*/
void bounds_and_type_check(host_span<column_view const> cols, rmm::cuda_stream_view stream)
{
CUDF_EXPECTS(std::all_of(cols.begin(),
cols.end(),
[expected_type = cols.front().type()](auto const& c) {
return c.type() == expected_type;
}),
"Type mismatch in columns to concatenate.");
CUDF_EXPECTS(cudf::all_have_same_types(cols.begin(), cols.end()),
"Type mismatch in columns to concatenate.",
cudf::data_type_error);

// total size of all concatenated rows
size_t const total_row_count =
Expand Down
18 changes: 7 additions & 11 deletions cpp/src/copying/copy.cu
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
#include <cudf/strings/string_view.cuh>
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/traits.hpp>
#include <cudf/utilities/type_checks.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_uvector.hpp>
Expand Down Expand Up @@ -361,9 +362,10 @@ std::unique_ptr<column> copy_if_else(column_view const& lhs,
CUDF_EXPECTS(boolean_mask.size() == lhs.size(),
"Boolean mask column must be the same size as lhs and rhs columns",
std::invalid_argument);
CUDF_EXPECTS(lhs.size() == rhs.size(), "Both columns must be of the size", std::invalid_argument);
CUDF_EXPECTS(
lhs.type() == rhs.type(), "Both inputs must be of the same type", cudf::data_type_error);
lhs.size() == rhs.size(), "Both columns must be of the same size", std::invalid_argument);
CUDF_EXPECTS(
cudf::have_same_types(lhs, rhs), "Both inputs must be of the same type", cudf::data_type_error);

return copy_if_else(lhs, rhs, lhs.has_nulls(), rhs.has_nulls(), boolean_mask, stream, mr);
}
Expand All @@ -377,11 +379,8 @@ std::unique_ptr<column> copy_if_else(scalar const& lhs,
CUDF_EXPECTS(boolean_mask.size() == rhs.size(),
"Boolean mask column must be the same size as rhs column",
std::invalid_argument);

auto rhs_type =
cudf::is_dictionary(rhs.type()) ? cudf::dictionary_column_view(rhs).keys_type() : rhs.type();
CUDF_EXPECTS(
lhs.type() == rhs_type, "Both inputs must be of the same type", cudf::data_type_error);
cudf::have_same_types(rhs, lhs), "Both inputs must be of the same type", cudf::data_type_error);

return copy_if_else(lhs, rhs, !lhs.is_valid(stream), rhs.has_nulls(), boolean_mask, stream, mr);
}
Expand All @@ -395,11 +394,8 @@ std::unique_ptr<column> copy_if_else(column_view const& lhs,
CUDF_EXPECTS(boolean_mask.size() == lhs.size(),
"Boolean mask column must be the same size as lhs column",
std::invalid_argument);

auto lhs_type =
cudf::is_dictionary(lhs.type()) ? cudf::dictionary_column_view(lhs).keys_type() : lhs.type();
CUDF_EXPECTS(
lhs_type == rhs.type(), "Both inputs must be of the same type", cudf::data_type_error);
cudf::have_same_types(lhs, rhs), "Both inputs must be of the same type", cudf::data_type_error);

return copy_if_else(lhs, rhs, lhs.has_nulls(), !rhs.is_valid(stream), boolean_mask, stream, mr);
}
Expand All @@ -411,7 +407,7 @@ std::unique_ptr<column> copy_if_else(scalar const& lhs,
rmm::mr::device_memory_resource* mr)
{
CUDF_EXPECTS(
lhs.type() == rhs.type(), "Both inputs must be of the same type", cudf::data_type_error);
cudf::have_same_types(lhs, rhs), "Both inputs must be of the same type", cudf::data_type_error);
return copy_if_else(
lhs, rhs, !lhs.is_valid(stream), !rhs.is_valid(stream), boolean_mask, stream, mr);
}
Expand Down
10 changes: 6 additions & 4 deletions cpp/src/copying/copy_range.cu
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/error.hpp>
#include <cudf/utilities/traits.hpp>
#include <cudf/utilities/type_checks.hpp>

#include <rmm/cuda_stream_view.hpp>

Expand Down Expand Up @@ -146,8 +147,9 @@ std::unique_ptr<cudf::column> out_of_place_copy_range_dispatch::operator()<cudf:
// check the keys in the source and target
cudf::dictionary_column_view const dict_source(source);
cudf::dictionary_column_view const dict_target(target);
CUDF_EXPECTS(dict_source.keys().type() == dict_target.keys().type(),
"dictionary keys must be the same type");
CUDF_EXPECTS(cudf::have_same_types(dict_source.keys(), dict_target.keys()),
"dictionary keys must be the same type",
cudf::data_type_error);

// combine keys so both dictionaries have the same set
auto target_matched =
Expand Down Expand Up @@ -210,7 +212,7 @@ void copy_range_in_place(column_view const& source,
(target_begin <= target.size() - (source_end - source_begin)),
"Range is out of bounds.",
std::out_of_range);
CUDF_EXPECTS(target.type() == source.type(), "Data type mismatch.", cudf::data_type_error);
CUDF_EXPECTS(cudf::have_same_types(target, source), "Data type mismatch.", cudf::data_type_error);
CUDF_EXPECTS(target.nullable() || not source.has_nulls(),
"target should be nullable if source has null values.",
std::invalid_argument);
Expand Down Expand Up @@ -238,7 +240,7 @@ std::unique_ptr<column> copy_range(column_view const& source,
(target_begin <= target.size() - (source_end - source_begin)),
"Range is out of bounds.",
std::out_of_range);
CUDF_EXPECTS(target.type() == source.type(), "Data type mismatch.", cudf::data_type_error);
CUDF_EXPECTS(cudf::have_same_types(target, source), "Data type mismatch.", cudf::data_type_error);

return cudf::type_dispatcher<dispatch_storage_type>(
target.type(),
Expand Down
39 changes: 15 additions & 24 deletions cpp/src/copying/scatter.cu
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@
#include <cudf/structs/struct_view.hpp>
#include <cudf/table/table_device_view.cuh>
#include <cudf/utilities/default_stream.hpp>
#include <cudf/utilities/error.hpp>
#include <cudf/utilities/type_checks.hpp>

#include <rmm/cuda_stream_view.hpp>

Expand Down Expand Up @@ -111,7 +113,7 @@ struct column_scalar_scatterer_impl {
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const
{
CUDF_EXPECTS(source.get().type() == target.type(),
CUDF_EXPECTS(cudf::have_same_types(target, source.get()),
"scalar and column types must match",
cudf::data_type_error);

Expand Down Expand Up @@ -144,7 +146,7 @@ struct column_scalar_scatterer_impl<string_view, MapIterator> {
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const
{
CUDF_EXPECTS(source.get().type() == target.type(),
CUDF_EXPECTS(cudf::have_same_types(target, source.get()),
"scalar and column types must match",
cudf::data_type_error);

Expand Down Expand Up @@ -314,12 +316,7 @@ std::unique_ptr<table> scatter(table_view const& source,
CUDF_EXPECTS(scatter_map.size() <= source.num_rows(),
"Size of scatter map must be equal to or less than source rows",
std::invalid_argument);
CUDF_EXPECTS(std::equal(source.begin(),
source.end(),
target.begin(),
[](auto const& col1, auto const& col2) {
return col1.type().id() == col2.type().id();
}),
CUDF_EXPECTS(cudf::have_same_types(source, target),
"Column types do not match between source and target",
cudf::data_type_error);
CUDF_EXPECTS(not scatter_map.has_nulls(), "Scatter map contains nulls", std::invalid_argument);
Expand Down Expand Up @@ -451,14 +448,9 @@ std::unique_ptr<table> boolean_mask_scatter(table_view const& input,
"Mask must be of Boolean type",
cudf::data_type_error);
// Count valid pair of input and columns as per type at each column index i
CUDF_EXPECTS(
std::all_of(thrust::counting_iterator<size_type>(0),
thrust::counting_iterator<size_type>(target.num_columns()),
[&input, &target](auto index) {
return ((input.column(index).type().id()) == (target.column(index).type().id()));
}),
"Type mismatch in input column and target column",
cudf::data_type_error);
CUDF_EXPECTS(cudf::have_same_types(input, target),
"Type mismatch in input column and target column",
cudf::data_type_error);

if (target.num_rows() != 0) {
std::vector<std::unique_ptr<column>> out_columns(target.num_columns());
Expand Down Expand Up @@ -495,14 +487,13 @@ std::unique_ptr<table> boolean_mask_scatter(
cudf::data_type_error);

// Count valid pair of input and columns as per type at each column/scalar index i
CUDF_EXPECTS(
std::all_of(thrust::counting_iterator<size_type>(0),
thrust::counting_iterator<size_type>(target.num_columns()),
[&input, &target](auto index) {
return (input[index].get().type().id() == target.column(index).type().id());
}),
"Type mismatch in input scalar and target column",
cudf::data_type_error);
CUDF_EXPECTS(std::all_of(thrust::counting_iterator<size_type>(0),
thrust::counting_iterator<size_type>(target.num_columns()),
[&input, &target](auto index) {
return cudf::have_same_types(target.column(index), input[index].get());
}),
"Type mismatch in input scalar and target column",
cudf::data_type_error);

if (target.num_rows() != 0) {
std::vector<std::unique_ptr<column>> out_columns(target.num_columns());
Expand Down
Loading
Loading