Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance issue and add a new code path to cudf::detail::contains #11330

Merged
merged 23 commits into from
Jul 22, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jul 22, 2022

The current implementation of cudf::detail::contains can process input with arbitrary nested types. However, it was reported to have severe performance issue when the input tables have many duplicate rows (#11299). In order to fix the issue, #11310 and #11325 was created.

Unfortunately, #11310 is separating semi-anti-join from cudf::detail::contains, causing duplicate implementation. On the other hand, #11325 can address the issue #11299 but semi-anti-join using it still performs worse than the previous semi-anti-join implementation.

The changes in this PR include the following:

  • Fix the performance issue reported in [BUG] performance regression after semi_anti_join refactor #11299 for the current cudf::detail::contains implementation that support nested types.
  • Add a separate code path into cudf::detail::contains such that:
    • Input without having lists column (at any nested level) will be processed by the code path that is the same as the old implementation of semi-anti-join. This is to make sure the performance of semi-anti-join will remain the same as before.
    • Input with nested lists column, or NaNs compared as unequal, will be processed by another code path that supports nested types and different NaNs behavior. This will make sure support for nested types will not be dropped.

Closes #11299.

@ttnghia ttnghia added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Jul 22, 2022
@ttnghia ttnghia added this to PR-WIP in v22.08 Release via automation Jul 22, 2022
@ttnghia ttnghia requested a review from a team as a code owner July 22, 2022 04:39
@ttnghia ttnghia self-assigned this Jul 22, 2022
@ttnghia ttnghia requested review from cwharris, mythrocks, PointKernel, abellina, bdice and hyperbolic2346 and removed request for cwharris July 22, 2022 04:39
@ttnghia ttnghia moved this from PR-WIP to PR-Needs review in v22.08 Release Jul 22, 2022
@codecov

This comment was marked as off-topic.

abellina added a commit to abellina/cudf that referenced this pull request Jul 22, 2022
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack looks like I refreshed during an update and my comments vanished. Will start again.

Comment on lines 49 to 61
template <typename Hasher>
struct strong_index_hasher_adapter {
strong_index_hasher_adapter(Hasher const& hasher) : _hasher{hasher} {}

template <typename T>
__device__ inline auto operator()(T const idx) const noexcept
{
return _hasher(static_cast<size_type>(idx));
}

private:
Hasher _hasher;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr you have two options that allow you to switch to a struct with public members:

  1. Define a constructor
  2. Explicitly specify the dependent type when you construct the adapter, e.g. strong_index_hasher_adapter<HasherT> h{...}.

Removing the constructor requires C++20 support for improved CTAD. CTAD in C++17 does not include the aggregate initializer that exists by default for aggregate types such as this one, so doing strong_index_hasher_adapter h{HasherT{}} will not infer the dependent type for the strong_index_hasher_adapter instance. This behavior is improved in C++20 (see the C++20 note in the implicitly-generated deduction guides section of the CTAD documentation). Here's an example that shows that supplying the template parameters explicitly always works, but deduction only works if you set the compiler flag to -std=c++20.

Comment on lines 49 to 61
template <typename Hasher>
struct strong_index_hasher_adapter {
strong_index_hasher_adapter(Hasher const& hasher) : _hasher{hasher} {}

template <typename T>
__device__ inline auto operator()(T const idx) const noexcept
{
return _hasher(static_cast<size_type>(idx));
}

private:
Hasher _hasher;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also yes, the hasher should be made const if it's public.

cpp/src/search/contains_table.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very minor suggestions, but this looks good to me. Thanks for putting together a solution so quickly @ttnghia and @PointKernel! Fantastic work here.

cpp/src/search/contains_table.cu Show resolved Hide resolved
cpp/src/search/contains_table.cu Outdated Show resolved Hide resolved
@vyasr
Copy link
Contributor

vyasr commented Jul 22, 2022

@ttnghia I think we can close #11310 and #11325 based on our discussions, correct?

@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 22, 2022

@ttnghia I think we can close #11310 and #11325 based on our discussions, correct?

Right, I'm closing them.

@abellina
Copy link
Contributor

abellina commented Jul 22, 2022

Overall I see this PR as having very minor effect on the time for the NDS benchmark where it is 4 seconds slower (or less than 1% impact). I am comparing this patch (applied on top of the original refactor PR) abellina@6919828 vs a baseline based on commit: abellina@9034b1b, which was before the original refactor work.

One of the queries, query80 was statistically significant with a difference in the means of ~700ms worse. I have taken an nsys trace of the baseline and the patched up runs of this query and it's not clear from this trace what the differences are. It is something that we should look at more closely, but we feel is minor enough.

template <typename T,
typename U,
CUDF_ENABLE_IF(is_strong_index_type<T>() && is_strong_index_type<U>())>
__device__ constexpr auto operator()(T const lhs_index, U const rhs_index) const noexcept
Copy link
Contributor

@bdice bdice Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttnghia Can you explain what types this functor expects to get? lhs_index_type, rhs_index_type and rhs_index_type, lhs_index_type or also things like lhs_index_type, lhs_index_type, rhs_index_type, rhs_index_type, and size_type, size_type? I want to simplify this but I don't fully understand the requirements you're aiming for here because the templates are fairly broad.

Copy link
Contributor Author

@ttnghia ttnghia Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(size_type, size_type) will never be instantiated, because only strong index types are used in the code. The template function will not be enabled for this case.

Only (lhs_index_type, lhs_index_type) and (lhs_index_type, rhs_index_type) will be instantiated in this file, at this time and potentially (rhs_index_type, lhs_index_type) may be instantiated anytime in the future:

  • (lhs_index_type, lhs_index_type) will be used for self-comparing the haystack table, when inserting row indices of that table into the map.
  • (lhs_index_type, rhs_index_type) will be used for checking contains using cuco::static_map::contains. The internal implementation of cuco::static_map::contains at this moment only calls (lhs_index_type, rhs_index_type), but it doesn't guarantee to not call (rhs_index_type, lhs_index_type) in the future. Nevertheless, this function will flip the order to make sure the right indices will be passed into the underlying comparator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use three explicit operator()s, one that addresses the self-comparison lhs_index_type, lhs_index_type case, and two that address the left/right (right/left) pairings? This template logic is quite complicated. I propose:

/**
 * @brief An adapter functor to support strong index type for table row comparator that must be
 * operating on `cudf::size_type`.
 */
template <typename Comparator>
struct strong_index_comparator_adapter {
  strong_index_comparator_adapter(Comparator const& comparator) : _comparator{comparator} {}

  __device__ constexpr auto operator()(lhs_index_type const left, lhs_index_type const right) const noexcept
  {
    return _comparator(static_cast<size_type>(left), static_cast<size_type>(right);
  }

  __device__ constexpr auto operator()(lhs_index_type const left, rhs_index_type const right) const noexcept
  {
    return _comparator(static_cast<size_type>(left), static_cast<size_type>(right);
  }

  __device__ constexpr auto operator()(rhs_index_type const right, lhs_index_type const left) const noexcept
  {
    // Reverse the order to left, right.
    return this->operator()(left, right);
  }

 private:
  Comparator const _comparator;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor Author

@ttnghia ttnghia Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, no. I have tried that. This will cause compiler warning: operator() defined but never referred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good explanation to me! I couldn’t quite piece that together myself but I think it sounds accurate.

Copy link
Contributor Author

@ttnghia ttnghia Jul 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that is just half correct. Here we have two separate code paths:

  • A code path uses exprimental row operators (two_table_comparator) that doesn't need this adapter (here), and
  • Another code path uses the traditional cudf::row_equality for comparing rows from two tables (here). This cudf::row_equality doesn't support strong index types thus it must be wrapped in this adapter as in the link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why there are warnings is that: this adapter is a template struct, which means the compiler creates a separate version of when it is used. In this file, it is used 3 times thus there will be 3 different versions of this template struct will be created. For each version, only one overload of operator() is called (out of 3), causing warnings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that is helpful to know. Perhaps that indicates that we should split this struct into multiple structs which only have the functions used in each case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch @ttnghia, thanks for the correction.

v22.08 Release automation moved this from PR-Needs review to PR-Reviewer approved Jul 22, 2022
@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 22, 2022

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

[BUG] performance regression after semi_anti_join refactor
7 participants