Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Left joins on struct key producing incorrect null results for right table #13109

Closed
jlowe opened this issue Apr 10, 2023 · 0 comments · Fixed by #13120
Closed

[BUG] Left joins on struct key producing incorrect null results for right table #13109

jlowe opened this issue Apr 10, 2023 · 0 comments · Fixed by #13120
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Apr 10, 2023

Describe the bug
After #12787 some RAPIDS Accelerator tests started failing for joins on struct keys. See NVIDIA/spark-rapids#8061. Examining the expected vs. actual data, there are nulls in the GPU result that are not expected, implying the row comparison is not working properly in some cases.

Oddly, an inner join on the same data seems to do the right thing.

Steps/Code to reproduce bug
I'm attaching two Parquet files, left and rightnotnull which are the the inputs to the following test. There should be 4 rows that have matches in the join results, but the GPU produces zero rows that have matches instead, so the right gather map indicates (incorrectly) all null rows.

left.gz
rightnotnull.gz

#include <iostream>

#include <cudf/join.hpp>
#include <cudf/io/parquet.hpp>

int main(int argc, char** argv) {
  auto left_table = cudf::io::read_parquet(
    cudf::io::parquet_reader_options::builder(cudf::io::source_info("left")));
  auto right_table = cudf::io::read_parquet(
    cudf::io::parquet_reader_options::builder(cudf::io::source_info("rightnotnull")));
  auto left_keys = cudf::table_view({left_table.tbl->get_column(0)});
  auto right_keys = cudf::table_view({right_table.tbl->get_column(0)});
  auto [gather_left, gather_right] = cudf::left_join(left_keys, right_keys);
  std::cout << "LEFT GATHER MAP:" << std::endl;
  for (int i = 0; i < gather_left->size(); i++) {
    std::cout << i << ": " << gather_left->element(i, rmm::cuda_stream_default) << std::endl;
  }
  std::cout << "RIGHT GATHER MAP:" << std::endl;
  for (int i = 0; i < gather_right->size(); i++) {
    std::cout << i << ": " << gather_right->element(i, rmm::cuda_stream_default) << std::endl;
  }
  return 0;
}

Expected behavior
Instead of producing all null results (i.e.: minint for all entries in the right gather map), there should be four not-null matches in the right gather map. Change to an inner join to see the expected matches.

@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Apr 10, 2023
@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress and removed Needs Triage Need team to review and classify labels Apr 10, 2023
@ttnghia ttnghia self-assigned this Apr 10, 2023
rapids-bot bot pushed a commit that referenced this issue Apr 13, 2023
This is very similar to #11284, which fixes a bug when only one input table has nulls while the other doesn't. This is due to the new experimental hasher producing different hash values depending on an input flag `has_nulls`. In order to properly use it, `has_nulls` must be computed by checking all the possible input tables, or set to a constant value (`true`).

Closes:
 * #13109

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Divye Gala (https://github.com/divyegala)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #13120
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants