Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join APIs that return gathermaps #7454

Merged
merged 183 commits into from Mar 30, 2021

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Feb 25, 2021

Closes #6480

C++ changes

TL;DR

  • Adds join APIs that accept join keys and return gathermaps
  • Return type is a unique_ptr<rmm::device_uvector<size_type>>> (rather than a unique_ptr<column>), to accommodate join results that can be larger than INT32_MAX rows
  • Simplifies previous join APIs to not accept arguments relating to "common columns" -- instead, those APIs always return all the columns from the LHS/RHS. Users wanting finer control can use the gathermap-based APIs

The problem

The work in this PR was motivated by the need for simpler join APIs that give the user more flexibility in how they want to construct the result of a join. To explain the current problem, consider the inner_join API:

std::unique_ptr<cudf::table> inner_join(
  cudf::table_view const& left,
  cudf::table_view const& right,
  std::vector<cudf::size_type> const& left_on,
  std::vector<cudf::size_type> const& right_on,
  std::vector<std::pair<cudf::size_type, cudf::size_type>> const& columns_in_common,
  null_equality compare_nulls         = null_equality::EQUAL,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

In addition to the left and right tables (and corresponding keys), the API also accepts a columns_in_common argument. This is argument specifies pairs of columns from the LHS and RHS respectively, for which only a single column should appear in the result. That single column appears on the "left" side of the result. This makes the API somewhat complicated as well as inflexible.

There is a "lower-level" join API that gives more control on which side the "common" columns should go, by providing an additional common_columns_output_side argument:

  std::pair<std::unique_ptr<cudf::table>, std::unique_ptr<cudf::table>> inner_join(
    cudf::table_view const& probe,
    std::vector<size_type> const& probe_on,
    std::vector<std::pair<cudf::size_type, cudf::size_type>> const& columns_in_common,
    common_columns_output_side common_columns_output_side = common_columns_output_side::PROBE,
    null_equality compare_nulls                           = null_equality::EQUAL,
    rmm::cuda_stream_view stream                          = rmm::cuda_stream_default,
    rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()) const;

But even that offers only limited flexibility: for example, it doesn't allow the user to specify an arbitrary ordering of result columns, or omit columns altogether from the result.

Proposed API

The proposed API in this PR is:

std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
          std::unique_ptr<rmm::device_uvector<size_type>>>
inner_join(cudf::table_view const& left_keys,
           cudf::table_view const& right_keys,
           null_equality compare_nulls         = null_equality::EQUAL,
           rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Note:

  • Rather than requiring the full left and right tables of the join, this API only needs the key columns from the left and right tables.
  • Rather than constructing the result of the join, this API returns the gathermaps which can be used to construct it.
  • For outer join, non-matches are represented by out-of-bound values in the gathermap. In conjunction with the out_of_bounds_policy::NULLIFY argument to gather, this will produce nulls in the appropriate locations of the result table.
  • The API returns a std::unique_ptr<rmm::device_uvector>> rather than just rmm::device_uvector because of a Cython limitation that prevents wrapping functions whose return types do not provide a nullary (default) constructor.
  • The use of rmm::device_uvector allows the API to return results of size > INT32_MAX, which can occur easily in outer joins.

Python changes

TL;DR

  • Add Cython bindings for the new C++ APIs
  • Rework join internals to interface with the new Cython APIs

Changes/Improvements

_Indexer

One major change introduced in the join internals is the use of a new type _Indexer to represent a key column.

Previously, join keys were represented by a numeric offset. This was for two reasons:

  • A join key could be either an index column or a data column, and the only way to refer to it unambiguously was by its offset -- a DataFrame can have an index column and a data column with the same name.
  • The C++ API required numeric offsets for the left_on and right_on arguments

_Indexer provides a more convenient way to construct and represent join keys by allowing one to refer unambiguosly to an index or data column of a Frame:

    # >>> df
    #    a
    # b
    # 4  1
    # 5  2
    # 6  3
    # >>> _Indexer("a", column=True).get(df)  # returns column "a" of df
    # >>> _Indexer("b", index=True).get(df)  # returns index level "b" of df

Casting logic

Some of the casting logic has been simplified since we no longer need to post-process (cast) the result returned by libcudf. Previously, we were accounting for "right" joins in our casting functions. But, since a right join is implemented in terms of a left join with the operands reversed, it turns out we never really needed to handle right joins separately. I have removed that and it simplifies casting logic further.

Others

  • Renamed casting_logic.py to _join_helpers.py and included other join utilities there.
  • Added a subclass of Merge for handling semi/anti joins
  • Added a assert_join_results_equal helper to compare join results between Pandas and cuDF. libcudf can return join results with arbitrary row ordering, and we weren't accounting for that in some of our tests previously. I'm a bit surprised we never ran into any test failures :)

@kkraus14
Copy link
Collaborator

@harrism can you review again when you get a chance? Then we'll merge!

v0.19 Release automation moved this from PR-Needs review to PR-Reviewer approved Mar 30, 2021
@harrism
Copy link
Member

harrism commented Mar 30, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 563edfa into rapidsai:branch-0.19 Mar 30, 2021
v0.19 Release automation moved this from PR-Reviewer approved to Done Mar 30, 2021
rapids-bot bot pushed a commit that referenced this pull request Mar 30, 2021
Adds Java bindings for the libcudf join APIs that return gather maps.  Depends upon #7454.

Authors:
  - Jason Lowe (@jlowe)

Approvers:
  - Robert (Bobby) Evans (@revans2)

URL: #7751
rapids-bot bot pushed a commit that referenced this pull request Apr 1, 2021
Adds support for equijoin on structs.

This PR is leveraging the [struct PR](#7422) and the [rewrite for join API](#7454). It enables equijoin on structs by flattening the struct for the hash calculation.

closes #7543

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Ashwin Srinath (https://github.com/shwina)
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Alessandro Bellina (https://github.com/abellina)
  - Devavret Makkar (https://github.com/devavret)
  - David Wendt (https://github.com/davidwendt)
  - Liangcai Li (https://github.com/firestarman)
  - Paul Taylor (https://github.com/trxcllnt)
  - Kumar Aatish (https://github.com/kaatish)
  - Jason Lowe (https://github.com/jlowe)
  - Dillon Cullinan (https://github.com/dillon-cullinan)
  - Raza Jafri (https://github.com/razajafri)
  - https://github.com/rwlee
  - Michael Wang (https://github.com/isVoid)
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Keith Kraus (https://github.com/kkraus14)
  - Robert Maynard (https://github.com/robertmaynard)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - https://github.com/ChrisJar
  - AJ Schmidt (https://github.com/ajschmidt8)
  - https://github.com/nvdbaranec
  - Nghia Truong (https://github.com/ttnghia)
  - https://github.com/chenrui17
  - Conor Hoekstra (https://github.com/codereport)
  - Mike Wendt (https://github.com/mike-wendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Devavret Makkar (https://github.com/devavret)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #7720
rapids-bot bot pushed a commit that referenced this pull request Oct 1, 2021
The `method` parameter to `DataFrame.join` and `DataFrame.merge` [isn't used internally](https://github.com/rapidsai/cudf/blob/e2098e56f0cb209b1d916ce617c04533444a056a/python/cudf/cudf/core/join/join.py#L90) after changes in #7454. This PR updates the docstrings and adds deprecation notices via `FutureWarning` as discussed in #9347.

The parameter is now deprecated in the public API. I removed all internal uses of the `method` parameter.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #9291
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change cuDF (Python) Affects Python cuDF API. improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
No open projects
v0.19 Release
  
Done
Development

Successfully merging this pull request may close these issues.

[FEA] Add a join API that returns gather maps rather than a table
9 participants