Skip to content

feat: An equivalent of the polars coalesce option in joins #11336

Open
@mwiebusch78

Description

@mwiebusch78

Is your feature request related to a problem?

I find that joins tend to be more 'wordy' in ibis than they are in other APIs like polars or pyspark and I think it would be quite simple to change that.

What is the motivation behind your request?

It is a very common pattern to use left joins to add information to a table that's currently "under construction". Here is an example:

product_attributes = ... # columns product_id, title, description, size, color, ...
df = (
    order_items
    .filter(...)
    .group_by(...)
    .agg(...)
)
filtered_product_attributes = product_attributes.filter(...)
df = (
    df
    .join(product_attributes, ['product_id'], how='left')
    .select(
        df,
        filtered_product_attributes['title'],
        filtered_product_attributes['description'],
        filtered_product_attributes['size'],
        ...
    )
)

My two main issues with this are that

  1. To avoid ending up with a spurious product_id_right column I have to do a projection, which means I have to explicitly list all the columns of filtered_product_attributes that I want to keep.
  2. Both the left and the right table must be assigned to a python variable so that I can do the projection. This means I have to break the method chaining to do a join which (I find) makes the code less readable. It also means that I can't do the filtering on product_attributes in-place in the join because I need to refer to the filtered table again in the projection.

Describe the solution you'd like

I would like the code to look like this:

product_attributes = ... # columns product_id, title, description, size, color, ...
df = (
    order_items
    .filter(...)
    .group_by(...)
    .agg(...)
    .join(
        product_attributes.filter(...),
        ['product_id'],
        how='left',
        coalesce=True,
    )
)

where the effect of the coalesce option is to remove the _right version of the product_id column that we joined on. (The name is borrowed from the polars API.) This makes the projection unnecessary, which means I can filter product_attributes in-place and do not have to explicitly list the columns I want to keep. Both the polars and the pyspark APIs can automatically remove the duplicates of the columns the you joined on, and their joins are more concise and readable as a result.

What version of ibis are you running?

10.5.0

What backend(s) are you using, if any?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeatures or general enhancements

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions