Skip to content

feat: better api for creating aggregates over scalars #11288

Open
@NickCrews

Description

@NickCrews

Is your feature request related to a problem?

In SQL, you can create aggregates over scalars. However, in ibis, the ir.Scalar Values don't have any of the aggregation functions:

import ibis

ibis.options.interactive = True

# aggregating over constants works
sql = """
WITH t as (
    SELECT 1 AS a, 5 AS b UNION ALL
    SELECT 1 AS a, 7 AS b UNION ALL
    SELECT 2 AS a, 3 AS b
)
SELECT AVG(b) AS avg_b, AVG(NULL) as avg_null, AVG(43) AS avg_const
FROM t
GROUP BY a
"""
from_sql = ibis.duckdb.connect().sql(sql)
print(from_sql.execute())
# works

# You don't even need to perform the aggregation over a table/group/etc
sql = """
SELECT AVG(NULL) as avg_null, AVG(43) AS avg_const
"""
from_sql = ibis.duckdb.connect().sql(sql)
print(from_sql.execute())
# works


t = ibis.memtable({"a": [1, 1, 2], "b": [5, 7, 3]})
t.group_by("a").aggregate(
    avg_b=t.b.mean(),
    avg_null=ibis.null(float).mean(),  # AttributeError: 'FloatingScalar' object has no attribute 'mean'
    avg_const=ibis.literal(43).mean(),  # AttributeError: 'IntegerScalar' object has no attribute 'mean'
)

What is the motivation behind your request?

I have some dataclass like

@dataclass
class Address:
    street1: ir.StringValue
    street2: ir.StringValue
    city: ir.StringValue
    ...

which I use as an abstraction layer to smooth over the differences in schemas that my vaiours tables have. Then I have some factory method, where the street2 field is optional. If not given, it infers NULL:

def make_address(t: ibis.Table):
    if "street2" is not in t.columns
        street2 = ibis.null(str)
    else:
        street2 = t.street2
    ....
    return Address(...)

Then I want to be able to use these as

def canonicalize_addresses(t: ibis.Table):
    address = make_address(t)
    quality = rank_quality(address)
    aggs = {
         street1: address.street1.first(order_by=quality),
         street2: address.street2.first(order_by=quality),
         ....,
     },
     return t.group_by("address_id").agg(**aggs)

In other words, once I get the Address dataclass back, I want to be able to treat these as values from the relation. This abstraction works great except for when I have to perform aggregations like in the above.

Describe the solution you'd like

Several options

  1. Don't change anything. I will need to add in an extra .select() somehwere in all using code so that the ibis.null() is turned from a ir.Scalar to to a ir.Column.
  2. Add a as_column() method to Scalars, Columns, and Tables, similar to the existing .as_scalar() and .as_table() methods. Then I will have to explicitly call this function.
  3. Add aggregation methods to ir.Scalar objects. IDK if we should make it so that ibis.literal(42).mean().execute() gives an error (similar to current behavior) or perhaps we should do what SQL does, and have this result in 42.

What version of ibis are you running?

main

What backend(s) are you using, if any?

duckdb, but I think this should affect all SQL backends the same. I think duckdbs handling of SELECT AVG(42) should be the same in all backends?

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeatures or general enhancements

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions