Skip to content

Pushdown compare with NaN broken #3958

@a10y

Description

@a10y

We implement our compare binary comparison operation using the arrow_ord kernels. This uses the f32/f64 total_cmp, which includes NaN as >= all positive normal numbers.

This causes inconsistent behavior when pushing down a comparison operation against NaN to a file column that contains NaN.

Directly evaluating the following comparison results in [false, false, true, true]:

PrimitiveArray([1.0, 2.0, NaN, NaN]) >= f32::NAN

However, this is not the case when the filter is pushed down into a scan, because our min/max stats do not consider NaN. The pruning expression that gets pushed down is

$.max < f32::NAN --> 2.0 < f32::NAN = true

which causes the whole thing to get pruned and the result becomes [false, false, false, false]

fn compute_min_max<'a, T>(iter: impl Iterator<Item = &'a T>, dtype: &DType) -> Option<MinMaxResult>
where
T: Into<ScalarValue> + NativePType,
{
// `total_compare` function provides a total ordering (even for NaN values).
// However, we exclude NaNs from min max as they're not useful for any purpose where min/max would be used

We need to either

  1. Make the stats contain NaN, and preserve the total_cmp ordering in pushdown
  2. Have a fallback compare kernel for float arrays with NaNCount > 0 that masks out the NaNs to false

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions