Skip to content

Reproducibility Issue: RAND(seed) in Window Functions is Non-Deterministic on Impala Backend #11309

Open
@94929

Description

@94929

Problem Description

When using a fixed integer seed with RAND() on the Impala backend, the results are not reproducible when the RAND(seed) call is used within the ORDER BY clause of a window function (e.g., ROW_NUMBER()). This makes it impossible to implement reproducible partitioned sampling, which is a critical feature for ML data preparation.
While a simple SELECT *, RAND(42) query is deterministic, its behavior changes when combined with an OVER() clause, leading to non-deterministic results on each execution.

Environment

Backend: Impala (version details can be added if known)
Ibis Version: (Please add your Ibis version, e.g., 8.0.0)
Python Version: 3.10.12

Minimal Reproducible Example

The following code isolates the issue. It creates a partitioned table, then attempts to rank rows within each partition using a reproducible random order. This fails on the Impala backend.

import ibis
import pandas as pd
import pytest

# --- Connection Details ---
# Assuming a connection `conn` to Impala is established.

temp_table_name = "ibis_repro_failure_test"
seed = 42

try:
    # 1. Create a partitioned test table
    data = [{'day': '2025-01-01', 'hour': '00', 'user_id': i} for i in range(10)]
    data += [{'day': '2025-01-01', 'hour': '12', 'user_id': i + 10} for i in range(10)]
    test_df = pd.DataFrame(data)
    
    conn.create_table(temp_table_name, test_df, overwrite=True)
    table = conn.table(temp_table_name)

    # 2. Define the partitioned sampling logic using the same method as in our codebase
    # This simulates how our sampling function constructs the query.
    def get_partitioned_sample(target_table, conn, seed_val):
        rand_col_name = "_rand_col"

        # This part accurately reflects our attempt to use RAND(seed) via conn.sql()
        compiled_sql_for_table = conn.compile(target_table, compile_kwargs={'literal_binds': True})
        with_rand_table = target_table.sql(
            f"SELECT *, RAND({seed_val}) AS {rand_col_name} FROM ({compiled_sql_for_table}) t"
        )
        
        # This is the core logic that fails: using RAND(seed) inside a window function
        partitioning_logic = (
            with_rand_table.mutate(
                _row_num=ibis.row_number().over(
                    ibis.window(
                        group_by=['day', 'hour'], 
                        order_by=with_rand_table[rand_col_name]
                    )
                )
            )
            .select('day', 'hour', 'user_id', rand_col_name)
            .order_by('day', 'hour', 'user_id')
        )
        return partitioning_logic.execute()

    # 3. Execute the logic twice
    print("--- First execution ---")
    df1 = get_partitioned_sample(table, conn, seed)
    print(df1)
    
    print("\n--- Second execution ---")
    df2 = get_partitioned_sample(table, conn, seed)
    print(df2)
    
    # 4. Compare the results
    pd.testing.assert_frame_equal(df1, df2)

except AssertionError:
    print("\n[FAILED] Reproducibility test failed: The two DataFrames are not identical.")
except Exception as e:
    print(f"\n[ERROR] An unexpected error occurred: {e}")
finally:
    if conn:
        conn.drop_table(temp_table_name, force=True)
        conn.disconnect()

Expected Behavior

The two DataFrames, df1 and df2, should be completely identical. The RAND(seed) expression, even when used inside a window function, should produce a deterministic ordering, resulting in a reproducible sample.

Actual Behavior

The test fails. The user_id and _rand_col columns are different between df1 and df2, proving that the ordering produced by ROW_NUMBER() OVER (ORDER BY RAND(seed)) is non-deterministic across executions.

Investigation and Findings

  1. A simple SELECT *, RAND(42) FROM table query is reproducible on our Impala instance.
  2. However, when this RAND(42) column is used as the ORDER BY key within a window function, the entire operation becomes non-deterministic.
  3. This suggests the issue is specific to the interaction between RAND(seed) and the OVER() clause in Impala's query execution engine.

Question

Is there a more idiomatic or "Ibis-native" way to handle this non-deterministic behavior of RAND(seed) within window functions on Impala?

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions