Reproducibility Issue: RAND(seed) in Window Functions is Non-Deterministic on Impala Backend

### Problem Description
When using a fixed integer seed with RAND() on the Impala backend, the results are not reproducible when the RAND(seed) call is used within the ORDER BY clause of a window function (e.g., ROW_NUMBER()). This makes it impossible to implement reproducible partitioned sampling, which is a critical feature for ML data preparation.
While a simple SELECT *, RAND(42) query is deterministic, its behavior changes when combined with an OVER() clause, leading to non-deterministic results on each execution.

### Environment
Backend: Impala (version details can be added if known)
Ibis Version: (Please add your Ibis version, e.g., 8.0.0)
Python Version: 3.10.12

### Minimal Reproducible Example
The following code isolates the issue. It creates a partitioned table, then attempts to rank rows within each partition using a reproducible random order. This fails on the Impala backend.

```python
import ibis
import pandas as pd
import pytest

# --- Connection Details ---
# Assuming a connection `conn` to Impala is established.

temp_table_name = "ibis_repro_failure_test"
seed = 42

try:
    # 1. Create a partitioned test table
    data = [{'day': '2025-01-01', 'hour': '00', 'user_id': i} for i in range(10)]
    data += [{'day': '2025-01-01', 'hour': '12', 'user_id': i + 10} for i in range(10)]
    test_df = pd.DataFrame(data)
    
    conn.create_table(temp_table_name, test_df, overwrite=True)
    table = conn.table(temp_table_name)

    # 2. Define the partitioned sampling logic using the same method as in our codebase
    # This simulates how our sampling function constructs the query.
    def get_partitioned_sample(target_table, conn, seed_val):
        rand_col_name = "_rand_col"

        # This part accurately reflects our attempt to use RAND(seed) via conn.sql()
        compiled_sql_for_table = conn.compile(target_table, compile_kwargs={'literal_binds': True})
        with_rand_table = target_table.sql(
            f"SELECT *, RAND({seed_val}) AS {rand_col_name} FROM ({compiled_sql_for_table}) t"
        )
        
        # This is the core logic that fails: using RAND(seed) inside a window function
        partitioning_logic = (
            with_rand_table.mutate(
                _row_num=ibis.row_number().over(
                    ibis.window(
                        group_by=['day', 'hour'], 
                        order_by=with_rand_table[rand_col_name]
                    )
                )
            )
            .select('day', 'hour', 'user_id', rand_col_name)
            .order_by('day', 'hour', 'user_id')
        )
        return partitioning_logic.execute()

    # 3. Execute the logic twice
    print("--- First execution ---")
    df1 = get_partitioned_sample(table, conn, seed)
    print(df1)
    
    print("\n--- Second execution ---")
    df2 = get_partitioned_sample(table, conn, seed)
    print(df2)
    
    # 4. Compare the results
    pd.testing.assert_frame_equal(df1, df2)

except AssertionError:
    print("\n[FAILED] Reproducibility test failed: The two DataFrames are not identical.")
except Exception as e:
    print(f"\n[ERROR] An unexpected error occurred: {e}")
finally:
    if conn:
        conn.drop_table(temp_table_name, force=True)
        conn.disconnect()
```

### Expected Behavior
The two DataFrames, df1 and df2, should be completely identical. The RAND(seed) expression, even when used inside a window function, should produce a deterministic ordering, resulting in a reproducible sample.

### Actual Behavior
The test fails. The user_id and _rand_col columns are different between df1 and df2, proving that the ordering produced by ROW_NUMBER() OVER (ORDER BY RAND(seed)) is non-deterministic across executions.

### Investigation and Findings
1. A simple SELECT *, RAND(42) FROM table query is reproducible on our Impala instance.
2. However, when this RAND(42) column is used as the ORDER BY key within a window function, the entire operation becomes non-deterministic.
3. This suggests the issue is specific to the interaction between RAND(seed) and the OVER() clause in Impala's query execution engine.

### Question
Is there a more idiomatic or "Ibis-native" way to handle this non-deterministic behavior of RAND(seed) within window functions on Impala?

Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducibility Issue: RAND(seed) in Window Functions is Non-Deterministic on Impala Backend #11309

Problem Description

Environment

Minimal Reproducible Example

Expected Behavior

Actual Behavior

Investigation and Findings

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducibility Issue: RAND(seed) in Window Functions is Non-Deterministic on Impala Backend #11309

Description

Problem Description

Environment

Minimal Reproducible Example

Expected Behavior

Actual Behavior

Investigation and Findings

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions