Description
Problem Description
When using a fixed integer seed with RAND() on the Impala backend, the results are not reproducible when the RAND(seed) call is used within the ORDER BY clause of a window function (e.g., ROW_NUMBER()). This makes it impossible to implement reproducible partitioned sampling, which is a critical feature for ML data preparation.
While a simple SELECT *, RAND(42) query is deterministic, its behavior changes when combined with an OVER() clause, leading to non-deterministic results on each execution.
Environment
Backend: Impala (version details can be added if known)
Ibis Version: (Please add your Ibis version, e.g., 8.0.0)
Python Version: 3.10.12
Minimal Reproducible Example
The following code isolates the issue. It creates a partitioned table, then attempts to rank rows within each partition using a reproducible random order. This fails on the Impala backend.
import ibis
import pandas as pd
import pytest
# --- Connection Details ---
# Assuming a connection `conn` to Impala is established.
temp_table_name = "ibis_repro_failure_test"
seed = 42
try:
# 1. Create a partitioned test table
data = [{'day': '2025-01-01', 'hour': '00', 'user_id': i} for i in range(10)]
data += [{'day': '2025-01-01', 'hour': '12', 'user_id': i + 10} for i in range(10)]
test_df = pd.DataFrame(data)
conn.create_table(temp_table_name, test_df, overwrite=True)
table = conn.table(temp_table_name)
# 2. Define the partitioned sampling logic using the same method as in our codebase
# This simulates how our sampling function constructs the query.
def get_partitioned_sample(target_table, conn, seed_val):
rand_col_name = "_rand_col"
# This part accurately reflects our attempt to use RAND(seed) via conn.sql()
compiled_sql_for_table = conn.compile(target_table, compile_kwargs={'literal_binds': True})
with_rand_table = target_table.sql(
f"SELECT *, RAND({seed_val}) AS {rand_col_name} FROM ({compiled_sql_for_table}) t"
)
# This is the core logic that fails: using RAND(seed) inside a window function
partitioning_logic = (
with_rand_table.mutate(
_row_num=ibis.row_number().over(
ibis.window(
group_by=['day', 'hour'],
order_by=with_rand_table[rand_col_name]
)
)
)
.select('day', 'hour', 'user_id', rand_col_name)
.order_by('day', 'hour', 'user_id')
)
return partitioning_logic.execute()
# 3. Execute the logic twice
print("--- First execution ---")
df1 = get_partitioned_sample(table, conn, seed)
print(df1)
print("\n--- Second execution ---")
df2 = get_partitioned_sample(table, conn, seed)
print(df2)
# 4. Compare the results
pd.testing.assert_frame_equal(df1, df2)
except AssertionError:
print("\n[FAILED] Reproducibility test failed: The two DataFrames are not identical.")
except Exception as e:
print(f"\n[ERROR] An unexpected error occurred: {e}")
finally:
if conn:
conn.drop_table(temp_table_name, force=True)
conn.disconnect()
Expected Behavior
The two DataFrames, df1 and df2, should be completely identical. The RAND(seed) expression, even when used inside a window function, should produce a deterministic ordering, resulting in a reproducible sample.
Actual Behavior
The test fails. The user_id and _rand_col columns are different between df1 and df2, proving that the ordering produced by ROW_NUMBER() OVER (ORDER BY RAND(seed)) is non-deterministic across executions.
Investigation and Findings
- A simple SELECT *, RAND(42) FROM table query is reproducible on our Impala instance.
- However, when this RAND(42) column is used as the ORDER BY key within a window function, the entire operation becomes non-deterministic.
- This suggests the issue is specific to the interaction between RAND(seed) and the OVER() clause in Impala's query execution engine.
Question
Is there a more idiomatic or "Ibis-native" way to handle this non-deterministic behavior of RAND(seed) within window functions on Impala?
Thanks in advance.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status