-
Notifications
You must be signed in to change notification settings - Fork 848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Inconsistent and Incorrect Aggregations #4922
Comments
Hi @aboucher-r7 , thank you for reporting the issue and taking the time to provide reproduction steps. |
Updating the reproduction steps with what I did:
|
@aboucher-r7 we have tried to reproduce, but have not been able to. So for double precision type:
For integer values though, the results I got were the same for both the queries, with the SUM aggregate, both using regular Postgres tables and hypertables. Might if be possible for you to provide some additional information to help determine if this is expected behavior or an actual bug?
Thank you! |
Using the two example queries above for the integers, the correct value is 1000243 gained by running query 2, examples of results from query 1 would be 402612, 442536, 387521, 454780, or 443951. How would I go about testing regular PG tables in a distributed fashion? Editing to add in case it's relevant, I've been using a python script to generate random data and all of the data points within each chunk will have the same timestamp. Also, I haven't yet tried this on a clean database. I can do so tomorrow, which might prove that it's our database that is in some kind of error state. |
@aboucher-r7 please do get back to us after retrying on a clean database, thank you! If you are able to reproduce it there, could you also share the python script you are using? It would be very helpful.
Please disregard that part of my previous comment, I did not mean to try creating a distributed hypertable, was rather asking if you would also notice the difference between the two queries by inserting the same data into and then querying a regular PG table. Please disregard it. It would also be useful if you could share the EXPLAIN VERBOSE output for both of the queries. Thank you! |
@konskov Was able to reproduce on a clean install this morning. I also tested use different timestamps for each data point and still had the issue. Interestingly, I cannot reproduce the issue with the steps you provided in your earlier comment. Python script is as follows: import datetime
import io
import random
import time
import psycopg2
conn_details = {
# Insert connection details here
}
days_ago = 25
num_days = 20
num_hours = 20
num_per_hour = 5000
max_int = 1
num_queries = 10
minute = 60
hour = minute * 60
day = hour * 24
start_time = int(time.time()) - day*days_ago
data = ''
for i in range(num_days):
for j in range(num_hours):
date = start_time + day*i + hour*j
formatted_date = datetime.datetime.fromtimestamp(date).strftime('%Y-%m-%dT%H:%M:%S.%fZ')
for _ in range(num_per_hour):
data += f'{formatted_date}\t{random.randint(0, max_int)}\n'
print(f'{datetime.datetime.now()} - Inserting')
conn = None
cur = None
while True:
try:
conn = psycopg2.connect(**conn_details)
cur = conn.cursor()
cur.execute('''DROP TABLE IF EXISTS example_table_name;''')
cur.execute('''create table example_table_name
(
time timestamp not null,
value int
);''')
cur.execute('''SELECT create_distributed_hypertable('example_table_name'::regclass, 'time');''')
cur.execute('''SELECT set_chunk_time_interval('example_table_name'::regclass, INTERVAL '1h');''')
cur.execute('''SELECT add_retention_policy('example_table_name', INTERVAL '30d');''')
cur.copy_from(io.StringIO(data), 'example_table_name', columns=['time', 'value'])
conn.commit()
cur.close()
conn.close()
break
except Exception as e:
print(repr(e))
if cur is not None:
cur.close()
if conn is not None:
conn.close()
conn = psycopg2.connect(**conn_details)
cur = conn.cursor()
print(f'{datetime.datetime.now()} - Insert complete')
for _ in range(num_queries):
cur.execute('''SELECT SUM(value) FROM example_table_name;''')
print(cur.fetchone())
conn.commit()
cur.close()
conn.close() Query plan for query1:
And for query 2:
|
@aboucher-r7 thank you for providing the script, I was able to reproduce the issue on cloud. It seems the issue is the row-by-row fetcher. (If you do |
@konskov that does indeed fix our issue, thank you. Does that have any performance implications or should that be enabled across the board? |
The cursor fetcher could potentially be slower. That is because it does not allow the execution of remote parallel plans. So that is something to keep in mind. |
We have seen a similar problem of inconsistent aggregations. This only occurs with |
@robfranolic yes, we are looking at fixing this issue. The row-by-row fetcher that was implicated as causing the issue is actually being replaced by another implementation that uses the COPY protocol. We are trying to determine if the new implementation solves the issue. |
From some more investigation, it seems the fetcher is not the issue. The issue seems to be that the Here's a test with part of a query that gets executed on a data node. First, execute the query without partials: data_node_1=# select sum(value) FROM public.example_table_name WHERE _timescaledb_internal.chunks_in(public.example_table_name.*, ARRAY[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101]);
sum
--------
249765
(1 row) Then execute the same query with partial_agg: data_node_1=# select _timescaledb_internal.finalize_agg('sum(integer)', null, null, null, partial, cast('1' as int8)) from (SELECT _timescaledb_internal.partialize_agg(sum(value)) partial FROM public.example_table_name WHERE _timescaledb_internal.chunks_in(public.example_table_name.*, ARRAY[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101])) a;
finalize_agg
--------------
109798
(1 row) (The above query actually returns different results every time it is queried.) Now turn off parallel queries: data_node_1=# set max_parallel_workers=0;
SET
data_node_1=# select _timescaledb_internal.finalize_agg('sum(integer)', null, null, null, partial, cast('1' as int8)) from (SELECT _timescaledb_internal.partialize_agg(sum(value)) partial FROM public.example_table_name WHERE _timescaledb_internal.chunks_in(public.example_table_name.*, ARRAY[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101])) a;
finalize_agg
--------------
249765
(1 row) |
Thanks. As I understand it, |
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel safe but this change is leading to incorrect results in some cases. Those functions are supposed work in parallel but seems is not the case and it is not evident yet the root cause and how to properly use it in parallel queries so we decided to revert this change and provide correct results to users. Fixes timescale#4922
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel safe but this change is leading to incorrect results in some cases. Those functions are supposed work in parallel but seems is not the case and it is not evident yet the root cause and how to properly use it in parallel queries so we decided to revert this change and provide correct results to users. Fixes timescale#4922
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel safe but this change is leading to incorrect results in some cases. Those functions are supposed work in parallel but seems is not the case and it is not evident yet the root cause and how to properly use it in parallel queries so we decided to revert this change and provide correct results to users. Fixes timescale#4922
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel safe but this change is leading to incorrect results in some cases. Those functions are supposed work in parallel but seems is not the case and it is not evident yet the root cause and how to properly use it in parallel queries so we decided to revert this change and provide correct results to users. Fixes timescale#4922
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel safe but this change is leading to incorrect results in some cases. Those functions are supposed work in parallel but seems is not the case and it is not evident yet the root cause and how to properly use it in parallel queries so we decided to revert this change and provide correct results to users. Fixes timescale#4922
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel safe but this change is leading to incorrect results in some cases. Those functions are supposed work in parallel but seems is not the case and it is not evident yet the root cause and how to properly use it in parallel queries so we decided to revert this change and provide correct results to users. Fixes timescale#4922
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel safe but this change is leading to incorrect results in some cases. Those functions are supposed work in parallel but seems is not the case and it is not evident yet the root cause and how to properly use it in parallel queries so we decided to revert this change and provide correct results to users. Fixes timescale#4922
Previous PR #4307 mark `partialize_agg` and `finalize_agg` as parallel safe but this change is leading to incorrect results in some cases. Those functions are supposed work in parallel but seems is not the case and it is not evident yet the root cause and how to properly use it in parallel queries so we decided to revert this change and provide correct results to users. Fixes #4922
Previous PR #4307 mark `partialize_agg` and `finalize_agg` as parallel safe but this change is leading to incorrect results in some cases. Those functions are supposed work in parallel but seems is not the case and it is not evident yet the root cause and how to properly use it in parallel queries so we decided to revert this change and provide correct results to users. Fixes #4922
Previous PR #4307 mark `partialize_agg` and `finalize_agg` as parallel safe but this change is leading to incorrect results in some cases. Those functions are supposed work in parallel but seems is not the case and it is not evident yet the root cause and how to properly use it in parallel queries so we decided to revert this change and provide correct results to users. Fixes #4922
What type of bug is this?
Incorrect result
What subsystems and features are affected?
Multi-node
What happened?
Queries utilising partial aggregation return incorrect and inconsistent results.
Example:
Query 1
SELECT sum(value) FROM table;
Query 2
SELECT SUM(value) FROM ( SELECT value FROM table ORDER BY 1) a;
Query 1 will produce incorrect results, where Query 2 will bypass partial aggregation on the data nodes and return correctly.
TimescaleDB version affected
2.8.1
PostgreSQL version used
14.5
What operating system did you use?
Ubuntu 22.04.1 x64
What installation method did you use?
Docker
What platform did you run on?
Amazon Web Services (AWS)
Relevant log output and stack trace
No response
How can we reproduce the bug?
Running
timescaledb-ha pg14.5-ts2.8.1-p0
on 1 access node with 4 data nodes using EKS with all nodes in the same availability zone.Data of the form
(1664496000, 'a', '1efa7394-5540-410d-a468-b5f204d66ff9', 'b', 1)
I had to insert 2M rows across 400 chunks before I saw this bug appear. Note that I have seen this issue occur on multiple tables now, all of which have the same partitioning and retention policies, but different schemas.
AVG
is also affected, the issue is not limited toSUM
.Editing to add: this also happens when the value is an int, and the string columns are unnecessary
The text was updated successfully, but these errors were encountered: