Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not find pathkey item to sort #2232

Open
kvc0 opened this issue Aug 20, 2020 · 34 comments · Fixed by #2868
Open

Could not find pathkey item to sort #2232

kvc0 opened this issue Aug 20, 2020 · 34 comments · Fixed by #2868

Comments

@kvc0
Copy link

kvc0 commented Aug 20, 2020

Relevant system information:

  • OS: Ubuntu 18.04 LTS
  • PostgreSQL 12.2
  • TimescaleDB 1.7.0
  • Installation method: apt install

Describe the bug
When querying a hypertable and GROUP BY the hypertable's time column, the width of the time bucket determines whether the postgresql planner returns the error Could not find pathkey item to sort.

To Reproduce
It is not known if this is reproducible on other environments.

Example failure, note the time_bucket value:

postgres=# explain select
 time_bucket('83022ms',time),
 avg(seconds)
from metrics
where "time" > '2020-08-19T17:34:37.496Z' AND time < '2020-08-19T23:34:37.497Z'
  and cell_name = 'a particular cell'
group by 1;
ERROR:  could not find pathkey item to sort

Example success, note the time bucket value:

postgres=# explain select
 time_bucket('83021ms',time),
 avg(seconds)
from metrics
where "time" > '2020-08-19T17:34:37.496Z' AND time < '2020-08-19T23:34:37.497Z'
  and cell_name = 'a particular cell'
group by 1;
                                                                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=127900.99..134242.67 rows=43909 width=16)
   Group Key: (time_bucket('00:01:23.021'::interval, _hyper_20_1182123_chunk."time"))
   ->  Gather Merge  (cost=127900.99..133364.48 rows=43910 width=40)
         Workers Planned: 5
         ->  Partial GroupAggregate  (cost=126900.91..127076.55 rows=8782 width=40)
               Group Key: (time_bucket('00:01:23.021'::interval, _hyper_20_1182123_chunk."time"))
               ->  Sort  (cost=126900.91..126922.86 rows=8782 width=16)
                     Sort Key: (time_bucket('00:01:23.021'::interval, _hyper_20_1182123_chunk."time"))
                     ->  Result  (cost=0.43..126325.67 rows=8782 width=16)
                           ->  Parallel Index Scan Backward using _hyper_20_1182123_chunk_this_index_name_is_long_enough_to_trunca on _hyper_20_1182123_chunk  (cost=0.43..126215.90 rows=8782 width=16)
                                 Index Cond: (("time" > '2020-08-19 10:34:37.496-07'::timestamp with time zone) AND ("time" < '2020-08-19 16:34:37.497-07'::timestamp with time zone))
                                 Filter: (cell_name = 'some particular cell'::text)
 JIT:
   Functions: 15
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(15 rows)

Expected behavior
For any valid width of time_bucket, the planner can produce a query plan. Alternatively, date_trunc can work with both minute and hour.

Actual behavior
Any grouping of the hypertable's time column that is larger than 83021 milliseconds causes the query planner to error.

Additional context
Certain selections of the table for this time range do this while others do not.

  • The hypertable is simple; there is no physical partitioning, continuous aggregate or compression.
  • There are no additional indices beyond the one Timescaledb affixes to the time column when you create_hypertable.
  • All columns are either text or double precision.
  • There are 43 chunks, each 1 day wide with approximately 7 million rows each.
  • There are no null values in the time column (per the hypertable constraint) and there are no null values in the aggregated numeric column.
  • The query matches 46128 rows before grouping, no null values in either column.
@svenklemm
Copy link
Member

Do you still get this with the latest version? If so could you provide a minimal test script that reproduces the behaviour in a fresh database.

@kvc0
Copy link
Author

kvc0 commented Sep 1, 2020

I've been unable to produce a repro script. I do not know what manner of data poisons the planner thusly.

There are no null values or anything particularly interesting among the data and it crashes long before any tuples are considered, during a simple explain.

@kvc0
Copy link
Author

kvc0 commented Nov 10, 2020

postgres=# EXPLAIN SELECT
  time_bucket('1m', time),
  count(total_measurement)
FROM metrics
WHERE
  time between  '2020-11-10T12:58Z' and '2020-11-10T18:43:01Z'
  AND item = '15660'
GROUP BY 1
;

ERROR:  could not find pathkey item to sort

This error vacillates around the time endpoints. Finding the second at which the error occurs moves the error by minutes in one direction or the other.

Of particular note: When using explicit time stamps in the filter this happens. If you replace the equivalent explicit time filters with time between now() - interval '12 hours' and now() it works and completes the EXPLAIN.

@emadct
Copy link

emadct commented Dec 19, 2020

Is there any fix around this issue ? Because we have such an issue in our monitoring (grafana) and we see it often. When time passes for example two hours, it works again, we are using time_bucket and group by based on that. Also I checked it now, while I remove other conditions it works !

Not working:

SELECT time_bucket('60s',"time") AS "time", count(ids) AS "Items"
FROM tbl WHERE "time" BETWEEN '2020-12-18T23:16:40.053Z' AND '2020-12-19T05:16:40.053Z'
AND respcode != 0
GROUP BY 1
ORDER BY 1;

ERROR:  could not find pathkey item to sort

Working:

SELECT time_bucket('60s',"time") AS "time", count(ids) AS "Items"
FROM tbl WHERE "time" BETWEEN '2020-12-18T23:16:40.053Z' AND '2020-12-19T05:16:40.053Z'
GROUP BY 1
ORDER BY 1;

Data...

@kvc0
Copy link
Author

kvc0 commented Dec 19, 2020

For a workaround instead of $__timeFilter(“time”) try time between $__timeFrom()::timestamptz and $__timeTo()::timestamptz.

@emadct
Copy link

emadct commented Dec 20, 2020

For a workaround instead of $__timeFilter(“time”) try time between $__timeFrom()::timestamptz and $__timeTo()::timestamptz.

Thank you very much, by the way if there are more information required to fix the issue, I can try my best to reproduce it using a SQL file.

@edsinclair
Copy link

I just ran into this issue as well when attempting to order with a join:

EXPLAIN SELECT time_bucket('10 seconds', "metrics"."ts") AS ten_seconds, 
  SUM("metrics"."column1") AS column1, 
  SUM("metrics"."column2") AS column2, 
  SUM("normalized_sensor_readings"."column3") AS column3 
FROM "metrics" 
INNER JOIN "normalized_sensor_readings" ON "metrics"."facility" = "normalized_sensor_readings"."facility" AND "normalized_sensor_readings"."ts" = "metrics"."ts" 
WHERE "metrics"."facility" = 'facility1' AND "metrics"."ts" BETWEEN '2020-12-01 19:21:20' AND '2020-12-01 19:42:40' GROUP BY ten_seconds
ORDER BY ten_seconds;
ERROR:  could not find pathkey item to sort

I tried the workaround suggested above, but (perhaps obviously) it produces the wrong aggregates even though it does not throw an error:

EXPLAIN SELECT time_bucket('10 seconds', "metrics"."ts") AS ten_seconds,
  SUM("metrics"."column1") AS column1,
  SUM("metrics"."column2") AS column2,
  SUM("normalized_sensor_readings"."column3") AS column3 
FROM "metrics" 
INNER JOIN "normalized_sensor_readings" ON "metrics"."facility" = "normalized_sensor_readings"."facility" AND "normalized_sensor_readings"."ts" BETWEEN '2020-12-01 19:21:20' AND '2020-12-01 19:42:40' 
WHERE "metrics"."facility" = 'facility1' AND "metrics"."ts" BETWEEN '2020-12-01 19:21:20' AND '2020-12-01 19:42:40' GROUP BY ten_seconds
ORDER BY ten_seconds;

GroupAggregate  (cost=0.83..672.61 rows=130 width=24)
   Group Key: (time_bucket('00:00:10'::interval, metrics.ts))
   ->  Nested Loop  (cost=0.83..449.99 rows=17680 width=24)
         ->  Custom Scan (ChunkAppend) on metrics  (cost=0.42..87.84 rows=130 width=24)
               Order: time_bucket('00:00:10'::interval, metrics.ts)
               ->  Index Scan using "17945_16306_metrics_pkey" on _hyper_32_17945_chunk  (cost=0.42..87.84 rows=130 width=24)
                     Index Cond: ((facility = 'facility1'::facility_name) AND (ts >= '2020-12-01 19:21:20'::timestamp without time zone) AND (ts <= '2020-12-01 19:42:40'::timestamp without time zone))
         ->  Materialize  (cost=0.42..97.29 rows=136 width=8)
               ->  Append  (cost=0.42..96.61 rows=136 width=8)
                     ->  Index Scan using "17942_16303_normalized_sensor_readings_tmp_pkey" on _hyper_27_17942_chunk  (cost=0.42..95.93 rows=136 width=8)
                           Index Cond: ((facility = 'facility1'::facility_name) AND (ts >= '2020-12-01 19:21:20'::timestamp without time zone) AND (ts <= '2020-12-01 19:42:40'::timestamp without time zone))
(11 rows)

In this particular case I'm not using Grafana and the result set isn't very big I'm able to workaround by sorting outside of Timescaledb by dropping the ORDER BY:

EXPLAIN SELECT time_bucket('10 seconds', "metrics"."ts") AS ten_seconds, 
  SUM("metrics"."column1") AS column1,
  SUM("metrics"."column2") AS column2,
  SUM("normalized_sensor_readings"."column3") AS column3
FROM "metrics"
INNER JOIN "normalized_sensor_readings" ON "metrics"."facility" = "normalized_sensor_readings"."facility" AND "normalized_sensor_readings"."ts" = "metrics"."ts" 
WHERE "metrics"."facility" = 'facility1' AND "metrics"."ts" BETWEEN '2020-12-01 19:21:20' AND '2020-12-01 19:42:40'
GROUP BY ten_seconds;

 HashAggregate  (cost=184.41..186.03 rows=130 width=24)
   Group Key: time_bucket('00:00:10'::interval, _hyper_32_17945_chunk.ts)
   ->  Hash Join  (cost=90.53..182.78 rows=130 width=24)
         Hash Cond: (_hyper_27_17942_chunk.ts = _hyper_32_17945_chunk.ts)
         ->  Append  (cost=0.42..91.99 rows=130 width=16)
               ->  Index Scan using "17942_16303_normalized_sensor_readings_tmp_pkey" on _hyper_27_17942_chunk  (cost=0.42..91.34 rows=130 width=16)
                     Index Cond: ((facility = 'facility1'::facility_name) AND (ts >= '2020-12-01 19:21:20'::timestamp without time zone) AND (ts <= '2020-12-01 19:42:40'::timestamp without time zone))
         ->  Hash  (cost=88.49..88.49 rows=130 width=24)
               ->  Append  (cost=0.42..88.49 rows=130 width=24)
                     ->  Index Scan using "17945_16306_metrics_pkey" on _hyper_32_17945_chunk  (cost=0.42..87.84 rows=130 width=24)
                           Index Cond: ((facility = 'facility1'::facility_name) AND (ts >= '2020-12-01 19:21:20'::timestamp without time zone) AND (ts <= '2020-12-01 19:42:40'::timestamp without time zone))
(11 rows)

@WarriorOfWire thanks for reporting and suggesting the workaround above.

@kvc0
Copy link
Author

kvc0 commented Jan 4, 2021

@edsinclair Are you using timestamp with time zone for metrics.ts? Try your original query but use

WHERE [...] AND "metrics"."ts" BETWEEN '2020-12-01T19:21:20'::timestamptz AND '2020-12-01T19:42:40'::timestamptz

@feikesteenbergen
Copy link
Member

feikesteenbergen commented Jan 19, 2021

Getting similar issues on Timescale 2.0.0, for example:

CREATE TABLE bugs (
    observed timestamptz not null,
    metric bigint not null
);
​
SELECT create_hypertable('bugs', 'observed', chunk_time_interval => '1 week'::interval);
​
INSERT INTO bugs VALUES ('2021-01-12T14:00', 4);
INSERT INTO bugs VALUES ('2021-01-19T14:01', 4);
INSERT INTO bugs VALUES ('2021-01-19T14:05', 6);

And then running the following query:

SELECT
    time_bucket_gapfill('60 seconds', observed) as observed
FROM
    bugs
WHERE
    observed >= '2021-01-18T14:05'
    AND observed <= now()
GROUP BY
    1
ORDER BY
    1 DESC;

Results in:

ERROR:  XX000: could not find pathkey item to sort
LOCATION:  prepare_sort_from_pathkeys, createplan.c:5896

PostgreSQL 12.5 with TimescaleDB 2.0.0

@pmwkaa
Copy link
Contributor

pmwkaa commented Jan 19, 2021

@feikesteenbergen

Got this result with PG12.5 and current master running using timescaledb test-suite, but was able to reproduce using psql.

+CREATE TABLE bugs (
+    observed timestamptz not null,
+    metric bigint not null
+);
+SELECT create_hypertable('bugs', 'observed', chunk_time_interval => '1 week'::interval);
+ create_hypertable 
+-------------------
+ (2,public,bugs,t)
+(1 row)
+
+INSERT INTO bugs VALUES ('2021-01-12T14:00', 4);
+INSERT INTO bugs VALUES ('2021-01-19T14:01', 4);
+INSERT INTO bugs VALUES ('2021-01-19T14:05', 6);
+SELECT
+    time_bucket_gapfill('60 seconds', observed) as observed
+FROM
+    bugs
+WHERE
+    observed >= '2021-01-18T14:05'
+    AND observed <= now()
+GROUP BY
+    1
+ORDER BY
+    1 DESC;
+ observed 
+----------
+(0 rows)
+

@erimatnor erimatnor self-assigned this Jan 22, 2021
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that lead to certain query plans
failing with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan caused by removing the last
two arguments to the `time_bucket_gapfill` during query
planning. Since the function's expression was modified via a reference
to the original expression in the query plan, it affected also the
target list. The error occurred because the planner couldn't match the
function (including arguments) with the corresponding equivalence
member (which still included the two removed arguments).

The reason the two arguments were removed, originally, was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes timescale#2232
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that caused certain query plans
to fail with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan which happened as a result of
removing the last two arguments to the `time_bucket_gapfill` function
expression during query planning. Since the function expression was
modified via a reference to the original query plan tree, it affected
also the expression in the target list. When the planner couldn't
match the target list with the corresponding equivalence member (which
still included the two removed arguments), the error was generated.

The original reason for removing the two arguments was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes timescale#2232
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that caused certain query plans
to fail with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan which happened as a result of
removing the last two arguments to the `time_bucket_gapfill` function
expression during query planning. Since the function expression was
modified via a reference to the original query plan tree, it affected
also the expression in the target list. When the planner couldn't
match the target list with the corresponding equivalence member (which
still included the two removed arguments), the error was generated.

The original reason for removing the two arguments was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes timescale#2232
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that caused certain query plans
to fail with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan which happened as a result of
removing the last two arguments to the `time_bucket_gapfill` function
expression during query planning. Since the function expression was
modified via a reference to the original query plan tree, it affected
also the expression in the target list. When the planner couldn't
match the target list with the corresponding equivalence member (which
still included the two removed arguments), the error was generated.

The original reason for removing the two arguments was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes timescale#2232
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that caused certain query plans
to fail with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan which happened as a result of
removing the last two arguments to the `time_bucket_gapfill` function
expression during query planning. Since the function expression was
modified via a reference to the original query plan tree, it affected
also the expression in the target list. When the planner couldn't
match the target list with the corresponding equivalence member (which
still included the two removed arguments), the error was generated.

The original reason for removing the two arguments was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes timescale#2232
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that caused certain query plans
to fail with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan which happened as a result of
removing the last two arguments to the `time_bucket_gapfill` function
expression during query planning. Since the function expression was
modified via a reference to the original query plan tree, it affected
also the expression in the target list. When the planner couldn't
match the target list with the corresponding equivalence member (which
still included the two removed arguments), the error was generated.

The original reason for removing the two arguments was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes timescale#2232
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that caused certain query plans
to fail with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan which happened as a result of
removing the last two arguments to the `time_bucket_gapfill` function
expression during query planning. Since the function expression was
modified via a reference to the original query plan tree, it affected
also the expression in the target list. When the planner couldn't
match the target list with the corresponding equivalence member (which
still included the two removed arguments), the error was generated.

The original reason for removing the two arguments was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes timescale#2232
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that caused certain query plans
to fail with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan which happened as a result of
removing the last two arguments to the `time_bucket_gapfill` function
expression during query planning. Since the function expression was
modified via a reference to the original query plan tree, it affected
also the expression in the target list. When the planner couldn't
match the target list with the corresponding equivalence member (which
still included the two removed arguments), the error was generated.

The original reason for removing the two arguments was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes timescale#2232
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that caused certain query plans
to fail with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan which happened as a result of
removing the last two arguments to the `time_bucket_gapfill` function
expression during query planning. Since the function expression was
modified via a reference to the original query plan tree, it affected
also the expression in the target list. When the planner couldn't
match the target list with the corresponding equivalence member (which
still included the two removed arguments), the error was generated.

The original reason for removing the two arguments was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes timescale#2232
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that caused certain query plans
to fail with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan which happened as a result of
removing the last two arguments to the `time_bucket_gapfill` function
expression during query planning. Since the function expression was
modified via a reference to the original query plan tree, it affected
also the expression in the target list. When the planner couldn't
match the target list with the corresponding equivalence member (which
still included the two removed arguments), the error was generated.

The original reason for removing the two arguments was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes timescale#2232
erimatnor added a commit that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that caused certain query plans
to fail with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan which happened as a result of
removing the last two arguments to the `time_bucket_gapfill` function
expression during query planning. Since the function expression was
modified via a reference to the original query plan tree, it affected
also the expression in the target list. When the planner couldn't
match the target list with the corresponding equivalence member (which
still included the two removed arguments), the error was generated.

The original reason for removing the two arguments was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes #2232
svenklemm pushed a commit that referenced this issue Jan 27, 2021
This change fixes a bug with gapfill that caused certain query plans
to fail with the error "could not find pathkey item to sort". This was
caused by a corruption in the query plan which happened as a result of
removing the last two arguments to the `time_bucket_gapfill` function
expression during query planning. Since the function expression was
modified via a reference to the original query plan tree, it affected
also the expression in the target list. When the planner couldn't
match the target list with the corresponding equivalence member (which
still included the two removed arguments), the error was generated.

The original reason for removing the two arguments was to avoid
passing them on to `time_bucket`, which is called internally by
`time_bucket_gapfill`. However, the last to arguments aren't passed on
anyway, so it isn't necessary to modify the original argument list.

Fixes #2232
@morganchristiansson
Copy link

For me, GROUP BY SMALLINT type columns didn't work.
Casting to ::INTEGER fixed it.

@jnidzwetzki jnidzwetzki reopened this Sep 19, 2023
@jnidzwetzki
Copy link
Contributor

Hello @morganchristiansson,

Thanks for letting us know about the problem. Which version of TimescaleDB are you using? Additionally, it would be great if you could provide the steps to reproduce the error. This includes the SQL queries required to create the affected table, insert some test data, and the query that failed to execute. This information will help us to analyze the problem.

@morganchristiansson
Copy link

Hmm, my query has evolved and been simplified and casting to ::INTEGER is no longer needed and it's not reproducing.

If it happens again I will provide steps to reproduce, or possibly I can dig through the history...

Copy link

Dear Author,

This issue has been automatically marked as stale due to lack of activity. With only the issue description that is currently provided, we do not have enough information to take action. If you have or find the answers we would need, please reach out. Otherwise, this issue will be closed in 30 days.

Thank you!

@LuxCore
Copy link

LuxCore commented Dec 14, 2023

Hi!

Postgres 15, TimescaleDB 2.12 - 2.13
Table created with SMALLINT fields, converted to hypertable, populated (about 100mln records), compressed.

The query contain next parts:

...
FROM hypertable AS ht
	INNER JOIN postgrestable AS pt ON pt.id = ht.id
WHERE ht.ts >= TIMESTAMP '2023-10-01' AND ht.ts < TIMESTAMP '2023-11-15'
GROUP BY ...

Mentioned period gives SQL Error [XX000]: ERROR: could not find pathkey item to sort. Shortening the period to few days ( for example WHERE ht.ts >= TIMESTAMP '2023-11-10' AND ht.ts < TIMESTAMP '2023-11-15') gives successful result.

After recreation of table with INTEGER fields instead of SMALLINT the query works on any period.

Also this error occurs when JOIN or/and GROUP BY contain field with NULL values. As a workaround I used COALESCE(field, 0)

@antekresic
Copy link
Contributor

Hey,

Any chance we can get more details in order to reproduce this issue?
This really isn't a lot to go on. A full query with redacted column names works as well, if you are concerned about that.

Thanks.

@kvc0
Copy link
Author

kvc0 commented Dec 14, 2023

Hey,

Any chance we can get more details in order to reproduce this issue? This really isn't a lot to go on. A full query with redacted column names works as well, if you are concerned about that.

Thanks.

For some examples, you can check out the original post, this post, this post, or this post. There’s quite a bit of information in here:

  • It happens at query planner time (reproduces with explain).
  • It happens with timescale versions as early as 1.7.
  • It happens with simple queries and complex queries.
  • Random adjustments to the query sometimes mitigate the error for a time.
  • it seems resistant to reproduction via a small reproducer script. It hasn’t been withheld for lack of effort.

FYI, the original issue here is not limited to gapfill queries and this issue should not have been closed with that rationale #2232 (comment)

@antekresic
Copy link
Contributor

antekresic commented Dec 14, 2023

@kvc0 thanks for the in-depth explanation of the history of the issue.

I do have a suggestion for you (or anybody else) when you come across this issue, try doing:
set timescaledb.enable_chunkwise_aggregation = off;
and running the explain again.

If that fixes the issue, it helps narrow down the root cause.

Thanks!

@kvc0
Copy link
Author

kvc0 commented Dec 14, 2023

That’s a great help to try! When I was still maintaining a large timescale database restarting the service would make the error go away for a while. So having an online setting adjustment that helps narrow it is super helpful!

Are there any other helpful online settings to try? I don’t have an affected database currently ($last_job) but I’m sure others would be happy to try. And I will too if my small NAS timescaledb hits this issue ;-)

@antekresic
Copy link
Contributor

That's about it for now, hopefully we get new least so we can get to the bottom of this.

@nguiard
Copy link

nguiard commented Dec 20, 2023

I just ran into this bug, and I thought I had a reproduction for you but it disappeared on me...

In the hope that this can be helpful: on the period of time where the bug appeared, I was able to make the bug happen or not by changing one line in my query. I'll give the full query (because I was unable to drill down to a minimal example), even though I doubt most of the columns are relevant.

This triggers the bug:

select
    m.sender,
    coalesce(sum(m.recipients), 0) tot_rcpt,
    coalesce(sum(m.recipients) filter (where m.accepted), 0) acc_rcpt,
    coalesce(sum(m.recipients) filter (where not m.accepted), 0) rej_rcpt,
    to_char(coalesce(sum(m.recipients) filter (where not m.accepted) / (sum(m.recipients)::float + 0.001)), '0.00') rej_ratio,
    coalesce(count(m.*), 0) tot_msg,
    coalesce(count(m.*) filter (where m.accepted), 0) acc_msg,
    coalesce(count(m.*) filter (where not m.accepted), 0) rej_msg,
    count(c.*) conns,
    count(distinct(c.host)) distinct_hosts
from
    messages m
    join connections c on (c.icid = m.icid)
where
    not m.sender_is_list
    and m.sender_dom = 'xxxxxx'
    and m.time > '2023-12-15'
group by
    m.sender

But removing the line

    count(distinct(c.host)) distinct_hosts

(and the "," above) made the bug disappear. messages and connections are hypertables, for a bit of context.

I was able to go back and forth between the two queries, and reliably trigger the bug or not. So it was not a fluke at the time I tested it.

However, I also had an (apparently wrong) hypothesis that it would be related to something else: I'm working on a log parsing script that populates the db with data, and it appeared to me that it seemed to happen only when I parsed a single day of the testing logs I had... So I mistakenly went for recreating the db a few times, and the bug disappeared in both conditions (single day of logs or multiple days), and now I seem unable to experiment and reproduce it.

I also unfortunately was unable to try the suggested set timescaledb.enable_chunkwise_aggregation = off;, so no idea if it would have made the bug disappear.

@nguiard
Copy link

nguiard commented Dec 22, 2023

It happened to me again! I was able to narrow down a bit more stuff on working / non-working queries, and try set timescaledb.enable_chunkwise_aggregation = off;. PG16 and TimscaleDB 2.13 here.

  • This triggers the bug:
select
    count(c.*),
    count(distinct(c.host)) hosts
from
    messages m
    join connections c on (c.icid = m.icid)
where
    m.time > now() - interval '1 hours'
group by
    m.sender
;
  • Removing the line
    count(distinct(c.host)) hosts

effectively makes the bug disappear.

  • Selecting from connections only also works fine:
select
    count(c.*),
    count(distinct(c.host)) hosts
from
    connections c
where
    c.time > now() - interval '1 hours'
;
  • Perhaps most interestingly, doing a left join instead of a join also makes the bug disappear:
select
    count(c.*),
    count(distinct(c.host)) hosts
from
    messages m
    left join connections c on (c.icid = m.icid)
where
    m.time > now() - interval '1 hours'
group by
    m.sender
;
  • in my case at least, applying set timescaledb.enable_chunkwise_aggregation = off; did not fix the issue

I have to leave, but for now, the issue did not disappear by itself by ingesting a bit more data. I have not tried recreating the db yet. It will probably make it disappear I guess. If you have more ideas on things to try, don't hesitate.

For a little bit more context if that can help: I'm ingesting logs, putting them in the db. But as I'm "backfilling" a few days / weeks of logs, my program is applying timescale compression "manually" after each day of log file ingested, to save space. There are probably a few things being inserted into already compressed chunks... No idea if that is in any way related...

@jnidzwetzki
Copy link
Contributor

Hi @nguiard,

Thanks for the detailed information. Is it possible to share the definition of the hypertable as well? Do you use a compressed hypertable or are all chunks uncompressed? When the issue happens the next time, could you share the generated query plan (this can be done by executing ANALYZE SELECT ....)? This would help us to get a better understanding of the problem.

@nguiard
Copy link

nguiard commented Dec 22, 2023

Hi @jnidzwetzki, thanks a lot for your answer. It seems I have the issue happening right now :)

  • First thing to note: as mentioned earlier by @kvc0, the bug happen at planning time, so running EXPLAIN ... or EXPLAIN ANALYZE ... does the same thing as trying to run the query. (confirmed, just tried it as I'm writing this).

I can share the related table definitions (with a tiny bit of redaction):

create extension plpgsql_check;
create extension pg_stat_statements;
create extension timescaledb;

-----------------------------------------------
--
create table connections (
    time            timestamptz     not null,
    icid            int8            not null,
    ip              inet            not null,
    host            varchar(512)    not null,
    country         varchar(512)    ,
    accepted        boolean         not null default false,
    relay           boolean         not null default false,
    is_exchange     boolean         generated always as
        (ip << '10.xx.xx/24' or ip << '10.xx.xx/24') stored
);
select create_hypertable('connections', by_range('time', interval '1 day'));
alter table connections set (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'host',
    timescaledb.compress_orderby = 'icid desc'
);
select add_compression_policy('connections', interval '3 days');
select add_retention_policy('connections', interval '1 year');

create index on connections (icid, time desc);
create index on connections (ip, time desc);
create index on connections (host, time desc);
create index on connections (country, time desc);
create index on connections (accepted, time desc);
create index on connections (relay, time desc);
create index on connections (is_exchange, time desc);

-----------------------------------------------
--
create table messages (
    time            timestamptz     not null,
    mid             int8            not null,
    icid            int8            not null, 
    sender          varchar(512)    ,
    sender_dom      varchar(256)    generated always as
        (split_part(sender, '@', 2)) stored,
    sender_is_list  boolean         generated always as
        (right(sender, 3) = 'xxx'
        or right(sender, 3) = 'yyy') stored,
    accepted        boolean         default false,
    aborted         boolean         default false,
    avg_fl_entropy  float4          ,
    recipients      int2            ,
    subject         varchar(1024)   ,
    spf_helo        varchar(512)    ,
    spf_helo_st     varchar(16)     ,
    spf_mailfrom    varchar(512)    ,
    spf_mailfrom_st varchar(16)
);
select create_hypertable('messages', by_range('time', interval '1 day'));
alter table messages set (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'sender',
    timescaledb.compress_orderby = 'mid desc'
);
select add_compression_policy('messages', interval '3 days');
select add_retention_policy('messages', interval '1 year');

create index on messages (icid, time desc);
create index on messages (mid, time desc);
create index on messages (sender, time desc);
create index on messages (sender_dom, time desc);
create index on messages (sender_is_list, time desc);
create index on messages (accepted, time desc);
create index on messages (aborted, time desc);
create index on messages (recipients, time desc);
create index on messages (avg_fl_entropy, time desc);
create index on messages (subject, time desc);
create index on messages (sender, subject, time desc);
create index on messages (sender, accepted, time desc);
create index on messages (spf_helo, time desc);
create index on messages (spf_helo_st, time desc);
create index on messages (spf_mailfrom, time desc);
create index on messages (spf_mailfrom_st, time desc);

-- I also have a msg_recipients table but I highly doubt it's relevant, since the failing queries don't use it.
-- And a few view...

When I'm "backfilling" (reading for example 30 days of logs) I run this function after each log file (corresponding to 1 day of logs) is parsed and loaded into the db (the files are parsed sequentially, in ascending time order, working up to current day):

create or replace function backfill_maintenance() returns void
language plpgsql as $$
begin
    perform compress_chunk(i, if_not_compressed => true)
        from show_chunks('connections', older_than => interval '3 days') i;

    perform compress_chunk(i, if_not_compressed => true)
        from show_chunks('messages', older_than => interval '3 days') i;

    perform compress_chunk(i, if_not_compressed => true)
        from show_chunks('msg_recipients', older_than => interval '3 days') i;
end
$$;

This is so:

  • I don't have a huge processing spike for compression at the end of the backfill
  • I don't risk using too much diskspace

However, as the log files are not stopped exactly at midnight, I clearly do have some inserts happening on already compressed chunks. I have no idea if this is related to the issue but at least you have a bit of context.

Seems I have the issue right now on the live db. Switching to a left join still fixes the issue (which is an ok workaround in my case), and going back to a join makes the bug appear.

Tell me if you want me to try something or have an further idea! :)

(edit: I had an unclosed parenthesis in the messages table, fixed)

@jnidzwetzki
Copy link
Contributor

Hi @nguiard,

Thanks for all the information. A college of mine made me aware that this problem could be triggered by a new optimization in PostgreSQL 16. Could you disable the setting enable_presorted_aggregate (SET enable_presorted_aggregate = off; - see the PostgreSQL documentation) and check if this fixes the problem in your environment?

@nguiard
Copy link

nguiard commented Dec 22, 2023

Hey @jnidzwetzki, thanks for the suggestion. I just tried it and it did not solve the issue (still same behaviour, getting ERROR: could not find pathkey item to sort)

@jnidzwetzki
Copy link
Contributor

Hi @nguiard,

Thanks for getting back. I tried to reproduce the behavior with PG 16 in my local environment, but I was not able to trigger the error so far. Could you perform the following steps when the error happens again:

  1. Since the error message can be triggered in a few functions, could you execute set log_error_verbosity = VERBOSE; and perform the query again? You should now see the name of the function which triggers the error. This would help us a lot in understanding the problem better.

  2. Could you try to get a query plan for the query? This can be done by prefixing the query with ANALYZE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.