[Bug]: Segfault after second ANALYZE with PG 14.5/TimescaleDB 2.8.1 #4857

kyrias · 2022-10-19T11:02:19Z

What type of bug is this?

Crash

What subsystems and features are affected?

Other

What happened?

When I execute ANALYZE twice from my application using SQLx the second execution consistently leads to the worker process segfaulting.

Strangely it does not happen when repeatedly executing ANALYZE in psql.

TimescaleDB version affected

2.8.1

PostgreSQL version used

14.5

What operating system did you use?

Arch Linux x86_64

What installation method did you use?

Other

What platform did you run on?

Other

Relevant log output and stack trace

Oct 19 12:40:45 beryllium postgres[57717]: 2022-10-19 12:40:45.770 CEST [57717] LOG:  server process (PID 58663) was terminated by signal 11: Segmentation fault
Oct 19 12:40:45 beryllium postgres[57717]: 2022-10-19 12:40:45.770 CEST [57717] DETAIL:  Failed process was running: ANALYZE
Oct 19 12:40:45 beryllium postgres[57717]: 2022-10-19 12:40:45.770 CEST [57717] LOG:  terminating any other active server processes
Oct 19 12:40:45 beryllium postgres[57717]: 2022-10-19 12:40:45.771 CEST [57717] LOG:  all server processes terminated; reinitializing
Oct 19 12:40:45 beryllium postgres[58711]: 2022-10-19 12:40:45.785 CEST [58711] LOG:  database system was interrupted; last known up at 2022-10-19 12:40:40 CEST
Oct 19 12:40:45 beryllium postgres[58711]: 2022-10-19 12:40:45.785 CEST [58711] LOG:  database system was not properly shut down; automatic recovery in progress
Oct 19 12:40:45 beryllium postgres[58711]: 2022-10-19 12:40:45.785 CEST [58711] LOG:  redo starts at 0/C80EC4F8
Oct 19 12:40:45 beryllium postgres[58711]: 2022-10-19 12:40:45.797 CEST [58711] LOG:  redo done at 0/C853DF80 system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
Oct 19 12:40:45 beryllium postgres[57717]: 2022-10-19 12:40:45.803 CEST [57717] LOG:  database system is ready to accept connections
Oct 19 12:40:45 beryllium postgres[58717]: 2022-10-19 12:40:45.803 CEST [58717] LOG:  TimescaleDB background worker launcher connected to shared catalogs


#0  RangeVarGetRelidExtended (relation=0x0, lockmode=1, flags=0, callback=0x0, callback_arg=0x0) at catalog/namespace.c:246
#1  0x000055f41b74e6af in expand_vacuum_rel (options=66, vrel=0x55f41cd93710) at commands/vacuum.c:778
#2  vacuum (relations=0x55f41cda8b78, params=0x7ffed1ac1ed0, bstrategy=<optimized out>, isTopLevel=<optimized out>) at commands/vacuum.c:392
#3  0x000055f41b74f1c8 in ExecVacuum (pstate=0x55f41cd86120, vacstmt=vacstmt@entry=0x55f41cd8e6f0, isTopLevel=isTopLevel@entry=true) at commands/vacuum.c:268
#4  0x00007f896bfc900c in process_vacuum (args=0x7ffed1ac1ff0) at /usr/src/debug/timescaledb/src/process_utility.c:950
#5  0x00007f896bfc7f64 in process_ddl_command_start (args=0x7ffed1ac1ff0) at /usr/src/debug/timescaledb/src/process_utility.c:4156
#6  timescaledb_ddl_command_start (pstmt=0x55f41cd8e660, query_string=<optimized out>, readonly_tree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, 
    params=0x0, queryEnv=<optimized out>, dest=<optimized out>, completion_tag=<optimized out>) at /usr/src/debug/timescaledb/src/process_utility.c:4399
#7  0x000055f41b900684 in ProcessUtility (qc=0x7ffed1ac2460, dest=0x55f41bd32720 <donothingDR.lto_priv.0>, queryEnv=<optimized out>, params=<optimized out>, 
    context=PROCESS_UTILITY_TOPLEVEL, readOnlyTree=<optimized out>, queryString=<optimized out>, pstmt=0x55f41cd8e660) at tcop/utility.c:523
#8  PortalRunUtility (portal=portal@entry=0x55f41cd09518, pstmt=pstmt@entry=0x55f41cd8e660, isTopLevel=isTopLevel@entry=true, 
    setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x55f41bd32720 <donothingDR.lto_priv.0>, qc=0x7ffed1ac2460) at tcop/pquery.c:1155
#9  0x000055f41b900807 in PortalRunMulti (portal=portal@entry=0x55f41cd09518, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, 
    dest=0x55f41bd32720 <donothingDR.lto_priv.0>, dest@entry=0x55f41cc9be90, altdest=0x55f41bd32720 <donothingDR.lto_priv.0>, altdest@entry=0x55f41cc9be90, 
    qc=qc@entry=0x7ffed1ac2460) at tcop/pquery.c:1319
#10 0x000055f41b900d53 in PortalRun (portal=0x55f41cd09518, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, 
    dest=0x55f41cc9be90, altdest=0x55f41cc9be90, qc=0x7ffed1ac2460) at tcop/pquery.c:788
#11 0x000055f41b90279e in exec_execute_message (max_rows=<optimized out>, portal_name=0x55f41cc9ba48 "") at tcop/postgres.c:2187
#12 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at tcop/postgres.c:4557
#13 0x000055f41b8751b0 in BackendRun (port=0x55f41ccc7320) at postmaster/postmaster.c:4530
#14 BackendStartup (port=0x55f41ccc7320) at postmaster/postmaster.c:4252
#15 ServerLoop () at postmaster/postmaster.c:1745
#16 0x000055f41b875fb2 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster/postmaster.c:1417
#17 0x000055f41b58e14b in main (argc=3, argv=0x55f41cc95240) at main/main.c:209

How can we reproduce the bug?

I've only been able to reproduce it so far through SQLx in our proprietary application, not through `psql`.

If required I could try to write a separate reproducing application using SQLx that I could publish, but literally all it's doing is execute `ANALYZE` twice using the same PostgreSQL connection, even against a database with no tables in it.

The text was updated successfully, but these errors were encountered:

fabriziomello · 2022-10-19T19:25:54Z

@kyrias you can try to log all commands executed from your SQLx tool by setting the log_min_duration_statement=0 in your Postgres. Then you can have a look in your logs to see all statements that SQLx send to the Postgres backend.

kyrias · 2022-10-20T08:01:02Z

Oct 20 09:57:50 beryllium postgres[10119]: 2022-10-20 09:57:50.276 CEST [10119] LOG:  duration: 2.190 ms  parse sqlx_s_1: ANALYZE
Oct 20 09:57:50 beryllium postgres[10119]: 2022-10-20 09:57:50.276 CEST [10119] LOG:  duration: 0.004 ms  bind sqlx_s_1: ANALYZE
Oct 20 09:57:50 beryllium postgres[10119]: 2022-10-20 09:57:50.328 CEST [10119] LOG:  duration: 52.682 ms  execute sqlx_s_1: ANALYZE
Oct 20 09:57:50 beryllium postgres[10119]: 2022-10-20 09:57:50.329 CEST [10119] LOG:  duration: 0.005 ms  bind sqlx_s_1: ANALYZE
Oct 20 09:57:50 beryllium postgres[9977]: 2022-10-20 09:57:50.599 CEST [9977] LOG:  server process (PID 10119) was terminated by signal 11: Segmentation fault
Oct 20 09:57:50 beryllium postgres[9977]: 2022-10-20 09:57:50.599 CEST [9977] DETAIL:  Failed process was running: ANALYZE
Oct 20 09:57:50 beryllium postgres[9977]: 2022-10-20 09:57:50.599 CEST [9977] LOG:  terminating any other active server processes
Oct 20 09:57:50 beryllium postgres[9977]: 2022-10-20 09:57:50.599 CEST [9977] LOG:  all server processes terminated; reinitializing
Oct 20 09:57:50 beryllium postgres[10133]: 2022-10-20 09:57:50.610 CEST [10133] LOG:  database system was interrupted; last known up at 2022-10-20 09:57:27 CEST
Oct 20 09:57:50 beryllium postgres[10133]: 2022-10-20 09:57:50.610 CEST [10133] LOG:  database system was not properly shut down; automatic recovery in progress
Oct 20 09:57:50 beryllium postgres[10133]: 2022-10-20 09:57:50.610 CEST [10133] LOG:  redo starts at 0/CCAFC0C0
Oct 20 09:57:50 beryllium postgres[10133]: 2022-10-20 09:57:50.611 CEST [10133] LOG:  redo done at 0/CCB2FF70 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
Oct 20 09:57:50 beryllium postgres[9977]: 2022-10-20 09:57:50.614 CEST [9977] LOG:  database system is ready to accept connections
Oct 20 09:57:50 beryllium postgres[10139]: 2022-10-20 09:57:50.615 CEST [10139] LOG:  TimescaleDB background worker launcher connected to shared catalogs

fabriziomello · 2022-10-21T17:54:33Z

@kyrias thanks for all the information. I know what's happening based on your info and stacktrace but I can't reproduce it yet. Can u share a bt full stacktrace?? Another thing that can help is if you can share your database schema (aka pg_dump -s).

kyrias · 2022-10-21T22:08:21Z

The same thing happens with an entirely empty database with only the extension created.

I can reproduce it with the following:

podman run --name timescaledb -p 5432:5432 -e POSTGRES_HOST_AUTH_METHOD=trust docker.io/timescale/timescaledb-ha:pg14-latest

tar -xf foo.tar.gz
cd foo
cargo run

foo.tar.gz

kyrias · 2022-11-02T08:55:38Z

@fabriziomello Where you able to reproduce it with the above instructions?

svenklemm · 2022-11-02T17:20:19Z

Looks like you are using extended query protocol which psql doesnt support atm

sb230132 · 2022-12-05T10:07:55Z

Hi @kyrias
I tried the same steps provided by you:

% cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.08s
     Running `target/debug/foo`

I enabled log_statement and see this logs:

2022-12-05 15:30:29.007 IST [9911] [txn id: 0] LOG:  connection received: host=::1 port=54913
2022-12-05 15:30:29.007 IST [9911] [txn id: 0] LOG:  connection authorized: user=postgres database=postgres
2022-12-05 15:30:29.009 IST [9911] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE
2022-12-05 15:30:29.098 IST [9911] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE

With timescaledb extension installed i can see a crash, when i run cargo more than once

2022-12-05 15:35:20.717 IST [9990] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE
2022-12-05 15:35:20.824 IST [9990] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE
2022-12-05 15:35:30.422 IST [9994] [txn id: 0] LOG:  connection received: host=::1 port=54934
2022-12-05 15:35:30.422 IST [9994] [txn id: 0] LOG:  connection authorized: user=postgres database=postgres
2022-12-05 15:35:30.433 IST [9994] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE
2022-12-05 15:35:30.537 IST [9994] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE
2022-12-05 15:35:30.622 IST [9972] [txn id: 0] LOG:  server process (PID 9994) was terminated by signal 11: Segmentation fault: 11
2022-12-05 15:35:30.622 IST [9972] [txn id: 0] DETAIL:  Failed process was running: ANALYZE
2022-12-05 15:35:30.622 IST [9972] [txn id: 0] LOG:  terminating any other active server processes
2022-12-05 15:35:30.623 IST [9972] [txn id: 0] LOG:  all server processes terminated; reinitializing
2022-12-05 15:35:30.626 IST [9995] [txn id: 0] LOG:  database system was interrupted; last known up at 2022-12-05 15:35:05 IST
2022-12-05 15:35:30.652 IST [9995] [txn id: 0] LOG:  database system was not properly shut down; automatic recovery in progress
2022-12-05 15:35:30.652 IST [9995] [txn id: 0] LOG:  redo starts at 0/1794FA8
2022-12-05 15:35:30.665 IST [9995] [txn id: 0] LOG:  invalid record length at 0/1A2E788: wanted 24, got 0
2022-12-05 15:35:30.665 IST [9995] [txn id: 0] LOG:  redo done at 0/1A2E700 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.01 s
2022-12-05 15:35:30.673 IST [9972] [txn id: 0] LOG:  database system is ready to accept connections
2022-12-05 15:35:30.674 IST [10001] [txn id: 0] LOG:  TimescaleDB background worker launcher connected to shared catalogs

Thanks for reproducible steps. I will analyse more and try to find out what is the issue.

sb230132 · 2022-12-05T11:38:57Z

Here is the stack trace from core file:

[New LWP 903769]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: postgres postgres 127.0.0.1(45590) ANALYZE'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000056459f6589c3 in vacuum_open_relation ()
(gdb) bt
#0  0x000056459f6589c3 in vacuum_open_relation ()
#1  0x000056459f5f89b9 in analyze_rel ()
#2  0x000056459f65a04f in vacuum ()
#3  0x000056459f65a5b4 in ExecVacuum ()
#4  0x00007fb37149d8ea in process_vacuum () from /usr/local/postgresql-14.6/lib/timescaledb-2.9.0-dev.so
#5  0x00007fb37149d0a1 in timescaledb_ddl_command_start () from /usr/local/postgresql-14.6/lib/timescaledb-2.9.0-dev.so
#6  0x000056459f7d82b1 in PortalRunUtility ()
#7  0x000056459f7d83bf in PortalRunMulti ()
#8  0x000056459f7d8901 in PortalRun ()
#9  0x000056459f7d606d in PostgresMain ()
#10 0x000056459f7575e6 in ServerLoop ()
#11 0x000056459f758459 in PostmasterMain ()
#12 0x000056459f4e7220 in main ()

sb230132 · 2022-12-05T13:08:51Z

Hi @kyrias
I suspect that "ANALYZE" is executed in parallel.

I would like to understand more about main.rs program.
Is the program run by spawning multiple threads and executing in parallel ?
What are these terms in the program do, like "async", "await" ?

If so can we try synchronous version of the same program ?

kyrias · 2022-12-05T15:26:27Z

There is no visible parallelism in this program. The database library used is an async library which allows for tasks to run concurrently, but there is also no concurrency in the main function of this program. It connects to the database, executes one query and waits for the result, and then executes another query and waits for the result.

sb230132 · 2022-12-05T15:33:13Z

Ok. Unfortunately i could not reproduce on psql utility which is of much help for me to debug and find the root causes.
If you can share the same program with a sleep between the 2 execute() API calls, it would be good for me to put the connection in debug mode and check.

kyrias · 2022-12-05T16:34:39Z

Ok. Unfortunately i could not reproduce on psql utility which is of much help for me to debug and find the root causes.

Which is mentioned in the issue description. And as svenklemm mentioned it appears to be due to psql not supporting the extended query protocol.

If you can share the same program with a sleep between the 2 execute() API calls, it would be good for me to put the connection in debug mode and check.

You can add tokio::time::sleep(std::time::Duration::from_secs(5)).await; wherever you wish. Or in this specific case even std::thread::sleep.

sb230132 · 2022-12-06T05:42:43Z

Here is a very simple testcase which can be used to reproduce the issue via psql:

do
$$
declare 
 o int;
begin
 for c in 1..2 loop
  ANALYZE;
 end loop;
end;
$$;

SQL statements in plpgsql have an implicit prepared statement created which i guess used extended query protocol.

Issue occurs in extended query protocol mode only where every query goes through PREPARE and EXECUTE phase. First time ANALYZE is executed, a list of relations to be vaccumed is extracted and saved in a list. This list is referenced in parsetree node. Once execution of ANALYZE is complete, this list is cleaned up, however reference to the same is not cleaned up in parsetree node. When second time ANALYZE is executed, segfault happens as we access an invalid memory location. Fixed the issue by restoring the actual value in parsetree node once ANALYZE completes its execution. Fixes timescale#4857

Issue occurs in extended query protocol mode only where every query goes through PREPARE and EXECUTE phase. First time ANALYZE is executed, a list of relations to be vaccumed is extracted and saved in a list. This list is referenced in parsetree node. Once execution of ANALYZE is complete, this list is cleaned up, however reference to the same is not cleaned up in parsetree node. When second time ANALYZE is executed, segfault happens as we access an invalid memory location. Fixed the issue by restoring the actual value in parsetree node once ANALYZE completes its execution. Fixes #4857

kyrias added the bug label Oct 19, 2022

github-actions bot removed the waiting-for-author label Oct 20, 2022

sb230132 self-assigned this Dec 6, 2022

sb230132 mentioned this issue Dec 6, 2022

Fix segfault after second ANALYZE #5054

Merged

sb230132 closed this as completed in #5054 Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Segfault after second ANALYZE with PG 14.5/TimescaleDB 2.8.1 #4857

[Bug]: Segfault after second ANALYZE with PG 14.5/TimescaleDB 2.8.1 #4857

kyrias commented Oct 19, 2022

fabriziomello commented Oct 19, 2022

kyrias commented Oct 20, 2022

fabriziomello commented Oct 21, 2022

kyrias commented Oct 21, 2022

kyrias commented Nov 2, 2022

svenklemm commented Nov 2, 2022

sb230132 commented Dec 5, 2022

sb230132 commented Dec 5, 2022 •

edited

sb230132 commented Dec 5, 2022 •

edited

kyrias commented Dec 5, 2022

sb230132 commented Dec 5, 2022

kyrias commented Dec 5, 2022

sb230132 commented Dec 6, 2022

[Bug]: Segfault after second ANALYZE with PG 14.5/TimescaleDB 2.8.1 #4857

[Bug]: Segfault after second ANALYZE with PG 14.5/TimescaleDB 2.8.1 #4857

Comments

kyrias commented Oct 19, 2022

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

fabriziomello commented Oct 19, 2022

kyrias commented Oct 20, 2022

fabriziomello commented Oct 21, 2022

kyrias commented Oct 21, 2022

kyrias commented Nov 2, 2022

svenklemm commented Nov 2, 2022

sb230132 commented Dec 5, 2022

sb230132 commented Dec 5, 2022 • edited

sb230132 commented Dec 5, 2022 • edited

kyrias commented Dec 5, 2022

sb230132 commented Dec 5, 2022

kyrias commented Dec 5, 2022

sb230132 commented Dec 6, 2022

sb230132 commented Dec 5, 2022 •

edited

sb230132 commented Dec 5, 2022 •

edited