Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Segfault after second ANALYZE with PG 14.5/TimescaleDB 2.8.1 #4857

Closed
kyrias opened this issue Oct 19, 2022 · 13 comments · Fixed by #5054
Closed

[Bug]: Segfault after second ANALYZE with PG 14.5/TimescaleDB 2.8.1 #4857

kyrias opened this issue Oct 19, 2022 · 13 comments · Fixed by #5054
Assignees
Labels

Comments

@kyrias
Copy link

kyrias commented Oct 19, 2022

What type of bug is this?

Crash

What subsystems and features are affected?

Other

What happened?

When I execute ANALYZE twice from my application using SQLx the second execution consistently leads to the worker process segfaulting.

Strangely it does not happen when repeatedly executing ANALYZE in psql.

TimescaleDB version affected

2.8.1

PostgreSQL version used

14.5

What operating system did you use?

Arch Linux x86_64

What installation method did you use?

Other

What platform did you run on?

Other

Relevant log output and stack trace

Oct 19 12:40:45 beryllium postgres[57717]: 2022-10-19 12:40:45.770 CEST [57717] LOG:  server process (PID 58663) was terminated by signal 11: Segmentation fault
Oct 19 12:40:45 beryllium postgres[57717]: 2022-10-19 12:40:45.770 CEST [57717] DETAIL:  Failed process was running: ANALYZE
Oct 19 12:40:45 beryllium postgres[57717]: 2022-10-19 12:40:45.770 CEST [57717] LOG:  terminating any other active server processes
Oct 19 12:40:45 beryllium postgres[57717]: 2022-10-19 12:40:45.771 CEST [57717] LOG:  all server processes terminated; reinitializing
Oct 19 12:40:45 beryllium postgres[58711]: 2022-10-19 12:40:45.785 CEST [58711] LOG:  database system was interrupted; last known up at 2022-10-19 12:40:40 CEST
Oct 19 12:40:45 beryllium postgres[58711]: 2022-10-19 12:40:45.785 CEST [58711] LOG:  database system was not properly shut down; automatic recovery in progress
Oct 19 12:40:45 beryllium postgres[58711]: 2022-10-19 12:40:45.785 CEST [58711] LOG:  redo starts at 0/C80EC4F8
Oct 19 12:40:45 beryllium postgres[58711]: 2022-10-19 12:40:45.797 CEST [58711] LOG:  redo done at 0/C853DF80 system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
Oct 19 12:40:45 beryllium postgres[57717]: 2022-10-19 12:40:45.803 CEST [57717] LOG:  database system is ready to accept connections
Oct 19 12:40:45 beryllium postgres[58717]: 2022-10-19 12:40:45.803 CEST [58717] LOG:  TimescaleDB background worker launcher connected to shared catalogs


#0  RangeVarGetRelidExtended (relation=0x0, lockmode=1, flags=0, callback=0x0, callback_arg=0x0) at catalog/namespace.c:246
#1  0x000055f41b74e6af in expand_vacuum_rel (options=66, vrel=0x55f41cd93710) at commands/vacuum.c:778
#2  vacuum (relations=0x55f41cda8b78, params=0x7ffed1ac1ed0, bstrategy=<optimized out>, isTopLevel=<optimized out>) at commands/vacuum.c:392
#3  0x000055f41b74f1c8 in ExecVacuum (pstate=0x55f41cd86120, vacstmt=vacstmt@entry=0x55f41cd8e6f0, isTopLevel=isTopLevel@entry=true) at commands/vacuum.c:268
#4  0x00007f896bfc900c in process_vacuum (args=0x7ffed1ac1ff0) at /usr/src/debug/timescaledb/src/process_utility.c:950
#5  0x00007f896bfc7f64 in process_ddl_command_start (args=0x7ffed1ac1ff0) at /usr/src/debug/timescaledb/src/process_utility.c:4156
#6  timescaledb_ddl_command_start (pstmt=0x55f41cd8e660, query_string=<optimized out>, readonly_tree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, 
    params=0x0, queryEnv=<optimized out>, dest=<optimized out>, completion_tag=<optimized out>) at /usr/src/debug/timescaledb/src/process_utility.c:4399
#7  0x000055f41b900684 in ProcessUtility (qc=0x7ffed1ac2460, dest=0x55f41bd32720 <donothingDR.lto_priv.0>, queryEnv=<optimized out>, params=<optimized out>, 
    context=PROCESS_UTILITY_TOPLEVEL, readOnlyTree=<optimized out>, queryString=<optimized out>, pstmt=0x55f41cd8e660) at tcop/utility.c:523
#8  PortalRunUtility (portal=portal@entry=0x55f41cd09518, pstmt=pstmt@entry=0x55f41cd8e660, isTopLevel=isTopLevel@entry=true, 
    setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x55f41bd32720 <donothingDR.lto_priv.0>, qc=0x7ffed1ac2460) at tcop/pquery.c:1155
#9  0x000055f41b900807 in PortalRunMulti (portal=portal@entry=0x55f41cd09518, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, 
    dest=0x55f41bd32720 <donothingDR.lto_priv.0>, dest@entry=0x55f41cc9be90, altdest=0x55f41bd32720 <donothingDR.lto_priv.0>, altdest@entry=0x55f41cc9be90, 
    qc=qc@entry=0x7ffed1ac2460) at tcop/pquery.c:1319
#10 0x000055f41b900d53 in PortalRun (portal=0x55f41cd09518, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, 
    dest=0x55f41cc9be90, altdest=0x55f41cc9be90, qc=0x7ffed1ac2460) at tcop/pquery.c:788
#11 0x000055f41b90279e in exec_execute_message (max_rows=<optimized out>, portal_name=0x55f41cc9ba48 "") at tcop/postgres.c:2187
#12 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at tcop/postgres.c:4557
#13 0x000055f41b8751b0 in BackendRun (port=0x55f41ccc7320) at postmaster/postmaster.c:4530
#14 BackendStartup (port=0x55f41ccc7320) at postmaster/postmaster.c:4252
#15 ServerLoop () at postmaster/postmaster.c:1745
#16 0x000055f41b875fb2 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster/postmaster.c:1417
#17 0x000055f41b58e14b in main (argc=3, argv=0x55f41cc95240) at main/main.c:209

How can we reproduce the bug?

I've only been able to reproduce it so far through SQLx in our proprietary application, not through `psql`.

If required I could try to write a separate reproducing application using SQLx that I could publish, but literally all it's doing is execute `ANALYZE` twice using the same PostgreSQL connection, even against a database with no tables in it.
@kyrias kyrias added the bug label Oct 19, 2022
@fabriziomello
Copy link
Contributor

@kyrias you can try to log all commands executed from your SQLx tool by setting the log_min_duration_statement=0 in your Postgres. Then you can have a look in your logs to see all statements that SQLx send to the Postgres backend.

@kyrias
Copy link
Author

kyrias commented Oct 20, 2022

Oct 20 09:57:50 beryllium postgres[10119]: 2022-10-20 09:57:50.276 CEST [10119] LOG:  duration: 2.190 ms  parse sqlx_s_1: ANALYZE
Oct 20 09:57:50 beryllium postgres[10119]: 2022-10-20 09:57:50.276 CEST [10119] LOG:  duration: 0.004 ms  bind sqlx_s_1: ANALYZE
Oct 20 09:57:50 beryllium postgres[10119]: 2022-10-20 09:57:50.328 CEST [10119] LOG:  duration: 52.682 ms  execute sqlx_s_1: ANALYZE
Oct 20 09:57:50 beryllium postgres[10119]: 2022-10-20 09:57:50.329 CEST [10119] LOG:  duration: 0.005 ms  bind sqlx_s_1: ANALYZE
Oct 20 09:57:50 beryllium postgres[9977]: 2022-10-20 09:57:50.599 CEST [9977] LOG:  server process (PID 10119) was terminated by signal 11: Segmentation fault
Oct 20 09:57:50 beryllium postgres[9977]: 2022-10-20 09:57:50.599 CEST [9977] DETAIL:  Failed process was running: ANALYZE
Oct 20 09:57:50 beryllium postgres[9977]: 2022-10-20 09:57:50.599 CEST [9977] LOG:  terminating any other active server processes
Oct 20 09:57:50 beryllium postgres[9977]: 2022-10-20 09:57:50.599 CEST [9977] LOG:  all server processes terminated; reinitializing
Oct 20 09:57:50 beryllium postgres[10133]: 2022-10-20 09:57:50.610 CEST [10133] LOG:  database system was interrupted; last known up at 2022-10-20 09:57:27 CEST
Oct 20 09:57:50 beryllium postgres[10133]: 2022-10-20 09:57:50.610 CEST [10133] LOG:  database system was not properly shut down; automatic recovery in progress
Oct 20 09:57:50 beryllium postgres[10133]: 2022-10-20 09:57:50.610 CEST [10133] LOG:  redo starts at 0/CCAFC0C0
Oct 20 09:57:50 beryllium postgres[10133]: 2022-10-20 09:57:50.611 CEST [10133] LOG:  redo done at 0/CCB2FF70 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
Oct 20 09:57:50 beryllium postgres[9977]: 2022-10-20 09:57:50.614 CEST [9977] LOG:  database system is ready to accept connections
Oct 20 09:57:50 beryllium postgres[10139]: 2022-10-20 09:57:50.615 CEST [10139] LOG:  TimescaleDB background worker launcher connected to shared catalogs

@fabriziomello
Copy link
Contributor

@kyrias thanks for all the information. I know what's happening based on your info and stacktrace but I can't reproduce it yet. Can u share a bt full stacktrace?? Another thing that can help is if you can share your database schema (aka pg_dump -s).

@kyrias
Copy link
Author

kyrias commented Oct 21, 2022

The same thing happens with an entirely empty database with only the extension created.

I can reproduce it with the following:

podman run --name timescaledb -p 5432:5432 -e POSTGRES_HOST_AUTH_METHOD=trust docker.io/timescale/timescaledb-ha:pg14-latest
tar -xf foo.tar.gz
cd foo
cargo run

foo.tar.gz

@kyrias
Copy link
Author

kyrias commented Nov 2, 2022

@fabriziomello Where you able to reproduce it with the above instructions?

@svenklemm
Copy link
Member

Looks like you are using extended query protocol which psql doesnt support atm

@sb230132
Copy link
Contributor

sb230132 commented Dec 5, 2022

Hi @kyrias
I tried the same steps provided by you:

% cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.08s
     Running `target/debug/foo`

I enabled log_statement and see this logs:

2022-12-05 15:30:29.007 IST [9911] [txn id: 0] LOG:  connection received: host=::1 port=54913
2022-12-05 15:30:29.007 IST [9911] [txn id: 0] LOG:  connection authorized: user=postgres database=postgres
2022-12-05 15:30:29.009 IST [9911] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE
2022-12-05 15:30:29.098 IST [9911] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE

With timescaledb extension installed i can see a crash, when i run cargo more than once

2022-12-05 15:35:20.717 IST [9990] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE
2022-12-05 15:35:20.824 IST [9990] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE
2022-12-05 15:35:30.422 IST [9994] [txn id: 0] LOG:  connection received: host=::1 port=54934
2022-12-05 15:35:30.422 IST [9994] [txn id: 0] LOG:  connection authorized: user=postgres database=postgres
2022-12-05 15:35:30.433 IST [9994] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE
2022-12-05 15:35:30.537 IST [9994] [txn id: 0] LOG:  execute sqlx_s_1: ANALYZE
2022-12-05 15:35:30.622 IST [9972] [txn id: 0] LOG:  server process (PID 9994) was terminated by signal 11: Segmentation fault: 11
2022-12-05 15:35:30.622 IST [9972] [txn id: 0] DETAIL:  Failed process was running: ANALYZE
2022-12-05 15:35:30.622 IST [9972] [txn id: 0] LOG:  terminating any other active server processes
2022-12-05 15:35:30.623 IST [9972] [txn id: 0] LOG:  all server processes terminated; reinitializing
2022-12-05 15:35:30.626 IST [9995] [txn id: 0] LOG:  database system was interrupted; last known up at 2022-12-05 15:35:05 IST
2022-12-05 15:35:30.652 IST [9995] [txn id: 0] LOG:  database system was not properly shut down; automatic recovery in progress
2022-12-05 15:35:30.652 IST [9995] [txn id: 0] LOG:  redo starts at 0/1794FA8
2022-12-05 15:35:30.665 IST [9995] [txn id: 0] LOG:  invalid record length at 0/1A2E788: wanted 24, got 0
2022-12-05 15:35:30.665 IST [9995] [txn id: 0] LOG:  redo done at 0/1A2E700 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.01 s
2022-12-05 15:35:30.673 IST [9972] [txn id: 0] LOG:  database system is ready to accept connections
2022-12-05 15:35:30.674 IST [10001] [txn id: 0] LOG:  TimescaleDB background worker launcher connected to shared catalogs

Thanks for reproducible steps. I will analyse more and try to find out what is the issue.

@sb230132
Copy link
Contributor

sb230132 commented Dec 5, 2022

Here is the stack trace from core file:

[New LWP 903769]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: postgres postgres 127.0.0.1(45590) ANALYZE'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000056459f6589c3 in vacuum_open_relation ()
(gdb) bt
#0  0x000056459f6589c3 in vacuum_open_relation ()
#1  0x000056459f5f89b9 in analyze_rel ()
#2  0x000056459f65a04f in vacuum ()
#3  0x000056459f65a5b4 in ExecVacuum ()
#4  0x00007fb37149d8ea in process_vacuum () from /usr/local/postgresql-14.6/lib/timescaledb-2.9.0-dev.so
#5  0x00007fb37149d0a1 in timescaledb_ddl_command_start () from /usr/local/postgresql-14.6/lib/timescaledb-2.9.0-dev.so
#6  0x000056459f7d82b1 in PortalRunUtility ()
#7  0x000056459f7d83bf in PortalRunMulti ()
#8  0x000056459f7d8901 in PortalRun ()
#9  0x000056459f7d606d in PostgresMain ()
#10 0x000056459f7575e6 in ServerLoop ()
#11 0x000056459f758459 in PostmasterMain ()
#12 0x000056459f4e7220 in main ()

@sb230132
Copy link
Contributor

sb230132 commented Dec 5, 2022

Hi @kyrias
I suspect that "ANALYZE" is executed in parallel.

I would like to understand more about main.rs program.
Is the program run by spawning multiple threads and executing in parallel ?
What are these terms in the program do, like "async", "await" ?

If so can we try synchronous version of the same program ?

@kyrias
Copy link
Author

kyrias commented Dec 5, 2022

There is no visible parallelism in this program. The database library used is an async library which allows for tasks to run concurrently, but there is also no concurrency in the main function of this program. It connects to the database, executes one query and waits for the result, and then executes another query and waits for the result.

@sb230132
Copy link
Contributor

sb230132 commented Dec 5, 2022

Ok. Unfortunately i could not reproduce on psql utility which is of much help for me to debug and find the root causes.
If you can share the same program with a sleep between the 2 execute() API calls, it would be good for me to put the connection in debug mode and check.

@kyrias
Copy link
Author

kyrias commented Dec 5, 2022

Ok. Unfortunately i could not reproduce on psql utility which is of much help for me to debug and find the root causes.

Which is mentioned in the issue description. And as svenklemm mentioned it appears to be due to psql not supporting the extended query protocol.

If you can share the same program with a sleep between the 2 execute() API calls, it would be good for me to put the connection in debug mode and check.

You can add tokio::time::sleep(std::time::Duration::from_secs(5)).await; wherever you wish. Or in this specific case even std::thread::sleep.

@sb230132
Copy link
Contributor

sb230132 commented Dec 6, 2022

Here is a very simple testcase which can be used to reproduce the issue via psql:

do
$$
declare 
 o int;
begin
 for c in 1..2 loop
  ANALYZE;
 end loop;
end;
$$;

SQL statements in plpgsql have an implicit prepared statement created which i guess used extended query protocol.

@sb230132 sb230132 self-assigned this Dec 6, 2022
sb230132 added a commit to sb230132/timescaledb that referenced this issue Dec 6, 2022
Issue occurs in extended query protocol mode only where every
query goes through PREPARE and EXECUTE phase. First time ANALYZE
is executed, a list of relations to be vaccumed is extracted and
saved in a list. This list is referenced in parsetree node. Once
execution of ANALYZE is complete, this list is cleaned up, however
reference to the same is not cleaned up in parsetree node. When
second time ANALYZE is executed, segfault happens as we access an
invalid memory location.

Fixed the issue by restoring the actual value in parsetree node
once ANALYZE completes its execution.

Fixes timescale#4857
sb230132 added a commit to sb230132/timescaledb that referenced this issue Dec 6, 2022
Issue occurs in extended query protocol mode only where every
query goes through PREPARE and EXECUTE phase. First time ANALYZE
is executed, a list of relations to be vaccumed is extracted and
saved in a list. This list is referenced in parsetree node. Once
execution of ANALYZE is complete, this list is cleaned up, however
reference to the same is not cleaned up in parsetree node. When
second time ANALYZE is executed, segfault happens as we access an
invalid memory location.

Fixed the issue by restoring the actual value in parsetree node
once ANALYZE completes its execution.

Fixes timescale#4857
sb230132 added a commit to sb230132/timescaledb that referenced this issue Dec 7, 2022
Issue occurs in extended query protocol mode only where every
query goes through PREPARE and EXECUTE phase. First time ANALYZE
is executed, a list of relations to be vaccumed is extracted and
saved in a list. This list is referenced in parsetree node. Once
execution of ANALYZE is complete, this list is cleaned up, however
reference to the same is not cleaned up in parsetree node. When
second time ANALYZE is executed, segfault happens as we access an
invalid memory location.

Fixed the issue by restoring the actual value in parsetree node
once ANALYZE completes its execution.

Fixes timescale#4857
sb230132 added a commit to sb230132/timescaledb that referenced this issue Dec 8, 2022
Issue occurs in extended query protocol mode only where every
query goes through PREPARE and EXECUTE phase. First time ANALYZE
is executed, a list of relations to be vaccumed is extracted and
saved in a list. This list is referenced in parsetree node. Once
execution of ANALYZE is complete, this list is cleaned up, however
reference to the same is not cleaned up in parsetree node. When
second time ANALYZE is executed, segfault happens as we access an
invalid memory location.

Fixed the issue by restoring the actual value in parsetree node
once ANALYZE completes its execution.

Fixes timescale#4857
sb230132 added a commit to sb230132/timescaledb that referenced this issue Dec 9, 2022
Issue occurs in extended query protocol mode only where every
query goes through PREPARE and EXECUTE phase. First time ANALYZE
is executed, a list of relations to be vaccumed is extracted and
saved in a list. This list is referenced in parsetree node. Once
execution of ANALYZE is complete, this list is cleaned up, however
reference to the same is not cleaned up in parsetree node. When
second time ANALYZE is executed, segfault happens as we access an
invalid memory location.

Fixed the issue by restoring the actual value in parsetree node
once ANALYZE completes its execution.

Fixes timescale#4857
sb230132 added a commit to sb230132/timescaledb that referenced this issue Dec 12, 2022
Issue occurs in extended query protocol mode only where every
query goes through PREPARE and EXECUTE phase. First time ANALYZE
is executed, a list of relations to be vaccumed is extracted and
saved in a list. This list is referenced in parsetree node. Once
execution of ANALYZE is complete, this list is cleaned up, however
reference to the same is not cleaned up in parsetree node. When
second time ANALYZE is executed, segfault happens as we access an
invalid memory location.

Fixed the issue by restoring the actual value in parsetree node
once ANALYZE completes its execution.

Fixes timescale#4857
sb230132 added a commit that referenced this issue Dec 12, 2022
Issue occurs in extended query protocol mode only where every
query goes through PREPARE and EXECUTE phase. First time ANALYZE
is executed, a list of relations to be vaccumed is extracted and
saved in a list. This list is referenced in parsetree node. Once
execution of ANALYZE is complete, this list is cleaned up, however
reference to the same is not cleaned up in parsetree node. When
second time ANALYZE is executed, segfault happens as we access an
invalid memory location.

Fixed the issue by restoring the actual value in parsetree node
once ANALYZE completes its execution.

Fixes #4857
svenklemm pushed a commit that referenced this issue Dec 15, 2022
Issue occurs in extended query protocol mode only where every
query goes through PREPARE and EXECUTE phase. First time ANALYZE
is executed, a list of relations to be vaccumed is extracted and
saved in a list. This list is referenced in parsetree node. Once
execution of ANALYZE is complete, this list is cleaned up, however
reference to the same is not cleaned up in parsetree node. When
second time ANALYZE is executed, segfault happens as we access an
invalid memory location.

Fixed the issue by restoring the actual value in parsetree node
once ANALYZE completes its execution.

Fixes #4857
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants