Regresscheck tests hanging on PG15 #4838

sb230132 · 2022-10-17T05:02:48Z

What type of bug is this?

Unexpected error

What subsystems and features are affected?

Data node

What happened?

TimescaleDB when compiled against PG15 causes many few tests to hang forever.

TimescaleDB version affected

2.9.0

PostgreSQL version used

15.0

What operating system did you use?

Mac OS X

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

make installcheck TESTS="data_node_bootstrap dist_hypertable-15 bgw_custom"

lkshminarayanan · 2022-11-09T09:46:25Z

The testcases data_node_bootstrap and dist_hypertable-15 fail due to an issue in the delete_data_node function call.

The delete_data_node gets stuck when drop_database option is enabled.

Simple steps to reproduce the issue in PG15 :

SELECT node_name, database, node_created, database_created, extension_created
    FROM add_data_node('bootstrap_test', host => 'localhost', database => 'bootstrap_test', bootstrap => true);
SELECT * FROM delete_data_node('bootstrap_test', drop_database => true);

The 'DROP DATABASE' sent by the drop_data_node_database method is stuck causing this hang.

The server logs keep emitting these lines when the query is stuck :

2022-11-09 15:16:52.514 IST [2621641] LOG:  still waiting for backend with PID 2618447 to accept ProcSignalBarrier
2022-11-09 15:16:52.514 IST [2621641] STATEMENT:  DROP DATABASE bootstrap_test
2022-11-09 15:16:57.517 IST [2621641] LOG:  still waiting for backend with PID 2618447 to accept ProcSignalBarrier
2022-11-09 15:16:57.517 IST [2621641] STATEMENT:  DROP DATABASE bootstrap_test
...

The query succeeds if drop_database option is not enabled.

lkshminarayanan · 2022-11-09T09:50:06Z

The other test bgw_custom doesn't fail locally in my machine. And @sb230132 mentioned that it fails randomly in CI and not reproducible locally. I'll check if I can dig out the relevant failure logs from CI.

lkshminarayanan · 2022-11-10T11:55:16Z

The hang is caused due to a new ProcSignalBarrier mechanism introduced in PG15.

The process that is executing the DROP DATABASE waits for the process that is executing SELECT * FROM delete_data_node('bootstrap_test', drop_database => true); to accept the new process signal barrier it just emitted. But the delete_data_node process is unable to process that as it is waiting for DROP DATABASE query to complete. So the two processes end up in a deadlock.

The code changes in postgres that caused this issue are : postgres/postgres@4eb217631 and postgres/postgres@e2f65f42555.

PG15 introduced a ProcSignalBarrier mechanism in drop database implementation to force all backends to close the file handles for dropped tables. The backend that is executing the drop database command will emit a new process signal barrier and wait for other backends to accept it. But the backend which is executing the delete_data_node function will not be able to process the above mentioned signal as it will be stuck waiting for the drop database query to return. Thus the two backends end up waiting for each other causing a deadlock. Fixed it by using the async API to execute the drop database command from delete_data_node instead of the blocking remote_connection_cmdf_ok call. Fixes timescale#4838

PG15 introduced a ProcSignalBarrier mechanism in drop database implementation to force all backends to close the file handles for dropped tables. The backend that is executing the drop database command will emit a new process signal barrier and wait for other backends to accept it. But the backend which is executing the delete_data_node function will not be able to process the above mentioned signal as it will be stuck waiting for the drop database query to return. Thus the two backends end up waiting for each other causing a deadlock. Fixed it by using the async API to execute the drop database command from delete_data_node instead of the blocking remote_connection_cmdf_ok call. Fixes #4838

sb230132 added the bug label Oct 17, 2022

lkshminarayanan self-assigned this Nov 7, 2022

lkshminarayanan mentioned this issue Nov 10, 2022

Use async API to drop database from delete_data_node #4954

Merged

lkshminarayanan closed this as completed in #4954 Nov 17, 2022

anikin-aa mentioned this issue Feb 9, 2023

DROP queries are hanged if powa is enabled powa-team/powa-archivist#53

Closed

Ngalstyan4 mentioned this issue Apr 1, 2024

Unable to drop a database when there is an active cron job citusdata/pg_cron#318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regresscheck tests hanging on PG15 #4838

Regresscheck tests hanging on PG15 #4838

sb230132 commented Oct 17, 2022 •

edited

lkshminarayanan commented Nov 9, 2022 •

edited

lkshminarayanan commented Nov 9, 2022

lkshminarayanan commented Nov 10, 2022

Regresscheck tests hanging on PG15 #4838

Regresscheck tests hanging on PG15 #4838

Comments

sb230132 commented Oct 17, 2022 • edited

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

lkshminarayanan commented Nov 9, 2022 • edited

lkshminarayanan commented Nov 9, 2022

lkshminarayanan commented Nov 10, 2022

sb230132 commented Oct 17, 2022 •

edited

lkshminarayanan commented Nov 9, 2022 •

edited