Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regresscheck tests hanging on PG15 #4838

Closed
sb230132 opened this issue Oct 17, 2022 · 3 comments · Fixed by #4954
Closed

Regresscheck tests hanging on PG15 #4838

sb230132 opened this issue Oct 17, 2022 · 3 comments · Fixed by #4954
Assignees
Labels

Comments

@sb230132
Copy link
Contributor

sb230132 commented Oct 17, 2022

What type of bug is this?

Unexpected error

What subsystems and features are affected?

Data node

What happened?

TimescaleDB when compiled against PG15 causes many few tests to hang forever.

TimescaleDB version affected

2.9.0

PostgreSQL version used

15.0

What operating system did you use?

Mac OS X

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

make installcheck TESTS="data_node_bootstrap dist_hypertable-15 bgw_custom"
@sb230132 sb230132 added the bug label Oct 17, 2022
@lkshminarayanan lkshminarayanan self-assigned this Nov 7, 2022
@lkshminarayanan
Copy link
Contributor

lkshminarayanan commented Nov 9, 2022

The testcases data_node_bootstrap and dist_hypertable-15 fail due to an issue in the delete_data_node function call.

The delete_data_node gets stuck when drop_database option is enabled.

Simple steps to reproduce the issue in PG15 :

SELECT node_name, database, node_created, database_created, extension_created
    FROM add_data_node('bootstrap_test', host => 'localhost', database => 'bootstrap_test', bootstrap => true);
SELECT * FROM delete_data_node('bootstrap_test', drop_database => true);

The 'DROP DATABASE' sent by the drop_data_node_database method is stuck causing this hang.

The server logs keep emitting these lines when the query is stuck :

2022-11-09 15:16:52.514 IST [2621641] LOG:  still waiting for backend with PID 2618447 to accept ProcSignalBarrier
2022-11-09 15:16:52.514 IST [2621641] STATEMENT:  DROP DATABASE bootstrap_test
2022-11-09 15:16:57.517 IST [2621641] LOG:  still waiting for backend with PID 2618447 to accept ProcSignalBarrier
2022-11-09 15:16:57.517 IST [2621641] STATEMENT:  DROP DATABASE bootstrap_test
...

The query succeeds if drop_database option is not enabled.

@lkshminarayanan
Copy link
Contributor

The other test bgw_custom doesn't fail locally in my machine. And @sb230132 mentioned that it fails randomly in CI and not reproducible locally. I'll check if I can dig out the relevant failure logs from CI.

@lkshminarayanan
Copy link
Contributor

The hang is caused due to a new ProcSignalBarrier mechanism introduced in PG15.

The process that is executing the DROP DATABASE waits for the process that is executing SELECT * FROM delete_data_node('bootstrap_test', drop_database => true); to accept the new process signal barrier it just emitted. But the delete_data_node process is unable to process that as it is waiting for DROP DATABASE query to complete. So the two processes end up in a deadlock.

The code changes in postgres that caused this issue are : postgres/postgres@4eb217631 and postgres/postgres@e2f65f42555.

lkshminarayanan added a commit to lkshminarayanan/timescaledb that referenced this issue Nov 10, 2022
PG15 introduced a ProcSignalBarrier mechanism in drop database
implementation to force all backends to close the file handles for
dropped tables. The backend that is executing the drop database command
will emit a new process signal barrier and wait for other backends to
accept it. But the backend which is executing the delete_data_node
function will not be able to process the above mentioned signal as it
will be stuck waiting for the drop database query to return. Thus the
two backends end up waiting for each other causing a deadlock.

Fixed it by using the async API to execute the drop database command
from delete_data_node instead of the blocking remote_connection_cmdf_ok
call.

Fixes timescale#4838
lkshminarayanan added a commit to lkshminarayanan/timescaledb that referenced this issue Nov 10, 2022
PG15 introduced a ProcSignalBarrier mechanism in drop database
implementation to force all backends to close the file handles for
dropped tables. The backend that is executing the drop database command
will emit a new process signal barrier and wait for other backends to
accept it. But the backend which is executing the delete_data_node
function will not be able to process the above mentioned signal as it
will be stuck waiting for the drop database query to return. Thus the
two backends end up waiting for each other causing a deadlock.

Fixed it by using the async API to execute the drop database command
from delete_data_node instead of the blocking remote_connection_cmdf_ok
call.

Fixes timescale#4838
lkshminarayanan added a commit to lkshminarayanan/timescaledb that referenced this issue Nov 15, 2022
PG15 introduced a ProcSignalBarrier mechanism in drop database
implementation to force all backends to close the file handles for
dropped tables. The backend that is executing the drop database command
will emit a new process signal barrier and wait for other backends to
accept it. But the backend which is executing the delete_data_node
function will not be able to process the above mentioned signal as it
will be stuck waiting for the drop database query to return. Thus the
two backends end up waiting for each other causing a deadlock.

Fixed it by using the async API to execute the drop database command
from delete_data_node instead of the blocking remote_connection_cmdf_ok
call.

Fixes timescale#4838
lkshminarayanan added a commit to lkshminarayanan/timescaledb that referenced this issue Nov 17, 2022
PG15 introduced a ProcSignalBarrier mechanism in drop database
implementation to force all backends to close the file handles for
dropped tables. The backend that is executing the drop database command
will emit a new process signal barrier and wait for other backends to
accept it. But the backend which is executing the delete_data_node
function will not be able to process the above mentioned signal as it
will be stuck waiting for the drop database query to return. Thus the
two backends end up waiting for each other causing a deadlock.

Fixed it by using the async API to execute the drop database command
from delete_data_node instead of the blocking remote_connection_cmdf_ok
call.

Fixes timescale#4838
lkshminarayanan added a commit that referenced this issue Nov 17, 2022
PG15 introduced a ProcSignalBarrier mechanism in drop database
implementation to force all backends to close the file handles for
dropped tables. The backend that is executing the drop database command
will emit a new process signal barrier and wait for other backends to
accept it. But the backend which is executing the delete_data_node
function will not be able to process the above mentioned signal as it
will be stuck waiting for the drop database query to return. Thus the
two backends end up waiting for each other causing a deadlock.

Fixed it by using the async API to execute the drop database command
from delete_data_node instead of the blocking remote_connection_cmdf_ok
call.

Fixes #4838
SachinSetiya pushed a commit that referenced this issue Nov 28, 2022
PG15 introduced a ProcSignalBarrier mechanism in drop database
implementation to force all backends to close the file handles for
dropped tables. The backend that is executing the drop database command
will emit a new process signal barrier and wait for other backends to
accept it. But the backend which is executing the delete_data_node
function will not be able to process the above mentioned signal as it
will be stuck waiting for the drop database query to return. Thus the
two backends end up waiting for each other causing a deadlock.

Fixed it by using the async API to execute the drop database command
from delete_data_node instead of the blocking remote_connection_cmdf_ok
call.

Fixes #4838
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants