New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test dist_move_chunk hangs in PG15 #4972
Comments
The dist_move_chunk causes the CI to hang when compiled and run with PG15 as explained in timescale#4972.
The dist_move_chunk causes the CI to hang when compiled and run with PG15 as explained in timescale#4972.
The dist_move_chunk causes the CI to hang when compiled and run with PG15 as explained in timescale#4972.
The dist_move_chunk causes the CI to hang when compiled and run with PG15 as explained in timescale#4972.
The dist_move_chunk causes the CI to hang when compiled and run with PG15 as explained in timescale#4972. Also fixed schema permission issues in data_node and dist_param tests.
The dist_move_chunk causes the CI to hang when compiled and run with PG15 as explained in #4972. Also fixed schema permission issues in data_node and dist_param tests.
This is now reproducible locally and issue still occurs in 15.1. |
This is a deadlock between the worker process and another process running the W.r.to the logs mentioned in the bug description, all the The TLDR; |
The wait done by process The exact place where the wait happens is this - libpqwalreceiver.c#L696. The Relevant stacktrace of this process :
|
The place where the process executing Relevant stacktrace of this wait :
|
Excellent analysis. As discussed this looks like a PG internal issue. Maybe this needs to be discussed in the community as well. |
The dist_move_chunk causes the CI to hang when compiled and run with PG15 as explained in #4972. Also fixed schema permission issues in data_node and dist_param tests.
I tried reproducing the hang with vanilla postgres by creating a subscription and dropping the database but I was never able to get the timing right. But just from the code inspection, one can confirm the possibility of this deadlock. |
The following SQL statements are equivalent to what dist_move_chunk and another testcase execute to get into a hang : Run the following in the publisher node :
Run the following in the subscriber node :
Now execute the following two queries parallely on the subscriber node :
and
The |
Is this related? |
Looks like this is fixed by #5058 |
Hi @erimatnor I ran the test locally with the changes from #5058 (commit hash 845b78096c). The test still hangs for me. The problem occurs only when the Relevant log that keeps repeating in postmaster :
commit_hash and pg_stat_activity output:
I still think this is not a Timescale issue rather a PostgreSQL upstream issue. There is a deadlock between a worker thread that is trying to create a replication slot and another process that is executing |
I was able to reproduce the issue in vanilla postgres with the following queries :
This confirms that the issue is not TimescaleDB specific. I have sent a bug report to upstream with all the details. W.r.to TimescaleDB, this is not a serious issue as the bug will occur only if the two data nodes are running in the same postgres instance - which is the case in the |
This is now fixed : postgres/postgres@6c6d6ba and will probably be included in PG15.2. |
When run in a parallel group, the dist_move_chunk test can get into a deadlock with another test running a 'DROP DATABASE' command. So, mark it as a solo test to disallow it from running in a parallel group. Closes timescale#4972
When run in a parallel group, the dist_move_chunk test can get into a deadlock with another test running a 'DROP DATABASE' command. So, mark it as a solo test to disallow it from running in a parallel group. Closes timescale#4972
When run in a parallel group, the dist_move_chunk test can get into a deadlock with another test running a 'DROP DATABASE' command. So, mark it as a solo test to disallow it from running in a parallel group. Closes timescale#4972
When run in a parallel group, the dist_move_chunk test can get into a deadlock with another test running a 'DROP DATABASE' command. So, mark it as a solo test to disallow it from running in a parallel group. Closes timescale#4972
When run in a parallel group, the dist_move_chunk test can get into a deadlock with another test running a 'DROP DATABASE' command. So, mark it as a solo test to disallow it from running in a parallel group. Closes timescale#4972
When run in a parallel group, the dist_move_chunk test can get into a deadlock with another test running a 'DROP DATABASE' command. So, mark it as a solo test to disallow it from running in a parallel group. Closes #4972
What type of bug is this?
Other
What subsystems and features are affected?
Data node, Distributed hypertable
What happened?
When compiled and run in PG15, the
dist_move_chunk
testcase hangs due to a deadlock caused between the copy_chunk function and a DROP DATABASE command run in parallel. This is not locally reproducible but hangs in Github CI.TimescaleDB version affected
2.9.0
PostgreSQL version used
15.0
What operating system did you use?
Ubuntu 22.04
What installation method did you use?
Source
What platform did you run on?
Not applicable
Relevant log output and stack trace
How can we reproduce the bug?
Reproducible in Github CI.
The text was updated successfully, but these errors were encountered: