Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recompress deadlock when when running with queries #3846

Closed
mkindahl opened this issue Nov 23, 2021 · 2 comments · Fixed by #3845
Closed

Recompress deadlock when when running with queries #3846

mkindahl opened this issue Nov 23, 2021 · 2 comments · Fixed by #3845

Comments

@mkindahl
Copy link
Contributor

Calling recompress_chunk can create a deadlock with a query because of different lock order.

To reproduce

  1. Create a table hyper with data and compress it. Do not add a compression policy.
  2. Insert some extra rows.
  3. Start two sessions S1 and S2
  4. Attach a debugger to session S1 on tsl_recompress_chunk_wrapper.
  5. Run SELECT recompress_chunks(show_chunks('hyper')) in S1.
  6. Step in tsl_recompress_chunk_wrapper until you reach the line with tsl_compress_chunk_wrapper.
  7. Run SELECT count(*) FROM hyper in S1. This will block since it is waiting to get locks acquired by S1.
  8. Continue the execution in the debugger.
  9. A deadlock will be detected in one of the session, which will be killed and let the other session proceed. It is typically the recompress_chunks session, but not always.

Error produced

mats=# select recompress_chunk(show_chunks('hyper'));
ERROR:  deadlock detected
DETAIL:  Process 282425 waits for AccessExclusiveLock on relation 171357 of database 16384; blocked by process 282898.
Process 282898 waits for AccessShareLock on relation 171361 of database 16384; blocked by process 282425.
HINT:  See server log for query details.

Additional information

In this case, we have:

mats=# select 171357::regclass, 171361::regclass;
                 regclass                  |                         regclass                         
-------------------------------------------+----------------------------------------------------------
 _timescaledb_internal._hyper_16_148_chunk | _timescaledb_internal._hyper_16_148_chunk_hyper_time_idx
(1 row)

Locks taken when stopping debugger after step 7 are:

mats=# select pid, relation, relation::regclass, mode, granted from pg_locks where locktype = 'relation' order by pid;
  pid   | relation |                         relation                         |           mode           | granted 
--------+----------+----------------------------------------------------------+--------------------------+---------
 282425 |   171362 | _timescaledb_internal._compressed_hypertable_17          | ShareUpdateExclusiveLock | t
 282425 |   171357 | _timescaledb_internal._hyper_16_148_chunk                | AccessShareLock          | t
 282425 |   171357 | _timescaledb_internal._hyper_16_148_chunk                | ShareUpdateExclusiveLock | t
 282425 |   171357 | _timescaledb_internal._hyper_16_148_chunk                | ShareLock                | t
 282425 |   171357 | _timescaledb_internal._hyper_16_148_chunk                | ExclusiveLock            | t
 282425 |   179574 | 179574                                                   | ShareLock                | t
 282425 |   166790 | _timescaledb_catalog.chunk                               | RowExclusiveLock         | t
 282425 |   166963 | _timescaledb_catalog.hypertable_compression              | AccessShareLock          | t
 282425 |   166788 | _timescaledb_catalog.chunk_id_seq                        | RowExclusiveLock         | t
 282425 |   171362 | _timescaledb_internal._compressed_hypertable_17          | AccessShareLock          | t
 282425 |   171352 | hyper                                                    | AccessShareLock          | t
 282425 |   179571 | 179571                                                   | ShareUpdateExclusiveLock | t
 282425 |   179571 | 179571                                                   | ShareLock                | t
 282425 |   179571 | 179571                                                   | ExclusiveLock            | t
 282425 |   179571 | 179571                                                   | AccessExclusiveLock      | t
 282425 |   171372 | pg_toast.pg_toast_171369                                 | AccessExclusiveLock      | t
 282425 |   179576 | 179576                                                   | AccessExclusiveLock      | t
 282425 |   171361 | _timescaledb_internal._hyper_16_148_chunk_hyper_time_idx | AccessShareLock          | t
 282425 |   171361 | _timescaledb_internal._hyper_16_148_chunk_hyper_time_idx | AccessExclusiveLock      | t
 282425 |   171374 | pg_toast.pg_toast_171369_index                           | AccessExclusiveLock      | t
 282425 |   171369 | _timescaledb_internal.compress_hyper_17_149_chunk        | ExclusiveLock            | t
 282425 |   171369 | _timescaledb_internal.compress_hyper_17_149_chunk        | AccessExclusiveLock      | t
 282898 |   171352 | hyper                                                    | AccessShareLock          | t
 282898 |   171357 | _timescaledb_internal._hyper_16_148_chunk                | AccessShareLock          | t
 282898 |   171356 | hyper_time_idx                                           | AccessShareLock          | t
 282898 |   171361 | _timescaledb_internal._hyper_16_148_chunk_hyper_time_idx | AccessShareLock          | f
 342170 |    12141 | pg_locks                                                 | AccessShareLock          | t
(27 rows)
@mkindahl mkindahl self-assigned this Nov 23, 2021
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 23, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index are locked in different orders. This commit adds an isolation
test that trigger a deadlock between `recompress_chunk` and a query on
a table with some uncompressed rows and then fixes the deadlock risk by
conservatively taking the lock on the compressed chunk when the
recompress procedure starts running (since it will eventually truncate
it).

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 24, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index are locked in different orders. In particular, when
`recompress_chunk` is executing, it will first decompress the chunk and
as part of that lock the uncompressed chunk index in `AccessExclusive`
mode and when trying to compress the chunk again it will try to lock
the uncompressed chunk in `AccessExclusive` as part of truncating it.

To avoid the deadlock, this commit skips rebuilding the uncompressed
chunk index when decompressing the chunk since it will not change when
incorporating new rows into the compressed chunk.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 24, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index are locked in different orders. In particular, when
`recompress_chunk` is executing, it will first decompress the chunk and
as part of that lock the uncompressed chunk index in `AccessExclusive`
mode and when trying to compress the chunk again it will try to lock
the uncompressed chunk in `AccessExclusive` as part of truncating it.

To avoid the deadlock, this commit skips rebuilding the uncompressed
chunk index when decompressing the chunk since it will not change when
incorporating new rows into the compressed chunk.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 25, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index are locked in different orders. In particular, when
`recompress_chunk` is executing, it will first decompress the chunk and
as part of that lock the uncompressed chunk index in `AccessExclusive`
mode and when trying to compress the chunk again it will try to lock
the uncompressed chunk in `AccessExclusive` as part of truncating it.

To avoid the deadlock, this commit skips rebuilding the uncompressed
chunk index when decompressing the chunk since it will not change when
incorporating new rows into the compressed chunk.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 25, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index are locked in different orders. In particular, when
`recompress_chunk` is executing, it will first decompress the chunk and
as part of that lock the uncompressed chunk index in `AccessExclusive`
mode and when trying to compress the chunk again it will try to lock
the uncompressed chunk in `AccessExclusive` as part of truncating it.

To avoid the deadlock, this commit skips rebuilding the uncompressed
chunk index when decompressing the chunk during recompression since it
will not change when incorporating new rows into the compressed chunk.
The index is still rebuilt when just decompressing a chunk.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 29, 2021
When executing recompress chunk policy concurrently with queries query, a
deadlock can be generated because the chunk relation and the chunk
index or the uncompressed chunk or the compressed chunk are locked in
different orders. In particular, when recompress chunk policy is
executing, it will first decompress the chunk and as part of that lock
the compressed chunk in AccessExclusive mode when dropping it and when
trying to compress the chunk again it will try to lock the uncompressed
chunk in AccessExclusive as part of truncating it.

To avoid the deadlock, this commit updates the recompress policy to do
the compression and the decompression steps in separate transactions,
which will avoid the deadlock since each phase locks indexes and
compressed/uncompressed chunks in the same order.

Partial-Bug: timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 30, 2021
When executing recompress chunk policy concurrently with queries query, a
deadlock can be generated because the chunk relation and the chunk
index or the uncompressed chunk or the compressed chunk are locked in
different orders. In particular, when recompress chunk policy is
executing, it will first decompress the chunk and as part of that lock
the compressed chunk in `AccessExclusive` mode when dropping it and when
trying to compress the chunk again it will try to lock the uncompressed
chunk in `AccessExclusive` mode as part of truncating it.

To avoid the deadlock, this commit updates the recompress policy to do
the compression and the decompression steps in separate transactions,
which will avoid the deadlock since each phase (decompress and compress
chunk) locks indexes and compressed/uncompressed chunks in the same
order.

Note that this fixes the policy only, and not the `recompress_chunk`
function, which still is prone to deadlocks.

Partial-Bug: timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 30, 2021
When executing recompress chunk policy concurrently with queries query, a
deadlock can be generated because the chunk relation and the chunk
index or the uncompressed chunk or the compressed chunk are locked in
different orders. In particular, when recompress chunk policy is
executing, it will first decompress the chunk and as part of that lock
the compressed chunk in `AccessExclusive` mode when dropping it and when
trying to compress the chunk again it will try to lock the uncompressed
chunk in `AccessExclusive` mode as part of truncating it.

To avoid the deadlock, this commit updates the recompress policy to do
the compression and the decompression steps in separate transactions,
which will avoid the deadlock since each phase (decompress and compress
chunk) locks indexes and compressed/uncompressed chunks in the same
order.

Note that this fixes the policy only, and not the `recompress_chunk`
function, which still is prone to deadlocks.

Partial-Bug: timescale#3846
mkindahl added a commit that referenced this issue Nov 30, 2021
When executing recompress chunk policy concurrently with queries query, a
deadlock can be generated because the chunk relation and the chunk
index or the uncompressed chunk or the compressed chunk are locked in
different orders. In particular, when recompress chunk policy is
executing, it will first decompress the chunk and as part of that lock
the compressed chunk in `AccessExclusive` mode when dropping it and when
trying to compress the chunk again it will try to lock the uncompressed
chunk in `AccessExclusive` mode as part of truncating it.

To avoid the deadlock, this commit updates the recompress policy to do
the compression and the decompression steps in separate transactions,
which will avoid the deadlock since each phase (decompress and compress
chunk) locks indexes and compressed/uncompressed chunks in the same
order.

Note that this fixes the policy only, and not the `recompress_chunk`
function, which still is prone to deadlocks.

Partial-Bug: #3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 30, 2021
When executing recompress chunk policy concurrently with queries query, a
deadlock can be generated because the chunk relation and the chunk
index or the uncompressed chunk or the compressed chunk are locked in
different orders. In particular, when recompress chunk policy is
executing, it will first decompress the chunk and as part of that lock
the compressed chunk in `AccessExclusive` mode when dropping it and when
trying to compress the chunk again it will try to lock the uncompressed
chunk in `AccessExclusive` mode as part of truncating it.

To avoid the deadlock, this commit updates the recompress policy to do
the compression and the decompression steps in separate transactions,
which will avoid the deadlock since each phase (decompress and compress
chunk) locks indexes and compressed/uncompressed chunks in the same
order.

Note that this fixes the policy only, and not the `recompress_chunk`
function, which still is prone to deadlocks.

Partial-Bug: timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 2, 2021
After the synchronizing lock is released and the transaction is
committed, both sessions are free to execute independently. This means
that the query can actually start running before the recompress step
has completed, which means that the order for completion is
non-deterministic.

We fix this by adding a marker so that the query is not reported as
completed until the recompress has finished execution.

Part-Of: timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 2, 2021
After the synchronizing lock is released and the transaction is
committed, both sessions are free to execute independently. This means
that the query can actually start running before the recompress step
has completed, which means that the order for completion is
non-deterministic.

We fix this by adding a marker so that the query is not reported as
completed until the recompress has finished execution. Since markers in
isolation tests is a recent thing, we only run the test for PostgreSQL
versions with markers added.

Part-Of: timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 2, 2021
After the synchronizing lock is released and the transaction is
committed, both sessions are free to execute independently. This means
that the query can actually start running before the recompress step
has completed, which means that the order for completion is
non-deterministic.

We fix this by adding a marker so that the query is not reported as
completed until the recompress has finished execution. Since markers in
isolation tests is a recent thing, we only run the test for PostgreSQL
versions with markers added.

Part-Of: timescale#3846
duncan-tsdb pushed a commit that referenced this issue Dec 2, 2021
When executing recompress chunk policy concurrently with queries query, a
deadlock can be generated because the chunk relation and the chunk
index or the uncompressed chunk or the compressed chunk are locked in
different orders. In particular, when recompress chunk policy is
executing, it will first decompress the chunk and as part of that lock
the compressed chunk in `AccessExclusive` mode when dropping it and when
trying to compress the chunk again it will try to lock the uncompressed
chunk in `AccessExclusive` mode as part of truncating it.

To avoid the deadlock, this commit updates the recompress policy to do
the compression and the decompression steps in separate transactions,
which will avoid the deadlock since each phase (decompress and compress
chunk) locks indexes and compressed/uncompressed chunks in the same
order.

Note that this fixes the policy only, and not the `recompress_chunk`
function, which still is prone to deadlocks.

Partial-Bug: #3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 3, 2021
After the synchronizing lock is released and the transaction is
committed, both sessions are free to execute independently. This means
that the query can actually start running before the recompress step
has completed, which means that the order for completion is
non-deterministic.

We fix this by adding a marker so that the query is not reported as
completed until the recompress has finished execution. Since markers in
isolation tests is a recent thing, we only run the test for PostgreSQL
versions with markers added.

Part-Of: timescale#3846
mkindahl added a commit that referenced this issue Dec 3, 2021
After the synchronizing lock is released and the transaction is
committed, both sessions are free to execute independently. This means
that the query can actually start running before the recompress step
has completed, which means that the order for completion is
non-deterministic.

We fix this by adding a marker so that the query is not reported as
completed until the recompress has finished execution. Since markers in
isolation tests is a recent thing, we only run the test for PostgreSQL
versions with markers added.

Part-Of: #3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 6, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index and the compressed and uncompressd chunks are locked in different
orders. In particular, when `recompress_chunk` is executing, it will
first decompress the chunk and as part of that lock the uncompressed
chunk index in AccessExclusive mode and when trying to compress the
chunk again it will try to lock the uncompressed chunk in
AccessExclusive as part of truncating it.

Note that `decompress_chunk` and `compress_chunk` lock the relations in
the same order and the issue arises because the procedures are combined
inth a single transaction.

To avoid the deadlock, this commit rewrites the `recompress_chunk` to
be a procedure and adds a commit between the decompression and
compression. Committing the transaction after the decompress will allow
reads and inserts to proceed by working on the uncompressed chunk, and
the compression part of the procedure will take the necessary locks in
strict order, thereby avoiding a deadlock.

In addition, the isolation test is rewritten so that instead of adding
a waitpoint in the PL/SQL function, we implement the isolation test by
taking a lock on the compressed table after the decompression.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 6, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index and the compressed and uncompressd chunks are locked in different
orders. In particular, when `recompress_chunk` is executing, it will
first decompress the chunk and as part of that lock the uncompressed
chunk index in AccessExclusive mode and when trying to compress the
chunk again it will try to lock the uncompressed chunk in
AccessExclusive as part of truncating it.

Note that `decompress_chunk` and `compress_chunk` lock the relations in
the same order and the issue arises because the procedures are combined
inth a single transaction.

To avoid the deadlock, this commit rewrites the `recompress_chunk` to
be a procedure and adds a commit between the decompression and
compression. Committing the transaction after the decompress will allow
reads and inserts to proceed by working on the uncompressed chunk, and
the compression part of the procedure will take the necessary locks in
strict order, thereby avoiding a deadlock.

In addition, the isolation test is rewritten so that instead of adding
a waitpoint in the PL/SQL function, we implement the isolation test by
taking a lock on the compressed table after the decompression.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 6, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index and the compressed and uncompressd chunks are locked in different
orders. In particular, when `recompress_chunk` is executing, it will
first decompress the chunk and as part of that lock the uncompressed
chunk index in AccessExclusive mode and when trying to compress the
chunk again it will try to lock the uncompressed chunk in
AccessExclusive as part of truncating it.

Note that `decompress_chunk` and `compress_chunk` lock the relations in
the same order and the issue arises because the procedures are combined
inth a single transaction.

To avoid the deadlock, this commit rewrites the `recompress_chunk` to
be a procedure and adds a commit between the decompression and
compression. Committing the transaction after the decompress will allow
reads and inserts to proceed by working on the uncompressed chunk, and
the compression part of the procedure will take the necessary locks in
strict order, thereby avoiding a deadlock.

In addition, the isolation test is rewritten so that instead of adding
a waitpoint in the PL/SQL function, we implement the isolation test by
taking a lock on the compressed table after the decompression.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 6, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index and the compressed and uncompressd chunks are locked in different
orders. In particular, when `recompress_chunk` is executing, it will
first decompress the chunk and as part of that lock the uncompressed
chunk index in AccessExclusive mode and when trying to compress the
chunk again it will try to lock the uncompressed chunk in
AccessExclusive as part of truncating it.

Note that `decompress_chunk` and `compress_chunk` lock the relations in
the same order and the issue arises because the procedures are combined
inth a single transaction.

To avoid the deadlock, this commit rewrites the `recompress_chunk` to
be a procedure and adds a commit between the decompression and
compression. Committing the transaction after the decompress will allow
reads and inserts to proceed by working on the uncompressed chunk, and
the compression part of the procedure will take the necessary locks in
strict order, thereby avoiding a deadlock.

In addition, the isolation test is rewritten so that instead of adding
a waitpoint in the PL/SQL function, we implement the isolation test by
taking a lock on the compressed table after the decompression.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 7, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index and the compressed and uncompressd chunks are locked in different
orders. In particular, when `recompress_chunk` is executing, it will
first decompress the chunk and as part of that lock the uncompressed
chunk index in AccessExclusive mode and when trying to compress the
chunk again it will try to lock the uncompressed chunk in
AccessExclusive as part of truncating it.

Note that `decompress_chunk` and `compress_chunk` lock the relations in
the same order and the issue arises because the procedures are combined
inth a single transaction.

To avoid the deadlock, this commit rewrites the `recompress_chunk` to
be a procedure and adds a commit between the decompression and
compression. Committing the transaction after the decompress will allow
reads and inserts to proceed by working on the uncompressed chunk, and
the compression part of the procedure will take the necessary locks in
strict order, thereby avoiding a deadlock.

In addition, the isolation test is rewritten so that instead of adding
a waitpoint in the PL/SQL function, we implement the isolation test by
taking a lock on the compressed table after the decompression.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 7, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index and the compressed and uncompressd chunks are locked in different
orders. In particular, when `recompress_chunk` is executing, it will
first decompress the chunk and as part of that lock the uncompressed
chunk index in AccessExclusive mode and when trying to compress the
chunk again it will try to lock the uncompressed chunk in
AccessExclusive as part of truncating it.

Note that `decompress_chunk` and `compress_chunk` lock the relations in
the same order and the issue arises because the procedures are combined
inth a single transaction.

To avoid the deadlock, this commit rewrites the `recompress_chunk` to
be a procedure and adds a commit between the decompression and
compression. Committing the transaction after the decompress will allow
reads and inserts to proceed by working on the uncompressed chunk, and
the compression part of the procedure will take the necessary locks in
strict order, thereby avoiding a deadlock.

In addition, the isolation test is rewritten so that instead of adding
a waitpoint in the PL/SQL function, we implement the isolation test by
taking a lock on the compressed table after the decompression.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 9, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index and the compressed and uncompressd chunks are locked in different
orders. In particular, when `recompress_chunk` is executing, it will
first decompress the chunk and as part of that lock the uncompressed
chunk index in AccessExclusive mode and when trying to compress the
chunk again it will try to lock the uncompressed chunk in
AccessExclusive as part of truncating it.

Note that `decompress_chunk` and `compress_chunk` lock the relations in
the same order and the issue arises because the procedures are combined
inth a single transaction.

To avoid the deadlock, this commit rewrites the `recompress_chunk` to
be a procedure and adds a commit between the decompression and
compression. Committing the transaction after the decompress will allow
reads and inserts to proceed by working on the uncompressed chunk, and
the compression part of the procedure will take the necessary locks in
strict order, thereby avoiding a deadlock.

In addition, the isolation test is rewritten so that instead of adding
a waitpoint in the PL/SQL function, we implement the isolation test by
taking a lock on the compressed table after the decompression.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 9, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index and the compressed and uncompressd chunks are locked in different
orders. In particular, when `recompress_chunk` is executing, it will
first decompress the chunk and as part of that lock the uncompressed
chunk index in AccessExclusive mode and when trying to compress the
chunk again it will try to lock the uncompressed chunk in
AccessExclusive as part of truncating it.

Note that `decompress_chunk` and `compress_chunk` lock the relations in
the same order and the issue arises because the procedures are combined
inth a single transaction.

To avoid the deadlock, this commit rewrites the `recompress_chunk` to
be a procedure and adds a commit between the decompression and
compression. Committing the transaction after the decompress will allow
reads and inserts to proceed by working on the uncompressed chunk, and
the compression part of the procedure will take the necessary locks in
strict order, thereby avoiding a deadlock.

In addition, the isolation test is rewritten so that instead of adding
a waitpoint in the PL/SQL function, we implement the isolation test by
taking a lock on the compressed table after the decompression.

Fixes timescale#3846
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 9, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index and the compressed and uncompressd chunks are locked in different
orders. In particular, when `recompress_chunk` is executing, it will
first decompress the chunk and as part of that lock the uncompressed
chunk index in AccessExclusive mode and when trying to compress the
chunk again it will try to lock the uncompressed chunk in
AccessExclusive as part of truncating it.

Note that `decompress_chunk` and `compress_chunk` lock the relations in
the same order and the issue arises because the procedures are combined
inth a single transaction.

To avoid the deadlock, this commit rewrites the `recompress_chunk` to
be a procedure and adds a commit between the decompression and
compression. Committing the transaction after the decompress will allow
reads and inserts to proceed by working on the uncompressed chunk, and
the compression part of the procedure will take the necessary locks in
strict order, thereby avoiding a deadlock.

In addition, the isolation test is rewritten so that instead of adding
a waitpoint in the PL/SQL function, we implement the isolation test by
taking a lock on the compressed table after the decompression.

Fixes timescale#3846
mkindahl added a commit that referenced this issue Dec 9, 2021
When executing `recompress_chunk` and a query at the same time, a
deadlock can be generated because the chunk relation and the chunk
index and the compressed and uncompressd chunks are locked in different
orders. In particular, when `recompress_chunk` is executing, it will
first decompress the chunk and as part of that lock the uncompressed
chunk index in AccessExclusive mode and when trying to compress the
chunk again it will try to lock the uncompressed chunk in
AccessExclusive as part of truncating it.

Note that `decompress_chunk` and `compress_chunk` lock the relations in
the same order and the issue arises because the procedures are combined
inth a single transaction.

To avoid the deadlock, this commit rewrites the `recompress_chunk` to
be a procedure and adds a commit between the decompression and
compression. Committing the transaction after the decompress will allow
reads and inserts to proceed by working on the uncompressed chunk, and
the compression part of the procedure will take the necessary locks in
strict order, thereby avoiding a deadlock.

In addition, the isolation test is rewritten so that instead of adding
a waitpoint in the PL/SQL function, we implement the isolation test by
taking a lock on the compressed table after the decompression.

Fixes #3846
@sickel
Copy link

sickel commented Jan 19, 2022

I see the same behaviour using timescaledb 2.5.1 on postgres 14.1. using centos7, postgresql from the pgdb-repo, timescaledb from timescaledb. I am upgrading our soh database with timescaledb, so at the moment I have set up a timescaledbenabled table where I am compressing the old data at the same time as new data are coming in.

=> select compress_chunk(i,true) from show_chunks('nagios_t',older_than=> interval '1 years') i;
NOTICE: chunk "_hyper_2_100_chunk" is already compressed
(...)
ERROR: deadlock detected
DETAIL: Process 20358 waits for AccessExclusiveLock on relation 1410875 of database 18416; blocked by process 40226.
Process 40226 waits for AccessShareLock on relation 1410561 of database 18416; blocked by process 20358.
HINT: See server log for query details.

in the log, I get

$ grep -A 10 1410875 /var/log/postgresql.log
Jan 19 14:21:14 nagios postgres[20358]: [49-2] 2022-01-19 14:21:14.950 CET [20358] DETAIL: Process 20358 waits for AccessExclusiveLock on relation 1410875 of database 18416; blocked by process 40226.
Jan 19 14:21:14 nagios postgres[20358]: [49-3] #011Process 40226 waits for AccessShareLock on relation 1410561 of database 18416; blocked by process 20358.
Jan 19 14:21:14 nagios postgres[20358]: [49-4] #011Process 20358: select compress_chunk(i,true) from show_chunks('nagios_t',older_than=> interval '1 years') i;
Jan 19 14:21:14 nagios postgres[20358]: [49-5] #011Process 40226: insert into nagios_t select * from nagios where checktime > (select max(checktime) from nagios_t);

The query on the last line is copying data from the current working table to the new one.

As far as I can see, the relations in the deadlock are two different tsdb chunks:

nagios=> select oid,relname from pg_class where oid in (1410875,1410561);
oid | relname
---------+--------------------
1410561 | _hyper_2_120_chunk
1410875 | _hyper_2_129_chunk

@gayyappan
Copy link
Contributor

gayyappan commented Jan 27, 2022

@sickel select compress_chunk(i,true) from show_chunks('nagios_t',older_than=> interval '1 years') i;
will lock the chunks since you are compressing them. The locks will be released only when the transaction completes.
So this is not the best way to compress chunks if you are inserting/querying these chunks at the same time.
You are acquiring locks that might block other operations on these chunks.

The right way to do this is to compress 1 chunk at a time, commit and then compress the next chunk.

svenklemm pushed a commit that referenced this issue Feb 8, 2022
After the synchronizing lock is released and the transaction is
committed, both sessions are free to execute independently. This means
that the query can actually start running before the recompress step
has completed, which means that the order for completion is
non-deterministic.

We fix this by adding a marker so that the query is not reported as
completed until the recompress has finished execution. Since markers in
isolation tests is a recent thing, we only run the test for PostgreSQL
versions with markers added.

Part-Of: #3846
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants