Recompress deadlock when when running with queries #3846

mkindahl · 2021-11-23T12:59:03Z

Calling recompress_chunk can create a deadlock with a query because of different lock order.

To reproduce

Create a table hyper with data and compress it. Do not add a compression policy.
Insert some extra rows.
Start two sessions S1 and S2
Attach a debugger to session S1 on tsl_recompress_chunk_wrapper.
Run SELECT recompress_chunks(show_chunks('hyper')) in S1.
Step in tsl_recompress_chunk_wrapper until you reach the line with tsl_compress_chunk_wrapper.
Run SELECT count(*) FROM hyper in S1. This will block since it is waiting to get locks acquired by S1.
Continue the execution in the debugger.
A deadlock will be detected in one of the session, which will be killed and let the other session proceed. It is typically the recompress_chunks session, but not always.

Error produced

mats=# select recompress_chunk(show_chunks('hyper'));
ERROR:  deadlock detected
DETAIL:  Process 282425 waits for AccessExclusiveLock on relation 171357 of database 16384; blocked by process 282898.
Process 282898 waits for AccessShareLock on relation 171361 of database 16384; blocked by process 282425.
HINT:  See server log for query details.

Additional information

In this case, we have:

mats=# select 171357::regclass, 171361::regclass;
                 regclass                  |                         regclass                         
-------------------------------------------+----------------------------------------------------------
 _timescaledb_internal._hyper_16_148_chunk | _timescaledb_internal._hyper_16_148_chunk_hyper_time_idx
(1 row)

Locks taken when stopping debugger after step 7 are:

mats=# select pid, relation, relation::regclass, mode, granted from pg_locks where locktype = 'relation' order by pid;
  pid   | relation |                         relation                         |           mode           | granted 
--------+----------+----------------------------------------------------------+--------------------------+---------
 282425 |   171362 | _timescaledb_internal._compressed_hypertable_17          | ShareUpdateExclusiveLock | t
 282425 |   171357 | _timescaledb_internal._hyper_16_148_chunk                | AccessShareLock          | t
 282425 |   171357 | _timescaledb_internal._hyper_16_148_chunk                | ShareUpdateExclusiveLock | t
 282425 |   171357 | _timescaledb_internal._hyper_16_148_chunk                | ShareLock                | t
 282425 |   171357 | _timescaledb_internal._hyper_16_148_chunk                | ExclusiveLock            | t
 282425 |   179574 | 179574                                                   | ShareLock                | t
 282425 |   166790 | _timescaledb_catalog.chunk                               | RowExclusiveLock         | t
 282425 |   166963 | _timescaledb_catalog.hypertable_compression              | AccessShareLock          | t
 282425 |   166788 | _timescaledb_catalog.chunk_id_seq                        | RowExclusiveLock         | t
 282425 |   171362 | _timescaledb_internal._compressed_hypertable_17          | AccessShareLock          | t
 282425 |   171352 | hyper                                                    | AccessShareLock          | t
 282425 |   179571 | 179571                                                   | ShareUpdateExclusiveLock | t
 282425 |   179571 | 179571                                                   | ShareLock                | t
 282425 |   179571 | 179571                                                   | ExclusiveLock            | t
 282425 |   179571 | 179571                                                   | AccessExclusiveLock      | t
 282425 |   171372 | pg_toast.pg_toast_171369                                 | AccessExclusiveLock      | t
 282425 |   179576 | 179576                                                   | AccessExclusiveLock      | t
 282425 |   171361 | _timescaledb_internal._hyper_16_148_chunk_hyper_time_idx | AccessShareLock          | t
 282425 |   171361 | _timescaledb_internal._hyper_16_148_chunk_hyper_time_idx | AccessExclusiveLock      | t
 282425 |   171374 | pg_toast.pg_toast_171369_index                           | AccessExclusiveLock      | t
 282425 |   171369 | _timescaledb_internal.compress_hyper_17_149_chunk        | ExclusiveLock            | t
 282425 |   171369 | _timescaledb_internal.compress_hyper_17_149_chunk        | AccessExclusiveLock      | t
 282898 |   171352 | hyper                                                    | AccessShareLock          | t
 282898 |   171357 | _timescaledb_internal._hyper_16_148_chunk                | AccessShareLock          | t
 282898 |   171356 | hyper_time_idx                                           | AccessShareLock          | t
 282898 |   171361 | _timescaledb_internal._hyper_16_148_chunk_hyper_time_idx | AccessShareLock          | f
 342170 |    12141 | pg_locks                                                 | AccessShareLock          | t
(27 rows)

The text was updated successfully, but these errors were encountered:

When executing `recompress_chunk` and a query at the same time, a deadlock can be generated because the chunk relation and the chunk index are locked in different orders. This commit adds an isolation test that trigger a deadlock between `recompress_chunk` and a query on a table with some uncompressed rows and then fixes the deadlock risk by conservatively taking the lock on the compressed chunk when the recompress procedure starts running (since it will eventually truncate it). Fixes timescale#3846

When executing `recompress_chunk` and a query at the same time, a deadlock can be generated because the chunk relation and the chunk index are locked in different orders. In particular, when `recompress_chunk` is executing, it will first decompress the chunk and as part of that lock the uncompressed chunk index in `AccessExclusive` mode and when trying to compress the chunk again it will try to lock the uncompressed chunk in `AccessExclusive` as part of truncating it. To avoid the deadlock, this commit skips rebuilding the uncompressed chunk index when decompressing the chunk since it will not change when incorporating new rows into the compressed chunk. Fixes timescale#3846

When executing `recompress_chunk` and a query at the same time, a deadlock can be generated because the chunk relation and the chunk index are locked in different orders. In particular, when `recompress_chunk` is executing, it will first decompress the chunk and as part of that lock the uncompressed chunk index in `AccessExclusive` mode and when trying to compress the chunk again it will try to lock the uncompressed chunk in `AccessExclusive` as part of truncating it. To avoid the deadlock, this commit skips rebuilding the uncompressed chunk index when decompressing the chunk during recompression since it will not change when incorporating new rows into the compressed chunk. The index is still rebuilt when just decompressing a chunk. Fixes timescale#3846

When executing recompress chunk policy concurrently with queries query, a deadlock can be generated because the chunk relation and the chunk index or the uncompressed chunk or the compressed chunk are locked in different orders. In particular, when recompress chunk policy is executing, it will first decompress the chunk and as part of that lock the compressed chunk in AccessExclusive mode when dropping it and when trying to compress the chunk again it will try to lock the uncompressed chunk in AccessExclusive as part of truncating it. To avoid the deadlock, this commit updates the recompress policy to do the compression and the decompression steps in separate transactions, which will avoid the deadlock since each phase locks indexes and compressed/uncompressed chunks in the same order. Partial-Bug: timescale#3846

When executing recompress chunk policy concurrently with queries query, a deadlock can be generated because the chunk relation and the chunk index or the uncompressed chunk or the compressed chunk are locked in different orders. In particular, when recompress chunk policy is executing, it will first decompress the chunk and as part of that lock the compressed chunk in `AccessExclusive` mode when dropping it and when trying to compress the chunk again it will try to lock the uncompressed chunk in `AccessExclusive` mode as part of truncating it. To avoid the deadlock, this commit updates the recompress policy to do the compression and the decompression steps in separate transactions, which will avoid the deadlock since each phase (decompress and compress chunk) locks indexes and compressed/uncompressed chunks in the same order. Note that this fixes the policy only, and not the `recompress_chunk` function, which still is prone to deadlocks. Partial-Bug: timescale#3846

When executing recompress chunk policy concurrently with queries query, a deadlock can be generated because the chunk relation and the chunk index or the uncompressed chunk or the compressed chunk are locked in different orders. In particular, when recompress chunk policy is executing, it will first decompress the chunk and as part of that lock the compressed chunk in `AccessExclusive` mode when dropping it and when trying to compress the chunk again it will try to lock the uncompressed chunk in `AccessExclusive` mode as part of truncating it. To avoid the deadlock, this commit updates the recompress policy to do the compression and the decompression steps in separate transactions, which will avoid the deadlock since each phase (decompress and compress chunk) locks indexes and compressed/uncompressed chunks in the same order. Note that this fixes the policy only, and not the `recompress_chunk` function, which still is prone to deadlocks. Partial-Bug: #3846

When executing recompress chunk policy concurrently with queries query, a deadlock can be generated because the chunk relation and the chunk index or the uncompressed chunk or the compressed chunk are locked in different orders. In particular, when recompress chunk policy is executing, it will first decompress the chunk and as part of that lock the compressed chunk in `AccessExclusive` mode when dropping it and when trying to compress the chunk again it will try to lock the uncompressed chunk in `AccessExclusive` mode as part of truncating it. To avoid the deadlock, this commit updates the recompress policy to do the compression and the decompression steps in separate transactions, which will avoid the deadlock since each phase (decompress and compress chunk) locks indexes and compressed/uncompressed chunks in the same order. Note that this fixes the policy only, and not the `recompress_chunk` function, which still is prone to deadlocks. Partial-Bug: timescale#3846

After the synchronizing lock is released and the transaction is committed, both sessions are free to execute independently. This means that the query can actually start running before the recompress step has completed, which means that the order for completion is non-deterministic. We fix this by adding a marker so that the query is not reported as completed until the recompress has finished execution. Part-Of: timescale#3846

After the synchronizing lock is released and the transaction is committed, both sessions are free to execute independently. This means that the query can actually start running before the recompress step has completed, which means that the order for completion is non-deterministic. We fix this by adding a marker so that the query is not reported as completed until the recompress has finished execution. Since markers in isolation tests is a recent thing, we only run the test for PostgreSQL versions with markers added. Part-Of: timescale#3846

When executing recompress chunk policy concurrently with queries query, a deadlock can be generated because the chunk relation and the chunk index or the uncompressed chunk or the compressed chunk are locked in different orders. In particular, when recompress chunk policy is executing, it will first decompress the chunk and as part of that lock the compressed chunk in `AccessExclusive` mode when dropping it and when trying to compress the chunk again it will try to lock the uncompressed chunk in `AccessExclusive` mode as part of truncating it. To avoid the deadlock, this commit updates the recompress policy to do the compression and the decompression steps in separate transactions, which will avoid the deadlock since each phase (decompress and compress chunk) locks indexes and compressed/uncompressed chunks in the same order. Note that this fixes the policy only, and not the `recompress_chunk` function, which still is prone to deadlocks. Partial-Bug: #3846

After the synchronizing lock is released and the transaction is committed, both sessions are free to execute independently. This means that the query can actually start running before the recompress step has completed, which means that the order for completion is non-deterministic. We fix this by adding a marker so that the query is not reported as completed until the recompress has finished execution. Since markers in isolation tests is a recent thing, we only run the test for PostgreSQL versions with markers added. Part-Of: timescale#3846

After the synchronizing lock is released and the transaction is committed, both sessions are free to execute independently. This means that the query can actually start running before the recompress step has completed, which means that the order for completion is non-deterministic. We fix this by adding a marker so that the query is not reported as completed until the recompress has finished execution. Since markers in isolation tests is a recent thing, we only run the test for PostgreSQL versions with markers added. Part-Of: #3846

When executing `recompress_chunk` and a query at the same time, a deadlock can be generated because the chunk relation and the chunk index and the compressed and uncompressd chunks are locked in different orders. In particular, when `recompress_chunk` is executing, it will first decompress the chunk and as part of that lock the uncompressed chunk index in AccessExclusive mode and when trying to compress the chunk again it will try to lock the uncompressed chunk in AccessExclusive as part of truncating it. Note that `decompress_chunk` and `compress_chunk` lock the relations in the same order and the issue arises because the procedures are combined inth a single transaction. To avoid the deadlock, this commit rewrites the `recompress_chunk` to be a procedure and adds a commit between the decompression and compression. Committing the transaction after the decompress will allow reads and inserts to proceed by working on the uncompressed chunk, and the compression part of the procedure will take the necessary locks in strict order, thereby avoiding a deadlock. In addition, the isolation test is rewritten so that instead of adding a waitpoint in the PL/SQL function, we implement the isolation test by taking a lock on the compressed table after the decompression. Fixes timescale#3846

When executing `recompress_chunk` and a query at the same time, a deadlock can be generated because the chunk relation and the chunk index and the compressed and uncompressd chunks are locked in different orders. In particular, when `recompress_chunk` is executing, it will first decompress the chunk and as part of that lock the uncompressed chunk index in AccessExclusive mode and when trying to compress the chunk again it will try to lock the uncompressed chunk in AccessExclusive as part of truncating it. Note that `decompress_chunk` and `compress_chunk` lock the relations in the same order and the issue arises because the procedures are combined inth a single transaction. To avoid the deadlock, this commit rewrites the `recompress_chunk` to be a procedure and adds a commit between the decompression and compression. Committing the transaction after the decompress will allow reads and inserts to proceed by working on the uncompressed chunk, and the compression part of the procedure will take the necessary locks in strict order, thereby avoiding a deadlock. In addition, the isolation test is rewritten so that instead of adding a waitpoint in the PL/SQL function, we implement the isolation test by taking a lock on the compressed table after the decompression. Fixes #3846

sickel · 2022-01-19T13:52:16Z

I see the same behaviour using timescaledb 2.5.1 on postgres 14.1. using centos7, postgresql from the pgdb-repo, timescaledb from timescaledb. I am upgrading our soh database with timescaledb, so at the moment I have set up a timescaledbenabled table where I am compressing the old data at the same time as new data are coming in.

=> select compress_chunk(i,true) from show_chunks('nagios_t',older_than=> interval '1 years') i;
NOTICE: chunk "_hyper_2_100_chunk" is already compressed
(...)
ERROR: deadlock detected
DETAIL: Process 20358 waits for AccessExclusiveLock on relation 1410875 of database 18416; blocked by process 40226.
Process 40226 waits for AccessShareLock on relation 1410561 of database 18416; blocked by process 20358.
HINT: See server log for query details.

in the log, I get

$ grep -A 10 1410875 /var/log/postgresql.log
Jan 19 14:21:14 nagios postgres[20358]: [49-2] 2022-01-19 14:21:14.950 CET [20358] DETAIL: Process 20358 waits for AccessExclusiveLock on relation 1410875 of database 18416; blocked by process 40226.
Jan 19 14:21:14 nagios postgres[20358]: [49-3] #011Process 40226 waits for AccessShareLock on relation 1410561 of database 18416; blocked by process 20358.
Jan 19 14:21:14 nagios postgres[20358]: [49-4] #011Process 20358: select compress_chunk(i,true) from show_chunks('nagios_t',older_than=> interval '1 years') i;
Jan 19 14:21:14 nagios postgres[20358]: [49-5] #011Process 40226: insert into nagios_t select * from nagios where checktime > (select max(checktime) from nagios_t);

The query on the last line is copying data from the current working table to the new one.

As far as I can see, the relations in the deadlock are two different tsdb chunks:

nagios=> select oid,relname from pg_class where oid in (1410875,1410561);
oid | relname
---------+--------------------
1410561 | _hyper_2_120_chunk
1410875 | _hyper_2_129_chunk

gayyappan · 2022-01-27T15:39:48Z

@sickel select compress_chunk(i,true) from show_chunks('nagios_t',older_than=> interval '1 years') i;
will lock the chunks since you are compressing them. The locks will be released only when the transaction completes.
So this is not the best way to compress chunks if you are inserting/querying these chunks at the same time.
You are acquiring locks that might block other operations on these chunks.

The right way to do this is to compress 1 chunk at a time, commit and then compress the next chunk.

After the synchronizing lock is released and the transaction is committed, both sessions are free to execute independently. This means that the query can actually start running before the recompress step has completed, which means that the order for completion is non-deterministic. We fix this by adding a marker so that the query is not reported as completed until the recompress has finished execution. Since markers in isolation tests is a recent thing, we only run the test for PostgreSQL versions with markers added. Part-Of: #3846

mkindahl mentioned this issue Nov 23, 2021

Eliminate deadlock for recompress_chunk #3845

Merged

mkindahl added bug compression locking labels Nov 23, 2021

mkindahl self-assigned this Nov 23, 2021

mkindahl mentioned this issue Nov 23, 2021

Compression deadlock detected while reading hyper table #2565

Closed

mkindahl mentioned this issue Nov 29, 2021

Eliminate deadlock in recompress chunk policy #3864

Merged

duncan-tsdb mentioned this issue Nov 30, 2021

Release 2.5.1 #3872

Merged

mkindahl mentioned this issue Dec 2, 2021

Fix race condition in deadlock_recompress_chunk #3880

Merged

duncan-tsdb mentioned this issue Dec 2, 2021

Draft2 release 2.5.1 #3861

Merged

mkindahl closed this as completed in #3845 Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recompress deadlock when when running with queries #3846

Recompress deadlock when when running with queries #3846

mkindahl commented Nov 23, 2021

sickel commented Jan 19, 2022

gayyappan commented Jan 27, 2022 •

edited

Recompress deadlock when when running with queries #3846

Recompress deadlock when when running with queries #3846

Comments

mkindahl commented Nov 23, 2021

To reproduce

Error produced

Additional information

sickel commented Jan 19, 2022

gayyappan commented Jan 27, 2022 • edited

gayyappan commented Jan 27, 2022 •

edited