Compression job running after inserting data fails #3656

hardikm10 · 2021-10-06T11:56:26Z

Relevant system information:
OS: [e.g. Ubuntu 16.04, Windows 10 x64, etc]: Docker / timescale/timescaledb:2.4.2-pg13

PostgreSQL version (output of postgres --version): 13.4
TimescaleDB version (output of \dx in psql): 2.4.2
Installation method: [e.g., "using Docker", "apt install", "source"] : "Using Docker"

Describe the bug
Calling run_job(compression_job_id) fails with the error "ERROR: cannot fetch toast data without an active snapshot"
if there were inserts into the compressed chunks.

To Reproduce

drop table sensor_data;

create table sensor_data(
time timestamptz not null,
sensor_id integer not null,
cpu double precision null,
temperature double precision null );

select from create_hypertable('sensor_data','time');

INSERT INTO sensor_data
SELECT
time + (INTERVAL '1 minute' * random()) AS time,
sensor_id,
random() AS cpu,
random()* 100 AS temperature
FROM
generate_series(now() - INTERVAL '1 months', now() - INTERVAL '1 week', INTERVAL '10 minute') AS g1(time),
generate_series(1, 100, 1 ) AS g2(sensor_id)
ORDER BY
time;

\d+ sensor_data

alter table sensor_data set (timescaledb.compress, timescaledb.compress_orderby = 'time DESC');
select add_compression_policy('sensor_data','1 minute'::INTERVAL);

--- Wait for all the chunks to be compressed ----
-- Use below to confirm above--
SELECT c.chunk_name,c.range_start,c.range_end,c.is_compressed,pg_size_pretty(d.total_bytes) FROM public.chunks_detailed_size('public.sensor_data') d, timescaledb_information.chunks c WHERE d.chunk_name=c.chunk_name order by c.range_start;

-- Insert into the first chunk which is already compressed--
insert
	into
	sensor_data
select
	time + (interval '1 minute' * random()) as time,
	sensor_id,
	random() as cpu,
	random()* 100 as temperature
from
	generate_series('2021-09-02 00:00:00+00'::timestamptz, '2021-09-08 00:00:00+00'::timestamptz, interval '10 minute') as g1(time),
	generate_series(1, 100, 1 ) as g2(sensor_id)
order by
	time;

CALL run_job(1000);

postgres=# call run_job(1000);
ERROR:  cannot fetch toast data without an active snapshot

Expected behavior
The job should run and compress the chunks.

Actual behavior
The job fails to run if called manually using 'CALL run_job'.

The text was updated successfully, but these errors were encountered:

mkindahl · 2021-10-06T12:07:39Z

The error is reported from inside init_toast_snapshot when there is no snapshot returned from GetOldestSnapshot(), which means that there is no snapshot at all. This function is called from two locations: toast_delete_datum and toast_fetch_datum_slice, which seems to be called when deleting or scanning tuples in a table that has a TOAST table (which a compressed chunk has).

toast_delete_datum is called from toast_tuple_cleanup, which finally is called from heap_toast_insert_or_update.
toast_fetch_datum_slice is called from table_relation_fetch_toast_slice which is called when detoasting a tuple in the toast table for the compressed table.

Inside the procedure policy_compression_execute (and policy_recompression_execute) we have the following sequence:

	if (ActiveSnapshotSet())
		PopActiveSnapshot();
	/* process each chunk in a new transaction */
	int total_chunks = list_length(chunkid_lst), num_chunks = 0;
	foreach (lc, chunkid_lst)
	{
		CommitTransactionCommand();
		StartTransactionCommand();
                       .
                       .
                       .
	}

So, we pop the active snapshot, potentially leaving no snapshot at all, and then commit the transaction and start a new one. I wonder if we should have a PushActiveSnapshot(GetTransactionSnapshot()) after having started the new transaction since we potentially can have a call to heap_toast_insert_or_update or table_relation_fetch_toast_slice as part of the recompress policy execution.

mkindahl · 2021-10-06T12:08:53Z

Attaching a debugger to init_toast_snapshot and running the reproduction give the following stack:

#0  init_toast_snapshot (toast_snapshot=toast_snapshot@entry=0x7fff5ad2fb90) at toast_internals.c:625
#1  0x000056276c56b482 in heap_fetch_toast_slice (toastrel=0x7f0fa0095b60, valueid=118360, attrsize=3388, sliceoffset=0, slicelength=<optimized out>, result=0x56276e991b30) at heaptoast.c:688
#2  0x000056276c511069 in table_relation_fetch_toast_slice (result=0x56276e991b30, slicelength=3388, sliceoffset=0, attrsize=3388, valueid=<optimized out>, toastrel=0x7f0fa0095b60) at ../../../../src/include/access/tableam.h:1672
#3  toast_fetch_datum (attr=<optimized out>) at detoast.c:369
#4  0x000056276c5116fd in detoast_attr (attr=0x7f0fa131afa0) at detoast.c:123
#5  0x000056276c9d78b5 in pg_detoast_datum (datum=<optimized out>) at fmgr.c:1729
#6  0x00007f0fa013ebad in populate_per_compressed_columns_from_data (compressed_is_nulls=0x56276e98c300, compressed_datums=0x56276e98c038, num_cols=8, per_compressed_cols=0x56276e5d1f78)
    at /home/mats/repos/timescaledb-2.0/tsl/src/compression/compression.c:1201
#7  decompress_chunk (in_table=<optimized out>, out_table=118325) at /home/mats/repos/timescaledb-2.0/tsl/src/compression/compression.c:1093
#8  0x00007f0fa01422ea in decompress_chunk_impl (uncompressed_hypertable_relid=118320, uncompressed_chunk_relid=uncompressed_chunk_relid@entry=118325, if_compressed=if_compressed@entry=false)
    at /home/mats/repos/timescaledb-2.0/tsl/src/compression/compress_utils.c:337
#9  0x00007f0fa0142e7e in tsl_recompress_chunk_wrapper (uncompressed_chunk=uncompressed_chunk@entry=0x56276e5d2300) at /home/mats/repos/timescaledb-2.0/tsl/src/compression/compress_utils.c:576
#10 0x00007f0fa01357bb in policy_compression_execute (job_id=1040, config=<optimized out>) at /home/mats/repos/timescaledb-2.0/tsl/src/bgw_policy/job.c:610
#11 0x00007f0fa0133091 in policy_compression_proc (fcinfo=0x7fff5ad30070) at /home/mats/repos/timescaledb-2.0/tsl/src/bgw_policy/compression_api.c:149
#12 0x000056276c69db40 in ExecuteCallStmt (stmt=stmt@entry=0x56276e9208f8, params=<optimized out>, atomic=atomic@entry=false, dest=dest@entry=0x56276cc866e0 <donothingDR>) at functioncmds.c:2232
#13 0x00007f0fa013601d in job_execute_procedure (funcexpr=0x56276e9fc9c8) at /home/mats/repos/timescaledb-2.0/tsl/src/bgw_policy/job.c:751
#14 job_execute (job=<optimized out>) at /home/mats/repos/timescaledb-2.0/tsl/src/bgw_policy/job.c:832
#15 0x00007f0fa0136a7f in job_run (fcinfo=<optimized out>) at /home/mats/repos/timescaledb-2.0/tsl/src/bgw_policy/job_api.c:305
#16 0x000056276c69db40 in ExecuteCallStmt (stmt=stmt@entry=0x56276e4ad810, params=params@entry=0x0, atomic=atomic@entry=false, dest=dest@entry=0x56276e4add18) at functioncmds.c:2232
#17 0x000056276c8a1465 in standard_ProcessUtility (pstmt=pstmt@entry=0x56276e4ad8d8, queryString=queryString@entry=0x56276e4acc90 "CALL run_job(1040);", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, 
    queryEnv=0x0, dest=0x56276e4add18, qc=0x7fff5ad31340) at ../../../src/include/nodes/nodes.h:594
#18 0x00007f0faa0615c6 in loader_process_utility_hook (pstmt=0x56276e4ad8d8, query_string=0x56276e4acc90 "CALL run_job(1040);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=<optimized out>, dest=<optimized out>, 
    completion_tag=0x7fff5ad31340) at /home/mats/repos/timescaledb-2.0/src/loader/loader.c:522
#19 0x00007f0fa0207463 in prev_ProcessUtility (args=args@entry=0x7fff5ad31030) at /home/mats/repos/timescaledb-2.0/src/process_utility.c:90
#20 0x00007f0fa020862e in timescaledb_ddl_command_start (pstmt=0x56276e4ad8d8, query_string=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=<optimized out>, dest=0x56276e4add18, completion_tag=0x7fff5ad31340)
    at /home/mats/repos/timescaledb-2.0/src/process_utility.c:4100
#21 0x000056276c89f95a in PortalRunUtility (portal=portal@entry=0x56276e5168b0, pstmt=pstmt@entry=0x56276e4ad8d8, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x56276e4add18, qc=0x7fff5ad31340)
    at pquery.c:1145
#22 0x000056276c89faa4 in PortalRunMulti (portal=portal@entry=0x56276e5168b0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x56276e4add18, altdest=altdest@entry=0x56276e4add18, 
    qc=qc@entry=0x7fff5ad31340) at pquery.c:1301
#23 0x000056276c8a003e in PortalRun (portal=portal@entry=0x56276e5168b0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x56276e4add18, 
    altdest=altdest@entry=0x56276e4add18, qc=0x7fff5ad31340) at pquery.c:786
#24 0x000056276c89b6b9 in exec_simple_query (query_string=0x56276e4acc90 "CALL run_job(1040);") at postgres.c:1239
#25 0x000056276c89cf7f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x56276e4da840, dbname=<optimized out>, username=<optimized out>) at postgres.c:4339
#26 0x000056276c80bd8f in BackendRun (port=0x56276e4d3fd0, port=0x56276e4d3fd0) at postmaster.c:4526
#27 BackendStartup (port=0x56276e4d3fd0) at postmaster.c:4210
#28 ServerLoop () at postmaster.c:1739
#29 0x000056276c80cd54 in PostmasterMain (argc=1, argv=<optimized out>) at postmaster.c:1412
#30 0x000056276c506e90 in main (argc=1, argv=0x56276e4a67e0) at main.c:210

When compression is running, `init_toast_snapshot` require an active snapshot to work. This function is, among other locations, called from `heap_prepare_insert` and `heap_update` when it is necessary to toast the tuple. Starting a new transaction without a snapshot will fail with the error "cannot fetch toast data without an active snapshot". This commit avoids the problem by pushing an active snapshot after starting a transaction in the compress and recompress policy routines. Fixes timescale#3656

This PR removes the C code that executes the compression policy. Instead we use a PL/pgSQL procedure to execute the policy. PG13.4 and PG12.8 introduced some changes that require PortalContexts while executing transactions. The compression policy procedure compresses chunks in multiple transactions. We have seen some issues with snapshots and portal management in the policy code (due to the PG13.4 code changes). SPI API has transaction-portal management code. However, the compression policy code does not use SPI interfaces. But it is fairly easy to just convert this into a PL/pgSQL procedure (which calls SPI) rather than replicating portal managment code in C to manage multiple txns in the compression policy. This PR also disallows decompress_chunk, compress_chunk and recompress_chunk in txn read only mode. Fixes timescale#3656

hardikm10 · 2021-10-20T11:01:53Z

Adding here for completeness, a segfault also occurs on CALL run_job in some cases:

For example, after some drop_chunk, reinsert into compressed chunk and logging out and logging again, the test service was segfaulting (not necessary that this order has to be followed) , and here's the backtrace:

(gdb) bt
#0  0x000055c198311fa6 in GetActiveSnapshot () at utils/time/./build/../src/backend/utils/time/snapmgr.c:845
#1  InitializeParallelDSM (pcxt=0x55c199cff2a8) at access/transam/./build/../src/backend/access/transam/parallel.c:220
#2  0x000055c1982f2e87 in _bt_begin_parallel (request=1, isconcurrent=<optimized out>, buildstate=0x7ffc70db5950) at access/nbtree/./build/../src/backend/access/nbtree/nbtsort.c:1545
#3  _bt_spools_heapscan (indexInfo=0x55c199cff198, buildstate=0x7ffc70db5950, index=..., heap=0x7fbeac74fcf8) at access/nbtree/./build/../src/backend/access/nbtree/nbtsort.c:394
#4  btbuild (heap=0x7fbeac74fcf8, index=0x7fbeac753498, indexInfo=0x55c199cff198) at access/nbtree/./build/../src/backend/access/nbtree/nbtsort.c:326
#5  0x000055c198356995 in index_build (heapRelation=0x7fbeac74fcf8, indexRelation=0x7fbeac753498, indexInfo=0x55c199cff198, isreindex=<optimized out>, parallel=<optimized out>) at catalog/./build/../src/backend/catalog/index.c:2962
#6  0x000055c198357397 in reindex_index (indexId=41860, skip_constraint_checks=<optimized out>, persistence=<optimized out>, options=0) at catalog/./build/../src/backend/catalog/index.c:3587
#7  0x000055c198357835 in reindex_relation (relid=relid@entry=41856, flags=flags@entry=0, options=options@entry=0) at catalog/./build/../src/backend/catalog/index.c:3810
#8  0x00007fbeac8586b9 in decompress_chunk (out_table=41856, in_table=<optimized out>) at /build/timescaledb/tsl/src/compression/compression.c:1097
#9  decompress_chunk_impl (uncompressed_hypertable_relid=41851, uncompressed_chunk_relid=41856, if_compressed=<optimized out>) at /build/timescaledb/tsl/src/compression/compress_utils.c:337
#10 0x00007fbeac882d9e in tsl_recompress_chunk_wrapper.isra.0 (uncompressed_chunk=0x55c199cfebb8) at /build/timescaledb/tsl/src/compression/compress_utils.c:576
#11 0x00007fbeac84a9a9 in policy_compression_execute (config=<optimized out>, job_id=1001) at /build/timescaledb/tsl/src/bgw_policy/job.c:610
#12 policy_compression_proc (fcinfo=<optimized out>) at /build/timescaledb/tsl/src/bgw_policy/compression_api.c:149
#13 policy_compression_proc (fcinfo=<optimized out>) at /build/timescaledb/tsl/src/bgw_policy/compression_api.c:142
#14 0x000055c198403a36 in ExecuteCallStmt (stmt=stmt@entry=0x55c199d49b28, params=<optimized out>, atomic=atomic@entry=false, dest=dest@entry=0x55c1989d25a0 <donothingDR.lto_priv.0>)
    at commands/./build/../src/backend/commands/functioncmds.c:2232
#15 0x00007fbeac84937b in job_execute_procedure (funcexpr=0x55c199d4a518) at /build/timescaledb/tsl/src/bgw_policy/job.c:751
#16 job_execute (job=<optimized out>) at /build/timescaledb/tsl/src/bgw_policy/job.c:832
#17 0x00007fbeac84957f in job_run (fcinfo=<optimized out>) at /build/timescaledb/tsl/src/bgw_policy/job_api.c:305
#18 0x000055c198403a36 in ExecuteCallStmt (stmt=<optimized out>, params=<optimized out>, atomic=<optimized out>, dest=<optimized out>) at commands/./build/../src/backend/commands/functioncmds.c:2232
#19 0x000055c1986010e6 in standard_ProcessUtility (pstmt=0x55c199c3f088, queryString=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x55c199c3f530, qc=0x7ffc70db7600)
    at tcop/./build/../src/backend/tcop/utility.c:817
#20 0x00007fbecf1c2be6 in pgss_ProcessUtility (pstmt=0x55c199c3f088, queryString=0x55c199c3e4b8 "call run_job(1001);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x55c199c3f530, qc=0x7ffc70db7600)
    at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1156
#21 0x00007fbeac925ce7 in ?? () from /usr/lib/postgresql/13/lib/plugins/pgextwlist.so
#22 0x00007fbeac8df737 in prev_ProcessUtility (args=<optimized out>) at /build/timescaledb/src/process_utility.c:90
#23 0x00007fbeac8ec78e in timescaledb_ddl_command_start (pstmt=0x55c199c3f088, query_string=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=<optimized out>, dest=0x55c199c3f530, completion_tag=0x7ffc70db7600)
    at /build/timescaledb/src/process_utility.c:4078
#24 0x000055c198601ca3 in ProcessUtility (qc=0x7ffc70db7600, dest=0x55c199c3f530, queryEnv=<optimized out>, params=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, queryString=<optimized out>, pstmt=0x55c199c3f088)
    at tcop/./build/../src/backend/tcop/utility.c:520
#25 PortalRunUtility (portal=portal@entry=0x55c199ce36b8, pstmt=pstmt@entry=0x55c199c3f088, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x55c199c3f530, qc=0x7ffc70db7600)
    at tcop/./build/../src/backend/tcop/pquery.c:1145
#26 0x000055c198601e25 in PortalRunMulti (portal=portal@entry=0x55c199ce36b8, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55c199c3f530, altdest=altdest@entry=0x55c199c3f530,
    qc=qc@entry=0x7ffc70db7600) at tcop/./build/../src/backend/tcop/pquery.c:1308
#27 0x000055c19860232e in PortalRun (portal=0x55c199ce36b8, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x55c199c3f530, altdest=0x55c199c3f530, qc=0x7ffc70db7600)
    at tcop/./build/../src/backend/tcop/pquery.c:786
#28 0x000055c1985faff3 in exec_simple_query (query_string=0x55c199c3e4b8 "call run_job(1001);") at tcop/./build/../src/backend/tcop/postgres.c:1239
#29 0x000055c1985fd727 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at tcop/./build/../src/backend/tcop/postgres.c:4339
#30 0x000055c19857beb8 in BackendRun (port=0x55c199c892c0) at postmaster/./build/../src/backend/postmaster/postmaster.c:4526
#31 BackendStartup (port=0x55c199c892c0) at postmaster/./build/../src/backend/postmaster/postmaster.c:4210
#32 ServerLoop () at postmaster/./build/../src/backend/postmaster/postmaster.c:1739
#33 0x000055c19857cdb4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster/./build/../src/backend/postmaster/postmaster.c:1412
#34 0x000055c19827e989 in main (argc=17, argv=0x55c199c37030) at main/./build/../src/backend/main/main.c:210

mkindahl changed the title ~~Calling a compression job results in "ERROR:cannot fetch toast data without an active snapshot"~~ Compression job running after inserting data fails Oct 6, 2021

mkindahl self-assigned this Oct 6, 2021

mkindahl added bgw The background worker subsystem, including the scheduler bug compression labels Oct 6, 2021

mkindahl mentioned this issue Oct 6, 2021

Ensure compression policy has a snapshot #3649

Closed

mkindahl assigned gayyappan and unassigned mkindahl Oct 6, 2021

gayyappan closed this as completed in fffd6c2 Oct 13, 2021

This was referenced Oct 26, 2021

Segfault in compress_chunk -> GetActiveSnapshot #3740

Closed

Compression Policy job error after upgrading #3735

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression job running after inserting data fails #3656

Compression job running after inserting data fails #3656

hardikm10 commented Oct 6, 2021

mkindahl commented Oct 6, 2021

mkindahl commented Oct 6, 2021

hardikm10 commented Oct 20, 2021

Compression job running after inserting data fails #3656

Compression job running after inserting data fails #3656

Comments

hardikm10 commented Oct 6, 2021

mkindahl commented Oct 6, 2021

mkindahl commented Oct 6, 2021

hardikm10 commented Oct 20, 2021