Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression job running after inserting data fails #3656

Closed
hardikm10 opened this issue Oct 6, 2021 · 3 comments
Closed

Compression job running after inserting data fails #3656

hardikm10 opened this issue Oct 6, 2021 · 3 comments
Assignees
Labels
bgw The background worker subsystem, including the scheduler bug compression

Comments

@hardikm10
Copy link

Relevant system information:
OS: [e.g. Ubuntu 16.04, Windows 10 x64, etc]: Docker / timescale/timescaledb:2.4.2-pg13

PostgreSQL version (output of postgres --version): 13.4
TimescaleDB version (output of \dx in psql): 2.4.2
Installation method: [e.g., "using Docker", "apt install", "source"] : "Using Docker"

Describe the bug
Calling run_job(compression_job_id) fails with the error "ERROR: cannot fetch toast data without an active snapshot"
if there were inserts into the compressed chunks.

To Reproduce

drop table sensor_data;

create table sensor_data(
time timestamptz not null,
sensor_id integer not null,
cpu double precision null,
temperature double precision null );

select from create_hypertable('sensor_data','time');

INSERT INTO sensor_data
SELECT
time + (INTERVAL '1 minute' * random()) AS time,
sensor_id,
random() AS cpu,
random()* 100 AS temperature
FROM
generate_series(now() - INTERVAL '1 months', now() - INTERVAL '1 week', INTERVAL '10 minute') AS g1(time),
generate_series(1, 100, 1 ) AS g2(sensor_id)
ORDER BY
time;

\d+ sensor_data

alter table sensor_data set (timescaledb.compress, timescaledb.compress_orderby = 'time DESC');
select add_compression_policy('sensor_data','1 minute'::INTERVAL);

--- Wait for all the chunks to be compressed ----
-- Use below to confirm above--
SELECT c.chunk_name,c.range_start,c.range_end,c.is_compressed,pg_size_pretty(d.total_bytes) FROM public.chunks_detailed_size('public.sensor_data') d, timescaledb_information.chunks c WHERE d.chunk_name=c.chunk_name order by c.range_start;

-- Insert into the first chunk which is already compressed--
insert
	into
	sensor_data
select
	time + (interval '1 minute' * random()) as time,
	sensor_id,
	random() as cpu,
	random()* 100 as temperature
from
	generate_series('2021-09-02 00:00:00+00'::timestamptz, '2021-09-08 00:00:00+00'::timestamptz, interval '10 minute') as g1(time),
	generate_series(1, 100, 1 ) as g2(sensor_id)
order by
	time;

CALL run_job(1000);

postgres=# call run_job(1000);
ERROR:  cannot fetch toast data without an active snapshot

Expected behavior
The job should run and compress the chunks.

Actual behavior
The job fails to run if called manually using 'CALL run_job'.

@mkindahl mkindahl changed the title Calling a compression job results in "ERROR:cannot fetch toast data without an active snapshot" Compression job running after inserting data fails Oct 6, 2021
@mkindahl mkindahl self-assigned this Oct 6, 2021
@mkindahl mkindahl added bgw The background worker subsystem, including the scheduler bug compression labels Oct 6, 2021
@mkindahl
Copy link
Contributor

mkindahl commented Oct 6, 2021

The error is reported from inside init_toast_snapshot when there is no snapshot returned from GetOldestSnapshot(), which means that there is no snapshot at all. This function is called from two locations: toast_delete_datum and toast_fetch_datum_slice, which seems to be called when deleting or scanning tuples in a table that has a TOAST table (which a compressed chunk has).

  • toast_delete_datum is called from toast_tuple_cleanup, which finally is called from heap_toast_insert_or_update.
  • toast_fetch_datum_slice is called from table_relation_fetch_toast_slice which is called when detoasting a tuple in the toast table for the compressed table.

Inside the procedure policy_compression_execute (and policy_recompression_execute) we have the following sequence:

	if (ActiveSnapshotSet())
		PopActiveSnapshot();
	/* process each chunk in a new transaction */
	int total_chunks = list_length(chunkid_lst), num_chunks = 0;
	foreach (lc, chunkid_lst)
	{
		CommitTransactionCommand();
		StartTransactionCommand();
                       .
                       .
                       .
	}

So, we pop the active snapshot, potentially leaving no snapshot at all, and then commit the transaction and start a new one. I wonder if we should have a PushActiveSnapshot(GetTransactionSnapshot()) after having started the new transaction since we potentially can have a call to heap_toast_insert_or_update or table_relation_fetch_toast_slice as part of the recompress policy execution.

@mkindahl
Copy link
Contributor

mkindahl commented Oct 6, 2021

Attaching a debugger to init_toast_snapshot and running the reproduction give the following stack:

#0  init_toast_snapshot (toast_snapshot=toast_snapshot@entry=0x7fff5ad2fb90) at toast_internals.c:625
#1  0x000056276c56b482 in heap_fetch_toast_slice (toastrel=0x7f0fa0095b60, valueid=118360, attrsize=3388, sliceoffset=0, slicelength=<optimized out>, result=0x56276e991b30) at heaptoast.c:688
#2  0x000056276c511069 in table_relation_fetch_toast_slice (result=0x56276e991b30, slicelength=3388, sliceoffset=0, attrsize=3388, valueid=<optimized out>, toastrel=0x7f0fa0095b60) at ../../../../src/include/access/tableam.h:1672
#3  toast_fetch_datum (attr=<optimized out>) at detoast.c:369
#4  0x000056276c5116fd in detoast_attr (attr=0x7f0fa131afa0) at detoast.c:123
#5  0x000056276c9d78b5 in pg_detoast_datum (datum=<optimized out>) at fmgr.c:1729
#6  0x00007f0fa013ebad in populate_per_compressed_columns_from_data (compressed_is_nulls=0x56276e98c300, compressed_datums=0x56276e98c038, num_cols=8, per_compressed_cols=0x56276e5d1f78)
    at /home/mats/repos/timescaledb-2.0/tsl/src/compression/compression.c:1201
#7  decompress_chunk (in_table=<optimized out>, out_table=118325) at /home/mats/repos/timescaledb-2.0/tsl/src/compression/compression.c:1093
#8  0x00007f0fa01422ea in decompress_chunk_impl (uncompressed_hypertable_relid=118320, uncompressed_chunk_relid=uncompressed_chunk_relid@entry=118325, if_compressed=if_compressed@entry=false)
    at /home/mats/repos/timescaledb-2.0/tsl/src/compression/compress_utils.c:337
#9  0x00007f0fa0142e7e in tsl_recompress_chunk_wrapper (uncompressed_chunk=uncompressed_chunk@entry=0x56276e5d2300) at /home/mats/repos/timescaledb-2.0/tsl/src/compression/compress_utils.c:576
#10 0x00007f0fa01357bb in policy_compression_execute (job_id=1040, config=<optimized out>) at /home/mats/repos/timescaledb-2.0/tsl/src/bgw_policy/job.c:610
#11 0x00007f0fa0133091 in policy_compression_proc (fcinfo=0x7fff5ad30070) at /home/mats/repos/timescaledb-2.0/tsl/src/bgw_policy/compression_api.c:149
#12 0x000056276c69db40 in ExecuteCallStmt (stmt=stmt@entry=0x56276e9208f8, params=<optimized out>, atomic=atomic@entry=false, dest=dest@entry=0x56276cc866e0 <donothingDR>) at functioncmds.c:2232
#13 0x00007f0fa013601d in job_execute_procedure (funcexpr=0x56276e9fc9c8) at /home/mats/repos/timescaledb-2.0/tsl/src/bgw_policy/job.c:751
#14 job_execute (job=<optimized out>) at /home/mats/repos/timescaledb-2.0/tsl/src/bgw_policy/job.c:832
#15 0x00007f0fa0136a7f in job_run (fcinfo=<optimized out>) at /home/mats/repos/timescaledb-2.0/tsl/src/bgw_policy/job_api.c:305
#16 0x000056276c69db40 in ExecuteCallStmt (stmt=stmt@entry=0x56276e4ad810, params=params@entry=0x0, atomic=atomic@entry=false, dest=dest@entry=0x56276e4add18) at functioncmds.c:2232
#17 0x000056276c8a1465 in standard_ProcessUtility (pstmt=pstmt@entry=0x56276e4ad8d8, queryString=queryString@entry=0x56276e4acc90 "CALL run_job(1040);", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, 
    queryEnv=0x0, dest=0x56276e4add18, qc=0x7fff5ad31340) at ../../../src/include/nodes/nodes.h:594
#18 0x00007f0faa0615c6 in loader_process_utility_hook (pstmt=0x56276e4ad8d8, query_string=0x56276e4acc90 "CALL run_job(1040);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=<optimized out>, dest=<optimized out>, 
    completion_tag=0x7fff5ad31340) at /home/mats/repos/timescaledb-2.0/src/loader/loader.c:522
#19 0x00007f0fa0207463 in prev_ProcessUtility (args=args@entry=0x7fff5ad31030) at /home/mats/repos/timescaledb-2.0/src/process_utility.c:90
#20 0x00007f0fa020862e in timescaledb_ddl_command_start (pstmt=0x56276e4ad8d8, query_string=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=<optimized out>, dest=0x56276e4add18, completion_tag=0x7fff5ad31340)
    at /home/mats/repos/timescaledb-2.0/src/process_utility.c:4100
#21 0x000056276c89f95a in PortalRunUtility (portal=portal@entry=0x56276e5168b0, pstmt=pstmt@entry=0x56276e4ad8d8, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x56276e4add18, qc=0x7fff5ad31340)
    at pquery.c:1145
#22 0x000056276c89faa4 in PortalRunMulti (portal=portal@entry=0x56276e5168b0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x56276e4add18, altdest=altdest@entry=0x56276e4add18, 
    qc=qc@entry=0x7fff5ad31340) at pquery.c:1301
#23 0x000056276c8a003e in PortalRun (portal=portal@entry=0x56276e5168b0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x56276e4add18, 
    altdest=altdest@entry=0x56276e4add18, qc=0x7fff5ad31340) at pquery.c:786
#24 0x000056276c89b6b9 in exec_simple_query (query_string=0x56276e4acc90 "CALL run_job(1040);") at postgres.c:1239
#25 0x000056276c89cf7f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x56276e4da840, dbname=<optimized out>, username=<optimized out>) at postgres.c:4339
#26 0x000056276c80bd8f in BackendRun (port=0x56276e4d3fd0, port=0x56276e4d3fd0) at postmaster.c:4526
#27 BackendStartup (port=0x56276e4d3fd0) at postmaster.c:4210
#28 ServerLoop () at postmaster.c:1739
#29 0x000056276c80cd54 in PostmasterMain (argc=1, argv=<optimized out>) at postmaster.c:1412
#30 0x000056276c506e90 in main (argc=1, argv=0x56276e4a67e0) at main.c:210

mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 6, 2021
When compression is running, `init_toast_snapshot` require an active
snapshot to work. This function is, among other locations, called from
`heap_prepare_insert` and `heap_update` when it is necessary to toast
the tuple. Starting a new transaction without a snapshot will fail with
the error "cannot fetch toast data without an active snapshot".

This commit avoids the problem by pushing an active snapshot after
starting a transaction in the compress and recompress policy routines.

Fixes timescale#3656
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 6, 2021
When compression is running, `init_toast_snapshot` require an active
snapshot to work. This function is, among other locations, called from
`heap_prepare_insert` and `heap_update` when it is necessary to toast
the tuple. Starting a new transaction without a snapshot will fail with
the error "cannot fetch toast data without an active snapshot".

This commit avoids the problem by pushing an active snapshot after
starting a transaction in the compress and recompress policy routines.

Fixes timescale#3656
@mkindahl mkindahl assigned gayyappan and unassigned mkindahl Oct 6, 2021
gayyappan added a commit to gayyappan/timescaledb that referenced this issue Oct 12, 2021
This PR removes the C code that executes the compression
policy. Instead we use a PL/pgSQL procedure to execute
the policy.

PG13.4 and PG12.8 introduced some changes
that require PortalContexts while executing transactions.
The compression policy procedure compresses chunks in
multiple transactions. We have seen some issues with snapshots
and portal management in the policy code (due to the
PG13.4 code changes). SPI API has transaction-portal management
code. However, the compression policy code does not use SPI
interfaces. But it is fairly easy to just convert this into
a PL/pgSQL procedure (which calls SPI) rather than replicating
portal managment code in C to manage multiple txns in the
compression policy.

This PR also disallows decompress_chunk, compress_chunk and
recompress_chunk in txn read only mode.

Fixes timescale#3656
gayyappan added a commit to gayyappan/timescaledb that referenced this issue Oct 12, 2021
This PR removes the C code that executes the compression
policy. Instead we use a PL/pgSQL procedure to execute
the policy.

PG13.4 and PG12.8 introduced some changes
that require PortalContexts while executing transactions.
The compression policy procedure compresses chunks in
multiple transactions. We have seen some issues with snapshots
and portal management in the policy code (due to the
PG13.4 code changes). SPI API has transaction-portal management
code. However, the compression policy code does not use SPI
interfaces. But it is fairly easy to just convert this into
a PL/pgSQL procedure (which calls SPI) rather than replicating
portal managment code in C to manage multiple txns in the
compression policy.

This PR also disallows decompress_chunk, compress_chunk and
recompress_chunk in txn read only mode.

Fixes timescale#3656
gayyappan added a commit to gayyappan/timescaledb that referenced this issue Oct 13, 2021
This PR removes the C code that executes the compression
policy. Instead we use a PL/pgSQL procedure to execute
the policy.

PG13.4 and PG12.8 introduced some changes
that require PortalContexts while executing transactions.
The compression policy procedure compresses chunks in
multiple transactions. We have seen some issues with snapshots
and portal management in the policy code (due to the
PG13.4 code changes). SPI API has transaction-portal management
code. However, the compression policy code does not use SPI
interfaces. But it is fairly easy to just convert this into
a PL/pgSQL procedure (which calls SPI) rather than replicating
portal managment code in C to manage multiple txns in the
compression policy.

This PR also disallows decompress_chunk, compress_chunk and
recompress_chunk in txn read only mode.

Fixes timescale#3656
@hardikm10
Copy link
Author

Adding here for completeness, a segfault also occurs on CALL run_job in some cases:

For example, after some drop_chunk, reinsert into compressed chunk and logging out and logging again, the test service was segfaulting (not necessary that this order has to be followed) , and here's the backtrace:

(gdb) bt
#0  0x000055c198311fa6 in GetActiveSnapshot () at utils/time/./build/../src/backend/utils/time/snapmgr.c:845
#1  InitializeParallelDSM (pcxt=0x55c199cff2a8) at access/transam/./build/../src/backend/access/transam/parallel.c:220
#2  0x000055c1982f2e87 in _bt_begin_parallel (request=1, isconcurrent=<optimized out>, buildstate=0x7ffc70db5950) at access/nbtree/./build/../src/backend/access/nbtree/nbtsort.c:1545
#3  _bt_spools_heapscan (indexInfo=0x55c199cff198, buildstate=0x7ffc70db5950, index=..., heap=0x7fbeac74fcf8) at access/nbtree/./build/../src/backend/access/nbtree/nbtsort.c:394
#4  btbuild (heap=0x7fbeac74fcf8, index=0x7fbeac753498, indexInfo=0x55c199cff198) at access/nbtree/./build/../src/backend/access/nbtree/nbtsort.c:326
#5  0x000055c198356995 in index_build (heapRelation=0x7fbeac74fcf8, indexRelation=0x7fbeac753498, indexInfo=0x55c199cff198, isreindex=<optimized out>, parallel=<optimized out>) at catalog/./build/../src/backend/catalog/index.c:2962
#6  0x000055c198357397 in reindex_index (indexId=41860, skip_constraint_checks=<optimized out>, persistence=<optimized out>, options=0) at catalog/./build/../src/backend/catalog/index.c:3587
#7  0x000055c198357835 in reindex_relation (relid=relid@entry=41856, flags=flags@entry=0, options=options@entry=0) at catalog/./build/../src/backend/catalog/index.c:3810
#8  0x00007fbeac8586b9 in decompress_chunk (out_table=41856, in_table=<optimized out>) at /build/timescaledb/tsl/src/compression/compression.c:1097
#9  decompress_chunk_impl (uncompressed_hypertable_relid=41851, uncompressed_chunk_relid=41856, if_compressed=<optimized out>) at /build/timescaledb/tsl/src/compression/compress_utils.c:337
#10 0x00007fbeac882d9e in tsl_recompress_chunk_wrapper.isra.0 (uncompressed_chunk=0x55c199cfebb8) at /build/timescaledb/tsl/src/compression/compress_utils.c:576
#11 0x00007fbeac84a9a9 in policy_compression_execute (config=<optimized out>, job_id=1001) at /build/timescaledb/tsl/src/bgw_policy/job.c:610
#12 policy_compression_proc (fcinfo=<optimized out>) at /build/timescaledb/tsl/src/bgw_policy/compression_api.c:149
#13 policy_compression_proc (fcinfo=<optimized out>) at /build/timescaledb/tsl/src/bgw_policy/compression_api.c:142
#14 0x000055c198403a36 in ExecuteCallStmt (stmt=stmt@entry=0x55c199d49b28, params=<optimized out>, atomic=atomic@entry=false, dest=dest@entry=0x55c1989d25a0 <donothingDR.lto_priv.0>)
    at commands/./build/../src/backend/commands/functioncmds.c:2232
#15 0x00007fbeac84937b in job_execute_procedure (funcexpr=0x55c199d4a518) at /build/timescaledb/tsl/src/bgw_policy/job.c:751
#16 job_execute (job=<optimized out>) at /build/timescaledb/tsl/src/bgw_policy/job.c:832
#17 0x00007fbeac84957f in job_run (fcinfo=<optimized out>) at /build/timescaledb/tsl/src/bgw_policy/job_api.c:305
#18 0x000055c198403a36 in ExecuteCallStmt (stmt=<optimized out>, params=<optimized out>, atomic=<optimized out>, dest=<optimized out>) at commands/./build/../src/backend/commands/functioncmds.c:2232
#19 0x000055c1986010e6 in standard_ProcessUtility (pstmt=0x55c199c3f088, queryString=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x55c199c3f530, qc=0x7ffc70db7600)
    at tcop/./build/../src/backend/tcop/utility.c:817
#20 0x00007fbecf1c2be6 in pgss_ProcessUtility (pstmt=0x55c199c3f088, queryString=0x55c199c3e4b8 "call run_job(1001);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x55c199c3f530, qc=0x7ffc70db7600)
    at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1156
#21 0x00007fbeac925ce7 in ?? () from /usr/lib/postgresql/13/lib/plugins/pgextwlist.so
#22 0x00007fbeac8df737 in prev_ProcessUtility (args=<optimized out>) at /build/timescaledb/src/process_utility.c:90
#23 0x00007fbeac8ec78e in timescaledb_ddl_command_start (pstmt=0x55c199c3f088, query_string=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=<optimized out>, dest=0x55c199c3f530, completion_tag=0x7ffc70db7600)
    at /build/timescaledb/src/process_utility.c:4078
#24 0x000055c198601ca3 in ProcessUtility (qc=0x7ffc70db7600, dest=0x55c199c3f530, queryEnv=<optimized out>, params=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, queryString=<optimized out>, pstmt=0x55c199c3f088)
    at tcop/./build/../src/backend/tcop/utility.c:520
#25 PortalRunUtility (portal=portal@entry=0x55c199ce36b8, pstmt=pstmt@entry=0x55c199c3f088, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x55c199c3f530, qc=0x7ffc70db7600)
    at tcop/./build/../src/backend/tcop/pquery.c:1145
#26 0x000055c198601e25 in PortalRunMulti (portal=portal@entry=0x55c199ce36b8, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55c199c3f530, altdest=altdest@entry=0x55c199c3f530,
    qc=qc@entry=0x7ffc70db7600) at tcop/./build/../src/backend/tcop/pquery.c:1308
#27 0x000055c19860232e in PortalRun (portal=0x55c199ce36b8, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x55c199c3f530, altdest=0x55c199c3f530, qc=0x7ffc70db7600)
    at tcop/./build/../src/backend/tcop/pquery.c:786
#28 0x000055c1985faff3 in exec_simple_query (query_string=0x55c199c3e4b8 "call run_job(1001);") at tcop/./build/../src/backend/tcop/postgres.c:1239
#29 0x000055c1985fd727 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at tcop/./build/../src/backend/tcop/postgres.c:4339
#30 0x000055c19857beb8 in BackendRun (port=0x55c199c892c0) at postmaster/./build/../src/backend/postmaster/postmaster.c:4526
#31 BackendStartup (port=0x55c199c892c0) at postmaster/./build/../src/backend/postmaster/postmaster.c:4210
#32 ServerLoop () at postmaster/./build/../src/backend/postmaster/postmaster.c:1739
#33 0x000055c19857cdb4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster/./build/../src/backend/postmaster/postmaster.c:1412
#34 0x000055c19827e989 in main (argc=17, argv=0x55c199c37030) at main/./build/../src/backend/main/main.c:210

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgw The background worker subsystem, including the scheduler bug compression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants