Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix subtransaction resource owner #5668

Merged
merged 1 commit into from
May 11, 2023

Conversation

mkindahl
Copy link
Contributor

@mkindahl mkindahl commented May 9, 2023

When executing a subtransaction using BeginInternalSubTransaction the memory context switches from the current context to CurTransactionContext and when the transaction is aborted or committed using ReleaseCurrentSubTransaction or RollbackAndReleaseCurrentSubTransaction respectively, it will not restore to the previous memory context or resource owner but rather use TopTransactionContext. Because of this, both the memory context and the resource owner will be wrong when executing calculate_next_start_on_failure, which causes run_job to generate an error when used with the telemetry job.

This commit fixes this by saving both the resource owner and the memory context before starting the internal subtransaction and restoring it after finishing the internal subtransaction.

Since the ts_bgw_job_run_and_set_next_start was incorrectly reading the wrong result from the telemetry job, this commit fixes this as well. Note that ts_bgw_job_run_and_set_next_start is only used when running the telemetry job, so it does not cause issues for other jobs.

@mkindahl mkindahl force-pushed the bad-subtrans-context branch 2 times, most recently from 4c735f6 to 87b7218 Compare May 10, 2023 07:26
@codecov
Copy link

codecov bot commented May 10, 2023

Codecov Report

Merging #5668 (8157026) into main (abb6762) will increase coverage by 0.02%.
The diff coverage is 80.00%.

❗ Current head 8157026 differs from pull request most recent head 1448cf4. Consider uploading reports for the commit 1448cf4 to get more accurate results

@@            Coverage Diff             @@
##             main    #5668      +/-   ##
==========================================
+ Coverage   90.97%   90.99%   +0.02%     
==========================================
  Files         230      230              
  Lines       54569    54568       -1     
==========================================
+ Hits        49642    49653      +11     
+ Misses       4927     4915      -12     
Impacted Files Coverage Δ
src/bgw/job_stat.c 92.69% <71.42%> (-0.21%) ⬇️
src/bgw/job.c 93.31% <100.00%> (ø)

... and 7 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mkindahl mkindahl marked this pull request as ready for review May 10, 2023 07:46
@github-actions
Copy link

@fabriziomello, @nikkhils: please review this pull request.

Powered by pull-review

@mkindahl
Copy link
Contributor Author

Note that this can trigger an assertion inside calculate_next_start_on_failure but for a release build, it just generates an error.

@mkindahl
Copy link
Contributor Author

It is difficult to create a good test. It was tested manually:

mats@fury:~/work/timescale/timescaledb+telemetry/build-14$ psql
Expanded display is used automatically.
Null display is "[NULL]".
psql (14.6)
Type "help" for help.

mats=# SELECT id AS job_id FROM _timescaledb_config.bgw_job
 WHERE proc_schema = '_timescaledb_internal'
   AND proc_name = 'policy_telemetry' \gset
mats=# CALL run_job(:job_id);
NOTICE:  the "timescaledb" extension is up-to-date
CALL

src/bgw/job.c Show resolved Hide resolved
src/bgw/job.c Show resolved Hide resolved
@@ -346,7 +347,8 @@ calculate_next_start_on_failure(TimestampTz finish_time, int consecutive_failure
consecutive_failures);
Assert(consecutive_failures > 0 && multiplier < 63);

MemoryContext oldctx;
MemoryContext oldctx = CurrentMemoryContext;
ResourceOwner oldowner = CurrentResourceOwner;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to place it inside the PG_TRY - so that its guarded by the try?

I think in some odd cases BeginInternalSubTransaction may error out...possibly leaving it in inconsistent state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... maybe. Will investigate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can error out and it is better to move the call to BeginInternalSubTransaction() inside the PG_TRY.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see CurrentResourceOwner changed anywhere - I'm just wondering how does that change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is changed inside BeginInternalSubTransaction, ReleaseCurrentSubTransaction, and RollbackAndReleaseCurrentSubTransaction.

tsl/test/expected/bgw_telemetry.out Show resolved Hide resolved
@mkindahl mkindahl force-pushed the bad-subtrans-context branch 2 times, most recently from e2ec356 to 02ebf97 Compare May 10, 2023 15:41
@mkindahl mkindahl requested a review from kgyrtkirk May 10, 2023 19:21

if (mark)
ts_bgw_job_stat_mark_end(job, had_error ? JOB_FAILURE : JOB_SUCCESS);
ts_bgw_job_stat_mark_end(job, result ? JOB_SUCCESS : JOB_FAILURE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this very puzzling...there is no contract for what implementations of job_main_func supposed to live up to....

...and now this commit is just negating that...seems like ts_telemetry_main is the only implementation ? 🤦

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but don't want to make larger changes than necessary since things have a tendency to break. Fine with refactoring code as a separate PR though.

@@ -346,7 +347,8 @@ calculate_next_start_on_failure(TimestampTz finish_time, int consecutive_failure
consecutive_failures);
Assert(consecutive_failures > 0 && multiplier < 63);

MemoryContext oldctx;
MemoryContext oldctx = CurrentMemoryContext;
ResourceOwner oldowner = CurrentResourceOwner;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see CurrentResourceOwner changed anywhere - I'm just wondering how does that change?

src/bgw/job_stat.c Show resolved Hide resolved
ErrorData *errdata = CopyErrorData();
ereport(LOG,
(errcode(ERRCODE_INTERNAL_ERROR),
errmsg("could not calculate next start on failure: resetting value"),
errdetail("Error: %s.", errdata->message)));
FlushErrorState();
RollbackAndReleaseCurrentSubTransaction();
}
PG_END_TRY();
Assert(CurrentMemoryContext == oldctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it make sense to move

	MemoryContextSwitchTo(oldctx);
		CurrentResourceOwner = oldowner;

from both try/catch branches to here - instead of this assert ?

it would make the CopyErrorData use a different mem context; but I think that doesn't make much difference

Copy link
Contributor Author

@mkindahl mkindahl May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is following the pattern used in PostgreSQL to handle similar situations. Trying to "optimize" the code here runs the risk of introducing subtle errors. The comment on CopyErrorData says:

/*
 * CopyErrorData --- obtain a copy of the topmost error stack entry
 *
 * This is only for use in error handler code.  The data is copied into the
 * current memory context, so callers should always switch away from
 * ErrorContext first; otherwise it will be lost when FlushErrorState is done.
 */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ; I know - but the code uses the results right away in the next line to create an error message...so it doesn't escape

it could be left as-is

src/bgw/job_stat.c Show resolved Hide resolved
When executing a subtransaction using `BeginInternalSubTransaction` the
memory context switches from the current context to
`CurTransactionContext` and when the transaction is aborted or
committed using `ReleaseCurrentSubTransaction` or
`RollbackAndReleaseCurrentSubTransaction` respectively, it will not
restore to the previous memory context or resource owner but rather use
`TopTransactionContext`. Because of this, both the memory context and
the resource owner will be wrong when executing
`calculate_next_start_on_failure`, which causes `run_job` to generate
an error when used with the telemetry job.

This commit fixes this by saving both the resource owner and the memory
context before starting the internal subtransaction and restoring it
after finishing the internal subtransaction.

Since the `ts_bgw_job_run_and_set_next_start` was incorrectly reading
the wrong result from the telemetry job, this commit fixes this as
well. Note that `ts_bgw_job_run_and_set_next_start` is only used when
running the telemetry job, so it does not cause issues for other jobs.
@mkindahl mkindahl enabled auto-merge (rebase) May 11, 2023 11:55
@mkindahl mkindahl merged commit 656daf4 into timescale:main May 11, 2023
@mkindahl mkindahl deleted the bad-subtrans-context branch May 11, 2023 12:16
kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this pull request May 12, 2023
This release includes these noteworthy features:
* compressed hypertable enhancements:
  * UPDATE/DELETE support
  * ON CONFLICT DO UPDATE
* Join support for hierarchical Continougs Aggregates
* performance improvements

**Features**
* timescale#5212 Allow pushdown of reference table joins
* timescale#5221 Improve Realtime Continuous Aggregate performance
* timescale#5252 Improve unique constraint support on compressed hypertables
* timescale#5339 Support UPDATE/DELETE on compressed hypertables
* timescale#5344 Enable JOINS for Hierarchical Continuous Aggregates
* timescale#5361 Add parallel support for partialize_agg()
* timescale#5417 Refactor and optimize distributed COPY
* timescale#5454 Add support for ON CONFLICT DO UPDATE for compressed hypertables
* timescale#5547 Skip Ordered Append when only 1 child node is present
* timescale#5510 Propagate vacuum/analyze to compressed chunks
* timescale#5584 Reduce decompression during constraint checking
* timescale#5530 Optimize compressed chunk resorting

**Bugfixes**
* timescale#5396 Fix SEGMENTBY columns predicates to be pushed down
* timescale#5427 Handle user-defined FDW options properly
* timescale#5442 Decompression may have lost DEFAULT values
* timescale#5459 Fix issue creating dimensional constraints
* timescale#5570 Improve interpolate error message on datatype mismatch
* timescale#5573 Fix unique constraint on compressed tables
* timescale#5615 Add permission checks to run_job()
* timescale#5614 Enable run_job() for telemetry job
* timescale#5578 Fix on-insert decompression after schema changes
* timescale#5613 Quote username identifier appropriately
* timescale#5525 Fix tablespace for compressed hypertable and corresponding toast
* timescale#5642 Fix ALTER TABLE SET with normal tables
* timescale#5666 Reduce memory usage for distributed analyze
* timescale#5668 Fix subtransaction resource owner

**Thanks**
* @kovetskiy and @DZDomi for reporting peformance regression in Realtime Continuous Aggregates
* @ollz272 for reporting an issue with interpolate error messages
kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this pull request May 17, 2023
This release contains new features and bug fixes since the 2.10.3 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:
* Support for DML operations on compressed chunks:
  * UPDATE/DELETE support
  * Support for unique constraints on compressed chunks
  * Support for `ON CONFLICT DO UPDATE`
  * Support for `ON CONFLICT DO NOTHING`
* Join support for hierarchical Continuous Aggregates

**Features**
* timescale#5212 Allow pushdown of reference table joins
* timescale#5221 Improve Realtime Continuous Aggregate performance
* timescale#5252 Improve unique constraint support on compressed hypertables
* timescale#5339 Support UPDATE/DELETE on compressed hypertables
* timescale#5344 Enable JOINS for Hierarchical Continuous Aggregates
* timescale#5361 Add parallel support for partialize_agg()
* timescale#5417 Refactor and optimize distributed COPY
* timescale#5454 Add support for ON CONFLICT DO UPDATE for compressed hypertables
* timescale#5547 Skip Ordered Append when only 1 child node is present
* timescale#5510 Propagate vacuum/analyze to compressed chunks
* timescale#5584 Reduce decompression during constraint checking
* timescale#5530 Optimize compressed chunk resorting
* timescale#5639 Support sending telemetry event reports

**Bugfixes**
* timescale#5396 Fix SEGMENTBY columns predicates to be pushed down
* timescale#5427 Handle user-defined FDW options properly
* timescale#5442 Decompression may have lost DEFAULT values
* timescale#5459 Fix issue creating dimensional constraints
* timescale#5570 Improve interpolate error message on datatype mismatch
* timescale#5573 Fix unique constraint on compressed tables
* timescale#5615 Add permission checks to run_job()
* timescale#5614 Enable run_job() for telemetry job
* timescale#5578 Fix on-insert decompression after schema changes
* timescale#5613 Quote username identifier appropriately
* timescale#5525 Fix tablespace for compressed hypertable and corresponding toast
* timescale#5642 Fix ALTER TABLE SET with normal tables
* timescale#5666 Reduce memory usage for distributed analyze
* timescale#5668 Fix subtransaction resource owner

**Thanks**
* @kovetskiy and @DZDomi for reporting peformance regression in Realtime Continuous Aggregates
* @ollz272 for reporting an issue with interpolate error messages
@kgyrtkirk kgyrtkirk mentioned this pull request May 17, 2023
kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this pull request May 19, 2023
This release contains new features and bug fixes since the 2.10.3 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:
* Support for DML operations on compressed chunks:
  * UPDATE/DELETE support
  * Support for unique constraints on compressed chunks
  * Support for `ON CONFLICT DO UPDATE`
  * Support for `ON CONFLICT DO NOTHING`
* Join support for hierarchical Continuous Aggregates

**Features**
* timescale#5212 Allow pushdown of reference table joins
* timescale#5221 Improve Realtime Continuous Aggregate performance
* timescale#5252 Improve unique constraint support on compressed hypertables
* timescale#5339 Support UPDATE/DELETE on compressed hypertables
* timescale#5344 Enable JOINS for Hierarchical Continuous Aggregates
* timescale#5361 Add parallel support for partialize_agg()
* timescale#5417 Refactor and optimize distributed COPY
* timescale#5454 Add support for ON CONFLICT DO UPDATE for compressed hypertables
* timescale#5547 Skip Ordered Append when only 1 child node is present
* timescale#5510 Propagate vacuum/analyze to compressed chunks
* timescale#5584 Reduce decompression during constraint checking
* timescale#5530 Optimize compressed chunk resorting
* timescale#5639 Support sending telemetry event reports

**Bugfixes**
* timescale#5396 Fix SEGMENTBY columns predicates to be pushed down
* timescale#5427 Handle user-defined FDW options properly
* timescale#5442 Decompression may have lost DEFAULT values
* timescale#5459 Fix issue creating dimensional constraints
* timescale#5570 Improve interpolate error message on datatype mismatch
* timescale#5573 Fix unique constraint on compressed tables
* timescale#5615 Add permission checks to run_job()
* timescale#5614 Enable run_job() for telemetry job
* timescale#5578 Fix on-insert decompression after schema changes
* timescale#5613 Quote username identifier appropriately
* timescale#5525 Fix tablespace for compressed hypertable and corresponding toast
* timescale#5642 Fix ALTER TABLE SET with normal tables
* timescale#5666 Reduce memory usage for distributed analyze
* timescale#5668 Fix subtransaction resource owner

**Thanks**
* @kovetskiy and @DZDomi for reporting peformance regression in Realtime Continuous Aggregates
* @ollz272 for reporting an issue with interpolate error messages
kgyrtkirk added a commit that referenced this pull request May 19, 2023
This release contains new features and bug fixes since the 2.10.3 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:
* Support for DML operations on compressed chunks:
  * UPDATE/DELETE support
  * Support for unique constraints on compressed chunks
  * Support for `ON CONFLICT DO UPDATE`
  * Support for `ON CONFLICT DO NOTHING`
* Join support for hierarchical Continuous Aggregates

**Features**
* #5212 Allow pushdown of reference table joins
* #5221 Improve Realtime Continuous Aggregate performance
* #5252 Improve unique constraint support on compressed hypertables
* #5339 Support UPDATE/DELETE on compressed hypertables
* #5344 Enable JOINS for Hierarchical Continuous Aggregates
* #5361 Add parallel support for partialize_agg()
* #5417 Refactor and optimize distributed COPY
* #5454 Add support for ON CONFLICT DO UPDATE for compressed hypertables
* #5547 Skip Ordered Append when only 1 child node is present
* #5510 Propagate vacuum/analyze to compressed chunks
* #5584 Reduce decompression during constraint checking
* #5530 Optimize compressed chunk resorting
* #5639 Support sending telemetry event reports

**Bugfixes**
* #5396 Fix SEGMENTBY columns predicates to be pushed down
* #5427 Handle user-defined FDW options properly
* #5442 Decompression may have lost DEFAULT values
* #5459 Fix issue creating dimensional constraints
* #5570 Improve interpolate error message on datatype mismatch
* #5573 Fix unique constraint on compressed tables
* #5615 Add permission checks to run_job()
* #5614 Enable run_job() for telemetry job
* #5578 Fix on-insert decompression after schema changes
* #5613 Quote username identifier appropriately
* #5525 Fix tablespace for compressed hypertable and corresponding toast
* #5642 Fix ALTER TABLE SET with normal tables
* #5666 Reduce memory usage for distributed analyze
* #5668 Fix subtransaction resource owner

**Thanks**
* @kovetskiy and @DZDomi for reporting peformance regression in Realtime Continuous Aggregates
* @ollz272 for reporting an issue with interpolate error messages
kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this pull request May 19, 2023
This release contains new features and bug fixes since the 2.10.3 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:
* Support for DML operations on compressed chunks:
  * UPDATE/DELETE support
  * Support for unique constraints on compressed chunks
  * Support for `ON CONFLICT DO UPDATE`
  * Support for `ON CONFLICT DO NOTHING`
* Join support for hierarchical Continuous Aggregates

**Features**
* timescale#5212 Allow pushdown of reference table joins
* timescale#5221 Improve Realtime Continuous Aggregate performance
* timescale#5252 Improve unique constraint support on compressed hypertables
* timescale#5339 Support UPDATE/DELETE on compressed hypertables
* timescale#5344 Enable JOINS for Hierarchical Continuous Aggregates
* timescale#5361 Add parallel support for partialize_agg()
* timescale#5417 Refactor and optimize distributed COPY
* timescale#5454 Add support for ON CONFLICT DO UPDATE for compressed hypertables
* timescale#5547 Skip Ordered Append when only 1 child node is present
* timescale#5510 Propagate vacuum/analyze to compressed chunks
* timescale#5584 Reduce decompression during constraint checking
* timescale#5530 Optimize compressed chunk resorting
* timescale#5639 Support sending telemetry event reports

**Bugfixes**
* timescale#5396 Fix SEGMENTBY columns predicates to be pushed down
* timescale#5427 Handle user-defined FDW options properly
* timescale#5442 Decompression may have lost DEFAULT values
* timescale#5459 Fix issue creating dimensional constraints
* timescale#5570 Improve interpolate error message on datatype mismatch
* timescale#5573 Fix unique constraint on compressed tables
* timescale#5615 Add permission checks to run_job()
* timescale#5614 Enable run_job() for telemetry job
* timescale#5578 Fix on-insert decompression after schema changes
* timescale#5613 Quote username identifier appropriately
* timescale#5525 Fix tablespace for compressed hypertable and corresponding toast
* timescale#5642 Fix ALTER TABLE SET with normal tables
* timescale#5666 Reduce memory usage for distributed analyze
* timescale#5668 Fix subtransaction resource owner

**Thanks**
* @kovetskiy and @DZDomi for reporting peformance regression in Realtime Continuous Aggregates
* @ollz272 for reporting an issue with interpolate error messages
kgyrtkirk added a commit to kgyrtkirk/timescaledb that referenced this pull request May 19, 2023
This release contains new features and bug fixes since the 2.10.3 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:
* Support for DML operations on compressed chunks:
  * UPDATE/DELETE support
  * Support for unique constraints on compressed chunks
  * Support for `ON CONFLICT DO UPDATE`
  * Support for `ON CONFLICT DO NOTHING`
* Join support for hierarchical Continuous Aggregates

**Features**
* timescale#5212 Allow pushdown of reference table joins
* timescale#5221 Improve Realtime Continuous Aggregate performance
* timescale#5252 Improve unique constraint support on compressed hypertables
* timescale#5339 Support UPDATE/DELETE on compressed hypertables
* timescale#5344 Enable JOINS for Hierarchical Continuous Aggregates
* timescale#5361 Add parallel support for partialize_agg()
* timescale#5417 Refactor and optimize distributed COPY
* timescale#5454 Add support for ON CONFLICT DO UPDATE for compressed hypertables
* timescale#5547 Skip Ordered Append when only 1 child node is present
* timescale#5510 Propagate vacuum/analyze to compressed chunks
* timescale#5584 Reduce decompression during constraint checking
* timescale#5530 Optimize compressed chunk resorting
* timescale#5639 Support sending telemetry event reports

**Bugfixes**
* timescale#5396 Fix SEGMENTBY columns predicates to be pushed down
* timescale#5427 Handle user-defined FDW options properly
* timescale#5442 Decompression may have lost DEFAULT values
* timescale#5459 Fix issue creating dimensional constraints
* timescale#5570 Improve interpolate error message on datatype mismatch
* timescale#5573 Fix unique constraint on compressed tables
* timescale#5615 Add permission checks to run_job()
* timescale#5614 Enable run_job() for telemetry job
* timescale#5578 Fix on-insert decompression after schema changes
* timescale#5613 Quote username identifier appropriately
* timescale#5525 Fix tablespace for compressed hypertable and corresponding toast
* timescale#5642 Fix ALTER TABLE SET with normal tables
* timescale#5666 Reduce memory usage for distributed analyze
* timescale#5668 Fix subtransaction resource owner

**Thanks**
* @kovetskiy and @DZDomi for reporting peformance regression in Realtime Continuous Aggregates
* @ollz272 for reporting an issue with interpolate error messages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants