Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tuple concurrently deleted error with multiple continuous aggregates #2051

Merged
merged 1 commit into from Jul 4, 2020

Conversation

gayyappan
Copy link
Contributor

@gayyappan gayyappan commented Jul 1, 2020

When we have multiple continuous aggregates defined on
the same hypertable, they could try to delete the hypertable
invalidation logs concurrently. Resolve this by serializing
invalidation log deletes by hypertable id.

Fixes #1940

@codecov
Copy link

codecov bot commented Jul 1, 2020

Codecov Report

Merging #2051 into master will increase coverage by 0.12%.
The diff coverage is 86.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2051      +/-   ##
==========================================
+ Coverage   90.08%   90.20%   +0.12%     
==========================================
  Files         211      211              
  Lines       33679    33644      -35     
==========================================
+ Hits        30340    30349       +9     
+ Misses       3339     3295      -44     
Impacted Files Coverage Δ
tsl/src/continuous_aggs/materialize.c 92.90% <86.66%> (-0.18%) ⬇️
src/loader/bgw_message_queue.c 84.51% <0.00%> (-2.59%) ⬇️
src/import/planner.c 70.30% <0.00%> (+11.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d649d49...1b22ac8. Read the comment docs.

@@ -793,6 +806,11 @@ materialization_invalidation_log_get_range(int32 materialization_id, Oid type, i
return invalidation_range;
}

/* We already locked the materialization_invalidation_log table in
* ShareRowExclusiveLock mode for the Session in the first txn.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we take a session lock for the materialization_invalidation_log? The only ones I see we take, and the only ones mentioned in the comment on line 156 are: the materialization table, the partial view and the raw hypertable...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the comment. It should be session lock for materialization table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a session lock? I guess it is confusing because it sounds like the lock is held for the duration of the session. Is it even possible to hold locks across multiple transactions unless it is an advisory lock? The documentation says: "Once acquired, a lock is normally held until the end of the transaction."

Copy link
Contributor Author

@gayyappan gayyappan Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the session lock is for the duration of the materialization - spans the transactions.

@gayyappan gayyappan changed the title (WIP) github issue 1940 Fix tuple concurrently deleted error with multiple continuous aggregates Jul 2, 2020
Copy link
Contributor

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok as long as the catalog scan does not release the row lock

.enabled = true,
} };
/* acquire lock explicitly.
* the catalog scanning code , releases it at end of the call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the row lock also released at by catalog scanning code, or no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple locks are released at the end of the txn.

@gayyappan gayyappan force-pushed the cagg_multi branch 2 times, most recently from 4dfd050 to 3b5a760 Compare July 2, 2020 20:56
@gayyappan gayyappan marked this pull request as ready for review July 2, 2020 20:58
@gayyappan gayyappan requested a review from a team as a code owner July 2, 2020 20:58
@gayyappan gayyappan requested review from mkindahl, k-rus, WireBaron, erimatnor and svenklemm and removed request for a team July 2, 2020 20:58
Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand the following sentence in the commit message: "Resolve this be serializing
invalidation log deletes by hypertable id." You might want to take a look at that.

Also think we should always lock instead of conditionally based on whether there is more than one continuous aggregate. I imagine a second aggregate can be concurrently created and refreshed, which means we end up with the same situation although this is not visible to the first job.

You should also take a look at how you are handling locking on the invalidation threshold table. I believe there is an issue there since you are keeping the row-exclusive lock on the table until the end of the TX (in addition to the tuple lock). This seems to negate the purpose of the tuple lock and is much more heavy-handed.

{
if (tinfo->lockresult != TM_Ok)
{
ereport(ERROR,
Copy link
Contributor

@erimatnor erimatnor Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we should throw a generic error for all lock results here. Let's say we get TM_Updated (the row was updated by other tx)---this doesn't seem unlikely and is not necessarily "unexpected". Perhaps we should have a less jarring error for that case at least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything other than TM_Ok is not valid here. Knowing the specific failure doesn't seem to be helpful since we have to abort the transaction anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the user might see this error and an update doesn't seem like it would be "unexpected". Maybe not common, but not unexpected either.

Int32GetDatum(raw_hypertable_id));

/* lock table in RowExclusive mode and the row with AccessExclusive */
ScannerCtx scanctx = { .table = catalog_get_table_id(catalog,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed declaration and code. I suggest you move this above the ScanKeyInit. It should be safe to initialize the scankey after this declaration even if the scankey is referenced by the scanctx.

invalidation_threshold_table_relation =
table_open(catalog_get_table_id(catalog, CONTINUOUS_AGGS_INVALIDATION_THRESHOLD),
RowExclusiveLock);
ts_scanner_scan(&scanctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you check the return value here (num tuples found) and raise an error in case of not 1.

caggs = ts_continuous_aggs_find_by_raw_table_id(cagg_data.raw_hypertable_id);
if (list_length(caggs) > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it more safe to always lock here? Could there be a case when cagg_2 is created in another transaction, but not yet visible to this transaction? If cagg_2 is immediately refreshed, I imagine that you won't lock in that case, although you still have concurrent refreshes.

/* REFRESH cagg_1; REFRESH cagg_2 can try to read+delete the
* invalidation logs for the same hypertable.
* If both REFRESH jobs try to delete the
* same set of rows, we run into "tuple concurrently deleted error".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could in theory handle this without locking. A concurrently deleted tuple might not strictly be an error since it doesn't really matter which job deleted the tuple. If the job that experience the error aimed to delete the tuple itself, then I guess it is fine for that job to proceed as if it actually succeeded.

This can, of course, be left as a future optimization. But I still figured it is worth pointing this out. Maybe leave a comment.

table_open(catalog_get_table_id(catalog, CONTINUOUS_AGGS_INVALIDATION_THRESHOLD),
RowExclusiveLock);
ts_scanner_scan(&scanctx);
table_close(invalidation_threshold_table_relation, NoLock); /*release lock at end
Copy link
Contributor

@erimatnor erimatnor Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we keep the table lock here (in RowExclusiveLock) mode. That seems to negate the purpose of taking the tuple lock. I guess RowExclusiveLock doesn't conflict with itself, but does it block other stuff that we shouldn't block?

Or, am I missing something here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to AccessShare based on our discssion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock still held here. I thought that the table lock was going to be released here since tuple lock is held and that is the actual serialization point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood that - modified so that we only have the tuple lock.

*/
invalidation_threshold_table_relation =
table_open(catalog_get_table_id(catalog, CONTINUOUS_AGGS_INVALIDATION_THRESHOLD),
RowExclusiveLock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this lock RowExclusiveLock? Are we modifying this table in addition to taking the tuple lock? I suggest adding a comment on why this is the appropriate lock mode since it is non-obvious.

Copy link
Contributor

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is safe to not lock with one cagg. Also will the test be deterministic?

* If both REFRESH jobs try to delete the
* same set of rows, we run into "tuple concurrently deleted error".
* (github issue 1940)
* Prevent this by serializing on the raw hypertable row
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Prevent this by serializing on the raw hypertable row
* Prevent this by serializing on the raw hypertable row.

return SCAN_DONE;
}

/* lock row corresponding to hypertable id in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* lock row corresponding to hypertable id in
/* Lock row corresponding to hypertable id in


/* lock row corresponding to hypertable id in
* continuous_aggs_invalidation_threshold table in AccessExclusive mode
* block till lock is acquired.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the comment. Specifically, block what?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it?:

Suggested change
* block till lock is acquired.
* block continuous_aggs_invalidation_threshold table till lock is acquired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process is waiting on the lock till it is acquired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying. I suggest to fix the grammar, so it is easier to read later. I guess comma or and is missing.

caggs = ts_continuous_aggs_find_by_raw_table_id(cagg_data.raw_hypertable_id);
if (list_length(caggs) > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that the lock is only obtained when there are more than one cagg is defined, so there is a risk of concurrency? What's about a corner case when second cagg is created after materialising the first one passed this line? Can it be a race condition if meterializing the first one is slow or interrupted?
Or is this condition to deal with a deadlock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k-rus @erimatnor Yes, the corner case is possible. But not fully convinced that we should deal with the error for the corner case. The locks have a concurrency cost. So acquiring a lock only if it is necessary to do so seems to be a good trade off to me. ( There is an additional optimization possible here : a first time cagg does not have to deal with any invalidations since it has to fully materialize all the data. But we don't optimize for that case. Then this code might actually not be exercised in this scenario).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how big this cost is and do we throw a reasonable error when we hit the corner case? If not, I think we should cover the corner-case too and care about "correctness" prior to optimizing. We can always optimize further at a later time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to add: which PG processes will be affected by this cost? Will insertion to hypertable or querying hypertable be affected by this concurrency overhead?

/* acquire lock explicitly.
* the catalog scanning code , releases it at end of the call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they are complete sentences:

Suggested change
/* acquire lock explicitly.
* the catalog scanning code , releases it at end of the call
/* Acquire lock explicitly.
* The catalog scanning code , releases it at end of the call.

ScannerCtx scanctx = { .table = catalog_get_table_id(catalog,
CONTINUOUS_AGGS_INVALIDATION_THRESHOLD),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be more efficient, but still correct to call catalog_get_table_id(catalog, CONTINUOUS_AGGS_INVALIDATION_THRESHOLD) once, store in a variable, and use in both places, here and below?

Int32GetDatum(raw_hypertable_id));

/* lock table in RowExclusive mode and the row with AccessExclusive */
ScannerCtx scanctx = { .table = catalog_get_table_id(catalog,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This declaration is placed after calling ScanKeyInit. Would it be better to do all declarations before the call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we said that is okay to mix declarations and code. We are on C99.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a suggestion. But FWIW, we haven't used mixed-declaration-and-code in most of our other code. Not the appropriate place to discuss decisions. But this is suggested by two reviewers and seems like an easy fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the struct initialization is more readable here. scanctx depends on scankey. That's why there is mixed declaration in this block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_lowest_invalidated_time_for_hypertable contains similar code with variable declaration first:

	ScanKeyData scankey[1];
	ScannerCtx scanctx;

	ScanKeyInit(&scankey[0],
				Anum_continuous_aggs_invalidation_threshold_pkey_hypertable_id,
				BTEqualStrategyNumber,
				F_INT4EQ,
				Int32GetDatum(ts_hypertable_relid_to_id(hypertable_relid)));
	scanctx = (ScannerCtx){
		.table = catalog_get_table_id(catalog, CONTINUOUS_AGGS_INVALIDATION_THRESHOLD),
		.index = catalog_get_index(catalog,
								   CONTINUOUS_AGGS_INVALIDATION_THRESHOLD,
								   CONTINUOUS_AGGS_INVALIDATION_THRESHOLD_PKEY),
		.nkeys = 1,
		.scankey = scankey,
		.tuple_found = &invalidation_tuple_found,
		.filter = NULL,
		.data = &min_val,
		.lockmode = AccessShareLock,
		.scandirection = ForwardScanDirection,
		.result_mctx = NULL,
	};

I don't have strong opinion, but there might be a test, which will fail declaration after code. However, I might be mistaking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer.

table_open(catalog_get_table_id(catalog, CONTINUOUS_AGGS_INVALIDATION_THRESHOLD),
RowExclusiveLock);
ts_scanner_scan(&scanctx);
table_close(invalidation_threshold_table_relation, NoLock); /*release lock at end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow why a lock needs to be obtained if the entire table is locked until the end of the transaction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now that the tuple is locked for the session, which is longer than the transaction, right? Where is the lock on the tuple released?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple locks are released at the end of the transaction.

Copy link
Contributor

@k-rus k-rus Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it really necessary to obtain the tuple lock if the table is already locked for the same duration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since a concurrent transaction will not conflict on the table lock. The tuple lock helps serialize the transactions. This begs the question is the tuple lock sufficient? : I believe that you need a table lock before you acquire the tuple lock. (I am going to change the table lock from RowExclusive to AccessShare)

Comment on lines +168 to +174
step UnlockInvRow: ROLLBACK;
R1: LOG: new materialization range not found for public.ts_continuous_test (time column time): not enough new data past completion threshold of 30 as of 29
R1: LOG: materializing continuous aggregate public.continuous_view_1: processing invalidations, no new range
step Refresh1: <... completed>
R2: LOG: new materialization range not found for public.ts_continuous_test (time column time): not enough new data past completion threshold of 30 as of 29
R2: LOG: materializing continuous aggregate public.continuous_view_2: processing invalidations, no new range
step Refresh2: <... completed>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it deterministic that Refresh1 will be executed first after the lock is released and then Refresh2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, cannot be guaranteed. But if Refresh1 starts before Refresh2, it would acquire the released lock first. The sessions are generally started in the order in which they appear in the spec file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test reproduce the bug before the fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this test proves is the refresh gets serialized on the hypertable row lock. This is true for all the concurrency tests. They do not necessarily "catch" all bugs with the code.

@k-rus
Copy link
Contributor

k-rus commented Jul 3, 2020

@gayyappan Is this PR related to any public issue?

svenklemm added a commit to svenklemm/timescaledb that referenced this pull request Jul 3, 2020
This maintenance release contains bugfixes since the 1.7.1 release. We deem it medium
priority for upgrading.

**Features**
* timescale#1877 Add support for fast pruning of inlined functions

**Bugfixes**
* timescale#1908 Fix drop_chunks with unique constraints when cascade_to_materializations is false
* timescale#1915 Check for database in extension_current_state
* timescale#1918 Unify chunk index creation
* timescale#1932 Change compression locking order
* timescale#1938 Fix gapfill locf treat_null_as_missing
* timescale#1982 Check for disabled telemetry earlier
* timescale#1984 Fix compression bit array left shift count
* timescale#1997 Add checks for read-only transactions
* timescale#2002 Reset restoring gucs rather than explicitly setting 'off'
* timescale#2028 Fix locking in drop_chunks
* timescale#2031 Enable compression for tables with compound foreign key
* timescale#2039 Fix segfault in create_trigger_handler
* timescale#2043 Fix segfault in cagg_update_view_definition
* timescale#2046 Use index tablespace during chunk creation
* timescale#2047 Better handling of chunk insert state destruction
* timescale#2049 Fix handling of PlaceHolderVar in DecompressChunk
* timescale#2051 Fix tuple concurrently deleted error with multiple continuous aggregates

**Thanks**
* @akamensky for reporting an issue with telemetry and an issue with drop_chunks
* @darko408 for reporting an issue with decompression
* @Dmitri191 for reporting an issue with failing background workers
* @eduardotsj for reporting an issue with indexes not inheriting tablespace settings
* @FourSeventy for reporting an issue with multiple continuous aggregrates
* @fvannee for contributing optimizations for pruning inlined functions
* @jflambert for reporting an issue with failing telemetry jobs
* @nbouscal for reporting an issue with compression jobs locking referenced tables
* @Nicolai6120 for reporting an issue with locf and treat_null_as_missing
* @NomAnor for reporting an issue with expression index with table references
* @Olernov for contributing a fix for compressing tables with compound foreign keys
* @werjo for reporting an issue with drop_chunks and unique constraints
svenklemm added a commit to svenklemm/timescaledb that referenced this pull request Jul 3, 2020
This maintenance release contains bugfixes since the 1.7.1 release. We deem it medium
priority for upgrading.

In particular the fixes contained in this maintenance release address bugs in continuous
aggregates, drop_chunks and compression.

**Features**
* timescale#1877 Add support for fast pruning of inlined functions

**Bugfixes**
* timescale#1908 Fix drop_chunks with unique constraints when cascade_to_materializations is false
* timescale#1915 Check for database in extension_current_state
* timescale#1918 Unify chunk index creation
* timescale#1932 Change compression locking order
* timescale#1938 Fix gapfill locf treat_null_as_missing
* timescale#1982 Check for disabled telemetry earlier
* timescale#1984 Fix compression bit array left shift count
* timescale#1997 Add checks for read-only transactions
* timescale#2002 Reset restoring gucs rather than explicitly setting 'off'
* timescale#2028 Fix locking in drop_chunks
* timescale#2031 Enable compression for tables with compound foreign key
* timescale#2039 Fix segfault in create_trigger_handler
* timescale#2043 Fix segfault in cagg_update_view_definition
* timescale#2046 Use index tablespace during chunk creation
* timescale#2047 Better handling of chunk insert state destruction
* timescale#2049 Fix handling of PlaceHolderVar in DecompressChunk
* timescale#2051 Fix tuple concurrently deleted error with multiple continuous aggregates

**Thanks**
* @akamensky for reporting an issue with telemetry and an issue with drop_chunks
* @darko408 for reporting an issue with decompression
* @Dmitri191 for reporting an issue with failing background workers
* @eduardotsj for reporting an issue with indexes not inheriting tablespace settings
* @FourSeventy for reporting an issue with multiple continuous aggregrates
* @fvannee for contributing optimizations for pruning inlined functions
* @jflambert for reporting an issue with failing telemetry jobs
* @nbouscal for reporting an issue with compression jobs locking referenced tables
* @Nicolai6120 for reporting an issue with locf and treat_null_as_missing
* @NomAnor for reporting an issue with expression index with table references
* @Olernov for contributing a fix for compressing tables with compound foreign keys
* @werjo for reporting an issue with drop_chunks and unique constraints
@gayyappan gayyappan force-pushed the cagg_multi branch 2 times, most recently from bca00aa to 9d5d807 Compare July 3, 2020 15:45
@gayyappan gayyappan requested review from erimatnor and k-rus July 3, 2020 15:54
Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but I suggest you go through the nits and try to address them.

{
ereport(ERROR,
(errcode(ERRCODE_INTERNAL_ERROR),
errmsg("unexpected return value %d in lock_invalidation_threhold_hypertable_row",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the function name in error messages is redundant as elog already tracks this and the function and line number can be shown if the right log configuration is set. I think it is better to formulate a good error message to describe the error, e.g.: "duplicate catalog entry for hypertable "%s"". As a last resort, use the __func__ macro as that also handles function name changes.

table_open(catalog_get_table_id(catalog, CONTINUOUS_AGGS_INVALIDATION_THRESHOLD),
AccessShareLock);
retcnt = ts_scanner_scan(&scanctx);
if (retcnt > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anything but 1 expected here? Isn't 0 (or less) also errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be 0 if there are no entries in the table yet for that hypertable.

table_open(catalog_get_table_id(catalog, CONTINUOUS_AGGS_INVALIDATION_THRESHOLD),
RowExclusiveLock);
ts_scanner_scan(&scanctx);
table_close(invalidation_threshold_table_relation, NoLock); /*release lock at end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock still held here. I thought that the table lock was going to be released here since tuple lock is held and that is the actual serialization point.

@erimatnor erimatnor added the bug label Jul 3, 2020
@erimatnor erimatnor added this to the 1.7.2 milestone Jul 3, 2020
@gayyappan gayyappan force-pushed the cagg_multi branch 2 times, most recently from 7786b8d to b6bb989 Compare July 3, 2020 21:53
When we have multiple continuous aggregates defined on
the same hypertable, they could try to delete the hypertable
invalidation logs concurrently. Resolve this by serializing
invalidation log deletes by raw hypertable id.

Fixes timescale#1940
@gayyappan gayyappan merged commit 43edbf8 into timescale:master Jul 4, 2020
@svenklemm svenklemm mentioned this pull request Jul 4, 2020
svenklemm added a commit to svenklemm/timescaledb that referenced this pull request Jul 4, 2020
This maintenance release contains bugfixes since the 1.7.1 release. We deem it medium
priority for upgrading.

In particular the fixes contained in this maintenance release address bugs in continuous
aggregates, drop_chunks and compression.

**Features**
* timescale#1877 Add support for fast pruning of inlined functions

**Bugfixes**
* timescale#1908 Fix drop_chunks with unique constraints when cascade_to_materializations is false
* timescale#1915 Check for database in extension_current_state
* timescale#1918 Unify chunk index creation
* timescale#1932 Change compression locking order
* timescale#1938 Fix gapfill locf treat_null_as_missing
* timescale#1982 Check for disabled telemetry earlier
* timescale#1984 Fix compression bit array left shift count
* timescale#1997 Add checks for read-only transactions
* timescale#2002 Reset restoring gucs rather than explicitly setting 'off'
* timescale#2028 Fix locking in drop_chunks
* timescale#2031 Enable compression for tables with compound foreign key
* timescale#2039 Fix segfault in create_trigger_handler
* timescale#2043 Fix segfault in cagg_update_view_definition
* timescale#2046 Use index tablespace during chunk creation
* timescale#2047 Better handling of chunk insert state destruction
* timescale#2049 Fix handling of PlaceHolderVar in DecompressChunk
* timescale#2051 Fix tuple concurrently deleted error with multiple continuous aggregates

**Thanks**
* @akamensky for reporting an issue with telemetry and an issue with drop_chunks
* @darko408 for reporting an issue with decompression
* @Dmitri191 for reporting an issue with failing background workers
* @eduardotsj for reporting an issue with indexes not inheriting tablespace settings
* @FourSeventy for reporting an issue with multiple continuous aggregrates
* @fvannee for contributing optimizations for pruning inlined functions
* @jflambert for reporting an issue with failing telemetry jobs
* @nbouscal for reporting an issue with compression jobs locking referenced tables
* @Nicolai6120 for reporting an issue with locf and treat_null_as_missing
* @NomAnor for reporting an issue with expression index with table references
* @Olernov for contributing a fix for compressing tables with compound foreign keys
* @werjo for reporting an issue with drop_chunks and unique constraints
svenklemm added a commit that referenced this pull request Jul 4, 2020
This maintenance release contains bugfixes since the 1.7.1 release. We deem it medium
priority for upgrading.

In particular the fixes contained in this maintenance release address bugs in continuous
aggregates, drop_chunks and compression.

**Features**
* #1877 Add support for fast pruning of inlined functions

**Bugfixes**
* #1908 Fix drop_chunks with unique constraints when cascade_to_materializations is false
* #1915 Check for database in extension_current_state
* #1918 Unify chunk index creation
* #1932 Change compression locking order
* #1938 Fix gapfill locf treat_null_as_missing
* #1982 Check for disabled telemetry earlier
* #1984 Fix compression bit array left shift count
* #1997 Add checks for read-only transactions
* #2002 Reset restoring gucs rather than explicitly setting 'off'
* #2028 Fix locking in drop_chunks
* #2031 Enable compression for tables with compound foreign key
* #2039 Fix segfault in create_trigger_handler
* #2043 Fix segfault in cagg_update_view_definition
* #2046 Use index tablespace during chunk creation
* #2047 Better handling of chunk insert state destruction
* #2049 Fix handling of PlaceHolderVar in DecompressChunk
* #2051 Fix tuple concurrently deleted error with multiple continuous aggregates

**Thanks**
* @akamensky for reporting an issue with telemetry and an issue with drop_chunks
* @darko408 for reporting an issue with decompression
* @Dmitri191 for reporting an issue with failing background workers
* @eduardotsj for reporting an issue with indexes not inheriting tablespace settings
* @FourSeventy for reporting an issue with multiple continuous aggregrates
* @fvannee for contributing optimizations for pruning inlined functions
* @jflambert for reporting an issue with failing telemetry jobs
* @nbouscal for reporting an issue with compression jobs locking referenced tables
* @Nicolai6120 for reporting an issue with locf and treat_null_as_missing
* @NomAnor for reporting an issue with expression index with table references
* @Olernov for contributing a fix for compressing tables with compound foreign keys
* @werjo for reporting an issue with drop_chunks and unique constraints
svenklemm added a commit that referenced this pull request Jul 4, 2020
This maintenance release contains bugfixes since the 1.7.1 release. We deem it medium
priority for upgrading.

In particular the fixes contained in this maintenance release address bugs in continuous
aggregates, drop_chunks and compression.

**Features**
* #1877 Add support for fast pruning of inlined functions

**Bugfixes**
* #1908 Fix drop_chunks with unique constraints when cascade_to_materializations is false
* #1915 Check for database in extension_current_state
* #1918 Unify chunk index creation
* #1932 Change compression locking order
* #1938 Fix gapfill locf treat_null_as_missing
* #1982 Check for disabled telemetry earlier
* #1984 Fix compression bit array left shift count
* #1997 Add checks for read-only transactions
* #2002 Reset restoring gucs rather than explicitly setting 'off'
* #2028 Fix locking in drop_chunks
* #2031 Enable compression for tables with compound foreign key
* #2039 Fix segfault in create_trigger_handler
* #2043 Fix segfault in cagg_update_view_definition
* #2046 Use index tablespace during chunk creation
* #2047 Better handling of chunk insert state destruction
* #2049 Fix handling of PlaceHolderVar in DecompressChunk
* #2051 Fix tuple concurrently deleted error with multiple continuous aggregates

**Thanks**
* @akamensky for reporting an issue with telemetry and an issue with drop_chunks
* @darko408 for reporting an issue with decompression
* @Dmitri191 for reporting an issue with failing background workers
* @eduardotsj for reporting an issue with indexes not inheriting tablespace settings
* @FourSeventy for reporting an issue with multiple continuous aggregrates
* @fvannee for contributing optimizations for pruning inlined functions
* @jflambert for reporting an issue with failing telemetry jobs
* @nbouscal for reporting an issue with compression jobs locking referenced tables
* @Nicolai6120 for reporting an issue with locf and treat_null_as_missing
* @NomAnor for reporting an issue with expression index with table references
* @Olernov for contributing a fix for compressing tables with compound foreign keys
* @werjo for reporting an issue with drop_chunks and unique constraints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple continuous aggregations for same hypertable crash with: ERROR: tuple concurrently deleted
5 participants