postgres server crash on dropping large hypertable #1486

fvannee · 2019-10-14T13:17:06Z

Relevant system information:

OS: CentOs 7.6
PostgreSQL version (output of postgres --version): 11.2
TimescaleDB version (output of \dx in psql): 1.4.2
Installation method: source

Describe the bug
We ran into a Postgres server crash while dropping a large (4TB, +- 600 chunks) TimescaleDb table. Unfortunately it looks like a race condition, so I can't reproduce it. However, I have the core dump. I'll also explain what happened.
Connection 1 executed in a transaction (a, b and d were large hypertables, c was a regular table):

DROP TABLE a CASCADE;
DROP TABLE b CASCADE;
DROP TABLE c CASCADE;
DROP TABLE d CASCADE;

At the same time, other processes were just running and querying the database, but none of them were querying the tables that were being dropped. One of these other, unrelated, processes crashed. This process was inserting data into a completely different hypertable.

Part of the backtrace can be found below. It's going into infinite recursion, eventually leading to a stackoverflow, so the full backtrace is a bit too long. You can see from the pasted part where it starts recursing though.

Please let me know in case you need any additional info from the core dump.

#132251 0x00000000007abbcd in ReceiveSharedInvalidMessages (invalFunction=invalFunction@entry=0x91d210 <LocalExecuteInvalidationMessage>, resetFunction=resetFunction@entry=0x91d180 <InvalidateSystemCaches>) at sinval.c:91
#132252 0x000000000091d50f in AcceptInvalidationMessages () at inval.c:683
#132253 0x00000000007af409 in LockRelationOid (relid=2615, lockmode=1) at lmgr.c:133
#132254 0x00000000004c57ed in relation_open (lockmode=1, relationId=2615) at heapam.c:1132
#132255 heap_open (relationId=2615, lockmode=lockmode@entry=1) at heapam.c:1298
#132256 0x000000000091a577 in SearchCatCacheMiss (cache=cache@entry=0x1e16b00, nkeys=nkeys@entry=1, hashValue=hashValue@entry=2512250167, hashIndex=hashIndex@entry=3, v1=v1@entry=140229501255778, v2=v2@entry=0, v3=0, v4=0) at catcache.c:1375
#132257 0x000000000091b2f6 in SearchCatCacheInternal (v4=0, v3=0, v2=0, v1=140229501255778, nkeys=1, cache=<optimized out>) at catcache.c:1317
#132258 SearchCatCache (cache=<optimized out>, v1=140229501255778, v1@entry=0, v2=v2@entry=0, v3=v3@entry=0, v4=v4@entry=0) at catcache.c:1171
#132259 0x0000000000932763 in SearchSysCache (key4=key4@entry=0, key3=key3@entry=0, key2=key2@entry=0, key1=key1@entry=0, cacheId=cacheId@entry=35) at syscache.c:1108
#132260 GetSysCacheOid (cacheId=cacheId@entry=35, key1=key1@entry=140229501255778, key2=key2@entry=0, key3=key3@entry=0, key4=key4@entry=0) at syscache.c:1230
#132261 0x000000000053b3df in get_namespace_oid (nspname=nspname@entry=0x7f89b99c0062 "_timescaledb_cache", missing_ok=missing_ok@entry=true) at namespace.c:3025
#132262 0x00007f89b99a2436 in proxy_table_exists () at /home/florisvannee/workspace/timescaledb/src/extension_utils.c:163
#132263 extension_current_state () at /home/florisvannee/workspace/timescaledb/src/extension_utils.c:163
#132264 extension_update_state () at /home/florisvannee/workspace/timescaledb/src/extension.c:140
#132265 ts_extension_invalidate (relid=relid@entry=93255422) at /home/florisvannee/workspace/timescaledb/src/extension.c:205
#132266 0x00007f89b998f20f in cache_invalidate_callback (arg=<optimized out>, relid=93255422) at /home/florisvannee/workspace/timescaledb/src/cache_invalidate.c:67
#132267 0x000000000091d33a in LocalExecuteInvalidationMessage (msg=0x7ffe1bfcbe40) at inval.c:595
#132268 0x00000000007abbcd in ReceiveSharedInvalidMessages (invalFunction=invalFunction@entry=0x91d210 <LocalExecuteInvalidationMessage>, resetFunction=resetFunction@entry=0x91d180 <InvalidateSystemCaches>) at sinval.c:91
#132269 0x000000000091d50f in AcceptInvalidationMessages () at inval.c:683
#132270 0x00000000007af409 in LockRelationOid (relid=2615, lockmode=1) at lmgr.c:133
#132271 0x00000000004c57ed in relation_open (lockmode=1, relationId=2615) at heapam.c:1132
#132272 heap_open (relationId=2615, lockmode=lockmode@entry=1) at heapam.c:1298
#132273 0x000000000091a577 in SearchCatCacheMiss (cache=cache@entry=0x1e16b00, nkeys=nkeys@entry=1, hashValue=hashValue@entry=2512250167, hashIndex=hashIndex@entry=3, v1=v1@entry=140229501255778, v2=v2@entry=0, v3=0, v4=0) at catcache.c:1375
#132274 0x000000000091b2f6 in SearchCatCacheInternal (v4=0, v3=0, v2=0, v1=140229501255778, nkeys=1, cache=<optimized out>) at catcache.c:1317
#132275 SearchCatCache (cache=<optimized out>, v1=140229501255778, v1@entry=0, v2=v2@entry=0, v3=v3@entry=0, v4=v4@entry=0) at catcache.c:1171
#132276 0x0000000000932763 in SearchSysCache (key4=key4@entry=0, key3=key3@entry=0, key2=key2@entry=0, key1=key1@entry=0, cacheId=cacheId@entry=35) at syscache.c:1108
#132277 GetSysCacheOid (cacheId=cacheId@entry=35, key1=key1@entry=140229501255778, key2=key2@entry=0, key3=key3@entry=0, key4=key4@entry=0) at syscache.c:1230
#132278 0x000000000053b3df in get_namespace_oid (nspname=nspname@entry=0x7f89b99c0062 "_timescaledb_cache", missing_ok=missing_ok@entry=true) at namespace.c:3025
#132279 0x00007f89b99a2436 in proxy_table_exists () at /home/florisvannee/workspace/timescaledb/src/extension_utils.c:163
#132280 extension_current_state () at /home/florisvannee/workspace/timescaledb/src/extension_utils.c:163
#132281 extension_update_state () at /home/florisvannee/workspace/timescaledb/src/extension.c:140
#132282 ts_extension_invalidate (relid=relid@entry=93255422) at /home/florisvannee/workspace/timescaledb/src/extension.c:205
#132283 0x00007f89b998f20f in cache_invalidate_callback (arg=<optimized out>, relid=93255422) at /home/florisvannee/workspace/timescaledb/src/cache_invalidate.c:67
#132284 0x000000000091d33a in LocalExecuteInvalidationMessage (msg=0x7ffe1bfcc210) at inval.c:595
#132285 0x00000000007abc5d in ReceiveSharedInvalidMessages (invalFunction=invalFunction@entry=0x91d210 <LocalExecuteInvalidationMessage>, resetFunction=resetFunction@entry=0x91d180 <InvalidateSystemCaches>) at sinval.c:121
#132286 0x000000000091d50f in AcceptInvalidationMessages () at inval.c:683
#132287 0x00000000007af409 in LockRelationOid (relid=1247, lockmode=1) at lmgr.c:133
#132288 0x00000000004c57ed in relation_open (lockmode=1, relationId=1247) at heapam.c:1132
#132289 heap_open (relationId=1247, lockmode=lockmode@entry=1) at heapam.c:1298
#132290 0x000000000091a577 in SearchCatCacheMiss (cache=cache@entry=0x1e26480, nkeys=nkeys@entry=1, hashValue=hashValue@entry=459879649, hashIndex=hashIndex@entry=33, v1=v1@entry=705, v2=v2@entry=0, v3=0, v4=0) at catcache.c:1375
#132291 0x000000000091b43d in SearchCatCacheInternal (nkeys=1, v2=0, v3=0, v4=0, v1=705, cache=0x1e26480) at catcache.c:1317
#132292 SearchCatCache1 (cache=0x1e26480, v1=705) at catcache.c:1185
#132293 0x00000000009326a0 in SearchSysCache1 (cacheId=cacheId@entry=74, key1=<optimized out>) at syscache.c:1119
#132294 0x0000000000921d3f in get_typcollation (typid=<optimized out>) at lsyscache.c:2827
#132295 0x000000000059e9af in variable_paramref_hook (pstate=<optimized out>, pref=0x1e91b70) at parse_param.c:171
#132296 0x000000000059680c in transformParamRef (pref=0x1e91b70, pstate=0x1e91ee0) at parse_expr.c:820
#132297 transformExprRecurse (pstate=pstate@entry=0x1e91ee0, expr=<optimized out>) at parse_expr.c:181
#132298 0x0000000000596e2b in transformFuncCall (fn=0x1e91c60, pstate=0x1e91ee0) at parse_expr.c:1454
#132299 transformExprRecurse (pstate=pstate@entry=0x1e91ee0, expr=expr@entry=0x1e91c60) at parse_expr.c:266
#132300 0x00000000005990e4 in transformExpr (pstate=pstate@entry=0x1e91ee0, expr=expr@entry=0x1e91c60, exprKind=exprKind@entry=EXPR_KIND_SELECT_TARGET) at parse_expr.c:156
#132301 0x00000000005a6d7e in transformTargetEntry (resjunk=false, colname=0x0, exprKind=<optimized out>, expr=0x0, node=<optimized out>, pstate=0x1e91ee0) at parse_target.c:103
#132302 transformTargetList (pstate=pstate@entry=0x1e91ee0, targetlist=<optimized out>, exprKind=exprKind@entry=EXPR_KIND_SELECT_TARGET) at parse_target.c:191
#132303 0x00000000005640fd in transformSelectStmt (stmt=0x1e91d50, pstate=0x1e91ee0) at analyze.c:1243
#132304 transformStmt (pstate=pstate@entry=0x1e91ee0, parseTree=parseTree@entry=0x1e91d50) at analyze.c:301
#132305 0x000000000056622a in transformOptionalSelectInto (parseTree=0x1e91d50, pstate=0x1e91ee0) at analyze.c:246
#132306 transformTopLevelStmt (parseTree=0x1e91e60, pstate=0x1e91ee0) at analyze.c:196
#132307 parse_analyze_varparams (parseTree=parseTree@entry=0x1e91e60, sourceText=sourceText@entry=0x1d84531 "SELECT streams._import_base_price_update($1, $2, $3);", paramTypes=paramTypes@entry=0x7ffe1bfcc858, numParams=numParams@entry=0x7ffe1bfcc844) at analyze.c:146
#132308 0x00000000007cd56f in exec_parse_message (numParams=<optimized out>, paramTypes=<optimized out>, stmt_name=0x1d84530 "", query_string=0x1d84531 "SELECT streams._import_base_price_update($1, $2, $3);") at postgres.c:1384
#132309 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1ddff08, dbname=<optimized out>, username=<optimized out>) at postgres.c:4213
#132310 0x000000000074296b in BackendRun (port=0x1dd2020) at postmaster.c:4361
#132311 BackendStartup (port=0x1dd2020) at postmaster.c:4033
#132312 ServerLoop () at postmaster.c:1706
#132313 0x00000000007458c4 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1d7f1c0) at postmaster.c:1379
#132314 0x00000000004845b6 in main (argc=3, argv=0x1d7f1c0) at main.c:228

The text was updated successfully, but these errors were encountered:

gayyappan · 2019-10-17T14:16:39Z

@fvannee Are there any relationships between the tables that are being dropped (like foreign keys, continuous aggregates )?

fvannee · 2019-10-17T14:17:59Z

No, nothing. They're completely independent.
The 'CASCADE' part of the drop cascades to a couple of SQL functions.

This patch prevents recursing into the cache invalidation code. Since this code's result is dependent on the transaction snapshot, processing it multiple times recursively won't change the result. Fixes timescale#1486

cevian · 2019-10-24T02:14:41Z

Hi @fvannee, Thank you for the report. I wasn't able to reproduce the issue, but looking at the backtrace I think #1493 fixes it. Do you, by any chance, have any way of testing this fix?

fvannee · 2019-10-24T06:52:54Z

I'm afraid not. It seems very difficult to reproduce. I took a look at the PR though and I think it'll indeed fix the issue. I have one further question though: do you have an idea why it keeps on receiving the invalidation messages? It seems to be processing different invalidation messages on every iteration - so apparently it's flooded with invalidation messages. Judging by the depth of the backtrace, there must be at least 6000? That's even much more than the total number of chunks that were dropped.

cevian · 2019-10-24T14:09:12Z

So on commit, it should receive a message for 2 x #chunks (one for the table drop, one for the removal of the chunks from the hypertable parent table) + # of chunk indexex + any functions due to cascade + other stuff I am missing. With 600 chunks I can see this adding up. But, the weird thing is that if the message queue gets filled up (size of queue=4096), Postgres send a 0 - meaning invalidate all caches. So the system must be clearing the cache just fast enough to avoid this but filling up it's call stack instead. That's my read of the situation anyway.

fvannee · 2019-10-24T15:17:52Z

Ah indeed, makes sense. Thanks for looking into it!

This patch prevents recursing into the cache invalidation code. Since this code's result is dependent on the transaction snapshot, processing it multiple times recursively won't change the result. Fixes #1486

bboule assigned cevian Oct 14, 2019

bboule added the bug label Oct 14, 2019

bboule modified the milestone: 1.5.0 Oct 17, 2019

bboule added the investigate label Oct 17, 2019

bboule added this to the 1.5.1 milestone Oct 17, 2019

cevian mentioned this issue Oct 24, 2019

Prevent recursion in invalidation processing #1493

Merged

cevian closed this as completed in #1493 Oct 25, 2019

cevian modified the milestones: 1.5.1, 1.5.0 Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

postgres server crash on dropping large hypertable #1486

postgres server crash on dropping large hypertable #1486

fvannee commented Oct 14, 2019

gayyappan commented Oct 17, 2019

fvannee commented Oct 17, 2019 •

edited

cevian commented Oct 24, 2019

fvannee commented Oct 24, 2019

cevian commented Oct 24, 2019

fvannee commented Oct 24, 2019

postgres server crash on dropping large hypertable #1486

postgres server crash on dropping large hypertable #1486

Comments

fvannee commented Oct 14, 2019

gayyappan commented Oct 17, 2019

fvannee commented Oct 17, 2019 • edited

cevian commented Oct 24, 2019

fvannee commented Oct 24, 2019

cevian commented Oct 24, 2019

fvannee commented Oct 24, 2019

fvannee commented Oct 17, 2019 •

edited