Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Segmentation Fault when VACUUM FULL pg_class #3924

Closed
amalek215 opened this issue Dec 13, 2021 · 8 comments · Fixed by #4193
Closed

[Bug]: Segmentation Fault when VACUUM FULL pg_class #3924

amalek215 opened this issue Dec 13, 2021 · 8 comments · Fixed by #4193
Assignees
Labels
2.5.1 bgw The background worker subsystem, including the scheduler bug pg14 RPM segfault Segmentation fault Team: Core Database

Comments

@amalek215
Copy link

amalek215 commented Dec 13, 2021

What type of bug is this?

Crash

What subsystems and features are affected?

Background worker

What happened?

VACUUM FULL pg_class while another process is modifying table will cause a variety errors ranging from segmentation fault, to pg_class corruption.

Very similar to #3188 except this occurs after the extension is installed

TimescaleDB version affected

2.5.1

PostgreSQL version used

14.1

What operating system did you use?

Red Hat Enterprise Linux 7.9

What installation method did you use?

RPM

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

Logs for most recent replication of error

2021-12-13 11:17:29.258 EST [24508] CONTEXT:  PL/pgSQL function inline_code_block line 12 at RAISE
2021-12-13 11:18:53.208 EST [26184] ERROR:  cache lookup failed for relation 155205
2021-12-13 11:18:53.208 EST [26184] STATEMENT:  VACUUM FULL pg_class ;
2021-12-13 11:18:56.981 EST [24508] ERROR:  deadlock detected
2021-12-13 11:18:56.981 EST [24508] DETAIL:  Process 24508 waits for RowExclusiveLock on relation 1259 of database 16384; blocked by process 26184.
        Process 26184 waits for ShareLock on transaction 25128; blocked by process 24508.
        Process 24508: CREATE TABLE a24382 (time timestamp default now(), i BIGINT, i2 BIGINT, t TEXT, t2 TEXT);
        Process 26184: VACUUM FULL pg_class ;
2021-12-13 11:18:56.981 EST [24508] HINT:  See server log for query details.
2021-12-13 11:18:56.981 EST [24508] STATEMENT:  CREATE TABLE a24382 (time timestamp default now(), i BIGINT, i2 BIGINT, t TEXT, t2 TEXT);
2021-12-13 11:18:57.092 EST [26184] ERROR:  duplicate key value violates unique constraint "pg_class_relname_nsp_index"
2021-12-13 11:18:57.092 EST [26184] DETAIL:  Key (relname, relnamespace)=(pg_class, 11) already exists.
2021-12-13 11:18:57.092 EST [26184] STATEMENT:  VACUUM FULL pg_class ;
2021-12-13 11:19:01.754 EST [26184] ERROR:  cache lookup failed for relation 171287
2021-12-13 11:19:01.754 EST [26184] STATEMENT:  VACUUM FULL pg_class ;
2021-12-13 11:19:04.592 EST [24508] ERROR:  deadlock detected
2021-12-13 11:19:04.592 EST [24508] DETAIL:  Process 24508 waits for RowExclusiveLock on relation 1259 of database 16384; blocked by process 26184.
        Process 26184 waits for ShareLock on transaction 26880; blocked by process 24508.
        Process 24508: CREATE TABLE a26127 (time timestamp default now(), i BIGINT, i2 BIGINT, t TEXT, t2 TEXT);
        Process 26184: VACUUM FULL pg_class ;
2021-12-13 11:19:04.592 EST [24508] HINT:  See server log for query details.
2021-12-13 11:19:04.592 EST [24508] STATEMENT:  CREATE TABLE a26127 (time timestamp default now(), i BIGINT, i2 BIGINT, t TEXT, t2 TEXT);
2021-12-13 11:19:06.454 EST [24071] LOG:  server process (PID 26184) was terminated by signal 11: Segmentation fault
2021-12-13 11:19:06.454 EST [24071] DETAIL:  Failed process was running: VACUUM FULL pg_class ;
2021-12-13 11:19:06.454 EST [24071] LOG:  terminating any other active server processes
2021-12-13 11:19:06.455 EST [29849] FATAL:  the database system is in recovery mode

Below is what was being run in psql shell:
(not the session creating the tables)

$ /usr/pgsql-14/bin/psql -d testdb
psql (14.1)
Type "help" for help.

testdb=# VACUUM FULL pg_class ;
VACUUM
testdb=# SELECT  datname, pg_size_pretty( pg_database_size(datname) ) FROM pg_database;
  datname  | pg_size_pretty
-----------+----------------
 postgres  | 8769 kB
 testdb    | 315 MB
 template1 | 8617 kB
 template0 | 8617 kB
(4 rows)

testdb=# VACUUM FULL pg_class ;
ERROR:  cache lookup failed for relation 155205
testdb=# VACUUM FULL pg_class ;
ERROR:  duplicate key value violates unique constraint "pg_class_relname_nsp_index"
DETAIL:  Key (relname, relnamespace)=(pg_class, 11) already exists.
testdb=# VACUUM FULL pg_class ;
ERROR:  cache lookup failed for relation 171287
testdb=# VACUUM FULL pg_class ;
VACUUM
testdb=# VACUUM FULL pg_class ;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

How can we reproduce the bug?

  1. Create database and install TimescaleDB extension
  2. In a psql session connected to that database run:
testdb=# SELECT format('CREATE TABLE a%s (time timestamp default now(), i BIGINT, i2 BIGINT, t TEXT, t2 TEXT);',i) FROM generate_series(1, 200000) i
\gexec

While above is running connect via another psql session connect to same database, wait until pg_class is over 20MB then run VACUUM FULL pg_class several times.

@amalek215 amalek215 added the bug label Dec 13, 2021
@NunoFilipeSantos NunoFilipeSantos added 2.5.1 bgw The background worker subsystem, including the scheduler pg14 RPM labels Dec 13, 2021
@mfundul
Copy link
Contributor

mfundul commented Dec 14, 2021

Potential fix: #3875 (comment)

@nikkhils
Copy link
Contributor

@mfundul's potential fix seems to remove the crashes/errors.

@nikkhils nikkhils self-assigned this Dec 16, 2021
@nikkhils
Copy link
Contributor

The reproduction steps mentioned above are always able to reproduce the issue. However, we need to spend time to understand the actual sequence of events which is causing the bug to surface up.

It also appears that it's somehow being triggered via parallel worker processes in the cache_invalidate_callback callback which seems to be doing a lot of cached catalog accesses WHILE the cache is being invalidated. This is, in general, non-standard behavior in cache clearing callbacks.

One option is to NOT to use the cache via get_relname_relid and get_namespace_oid calls but to directly use systable_beginscan and systable_getnext calls to get the required information.

@nikkhils
Copy link
Contributor

#3875 has a version of fix. But that is not what finally needs to be done. Linking here just in case.

@NunoFilipeSantos NunoFilipeSantos added the segfault Segmentation fault label Dec 20, 2021
@afiskon afiskon removed their assignment Jan 10, 2022
@naav
Copy link

naav commented Mar 7, 2022

I just lost a 26 TB database to this.

@naav
Copy link

naav commented Mar 7, 2022

FTR: it would appear that reindex system on the affected database has recovered most, if not all, the corrupted data and therefore avoided my immediate dismissal for gross incompetence. For the time being at least.

@konskov konskov self-assigned this Mar 15, 2022
@NunoFilipeSantos
Copy link
Contributor

We're actively looking into this issue.

@erimatnor erimatnor self-assigned this Mar 21, 2022
@NunoFilipeSantos NunoFilipeSantos added this to the TimescaleDB 2.6.1 milestone Mar 21, 2022
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 24, 2022
Fix a crash that could corrupt indexes when running `VACUUM FULL
pg_class`.

The crash happens when caches are queried/updated within a cache
invalidation function, which could lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 24, 2022
Fix a crash that could corrupt indexes when running `VACUUM FULL
pg_class`.

The crash happens when caches are queried/updated within a cache
invalidation function, which could lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 24, 2022
Fix a crash that could corrupt indexes when running `VACUUM FULL
pg_class`.

The crash happens when caches are queried/updated within a cache
invalidation function, which could lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 24, 2022
Fix a crash that could corrupt indexes when running `VACUUM FULL
pg_class`.

The crash happens when caches are queried/updated within a cache
invalidation function, which could lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 24, 2022
Fix a crash that could corrupt indexes when running `VACUUM FULL
pg_class`.

The crash happens when caches are queried/updated within a cache
invalidation function, which could lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 24, 2022
Fix a crash that could corrupt indexes when running `VACUUM FULL
pg_class`.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 24, 2022
Fix a crash that could corrupt indexes when running VACUUM FULL
pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
@konskov konskov removed their assignment Mar 24, 2022
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 25, 2022
Fix a crash that could corrupt indexes when running VACUUM FULL
pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 25, 2022
Fix a crash that could corrupt indexes when running VACUUM FULL
pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 25, 2022
Fix a crash that could corrupt indexes when running VACUUM FULL
pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 30, 2022
Fix a crash that could corrupt indexes when running VACUUM FULL
pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
@jerryxwu jerryxwu reopened this Mar 30, 2022
@jerryxwu
Copy link

Apologize, I accidentally close the issue.

erimatnor added a commit to erimatnor/timescaledb that referenced this issue Mar 31, 2022
Fix a crash that could corrupt indexes when running VACUUM FULL
pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
erimatnor added a commit that referenced this issue Apr 1, 2022
Fix a crash that could corrupt indexes when running VACUUM FULL
pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes #3924
RafiaSabih pushed a commit to RafiaSabih/timescaledb that referenced this issue Apr 5, 2022
Fix a crash that could corrupt indexes when running VACUUM FULL
pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
RafiaSabih pushed a commit to RafiaSabih/timescaledb that referenced this issue Apr 8, 2022
Fix a crash that could corrupt indexes when running VACUUM FULL
pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
mkindahl pushed a commit to RafiaSabih/timescaledb that referenced this issue Apr 8, 2022
Fix a crash that could corrupt indexes when running VACUUM FULL
pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes timescale#3924
svenklemm pushed a commit that referenced this issue Apr 11, 2022
Fix a crash that could corrupt indexes when running VACUUM FULL
pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes #3924
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment