New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Segmentation Fault when VACUUM FULL pg_class #3924
Comments
Potential fix: #3875 (comment) |
@mfundul's potential fix seems to remove the crashes/errors. |
The reproduction steps mentioned above are always able to reproduce the issue. However, we need to spend time to understand the actual sequence of events which is causing the bug to surface up. It also appears that it's somehow being triggered via parallel worker processes in the One option is to NOT to use the cache via |
#3875 has a version of fix. But that is not what finally needs to be done. Linking here just in case. |
I just lost a 26 TB database to this. |
FTR: it would appear that |
We're actively looking into this issue. |
Fix a crash that could corrupt indexes when running `VACUUM FULL pg_class`. The crash happens when caches are queried/updated within a cache invalidation function, which could lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running `VACUUM FULL pg_class`. The crash happens when caches are queried/updated within a cache invalidation function, which could lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running `VACUUM FULL pg_class`. The crash happens when caches are queried/updated within a cache invalidation function, which could lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running `VACUUM FULL pg_class`. The crash happens when caches are queried/updated within a cache invalidation function, which could lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running `VACUUM FULL pg_class`. The crash happens when caches are queried/updated within a cache invalidation function, which could lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running `VACUUM FULL pg_class`. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running VACUUM FULL pg_class. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running VACUUM FULL pg_class. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running VACUUM FULL pg_class. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running VACUUM FULL pg_class. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running VACUUM FULL pg_class. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Apologize, I accidentally close the issue. |
Fix a crash that could corrupt indexes when running VACUUM FULL pg_class. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running VACUUM FULL pg_class. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes #3924
Fix a crash that could corrupt indexes when running VACUUM FULL pg_class. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running VACUUM FULL pg_class. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running VACUUM FULL pg_class. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes timescale#3924
Fix a crash that could corrupt indexes when running VACUUM FULL pg_class. The crash happens when caches are queried/updated within a cache invalidation function, which can lead to corruption and recursive cache invalidations. To solve the issue, make sure the relcache invalidation callback is simple and never invokes the relcache or syscache directly or indirectly. Some background: The extension is preloaded and thus have planner hooks installed irrespective of whether the extension is actually installed or not in the current database. However, the hooks need to be disabled as long as the extension is not installed. To avoid always having to dynamically check for the presence of the extension, the state is cached in the session. However, the cached state needs to be updated if the extension changes (altered/dropped/created). Therefore, the relcache invalidation callback mechanism is (ab)used in TimescaleDB to signal updates to the extension state across all active backends. The signaling is implemented by installing a dummy table as part of the extension and any invalidation on the relid for that table signals a change in the extension state. However, as of this change, the actual state is no longer determined in the callback itself, since it requires use of the relcache and causes the bad behavior. Therefore, the only thing that remains in the callback after this change is to reset the extension state. The actual state is instead resolved on-demand, but can still be cached when the extension is in the installed state and the dummy table is present with a known relid. However, if the extension is not installed, the extension state can no longer be cached as there is no way to signal other backends that the state should be reset when they don't know the dummy table's relid, and cannot resolve it from within the callback itself. Fixes #3924
What type of bug is this?
Crash
What subsystems and features are affected?
Background worker
What happened?
VACUUM FULL pg_class while another process is modifying table will cause a variety errors ranging from segmentation fault, to pg_class corruption.
Very similar to #3188 except this occurs after the extension is installed
TimescaleDB version affected
2.5.1
PostgreSQL version used
14.1
What operating system did you use?
Red Hat Enterprise Linux 7.9
What installation method did you use?
RPM
What platform did you run on?
On prem/Self-hosted
Relevant log output and stack trace
How can we reproduce the bug?
While above is running connect via another psql session connect to same database, wait until pg_class is over 20MB then run
VACUUM FULL pg_class
several times.The text was updated successfully, but these errors were encountered: