Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from PG 11 to 12 does not work using pg_upgrade #1844

Closed
pehlert opened this issue Apr 22, 2020 · 12 comments · Fixed by #1871
Closed

Upgrading from PG 11 to 12 does not work using pg_upgrade #1844

pehlert opened this issue Apr 22, 2020 · 12 comments · Fixed by #1871
Assignees
Labels
pg-upgrade Issues related to upgrading the Postgres version
Milestone

Comments

@pehlert
Copy link

pehlert commented Apr 22, 2020

Hi all,

I tried to upgrade from PG 11 to PG 12.

What I did was:

  • upgrade extension to 1.7.0 on PG11
  • shutdown database
  • Initialize new PG12 cluster
  • Run pg_upgrade -O "-c timescaledb.restoring='on'" (data dirs are set using ENV vars)

What it returns:

[..] (some output ommitted, just creates like below
pg_restore: creating FUNCTION "public.time_bucket_gapfill(interval, timestamp without time zone, timestamp without time zone, timestamp without time zone)"
pg_restore: creating FUNCTION "public.time_bucket_gapfill(interval, timestamp with time zone, timestamp with time zone, timestamp with time zone)"
pg_restore: creating FUNCTION "public.timescaledb_post_restore()"
pg_restore: creating FUNCTION "public.timescaledb_pre_restore()"
pg_restore: creating FUNCTION "public.update_current_machine_status()"
pg_restore: creating FUNCTION "public.update_current_order_stats()"
pg_restore: creating AGGREGATE "_timescaledb_internal.finalize_agg("text", "name", "name", "name"[], "bytea", "anyelement")"
pg_restore: creating AGGREGATE "public.first("anyelement", "any")"
pg_restore: creating AGGREGATE "public.histogram(double precision, double precision, double precision, integer)"
pg_restore: creating AGGREGATE "public.last("anyelement", "any")"
pg_restore: creating TABLE "_timescaledb_cache.cache_inval_bgw_job"
pg_restore: creating TABLE "_timescaledb_cache.cache_inval_extension"
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 234; 1259 17367 TABLE cache_inval_extension mydb
pg_restore: error: could not execute query: ERROR:  OID lookup failed for table index "hypertable_pkey"
Command was:
-- For binary upgrade, must preserve pg_type oid
SELECT pg_catalog.binary_upgrade_set_next_pg_type_oid('17369'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_type array oid
SELECT pg_catalog.binary_upgrade_set_next_array_pg_type_oid('17368'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_class oids
SELECT pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('17367'::pg_catalog.oid);

CREATE TABLE "_timescaledb_cache"."cache_inval_extension" (
);

-- For binary upgrade, set heap's relfrozenxid and relminmxid
UPDATE pg_catalog.pg_class
SET relfrozenxid = '572', relminmxid = '1'
WHERE oid = '"_timescaledb_cache"."cache_inval_extension"'::pg_catalog.regclass;

-- For binary upgrade, handle extension membership the hard way
ALTER EXTENSION "timescaledb" ADD TABLE "_timescaledb_cache"."cache_inval_extension";
@trnubo
Copy link

trnubo commented Apr 23, 2020

I've hit basically the same error doing a pg_upgrade from PG10 to PG11 with TimescaleDB 1.6.1

pg_restore: creating TABLE "_timescaledb_cache.cache_inval_bgw_job"
pg_restore: creating TABLE "_timescaledb_cache.cache_inval_extension"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 224; 1259 53878 TABLE cache_inval_extension mydb
pg_restore: [archiver (db)] could not execute query: ERROR:  OID lookup failed for table index "hypertable_pkey"
    Command was: 
-- For binary upgrade, must preserve pg_type oid
SELECT pg_catalog.binary_upgrade_set_next_pg_type_oid('53880'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_type array oid
SELECT pg_catalog.binary_upgrade_set_next_array_pg_type_oid('53879'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_class oids
SELECT pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('53878'::pg_catalog.oid);

CREATE TABLE "_timescaledb_cache"."cache_inval_extension" (
);

-- For binary upgrade, set heap's relfrozenxid and relminmxid
UPDATE pg_catalog.pg_class
SET relfrozenxid = '600033552', relminmxid = '1'
WHERE oid = '"_timescaledb_cache"."cache_inval_extension"'::pg_catalog.regclass;

-- For binary upgrade, handle extension membership the hard way
ALTER EXTENSION "timescaledb" ADD TABLE "_timescaledb_cache"."cache_inval_extension";

@trnubo
Copy link

trnubo commented Apr 23, 2020

I managed to work past my issue by using SELECT timescaledb_pre_restore(); before the pg_upgrade and SELECT timescaledb_post_restore(); after. I'm not sure why this worked, I used -O "-c timescaledb.restoring='on'" during the failed upgrade and there isn't any direct mention of using those restore functions during as PG upgrade.

@pehlert
Copy link
Author

pehlert commented Apr 23, 2020

I had the same suspicion that the timescaledb.restoring flag might be broken somehow. With the way our dev environments are set up, a manual dump and restore would be much more difficult to achieve. I hope we can get this fixed somehow

@mfreed
Copy link
Member

mfreed commented Apr 23, 2020

Folks, thanks for this back-and-forth. We are investigating, but keep the insights and observations coming!

@mkindahl mkindahl added this to the 1.7.1 milestone Apr 27, 2020
@mkindahl
Copy link
Contributor

Looking through the code and trying multiple ways to reproduce it I still do not see why the upgrade as described above fails. The procedure I did was as described above, but repeating it below with details about what I did so that others can check.

Summary

To run an upgrade, it is necessary to

  • Set timescaledb.restoring when starting the server, which is done by passing -O "-c timescaledb.restoring=on to pg_upgrade
  • Add timescaledb to shared_preload_libraries in the configuration file. At minimum for the new server, but this should typically be done for both servers.

The procedures timescaledb_pre_restore and timescaledb_post_restore set (and clear) the variable timescaledb.restoring and also stop (and start) the background worker threads. Apart from that it does not do anything special. The only explanation for why adding them above removes the error would be that the timescaledb.restoring variable for some reason is not set in the execution of pg_upgrade above.

The binary pg_upgrade does, as part of one of the steps, copy the schema from the old server to the new server. If timescaledb.restoring is not set to on on the new server, then the error message above will be printed when the cache invalidation is done. This is done as part of executing the first statement on the new server. The actual error message is printed from inside ts_catalog_table_info_init when trying to create all the TimescaleDB tables, which is done as part of the cache invalidation procedure and only happen the first time the server is started with the extension.

I manage to get the error message (ERROR: OID lookup failed for table index "hypertable_pkey", to be exact) when running without timescaledb.restoring set to on, but only then. Adding the option removed the message. Investigating the code, I do not see any other path in the code that could trigger this error message by directly or indirectly calling this function.

Reproduction steps

What I did was:

Set up the PG11 server:

  1. Create a database with version PG 11.6 (this is what I had locally and I have no reason to believe that 11.7 would be any different).
  2. Add timescaledb to shared_preload_libraries in the configuration file.
  3. Start the server.
  4. Populate it with the NYC data from Tutorial: Hello Timescale!
  5. Add a continuous aggregate for good measure (note that the background workers were involved above).
  6. Stop the server.

Create a PG12 server:

  1. Create a server with version PG 12.1 (this is what I had locally, no reason that 12.2 would behave differently).
  2. Add timescaledb to shared_preload_libraries in the configuration file.

Do the upgrade:

  1. Set environment variables (see below).
  2. Run $PGBINNEW/pg_upgrade -O "-c timescaledb.restoring=on"

Environment variables set are:

# Binary directories with local install
oldroot=/usr/local/pgsql/11.6
newroot=/usr/local/pgsql/12.1

# Configure directories and other information for old and new server
export PGBINOLD=$oldroot/bin
export PGBINNEW=$newroot/bin
export PGDATAOLD=$HOME/temp/postgres/data-node-11
export PGDATANEW=$HOME/temp/postgres/data-node-12
export PGUSER=postgres
export PGPORTOLD=`get_port $PGDATAOLD/postgresql.conf`
export PGPORTNEW=`get_port $PGDATANEW/postgresql.conf`

(Function get_port is a utility function that extract the port from the configuration file.)

@pehlert
Copy link
Author

pehlert commented Apr 29, 2020

Thank you so much @mkindahl for the thorough investigation. I wish I could add more to this, but at least I verified in pg_upgrade's pg_upgrade_server.log that the servers are seemingly started correct with the right flag set:

command: "/usr/lib/postgresql/12/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/postgresql/data/12" -o "-p 50432 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c timescaledb.restoring='on' -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/postgresql'" start >> "pg_upgrade_server.log" 2>&1
waiting for server to start....2020-04-22 16:21:34.765 UTC [1549] LOG:  starting PostgreSQL 12.2 (Debian 12.2-2.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2020-04-22 16:21:34.765 UTC [1549] LOG:  listening on Unix socket "/var/lib/postgresql/.s.PGSQL.50432"
2020-04-22 16:21:34.778 UTC [1550] LOG:  database system was shut down at 2020-04-22 16:19:56 UTC
2020-04-22 16:21:34.783 UTC [1549] LOG:  database system is ready to accept connections
2020-04-22 16:21:34.796 UTC [1555] LOG:  TimescaleDB background worker launcher connected to shared catalogs
 done
server started


command: "/usr/lib/postgresql/12/bin/pg_ctl" -w -D "/var/lib/postgresql/data/12" -o "-c timescaledb.restoring='on'" -m smart stop >> "pg_upgrade_server.log" 2>&1
2020-04-22 16:21:35.206 UTC [1549] LOG:  received smart shutdown request
2020-04-22 16:21:35.206 UTC [1555] LOG:  terminating TimescaleDB background worker launcher due to administrator command
waiting for server to shut down...2020-04-22 16:21:35.206 UTC [1555] FATAL:  terminating connection due to administrator command
.2020-04-22 16:21:35.208 UTC [1549] LOG:  background worker "TimescaleDB Background Worker Launcher" (PID 1555) exited with exit code 1
2020-04-22 16:21:35.209 UTC [1549] LOG:  background worker "logical replication launcher" (PID 1556) exited with exit code 1
2020-04-22 16:21:35.209 UTC [1551] LOG:  shutting down
2020-04-22 16:21:35.217 UTC [1549] LOG:  database system is shut down
 done
server stopped

Let me know if there is anything else I can check.

@pehlert
Copy link
Author

pehlert commented Apr 29, 2020

One more thing I tried without luck was to set timescaledb.restoring = 'on' in the actual postgresql.conf file for the PG 12 server

@mkindahl mkindahl self-assigned this May 4, 2020
@mkindahl
Copy link
Contributor

mkindahl commented May 5, 2020

@pehlert Can you turn on statement logging (log_statement = 'all'), run a pg_restore, and check if there are any lines like this in the restore log:

2020-05-05 09:18:29.533 CEST [3454] STATEMENT:  ALTER DATABASE "defaultdb" SET "search_path" TO '$user', 'public', 'topology', 'tiger';
        ALTER DATABASE "defaultdb" SET "timescaledb.restoring" TO 'off';
        
        -- For binary upgrade, set datfrozenxid and datminmxid.
        UPDATE pg_catalog.pg_database
        SET datfrozenxid = '548', datminmxid = '1'
        WHERE datname = 'defaultdb';

After digging through the code, it seems like database-specific settings are dumped with pg_dump and will take precedence over both command-line settings and configuration file settings.

If timescaledb.restoring is for some reason set for the database (not sure why this happens in my case, but it could be from a client connection or something that is done internally), it will take effect rather than your command-line version because it is executed after the command-line options are installed.

@pehlert
Copy link
Author

pehlert commented May 5, 2020

Hi @mkindahl , again thank you so much for taking care of this and the brilliant investigation.

Your assumption is spot on, this is the relevant part of the log:

2020-05-05 21:21:58.338 UTC [340] FATAL:  database "template0" is not currently accepting connections
2020-05-05 21:21:58.340 UTC [313] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 340) exited with exit code 1
2020-05-05 21:21:58.341 UTC [339] LOG:  statement: SELECT pg_catalog.pg_is_in_recovery()
2020-05-05 21:21:58.342 UTC [341] LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-05-05 21:21:58.342 UTC [341] LOG:  statement: SET statement_timeout = 0;

2020-05-05 21:21:58.342 UTC [341] LOG:  statement: SET lock_timeout = 0;

2020-05-05 21:21:58.342 UTC [341] LOG:  statement: SET idle_in_transaction_session_timeout = 0;

2020-05-05 21:21:58.342 UTC [341] LOG:  statement: SET client_encoding = 'UTF8';

2020-05-05 21:21:58.342 UTC [341] LOG:  statement: SET standard_conforming_strings = on;

2020-05-05 21:21:58.342 UTC [341] LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);

2020-05-05 21:21:58.344 UTC [341] LOG:  statement: SET check_function_bodies = false;

2020-05-05 21:21:58.344 UTC [341] LOG:  statement: SET xmloption = content;

2020-05-05 21:21:58.345 UTC [341] LOG:  statement: SET client_min_messages = warning;

2020-05-05 21:21:58.345 UTC [341] LOG:  statement: SET row_security = off;

2020-05-05 21:21:58.345 UTC [341] LOG:  statement:

2020-05-05 21:21:58.345 UTC [341] LOG:  statement: ALTER DATABASE "mydb" SET "timescaledb.restoring" TO 'off';

        -- For binary upgrade, set datfrozenxid and datminmxid.
        UPDATE pg_catalog.pg_database
        SET datfrozenxid = '562', datminmxid = '1'
        WHERE datname = 'mydb';



2020-05-05 21:21:58.348 UTC [341] LOG:  statement: SELECT pg_catalog.pg_is_in_recovery()

I can also confirm that a ALTER DATABASE mydb RESET "timescaledb.restoring"' on the old database fixes the issue! A bit cumbersome to automate, but that workaround will do it for us.

@mkindahl
Copy link
Contributor

mkindahl commented May 6, 2020

@pehlert Thank you for testing this, it confirmed my suspicion and we have a patch that fixes this specific case of the issue.

However, given how option processing works, it is not a very solid approach to rely on timescaledb.restoring to be set correctly in all cases, so we need some other means to detect that we are restoring a database that does not require the user to set the variable.

mkindahl added a commit to mkindahl/timescaledb that referenced this issue May 6, 2020
If a binary upgrade is in progress (when using `pg_upgrade`) the
per-database setting of `timescaledb.restoring` can be included in the
dump, which causes `pg_upgrade` to fail.

This commit fixes this by checking the global variable
`IsBinaryUpgrade` and not refreshing cache if we are in the middle of
doing a binary upgrade.

Fixes timescale#1844
mkindahl added a commit to mkindahl/timescaledb that referenced this issue May 6, 2020
If a binary upgrade is in progress (when using `pg_upgrade`) the
per-database setting of `timescaledb.restoring` can be included in the
dump, which causes `pg_upgrade` to fail.

This commit fixes this by checking the global variable
`IsBinaryUpgrade` and not refreshing cache if we are in the middle of
doing a binary upgrade.

Fixes timescale#1844
mkindahl added a commit to mkindahl/timescaledb that referenced this issue May 6, 2020
If a binary upgrade is in progress (when using `pg_upgrade`) the
per-database setting of `timescaledb.restoring` can be included in the
dump, which causes `pg_upgrade` to fail.

This commit fixes this by checking the global variable
`IsBinaryUpgrade` and not refreshing cache if we are in the middle of
doing a binary upgrade.

Fixes timescale#1844
@pehlert
Copy link
Author

pehlert commented May 6, 2020

Let me know if I can support in testing this. Unfortunately as these are quite old datasets (I think we started with TSDB 1.3) I cannot pinpoint what could have set the parameter on the database in the first place.

mkindahl added a commit that referenced this issue May 7, 2020
If a binary upgrade is in progress (when using `pg_upgrade`) the
per-database setting of `timescaledb.restoring` can be included in the
dump, which causes `pg_upgrade` to fail.

This commit fixes this by checking the global variable
`IsBinaryUpgrade` and not refreshing cache if we are in the middle of
doing a binary upgrade.

Fixes #1844
@mkindahl
Copy link
Contributor

mkindahl commented May 7, 2020

Let me know if I can support in testing this. Unfortunately as these are quite old datasets (I think we started with TSDB 1.3) I cannot pinpoint what could have set the parameter on the database in the first place.

Precisely the reason to remove the option.

It is not feasible to expect anybody to track down rouge settings of timescaledb.restoring when restoring so, IMHO, the option should be removed and we should make sure that pg_upgrade and pg_restore works without this option.

I created issue #1878 to track this.

svenklemm pushed a commit that referenced this issue May 13, 2020
If a binary upgrade is in progress (when using `pg_upgrade`) the
per-database setting of `timescaledb.restoring` can be included in the
dump, which causes `pg_upgrade` to fail.

This commit fixes this by checking the global variable
`IsBinaryUpgrade` and not refreshing cache if we are in the middle of
doing a binary upgrade.

Fixes #1844
@mkindahl mkindahl added the pg-upgrade Issues related to upgrading the Postgres version label Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pg-upgrade Issues related to upgrading the Postgres version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants