Add script to test recovery using restore points #3024

pmwkaa · 2021-03-12T12:15:00Z

Test PostgreSQL restore point recovery for single node and multi node cluster, make sure that single and multi node results match after recovery.

codecov · 2021-03-12T12:28:33Z

Codecov Report

Merging #3024 (28dc896) into master (0bc3f0b) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3024   +/-   ##
=======================================
  Coverage   90.28%   90.28%           
=======================================
  Files         213      213           
  Lines       35004    35004           
=======================================
  Hits        31602    31602           
  Misses       3402     3402

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bc3f0b...28dc896. Read the comment docs.

svenklemm

Shouldnt this be exercised somewhere in CI?

erimatnor · 2021-03-15T08:39:27Z

scripts/test_restore_points.sh

+	NAME=$1
+	PORT=$2
+	RESTORE_POINT_NAME=$3
+	pg_ctl init -D "${STORAGE_DIR}/${NAME}"


I would make the script support PG tools in a non-standard path by prefixing like this:

Suggested change

pg_ctl init -D "${STORAGE_DIR}/${NAME}"

${PG_BIN}/pg_ctl init -D "${STORAGE_DIR}/${NAME}"

Then allow overriding PG_BIN above. This applies to all psql calls below as well.

I guess I can do this, but I believe we don't do that for our other scripts. Personally I would prefer to change PATH to point to proper pg directory instead before running this script, if someone need to change the version.

erimatnor · 2021-03-15T08:44:26Z

scripts/test_restore_points.sh

+('2018-07-01 09:11', 90, 2.7),
+('2018-07-01 08:01', 29, 1.5);
+
+SELECT pg_create_restore_point('${TEST_RESTORE_POINT_3}');


Would be good to also insert date after the restore point to see that this is not included in the restored instance.

I'll do it, but that's also checked when restoring from the restore points 1 and 2

pmwkaa · 2021-03-17T14:49:01Z

@erimatnor @svenklemm I've added cron test here to see that it actually works, I'll remove it before merging

.github/workflows/cron-tests.yaml

svenklemm · 2021-03-18T02:40:28Z

.github/workflows/cron-tests.yaml

+    strategy:
+      fail-fast: false
+    env:
+      PG_VERSION: 12.6


do we want a matrix here and test PG11 and PG13 too

k-rus

I can see that the regression test succeeded. However, I see some errors in the log, which I don't feel well about:

2021-03-17 14:43:27.520 GMT [110] FATAL:  database "sn" does not exist

cp: cannot stat '/testdir/wal/sn/00000002.history': No such file or directory

cp: cannot stat '/testdir/wal/sn/00000001.history': No such file or directory

Not ciritical:

sh: locale: not found
2021-03-17 14:43:25.331 UTC [67] WARNING:  no usable system locales were found

pmwkaa · 2021-03-18T09:05:57Z

I can see that the regression test succeeded. However, I see some errors in the log, which I don't feel well about:

2021-03-17 14:43:27.520 GMT [110] FATAL:  database "sn" does not exist

cp: cannot stat '/testdir/wal/sn/00000002.history': No such file or directory

cp: cannot stat '/testdir/wal/sn/00000001.history': No such file or directory

Not ciritical:

sh: locale: not found
2021-03-17 14:43:25.331 UTC [67] WARNING:  no usable system locales were found

I believe they are related to the way how postgresql copy files using archive/restore command. And database "sn" does not exist to my understanding was related to data node bootstraping, otherwise it would fail

k-rus · 2021-03-18T09:24:17Z

And database "sn" does not exist to my understanding was related to data node bootstraping, otherwise it would fail

@pmwkaa The error is FATAL. How can sn be related to the data node bootstrapping?

k-rus · 2021-03-18T09:27:10Z

@pmwkaa I tried to run docker-run-restore-points-test.sh locally on my Mac, however it fails on postgres.conf:

waiting for server to start....2021-03-18 08:46:13.429 GMT [92] LOG:  unrecognized configuration parameter "restore_command" in file "/testdir/storage/sn/postgresql.conf" line 10
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_name" in file "/testdir/storage/sn/postgresql.conf" line 11
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_action" in file "/testdir/storage/sn/postgresql.conf" line 12
2021-03-18 08:46:13.430 GMT [92] FATAL:  configuration file "/testdir/storage/sn/postgresql.conf" contains errors

What am I doing wrong? Since it is inside container, I am not sure why it shouldn't just work.

The entire log:

./scripts/docker-run-restore-points-test.sh
Image "rp_test:latest" already exists.

Run 'docker run -d --name some-timescaledb -p 5432:5432 rp_test:latest' to launch
59caf9e1b6350eb66f16de43f49df7a7b19bcfb925d1f3d617a0a77ec36add3f
**** Testing ****
timescaledb-rp: mkdir /testdir
timescaledb-rp: chown postgres:postgres /testdir
timescaledb-rp: (cd /testdir; su postgres sh -c /mnt/scripts/test_restore_points.sh)
Running single node tests
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /testdir/storage/sn ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... sh: locale: not found
2021-03-18 08:46:12.551 UTC [64] WARNING:  no usable system locales were found
ok
syncing data to disk ... ok

Success. You can now start the database server using:

    /usr/local/bin/pg_ctl -D /testdir/storage/sn -l logfile start


WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start....2021-03-18 08:46:13.429 GMT [92] LOG:  unrecognized configuration parameter "restore_command" in file "/testdir/storage/sn/postgresql.conf" line 10
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_name" in file "/testdir/storage/sn/postgresql.conf" line 11
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_action" in file "/testdir/storage/sn/postgresql.conf" line 12
2021-03-18 08:46:13.430 GMT [92] FATAL:  configuration file "/testdir/storage/sn/postgresql.conf" contains errors
 stopped waiting
pg_ctl: could not start server
Examine the log output.
timescaledb-rp
Exit status is 1

pmwkaa · 2021-03-18T09:33:40Z

@pmwkaa I tried to run docker-run-restore-points-test.sh locally on my Mac, however it fails on postgres.conf:

waiting for server to start....2021-03-18 08:46:13.429 GMT [92] LOG:  unrecognized configuration parameter "restore_command" in file "/testdir/storage/sn/postgresql.conf" line 10
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_name" in file "/testdir/storage/sn/postgresql.conf" line 11
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_action" in file "/testdir/storage/sn/postgresql.conf" line 12
2021-03-18 08:46:13.430 GMT [92] FATAL:  configuration file "/testdir/storage/sn/postgresql.conf" contains errors

What am I doing wrong? Since it is inside container, I am not sure why it shouldn't just work.

The entire log:

./scripts/docker-run-restore-points-test.sh
Image "rp_test:latest" already exists.

Run 'docker run -d --name some-timescaledb -p 5432:5432 rp_test:latest' to launch
59caf9e1b6350eb66f16de43f49df7a7b19bcfb925d1f3d617a0a77ec36add3f
**** Testing ****
timescaledb-rp: mkdir /testdir
timescaledb-rp: chown postgres:postgres /testdir
timescaledb-rp: (cd /testdir; su postgres sh -c /mnt/scripts/test_restore_points.sh)
Running single node tests
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /testdir/storage/sn ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... sh: locale: not found
2021-03-18 08:46:12.551 UTC [64] WARNING:  no usable system locales were found
ok
syncing data to disk ... ok

Success. You can now start the database server using:

    /usr/local/bin/pg_ctl -D /testdir/storage/sn -l logfile start


WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start....2021-03-18 08:46:13.429 GMT [92] LOG:  unrecognized configuration parameter "restore_command" in file "/testdir/storage/sn/postgresql.conf" line 10
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_name" in file "/testdir/storage/sn/postgresql.conf" line 11
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_action" in file "/testdir/storage/sn/postgresql.conf" line 12
2021-03-18 08:46:13.430 GMT [92] FATAL:  configuration file "/testdir/storage/sn/postgresql.conf" contains errors
 stopped waiting
pg_ctl: could not start server
Examine the log output.
timescaledb-rp
Exit status is 1

PG11 used different configuration file for recovery, which is not supported by this script

pmwkaa · 2021-03-18T09:40:49Z

I guess I can add support for PG11, if people think it is necessary

k-rus · 2021-03-18T10:01:05Z

I guess I can add support for PG11, if people think it is necessary

At least it should check and return a meaningful error if the local installation is not correct version of PG.

I think it is good to test PG11 too, since PG11 is still supported and users should be able to restore.

pmwkaa · 2021-03-18T11:53:04Z

@k-rus Found out the case of the fatal message: basically it is initiated by pg_isready which always requires a database name, I've switched it to use postgres.

@svenklemm @erimatnor Added support for PG11 and enabled additional debug output during the execution for better readability

erimatnor

Approving, although I left some comments about potential to improve code reuse and reduce the number of times we build the same Docker image in tests.

erimatnor · 2021-03-19T08:44:26Z

scripts/docker-run-restore-points-test.sh

+}
+
+docker rm -f timescaledb-rp 2>/dev/null || true
+IMAGE_NAME=rp_test TAG_NAME=latest bash ${SCRIPT_DIR}/docker-build.sh


I think we are now building this Docker image over-and-over again in different workflows, which seems inefficient. We should probably see if we can build it once, cache, and reuse across different tests.

If it is difficult to do, we can do it later but just noting the duplicate work here.

But we build them against different PG version, not sure what we can do here

erimatnor · 2021-03-19T08:45:24Z

scripts/docker-run-restore-points-test.sh

+set -e
+set -o pipefail
+
+SCRIPT_DIR=$(dirname $0)


The beginning of this script seems to be copy-pasting things from other scripts. I guess there's an opportunity to ensure better code-reuse here.

Not sure what you meant here, but I guess we can tell the same for other scripts in the directory, the only thing that was reused is the postgresql server startup wait logic. Maybe we should address this as a cumulative task which include other scripts?

k-rus

LGTM
See also my comments

k-rus · 2021-03-18T15:34:02Z

.github/workflows/cron-tests.yaml

+
+  backup_and_restore:
+    name: Backup and restore
+    runs-on: ubuntu-18.04


Due to PR #3029 should the test run against 20.04?

Suggested change

runs-on: ubuntu-18.04

runs-on: ubuntu-20.04

k-rus · 2021-03-18T15:36:06Z

scripts/docker-run-restore-points-test.sh

+    # function
+    status="$?"
+    set +e # do not exit immediately on failure in cleanup handler
+    # docker rm -vf timescaledb-valgrind 2>/dev/null


This line seems to be leftover. Can you remove it?

pmwkaa requested a review from a team as a code owner March 12, 2021 12:15

pmwkaa requested review from erimatnor, k-rus and svenklemm and removed request for a team March 12, 2021 12:15

svenklemm reviewed Mar 15, 2021

View reviewed changes

erimatnor reviewed Mar 15, 2021

View reviewed changes

pmwkaa force-pushed the test_restore_point branch 4 times, most recently from 9680b6d to d40e389 Compare March 17, 2021 12:05

pmwkaa requested review from erimatnor and svenklemm March 17, 2021 12:16

pmwkaa force-pushed the test_restore_point branch 9 times, most recently from 2ea1ee4 to 8874a85 Compare March 17, 2021 14:41

svenklemm reviewed Mar 18, 2021

View reviewed changes

.github/workflows/cron-tests.yaml Outdated Show resolved Hide resolved

svenklemm reviewed Mar 18, 2021

View reviewed changes

pmwkaa force-pushed the test_restore_point branch from 8874a85 to a84423c Compare March 18, 2021 08:48

k-rus reviewed Mar 18, 2021

View reviewed changes

pmwkaa force-pushed the test_restore_point branch from a84423c to 19f02b7 Compare March 18, 2021 09:04

k-rus assigned pmwkaa Mar 18, 2021

pmwkaa force-pushed the test_restore_point branch from 19f02b7 to 24a7d88 Compare March 18, 2021 11:48

pmwkaa requested review from k-rus and svenklemm March 18, 2021 11:53

erimatnor approved these changes Mar 19, 2021

View reviewed changes

k-rus approved these changes Mar 19, 2021

View reviewed changes

pmwkaa force-pushed the test_restore_point branch from 24a7d88 to 6db4f2b Compare March 19, 2021 11:37

svenklemm approved these changes Mar 19, 2021

View reviewed changes

Add script to test recovery using restore points

28dc896

pmwkaa force-pushed the test_restore_point branch from 6db4f2b to 28dc896 Compare March 19, 2021 12:14

pmwkaa merged commit c09f7e4 into timescale:master Mar 19, 2021

k-rus mentioned this pull request Mar 22, 2021

[Update] Document how to use distributed restore point on multinode timescale/docs.timescale.com-content#685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to test recovery using restore points #3024

Add script to test recovery using restore points #3024

pmwkaa commented Mar 12, 2021 •

edited

codecov bot commented Mar 12, 2021 •

edited

svenklemm left a comment

erimatnor Mar 15, 2021

pmwkaa Mar 15, 2021

erimatnor Mar 15, 2021

pmwkaa Mar 15, 2021

pmwkaa commented Mar 17, 2021 •

edited

svenklemm Mar 18, 2021

k-rus left a comment •

edited

pmwkaa commented Mar 18, 2021 •

edited

k-rus commented Mar 18, 2021

k-rus commented Mar 18, 2021

pmwkaa commented Mar 18, 2021

pmwkaa commented Mar 18, 2021

k-rus commented Mar 18, 2021

pmwkaa commented Mar 18, 2021

erimatnor left a comment

erimatnor Mar 19, 2021

pmwkaa Mar 19, 2021

erimatnor Mar 19, 2021

pmwkaa Mar 19, 2021 •

edited

k-rus left a comment

k-rus Mar 18, 2021

k-rus Mar 18, 2021

	pg_ctl init -D "${STORAGE_DIR}/${NAME}"
	${PG_BIN}/pg_ctl init -D "${STORAGE_DIR}/${NAME}"

Add script to test recovery using restore points #3024

Add script to test recovery using restore points #3024

Conversation

pmwkaa commented Mar 12, 2021 • edited

codecov bot commented Mar 12, 2021 • edited

Codecov Report

svenklemm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmwkaa commented Mar 17, 2021 • edited

Choose a reason for hiding this comment

k-rus left a comment • edited

Choose a reason for hiding this comment

pmwkaa commented Mar 18, 2021 • edited

k-rus commented Mar 18, 2021

k-rus commented Mar 18, 2021

pmwkaa commented Mar 18, 2021

pmwkaa commented Mar 18, 2021

k-rus commented Mar 18, 2021

pmwkaa commented Mar 18, 2021

erimatnor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmwkaa Mar 19, 2021 • edited

Choose a reason for hiding this comment

k-rus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmwkaa commented Mar 12, 2021 •

edited

codecov bot commented Mar 12, 2021 •

edited

pmwkaa commented Mar 17, 2021 •

edited

k-rus left a comment •

edited

pmwkaa commented Mar 18, 2021 •

edited

pmwkaa Mar 19, 2021 •

edited