Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to test recovery using restore points #3024

Merged
merged 1 commit into from Mar 19, 2021

Conversation

pmwkaa
Copy link
Contributor

@pmwkaa pmwkaa commented Mar 12, 2021

Test PostgreSQL restore point recovery for single node and multi node cluster, make sure that single and multi node results match after recovery.

@pmwkaa pmwkaa requested a review from a team as a code owner March 12, 2021 12:15
@pmwkaa pmwkaa requested review from erimatnor, k-rus and svenklemm and removed request for a team March 12, 2021 12:15
@codecov
Copy link

codecov bot commented Mar 12, 2021

Codecov Report

Merging #3024 (28dc896) into master (0bc3f0b) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3024   +/-   ##
=======================================
  Coverage   90.28%   90.28%           
=======================================
  Files         213      213           
  Lines       35004    35004           
=======================================
  Hits        31602    31602           
  Misses       3402     3402           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bc3f0b...28dc896. Read the comment docs.

Copy link
Member

@svenklemm svenklemm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt this be exercised somewhere in CI?

NAME=$1
PORT=$2
RESTORE_POINT_NAME=$3
pg_ctl init -D "${STORAGE_DIR}/${NAME}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the script support PG tools in a non-standard path by prefixing like this:

Suggested change
pg_ctl init -D "${STORAGE_DIR}/${NAME}"
${PG_BIN}/pg_ctl init -D "${STORAGE_DIR}/${NAME}"

Then allow overriding PG_BIN above. This applies to all psql calls below as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I can do this, but I believe we don't do that for our other scripts. Personally I would prefer to change PATH to point to proper pg directory instead before running this script, if someone need to change the version.

('2018-07-01 09:11', 90, 2.7),
('2018-07-01 08:01', 29, 1.5);

SELECT pg_create_restore_point('${TEST_RESTORE_POINT_3}');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to also insert date after the restore point to see that this is not included in the restored instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it, but that's also checked when restoring from the restore points 1 and 2

@pmwkaa pmwkaa force-pushed the test_restore_point branch 4 times, most recently from 9680b6d to d40e389 Compare March 17, 2021 12:05
@pmwkaa pmwkaa force-pushed the test_restore_point branch 9 times, most recently from 2ea1ee4 to 8874a85 Compare March 17, 2021 14:41
@pmwkaa
Copy link
Contributor Author

pmwkaa commented Mar 17, 2021

@erimatnor @svenklemm I've added cron test here to see that it actually works, I'll remove it before merging

strategy:
fail-fast: false
env:
PG_VERSION: 12.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want a matrix here and test PG11 and PG13 too

Copy link
Contributor

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that the regression test succeeded. However, I see some errors in the log, which I don't feel well about:

2021-03-17 14:43:27.520 GMT [110] FATAL:  database "sn" does not exist

cp: cannot stat '/testdir/wal/sn/00000002.history': No such file or directory

cp: cannot stat '/testdir/wal/sn/00000001.history': No such file or directory

Not ciritical:

sh: locale: not found
2021-03-17 14:43:25.331 UTC [67] WARNING:  no usable system locales were found

@pmwkaa
Copy link
Contributor Author

pmwkaa commented Mar 18, 2021

I can see that the regression test succeeded. However, I see some errors in the log, which I don't feel well about:

2021-03-17 14:43:27.520 GMT [110] FATAL:  database "sn" does not exist

cp: cannot stat '/testdir/wal/sn/00000002.history': No such file or directory

cp: cannot stat '/testdir/wal/sn/00000001.history': No such file or directory

Not ciritical:

sh: locale: not found
2021-03-17 14:43:25.331 UTC [67] WARNING:  no usable system locales were found

I believe they are related to the way how postgresql copy files using archive/restore command. And database "sn" does not exist to my understanding was related to data node bootstraping, otherwise it would fail

@k-rus
Copy link
Contributor

k-rus commented Mar 18, 2021

And database "sn" does not exist to my understanding was related to data node bootstraping, otherwise it would fail

@pmwkaa The error is FATAL. How can sn be related to the data node bootstrapping?

@k-rus
Copy link
Contributor

k-rus commented Mar 18, 2021

@pmwkaa I tried to run docker-run-restore-points-test.sh locally on my Mac, however it fails on postgres.conf:

waiting for server to start....2021-03-18 08:46:13.429 GMT [92] LOG:  unrecognized configuration parameter "restore_command" in file "/testdir/storage/sn/postgresql.conf" line 10
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_name" in file "/testdir/storage/sn/postgresql.conf" line 11
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_action" in file "/testdir/storage/sn/postgresql.conf" line 12
2021-03-18 08:46:13.430 GMT [92] FATAL:  configuration file "/testdir/storage/sn/postgresql.conf" contains errors

What am I doing wrong? Since it is inside container, I am not sure why it shouldn't just work.

The entire log:

./scripts/docker-run-restore-points-test.sh
Image "rp_test:latest" already exists.

Run 'docker run -d --name some-timescaledb -p 5432:5432 rp_test:latest' to launch
59caf9e1b6350eb66f16de43f49df7a7b19bcfb925d1f3d617a0a77ec36add3f
**** Testing ****
timescaledb-rp: mkdir /testdir
timescaledb-rp: chown postgres:postgres /testdir
timescaledb-rp: (cd /testdir; su postgres sh -c /mnt/scripts/test_restore_points.sh)
Running single node tests
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /testdir/storage/sn ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... sh: locale: not found
2021-03-18 08:46:12.551 UTC [64] WARNING:  no usable system locales were found
ok
syncing data to disk ... ok

Success. You can now start the database server using:

    /usr/local/bin/pg_ctl -D /testdir/storage/sn -l logfile start


WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start....2021-03-18 08:46:13.429 GMT [92] LOG:  unrecognized configuration parameter "restore_command" in file "/testdir/storage/sn/postgresql.conf" line 10
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_name" in file "/testdir/storage/sn/postgresql.conf" line 11
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_action" in file "/testdir/storage/sn/postgresql.conf" line 12
2021-03-18 08:46:13.430 GMT [92] FATAL:  configuration file "/testdir/storage/sn/postgresql.conf" contains errors
 stopped waiting
pg_ctl: could not start server
Examine the log output.
timescaledb-rp
Exit status is 1

@pmwkaa
Copy link
Contributor Author

pmwkaa commented Mar 18, 2021

@pmwkaa I tried to run docker-run-restore-points-test.sh locally on my Mac, however it fails on postgres.conf:

waiting for server to start....2021-03-18 08:46:13.429 GMT [92] LOG:  unrecognized configuration parameter "restore_command" in file "/testdir/storage/sn/postgresql.conf" line 10
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_name" in file "/testdir/storage/sn/postgresql.conf" line 11
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_action" in file "/testdir/storage/sn/postgresql.conf" line 12
2021-03-18 08:46:13.430 GMT [92] FATAL:  configuration file "/testdir/storage/sn/postgresql.conf" contains errors

What am I doing wrong? Since it is inside container, I am not sure why it shouldn't just work.

The entire log:

./scripts/docker-run-restore-points-test.sh
Image "rp_test:latest" already exists.

Run 'docker run -d --name some-timescaledb -p 5432:5432 rp_test:latest' to launch
59caf9e1b6350eb66f16de43f49df7a7b19bcfb925d1f3d617a0a77ec36add3f
**** Testing ****
timescaledb-rp: mkdir /testdir
timescaledb-rp: chown postgres:postgres /testdir
timescaledb-rp: (cd /testdir; su postgres sh -c /mnt/scripts/test_restore_points.sh)
Running single node tests
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /testdir/storage/sn ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... sh: locale: not found
2021-03-18 08:46:12.551 UTC [64] WARNING:  no usable system locales were found
ok
syncing data to disk ... ok

Success. You can now start the database server using:

    /usr/local/bin/pg_ctl -D /testdir/storage/sn -l logfile start


WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start....2021-03-18 08:46:13.429 GMT [92] LOG:  unrecognized configuration parameter "restore_command" in file "/testdir/storage/sn/postgresql.conf" line 10
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_name" in file "/testdir/storage/sn/postgresql.conf" line 11
2021-03-18 08:46:13.430 GMT [92] LOG:  unrecognized configuration parameter "recovery_target_action" in file "/testdir/storage/sn/postgresql.conf" line 12
2021-03-18 08:46:13.430 GMT [92] FATAL:  configuration file "/testdir/storage/sn/postgresql.conf" contains errors
 stopped waiting
pg_ctl: could not start server
Examine the log output.
timescaledb-rp
Exit status is 1

PG11 used different configuration file for recovery, which is not supported by this script

@pmwkaa
Copy link
Contributor Author

pmwkaa commented Mar 18, 2021

I guess I can add support for PG11, if people think it is necessary

@k-rus
Copy link
Contributor

k-rus commented Mar 18, 2021

I guess I can add support for PG11, if people think it is necessary

At least it should check and return a meaningful error if the local installation is not correct version of PG.

I think it is good to test PG11 too, since PG11 is still supported and users should be able to restore.

@pmwkaa
Copy link
Contributor Author

pmwkaa commented Mar 18, 2021

@k-rus Found out the case of the fatal message: basically it is initiated by pg_isready which always requires a database name, I've switched it to use postgres.

@svenklemm @erimatnor Added support for PG11 and enabled additional debug output during the execution for better readability

@pmwkaa pmwkaa requested review from k-rus and svenklemm March 18, 2021 11:53
Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, although I left some comments about potential to improve code reuse and reduce the number of times we build the same Docker image in tests.

}

docker rm -f timescaledb-rp 2>/dev/null || true
IMAGE_NAME=rp_test TAG_NAME=latest bash ${SCRIPT_DIR}/docker-build.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are now building this Docker image over-and-over again in different workflows, which seems inefficient. We should probably see if we can build it once, cache, and reuse across different tests.

If it is difficult to do, we can do it later but just noting the duplicate work here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we build them against different PG version, not sure what we can do here

set -e
set -o pipefail

SCRIPT_DIR=$(dirname $0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The beginning of this script seems to be copy-pasting things from other scripts. I guess there's an opportunity to ensure better code-reuse here.

Copy link
Contributor Author

@pmwkaa pmwkaa Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you meant here, but I guess we can tell the same for other scripts in the directory, the only thing that was reused is the postgresql server startup wait logic. Maybe we should address this as a cumulative task which include other scripts?

Copy link
Contributor

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
See also my comments


backup_and_restore:
name: Backup and restore
runs-on: ubuntu-18.04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to PR #3029 should the test run against 20.04?

Suggested change
runs-on: ubuntu-18.04
runs-on: ubuntu-20.04

# function
status="$?"
set +e # do not exit immediately on failure in cleanup handler
# docker rm -vf timescaledb-valgrind 2>/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems to be leftover. Can you remove it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants