Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should detect when archive_command is not set right.. #50

Closed
diranged opened this issue Jan 5, 2018 · 4 comments
Closed

Should detect when archive_command is not set right.. #50

diranged opened this issue Jan 5, 2018 · 4 comments

Comments

@diranged
Copy link

diranged commented Jan 5, 2018

We are doing some development of new puppet code for managing databases, and in my testing I hadn't yet set archive_command to anything on our dbmaster. Obviously we need to do that .. but if you miss it, and you try to execute a wal-g backup-push, it just hangs near the end:

/pg_subtrans
/pg_tblspc
/pg_twophase
/pg_xlog
/postgresql.auto.conf
/postgresql.conf
Finished writing part 1.
Starting part 2 ...
/global/pg_control
Finished writing part 2.
<hang is here>

Digging into the Postgres logs, we saw:

2018-01-05 08:00:02.232 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,3,"SELECT",2018-01-05 08:00:02 UTC,3/1674,0,LOG,00000,"duration: 124.497 ms  execute <unnamed>: SELECT case when pg_is_in_recovery() then '' else (pg_xlogfile_name_offset(lsn)).file_name end, lsn::text, pg_is_in_recovery() FROM pg_start_backup($1, true, false) lsn","parameters: $1 = '2018-01-05 08:00:02.106987155 +0000 UTC'",,,,,,,,""
2018-01-05 08:01:02.103 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,4,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (60 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:02:02.171 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,5,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (120 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:04:02.305 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,6,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (240 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:08:02.570 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,7,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (480 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:16:03.100 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,8,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (960 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 08:32:04.156 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,9,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (1920 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 09:04:06.275 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,10,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (3840 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 10:08:10.511 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,11,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (7680 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""
2018-01-05 12:16:18.992 UTC,"postgres","postgres",119597,"[local]",5a4f3082.1d32d,12,"SELECT",2018-01-05 08:00:02 UTC,3/1676,0,WARNING,01000,"pg_stop_backup still waiting for all required WAL segments to be archived (15360 seconds elapsed)",,"Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.",,,,,,,""

Adding archive_command = "/bin/true" and HUPing postgres solved the issue. However, it seems that wal-g should have some way to detect when its in this hung state and get out of it with a useful error.

@x4m
Copy link
Collaborator

x4m commented Jan 5, 2018

Matt, thanks for reporting this!
Testing nonempty archive_command does not seem like a proper solution. If you don't have WALs WAL-G should inform the user that backup is inconsistent.
Checking %wal-g% in archive_command is neither a solution: user can do some scripting like archive_command = 'my_fancy_script.sh'.

During backup-push, we can check whether we have at least some WALs related to this backup and report a problem. But, when we are doing the backup on replica it is possible to have 0 WALs issued during backup... I need to think a bit about this.

@diranged
Copy link
Author

diranged commented Jan 5, 2018

I totally agree that specifically checking archive_command = .*wal-g.* is wrong. I guess I just meant that a simple check to make sure it’s set at all might help. It would of course be better to check if a real backup can even possibly be taken.

@x4m
Copy link
Collaborator

x4m commented Jan 6, 2018

There are some options.

  1. If archive_mode is off, we do not need archive_command. But, WAL-G ignores xlog\wal folder, thus, we should not allow backups, they will be inconsistent. Either way, include WALs folder into backup.
  2. If archive_mode is not off, we can check archive_command. I've reproduced hanging WAL-G.

I think that we should add checks somewhere here. https://github.com/wal-g/wal-g/blob/master/connect.go#L24

If you have some spare time, I'd be happy if you could hack a pull request.

Or I can implement this checks, probably somewhere next week.

x4m added a commit that referenced this issue Jan 17, 2018
…ly this is printed as warning. Will consider panic in future
@x4m
Copy link
Collaborator

x4m commented Sep 6, 2018

We have hints for the user now, but we cannot protect the user from shooting in the foot completely, if they do not read WAL-G output.
If you will have some better ideas please feel free to ping me.

@x4m x4m closed this as completed Sep 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants