Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator.backup: update wal_prefetch() so that failed downloads are deleted #144

Merged
merged 1 commit into from Oct 26, 2014
Merged

operator.backup: update wal_prefetch() so that failed downloads are deleted #144

merged 1 commit into from Oct 26, 2014

Conversation

hblanks
Copy link

@hblanks hblanks commented Oct 22, 2014

Resolves the issue in v0.8c1 detailed at:

https://groups.google.com/forum/#!topic/wal-e/8wwiDkrNLXQ

wherein (standby_mode = off) PostgreSQL recoveries were failing when
--prefetch > 0. Specifically, in cases where do_lzop_get() failed to
fetch an archive file (as would be the case when trying to fetch the
non-existent WAL file following the latest one in the archive) prefetch
was creating an empty archive file and then allowing AtomicDownload to
link that file into place. This causes PostgreSQL's recovery to fail.
PostgreSQL then shuts down and can't be started up until you remove
recovery.conf.

Sorry to not include a test for this; I've had difficulties getting tox up and
running in my environment. My best guess would be to add a test to
test_blackbox.py along the lines of test_wal_fetch_non_existent(). I'm
not certain, however.

…eleted

Resolves the issue detailed at:

    https://groups.google.com/forum/#!topic/wal-e/8wwiDkrNLXQ

wherein (standby_mode = off) PostgreSQL recoveries were failing when
--prefetch > 0. Specifically, in cases where do_lzop_get() failed to
fetch an archive file (as would be the case when trying to fetch the
non-existent WAL file following the latest one in the archive) prefetch
was creating an empty archive file and then allowing AtomicDownload to
link that file into place. This causes PostgreSQL's recovery to fail.
PostgreSQL then shuts down and can't be started up until you remove
recovery.conf.
@hblanks
Copy link
Author

hblanks commented Oct 22, 2014

Logs from a successful restore, where standby_mode = off and --prefetch had the default of > 0.

2014-10-22 20:57:04 UTC LOG:  last completed transaction was at log time 2014-10-16 05:30:02.567065+00
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2014-10-22T20:57:06.626173-00 pid=20711
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2014-10-22T20:57:07.099129-00 pid=20711 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.
lzo prefix=PREFIX/ seg=0000000300000042000000F4 state=begin
wal_e.blobstore.s3.s3_util INFO     MSG: completed download and decompression
        DETAIL: Downloaded and decompressed "s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.lzo" to "pg_xlog/RECOVERYXLOG"
        STRUCTURED: time=2014-10-22T20:57:09.523464-00 pid=20711
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2014-10-22T20:57:09.524837-00 pid=20711 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.
lzo prefix=PREFIX/ seg=0000000300000042000000F4 state=complete
2014-10-22 20:57:09 UTC LOG:  restored log file "0000000300000042000000F4" from archive
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2014-10-22T20:57:10.198009-00 pid=20823
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2014-10-22T20:57:10.357207-00 pid=20823 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo pref
ix=PREFIX/ seg=00000004.history state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is s3://BUCKET/PREFIX/wal_005/00000004.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2014-10-22T20:57:10.521368-00 pid=20823
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2014-10-22T20:57:10.523909-00 pid=20823 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo pref
ix=PREFIX/ seg=00000004.history state=complete
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2014-10-22T20:57:11.198135-00 pid=20844
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2014-10-22T20:57:11.348992-00 pid=20844 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000005.history.lzo pref
ix=PREFIX/ seg=00000005.history state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is s3://BUCKET/PREFIX/wal_005/00000005.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2014-10-22T20:57:11.492573-00 pid=20844
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2014-10-22T20:57:11.494196-00 pid=20844 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000005.history.lzo pref
ix=PREFIX/ seg=00000005.history state=complete
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2014-10-22T20:57:12.205295-00 pid=20853

@fdr fdr mentioned this pull request Oct 26, 2014
@fdr fdr merged commit c34c993 into wal-e:master Oct 26, 2014
@fdr
Copy link
Member

fdr commented Oct 26, 2014

Yay Merged.

@evtuhovich
Copy link

Hi, the same problem with wal-fetch, when standby_mode = off

I use upstream wal-e, but do not understand, how to use prefetch - there is no documentation and i do not understand differences between fetch and prefetch

@hblanks
Copy link
Author

hblanks commented Mar 6, 2015

@evtuhovich prefetch is a clever, and expedient, feature, and it exists because of how PostgreSQL fetches WAL archives.

Basically, PostgreSQL runs the restore_command for one WAL file after another. This works, but it means that you will wait to download WAL archive 2 until after you've downloaded (and loaded?) archive 1.

You thus will delay your download of all the WAL archives, in the worst case, by the slowest download.

Hence, prefetch, which actually daemonizes a new process when the restore_command is executed, and that process goes out and tries to fetch future WAL files so that they'll be ready when it's time to restore them. (And of course, there's bookkeeping involved to make sure this process doesn't conflict with any other prefetch processes outstanding...)

@evtuhovich
Copy link

@hblanks, thanx for answer, i'll try to use it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants