operator.backup: update wal_prefetch() so that failed downloads are deleted #144

hblanks · 2014-10-22T21:56:52Z

Resolves the issue in v0.8c1 detailed at:

https://groups.google.com/forum/#!topic/wal-e/8wwiDkrNLXQ

wherein (standby_mode = off) PostgreSQL recoveries were failing when
--prefetch > 0. Specifically, in cases where do_lzop_get() failed to
fetch an archive file (as would be the case when trying to fetch the
non-existent WAL file following the latest one in the archive) prefetch
was creating an empty archive file and then allowing AtomicDownload to
link that file into place. This causes PostgreSQL's recovery to fail.
PostgreSQL then shuts down and can't be started up until you remove
recovery.conf.

Sorry to not include a test for this; I've had difficulties getting tox up and
running in my environment. My best guess would be to add a test to
test_blackbox.py along the lines of test_wal_fetch_non_existent(). I'm
not certain, however.

…eleted Resolves the issue detailed at: https://groups.google.com/forum/#!topic/wal-e/8wwiDkrNLXQ wherein (standby_mode = off) PostgreSQL recoveries were failing when --prefetch > 0. Specifically, in cases where do_lzop_get() failed to fetch an archive file (as would be the case when trying to fetch the non-existent WAL file following the latest one in the archive) prefetch was creating an empty archive file and then allowing AtomicDownload to link that file into place. This causes PostgreSQL's recovery to fail. PostgreSQL then shuts down and can't be started up until you remove recovery.conf.

hblanks · 2014-10-22T22:01:52Z

Logs from a successful restore, where standby_mode = off and --prefetch had the default of > 0.

2014-10-22 20:57:04 UTC LOG:  last completed transaction was at log time 2014-10-16 05:30:02.567065+00
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2014-10-22T20:57:06.626173-00 pid=20711
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2014-10-22T20:57:07.099129-00 pid=20711 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.
lzo prefix=PREFIX/ seg=0000000300000042000000F4 state=begin
wal_e.blobstore.s3.s3_util INFO     MSG: completed download and decompression
        DETAIL: Downloaded and decompressed "s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.lzo" to "pg_xlog/RECOVERYXLOG"
        STRUCTURED: time=2014-10-22T20:57:09.523464-00 pid=20711
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2014-10-22T20:57:09.524837-00 pid=20711 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/0000000300000042000000F4.
lzo prefix=PREFIX/ seg=0000000300000042000000F4 state=complete
2014-10-22 20:57:09 UTC LOG:  restored log file "0000000300000042000000F4" from archive
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2014-10-22T20:57:10.198009-00 pid=20823
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2014-10-22T20:57:10.357207-00 pid=20823 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo pref
ix=PREFIX/ seg=00000004.history state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is s3://BUCKET/PREFIX/wal_005/00000004.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2014-10-22T20:57:10.521368-00 pid=20823
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2014-10-22T20:57:10.523909-00 pid=20823 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000004.history.lzo pref
ix=PREFIX/ seg=00000004.history state=complete
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2014-10-22T20:57:11.198135-00 pid=20844
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2014-10-22T20:57:11.348992-00 pid=20844 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000005.history.lzo pref
ix=PREFIX/ seg=00000005.history state=begin
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is s3://BUCKET/PREFIX/wal_005/00000005.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2014-10-22T20:57:11.492573-00 pid=20844
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2014-10-22T20:57:11.494196-00 pid=20844 action=wal-fetch key=s3://BUCKET/PREFIX/wal_005/00000005.history.lzo pref
ix=PREFIX/ seg=00000005.history state=complete
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2014-10-22T20:57:12.205295-00 pid=20853

fdr · 2014-10-26T02:52:40Z

Yay Merged.

evtuhovich · 2015-03-05T16:07:54Z

Hi, the same problem with wal-fetch, when standby_mode = off

I use upstream wal-e, but do not understand, how to use prefetch - there is no documentation and i do not understand differences between fetch and prefetch

hblanks · 2015-03-06T04:39:56Z

@evtuhovich prefetch is a clever, and expedient, feature, and it exists because of how PostgreSQL fetches WAL archives.

Basically, PostgreSQL runs the restore_command for one WAL file after another. This works, but it means that you will wait to download WAL archive 2 until after you've downloaded (and loaded?) archive 1.

You thus will delay your download of all the WAL archives, in the worst case, by the slowest download.

Hence, prefetch, which actually daemonizes a new process when the restore_command is executed, and that process goes out and tries to fetch future WAL files so that they'll be ready when it's time to restore them. (And of course, there's bookkeeping involved to make sure this process doesn't conflict with any other prefetch processes outstanding...)

evtuhovich · 2015-03-06T08:57:27Z

@hblanks, thanx for answer, i'll try to use it

fdr mentioned this pull request Oct 26, 2014

Add SWIFT_AUTH_VERSION #146

Closed

fdr merged commit c34c993 into wal-e:master Oct 26, 2014

This was referenced Mar 6, 2015

Return non-zero exit code and do not create empty file for wal-fetch #168

Closed

Delete zero length wal-log #169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator.backup: update wal_prefetch() so that failed downloads are deleted #144

operator.backup: update wal_prefetch() so that failed downloads are deleted #144

hblanks commented Oct 22, 2014

hblanks commented Oct 22, 2014

fdr commented Oct 26, 2014

evtuhovich commented Mar 5, 2015

hblanks commented Mar 6, 2015

evtuhovich commented Mar 6, 2015

operator.backup: update wal_prefetch() so that failed downloads are deleted #144

operator.backup: update wal_prefetch() so that failed downloads are deleted #144

Conversation

hblanks commented Oct 22, 2014

hblanks commented Oct 22, 2014

fdr commented Oct 26, 2014

evtuhovich commented Mar 5, 2015

hblanks commented Mar 6, 2015

evtuhovich commented Mar 6, 2015