Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DecryptAndDecompressTar: lz4 decompress failed. Is archive encrypted?: DecompressLz4: lz4 write failed: unexpected EOF #849

Closed
bimcas opened this issue Jan 15, 2021 · 7 comments · Fixed by #1135

Comments

@bimcas
Copy link

bimcas commented Jan 15, 2021

Hi,

My DB server:
CentOS 7.7
PostgreSQL 12.5

I'm using WAL-G v0.2.19 (Latest). Backups and WALs go to S3 storage.

It's interesting that during restoration processes I'm getting some error messages:

NFO: 2021/01/15 11:10:36.969496 LATEST backup is: 'base_00000006000000000000000D
INFO: 2021/01/15 11:10:37.756309 Finished decompression of part_003.tar.lz4
INFO: 2021/01/15 11:10:37.756328 Finished extraction of part_003.tar.lz4
INFO: 2021/01/15 11:13:46.004024 Finished decompression of part_001.tar.lz4
ERROR: 2021/01/15 11:13:46.004063 part_001.tar.lz4 DecryptAndDecompressTar: lz4 decompress failed. Is archive encrypted?: DecompressLz4: lz4 write failed: unexpected EOF
INFO: 2021/01/15 11:13:46.004113 Finished extraction of part_001.tar.lz4
ERROR: 2021/01/15 11:13:46.004121 Extraction error in part_001.tar.lz4: extractOne: Interpret failed: Interpret: copy failed: unexpected EOF
INFO: 2021/01/15 11:16:04.996247 Finished decompression of part_001.tar.lz4
INFO: 2021/01/15 11:16:04.996269 Finished extraction of part_001.tar.lz4
INFO: 2021/01/15 11:16:05.003757 Finished decompression of pg_control.tar.lz4
INFO: 2021/01/15 11:16:05.003781 Finished extraction of pg_control.tar.lz4
INFO: 2021/01/15 11:16:05.003791 
Backup extraction complete.

Sometimes restore is OK, but sometimes it takes very long and finally crashed.

I create backups in a regular way with backup-push (no encryption or other special elements).

The creation of backups and uploads to S3 storage is always normal and successful. Problems occur only with the restore process.

What does it mean "DecryptAndDecompressTar: lz4 decompress failed. Is archive encrypted?: DecompressLz4: lz4 write failed: unexpected EOF"?

Why does it happen?

@x4m
Copy link
Collaborator

x4m commented Jan 16, 2021

Hi!
unexpected EOF is broken file download. WAL-G retries files few times with lower parallelism. Butif the same file failes again and again - WAL-G will give up.

@bimcas
Copy link
Author

bimcas commented Jan 18, 2021

Hi

Hmm ... It's strange. I tried manually upload/download archives (like part_001.tar.lz4 or similar) to/from S3 storage. Everything works perfectly. The connection is stable and the speed is good enough, but the restoration from backup is failing with error messages:

DecryptAndDecompressTar: lz4 decompress failed. Is archive encrypted ?: DecompressLz4: lz4 write failed: unexpected EOF
extractOne: Interpret failed: Interpret: copy failed: unexpected EOF
Failed to fetch backup: failed to extract files: part_001.tar.lz4

Also, I tried to use different S3 storages (different servers), but the restoration from backup is almost failing.
No error messages during the creation of backups and uploads them to S3 storage. ** Problems occur only with the restoration from backup **.
I'm rejecting the idea that something is wrong with S3 storage connectivity, stability, or file downloading issue. Manual downloads are fast, always successful and downloaded files from S3 aren't corrupted.

Maybe something wrong could be:

  • with backup-push. No error messages, but really backup files aren't valid and later backup-fetch can't extract and decompress it. I use: envdir "${ENV_DIR}" wal-g backup-push "${PGDATA}"
  • with backup-fetch. A kind of problem with LZ4 decompression. Maybe some library or package is missing. How does backup-fetch works? Does it download the archive to the node first and only then extract the archive? I use: envdir "${ENV_DIR}" backup-fetch "${PGDATA}" LATEST

Now I'm using wal-e. I think to start use wal-g, but those issues stop my idea to do it. How to solve this problem? What could you recommend?

@x4m
Copy link
Collaborator

x4m commented Jan 19, 2021

Decompression occurs on the fly right after receiving bytes from the network. No extra libs are necessary.
It's strange that you encounter an error with exactly first file of a backup. Maybe there some kind of privileges issues?
You can enable debug log by setting WALG_LOG_LEVEL=DEBUG.

@lalinsky
Copy link

lalinsky commented Mar 3, 2021

I get the same error. I'm using Backblaze's S3 API and the first two parts are always failing to decompress.

@troyanov
Copy link

troyanov commented May 6, 2021

Might be a duplicate of #449 that contains an interesting finding #449 (comment)

I also have the same problem running v0.2.19, except I use brotli

INFO: 2021/05/06 21:53:42.195928 Finished decompression of part_001.tar.br
ERROR: 2021/05/06 21:53:42.196046 part_001.tar.br DecryptAndDecompressTar: br decompress failed. Is archive encrypted?: unexpected EOF
INFO: 2021/05/06 21:53:42.197454 Finished extraction of part_001.tar.br
ERROR: 2021/05/06 21:53:42.197563 Extraction error in part_001.tar.br: extractOne: Interpret failed: Interpret: copy failed: unexpected EOF

I'm using Digital Ocean droplet with attached volume (https://docs.digitalocean.com/products/volumes/)
Interesting thing here is that it fails when I try to restore to an attached volume, but it goes smoothly if I restore to /tmp

On the Volumes product page they mention that they are using Ceph. Could it be that Ceph is a source of this problem?

Disk write speed is different, almost x2 times difference for me:

> dd if=/dev/zero of=/mnt/data_db_01_master/tmp/testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.82629 s, 281 MB/s

> dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.14976 s, 499 MB/s

I tried WALG_DOWNLOAD_CONCURRENCY=1 and results are the same.

Maybe tcpdump will be helpful:

when it fails to restore (RST is sent by Object Storage)
Screenshot 2021-05-07 at 03 19 29

and when it succeeds
Screenshot 2021-05-07 at 03 21 14

Looks like connection is closed by Object Storage?

Also interesting finding:
TCP Zero Window is sent by the client, and maybe that's why there is an RST from DO S3 compatible storage coming after

image

Normal disk (I have to mention, that part_001.tar.br is only 9.9 Mb):

INFO: 2021/05/07 21:28:55.432923 LATEST backup is: 'base_00000001000001C700000011_D_00000001000001C200000059'
INFO: 2021/05/07 21:28:55.596283 Delta from base_00000001000001C200000059_D_00000001000001BD00000092 at LSN 1c259000028
INFO: 2021/05/07 21:28:55.660371 Delta from base_00000001000001BD00000092_D_00000001000001B8000000B6 at LSN 1bd92000028
INFO: 2021/05/07 21:28:55.743961 Delta from base_00000001000001B8000000B6_D_00000001000001B3000000E9 at LSN 1b8b6000028
INFO: 2021/05/07 21:28:55.805089 Delta from base_00000001000001B3000000E9 at LSN 1b3e9000028
INFO: 2021/05/07 21:29:08.465294 Finished extraction of part_001.tar.br
INFO: 2021/05/07 21:29:08.465639 Finished decompression of part_001.tar.br
INFO: 2021/05/07 21:29:12.200141 Finished decompression of part_002.tar.br
INFO: 2021/05/07 21:29:12.200174 Finished extraction of part_002.tar.br
INFO: 2021/05/07 21:29:12.218050 Finished extraction of part_004.tar.br
INFO: 2021/05/07 21:29:12.218177 Finished decompression of part_004.tar.br
INFO: 2021/05/07 21:29:12.238381 Finished extraction of pg_control.tar.br
INFO: 2021/05/07 21:29:12.238399

Ceph disk

INFO: 2021/05/07 21:29:30.196952 LATEST backup is: 'base_00000001000001C700000011_D_00000001000001C200000059'
INFO: 2021/05/07 21:29:30.359867 Delta from base_00000001000001C200000059_D_00000001000001BD00000092 at LSN 1c259000028
INFO: 2021/05/07 21:29:30.447245 Delta from base_00000001000001BD00000092_D_00000001000001B8000000B6 at LSN 1bd92000028
INFO: 2021/05/07 21:29:30.534594 Delta from base_00000001000001B8000000B6_D_00000001000001B3000000E9 at LSN 1b8b6000028
INFO: 2021/05/07 21:29:30.613034 Delta from base_00000001000001B3000000E9 at LSN 1b3e9000028
INFO: 2021/05/07 21:30:12.745096 Finished decompression of part_001.tar.br
ERROR: 2021/05/07 21:30:12.745133 part_001.tar.br DecryptAndDecompressTar: br decompress failed. Is archive encrypted?: unexpected EOF
INFO: 2021/05/07 21:30:12.746438 Finished extraction of part_001.tar.br
ERROR: 2021/05/07 21:30:12.746463 Extraction error in part_001.tar.br: extractOne: Interpret failed: Interpret: copy failed: unexpected EOF
INFO: 2021/05/07 21:30:28.290893 Finished decompression of part_002.tar.br
INFO: 2021/05/07 21:30:28.290924 Finished extraction of part_002.tar.br
INFO: 2021/05/07 21:30:28.341604 Finished extraction of part_004.tar.br
INFO: 2021/05/07 21:30:28.341634 Finished decompression of part_004.tar.br
INFO: 2021/05/07 21:32:25.764028 Finished decompression of part_001.tar.br
ERROR: 2021/05/07 21:32:25.764073 part_001.tar.br DecryptAndDecompressTar: br decompress failed. Is archive encrypted?: unexpected EOF
INFO: 2021/05/07 21:32:25.766421 Finished extraction of part_001.tar.br
ERROR: 2021/05/07 21:32:25.766474 Extraction error in part_001.tar.br: extractOne: Interpret failed: Interpret: copy failed: unexpected EOF
ERROR: 2021/05/07 21:32:25.766517 Failed to fetch backup: failed to extract files:
part_001.tar.br

I have a feeling that it might be something about Stream which is taken all the way from s3 client to decompressor, and how buffers work in Ceph and how decompressed files are written to disk.

Wondering if saving to temporary file instead of stream processing right away will change the situation..?

@troyanov
Copy link

troyanov commented May 9, 2021

A bit more information:

If I generate some load (writing to disk) with fio

fio --filename=/mnt/volume_lon1_01/fio.test --direct=1 --rw=randwrite --ioengine=libaio --bs=4K --iodepth=128 --name=rand-w --runtime=120s --size=10000000000

Then backup-fetch will fail with rather long time waiting to get part_001.tar.br

wal-g backup-fetch bak LATEST --config /var/lib/postgresql/.walg.json
INFO: 2021/05/09 08:46:20.822339 LATEST backup is: 'base_00000001000001D00000003E'
INFO: 2021/05/09 08:47:50.093221 Finished decompression of part_001.tar.br
ERROR: 2021/05/09 08:47:50.093262 part_001.tar.br DecryptAndDecompressTar: br decompress failed. Is archive encrypted?: unexpected EOF
INFO: 2021/05/09 08:47:50.093979 Finished extraction of part_001.tar.br
ERROR: 2021/05/09 08:47:50.093998 Extraction error in part_001.tar.br: extractOne: Interpret failed: Interpret: copy failed: unexpected EOF

Without write heavy operation I can still restore to /mnt/volume_lon1_01, but even couple apps writing log file to that disk will affect the process...

But if I do the same on another disk (not an attached block storage device), then everything works:

wal-g backup-fetch /tmp/bak LATEST --config /var/lib/postgresql/.walg.json
INFO: 2021/05/09 08:50:20.756232 LATEST backup is: 'base_00000001000001D00000003E'
INFO: 2021/05/09 08:50:37.397772 Finished extraction of part_001.tar.br
INFO: 2021/05/09 08:50:37.397971 Finished decompression of part_001.tar.br
INFO: 2021/05/09 08:50:43.933277 Finished decompression of part_002.tar.br
INFO: 2021/05/09 08:50:43.933305 Finished extraction of part_002.tar.br
INFO: 2021/05/09 08:50:43.964077 Finished decompression of part_004.tar.br
INFO: 2021/05/09 08:50:43.964106 Finished extraction of part_004.tar.br
INFO: 2021/05/09 08:50:43.997018 Finished decompression of pg_control.tar.br
INFO: 2021/05/09 08:50:43.997035 Finished extraction of pg_control.tar.br
INFO: 2021/05/09 08:50:43.997143
Backup extraction complete.
root@debug-lab:/mnt/volume_lon1_01# df
Filesystem     1K-blocks     Used Available Use% Mounted on
udev             4071096        0   4071096   0% /dev
tmpfs             816800      644    816156   1% /run
/dev/vda1       25226960 13098068  12112508  52% /
tmpfs            4083992        0   4083992   0% /dev/shm
tmpfs               5120        0      5120   0% /run/lock
tmpfs            4083992        0   4083992   0% /sys/fs/cgroup
/dev/sda       103900088 29595500  69045324  31% /mnt/volume_lon1_01
/dev/vda15        106858     6170    100688   6% /boot/efi
tmpfs             816796        0    816796   0% /run/user/0

But at the same time I can use https://github.com/s3tools/s3cmd and download the file without any issues.

@x4m I'm wondering if that can make any difference?

Decompression occurs on the fly right after receiving bytes from the network.

Maybe "under" load the buffer size is changed and amount of bytes fetched is not enough to decompress contents? What if we download contents to a tempFile first (https://golang.org/pkg/io/ioutil/#example_TempFile) and then try decompress/extract?

@troyanov
Copy link

troyanov commented May 9, 2021

I did a quick test and simply dumped readCloser stream into file.

wal-g/internal/extract.go

Lines 102 to 103 in 0261341

func DecryptAndDecompressTar(writer io.Writer, readerMaker ReaderMaker, crypter crypto.Crypter) error {
readCloser, err := readerMaker.Reader()

Files are there and were created without any problems (no EOF in 10 tries).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants