Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO corruption on 2.86 #2609

Closed
nickb937 opened this issue Jan 23, 2022 · 14 comments
Closed

IO corruption on 2.86 #2609

nickb937 opened this issue Jan 23, 2022 · 14 comments

Comments

@nickb937
Copy link

Describe the bug

Applying a 'vacuum' to a sqlite database file is enough to corrupt it. Comparing the pread64() and pwrite64() syscalls between the same command run on ext4 and seaweedfs shows differences, particularly in the last byte of a 1024-byte read.

System Setup

/usr/local/bin/weed master -mdir=/home/weedfs/master -volumePreallocate -volumeSizeLimitMB=30000 -ip 192.168.1.3 -peers 192.168.1.254:9333,192.168.1.253:9333,192.168.1.3:9333
/usr/local/bin/weed volume -dir=/seaweedfs -mserver=192.168.1.253:9333,192.168.1.254:9333,192.168.1.3:9333 -port=8089 -port.grpc=18089 -ip=192.168.1.10 -max=30000 -index=leveldb -dataCenter=shed
weed mount -filer=192.168.1.10:8888 -dir=/mnt

OS: archlinux
sqlite: v3.37.0

[mysql2]  # or memsql, tidb
enabled = true
createTable = """
  CREATE TABLE IF NOT EXISTS `%s` (
    dirhash BIGINT,
    name VARCHAR(1000) BINARY,
    directory TEXT BINARY,
    meta LONGBLOB,
    PRIMARY KEY (dirhash, name)
  ) DEFAULT CHARSET=utf8;
"""
hostname = "192.168.1.3"
port = 3306
username = "weedfiler"
password = "weedfsfiler"
database = "weedfs"              # create or use an existing database
connection_max_idle = 2
connection_max_open = 100
connection_max_lifetime_seconds = 0
interpolateParams = false
# if insert/upsert failing, you can disable upsert or update query syntax to match your RDBMS syntax:
enableUpsert = true
upsertQuery = """INSERT INTO `%s` (dirhash,name,directory,meta) VALUES(?,?,?,?) ON DUPLICATE KEY UPDATE meta = VALUES(meta)"""

Expected behavior

To replicate:

  1. Download this sample SQLIte database and unpack it:
    https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip
    unzip chinook.zip

On weedfs:

$ sqlite3 chinook.db 'vacuum'
$ sqlite3 chinook.db '.tables'
Error: database disk image is malformed

On ext4:

$ sqlite3 chinook.db 'vacuum'
$ sqlite3 chinook.db '.tables'
albums          employees       invoices        playlists     
artists         genres          media_types     tracks        
customers       invoice_items   playlist_track

If you unpack chinook.db from the zip file on a non-weedfs and a weedfs filesystem and compare the syscall output, you'll see the byte variations:

strace -e pread64,pwrite64,open,fsync,fdatasync -v -xx -s 8192 -- sqlite3 chinook.db 'vacuum' 2>/root/strace-ext4.txt 
strace -e pread64,pwrite64,open,fsync,fdatasync -v -xx -s 8192 -- sqlite3 chinook.db 'vacuum' 2>/root/strace-weedfs.txt
diff -u /root/strace-ext4.txt  /root/strace-weedfs.txt

If you perform the operation twice i.e. unzip, vacuum, the corruption is shown to be consistent i.e. no byte-differences shown by cmp:

unzip chinook.zip
cp chinook.db chinook2.db
sqlite3 chinook.db 'vacuum;
sqlite3 chinook2.db 'vacuum;
cmp -l chinook.db chinook2.db   

I would expect that comparing the chinook.db between the vacuum performed on ext4 and weedfs should have no byte differences either.

@chrislusf
Copy link
Collaborator

Thanks for the detailed report with clean reproducible steps!

I verified it works well with latest dev version https://github.com/chrislusf/seaweedfs/releases/tag/dev

$ wget https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip
--2022-01-23 13:03:04--  https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip
Resolving www.sqlitetutorial.net (www.sqlitetutorial.net)... 151.139.128.11
Connecting to www.sqlitetutorial.net (www.sqlitetutorial.net)|151.139.128.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 305596 (298K) [application/zip]
Saving to: ‘chinook.zip’

chinook.zip                   100%[===============================================>] 298.43K   357KB/s    in 0.8s

2022-01-23 13:03:06 (357 KB/s) - ‘chinook.zip’ saved [305596/305596]

fio$ ls
chinook.zip	sample.db
fio$ unzip chinook.zip
Archive:  chinook.zip
  inflating: chinook.db
fio$ ls
chinook.db	chinook.zip	sample.db
fio$ ls  -al
total 2349
drwx------  1 chrislu  staff       0 Jan 22 05:21 .
drwxr-xr-x  1 chrislu  staff       0 Jan 22 20:34 ..
-rw-r--r--  1 chrislu  staff  884736 Nov 29  2015 chinook.db
-rw-r--r--  1 chrislu  staff  305596 Mar 11  2018 chinook.zip
-rw-r--r--  1 chrislu  staff   12288 Jan 23 13:02 sample.db
fio$ sqlite3 chinook.db 'vacuum'
fio$ sqlite3 chinook.db '.tables'
albums          employees       invoices        playlists
artists         genres          media_types     tracks
customers       invoice_items   playlist_track
fio$ ls -al
total 2273
drwx------  1 chrislu  staff       0 Jan 22 05:21 .
drwxr-xr-x  1 chrislu  staff       0 Jan 22 20:34 ..
-rw-r--r--  1 chrislu  staff  845824 Jan 23 13:03 chinook.db
-rw-r--r--  1 chrislu  staff  305596 Mar 11  2018 chinook.zip
-rw-r--r--  1 chrislu  staff   12288 Jan 23 13:02 sample.db

@nickb937
Copy link
Author

I've previous experienced a similar issue as reported #3384 but in my experience that happens because Plex will periodically vacuum its SQLite DB.

I tried this sequence with v3.18, to keep it simple with the default configuration:

$ wget https://github.com/seaweedfs/seaweedfs/releases/download/3.18/linux_amd64_full.tar.gz
$ tar -xzf linux_amd64_full.tar.gz
$ ./weed server -dir ./weed-data
$ ./weed filer
$ sudo ./weed mount -dir=/mnt

$ cd /mnt
$ wget https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip
$ unzip chinook.zip 
$ sqlite3 chinook.db 'vacuum'
$ sqlite3 chinook.db '.tables'
Error: database disk image is malformed

I can't see how your test would have worked. Did you definitely run it on a SeaweedFS mount?

@chrislusf
Copy link
Collaborator

@nickb937 works well with the dev version. Please help to confirm.

@nickb937
Copy link
Author

Alright, I just re-built direct from master rather than the download tar-files:

$ git clone https://github.com/seaweedfs/seaweedfs.git
$ cd seaweedfs ; make
$ ~/go/bin/weed server -dir /home/nickb/weed-data 
$ ~/go/bin/weed filer
# /home/nickb/go/bin/weed mount -dir=/mnt

nickb@pc:/mnt$ unzip chinook.zip 
Archive:  chinook.zip
  inflating: chinook.db              
nickb@pc:/mnt$ sqlite3 chinook.db 'vacuum'
nickb@pc:/mnt$ sqlite3 chinook.db '.tables'
Error: database disk image is malformed
nickb@pc:/mnt$ 

@chrislusf
Copy link
Collaborator

chrislusf commented Aug 1, 2022

I noticed the problem is only on Linux. My mac is fine.

The sqlite3 version on my linux is:

3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837balt1

On mac is:

3.37.0 2021-12-09 01:34:53 9ff244ce0739f8ee52a3e9671adb4ee54c83c640b02e3f9d185fd2f9a179aapl

Not sure whether the sqlite3 version is the problem.

@nickb937
Copy link
Author

nickb937 commented Aug 1, 2022

I have compared the contents of pread64/pwrite64 calls between sqlite running on seaweedfs and a different filesystem. From the application perspective they're exactly the same.

I am now trying to produce a small python script that reproduces the problem by copying the IO sequence that sqlite is doing. And viewing the IO patterns in seaweedfs.

The only thing I've noticed is that append mode on seaweedfs acts different to Linux - seek operations don't work on seaweedfs in append mode where they do in Linux, however this mode isn't used by sqlite, and something that the python docs mention that some filesystems will do.

@danlsgiga
Copy link
Contributor

FWIW, Error: database disk image is malformed is exactly the error I see from the Plex logs, part of #3384

@nickb937
Copy link
Author

nickb937 commented Aug 1, 2022

The SQLite VACUUM does the following sequence:

  • read chinook.db into memory on FD3
  • write a journal file to disk on FD4
  • re-write a compacted representation of the database on FD3

Starting from fresh:

unzip chinook.db
strace -f  sqlite3 chinook.db 'vacuum' >v2.strace 2>&1

We can see this sequence marking the beginning of the third state of re-writing to FD3 (around like 4081):

fdatasync(4)                            = 0
openat(AT_FDCWD, "/home/nickb", O_RDONLY|O_CLOEXEC) = 5
fdatasync(5)                            = 0
close(5)                                = 0
pwrite64(4, "\331\325\5\371 \241c\327\0\0\3:", 12, 0) = 12
fdatasync(4)                            = 0
pwrite64(3, "SQLite format 3\0\4\0\1\1\0@  \0\0\0\32\0\0\3:"..., 1024, 0) = 1024
pwrite64(3, "\5\0\0\0\v\3\301\0\0\0\0-\3\373\3\366\3\361\3\353\3\345\3\337\3\331\3\323\3\315\3\307"..., 1024, 1024) = 1024
pwrite64(3, "\r\0\0\0\n\3k\0\3\364\3\343\3\325\3\310\3\273\3\254\3\235\3\216\3z\3k\0\0\0\0"..., 1024, 2048) = 1024
pwrite64(3, "\5\0\0\0\7\3\330\0\0\0\09\3\373\3\366\3\360\3\352\3\344\3\336\3\330\0\0\0\0\0\0"..., 1024, 3072) = 1024
pwrite64(3, "\5\0\0\0\6\3\342\0\0\0\0@\3\373\3\366\3\361\3\354\3\347\3\342\0\0\0\0\0\0\0\0"..., 1024, 4096) = 1024

What I've noticed is that none of these pwrite64() calls are modifying the file, despite it being in mode O_RDWR:

unzip chinook.zip
cp chinook.db chinook.db.orig
sqlite3 chinook.db 'vacuum'

cmp chinook.db chinook.db.orig 
cmp: EOF on chinook.db after byte 845824, in line 8651

Whereas if you perform the same calls on some other filesystem:

$ cmp chinook.db chinook.db.orig 
chinook.db chinook.db.orig differ: byte 28, line 1

Here's a Python script that replicates the problem. Critical to the replication of the issue is the os.fdatasync() calls and os.ftruncate() and that the file requires to be truncated. If you comment out either of the fdatasync() calls or or ftruncate() then the test will not fail.

import os
import random

FILESIZE = 4096
FILE = "/mnt/file"

if os.path.exists(FILE):
    os.unlink(FILE)

random.seed(2)
# buff1 must be larger than buff2 so that ftruncate() has to do work
buff1 = bytearray(random.randbytes(FILESIZE + 1024))
buff2 = bytearray(random.randbytes(FILESIZE))
assert buff1 != buff2

fd = os.open(FILE, os.O_RDWR | os.O_CREAT | os.O_NOFOLLOW | os.O_CLOEXEC)

offset = 0
while offset < len(buff1):
    os.pwrite(fd, buff1[offset:offset+1024], offset)
    offset += 1024
os.fdatasync(fd)

offset = 0
while offset < len(buff2):
    os.pwrite(fd, buff2[offset:offset+1024], offset)
    offset += 1024
os.fdatasync(fd)
os.ftruncate(fd, FILESIZE)
os.close(fd)

fd = os.open(FILE, os.O_RDONLY)
data = os.read(fd, len(buff2))
os.close(fd)
assert data == buff2

@nickb937
Copy link
Author

nickb937 commented Aug 4, 2022

First write:

I0804 18:58:01.159751 dirty_pages_chunked.go:44 1 memory AddPage [4096, 5120)
I0804 18:58:01.159776 weedfs_file_sync.go:102 doFlush /file fh 0
I0804 18:58:01.159784 upload_pipeline.go:149  uploaderCount 0 ++> 1
I0804 18:58:01.165124 filehandle.go:95 /file existing 0 chunks adds 1 more
I0804 18:58:01.165140 dirty_pages_chunked.go:89 /file saveToStorage 1,05ea52ca24 [0,5120)
I0804 18:58:01.165148 upload_pipeline.go:171  uploaderCount 1 --> 0
I0804 18:58:01.165160 upload_pipeline.go:39 Free sealed chunk:  finished uploading chunk 0
I0804 18:58:01.165168 slot_pool.go:60 --> 0
I0804 18:58:01.165181 weedfs_file_sync.go:151 /file set chunks: 1
I0804 18:58:01.165190 weedfs_file_sync.go:153 /file chunks 0: 1,05ea52ca24 [0,5120)

Second write creates a new chunk, 3,060d51e12c, but file should presumably consist of only 1 chunk:

I0804 18:58:01.165938 weedfs_file_sync.go:102 doFlush /file fh 0
I0804 18:58:01.165945 upload_pipeline.go:149  uploaderCount 0 ++> 1
I0804 18:58:01.166903 filehandle.go:95 /file existing 1 chunks adds 1 more
I0804 18:58:01.166913 dirty_pages_chunked.go:89 /file saveToStorage 3,060d51e12c [0,4096)
I0804 18:58:01.166919 upload_pipeline.go:171  uploaderCount 1 --> 0
I0804 18:58:01.166925 upload_pipeline.go:39 Free sealed chunk:  finished uploading chunk 0
I0804 18:58:01.166934 slot_pool.go:60 --> 0
I0804 18:58:01.166937 weedfs_file_sync.go:151 /file set chunks: 2
I0804 18:58:01.166949 weedfs_file_sync.go:153 /file chunks 0: 1,05ea52ca24 [0,5120)
I0804 18:58:01.166957 weedfs_file_sync.go:153 /file chunks 1: 3,060d51e12c [0,4096)**

os.ftruncate() truncates chunk 1,05ea52ca24, the original data rather than chunk 3,060d51e12c the new data.

I0804 18:58:01.167376 weedfs_attr.go:53 /file setattr set size=4096 chunks=2
I0804 18:58:01.167395 weedfs_attr.go:65 truncated chunk 1,05ea52ca24 from 5120 to 4096

Looks like AddChunks in filehandle.go isn't accounting for the write-offset of the incoming chunk:

I0804 18:58:01.165938 weedfs_file_sync.go:102 doFlush /file fh 0
I0804 18:58:01.165945 upload_pipeline.go:149  uploaderCount 0 ++> 1
I0804 18:58:01.166903 filehandle.go:95 /file existing 1 chunks adds 1 more
I0804 18:58:01.166913 dirty_pages_chunked.go:89 /file saveToStorage 3,060d51e12c [0,4096)
I0804 18:58:01.166919 upload_pipeline.go:171  uploaderCount 1 --> 0

nickb937 added a commit to nickb937/seaweedfs that referenced this issue Aug 6, 2022
Fixes: seaweedfs#3384
Fixes: seaweedfs#2609

Before:

Chunk `3,577c823253` is dropped when truncate sets file length to 4096:

```
I0806 19:42:26.236780 weedfs_file_sync.go:151 /file set chunks: 2
I0806 19:42:26.236790 weedfs_file_sync.go:153 /file chunks 0: 3,577c823253 [0,5120)
I0806 19:42:26.236798 weedfs_file_sync.go:153 /file chunks 1: 6,58be291f0e [0,4096)
I0806 19:42:26.237150 weedfs_attr.go:53 /file setattr set size=4096 filersize=5120 chunks=2
I0806 19:42:26.237171 weedfs_attr.go:66 truncated chunk 3,577c823253 from 5120 to 4096
I0806 19:42:26.237411 weedfs_file_sync.go:102 doFlush /file fh 0
I0806 19:42:26.237430 weedfs_file_sync.go:151 /file set chunks: 1
I0806 19:42:26.237439 weedfs_file_sync.go:153 /file chunks 0: 3,577c823253 [0,4096)
```

After, secondary chunk is retailed:

```
I0806 19:40:51.755588 weedfs_file_sync.go:151 /file set chunks: 2
I0806 19:40:51.755598 weedfs_file_sync.go:153 /file chunks 0: 6,54a83ba23d [0,5120)
I0806 19:40:51.755605 weedfs_file_sync.go:153 /file chunks 1: 7,556020bbeb [0,4096)
I0806 19:40:51.756049 weedfs_attr.go:53 /file setattr set size=4096 chunks=2
I0806 19:40:51.756066 weedfs_attr.go:65 truncated chunk 6,54a83ba23d from 5120 to 4096
I0806 19:40:51.756077 weedfs_attr.go:65 truncated chunk 7,556020bbeb from 4096 to 4096
I0806 19:40:51.756109 weedfs_file_sync.go:102 doFlush /file fh 0
I0806 19:40:51.756127 weedfs_file_sync.go:151 /file set chunks: 2
I0806 19:40:51.756137 weedfs_file_sync.go:153 /file chunks 0: 6,54a83ba23d [0,4096)
I0806 19:40:51.756145 weedfs_file_sync.go:153 /file chunks 1: 7,556020bbeb [0,4096)
```
chrislusf added a commit that referenced this issue Aug 6, 2022
@sstent
Copy link

sstent commented Aug 11, 2022

Test Passed for me with latest build from master

version 30GB 3.20 986daec linux arm

@shapechecker
Copy link

Still seems to break databases where PRAGMA=WAL (e.g. Sonarr)

[v3.0.8.1507] code = Corrupt (11), message = System.Data.SQLite.SQLiteException (0x800007EF): database disk image is malformed
database disk image is malformed
  at System.Data.SQLite.SQLite3.Reset (System.Data.SQLite.SQLiteStatement stmt) [0x00088] in <cf516e4846354910b3d60749c894b1bf>:0 
  at System.Data.SQLite.SQLite3.Step (System.Data.SQLite.SQLiteStatement stmt) [0x0006e] in <cf516e4846354910b3d60749c894b1bf>:0 
  at System.Data.SQLite.SQLiteDataReader.NextResult () [0x00174] in <cf516e4846354910b3d60749c894b1bf>:0 
  at System.Data.SQLite.SQLiteDataReader..ctor (System.Data.SQLite.SQLiteCommand cmd, System.Data.CommandBehavior behave) [0x0008e] in <cf516e4846354910b3d60749c894b1bf>:0 
  at (wrapper remoting-invoke-with-check) System.Data.SQLite.SQLiteDataReader..ctor(System.Data.SQLite.SQLiteCommand,System.Data.CommandBehavior)
  at System.Data.SQLite.SQLiteCommand.ExecuteReader (System.Data.CommandBehavior behavior) [0x0000c] in <cf516e4846354910b3d60749c894b1bf>:0 
  at System.Data.SQLite.SQLiteCommand.ExecuteNonQuery (System.Data.CommandBehavior behavior) [0x00006] in <cf516e4846354910b3d60749c894b1bf>:0 
  at System.Data.SQLite.SQLiteCommand.ExecuteNonQuery () [0x00006] in <cf516e4846354910b3d60749c894b1bf>:0 
  at Marr.Data.DataMapper.ExecuteNonQuery (System.String sql) [0x0002a] in M:\BuildAgent\work\63739567f01dbcc2\src\Marr.Data\DataMapper.cs:204 
  at NzbDrone.Core.Messaging.Commands.CommandRepository.OrphanStarted () [0x00064] in M:\BuildAgent\work\63739567f01dbcc2\src\NzbDrone.Core\Messaging\Commands\CommandRepository.cs:46 
  at NzbDrone.Core.Messaging.Commands.CommandQueueManager.Handle (NzbDrone.Core.Lifecycle.ApplicationStartedEvent message) [0x00010] in M:\BuildAgent\work\63739567f01dbcc2\src\NzbDrone.Core\Messaging\Commands\CommandQueueManager.cs:260 
  at NzbDrone.Core.Messaging.Events.EventAggregator.PublishEvent[TEvent] (TEvent event) [0x0011a] in M:\BuildAgent\work\63739567f01dbcc2\src\NzbDrone.Core\Messaging\Events\EventAggregator.cs:92 

@danlsgiga
Copy link
Contributor

I just had a chance to test it with Plex and its the same error database disk image is malformed. :(

@nickb937
Copy link
Author

nickb937 commented Sep 8, 2022

With the 3.25 release, Sonarr has been running a few days without issues for me

@danlsgiga
Copy link
Contributor

Wooot! Awesome, just deployed 3.26 here and Plex is running stable! Thanks tons for fixing this!!!! 👏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants