FileStorage: fix rare data corruption when using restore after multiple undos #395

perrinjerome · 2024-01-31T14:43:22Z

The case of a history like this:

T0 initialize an object in state 0
T1 modifies object to state 1
T2 modifies object to state 2
T3 undo T2 and T1, bringing back to state 0
T4 modified object to state 3
T5 undo T4, bringing back object to state 0

was not correct after restore: the state is 1 instead of the expected 0.

This is because T3 contains two data records:

an undo of T2, with a backpointer to the data of state 1
an undo of T1, with a backpointer to the data of state 0

When restoring T5 (the undo of T4), the transaction is made of one data record, with a backpointer that is copied from the backpointer from T3, but this uses backpointer from the first record for this object, which is incorrect, in such a case where there is more than one backpointer for the same oid, we need to iterate in all data record to find the oldest version.

The case of a history like this: - T0 initialize an object in state 0 - T1 modifies object to state 1 - T2 modifies object to state 2 - T3 undo T2 and T1, bringing back to state 0 - T4 modified object to state 3 - T5 undo T4, bringing back object to state 0 was not correct after `restore`: the state is 1 instead of the expected 0. This is because T3 contains two data records: - an undo of T2, with a backpointer to the data of state 1 - an undo of T1, with a backpointer to the data of state 0 When restoring T5 (the undo of T4), the transaction is made of one data record, with a backpointer that is copied from the backpointer from T3, but this uses backpointer from the first record for this object, which is incorrect, in such a case where there is more than one backpointer for the same oid, we need to iterate in all data record to find the oldest version.

perrinjerome · 2024-01-31T14:47:20Z

For reference, the file format is described in

ZODB/src/ZODB/FileStorage/format.py

Lines 15 to 84 in 6f88781

    
           #  File-based ZODB storage 
        
           # 
        
           # Files are arranged as follows. 
        
           # 
        
           #   - The first 4 bytes are a file identifier. 
        
           # 
        
           #   - The rest of the file consists of a sequence of transaction 
        
           #     "records". 
        
           # 
        
           # A transaction record consists of: 
        
           # 
        
           #   - 8-byte transaction id, which is also a time stamp. 
        
           # 
        
           #   - 8-byte transaction record length - 8. 
        
           # 
        
           #   - 1-byte status code 
        
           #     ' '  (a blank) completed transaction that hasn't been packed 
        
           #     'p'  completed transaction that has been packed 
        
           #     'c'  checkpoint -- a transaction in progress, at the end of the file; 
        
           #          it's been thru vote() but not finish(); if finish() completes 
        
           #          normally, it will be overwritten with a blank; if finish() dies 
        
           #          (e.g., out of disk space), cleanup code will try to truncate 
        
           #          the file to chop off this incomplete transaction 
        
           #     'u'  uncertain; no longer used; was previously used to record something 
        
           #          about non-transactional undo 
        
           # 
        
           #   - 2-byte length of user name 
        
           # 
        
           #   - 2-byte length of description 
        
           # 
        
           #   - 2-byte length of extension attributes 
        
           # 
        
           #   -   user name 
        
           # 
        
           #   -   description 
        
           # 
        
           #   -   extension attributes 
        
           # 
        
           #   * A sequence of data records 
        
           # 
        
           #   - 8-byte redundant transaction length -8 
        
           # 
        
           # A data record consists of 
        
           # 
        
           #   - 8-byte oid. 
        
           # 
        
           #   - 8-byte tid, which matches the transaction id in the transaction record. 
        
           # 
        
           #   - 8-byte previous-record file-position. 
        
           # 
        
           #   - 8-byte beginning of transaction record file position. 
        
           # 
        
           #   - 2-bytes with zero values. (Was version length.) 
        
           # 
        
           #   - 8-byte data length 
        
           # 
        
           #   ? data 
        
           #     (data length > 0) 
        
           # 
        
           #   ? 8-byte position of data record containing data 
        
           #     (data length == 0) 
        
           # 
        
           # Note that the lengths and positions are all big-endian. 
        
           # Also, the object ids time stamps are big-endian, so comparisons 
        
           # are meaningful. 
        
           # 
        
           # Backpointers 
        
           # 
        
           #   When we undo a record, we don't copy (or delete) 
        
           #   data.  Instead, we write records with back pointers.

src/ZODB/tests/testFileStorage.py

src/ZODB/FileStorage/FileStorage.py

d-maurer · 2024-02-01T10:17:48Z

src/ZODB/FileStorage/FileStorage.py

+                    # looking at other records from this transaction,
+                    # if there is multiple undo records, we use the
+                    # oldest.
+                    backpointer = pos


Why are we sure that the "backpointer" found at the highest position is the right one? I.e. why have older undo records a higher position?

With your change, the code no longer matches the introductory source comment. The source comment should get updated.

Hello everyone. I had the same concern and actually this issue is not only about undo, but also how FileStorage handles and/or should handle a situation when one transaction contains two data records with the same oid. What loadBefore should return for such case - data from the first data record or from the second? Current behaviour is to return the data from the second data record - the one that was last to be .store()' ed. Proof:

create database with multiple data entries in the same transaction for oid=1

---- 8< ---- (1.txn)

user "author" description "T1" extension "" obj 0000000000000001 3 null:00 AAA obj 0000000000000001 3 null:00 BBB

$ python -m zodbtools.zodb commit 1.fs 0000000000000000 <1.txn 03f63ee48fa55c00

$ fs1 dump 1.fs Trans #00000 tid=03f63ee48fa55c00 time=2024-02-01 10:44:33.667016 offset=35 status=' ' user='author' description='T1' data #00000 oid=0000000000000001 size=3 class=?.? data #00001 oid=0000000000000001 size=3 class=?.?

try to read that oid=1

---- 8< ---- (catobj.py)

from __future__ import print_function from ZODB.FileStorage import FileStorage from ZODB.utils import p64, u64 stor = FileStorage('1.fs', read_only=True) head = stor.lastTransaction() head_ = p64(u64(head)+1) oid = p64(1) x = stor.loadBefore(oid, head_) print(x)

$ python catobj.py ('BBB', '\x03\xf6>\xe4\x8f\xa5\\\x00', None)

I'm not sure 100% but probably this is kind of expected behaviour.

So taking this into account for consistency it should be also the latest undo record that is taking the effect.

And for generality we should also consider cases like:

data record 1: undo to T2

data record 2: undo to T1

data record 3: store data3

because here it should be data3 to be returned by loadBefore, not data from T1.

(undo record can be, and probably should be, perceived as regular store with instruction to copy data from particular previous data record)

With the above point of view I suggest to amend the patch with the following:

--- a/src/ZODB/FileStorage/FileStorage.py +++ b/src/ZODB/FileStorage/FileStorage.py @@ -663,6 +663,10 @@ def _data_find(self, tpos, oid, data): # Else oid's data record contains the data, and the file offset of # oid's data record is returned. This data record should contain # a pickle identical to the 'data' argument. + # When looking for oid's data record we scan all data records in + # the transaction till the end looking for the latest record with oid, + # even if there are multiple such records. This matches load behaviour + # which uses the data record created by last store. # Unclear: If the length of the stored data doesn't match len(data), # an exception is raised. If the lengths match but the data isn't @@ -674,31 +678,36 @@ def _data_find(self, tpos, oid, data): self._file.read(ul + dl + el) tend = tpos + tl pos = self._file.tell() - backpointer = None + data_hdr = None + data_pos = 0 + # scan all data records in this transaction looking for the latest + # record with our oid while pos < tend: h = self._read_data_header(pos) if h.oid == oid: - # Make sure this looks like the right data record - if h.plen == 0: - # This is also a backpointer, remember it and keep - # looking at other records from this transaction, - # if there is multiple undo records, we use the - # oldest. - backpointer = pos - else: - if h.plen != len(data): - # The expected data doesn't match what's in the - # backpointer. Something is wrong. - logger.error("Mismatch between data and" - " backpointer at %d", pos) - return 0 - _data = self._file.read(h.plen) - if data != _data: - return 0 - return pos + data_hdr = h + data_pos = pos pos += h.recordlen() self._file.seek(pos) - return backpointer or 0 + if data_hdr is None: + return 0 + + # return position of found data record, but make sure this looks like + # the right data record to return. + if data_hdr.plen == 0: + # This is also a backpointer, Gotta trust it. + return data_pos + else: + if data_hdr.plen != len(data): + # The expected data doesn't match what's in the + # backpointer. Something is wrong. + logger.error("Mismatch between data and" + " backpointer at %d", pos) + return 0 + _data = self._file.read(data_hdr.plen) + if data != _data: + return 0 + return data_pos def restore(self, oid, serial, data, version, prev_txn, transaction): # A lot like store() but without all the consistency checks. This

Thank you, I like this patch better, it's more logical. I have integrated this in the pull request.

When I wrote that patch, the reason to take the latest data record was to match the behavior of _txn_undo_write, it iterates while pos < tend: and updates the index.

Thanks, Jérome.

And I see about your original observations.

d-maurer · 2024-02-01T10:25:52Z

I very much prefer PRs from the main repository (rather than from forks): for difficult code, it is important to see the modification in a larger context and not only the diff. To support this, I have cloned the main repository and do not like to add additional clones (or "remote"s) for reviews.

navytux · 2024-02-01T12:05:34Z

Hello Dieter. I understand your rationale, and on my side try to follow your advice after you expressed the same reasoning to me. But just for the reference, if you add the following into ZODB/.git/config

[remote "origin"]
        url = https://github.com/zopefoundation/ZODB.git
        fetch = +refs/heads/*:refs/remotes/origin/*
        fetch = +refs/pull/*/head:refs/remotes/origin/pr/*          <-- this line

you will be able to check out code of all pull requests without adding any remote or something. For example with

hereby PR it would be git fetch origin; git checkout origin/pr/395.

Co-authored-by: Kirill Smelkov <kirr@nexedi.com>

d-maurer · 2024-02-01T13:53:24Z

Jérome Perrin wrote at 2024-2-1 05:40 -0800:

... I can make another pull request to change all these to use skip, unless someone here thinks it's bad ?

`unittest` used to have problems when `dubug` was called for a skipped method. With `zope.testrunner` `debug` is called with the "-D" option. When `zope.testrunner`'s `-D` option is used, it is expected that it runs until the first failure and there enters the debugger (allowing the tester to analyse the failure context). Due to the above problem, the first encountered `skip` was considered a failure and entered the debugger. I filed a corresponding Python bug report but I am not sure whether a fix is in all supported Python versions. Should the problem still exist, I would not massively use `skip`.

perrinjerome · 2024-02-01T14:51:20Z

Ah yes, nowadays zope.testrunner's -D no longer enter debugger on skipped tests ( was fixed in zopefoundation/zope.testrunner#141 ) , but there is still something that teardown is not always executed ( zopefoundation/zope.testrunner#142 (comment) ), so it seems still safer to just keep using the "pass" pattern.

…txn_prev if restore is called with an existing txn_prev that does not contain the oid, the "8-byte redundant transaction length -8" and the beginning of the next transaction was read as a data record. The condition to stop was incorrect as discussed in [1], in practice this error would only happens when restore is passed a bad txn_prev so this should not be a problem in normal conditions because we are supposed to find the data record for that oid. [1] zopefoundation#395 (comment)

d-maurer · 2024-02-14T18:13:05Z

Jérome Perrin wrote at 2024-2-14 07:23 -0800:

... @perrinjerome commented on this pull request. > @@ -672,28 +672,33 @@ def _data_find(self, tpos, oid, data): tid, tl, status, ul, dl, el = unpack(TRANS_HDR, h) status = as_text(status) self._file.read(ul + dl + el) - tend = tpos + tl + 8 + tend = tpos + tl I tried to write a test calling restore with a txn that did not contain the oid, it triggers the problem, but it does not really represent a normal scenario so I'm not 100% sure what to do.

I think, we do not need a test to fix the `+ 8` bug. I know that I asked about the "why remove `+ 8`" but our discussion convinced me that it is the right thing: According to "format.py", `tl` contains "transaction record length - 8" and a transaction record ends with a sequence of "data record"s followed by "8-byte redundant transaction length -8". This means that the last data record ends at `tpos + tl`, not `tpos + tl + 8`.

navytux · 2024-02-16T16:25:43Z

@perrinjerome, thanks for adding the test and robustifying handling of records reader.

@d-maurer, thanks also for confirming you are ok with the fix and we all now share common understanding.

I think that tests are added not because the code is wrong, but to preserve it to stay working correctly. So in general whenever we fix something, we try to add a test for the fixed problem so that we know for sure it won't return after follow-up changes.

Sometimes it might be hard to do a test and the difficulties outweight usefulness of having tests. But in this case we already have tests done as part of #397 so unless there is a strong reason not to take them, let's just add both the fix and the test to make sure that quality is improved and stays ok.

Anyway, besides the tests, am I right that we all three agree that this patch is ok now? Or did I missed something and there is still something else to be improved?

Kirill

d-maurer · 2024-02-16T17:08:10Z

Kirill Smelkov wrote at 2024-2-16 08:25 -0800:

... @d-maurer, thanks also for confirming you are ok with the fix and we all now share common understanding. I think that tests are added not because the code is wrong, but to preserve it to stay working correctly. So in general whenever we fix something, we try to add a test for the fixed problem so that we know for sure it won't return after follow-up changes.

In this particular case, it is very unlikely that someone will readd the `+ 8` again and change the loop logic otherwise. But, I am not against a test if someone has already spent time on it.

navytux · 2024-02-16T17:29:22Z

Dieter, thanks for feedback. What you say is reasonable.

perrinjerome · 2024-02-27T06:37:28Z

I feel that we agreed on merging this after merging #397 , I "requested review" from github interface, if you agree please approve the review and I'll apply the changes.

perrinjerome force-pushed the fix/restore_undo_to_multiple_undo branch from e68c5f3 to 44d36b0 Compare January 31, 2024 14:44

perrinjerome changed the title ~~FileStorage: fix data corruption when using restore after multiple undos~~ FileStorage: fix rare data corruption when using restore after multiple undos Jan 31, 2024

perrinjerome marked this pull request as ready for review January 31, 2024 14:47

icemac reviewed Feb 1, 2024

View reviewed changes

src/ZODB/tests/testFileStorage.py Show resolved Hide resolved

icemac requested review from d-maurer and navytux February 1, 2024 07:46

icemac added the enhancement label Feb 1, 2024

d-maurer reviewed Feb 1, 2024

View reviewed changes

alternate implementation from review

37db547

Co-authored-by: Kirill Smelkov <kirr@nexedi.com>

Merge branch 'master' into fix/restore_undo_to_multiple_undo

ae6bdd0

perrinjerome mentioned this pull request Feb 14, 2024

Fix behavior of FileStorage.restore when provided wrong prev_txn #397

Draft

perrinjerome requested a review from d-maurer February 27, 2024 06:36

d-maurer approved these changes Feb 27, 2024

View reviewed changes

navytux approved these changes Mar 4, 2024

View reviewed changes

perrinjerome merged commit 8160568 into zopefoundation:master Mar 4, 2024
20 checks passed

perrinjerome deleted the fix/restore_undo_to_multiple_undo branch March 4, 2024 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileStorage: fix rare data corruption when using restore after multiple undos #395

FileStorage: fix rare data corruption when using restore after multiple undos #395

perrinjerome commented Jan 31, 2024 •

edited

Loading

perrinjerome commented Jan 31, 2024

d-maurer Feb 1, 2024

navytux Feb 1, 2024

navytux Feb 1, 2024

navytux Feb 1, 2024

perrinjerome Feb 1, 2024

navytux Feb 14, 2024

navytux Feb 14, 2024

d-maurer commented Feb 1, 2024

navytux commented Feb 1, 2024

d-maurer commented Feb 1, 2024 via email

perrinjerome commented Feb 1, 2024

d-maurer commented Feb 14, 2024 via email

navytux commented Feb 16, 2024

d-maurer commented Feb 16, 2024 via email

navytux commented Feb 16, 2024

perrinjerome commented Feb 27, 2024

FileStorage: fix rare data corruption when using restore after multiple undos #395

FileStorage: fix rare data corruption when using restore after multiple undos #395

Conversation

perrinjerome commented Jan 31, 2024 • edited Loading

perrinjerome commented Jan 31, 2024

d-maurer Feb 1, 2024

Choose a reason for hiding this comment

navytux Feb 1, 2024

Choose a reason for hiding this comment

create database with multiple data entries in the same transaction for oid=1

try to read that oid=1

navytux Feb 1, 2024

Choose a reason for hiding this comment

navytux Feb 1, 2024

Choose a reason for hiding this comment

perrinjerome Feb 1, 2024

Choose a reason for hiding this comment

navytux Feb 14, 2024

Choose a reason for hiding this comment

navytux Feb 14, 2024

Choose a reason for hiding this comment

d-maurer commented Feb 1, 2024

navytux commented Feb 1, 2024

d-maurer commented Feb 1, 2024 via email

perrinjerome commented Feb 1, 2024

d-maurer commented Feb 14, 2024 via email

navytux commented Feb 16, 2024

d-maurer commented Feb 16, 2024 via email

navytux commented Feb 16, 2024

perrinjerome commented Feb 27, 2024

perrinjerome commented Jan 31, 2024 •

edited

Loading