Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jepsen: multiple conflicting appends sometimes succeed #2765

Closed
snaury opened this issue Mar 14, 2024 · 5 comments
Closed

Jepsen: multiple conflicting appends sometimes succeed #2765

snaury opened this issue Mar 14, 2024 · 5 comments
Assignees

Comments

@snaury
Copy link
Member

snaury commented Mar 14, 2024

I rewrote jepsen test to merge multiple micro ops into a single query (does UPSERT from SELECT with LEFT JOIN and GROUP BY) and started to get inconsistent read failures. It appears that multiple transactions claim the same list index and successfully overwrite it, even though only one transaction should have succeeded.

A good example of such failure:

  • We append to key 192
  • One transaction appends 630, another appends 637, both succeeds
  • Other transactions see the list ending in 630, but then it changes to 637

History has this info on two appends:

# append 630
{:index 300389, :time 240680658687, :type :ok, :process 1547, :f :txn,
    :value [
        [:append 192 630]
        [:append 3 52]],
    :modified-txn [
        [:batch #{192 3} #jepsen.ydb.append.OperationBatch{:ops [[:append 192 630] [:append 3 52]], :reads [], :writes [[192 630] [3 52]], :readmap {}}]],
    :debug-info [
        [
            {:tablet 72075186226226475, :gen 25, :read-version "v1710410837721/18446744073709551615", :op "read", :node 50003, :read-id 2, :snapshot-repeatable true, :lock-tx-id 844430958736899}
            {:tablet 72075186226227048, :gen 2, :read-version "v1710410837721/18446744073709551615", :op "read", :node 50008, :read-id 1, :snapshot-repeatable true, :lock-tx-id 844430958736899}]
        [
            {:tablet 72075186226227048, :node 50008, :gen 2, :op "kqp-tx", :tx-id 281481004377038, :step 1710410838273, :version "v1710410838273/281481004377038"}
            {:tablet 72075186226226475, :node 50003, :gen 25, :op "kqp-tx", :tx-id 281481004377038, :step 1710410838273, :version "v1710410838273/281481004377038"}]]}

# append 637
{:index 300437, :time 240689511468, :type :ok, :process 1159, :f :txn,
    :value [
        [:append 192 637]
        [:append 196 45]
        [:append 196 46]],
    :modified-txn [
        [:batch #{192 196} #jepsen.ydb.append.OperationBatch{:ops [[:append 192 637] [:append 196 45]], :reads [], :writes [[192 637] [196 45]], :readmap {}}]
        [:commit nil [:append 196 46]]],
    :debug-info [
        [
            {:tablet 72075186226227048, :gen 2, :read-version "v1710410838260/18446744073709551615", :op "read", :node 50008, :read-id 1, :snapshot-repeatable true, :lock-tx-id 562955991686003}
            {:tablet 72075186226227048, :gen 2, :read-version "v1710410838260/18446744073709551615", :op "read", :node 50008, :read-id 1, :snapshot-repeatable true, :lock-tx-id 562955991686003}]
        [
            {:tablet 72075186226227048, :gen 2, :op "kqp-tx", :node 50008, :snapshot "v1710410838260/18446744073709551615", :tx-id 562955991686035, :snapshot-repeatable true, :lock-tx-id 562955991686035}
            {:tablet 72075186226227048, :gen 2, :read-version "v1710410838260/18446744073709551615", :op "read", :node 50008, :read-id 0, :snapshot-repeatable true, :lock-tx-id 562955991686035}
            {:tablet 72075186226227048, :node 50008, :gen 2, :op "kqp-tx", :tx-id 562955991686134, :version "v1710410838283/18446744073709551615"}]]}

What I could infer from this:

  • Tx with :append 192 630 committed first at v1710410838273/281481004377038, it was a pretty straight forward batch, single query with a separate commit, which applied deferred effects directly at commit time
  • Tx with :append 192 637 was part of a batch query (without commit), it performed reads at v1710410838260/18446744073709551615 with :lock-tx-id 562955991686003, deferred effects should have been buffered in memory
  • Then the second tx performed [:append 196 46] with fused commit. Since it append reads from the table it needed to flush deferred effects, and you can see it happened with :tx-id 562955991686035 and :lock-tx-id 562955991686035. Notice LockTxId is different (not 562955991686003) and matches TxId. This means AcquireLocksTxId was 0, and this can only happen when transaction has absolutely no locks in the list.
  • Since lock 562955991686003 (apparently) wasn't in memory we committed without checking it, and it successfully committed write of value 637 to the same index as 630.
  • All cases I've seen so far had :read-id that was not zero, apparently read was restarted for some reason, but current debug-info is only passed on success.
  • Currently debug-info doesn't log any info about acquired and validated locks, so it's unclear what the read actually returned (it's unclear whether the lock was successfully taken, for example, or whether it was initially broken).

I started seeing this after switching to complex query, so join and group by might be involved somehow.

@snaury
Copy link
Member Author

snaury commented Mar 14, 2024

@ulya-sidorina @ssmike @spuchin please take a look

@snaury
Copy link
Member Author

snaury commented Mar 14, 2024

One more case, with additional info about locks:

key 259
values: 263, 267

{:index 404617, :time 228748470226, :type :ok, :process 102, :f :txn,
    :value [
        [:append 259 263]],
    :modified-txn [
        [:commit nil [:batch #{259} #jepsen.ydb.append.OperationBatch{:ops [[:append 259 263]], :reads [], :writes [[259 263]], :readmap {}}]]],
    :debug-info [
        [
            {:tablet 72075186226312326, :acquired-locks [{:lock-id 844431139706297, :counter 135}], :gen 167, :read-version "v1710434960553/18446744073709551615", :op "read", :node 50003, :read-id 7, :snapshot-repeatable true, :lock-tx-id 844431139706297}
            {:tablet 72075186226312326, :gen 167, :op "kqp-tx", :node 50003, :locks-op "commit", :tx-id 844431139707274, :version "v1710434961384/18446744073709551615", :validate-locks [{:lock-id 844431139706297, :counter 135}]}]]}

{:index 404673, :time 228758805235, :type :ok, :process 25, :f :txn,
    :value [
        [:append 259 267]
        [:append 260 3]],
    :modified-txn [
        [:batch #{259 260} #jepsen.ydb.append.OperationBatch{:ops [[:append 259 267] [:append 260 3]], :reads [], :writes [[259 267] [260 3]], :readmap {}}]],
    :debug-info [
        [
            {:tablet 72075186226312326, :gen 167, :read-version "v1710434960564/18446744073709551615", :op "read", :node 50003, :read-id 7, :snapshot-repeatable true, :lock-tx-id 844431139706471}]
        [
            {:tablet 72075186226312326, :node 50003, :gen 167, :op "kqp-tx", :tx-id 844431139707697, :version "v1710434961392/0"}]]}

What's notable here is the read for keys 259/260 was acquiring lock 844431139706471, the result didn't acquire any locks for some reason (maybe it wasn't the first read result or something), and commit happily performed writes without checking any locks, since nobody asked.

@snaury
Copy link
Member Author

snaury commented Mar 14, 2024

I think I might have an idea of what happened. The read probably had keys 260 and 259 (in that order), read result for key 260 was empty, but when reading key 259 we got a page fault. Since we don't want to repeat work performed so far, and don't want to send empty results either, we continued in a secondary local db transaction without sending the first result. But since we acquire locks in the first transaction we were supposed to send their list, but while migrating to a secondary transaction those acquired locks are lost, and never sent again in the result. So kqp doesn't know about them and never ask datashard to check them again.

@snaury snaury self-assigned this Mar 14, 2024
@snaury
Copy link
Member Author

snaury commented Mar 14, 2024

Kinda confirmed with more debug info (and even better query batching):

key: 354
values: 20, 16

{:index 552317, :time 315279330363, :type :ok, :process 269, :f :txn,
    :value [
        [:append 354 20]],
    :modified-txn [
        [:batch nil [[:append 354 20]]]],
    :debug-info [
        [
            {:tablet 72075186226372915, :acquired-locks [{:lock-id 562956296446503, :counter 14}], :gen 1, :read-version "v1710451670871/18446744073709551615", :op "read", :finished true, :node 50006, :keys 0, :seqno 1, :read-id 1, :ranges 1, :snapshot-repeatable true, :lock-tx-id 562956296446503}]
        [
            {:tablet 72075186226372915, :gen 1, :op "kqp-tx", :node 50006, :locks-op "commit", :tx-id 562956296446568, :version "v1710451670891/18446744073709551615", :validate-locks [{:lock-id 562956296446503, :counter 14}]}]]}

{:index 552354, :time 315285857518, :type :ok, :process 264, :f :txn,
    :value [
        [:append 354 16]
        [:append 351 731]
        [:append 354 17]],
    :modified-txn [
        [:batch nil [[:append 354 16] [:append 351 731] [:append 354 17]]]],
    :debug-info [
        [
            {:tablet 72075186226372915, :gen 1, :read-version "v1710451670861/18446744073709551615", :op "read", :node 50006, :keys 0, :seqno 1, :read-id 1, :ranges 2, :snapshot-repeatable true, :lock-tx-id 844431262076086, :read-rows 80}
            {:tablet 72075186226372915, :gen 1, :read-version "v1710451670861/18446744073709551615", :op "read", :finished true, :node 50006, :keys 0, :seqno 2, :read-id 1, :ranges 2, :snapshot-repeatable true, :lock-tx-id 844431262076086, :read-rows 116}]
        [
            {:tablet 72075186226372915, :node 50006, :gen 1, :op "kqp-tx", :tx-id 844431262076188, :version "v1710451670892/0"}]]}

It is clearly visible that the second tx performed a read which produced two results, and the first one with :seqno 1 doesn't have :acquired-locks. Read iterators are currently supposed to take locks for all queries (see :ranges 2) on the first iteration.

@snaury
Copy link
Member Author

snaury commented Mar 14, 2024

Must have been broken since ebcd04f (main, 24.1), which wasn't merged into current production stable versions (23.3 and 23.4).

@snaury snaury closed this as completed in e91747e Mar 15, 2024
This was referenced Mar 29, 2024
@mregrock mregrock mentioned this issue May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment