Replication failure under high load #4040

sergepetrenko · 2019-03-12T07:16:52Z

Tarantool version: 1.10.2-138-g5c0a36c (probably all 1.10.2 and 2.1)

Bug description:
We have a configuration with 1 master and 1 replica, when replica is processing
large selects (select from a space with 70m entries), its applier may fail with
any of the three errors below:

message: Invalid MsgPack - packet body
message: Invalid MsgPack - packet header
message: Missing mandatory field 'tuple' in request

Replication is stopped after the error and can be resumed only by resetting
box.cfg.replication to '' and back to the original config.

Master logs at the moment of failure:

2019-03-11 16:18:18.201 [242647] relay/[::1]:49549/101/main I> 0.2M rows processed
2019-03-11 16:18:23.719 [242647] main/41415/main txn.c:297 W> too long WAL write: 67492 rows at LSN 714017467: 0.675 sec
2019-03-11 16:18:24.360 [242647] main/41415/main txn.c:297 W> too long WAL write: 67506 rows at LSN 714084987: 0.627 sec
2019-03-11 16:18:28.033 [242647] main/41419/main txn.c:297 W> too long WAL write: 67633 rows at LSN 714170429: 0.629 sec
2019-03-11 16:18:28.140 [242647] relay/[::1]:49549/101/main coio.cc:331 !> SystemError timed out: Connection timed out
2019-03-11 16:18:28.140 [242647] relay/[::1]:49549/101/main C> exiting the relay loop

Replica logs:

2019-03-11 16:18:24.095 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713876302: 3.188 sec
2019-03-11 16:18:28.443 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713878358: 2.154 sec
2019-03-11 16:18:28.443 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713878359: 2.151 sec
2019-03-11 16:18:28.681 [1565171] main/29085/applierw/ sio.cc:303 !> SystemError writev(2), called on fd 16, aka [::1]:49549: Connection reset by peer
2019-03-11 16:18:28.779 [1565171] main/28584/applier/ I> can't read row
2019-03-11 16:18:28.779 [1565171] main/28584/applier/ xrow.c:554 E> ER_MISSING_REQUEST_FIELD: Missing mandatory field 'tuple' in request

The problem looks similar to #3718 and #3910, both master's and replica's configs have
force_recovery set to true.
Steps to reproduce: Happens occasionally, no reproducer yet.

The text was updated successfully, but these errors were encountered:

locker · 2019-03-12T08:03:17Z

Looks like due to a long WAL write on the replica, it didn't send an ACK to the master in time (configured by box.cfg.replication_timeout). As a result the connection was broken by the master.

See the code:

Here's the replica sends ACKs to the master. Note, it only sends it if applier->writer_cond is signaled.

tarantool/src/box/applier.cc

Line 131 in 352e1f7

fiber_cond_wait_timeout(&applier->writer_cond,
The applier->writer_cond is signaled upon applying a row on the replica.

tarantool/src/box/applier.cc

Line 556 in 352e1f7

fiber_cond_signal(&applier->writer_cond);
Judging by the replica log, it looks like the applier was stalled due to long WAL write:

2019-03-11 16:18:24.095 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713876302: 3.188 sec
2019-03-11 16:18:28.443 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713878358: 2.154 sec
2019-03-11 16:18:28.443 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713878359: 2.151 sec
2019-03-11 16:18:28.681 [1565171] main/29085/applierw/ sio.cc:303 !> SystemError writev(2), called on fd 16, aka [::1]:49549: Connection reset by peer
2019-03-11 16:18:28.779 [1565171] main/28584/applier/ I> can't read row
2019-03-11 16:18:28.779 [1565171] main/28584/applier/ xrow.c:554 E> ER_MISSING_REQUEST_FIELD: Missing mandatory field 'tuple' in request

As a result, the master didn't receive an ACK in time and closed the connection:

tarantool/src/box/relay.cc

Line 483 in 352e1f7

coio_read_xrow_timeout_xc(&io, &ibuf, &xrow,

Why the connection wasn't reestablished though is a mystery to me. It looks like a bug somewhere in coio/sio causing the replica read and try to decode an incomplete row sent by the master right before closing the connection.

Thus there are two problems here to be fixed. The first one is the above-mentioned bug in sio. It will ameliorate the problem by making the error recoverable. The other problem is too long WAL writes causing the replication timeout. I don't think it's a replication issue: the problem is we write WALs to the system page cache without syncing them to disk; once there are too many dirty pages, the OS starts throttling writers until enough dirty pages have been flushed to disk. We've run into this problem before, e.g. see #3603 (comment). A workaround would be setting sysctl vm.dirty_bytes Linux option to a small value, say 10 MB. I think we should sync WALs explicitly, similarly to how we do it in case of SNAP files, see

tarantool/src/box/xlog.c

Line 339 in 352e1f7

dir->sync_interval = SNAP_SYNC_INTERVAL;

tarantool/src/box/xlog.c

Line 1257 in 352e1f7

sync_file_range(log->fd, sync_from, sync_len,

kostja · 2019-03-12T09:06:32Z

The event loop stall may happen on a huge select as well, or when writing a huge tuple to WAL, then compression may take a while. A workaround for long event loop cycles is setting replication_timeout to a higher value. Besides, applier is being moved out of tx thread, so event loop stalls in tx thread will not affect it in the future. So I don't think we should fix the problem with replication disconnect: a user can change replication_timeout setting. But we do need to find this bug in sio.

locker · 2019-03-12T09:38:35Z

The event loop stall may happen on a huge select as well, or when writing a huge tuple to WAL, then compression may take a while.

Stalling the event loop for several seconds by a CPU intensive workload smells like a bug in the application code, which is pretty easy to fix - just add fiber.yield() where appropriate. As for WAL compression - for it to stall the event loop for that long, we must insert hundreds megabytes per second, which is hardly possible.

Too long WAL writes caused by dirty page throttling is a completely different beast though - it may happen even if the application is designed perfectly well and the CPU is never overloaded. In fact, I have to set vm.dirty_bytes for write-intensive performance testing to avoid latency spikes. I think it's unfair to hand this problem over to users rather than fixing it in the server code, similarly to how we fixed #1359.

Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Related to #4040

Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Here's some example output: ``` 2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C 2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header ``` Related to #4040

Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Here's some example output: ``` 2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C 2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header ``` Related to #4040 (cherry picked from commit f645119)

Simultaneous usage of one coio from two or more fiber could lead to undefined behavior as coio routines are replacing awaiting fiber (a data member) and stopping watcher without any relevance if there any other users of the coio object. Such behavior could lead to an applier invalid stream issue #4040. The proposal is to disable an active coio reuse by returning a fake EINPROGRESS error. Part of #980

sergepetrenko · 2020-11-19T09:34:48Z

Similar bug seen on replica join. In case master is under high load, it may write lots of transactions while sending the replica its snapshot. In this case by the time replica fetches snapshot, master'll have a 'buffer' of already written transactions, which it'll send to the replica as fast as possible (its own WAL won't limit its speed). The replica, however, will write these transactions to WAL and will fall behind more and more.
During this high intensive load replica may receive similar errors: ER_INVALID_MSGPACK.

So, now we're one step closer to a reproducer.

kyukhin · 2021-04-12T13:30:10Z

5SP for investigation to estimate complexity.

sergepetrenko · 2021-06-23T13:12:40Z

Here's a reproducer for the issue we have thanks to @rosik and @droidroot1995:

In order to run it,
start tarantool, issue

box.cfg{listen=3301}
box.schema.user.grant('guest', 'super')

Then run the script below on another tarantool instance:

#!/usr/bin/env tarantool

local log = require('log')
local json = require('json')
local fiber = require('fiber')
local socket = require('socket')
local pickle = require('pickle')
local msgpack = require('msgpack')

local iproto_code = {
    [0x00] = "iproto_ok", iproto_ok = 0x00,
    [0x01] = "iproto_select", iproto_select = 0x01,
    [0x02] = "iproto_insert", iproto_insert = 0x02,
    [0x03] = "iproto_replace", iproto_replace = 0x03,
    [0x04] = "iproto_update", iproto_update = 0x04,
    [0x05] = "iproto_delete", iproto_delete = 0x05,
    [0x06] = "iproto_call_16", iproto_call_16 = 0x06,
    [0x07] = "iproto_auth", iproto_auth = 0x07,
    [0x08] = "iproto_eval", iproto_eval = 0x08,
    [0x09] = "iproto_upsert", iproto_upsert = 0x09,
    [0x0a] = "iproto_call", iproto_call = 0x0a,
    [0x0b] = "iproto_execute", iproto_execute = 0x0b,
    [0x0c] = "iproto_nop", iproto_nop = 0x0c,
    [0x0d] = "iproto_type_stat_max", iproto_type_stat_max = 0x0d,
    [0x40] = "iproto_ping", iproto_ping = 0x40,
    [0x41] = "iproto_join", iproto_join = 0x41,
    [0x42] = "iproto_subscribe", iproto_subscribe = 0x42,
    [0x43] = "iproto_request_vote", iproto_request_vote = 0x43,
}

local IPROTO = {
    REQUEST_TYPE = 0x00,
    SERVER_VERSION = 0x06,
    INSTANCE_UUID = 0x24,
    CLUSTER_UUID = 0x25,
    VCLOCK = 0x26,
}

local function mpenc_uint32(n)
    return pickle.pack('bN', 0xCE, n)
end

local s = socket.tcp_connect('127.0.0.1', 3301)
s:read(128)

function _G.req(code, body)
    local header = msgpack.encode({
        [IPROTO.REQUEST_TYPE] = assert(iproto_code[code]),
    })

    body = msgpack.encode(body)
    s:write(mpenc_uint32(#header + #body))
    s:write(header)
    s:write(body)
end

function _G.recv()
    local ok, buf = pcall(s.read, s, 5, 2)
    if not ok or buf == nil or buf == '' then
        log.debug('Peer closed')
        return false
    end

    local size, pos = msgpack.decode(buf)
    local ok, _tail = pcall(s.read, s, pos-1 + size - #buf)
    if not ok or _tail == nil or _tail == '' then
        log.info('Peer closed')
        return false
    else
        buf = buf .. _tail
    end

    local header, pos = msgpack.decode(buf, pos)
    local body = nil
    if pos < #buf then
        body = msgpack.decode(buf, pos)
    end

    local code = header[0x00]

    -- log.info('< %s %s', json.encode(header), json.encode(body))
    return true
end

c = require('net.box').connect('localhost:3301')
c:eval([[
    if not box.space._cluster:get(2) then
        box.space._cluster:replace({2, package.loaded.uuid.str()})
    end
    box.schema.create_space('test', {if_not_exists = true})
    box.space.test:create_index('pk', {if_not_exists = true})
]])

local vclock = c:eval([[return box.info.vclock]])


local instance_uuid = c:eval([[ return box.space._cluster:get(2)[2] ]])
local replicaset_uuid = c:eval([[return box.info.cluster.uuid]])

_G.s = s
function _G.subscribe()
    return _G.req('iproto_subscribe', {
        [IPROTO.VCLOCK] = setmetatable(vclock, {__serialize = 'map'}),
        [IPROTO.SERVER_VERSION] = 0x20702,
        [IPROTO.INSTANCE_UUID] = instance_uuid,
        [IPROTO.CLUSTER_UUID] = replicaset_uuid,
    })
end

subscribe()
recv()

c:eval([[
    for i = 1, 100 do
        s = package.loaded.digest.urandom(30):hex() -- 60 bytes
        s = string.rep(s, 1024) -- 60 KiB
        box.space.test:replace({1, s})
    end
]])

recv()

fiber.sleep(5)

while recv() do
end

require('console').start()

Once the script is run, you should see

LuajitError: test1.lua:76: msgpack.decode: invalid MsgPack
fatal error, exiting the event loop

in ~ 5 seconds.

And master's log should contain

2021-06-23 16:05:05.133 [84870] relay/[::ffff:127.0.0.1]:57794/101/main coio.cc:340 !> SystemError timed out: Operation timed out
2021-06-23 16:05:05.133 [84870] relay/[::ffff:127.0.0.1]:57794/101/main C> exiting the relay loop

It's now possible to figure out what's happening:
once replica subscribes, master starts feeding it its WAL.
When replica stalls (say, there's a fiber which doesn't yield control to others), this means it doesn't read the data from the master's socket. Once master writes more data than system configured tcp_wmem, its socket stops being ready for write, so master's relay yields (in the middle of writing some xrow to the socket). The socket doesn't become writeable until the timeout happens (because replica doesn't read anything from master). Once the timeout happens, relay is cancelled, meaining it doesn't have an opportunity to finish writing the row, and the error (TimedOut in this case) is sent to the replica right after the half-written row. This leads to replica not being able to parse the row and hence the invalid MsgPack error.

kostja · 2021-06-23T13:40:20Z

Nice!

Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040

Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040 (cherry picked from commit 61b95d1)

Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040

sergepetrenko added bug Something isn't working replication labels Mar 12, 2019

sergepetrenko self-assigned this Mar 12, 2019

kyukhin added this to the 1.10.3 milestone Mar 14, 2019

sergepetrenko added a commit that referenced this issue Mar 15, 2019

xrow: print corrupted rows on decoding error.

89c4656

Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Related to #4040

sergepetrenko added a commit that referenced this issue Mar 19, 2019

xrow: print corrupted rows on decoding error.

dbd67f7

Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Related to #4040

kyukhin modified the milestones: 1.10.3, 1.10.4 Apr 1, 2019

sergepetrenko added a commit that referenced this issue Apr 1, 2019

xrow: print corrupted rows on decoding error.

55a5892

Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Related to #4040

kyukhin modified the milestones: 1.10.4, 1.10.5 Sep 26, 2019

kyukhin modified the milestones: 1.10.5, 1.10.6 Jan 13, 2020

kyukhin modified the milestones: 1.10.6, 1.10.7 Apr 20, 2020

avtikhon mentioned this issue Apr 28, 2020

tests: split box/net.box test into smaller unit tests #4880

Closed

kyukhin modified the milestones: 1.10.7, 1.10.8 Jun 22, 2020

kyukhin removed this from the 1.10.8 milestone Oct 23, 2020

kyukhin added this to the wishlist milestone Oct 23, 2020

sergepetrenko mentioned this issue Nov 19, 2020

OOM on a joining/syncing replica #5536

Closed

sharonovd added the customer label Feb 18, 2021

kyukhin added the teamR label Feb 24, 2021

kyukhin modified the milestones: wishlist, 2.8.2 Mar 1, 2021

kyukhin added the tmp label Mar 1, 2021

sergepetrenko added the 13sp label Mar 23, 2021

kyukhin modified the milestones: 2.8.2, 2.7.3 Mar 23, 2021

kyukhin removed the tmp label Apr 30, 2021

kyukhin closed this as completed in 61b95d1 Jul 16, 2021

sergepetrenko mentioned this issue Apr 18, 2022

test: flaky replication/election_qsync_stress.test.lua test tarantool/tarantool-qa#166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication failure under high load #4040

Replication failure under high load #4040

sergepetrenko commented Mar 12, 2019

locker commented Mar 12, 2019

kostja commented Mar 12, 2019

locker commented Mar 12, 2019

sergepetrenko commented Nov 19, 2020 •

edited

Loading

kyukhin commented Apr 12, 2021

sergepetrenko commented Jun 23, 2021

kostja commented Jun 23, 2021

Replication failure under high load #4040

Replication failure under high load #4040

Comments

sergepetrenko commented Mar 12, 2019

locker commented Mar 12, 2019

kostja commented Mar 12, 2019

locker commented Mar 12, 2019

sergepetrenko commented Nov 19, 2020 • edited Loading

kyukhin commented Apr 12, 2021

sergepetrenko commented Jun 23, 2021

kostja commented Jun 23, 2021

sergepetrenko commented Nov 19, 2020 •

edited

Loading