Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication failure under high load #4040

Closed
sergepetrenko opened this issue Mar 12, 2019 · 7 comments
Closed

Replication failure under high load #4040

sergepetrenko opened this issue Mar 12, 2019 · 7 comments
Assignees
Labels
bug Something isn't working customer replication
Milestone

Comments

@sergepetrenko
Copy link
Collaborator

Tarantool version: 1.10.2-138-g5c0a36c (probably all 1.10.2 and 2.1)

Bug description:
We have a configuration with 1 master and 1 replica, when replica is processing
large selects (select from a space with 70m entries), its applier may fail with
any of the three errors below:

message: Invalid MsgPack - packet body
message: Invalid MsgPack - packet header
message: Missing mandatory field 'tuple' in request

Replication is stopped after the error and can be resumed only by resetting
box.cfg.replication to '' and back to the original config.

Master logs at the moment of failure:

2019-03-11 16:18:18.201 [242647] relay/[::1]:49549/101/main I> 0.2M rows processed
2019-03-11 16:18:23.719 [242647] main/41415/main txn.c:297 W> too long WAL write: 67492 rows at LSN 714017467: 0.675 sec
2019-03-11 16:18:24.360 [242647] main/41415/main txn.c:297 W> too long WAL write: 67506 rows at LSN 714084987: 0.627 sec
2019-03-11 16:18:28.033 [242647] main/41419/main txn.c:297 W> too long WAL write: 67633 rows at LSN 714170429: 0.629 sec
2019-03-11 16:18:28.140 [242647] relay/[::1]:49549/101/main coio.cc:331 !> SystemError timed out: Connection timed out
2019-03-11 16:18:28.140 [242647] relay/[::1]:49549/101/main C> exiting the relay loop

Replica logs:

2019-03-11 16:18:24.095 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713876302: 3.188 sec
2019-03-11 16:18:28.443 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713878358: 2.154 sec
2019-03-11 16:18:28.443 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713878359: 2.151 sec
2019-03-11 16:18:28.681 [1565171] main/29085/applierw/ sio.cc:303 !> SystemError writev(2), called on fd 16, aka [::1]:49549: Connection reset by peer
2019-03-11 16:18:28.779 [1565171] main/28584/applier/ I> can't read row
2019-03-11 16:18:28.779 [1565171] main/28584/applier/ xrow.c:554 E> ER_MISSING_REQUEST_FIELD: Missing mandatory field 'tuple' in request

The problem looks similar to #3718 and #3910, both master's and replica's configs have
force_recovery set to true.
Steps to reproduce: Happens occasionally, no reproducer yet.

@sergepetrenko sergepetrenko added bug Something isn't working replication labels Mar 12, 2019
@sergepetrenko sergepetrenko self-assigned this Mar 12, 2019
@locker
Copy link
Member

locker commented Mar 12, 2019

Looks like due to a long WAL write on the replica, it didn't send an ACK to the master in time (configured by box.cfg.replication_timeout). As a result the connection was broken by the master.

See the code:

  1. Here's the replica sends ACKs to the master. Note, it only sends it if applier->writer_cond is signaled.
    fiber_cond_wait_timeout(&applier->writer_cond,
  2. The applier->writer_cond is signaled upon applying a row on the replica.
    fiber_cond_signal(&applier->writer_cond);
  3. Judging by the replica log, it looks like the applier was stalled due to long WAL write:
2019-03-11 16:18:24.095 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713876302: 3.188 sec
2019-03-11 16:18:28.443 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713878358: 2.154 sec
2019-03-11 16:18:28.443 [1565171] main/28584/applier/ W> too long WAL write: 1 rows at LSN 713878359: 2.151 sec
2019-03-11 16:18:28.681 [1565171] main/29085/applierw/ sio.cc:303 !> SystemError writev(2), called on fd 16, aka [::1]:49549: Connection reset by peer
2019-03-11 16:18:28.779 [1565171] main/28584/applier/ I> can't read row
2019-03-11 16:18:28.779 [1565171] main/28584/applier/ xrow.c:554 E> ER_MISSING_REQUEST_FIELD: Missing mandatory field 'tuple' in request
  1. As a result, the master didn't receive an ACK in time and closed the connection:
    coio_read_xrow_timeout_xc(&io, &ibuf, &xrow,

Why the connection wasn't reestablished though is a mystery to me. It looks like a bug somewhere in coio/sio causing the replica read and try to decode an incomplete row sent by the master right before closing the connection.

Thus there are two problems here to be fixed. The first one is the above-mentioned bug in sio. It will ameliorate the problem by making the error recoverable. The other problem is too long WAL writes causing the replication timeout. I don't think it's a replication issue: the problem is we write WALs to the system page cache without syncing them to disk; once there are too many dirty pages, the OS starts throttling writers until enough dirty pages have been flushed to disk. We've run into this problem before, e.g. see #3603 (comment). A workaround would be setting sysctl vm.dirty_bytes Linux option to a small value, say 10 MB. I think we should sync WALs explicitly, similarly to how we do it in case of SNAP files, see

dir->sync_interval = SNAP_SYNC_INTERVAL;

sync_file_range(log->fd, sync_from, sync_len,

@kostja
Copy link
Contributor

kostja commented Mar 12, 2019

The event loop stall may happen on a huge select as well, or when writing a huge tuple to WAL, then compression may take a while. A workaround for long event loop cycles is setting replication_timeout to a higher value. Besides, applier is being moved out of tx thread, so event loop stalls in tx thread will not affect it in the future. So I don't think we should fix the problem with replication disconnect: a user can change replication_timeout setting. But we do need to find this bug in sio.

@locker
Copy link
Member

locker commented Mar 12, 2019

The event loop stall may happen on a huge select as well, or when writing a huge tuple to WAL, then compression may take a while.

Stalling the event loop for several seconds by a CPU intensive workload smells like a bug in the application code, which is pretty easy to fix - just add fiber.yield() where appropriate. As for WAL compression - for it to stall the event loop for that long, we must insert hundreds megabytes per second, which is hardly possible.

Too long WAL writes caused by dirty page throttling is a completely different beast though - it may happen even if the application is designed perfectly well and the CPU is never overloaded. In fact, I have to set vm.dirty_bytes for write-intensive performance testing to avoid latency spikes. I think it's unfair to hand this problem over to users rather than fixing it in the server code, similarly to how we fixed #1359.

@kyukhin kyukhin added this to the 1.10.3 milestone Mar 14, 2019
sergepetrenko added a commit that referenced this issue Mar 15, 2019
Add row hex printing to log on verbose level. This would be useful
during investigation of errors related to invalid msgpack packet
arrival.

Related to #4040
sergepetrenko added a commit that referenced this issue Mar 19, 2019
Add row hex printing to log on verbose level. This would be useful
during investigation of errors related to invalid msgpack packet
arrival.

Related to #4040
@kyukhin kyukhin modified the milestones: 1.10.3, 1.10.4 Apr 1, 2019
sergepetrenko added a commit that referenced this issue Apr 1, 2019
Add row hex printing to log on verbose level. This would be useful
during investigation of errors related to invalid msgpack packet
arrival.

Related to #4040
sergepetrenko added a commit that referenced this issue Apr 1, 2019
Add row hex printing to log on verbose level. This would be useful
during investigation of errors related to invalid msgpack packet
arrival.

Here's some example output:
```
2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C
2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header
```

Related to #4040
sergepetrenko added a commit that referenced this issue Apr 4, 2019
Add row hex printing to log on verbose level. This would be useful
during investigation of errors related to invalid msgpack packet
arrival.

Here's some example output:
```
2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C
2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header
```

Related to #4040
locker pushed a commit that referenced this issue Apr 4, 2019
Add row hex printing to log on verbose level. This would be useful
during investigation of errors related to invalid msgpack packet
arrival.

Here's some example output:
```
2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C
2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header
```

Related to #4040
locker pushed a commit that referenced this issue Apr 4, 2019
Add row hex printing to log on verbose level. This would be useful
during investigation of errors related to invalid msgpack packet
arrival.

Here's some example output:
```
2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C
2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header
```

Related to #4040

(cherry picked from commit f645119)
locker pushed a commit that referenced this issue Apr 4, 2019
Add row hex printing to log on verbose level. This would be useful
during investigation of errors related to invalid msgpack packet
arrival.

Here's some example output:
```
2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C
2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header
```

Related to #4040

(cherry picked from commit f645119)
@kyukhin kyukhin modified the milestones: 1.10.4, 1.10.5 Sep 26, 2019
@kyukhin kyukhin modified the milestones: 1.10.5, 1.10.6 Jan 13, 2020
GeorgyKirichenko pushed a commit that referenced this issue Feb 11, 2020
Simultaneous usage of one coio from two or more fiber could lead
to undefined behavior as coio routines are replacing awaiting fiber
(a data member) and stopping watcher without any relevance if there
any other users of the coio object. Such behavior could lead to
an applier invalid stream issue #4040.
The proposal is to disable an active coio reuse by returning a fake
EINPROGRESS error.

Part of #980
cyrillos pushed a commit that referenced this issue Apr 16, 2020
Simultaneous usage of one coio from two or more fiber could lead
to undefined behavior as coio routines are replacing awaiting fiber
(a data member) and stopping watcher without any relevance if there
any other users of the coio object.

Such behavior could lead to an applier invalid stream issue #4040.

The proposal is to disable an active coio reuse by returning a fake
EINPROGRESS error.

Part of #980
@kyukhin kyukhin modified the milestones: 1.10.6, 1.10.7 Apr 20, 2020
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Simultaneous usage of one coio from two or more fiber could lead
to undefined behavior as coio routines are replacing awaiting fiber
(a data member) and stopping watcher without any relevance if there
any other users of the coio object. Such behavior could lead to
an applier invalid stream issue #4040.
The proposal is to disable an active coio reuse by returning a fake
EINPROGRESS error.

Part of #980
@kyukhin kyukhin modified the milestones: 1.10.7, 1.10.8 Jun 22, 2020
@kyukhin kyukhin removed this from the 1.10.8 milestone Oct 23, 2020
@kyukhin kyukhin added this to the wishlist milestone Oct 23, 2020
@sergepetrenko
Copy link
Collaborator Author

sergepetrenko commented Nov 19, 2020

Similar bug seen on replica join. In case master is under high load, it may write lots of transactions while sending the replica its snapshot. In this case by the time replica fetches snapshot, master'll have a 'buffer' of already written transactions, which it'll send to the replica as fast as possible (its own WAL won't limit its speed). The replica, however, will write these transactions to WAL and will fall behind more and more.
During this high intensive load replica may receive similar errors: ER_INVALID_MSGPACK.

So, now we're one step closer to a reproducer.

@kyukhin
Copy link
Contributor

kyukhin commented Apr 12, 2021

5SP for investigation to estimate complexity.

@kyukhin kyukhin removed the tmp label Apr 30, 2021
@sergepetrenko
Copy link
Collaborator Author

Here's a reproducer for the issue we have thanks to @rosik and @droidroot1995:

In order to run it,
start tarantool, issue

box.cfg{listen=3301}
box.schema.user.grant('guest', 'super')

Then run the script below on another tarantool instance:

#!/usr/bin/env tarantool

local log = require('log')
local json = require('json')
local fiber = require('fiber')
local socket = require('socket')
local pickle = require('pickle')
local msgpack = require('msgpack')

local iproto_code = {
    [0x00] = "iproto_ok", iproto_ok = 0x00,
    [0x01] = "iproto_select", iproto_select = 0x01,
    [0x02] = "iproto_insert", iproto_insert = 0x02,
    [0x03] = "iproto_replace", iproto_replace = 0x03,
    [0x04] = "iproto_update", iproto_update = 0x04,
    [0x05] = "iproto_delete", iproto_delete = 0x05,
    [0x06] = "iproto_call_16", iproto_call_16 = 0x06,
    [0x07] = "iproto_auth", iproto_auth = 0x07,
    [0x08] = "iproto_eval", iproto_eval = 0x08,
    [0x09] = "iproto_upsert", iproto_upsert = 0x09,
    [0x0a] = "iproto_call", iproto_call = 0x0a,
    [0x0b] = "iproto_execute", iproto_execute = 0x0b,
    [0x0c] = "iproto_nop", iproto_nop = 0x0c,
    [0x0d] = "iproto_type_stat_max", iproto_type_stat_max = 0x0d,
    [0x40] = "iproto_ping", iproto_ping = 0x40,
    [0x41] = "iproto_join", iproto_join = 0x41,
    [0x42] = "iproto_subscribe", iproto_subscribe = 0x42,
    [0x43] = "iproto_request_vote", iproto_request_vote = 0x43,
}

local IPROTO = {
    REQUEST_TYPE = 0x00,
    SERVER_VERSION = 0x06,
    INSTANCE_UUID = 0x24,
    CLUSTER_UUID = 0x25,
    VCLOCK = 0x26,
}

local function mpenc_uint32(n)
    return pickle.pack('bN', 0xCE, n)
end

local s = socket.tcp_connect('127.0.0.1', 3301)
s:read(128)

function _G.req(code, body)
    local header = msgpack.encode({
        [IPROTO.REQUEST_TYPE] = assert(iproto_code[code]),
    })

    body = msgpack.encode(body)
    s:write(mpenc_uint32(#header + #body))
    s:write(header)
    s:write(body)
end

function _G.recv()
    local ok, buf = pcall(s.read, s, 5, 2)
    if not ok or buf == nil or buf == '' then
        log.debug('Peer closed')
        return false
    end

    local size, pos = msgpack.decode(buf)
    local ok, _tail = pcall(s.read, s, pos-1 + size - #buf)
    if not ok or _tail == nil or _tail == '' then
        log.info('Peer closed')
        return false
    else
        buf = buf .. _tail
    end

    local header, pos = msgpack.decode(buf, pos)
    local body = nil
    if pos < #buf then
        body = msgpack.decode(buf, pos)
    end

    local code = header[0x00]

    -- log.info('< %s %s', json.encode(header), json.encode(body))
    return true
end

c = require('net.box').connect('localhost:3301')
c:eval([[
    if not box.space._cluster:get(2) then
        box.space._cluster:replace({2, package.loaded.uuid.str()})
    end
    box.schema.create_space('test', {if_not_exists = true})
    box.space.test:create_index('pk', {if_not_exists = true})
]])

local vclock = c:eval([[return box.info.vclock]])


local instance_uuid = c:eval([[ return box.space._cluster:get(2)[2] ]])
local replicaset_uuid = c:eval([[return box.info.cluster.uuid]])

_G.s = s
function _G.subscribe()
    return _G.req('iproto_subscribe', {
        [IPROTO.VCLOCK] = setmetatable(vclock, {__serialize = 'map'}),
        [IPROTO.SERVER_VERSION] = 0x20702,
        [IPROTO.INSTANCE_UUID] = instance_uuid,
        [IPROTO.CLUSTER_UUID] = replicaset_uuid,
    })
end

subscribe()
recv()

c:eval([[
    for i = 1, 100 do
        s = package.loaded.digest.urandom(30):hex() -- 60 bytes
        s = string.rep(s, 1024) -- 60 KiB
        box.space.test:replace({1, s})
    end
]])

recv()

fiber.sleep(5)

while recv() do
end

require('console').start()

Once the script is run, you should see

LuajitError: test1.lua:76: msgpack.decode: invalid MsgPack
fatal error, exiting the event loop

in ~ 5 seconds.

And master's log should contain

2021-06-23 16:05:05.133 [84870] relay/[::ffff:127.0.0.1]:57794/101/main coio.cc:340 !> SystemError timed out: Operation timed out
2021-06-23 16:05:05.133 [84870] relay/[::ffff:127.0.0.1]:57794/101/main C> exiting the relay loop

It's now possible to figure out what's happening:
once replica subscribes, master starts feeding it its WAL.
When replica stalls (say, there's a fiber which doesn't yield control to others), this means it doesn't read the data from the master's socket. Once master writes more data than system configured tcp_wmem, its socket stops being ready for write, so master's relay yields (in the middle of writing some xrow to the socket). The socket doesn't become writeable until the timeout happens (because replica doesn't read anything from master). Once the timeout happens, relay is cancelled, meaining it doesn't have an opportunity to finish writing the row, and the error (TimedOut in this case) is sent to the replica right after the half-written row. This leads to replica not being able to parse the row and hence the invalid MsgPack error.

@kostja
Copy link
Contributor

kostja commented Jun 23, 2021

Nice!

sergepetrenko added a commit that referenced this issue Jul 8, 2021
Every error that happens during master processes a join or subscribe
request is sent to the replica for better diagnostics.

This could lead to the following situation with the TimedOut error:
it could be written on top of a half-written row and make the replica stop
replication with ER_INVALID_MSGPACK error. The error is unrecoverable
and the only way to resume replication after it happens is to reset
box.cfg.replication.

Here's what happened:

1) Replica is under heavy load, meaning it's event loop is occupied by
   some fiber not yielding control to others.

2) applier and other fibers aren't scheduled while the event loop is
   blocked. This means applier doesn't send heartbeat messages to the
   master and doesn't read any data coming from the master.

3) The unread master's data piles up. First in replica's receive buffer, then
   in master's send buffer.

4) Once master's send buffer is full, the corresponding socket stops
   being writeable and the relay yields waiting for the socket to become
   writeable again. The send buffer might contain a partially written
   row by now.

5) Replication timeout happens on master, because it hasn't heard from
   replica for a while. An exception is raised, and the exception is
   pushed to the replica's socket. Now two situations are possible:

  a) the socket becomes writeable by the time exception is raised.
     In this case the exception is logged to the buffer right after
     a partially written row. Once replica receives the half-written
     row with an exception logged on top, it errors with
     ER_INVALID_MSGPACK. Replication is broken.

  b) the socket isn't writeable still (the most probable scenario)
     The exception isn't logged to the socket and the connection is
     closed. Replica eventually receives a partially-written row and
     retries connection to the master normally.

In order to prevent case a) from happening, let's not push TimedOut
errors to the socket at all. They're the only errors that could be
raised while a row is being written, i.e. the only errors that could
lead to the situation described in 5a.

Closes #4040
sergepetrenko added a commit that referenced this issue Jul 9, 2021
Every error that happens during master processes a join or subscribe
request is sent to the replica for better diagnostics.

This could lead to the following situation with the TimedOut error:
it could be written on top of a half-written row and make the replica stop
replication with ER_INVALID_MSGPACK error. The error is unrecoverable
and the only way to resume replication after it happens is to reset
box.cfg.replication.

Here's what happened:

1) Replica is under heavy load, meaning it's event loop is occupied by
   some fiber not yielding control to others.

2) applier and other fibers aren't scheduled while the event loop is
   blocked. This means applier doesn't send heartbeat messages to the
   master and doesn't read any data coming from the master.

3) The unread master's data piles up. First in replica's receive buffer, then
   in master's send buffer.

4) Once master's send buffer is full, the corresponding socket stops
   being writeable and the relay yields waiting for the socket to become
   writeable again. The send buffer might contain a partially written
   row by now.

5) Replication timeout happens on master, because it hasn't heard from
   replica for a while. An exception is raised, and the exception is
   pushed to the replica's socket. Now two situations are possible:

  a) the socket becomes writeable by the time exception is raised.
     In this case the exception is logged to the buffer right after
     a partially written row. Once replica receives the half-written
     row with an exception logged on top, it errors with
     ER_INVALID_MSGPACK. Replication is broken.

  b) the socket isn't writeable still (the most probable scenario)
     The exception isn't logged to the socket and the connection is
     closed. Replica eventually receives a partially-written row and
     retries connection to the master normally.

In order to prevent case a) from happening, let's not push TimedOut
errors to the socket at all. They're the only errors that could be
raised while a row is being written, i.e. the only errors that could
lead to the situation described in 5a.

Closes #4040
sergepetrenko added a commit that referenced this issue Jul 12, 2021
Every error that happens during master processes a join or subscribe
request is sent to the replica for better diagnostics.

This could lead to the following situation with the TimedOut error:
it could be written on top of a half-written row and make the replica stop
replication with ER_INVALID_MSGPACK error. The error is unrecoverable
and the only way to resume replication after it happens is to reset
box.cfg.replication.

Here's what happened:

1) Replica is under heavy load, meaning it's event loop is occupied by
   some fiber not yielding control to others.

2) applier and other fibers aren't scheduled while the event loop is
   blocked. This means applier doesn't send heartbeat messages to the
   master and doesn't read any data coming from the master.

3) The unread master's data piles up. First in replica's receive buffer, then
   in master's send buffer.

4) Once master's send buffer is full, the corresponding socket stops
   being writeable and the relay yields waiting for the socket to become
   writeable again. The send buffer might contain a partially written
   row by now.

5) Replication timeout happens on master, because it hasn't heard from
   replica for a while. An exception is raised, and the exception is
   pushed to the replica's socket. Now two situations are possible:

  a) the socket becomes writeable by the time exception is raised.
     In this case the exception is logged to the buffer right after
     a partially written row. Once replica receives the half-written
     row with an exception logged on top, it errors with
     ER_INVALID_MSGPACK. Replication is broken.

  b) the socket isn't writeable still (the most probable scenario)
     The exception isn't logged to the socket and the connection is
     closed. Replica eventually receives a partially-written row and
     retries connection to the master normally.

In order to prevent case a) from happening, let's not push TimedOut
errors to the socket at all. They're the only errors that could be
raised while a row is being written, i.e. the only errors that could
lead to the situation described in 5a.

Closes #4040
sergepetrenko added a commit that referenced this issue Jul 15, 2021
Every error that happens during master processes a join or subscribe
request is sent to the replica for better diagnostics.

This could lead to the following situation with the TimedOut error:
it could be written on top of a half-written row and make the replica stop
replication with ER_INVALID_MSGPACK error. The error is unrecoverable
and the only way to resume replication after it happens is to reset
box.cfg.replication.

Here's what happened:

1) Replica is under heavy load, meaning it's event loop is occupied by
   some fiber not yielding control to others.

2) applier and other fibers aren't scheduled while the event loop is
   blocked. This means applier doesn't send heartbeat messages to the
   master and doesn't read any data coming from the master.

3) The unread master's data piles up. First in replica's receive buffer, then
   in master's send buffer.

4) Once master's send buffer is full, the corresponding socket stops
   being writeable and the relay yields waiting for the socket to become
   writeable again. The send buffer might contain a partially written
   row by now.

5) Replication timeout happens on master, because it hasn't heard from
   replica for a while. An exception is raised, and the exception is
   pushed to the replica's socket. Now two situations are possible:

  a) the socket becomes writeable by the time exception is raised.
     In this case the exception is logged to the buffer right after
     a partially written row. Once replica receives the half-written
     row with an exception logged on top, it errors with
     ER_INVALID_MSGPACK. Replication is broken.

  b) the socket isn't writeable still (the most probable scenario)
     The exception isn't logged to the socket and the connection is
     closed. Replica eventually receives a partially-written row and
     retries connection to the master normally.

In order to prevent case a) from happening, let's not push TimedOut
errors to the socket at all. They're the only errors that could be
raised while a row is being written, i.e. the only errors that could
lead to the situation described in 5a.

Closes #4040
kyukhin pushed a commit that referenced this issue Jul 16, 2021
Every error that happens during master processes a join or subscribe
request is sent to the replica for better diagnostics.

This could lead to the following situation with the TimedOut error:
it could be written on top of a half-written row and make the replica stop
replication with ER_INVALID_MSGPACK error. The error is unrecoverable
and the only way to resume replication after it happens is to reset
box.cfg.replication.

Here's what happened:

1) Replica is under heavy load, meaning it's event loop is occupied by
   some fiber not yielding control to others.

2) applier and other fibers aren't scheduled while the event loop is
   blocked. This means applier doesn't send heartbeat messages to the
   master and doesn't read any data coming from the master.

3) The unread master's data piles up. First in replica's receive buffer, then
   in master's send buffer.

4) Once master's send buffer is full, the corresponding socket stops
   being writeable and the relay yields waiting for the socket to become
   writeable again. The send buffer might contain a partially written
   row by now.

5) Replication timeout happens on master, because it hasn't heard from
   replica for a while. An exception is raised, and the exception is
   pushed to the replica's socket. Now two situations are possible:

  a) the socket becomes writeable by the time exception is raised.
     In this case the exception is logged to the buffer right after
     a partially written row. Once replica receives the half-written
     row with an exception logged on top, it errors with
     ER_INVALID_MSGPACK. Replication is broken.

  b) the socket isn't writeable still (the most probable scenario)
     The exception isn't logged to the socket and the connection is
     closed. Replica eventually receives a partially-written row and
     retries connection to the master normally.

In order to prevent case a) from happening, let's not push TimedOut
errors to the socket at all. They're the only errors that could be
raised while a row is being written, i.e. the only errors that could
lead to the situation described in 5a.

Closes #4040

(cherry picked from commit 61b95d1)
kyukhin pushed a commit that referenced this issue Jul 16, 2021
Every error that happens during master processes a join or subscribe
request is sent to the replica for better diagnostics.

This could lead to the following situation with the TimedOut error:
it could be written on top of a half-written row and make the replica stop
replication with ER_INVALID_MSGPACK error. The error is unrecoverable
and the only way to resume replication after it happens is to reset
box.cfg.replication.

Here's what happened:

1) Replica is under heavy load, meaning it's event loop is occupied by
   some fiber not yielding control to others.

2) applier and other fibers aren't scheduled while the event loop is
   blocked. This means applier doesn't send heartbeat messages to the
   master and doesn't read any data coming from the master.

3) The unread master's data piles up. First in replica's receive buffer, then
   in master's send buffer.

4) Once master's send buffer is full, the corresponding socket stops
   being writeable and the relay yields waiting for the socket to become
   writeable again. The send buffer might contain a partially written
   row by now.

5) Replication timeout happens on master, because it hasn't heard from
   replica for a while. An exception is raised, and the exception is
   pushed to the replica's socket. Now two situations are possible:

  a) the socket becomes writeable by the time exception is raised.
     In this case the exception is logged to the buffer right after
     a partially written row. Once replica receives the half-written
     row with an exception logged on top, it errors with
     ER_INVALID_MSGPACK. Replication is broken.

  b) the socket isn't writeable still (the most probable scenario)
     The exception isn't logged to the socket and the connection is
     closed. Replica eventually receives a partially-written row and
     retries connection to the master normally.

In order to prevent case a) from happening, let's not push TimedOut
errors to the socket at all. They're the only errors that could be
raised while a row is being written, i.e. the only errors that could
lead to the situation described in 5a.

Closes #4040

(cherry picked from commit 61b95d1)
sergepetrenko added a commit that referenced this issue Jul 16, 2021
Every error that happens during master processes a join or subscribe
request is sent to the replica for better diagnostics.

This could lead to the following situation with the TimedOut error:
it could be written on top of a half-written row and make the replica stop
replication with ER_INVALID_MSGPACK error. The error is unrecoverable
and the only way to resume replication after it happens is to reset
box.cfg.replication.

Here's what happened:

1) Replica is under heavy load, meaning it's event loop is occupied by
   some fiber not yielding control to others.

2) applier and other fibers aren't scheduled while the event loop is
   blocked. This means applier doesn't send heartbeat messages to the
   master and doesn't read any data coming from the master.

3) The unread master's data piles up. First in replica's receive buffer, then
   in master's send buffer.

4) Once master's send buffer is full, the corresponding socket stops
   being writeable and the relay yields waiting for the socket to become
   writeable again. The send buffer might contain a partially written
   row by now.

5) Replication timeout happens on master, because it hasn't heard from
   replica for a while. An exception is raised, and the exception is
   pushed to the replica's socket. Now two situations are possible:

  a) the socket becomes writeable by the time exception is raised.
     In this case the exception is logged to the buffer right after
     a partially written row. Once replica receives the half-written
     row with an exception logged on top, it errors with
     ER_INVALID_MSGPACK. Replication is broken.

  b) the socket isn't writeable still (the most probable scenario)
     The exception isn't logged to the socket and the connection is
     closed. Replica eventually receives a partially-written row and
     retries connection to the master normally.

In order to prevent case a) from happening, let's not push TimedOut
errors to the socket at all. They're the only errors that could be
raised while a row is being written, i.e. the only errors that could
lead to the situation described in 5a.

Closes #4040
kyukhin pushed a commit that referenced this issue Jul 16, 2021
Every error that happens during master processes a join or subscribe
request is sent to the replica for better diagnostics.

This could lead to the following situation with the TimedOut error:
it could be written on top of a half-written row and make the replica stop
replication with ER_INVALID_MSGPACK error. The error is unrecoverable
and the only way to resume replication after it happens is to reset
box.cfg.replication.

Here's what happened:

1) Replica is under heavy load, meaning it's event loop is occupied by
   some fiber not yielding control to others.

2) applier and other fibers aren't scheduled while the event loop is
   blocked. This means applier doesn't send heartbeat messages to the
   master and doesn't read any data coming from the master.

3) The unread master's data piles up. First in replica's receive buffer, then
   in master's send buffer.

4) Once master's send buffer is full, the corresponding socket stops
   being writeable and the relay yields waiting for the socket to become
   writeable again. The send buffer might contain a partially written
   row by now.

5) Replication timeout happens on master, because it hasn't heard from
   replica for a while. An exception is raised, and the exception is
   pushed to the replica's socket. Now two situations are possible:

  a) the socket becomes writeable by the time exception is raised.
     In this case the exception is logged to the buffer right after
     a partially written row. Once replica receives the half-written
     row with an exception logged on top, it errors with
     ER_INVALID_MSGPACK. Replication is broken.

  b) the socket isn't writeable still (the most probable scenario)
     The exception isn't logged to the socket and the connection is
     closed. Replica eventually receives a partially-written row and
     retries connection to the master normally.

In order to prevent case a) from happening, let's not push TimedOut
errors to the socket at all. They're the only errors that could be
raised while a row is being written, i.e. the only errors that could
lead to the situation described in 5a.

Closes #4040
yanshtunder pushed a commit that referenced this issue Oct 4, 2021
Every error that happens during master processes a join or subscribe
request is sent to the replica for better diagnostics.

This could lead to the following situation with the TimedOut error:
it could be written on top of a half-written row and make the replica stop
replication with ER_INVALID_MSGPACK error. The error is unrecoverable
and the only way to resume replication after it happens is to reset
box.cfg.replication.

Here's what happened:

1) Replica is under heavy load, meaning it's event loop is occupied by
   some fiber not yielding control to others.

2) applier and other fibers aren't scheduled while the event loop is
   blocked. This means applier doesn't send heartbeat messages to the
   master and doesn't read any data coming from the master.

3) The unread master's data piles up. First in replica's receive buffer, then
   in master's send buffer.

4) Once master's send buffer is full, the corresponding socket stops
   being writeable and the relay yields waiting for the socket to become
   writeable again. The send buffer might contain a partially written
   row by now.

5) Replication timeout happens on master, because it hasn't heard from
   replica for a while. An exception is raised, and the exception is
   pushed to the replica's socket. Now two situations are possible:

  a) the socket becomes writeable by the time exception is raised.
     In this case the exception is logged to the buffer right after
     a partially written row. Once replica receives the half-written
     row with an exception logged on top, it errors with
     ER_INVALID_MSGPACK. Replication is broken.

  b) the socket isn't writeable still (the most probable scenario)
     The exception isn't logged to the socket and the connection is
     closed. Replica eventually receives a partially-written row and
     retries connection to the master normally.

In order to prevent case a) from happening, let's not push TimedOut
errors to the socket at all. They're the only errors that could be
raised while a row is being written, i.e. the only errors that could
lead to the situation described in 5a.

Closes #4040
yanshtunder pushed a commit that referenced this issue Oct 4, 2021
Every error that happens during master processes a join or subscribe
request is sent to the replica for better diagnostics.

This could lead to the following situation with the TimedOut error:
it could be written on top of a half-written row and make the replica stop
replication with ER_INVALID_MSGPACK error. The error is unrecoverable
and the only way to resume replication after it happens is to reset
box.cfg.replication.

Here's what happened:

1) Replica is under heavy load, meaning it's event loop is occupied by
   some fiber not yielding control to others.

2) applier and other fibers aren't scheduled while the event loop is
   blocked. This means applier doesn't send heartbeat messages to the
   master and doesn't read any data coming from the master.

3) The unread master's data piles up. First in replica's receive buffer, then
   in master's send buffer.

4) Once master's send buffer is full, the corresponding socket stops
   being writeable and the relay yields waiting for the socket to become
   writeable again. The send buffer might contain a partially written
   row by now.

5) Replication timeout happens on master, because it hasn't heard from
   replica for a while. An exception is raised, and the exception is
   pushed to the replica's socket. Now two situations are possible:

  a) the socket becomes writeable by the time exception is raised.
     In this case the exception is logged to the buffer right after
     a partially written row. Once replica receives the half-written
     row with an exception logged on top, it errors with
     ER_INVALID_MSGPACK. Replication is broken.

  b) the socket isn't writeable still (the most probable scenario)
     The exception isn't logged to the socket and the connection is
     closed. Replica eventually receives a partially-written row and
     retries connection to the master normally.

In order to prevent case a) from happening, let's not push TimedOut
errors to the socket at all. They're the only errors that could be
raised while a row is being written, i.e. the only errors that could
lead to the situation described in 5a.

Closes #4040
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working customer replication
Projects
None yet
Development

No branches or pull requests

5 participants