-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication failure under high load #4040
Comments
Looks like due to a long WAL write on the replica, it didn't send an ACK to the master in time (configured by See the code:
Why the connection wasn't reestablished though is a mystery to me. It looks like a bug somewhere in coio/sio causing the replica read and try to decode an incomplete row sent by the master right before closing the connection. Thus there are two problems here to be fixed. The first one is the above-mentioned bug in sio. It will ameliorate the problem by making the error recoverable. The other problem is too long WAL writes causing the replication timeout. I don't think it's a replication issue: the problem is we write WALs to the system page cache without syncing them to disk; once there are too many dirty pages, the OS starts throttling writers until enough dirty pages have been flushed to disk. We've run into this problem before, e.g. see #3603 (comment). A workaround would be setting Line 339 in 352e1f7
Line 1257 in 352e1f7
|
The event loop stall may happen on a huge select as well, or when writing a huge tuple to WAL, then compression may take a while. A workaround for long event loop cycles is setting replication_timeout to a higher value. Besides, applier is being moved out of tx thread, so event loop stalls in tx thread will not affect it in the future. So I don't think we should fix the problem with replication disconnect: a user can change replication_timeout setting. But we do need to find this bug in sio. |
Stalling the event loop for several seconds by a CPU intensive workload smells like a bug in the application code, which is pretty easy to fix - just add Too long WAL writes caused by dirty page throttling is a completely different beast though - it may happen even if the application is designed perfectly well and the CPU is never overloaded. In fact, I have to set |
Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Related to #4040
Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Related to #4040
Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Related to #4040
Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Here's some example output: ``` 2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C 2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header ``` Related to #4040
Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Here's some example output: ``` 2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C 2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header ``` Related to #4040
Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Here's some example output: ``` 2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C 2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header ``` Related to #4040
Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Here's some example output: ``` 2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C 2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header ``` Related to #4040 (cherry picked from commit f645119)
Add row hex printing to log on verbose level. This would be useful during investigation of errors related to invalid msgpack packet arrival. Here's some example output: ``` 2019-04-01 16:04:51.629 [31304] iproto V> Corrupted row is: 3D A5 41 55 99 18 A8 08 C2 40 2B BA 50 12 F6 C6 0B 27 66 1C 2019-04-01 16:04:51.629 [31304] iproto xrow.c:133 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header ``` Related to #4040 (cherry picked from commit f645119)
Simultaneous usage of one coio from two or more fiber could lead to undefined behavior as coio routines are replacing awaiting fiber (a data member) and stopping watcher without any relevance if there any other users of the coio object. Such behavior could lead to an applier invalid stream issue #4040. The proposal is to disable an active coio reuse by returning a fake EINPROGRESS error. Part of #980
Simultaneous usage of one coio from two or more fiber could lead to undefined behavior as coio routines are replacing awaiting fiber (a data member) and stopping watcher without any relevance if there any other users of the coio object. Such behavior could lead to an applier invalid stream issue #4040. The proposal is to disable an active coio reuse by returning a fake EINPROGRESS error. Part of #980
Simultaneous usage of one coio from two or more fiber could lead to undefined behavior as coio routines are replacing awaiting fiber (a data member) and stopping watcher without any relevance if there any other users of the coio object. Such behavior could lead to an applier invalid stream issue #4040. The proposal is to disable an active coio reuse by returning a fake EINPROGRESS error. Part of #980
Similar bug seen on replica join. In case master is under high load, it may write lots of transactions while sending the replica its snapshot. In this case by the time replica fetches snapshot, master'll have a 'buffer' of already written transactions, which it'll send to the replica as fast as possible (its own WAL won't limit its speed). The replica, however, will write these transactions to WAL and will fall behind more and more. So, now we're one step closer to a reproducer. |
5SP for investigation to estimate complexity. |
Here's a reproducer for the issue we have thanks to @rosik and @droidroot1995: In order to run it, box.cfg{listen=3301}
box.schema.user.grant('guest', 'super') Then run the script below on another tarantool instance: #!/usr/bin/env tarantool
local log = require('log')
local json = require('json')
local fiber = require('fiber')
local socket = require('socket')
local pickle = require('pickle')
local msgpack = require('msgpack')
local iproto_code = {
[0x00] = "iproto_ok", iproto_ok = 0x00,
[0x01] = "iproto_select", iproto_select = 0x01,
[0x02] = "iproto_insert", iproto_insert = 0x02,
[0x03] = "iproto_replace", iproto_replace = 0x03,
[0x04] = "iproto_update", iproto_update = 0x04,
[0x05] = "iproto_delete", iproto_delete = 0x05,
[0x06] = "iproto_call_16", iproto_call_16 = 0x06,
[0x07] = "iproto_auth", iproto_auth = 0x07,
[0x08] = "iproto_eval", iproto_eval = 0x08,
[0x09] = "iproto_upsert", iproto_upsert = 0x09,
[0x0a] = "iproto_call", iproto_call = 0x0a,
[0x0b] = "iproto_execute", iproto_execute = 0x0b,
[0x0c] = "iproto_nop", iproto_nop = 0x0c,
[0x0d] = "iproto_type_stat_max", iproto_type_stat_max = 0x0d,
[0x40] = "iproto_ping", iproto_ping = 0x40,
[0x41] = "iproto_join", iproto_join = 0x41,
[0x42] = "iproto_subscribe", iproto_subscribe = 0x42,
[0x43] = "iproto_request_vote", iproto_request_vote = 0x43,
}
local IPROTO = {
REQUEST_TYPE = 0x00,
SERVER_VERSION = 0x06,
INSTANCE_UUID = 0x24,
CLUSTER_UUID = 0x25,
VCLOCK = 0x26,
}
local function mpenc_uint32(n)
return pickle.pack('bN', 0xCE, n)
end
local s = socket.tcp_connect('127.0.0.1', 3301)
s:read(128)
function _G.req(code, body)
local header = msgpack.encode({
[IPROTO.REQUEST_TYPE] = assert(iproto_code[code]),
})
body = msgpack.encode(body)
s:write(mpenc_uint32(#header + #body))
s:write(header)
s:write(body)
end
function _G.recv()
local ok, buf = pcall(s.read, s, 5, 2)
if not ok or buf == nil or buf == '' then
log.debug('Peer closed')
return false
end
local size, pos = msgpack.decode(buf)
local ok, _tail = pcall(s.read, s, pos-1 + size - #buf)
if not ok or _tail == nil or _tail == '' then
log.info('Peer closed')
return false
else
buf = buf .. _tail
end
local header, pos = msgpack.decode(buf, pos)
local body = nil
if pos < #buf then
body = msgpack.decode(buf, pos)
end
local code = header[0x00]
-- log.info('< %s %s', json.encode(header), json.encode(body))
return true
end
c = require('net.box').connect('localhost:3301')
c:eval([[
if not box.space._cluster:get(2) then
box.space._cluster:replace({2, package.loaded.uuid.str()})
end
box.schema.create_space('test', {if_not_exists = true})
box.space.test:create_index('pk', {if_not_exists = true})
]])
local vclock = c:eval([[return box.info.vclock]])
local instance_uuid = c:eval([[ return box.space._cluster:get(2)[2] ]])
local replicaset_uuid = c:eval([[return box.info.cluster.uuid]])
_G.s = s
function _G.subscribe()
return _G.req('iproto_subscribe', {
[IPROTO.VCLOCK] = setmetatable(vclock, {__serialize = 'map'}),
[IPROTO.SERVER_VERSION] = 0x20702,
[IPROTO.INSTANCE_UUID] = instance_uuid,
[IPROTO.CLUSTER_UUID] = replicaset_uuid,
})
end
subscribe()
recv()
c:eval([[
for i = 1, 100 do
s = package.loaded.digest.urandom(30):hex() -- 60 bytes
s = string.rep(s, 1024) -- 60 KiB
box.space.test:replace({1, s})
end
]])
recv()
fiber.sleep(5)
while recv() do
end
require('console').start() Once the script is run, you should see
in ~ 5 seconds. And master's log should contain
It's now possible to figure out what's happening: |
Nice! |
Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040
Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040
Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040
Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040
Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040 (cherry picked from commit 61b95d1)
Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040 (cherry picked from commit 61b95d1)
Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040
Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040
Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040
Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040
Tarantool version: 1.10.2-138-g5c0a36c (probably all 1.10.2 and 2.1)
Bug description:
We have a configuration with 1 master and 1 replica, when replica is processing
large selects (select from a space with 70m entries), its applier may fail with
any of the three errors below:
Replication is stopped after the error and can be resumed only by resetting
box.cfg.replication
to''
and back to the original config.Master logs at the moment of failure:
Replica logs:
The problem looks similar to #3718 and #3910, both master's and replica's configs have
force_recovery
set totrue
.Steps to reproduce: Happens occasionally, no reproducer yet.
The text was updated successfully, but these errors were encountered: