-
Notifications
You must be signed in to change notification settings - Fork 33
Relax error verbosity in calls and internal services #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Relax error verbosity in calls and internal services #596
Conversation
26626f8
to
a5202b7
Compare
a5202b7
to
ab14a41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the patch! It's really cool that vshard will stop being cluttered with logs.
vshard/replicaset.lua
Outdated
return false, nil, lerror.make(err) | ||
else | ||
if replica.last_error ~= nil then | ||
log.info("Calling '%s' on '%s' succeeded", func, replica) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, we want to display Calling '%s' on '%s' succeeded
if the previous attempt to call the "A" function on the master failed, but the current one "A" function ends successfully. But if we run next test:
logging_group.test_replica_call_not_spams_same_error = function(g)
vtest.storage_stop(g.replica_1_a)
-- Discovery is disabled so that it doesn't make any calls and trigger logs.
local new_cfg = table.deepcopy(global_cfg)
new_cfg.discovery_mode = 'off'
local router = vtest.router_new(g, 'router', new_cfg)
-- May be flaky, because socket may not be dead, when making calls.
local msg = "Exception during calling 'assert'"
router:assert_log_exactly_once(msg, {timeout = 0.5, on_yield = function()
local _, rs = next(ivshard.router.internal.static_router.replicasets)
rs:callrw('assert', {'a'}, {timeout = 0.1})
end})
vtest.storage_start(g.replica_1_a, global_cfg)
msg = "Calling 'echo' on 'replica_1_a"
router:assert_log_exactly_once(msg, {timeout = 0.5, on_yield = function()
local _, rs = next(ivshard.router.internal.static_router.replicasets)
rs:callrw('echo', nil, {timeout = 0.1})
end})
vtest.drop_instance(g, router)
end
we get Calling 'echo' on '...' succeeded
while echo
function has never been failed. I think it can be misleading just a bit. May be we should change replica.last_error ~= nil
condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we don't print that 'Succeeded ...` anymore, instead we always know, how many errors happened
vshard/error.lua
Outdated
if (err1 and not err2) or (not err1 and err2) then | ||
return false | ||
end | ||
return err1.code == err2.code and err1.message == err2.message and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #10506 we face with problem, that identical errors may have different messages. e.g. Connection refused
/ No such file or directory
/ Broken pipe
. In these cases the function errors_are_same
returns false (we expect true).
I hope, in vshard we cannot face with that, but rarely I see these kind of errors. May be, my patch #11727 fixes that for vshard too. Otherwise we should implement more complex errors' checkers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discovered one test where such kind of errors may appear - router-luatest/router_2_2_test.lua
. In this test unexpected eof while reading
ssl errors appear.
Logs:
2025-09-25 10:38:00.731 [24143] iproto ssl.c:359 E> SSL_read(16336), called on fd 47, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f
ile":"./src/lib/core/ssl.c","line":359}]}
2025-09-25 10:38:00.732 [24143] main/112/main/vshard.replicaset replicaset.lua:306 I> disconnected from unix/:/tmp/t/001_router-luatest/replica_1_a.iproto
2025-09-25 10:38:00.732 [24143] main/112/main/vshard.replicaset replicaset.lua:306 I> disconnected from unix/:/tmp/t/001_router-luatest/replica_2_a.iproto
2025-09-25 10:38:00.732 [24143] main/176/lua/vshard.replicaset replicaset.lua:1493 I> Old replicaset and replica objects are outdated.
2025-09-25 10:38:00.732 [24143] main/112/main/vshard.storage init.lua:3078 I> Wakeup the rebalancer
2025-09-25 10:38:00.733 [24143] main/112/main/vshard.storage init.lua:3078 I> Wakeup the rebalancer
2025-09-25 10:38:00.746 [24143] iproto ssl.c:359 E> SSL_read(16310), called on fd 50, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f
ile":"./src/lib/core/ssl.c","line":359}]}
2025-09-25 10:38:00.747 [24143] main/170/unix/:/tmp/t/001_router-luatest/replica_1_a.iproto (net.box) net_box.lua:442 E> fiber is cancelled {"type":"FiberIsCancelled","trace":[{"file":"built
in/box/net_box.lua","line":442}]}
2025-09-25 10:38:00.747 [24143] main/134/unix/:/tmp/t/001_router-luatest/replica_2_a.iproto (net.box) net_box.lua:442 E> fiber is cancelled {"type":"FiberIsCancelled","trace":[{"file":"built
in/box/net_box.lua","line":442}]}
2025-09-25 10:38:00.755 [24143] iproto ssl.c:359 E> SSL_read(16336), called on fd 48, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f
ile":"./src/lib/core/ssl.c","line":359}]}
2025-09-25 10:38:00.755 [24143] iproto ssl.c:359 E> SSL_read(16310), called on fd 52, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f
ile":"./src/lib/core/ssl.c","line":359}]}
These errors are not related to the router calling functions on the storages. But there is a non-zero probability that such errors can occur, for example, with map_callrw
/ map_callro
/ e.t.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, let's not care about the message of the error and instead use only type
and code
in order to distinguish them. It's possible in the future to add parsing of the messages, but now it's not very essential
test/unit-luatest/error_test.lua
Outdated
t.assert_not(verror.are_same(e1, nil)) | ||
t.assert_not(verror.are_same(nil, e2)) | ||
t.assert_not(verror.are_same(e2, e3)) | ||
t.assert(verror.are_same(e1, e2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we should add a test which checks duplication of Connection refused
/ No such file or directory
/ Broken pipe
errors? Each of these errors is identical, but may have different messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have that func (and consequently, the test) anymore
test/luatest_helpers/server.lua
Outdated
local errno = require('errno') | ||
local log = require('log') | ||
local yaml = require('yaml') | ||
local uri = require('uri') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patchset!
Good that you are addressing this. The spam issue was obvious. I am surprised nobody complained about it until recently.
With the approach chosen in the code it looks to me like we are only masking the issue though. Relying solely on error object comparison to suppress logs feels like a temporary workaround for very specific cases, not a generic fix. For instance, we lose visibility on how many failures actually happened, because suppressed logs aren’t counted. We will also still face spam if the messages keep oscillating due to something minor changing in them - like a user’s function name, a timestamp, or other details. These exceptions can even be directly triggered by users on the storage. We can’t be sure they won’t differ even when we call the same function repeatedly.
A more robust solution would be rate limiting per log source. This way the system wouldn’t rely on the message contents at all, and we’d still know how many logs were suppressed.
Regarding info logs, if I understood correctly, some of them are simply dropped. This feels risky, because we could lose visibility of what’s happening in the cluster - or whether anything is happening at all. I’d suggest applying aggregation or rate limiting to them too, with message formats adapted accordingly. For example, if an info log says “recovered 10 buckets”, it doesn’t make sense to rate limit that as-is. Instead, it should be reworked to accumulate stats (“N buckets recovered since last report”) and be logged not more often than every X seconds. This way we keep visibility while avoiding spam.
cc0729d
to
e34b425
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good. The new approach looks much more reliable. Good work 💪.
This commit introduces the module, which can (and will) be used for deduplication of the log messages. It does that according to the entries {type = <string>, code = <int>} passed to it. Using only code is not enough for deduplicating of the error objects, since different types of errors may have the same codes. The module itself doesn't log the message, since then all logs will have the same line and file, a lot of information will be lost. So, the module just says, whether an entry can be printed. Note, that for `Suppressed ...` message to appear, `flush()` must be called. For that there exists background service, which wakes up every consts.LOG_RATELIMIT_INTERVAL and flushes all saves limiters. Part of tarantool#578 Part of tarantool#583 Part of tarantool#588 NO_DOC=internal
Users worry about the space on disk, which logs might use in case of instance failure, and we should relax the verbosity of logs for that. `replica_call` in the replicaset module logged the error every time a request fail, which produced a lot of logs, when instance is down. From now on the errors with same type and code won't be logged for 30 seconds. Part of tarantool#588 NO_DOC=bugfix
Discovery service used to emit a warning on every iteration while a replicaset was unreachable, flooding logs with the same message. From now on this won't be the case, same errors are not printed anymore. Part of tarantool#588 NO_DOC=bugfix
Recovery, gc and rebalancer service used to emit an error on every iteration while an instance was unreachable, flooding logs with the same message. From now on this won't be the case, same errors are not printed anymore. The new test file is introduced, since the tests are slow and I'd like for them to run in parallel with others. Part of tarantool#588 NO_DOC=bugfix
Failover and master search service used to emit a warning on every iteration while a replicaset was unreachable, flooding logs with the same message. From now on this won't be the case, same errors are not printed anymore. Collect idle connections service doesn't need such functionality, since it doesn't log any errors. Closes tarantool#588 NO_DOC=bugfix
e34b425
to
ba33668
Compare
if not data or not data.limiter_name then | ||
return | ||
end | ||
if data.limiter == lratelimit.internal.limiters[data.limiter_name] then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lratelimit.internal.limiters[data.limiter_name]
must be lratelimit.get(name)
, used too often
On reconfiguration of the storage/router it may happen, that the service is disabled (e.g. master search), but the log ratelimiter continued to exist forever. The other problem is that the ratelimiter of the deleted replica suffers from the same issue. This commit fixes that by properly deleting the ratelimiters from the global table Follow-up tarantool#588 NO_DOC=bugfix
ba33668
to
04ce57b
Compare
-- Do not make users pay for what they do not use - when the search is | ||
-- disabled for all replicasets, there should not be any fiber. | ||
log.info('Master auto search is disabled') | ||
for _, replicaset in pairs(router.replicasets) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was useless, because the loop goes other the newly created replicasets, which don't have master_search
service at all. We don't need that worker:remove_service
at all
else | ||
-- Do not forget to apply the upgrades, when M is changed. | ||
return M | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine we create a new version where something is changed. Then there will be stuff applied + we will have to remove the return
so the functions all get updated. But then we do a rollback to an older version, hoping that the returned functions also get rolled back. And yet they won't, because in this older version we just do return M
. So the rollback to an older version becomes impossible. I think this else
condition can be just dropped entirely, no?
-- name = <string>, | ||
-- } | ||
-- | ||
local function ratelimit_can_log(limiter, new_entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't write this before, but can
looks like a "read-only" thing while it actually activates the limit. Perhaps a name like touch_log()
or try()
would make it look more like an action, hm?
local function ratelimit_drop(name) | ||
M.limiters[name] = nil | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be a good idea to make the limiters
table have weak refs? So the drop becomes not needed, and the limiters will be auto-deleted by GC when their owners are also deleted? I am afraid that in a language like Lua, where any line can throw an exception, it is too easy to miss this "drop" call.
Or the limiters could add themselves to this table when they have some logs to write. And would be popped out of there when their log queues become empty. Most of the time, I imagine, they would be empty anyway. Then an explicit drop also wouldn't be needed.
local name = 'discovery' | ||
local service = lservice_info.new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you drop the service name?
local name = 'rebalancer' | ||
local service = lservice_info.new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. I see the name is dropped for some services and isn't for the others. What is the reason?
if ratelimit == lratelimit.internal.limiters[name] then | ||
lratelimit.drop(name) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this hack won't be needed if you follow on my comment about limiters deleting themselves automatically.
local msg = "Error during recovery of bucket" | ||
g.replica_1_a:wait_log_exactly_once(msg, {timeout = 1, on_yield = function() | ||
ivshard.storage.recovery_wakeup() | ||
end}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, not a whole second just for this 😭. Can you make the periods smaller, so the timeout can also be reduced to like 0.1?
Same in the other super long places.
g.replica_1_a:exec(function(bids) | ||
ivconst.RECOVERY_GET_STAT_TIMEOUT = _G.old_timeout | ||
_G.old_timeout = nil | ||
for _, bid in ipairs(bids) do | ||
box.space._bucket:replace{bid, ivconst.BUCKET.GARBAGE} | ||
box.space._bucket:delete{bid} | ||
end | ||
end, {bids}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. But shouldn't you re-create these buckets? Otherwise the test leaves the cluster with just 27 buckets. And the next steps does the same, so they remain 24. I guess the tests still work, but this looks fragile.
data.limiter_name = name .. '.' .. replica.id | ||
data.limiter = lratelimit.create{name = data.limiter_name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need limiter_name
stored in data
? From what I see, it is just once passed to lratelimit.create()
and is never used again.
end | ||
end | ||
|
||
local function service_drop_ratelimit_if_needed(service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire commit, I guess, should disappear if you consider making the limiters auto-dropping.
Closes #588