Relax error verbosity in calls and internal services #596

Serpentian · 2025-09-17T10:29:55Z

Closes #588

mrForza

Thank you for the patch! It's really cool that vshard will stop being cluttered with logs.

mrForza · 2025-09-23T16:39:48Z

vshard/replicaset.lua

        return false, nil, lerror.make(err)
    else
+        if replica.last_error ~= nil then
+            log.info("Calling '%s' on '%s' succeeded", func, replica)


AFAIU, we want to display Calling '%s' on '%s' succeeded if the previous attempt to call the "A" function on the master failed, but the current one "A" function ends successfully. But if we run next test:

logging_group.test_replica_call_not_spams_same_error = function(g) vtest.storage_stop(g.replica_1_a) -- Discovery is disabled so that it doesn't make any calls and trigger logs. local new_cfg = table.deepcopy(global_cfg) new_cfg.discovery_mode = 'off' local router = vtest.router_new(g, 'router', new_cfg) -- May be flaky, because socket may not be dead, when making calls. local msg = "Exception during calling 'assert'" router:assert_log_exactly_once(msg, {timeout = 0.5, on_yield = function() local _, rs = next(ivshard.router.internal.static_router.replicasets) rs:callrw('assert', {'a'}, {timeout = 0.1}) end}) vtest.storage_start(g.replica_1_a, global_cfg) msg = "Calling 'echo' on 'replica_1_a" router:assert_log_exactly_once(msg, {timeout = 0.5, on_yield = function() local _, rs = next(ivshard.router.internal.static_router.replicasets) rs:callrw('echo', nil, {timeout = 0.1}) end}) vtest.drop_instance(g, router) end

we get Calling 'echo' on '...' succeeded while echo function has never been failed. I think it can be misleading just a bit. May be we should change replica.last_error ~= nil condition?

Now we don't print that 'Succeeded ...` anymore, instead we always know, how many errors happened

mrForza · 2025-09-23T16:39:52Z

vshard/error.lua

+    if (err1 and not err2) or (not err1 and err2) then
+        return false
+    end
+    return err1.code == err2.code and err1.message == err2.message and


In #10506 we face with problem, that identical errors may have different messages. e.g. Connection refused / No such file or directory / Broken pipe. In these cases the function errors_are_same returns false (we expect true).

I hope, in vshard we cannot face with that, but rarely I see these kind of errors. May be, my patch #11727 fixes that for vshard too. Otherwise we should implement more complex errors' checkers

I discovered one test where such kind of errors may appear - router-luatest/router_2_2_test.lua. In this test unexpected eof while reading ssl errors appear.

Logs:

2025-09-25 10:38:00.731 [24143] iproto ssl.c:359 E> SSL_read(16336), called on fd 47, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f ile":"./src/lib/core/ssl.c","line":359}]} 2025-09-25 10:38:00.732 [24143] main/112/main/vshard.replicaset replicaset.lua:306 I> disconnected from unix/:/tmp/t/001_router-luatest/replica_1_a.iproto 2025-09-25 10:38:00.732 [24143] main/112/main/vshard.replicaset replicaset.lua:306 I> disconnected from unix/:/tmp/t/001_router-luatest/replica_2_a.iproto 2025-09-25 10:38:00.732 [24143] main/176/lua/vshard.replicaset replicaset.lua:1493 I> Old replicaset and replica objects are outdated. 2025-09-25 10:38:00.732 [24143] main/112/main/vshard.storage init.lua:3078 I> Wakeup the rebalancer 2025-09-25 10:38:00.733 [24143] main/112/main/vshard.storage init.lua:3078 I> Wakeup the rebalancer 2025-09-25 10:38:00.746 [24143] iproto ssl.c:359 E> SSL_read(16310), called on fd 50, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f ile":"./src/lib/core/ssl.c","line":359}]} 2025-09-25 10:38:00.747 [24143] main/170/unix/:/tmp/t/001_router-luatest/replica_1_a.iproto (net.box) net_box.lua:442 E> fiber is cancelled {"type":"FiberIsCancelled","trace":[{"file":"built in/box/net_box.lua","line":442}]} 2025-09-25 10:38:00.747 [24143] main/134/unix/:/tmp/t/001_router-luatest/replica_2_a.iproto (net.box) net_box.lua:442 E> fiber is cancelled {"type":"FiberIsCancelled","trace":[{"file":"built in/box/net_box.lua","line":442}]} 2025-09-25 10:38:00.755 [24143] iproto ssl.c:359 E> SSL_read(16336), called on fd 48, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f ile":"./src/lib/core/ssl.c","line":359}]} 2025-09-25 10:38:00.755 [24143] iproto ssl.c:359 E> SSL_read(16310), called on fd 52, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f ile":"./src/lib/core/ssl.c","line":359}]}

These errors are not related to the router calling functions on the storages. But there is a non-zero probability that such errors can occur, for example, with map_callrw / map_callro / e.t.c

Agree, let's not care about the message of the error and instead use only type and code in order to distinguish them. It's possible in the future to add parsing of the messages, but now it's not very essential

mrForza · 2025-09-23T16:39:54Z

test/unit-luatest/error_test.lua

+    t.assert_not(verror.are_same(e1, nil))
+    t.assert_not(verror.are_same(nil, e2))
+    t.assert_not(verror.are_same(e2, e3))
+    t.assert(verror.are_same(e1, e2))


May be we should add a test which checks duplication of Connection refused / No such file or directory / Broken pipe errors? Each of these errors is identical, but may have different messages

We don't have that func (and consequently, the test) anymore

mrForza · 2025-09-23T16:39:59Z

test/luatest_helpers/server.lua

 local errno = require('errno')
 local log = require('log')
 local yaml = require('yaml')
+local uri = require('uri')


Nit: not needed

Gerold103

Thanks for the patchset!

Good that you are addressing this. The spam issue was obvious. I am surprised nobody complained about it until recently.

With the approach chosen in the code it looks to me like we are only masking the issue though. Relying solely on error object comparison to suppress logs feels like a temporary workaround for very specific cases, not a generic fix. For instance, we lose visibility on how many failures actually happened, because suppressed logs aren’t counted. We will also still face spam if the messages keep oscillating due to something minor changing in them - like a user’s function name, a timestamp, or other details. These exceptions can even be directly triggered by users on the storage. We can’t be sure they won’t differ even when we call the same function repeatedly.

A more robust solution would be rate limiting per log source. This way the system wouldn’t rely on the message contents at all, and we’d still know how many logs were suppressed.

Regarding info logs, if I understood correctly, some of them are simply dropped. This feels risky, because we could lose visibility of what’s happening in the cluster - or whether anything is happening at all. I’d suggest applying aggregation or rate limiting to them too, with message formats adapted accordingly. For example, if an info log says “recovered 10 buckets”, it doesn’t make sense to rate limit that as-is. Instead, it should be reworked to accumulate stats (“N buckets recovered since last report”) and be logged not more often than every X seconds. This way we keep visibility while avoiding spam.

.github/workflows/test_catch_flaky.yml

test/CMakeLists.txt

vshard/error.lua

vshard/replicaset.lua

test/luatest_helpers/server.lua

vshard/service_info.lua

Gerold103

Very good. The new approach looks much more reliable. Good work 💪.

vshard/log_ratelimit.lua

test/replicaset-luatest/replicaset_3_test.lua

vshard/router/init.lua

vshard/storage/init.lua

This commit introduces the module, which can (and will) be used for deduplication of the log messages. It does that according to the entries {type = <string>, code = <int>} passed to it. Using only code is not enough for deduplicating of the error objects, since different types of errors may have the same codes. The module itself doesn't log the message, since then all logs will have the same line and file, a lot of information will be lost. So, the module just says, whether an entry can be printed. Note, that for `Suppressed ...` message to appear, `flush()` must be called. For that there exists background service, which wakes up every consts.LOG_RATELIMIT_INTERVAL and flushes all saves limiters. Part of tarantool#578 Part of tarantool#583 Part of tarantool#588 NO_DOC=internal

Users worry about the space on disk, which logs might use in case of instance failure, and we should relax the verbosity of logs for that. `replica_call` in the replicaset module logged the error every time a request fail, which produced a lot of logs, when instance is down. From now on the errors with same type and code won't be logged for 30 seconds. Part of tarantool#588 NO_DOC=bugfix

Discovery service used to emit a warning on every iteration while a replicaset was unreachable, flooding logs with the same message. From now on this won't be the case, same errors are not printed anymore. Part of tarantool#588 NO_DOC=bugfix

Recovery, gc and rebalancer service used to emit an error on every iteration while an instance was unreachable, flooding logs with the same message. From now on this won't be the case, same errors are not printed anymore. The new test file is introduced, since the tests are slow and I'd like for them to run in parallel with others. Part of tarantool#588 NO_DOC=bugfix

Failover and master search service used to emit a warning on every iteration while a replicaset was unreachable, flooding logs with the same message. From now on this won't be the case, same errors are not printed anymore. Collect idle connections service doesn't need such functionality, since it doesn't log any errors. Closes tarantool#588 NO_DOC=bugfix

Serpentian · 2025-10-16T18:19:16Z

vshard/replicaset.lua

+    if not data or not data.limiter_name then
+        return
+    end
+    if data.limiter == lratelimit.internal.limiters[data.limiter_name] then


lratelimit.internal.limiters[data.limiter_name] must be lratelimit.get(name), used too often

On reconfiguration of the storage/router it may happen, that the service is disabled (e.g. master search), but the log ratelimiter continued to exist forever. The other problem is that the ratelimiter of the deleted replica suffers from the same issue. This commit fixes that by properly deleting the ratelimiters from the global table Follow-up tarantool#588 NO_DOC=bugfix

Serpentian · 2025-10-16T18:47:50Z

vshard/router/init.lua

-        -- Do not make users pay for what they do not use - when the search is
-        -- disabled for all replicasets, there should not be any fiber.
        log.info('Master auto search is disabled')
-        for _, replicaset in pairs(router.replicasets) do


This was useless, because the loop goes other the newly created replicasets, which don't have master_search service at all. We don't need that worker:remove_service at all

Gerold103 · 2025-10-16T20:38:33Z

vshard/log_ratelimit.lua

+else
+    -- Do not forget to apply the upgrades, when M is changed.
+    return M
+end


Imagine we create a new version where something is changed. Then there will be stuff applied + we will have to remove the return so the functions all get updated. But then we do a rollback to an older version, hoping that the returned functions also get rolled back. And yet they won't, because in this older version we just do return M. So the rollback to an older version becomes impossible. I think this else condition can be just dropped entirely, no?

Gerold103 · 2025-10-16T20:39:57Z

vshard/log_ratelimit.lua

+--         name = <string>,
+--     }
+--
+local function ratelimit_can_log(limiter, new_entry)


Sorry, didn't write this before, but can looks like a "read-only" thing while it actually activates the limit. Perhaps a name like touch_log() or try() would make it look more like an action, hm?

Gerold103 · 2025-10-16T20:44:31Z

vshard/log_ratelimit.lua

+local function ratelimit_drop(name)
+    M.limiters[name] = nil
+end


Do you think it would be a good idea to make the limiters table have weak refs? So the drop becomes not needed, and the limiters will be auto-deleted by GC when their owners are also deleted? I am afraid that in a language like Lua, where any line can throw an exception, it is too easy to miss this "drop" call.

Or the limiters could add themselves to this table when they have some logs to write. And would be popped out of there when their log queues become empty. Most of the time, I imagine, they would be empty anyway. Then an explicit drop also wouldn't be needed.

Gerold103 · 2025-10-16T20:56:55Z

vshard/router/init.lua

+    local name = 'discovery'
+    local service = lservice_info.new()


Why did you drop the service name?

Gerold103 · 2025-10-16T20:59:05Z

vshard/storage/init.lua

+    local name = 'rebalancer'
+    local service = lservice_info.new()


Ditto. I see the name is dropped for some services and isn't for the others. What is the reason?

Gerold103 · 2025-10-16T20:59:33Z

vshard/storage/init.lua

+    if ratelimit == lratelimit.internal.limiters[name] then
+        lratelimit.drop(name)
+    end


I think this hack won't be needed if you follow on my comment about limiters deleting themselves automatically.

Gerold103 · 2025-10-16T21:00:55Z

test/storage-luatest/log_verbosity_2_2_test.lua

+    local msg = "Error during recovery of bucket"
+    g.replica_1_a:wait_log_exactly_once(msg, {timeout = 1, on_yield = function()
+        ivshard.storage.recovery_wakeup()
+    end})


Please, not a whole second just for this 😭. Can you make the periods smaller, so the timeout can also be reduced to like 0.1?

Same in the other super long places.

Gerold103 · 2025-10-16T21:03:51Z

test/storage-luatest/log_verbosity_2_2_test.lua

+    g.replica_1_a:exec(function(bids)
+        ivconst.RECOVERY_GET_STAT_TIMEOUT = _G.old_timeout
+        _G.old_timeout = nil
+        for _, bid in ipairs(bids) do
+            box.space._bucket:replace{bid, ivconst.BUCKET.GARBAGE}
+            box.space._bucket:delete{bid}
+        end
+    end, {bids})


Hm. But shouldn't you re-create these buckets? Otherwise the test leaves the cluster with just 27 buckets. And the next steps does the same, so they remain 24. I guess the tests still work, but this looks fragile.

Gerold103 · 2025-10-16T21:06:31Z

vshard/replicaset.lua

+        data.limiter_name = name .. '.' .. replica.id
+        data.limiter = lratelimit.create{name = data.limiter_name}


Why do you need limiter_name stored in data? From what I see, it is just once passed to lratelimit.create() and is never used again.

Gerold103 · 2025-10-16T21:09:11Z

vshard/replicaset.lua

    end
 end

+local function service_drop_ratelimit_if_needed(service)


This entire commit, I guess, should disappear if you consider making the limiters auto-dropping.

Serpentian changed the title ~~Remove doubled error logs in services and calls~~ Relax error verbosity in calls in services Sep 17, 2025

Serpentian force-pushed the gh-remove-doubled-log-errors branch 6 times, most recently from 26626f8 to a5202b7 Compare September 18, 2025 10:16

Serpentian changed the title ~~Relax error verbosity in calls in services~~ Relax error verbosity in calls and internal services Sep 18, 2025

Serpentian force-pushed the gh-remove-doubled-log-errors branch from a5202b7 to ab14a41 Compare September 18, 2025 10:25

Serpentian marked this pull request as ready for review September 18, 2025 10:25

Serpentian requested review from Gerold103 and mrForza September 18, 2025 10:26

Serpentian assigned mrForza and Gerold103 Sep 18, 2025

mrForza reviewed Sep 23, 2025

View reviewed changes

mrForza assigned Serpentian and unassigned mrForza Sep 25, 2025

Gerold103 reviewed Oct 1, 2025

View reviewed changes

Serpentian force-pushed the gh-remove-doubled-log-errors branch 10 times, most recently from cc0729d to e34b425 Compare October 13, 2025 16:49

Serpentian assigned mrForza and unassigned Serpentian Oct 13, 2025

Gerold103 reviewed Oct 14, 2025

View reviewed changes

Serpentian added 5 commits October 16, 2025 17:36

Serpentian force-pushed the gh-remove-doubled-log-errors branch from e34b425 to ba33668 Compare October 16, 2025 18:17

Serpentian commented Oct 16, 2025

View reviewed changes

Serpentian force-pushed the gh-remove-doubled-log-errors branch from ba33668 to 04ce57b Compare October 16, 2025 18:31

Serpentian requested a review from Gerold103 October 16, 2025 18:43

Serpentian commented Oct 16, 2025

View reviewed changes

Gerold103 reviewed Oct 16, 2025

View reviewed changes

		local name = 'rebalancer'
		local service = lservice_info.new()

		data.limiter_name = name .. '.' .. replica.id
		data.limiter = lratelimit.create{name = data.limiter_name}

Relax error verbosity in calls and internal services #596

Are you sure you want to change the base?

Relax error verbosity in calls and internal services #596

Conversation

Serpentian commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrForza left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrForza Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gerold103 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gerold103 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Serpentian commented Sep 17, 2025 •

edited

Loading

mrForza Sep 23, 2025 •

edited

Loading