Skip to content

Conversation

Serpentian
Copy link
Collaborator

@Serpentian Serpentian commented Sep 17, 2025

Closes #588

@Serpentian Serpentian changed the title Remove doubled error logs in services and calls Relax error verbosity in calls in services Sep 17, 2025
@Serpentian Serpentian force-pushed the gh-remove-doubled-log-errors branch 6 times, most recently from 26626f8 to a5202b7 Compare September 18, 2025 10:16
@Serpentian Serpentian changed the title Relax error verbosity in calls in services Relax error verbosity in calls and internal services Sep 18, 2025
@Serpentian Serpentian force-pushed the gh-remove-doubled-log-errors branch from a5202b7 to ab14a41 Compare September 18, 2025 10:25
@Serpentian Serpentian marked this pull request as ready for review September 18, 2025 10:25
Copy link
Contributor

@mrForza mrForza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch! It's really cool that vshard will stop being cluttered with logs.

return false, nil, lerror.make(err)
else
if replica.last_error ~= nil then
log.info("Calling '%s' on '%s' succeeded", func, replica)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, we want to display Calling '%s' on '%s' succeeded if the previous attempt to call the "A" function on the master failed, but the current one "A" function ends successfully. But if we run next test:

logging_group.test_replica_call_not_spams_same_error = function(g)
    vtest.storage_stop(g.replica_1_a)
    -- Discovery is disabled so that it doesn't make any calls and trigger logs.
    local new_cfg = table.deepcopy(global_cfg)
    new_cfg.discovery_mode = 'off'
    local router = vtest.router_new(g, 'router', new_cfg)

    -- May be flaky, because socket may not be dead, when making calls.
    local msg = "Exception during calling 'assert'"
    router:assert_log_exactly_once(msg, {timeout = 0.5, on_yield = function()
        local _,  rs = next(ivshard.router.internal.static_router.replicasets)
        rs:callrw('assert', {'a'}, {timeout = 0.1})
    end})

    vtest.storage_start(g.replica_1_a, global_cfg)
    msg = "Calling 'echo' on 'replica_1_a"
    router:assert_log_exactly_once(msg, {timeout = 0.5, on_yield = function()
        local _,  rs = next(ivshard.router.internal.static_router.replicasets)
        rs:callrw('echo', nil, {timeout = 0.1})
    end})

    vtest.drop_instance(g, router)
end

we get Calling 'echo' on '...' succeeded while echo function has never been failed. I think it can be misleading just a bit. May be we should change replica.last_error ~= nil condition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we don't print that 'Succeeded ...` anymore, instead we always know, how many errors happened

vshard/error.lua Outdated
if (err1 and not err2) or (not err1 and err2) then
return false
end
return err1.code == err2.code and err1.message == err2.message and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #10506 we face with problem, that identical errors may have different messages. e.g. Connection refused / No such file or directory / Broken pipe. In these cases the function errors_are_same returns false (we expect true).

I hope, in vshard we cannot face with that, but rarely I see these kind of errors. May be, my patch #11727 fixes that for vshard too. Otherwise we should implement more complex errors' checkers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered one test where such kind of errors may appear - router-luatest/router_2_2_test.lua. In this test unexpected eof while reading ssl errors appear.

Logs:

2025-09-25 10:38:00.731 [24143] iproto ssl.c:359 E> SSL_read(16336), called on fd 47, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f
ile":"./src/lib/core/ssl.c","line":359}]}
2025-09-25 10:38:00.732 [24143] main/112/main/vshard.replicaset replicaset.lua:306 I> disconnected from unix/:/tmp/t/001_router-luatest/replica_1_a.iproto
2025-09-25 10:38:00.732 [24143] main/112/main/vshard.replicaset replicaset.lua:306 I> disconnected from unix/:/tmp/t/001_router-luatest/replica_2_a.iproto
2025-09-25 10:38:00.732 [24143] main/176/lua/vshard.replicaset replicaset.lua:1493 I> Old replicaset and replica objects are outdated.
2025-09-25 10:38:00.732 [24143] main/112/main/vshard.storage init.lua:3078 I> Wakeup the rebalancer
2025-09-25 10:38:00.733 [24143] main/112/main/vshard.storage init.lua:3078 I> Wakeup the rebalancer
2025-09-25 10:38:00.746 [24143] iproto ssl.c:359 E> SSL_read(16310), called on fd 50, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f
ile":"./src/lib/core/ssl.c","line":359}]}
2025-09-25 10:38:00.747 [24143] main/170/unix/:/tmp/t/001_router-luatest/replica_1_a.iproto (net.box) net_box.lua:442 E> fiber is cancelled {"type":"FiberIsCancelled","trace":[{"file":"built
in/box/net_box.lua","line":442}]}
2025-09-25 10:38:00.747 [24143] main/134/unix/:/tmp/t/001_router-luatest/replica_2_a.iproto (net.box) net_box.lua:442 E> fiber is cancelled {"type":"FiberIsCancelled","trace":[{"file":"built
in/box/net_box.lua","line":442}]}
2025-09-25 10:38:00.755 [24143] iproto ssl.c:359 E> SSL_read(16336), called on fd 48, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f
ile":"./src/lib/core/ssl.c","line":359}]}
2025-09-25 10:38:00.755 [24143] iproto ssl.c:359 E> SSL_read(16310), called on fd 52, aka unix/:(socket), peer of unix/:(socket): unexpected eof while reading {"type":"SSLError","trace":[{"f
ile":"./src/lib/core/ssl.c","line":359}]}

These errors are not related to the router calling functions on the storages. But there is a non-zero probability that such errors can occur, for example, with map_callrw / map_callro / e.t.c

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, let's not care about the message of the error and instead use only type and code in order to distinguish them. It's possible in the future to add parsing of the messages, but now it's not very essential

t.assert_not(verror.are_same(e1, nil))
t.assert_not(verror.are_same(nil, e2))
t.assert_not(verror.are_same(e2, e3))
t.assert(verror.are_same(e1, e2))
Copy link
Contributor

@mrForza mrForza Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should add a test which checks duplication of Connection refused / No such file or directory / Broken pipe errors? Each of these errors is identical, but may have different messages

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have that func (and consequently, the test) anymore

local errno = require('errno')
local log = require('log')
local yaml = require('yaml')
local uri = require('uri')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, fixed

@mrForza mrForza assigned Serpentian and unassigned mrForza Sep 25, 2025
Copy link
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patchset!

Good that you are addressing this. The spam issue was obvious. I am surprised nobody complained about it until recently.

With the approach chosen in the code it looks to me like we are only masking the issue though. Relying solely on error object comparison to suppress logs feels like a temporary workaround for very specific cases, not a generic fix. For instance, we lose visibility on how many failures actually happened, because suppressed logs aren’t counted. We will also still face spam if the messages keep oscillating due to something minor changing in them - like a user’s function name, a timestamp, or other details. These exceptions can even be directly triggered by users on the storage. We can’t be sure they won’t differ even when we call the same function repeatedly.

A more robust solution would be rate limiting per log source. This way the system wouldn’t rely on the message contents at all, and we’d still know how many logs were suppressed.

Regarding info logs, if I understood correctly, some of them are simply dropped. This feels risky, because we could lose visibility of what’s happening in the cluster - or whether anything is happening at all. I’d suggest applying aggregation or rate limiting to them too, with message formats adapted accordingly. For example, if an info log says “recovered 10 buckets”, it doesn’t make sense to rate limit that as-is. Instead, it should be reworked to accumulate stats (“N buckets recovered since last report”) and be logged not more often than every X seconds. This way we keep visibility while avoiding spam.

@Serpentian Serpentian force-pushed the gh-remove-doubled-log-errors branch 10 times, most recently from cc0729d to e34b425 Compare October 13, 2025 16:49
@Serpentian Serpentian assigned mrForza and unassigned Serpentian Oct 13, 2025
Copy link
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good. The new approach looks much more reliable. Good work 💪.

This commit introduces the module, which can (and will) be used for
deduplication of the log messages. It does that according to the entries
{type = <string>, code = <int>} passed to it. Using only code is not
enough for deduplicating of the error objects, since different types of
errors may have the same codes.

The module itself doesn't log the message, since then all logs will have
the same line and file, a lot of information will be lost. So, the
module just says, whether an entry can be printed.

Note, that for `Suppressed ...` message to appear, `flush()` must be
called. For that there exists background service, which wakes up every
consts.LOG_RATELIMIT_INTERVAL and flushes all saves limiters.

Part of tarantool#578
Part of tarantool#583
Part of tarantool#588

NO_DOC=internal
Users worry about the space on disk, which logs might use in case
of instance failure, and we should relax the verbosity of logs for that.

`replica_call` in the replicaset module logged the error every time a
request fail, which produced a lot of logs, when instance is down. From
now on the errors with same type and code won't be logged for 30 seconds.

Part of tarantool#588

NO_DOC=bugfix
Discovery service used to emit a warning on every iteration while a
replicaset was unreachable, flooding logs with the same message. From
now on this won't be the case, same errors are not printed anymore.

Part of tarantool#588

NO_DOC=bugfix
Recovery, gc and rebalancer service used to emit an error on every
iteration while an instance was unreachable, flooding logs with the same
message. From now on this won't be the case, same errors are not printed
anymore.

The new test file is introduced, since the tests are slow and I'd like
for them to run in parallel with others.

Part of tarantool#588

NO_DOC=bugfix
Failover and master search service used to emit a warning on every
iteration while a replicaset was unreachable, flooding logs with the
same message. From now on this won't be the case, same errors are not
printed anymore.

Collect idle connections service doesn't need such functionality, since
it doesn't log any errors.

Closes tarantool#588

NO_DOC=bugfix
@Serpentian Serpentian force-pushed the gh-remove-doubled-log-errors branch from e34b425 to ba33668 Compare October 16, 2025 18:17
if not data or not data.limiter_name then
return
end
if data.limiter == lratelimit.internal.limiters[data.limiter_name] then
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lratelimit.internal.limiters[data.limiter_name] must be lratelimit.get(name), used too often

On reconfiguration of the storage/router it may happen, that the service
is disabled (e.g. master search), but the log ratelimiter continued to
exist forever. The other problem is that the ratelimiter of the deleted
replica suffers from the same issue.

This commit fixes that by properly deleting the ratelimiters from the
global table

Follow-up tarantool#588

NO_DOC=bugfix
@Serpentian Serpentian force-pushed the gh-remove-doubled-log-errors branch from ba33668 to 04ce57b Compare October 16, 2025 18:31
@Serpentian Serpentian requested a review from Gerold103 October 16, 2025 18:43
-- Do not make users pay for what they do not use - when the search is
-- disabled for all replicasets, there should not be any fiber.
log.info('Master auto search is disabled')
for _, replicaset in pairs(router.replicasets) do
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was useless, because the loop goes other the newly created replicasets, which don't have master_search service at all. We don't need that worker:remove_service at all

Comment on lines +26 to +29
else
-- Do not forget to apply the upgrades, when M is changed.
return M
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine we create a new version where something is changed. Then there will be stuff applied + we will have to remove the return so the functions all get updated. But then we do a rollback to an older version, hoping that the returned functions also get rolled back. And yet they won't, because in this older version we just do return M. So the rollback to an older version becomes impossible. I think this else condition can be just dropped entirely, no?

-- name = <string>,
-- }
--
local function ratelimit_can_log(limiter, new_entry)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't write this before, but can looks like a "read-only" thing while it actually activates the limit. Perhaps a name like touch_log() or try() would make it look more like an action, hm?

Comment on lines +158 to +160
local function ratelimit_drop(name)
M.limiters[name] = nil
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be a good idea to make the limiters table have weak refs? So the drop becomes not needed, and the limiters will be auto-deleted by GC when their owners are also deleted? I am afraid that in a language like Lua, where any line can throw an exception, it is too easy to miss this "drop" call.

Or the limiters could add themselves to this table when they have some logs to write. And would be popped out of there when their log queues become empty. Most of the time, I imagine, they would be empty anyway. Then an explicit drop also wouldn't be needed.

Comment on lines +419 to +420
local name = 'discovery'
local service = lservice_info.new()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you drop the service name?

Comment on lines +2922 to +2923
local name = 'rebalancer'
local service = lservice_info.new()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. I see the name is dropped for some services and isn't for the others. What is the reason?

Comment on lines +1112 to +1114
if ratelimit == lratelimit.internal.limiters[name] then
lratelimit.drop(name)
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this hack won't be needed if you follow on my comment about limiters deleting themselves automatically.

Comment on lines +62 to +65
local msg = "Error during recovery of bucket"
g.replica_1_a:wait_log_exactly_once(msg, {timeout = 1, on_yield = function()
ivshard.storage.recovery_wakeup()
end})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, not a whole second just for this 😭. Can you make the periods smaller, so the timeout can also be reduced to like 0.1?

Same in the other super long places.

Comment on lines +68 to +75
g.replica_1_a:exec(function(bids)
ivconst.RECOVERY_GET_STAT_TIMEOUT = _G.old_timeout
_G.old_timeout = nil
for _, bid in ipairs(bids) do
box.space._bucket:replace{bid, ivconst.BUCKET.GARBAGE}
box.space._bucket:delete{bid}
end
end, {bids})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. But shouldn't you re-create these buckets? Otherwise the test leaves the cluster with just 27 buckets. And the next steps does the same, so they remain 24. I guess the tests still work, but this looks fragile.

Comment on lines +1786 to +1787
data.limiter_name = name .. '.' .. replica.id
data.limiter = lratelimit.create{name = data.limiter_name}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need limiter_name stored in data? From what I see, it is just once passed to lratelimit.create() and is never used again.

end
end

local function service_drop_ratelimit_if_needed(service)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire commit, I guess, should disappear if you consider making the limiters auto-dropping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Internal services should not spam the same errors in the logs on every iteration

3 participants