Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS_Mgr: Fixes around timeouts and IO loop behavior #3273

Merged
merged 2 commits into from Sep 5, 2023

Conversation

awelzel
Copy link
Contributor

@awelzel awelzel commented Sep 4, 2023

On Slack we were discussing below reproducer causing memory growth due to un-fulfilled when statements related to DNS_Mgr, lookup_addr(), ...

The two commits in this PR fix the state growth.

The most problematic scenario is triggered when 20 (MAX_PENDING_REQUESTS) DNS requests are pending, but none of them is ever responded to (so there's no FD activity in the future). Due to essentially non-existing timeout functionality, this would cause any future lookup_addr() calls to never be resolved.

@timwoj , I think this is 6.0.1 material, minimally 6.0.2 if that ship has sailed. Unfortunately I'm uncertain how this is best tested automatically without putting in much effort.


Think I have a better one because possibly a bit more real (DNS server restarting) - seems there's more going on.

  • Replace 192.168.0.1 above with 10.0.0.1, this results in NXdomain responses every 5 seconds on my system. In-between the cache is used. No memory growth.
    Stop local resolver: systemctl stop systemd-resolved
    Wait until cache timeout expires, observe memory growth even after 5 second DNS timeout should kick in.
    Restart resolver: Expect memory growth to stop because DNS is available again. It doesn't recover and memory keeps growing.

Sketch of script for reproduction:

redef exit_only_after_terminate=T;

global s = F;

global x: table[addr] of string;

event stats_tick() {
        local stats = get_proc_stats();
        print "stats", network_time(), stats$mem, get_dns_stats();
        schedule 1sec { stats_tick() };  # default DNS timeout
}

function f(a: addr) {
        when [a] ( local name = lookup_addr(a) )
                {
                x[a] = cat_sep(" ", "", network_time(), a, name);
                }
}

event tick() {
        f(10.0.0.1);
        schedule 10msec { tick() };
}

event zeek_init()
        {
        event tick();
        event stats_tick();
        }

Not sure, must have been some sort of left-over, but wasn't really
effective due to Process() not being implemented.
DNS_Mgr has a GetNextTimeout() implementation that may return 0.0. When
that is the case, its IO source is enqueued as ready with an fd of -1.
This in turn results in Process() being called instead of ProcessFd()
in RunState.cc.

Ensure timeouts behavior is properly handled by actually forwarding
timeout indications to c-ares via DNS_Mgr::Process(). This results
in pending DNS queries for which a timeout happened to actually
timeout (when there's no other connectivity).
@awelzel awelzel mentioned this pull request Sep 5, 2023
15 tasks
Copy link
Contributor

@timwoj timwoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a good idea for 6.0.1

@awelzel awelzel merged commit 1441b83 into master Sep 5, 2023
29 of 31 checks passed
@awelzel awelzel deleted the topic/awelzel/dns-mgr-fixes branch September 5, 2023 17:57
@zeek-bot
Copy link
Contributor

zeek-bot commented Sep 6, 2023

This pull request has been mentioned on Zeek. There might be relevant details there:

https://community.zeek.org/t/zeek-is-consuming-100-ram-memory/7128/4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants