Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imrelp FD leak #4005

Closed
nerdquadrat opened this issue Dec 2, 2019 · 9 comments · Fixed by #4240
Closed

imrelp FD leak #4005

nerdquadrat opened this issue Dec 2, 2019 · 9 comments · Fixed by #4240
Assignees
Labels
Milestone

Comments

@nerdquadrat
Copy link

Expected behavior

open files descriptors are recycled when not used any more and allover number of FD's roughly aligns to currently open file handles plus (network) sockets

Actual behavior

open descriptors of the rsyslogd process rise by ~ 1440 / h (with two balancers monitoring open ports every 5 seconds) while actual number of connections (418) and open files (89) stay roughly the same. Most lines of lsof output lines show something like this:
rsyslogd 15288 root 7529u sock 0,7 0t0 139507485 protocol: TCP

Steps to reproduce the behavior

  1. configure imrelp for incoming messages
  2. connect to the port and close the connection, it seems to me that descriptor leak of the TCP connection is independent of correctly initializing relp protocol, or just closing the connection (as our balancers do currently)

Environment

  • rsyslog version: rsyslogd 8.1911.0 (aka 2019.11)
  • platform: x86_64-redhat-linux-gnu
  • librelp version: 1.4.0-1.el6
  • for configuration questions/issues, include rsyslog.conf and included config files
    simplified config:
    ruleset( name="output" ) {
    action(
    name="a_outlog"
    type="omfile"
    File="/tmp/out.log"
    )
    stop
    }
    module(load="imrelp")
    input(type="imrelp" port="2510" ruleset="output")

then connect with netcat 100 times will permanently leak 100 descriptors:
for i in {1..100} ; do nc -z 127.0.0.1 2510 ; done

Doing the same with imtcp module does NOT change the number of used descriptors (only once after starting, as expected, since unused descriptors are recycled by imtcp).
Since it seems irrelevant for the FD leak if a relp protocol handshake actually happened, i put this issue under the rsyslog and not the librelp project.

@rgerhards rgerhards added this to the v8.2001 milestone Dec 2, 2019
@falon
Copy link

falon commented Jan 8, 2020

We have this problem too. We use to connect to the port and close the connection for a rudimental monitoring... and the descriptors grow indefinitely until rsyslog stops to work.

@rgerhards rgerhards modified the milestones: v8.2001, v8.2002 Jan 12, 2020
@rgerhards rgerhards modified the milestones: v8.2002, v8.2004 Feb 28, 2020
@nerdquadrat
Copy link
Author

since this has been postponed for a few month now and i know of others who faced this issue, here is the workaround currently working for us (in a slightly differend way and running as a cronjob multiple times a day):

[ $(/usr/sbin/lsof -p $(/usr/bin/pgrep rsyslogd) | /usr/bin/grep TCP | /usr/bin/egrep -v "ESTABLISHED|LISTEN" | /usr/bin/wc -l) -gt 3000 ] && /usr/sbin/rsyslogd -N1 && service rsyslog restart

note: lsof could show those tcp handles in CLOSE_WAIT state with -iTCP -s "TCP:^LISTEN" -s "TCP:^ESTABLISHED" but then these leaked handles would not be counted:

rsyslogd 24831 root 1978u sock 0,7 0t0 279860656 protocol: TCP

with two loadbalancers checking the port and 32k open files limit, this check restarts rsyslog twice a day.
Having the cronjob checking the number of leaked filehandles before restart the restarts will vanish as soon as a fixed version is in place, so this workaround would also help with future FD leaks if they happen.

@StrongestNumber9
Copy link
Contributor

We met this problem as well and can reproduce it reliably with the newest rsyslog (8.2002.0 at the time of writing):

Expected behavior

FDs get cleared if relp connections are force terminated.

Actual behavior

FDs gets left open:

# ls -l /proc/11/fd|head -10
total 0
lr-x------. 1 root root 64 Mar 30 16:40 0 -> /dev/urandom
lrwx------. 1 root root 64 Mar 30 16:40 1 -> socket:[166569908]
lrwx------. 1 root root 64 Mar 30 16:40 10 -> socket:[166569914]
lrwx------. 1 root root 64 Mar 30 16:40 100 -> socket:[166573526]
lrwx------. 1 root root 64 Mar 30 16:40 101 -> socket:[166573527]
lrwx------. 1 root root 64 Mar 30 16:40 102 -> socket:[166573528]
lrwx------. 1 root root 64 Mar 30 16:40 103 -> socket:[166573529]
lrwx------. 1 root root 64 Mar 30 16:40 104 -> socket:[166573530]
lrwx------. 1 root root 64 Mar 30 16:40 105 -> socket:[166573531]
# ls -l /proc/11/fd|wc -l
996

Output from reproduce.sh

rsyslogd: version 8.2002.0, config validation run (level 9), master config /spool.conf
rsyslogd: End of config validation run. Bye.
Waiting for threads to finish
Threads finished, sleeping 10s before counting open filehandles..
Open files at the end: 989

Steps to reproduce the behavior

spool.conf

$LocalHostName spool
$AbortOnUncleanConfig on
$PreserveFQDN on

global(
    workDirectory="/workdir"
    maxMessageSize="256k"
)

main_queue(queue.type="Direct")
module(load="imrelp")
input(
    type="imrelp"
    name="imrelp"
    port="601"
    ruleset="spool"
    MaxDataSize="256k"
)

ruleset(name="spool" queue.type="direct") {
}

# Just so rsyslog doesn't whine that we do not have outputs
ruleset(name="noop" queue.type="direct") {
    action(
        type="omfile"
        name="omfile"
        file="/spool/spool.log"
    )
}

reproduce.sh

#!/bin/bash
$RSYSLOGD -f /spool.conf -N9
$RSYSLOGD -f /spool.conf -i /workdir/rsyslogd.pid

sleep 2;
# How many tcpfloods we run at the same tiem
for ((i=1;i<=10;i++)); do
    (
        # How many times tcpflood runs in each threads
        for ((y=1;y<=100;y++)); do 
            (
                $TCPFLOOD -T relp-plain -t 127.0.0.1 -p 601 -s -m 1000000
            ) &
            # Give it time to actually connect
            sleep 2;
            kill -9 $!;
        done;
    ) >/dev/null 2>&1 &
done;
echo "Waiting for threads to end";
wait;
echo "Open files at the end: $(ls -l "/proc/$(cat /workdir/rsyslogd.pid)/fd" | wc -l)";

Environment

# cat /etc/centos-release
CentOS Linux release 7.7.1908 (Core)

# rpm -qa "rsyslog*"
rsyslog-relp-8.2002.0-1.el7.x86_64
rsyslog-8.2002.0-1.el7.x86_64

@alorbach
Copy link
Member

@StrongestNumber9 I will try to create a test based on your configuration.

@alorbach
Copy link
Member

alorbach commented Apr 3, 2020

@StrongestNumber9 : I have successfully build a test based on this config and I could reproduce an unexpected problem which I will investigate next week.

alorbach added a commit to alorbach/librelp that referenced this issue Apr 8, 2020
In iRet handler of relpSessSendResponse, the sendbuf
was freed if iRet returned a failure.

However if error RELP_RET_IO_ERR happened in relpSendqAddBuf,
sendbuf was already assigned to relpSendqe_t. As a result
sendbuf was double freed in relpSendqDestruct.

see also
rsyslog/rsyslog#4184
rsyslog/rsyslog#4005
alorbach added a commit to alorbach/librelp that referenced this issue Apr 8, 2020
In iRet handler of relpSessSendResponse, the sendbuf
was freed if iRet returned a failure.

However if error RELP_RET_IO_ERR happened in relpSendqAddBuf,
sendbuf was already assigned to relpSendqe_t. As a result
sendbuf was double freed in relpSendqDestruct.

see also
rsyslog/rsyslog#4184
rsyslog/rsyslog#4005

closes rsyslog#183
alorbach added a commit to alorbach/librelp that referenced this issue Apr 9, 2020
In iRet handler of relpSessSendResponse, the sendbuf
was freed if iRet returned a failure.

However if error RELP_RET_IO_ERR happened in relpSendqAddBuf,
sendbuf was already assigned to relpSendqe_t. As a result
sendbuf was double freed in relpSendqDestruct.

see also
rsyslog/rsyslog#4184
rsyslog/rsyslog#4005

closes rsyslog#183
alorbach added a commit to alorbach/rsyslog-1 that referenced this issue Apr 9, 2020
For successfull test, the PR
rsyslog/librelp#184
 for Issue
rsyslog/librelp#183
is needed. Otherwise the test will trigger a double free inside
librelp which results in a segfault in some cases.

closes: rsyslog#4184
closes: rsyslog#4005
alorbach added a commit to alorbach/rsyslog-1 that referenced this issue Apr 21, 2020
For successfull test, the PR
rsyslog/librelp#184
 for Issue
rsyslog/librelp#183
is needed. Otherwise the test will trigger a double free inside
librelp which results in a segfault in some cases.

closes: rsyslog#4184
closes: rsyslog#4005
alorbach added a commit to alorbach/rsyslog-1 that referenced this issue Apr 21, 2020
For successfull test, the PR
rsyslog/librelp#184
 for Issue
rsyslog/librelp#183
is needed. Otherwise the test will trigger a double free inside
librelp which results in a segfault in some cases.

closes: rsyslog#4184
closes: rsyslog#4005
alorbach added a commit to alorbach/rsyslog-1 that referenced this issue Apr 22, 2020
For successfull test, the PR
rsyslog/librelp#184
 for Issue
rsyslog/librelp#183
is needed. Otherwise the test will trigger a double free inside
librelp which results in a segfault in some cases.

closes: rsyslog#4184
closes: rsyslog#4005
alorbach added a commit to alorbach/rsyslog-1 that referenced this issue Apr 22, 2020
For successfull test, the PR
rsyslog/librelp#184
 for Issue
rsyslog/librelp#183
is needed. Otherwise the test will trigger a double free inside
librelp which results in a segfault in some cases.

closes: rsyslog#4184
closes: rsyslog#4005
alorbach added a commit to alorbach/rsyslog-1 that referenced this issue Apr 27, 2020
For successfull test, the PR
rsyslog/librelp#184
 for Issue
rsyslog/librelp#183
is needed. Otherwise the test will trigger a double free inside
librelp which results in a segfault in some cases.

closes: rsyslog#4184
closes: rsyslog#4005
@timtofan
Copy link

timtofan commented Apr 27, 2020

We have this problem too. The leak is happening both on log producers using omrelp, and on log receivers using imrelp. Within a day after restart imrelp process exhausts all available FDs and, being unable to open new connections, stops receiving messages.

It appeared after updating from 8.39 to 8.2002.

@rgerhards
Copy link
Member

@alorbach: I think this is fixed with the new librelp versions, right?

@alorbach
Copy link
Member

The leak itself is not fixed yet as it is difficult to fix. The crash that could happen with broken sessions was fixed.

I amworking on the testbench plumbing which is needed to create a session break test. Then we can directly run the test in librelp testbench. After that I wanted to address the leak.

@rgerhards rgerhards added the bug label Apr 28, 2020
@rgerhards rgerhards removed this from the v8.2004 milestone Apr 28, 2020
@lock
Copy link

lock bot commented May 6, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants