imrelp FD leak #4005

nerdquadrat · 2019-12-02T13:54:42Z

Expected behavior

open files descriptors are recycled when not used any more and allover number of FD's roughly aligns to currently open file handles plus (network) sockets

Actual behavior

open descriptors of the rsyslogd process rise by ~ 1440 / h (with two balancers monitoring open ports every 5 seconds) while actual number of connections (418) and open files (89) stay roughly the same. Most lines of lsof output lines show something like this:
rsyslogd 15288 root 7529u sock 0,7 0t0 139507485 protocol: TCP

Steps to reproduce the behavior

configure imrelp for incoming messages
connect to the port and close the connection, it seems to me that descriptor leak of the TCP connection is independent of correctly initializing relp protocol, or just closing the connection (as our balancers do currently)

Environment

rsyslog version: rsyslogd 8.1911.0 (aka 2019.11)
platform: x86_64-redhat-linux-gnu
librelp version: 1.4.0-1.el6
for configuration questions/issues, include rsyslog.conf and included config files
simplified config:
ruleset( name="output" ) {
action(
name="a_outlog"
type="omfile"
File="/tmp/out.log"
)
stop
}
module(load="imrelp")
input(type="imrelp" port="2510" ruleset="output")

then connect with netcat 100 times will permanently leak 100 descriptors:
for i in {1..100} ; do nc -z 127.0.0.1 2510 ; done

Doing the same with imtcp module does NOT change the number of used descriptors (only once after starting, as expected, since unused descriptors are recycled by imtcp).
Since it seems irrelevant for the FD leak if a relp protocol handshake actually happened, i put this issue under the rsyslog and not the librelp project.

falon · 2020-01-08T11:18:00Z

We have this problem too. We use to connect to the port and close the connection for a rudimental monitoring... and the descriptors grow indefinitely until rsyslog stops to work.

nerdquadrat · 2020-03-18T15:46:47Z

since this has been postponed for a few month now and i know of others who faced this issue, here is the workaround currently working for us (in a slightly differend way and running as a cronjob multiple times a day):

[ $(/usr/sbin/lsof -p $(/usr/bin/pgrep rsyslogd) | /usr/bin/grep TCP | /usr/bin/egrep -v "ESTABLISHED|LISTEN" | /usr/bin/wc -l) -gt 3000 ] && /usr/sbin/rsyslogd -N1 && service rsyslog restart

note: lsof could show those tcp handles in CLOSE_WAIT state with -iTCP -s "TCP:^LISTEN" -s "TCP:^ESTABLISHED" but then these leaked handles would not be counted:

rsyslogd 24831 root 1978u sock 0,7 0t0 279860656 protocol: TCP

with two loadbalancers checking the port and 32k open files limit, this check restarts rsyslog twice a day.
Having the cronjob checking the number of leaked filehandles before restart the restarts will vanish as soon as a fixed version is in place, so this workaround would also help with future FD leaks if they happen.

StrongestNumber9 · 2020-03-30T13:43:48Z

We met this problem as well and can reproduce it reliably with the newest rsyslog (8.2002.0 at the time of writing):

Expected behavior

FDs get cleared if relp connections are force terminated.

Actual behavior

FDs gets left open:

# ls -l /proc/11/fd|head -10
total 0
lr-x------. 1 root root 64 Mar 30 16:40 0 -> /dev/urandom
lrwx------. 1 root root 64 Mar 30 16:40 1 -> socket:[166569908]
lrwx------. 1 root root 64 Mar 30 16:40 10 -> socket:[166569914]
lrwx------. 1 root root 64 Mar 30 16:40 100 -> socket:[166573526]
lrwx------. 1 root root 64 Mar 30 16:40 101 -> socket:[166573527]
lrwx------. 1 root root 64 Mar 30 16:40 102 -> socket:[166573528]
lrwx------. 1 root root 64 Mar 30 16:40 103 -> socket:[166573529]
lrwx------. 1 root root 64 Mar 30 16:40 104 -> socket:[166573530]
lrwx------. 1 root root 64 Mar 30 16:40 105 -> socket:[166573531]
# ls -l /proc/11/fd|wc -l
996

Output from reproduce.sh

rsyslogd: version 8.2002.0, config validation run (level 9), master config /spool.conf
rsyslogd: End of config validation run. Bye.
Waiting for threads to finish
Threads finished, sleeping 10s before counting open filehandles..
Open files at the end: 989

Steps to reproduce the behavior

spool.conf

$LocalHostName spool
$AbortOnUncleanConfig on
$PreserveFQDN on

global(
    workDirectory="/workdir"
    maxMessageSize="256k"
)

main_queue(queue.type="Direct")
module(load="imrelp")
input(
    type="imrelp"
    name="imrelp"
    port="601"
    ruleset="spool"
    MaxDataSize="256k"
)

ruleset(name="spool" queue.type="direct") {
}

# Just so rsyslog doesn't whine that we do not have outputs
ruleset(name="noop" queue.type="direct") {
    action(
        type="omfile"
        name="omfile"
        file="/spool/spool.log"
    )
}

reproduce.sh

#!/bin/bash
$RSYSLOGD -f /spool.conf -N9
$RSYSLOGD -f /spool.conf -i /workdir/rsyslogd.pid

sleep 2;
# How many tcpfloods we run at the same tiem
for ((i=1;i<=10;i++)); do
    (
        # How many times tcpflood runs in each threads
        for ((y=1;y<=100;y++)); do 
            (
                $TCPFLOOD -T relp-plain -t 127.0.0.1 -p 601 -s -m 1000000
            ) &
            # Give it time to actually connect
            sleep 2;
            kill -9 $!;
        done;
    ) >/dev/null 2>&1 &
done;
echo "Waiting for threads to end";
wait;
echo "Open files at the end: $(ls -l "/proc/$(cat /workdir/rsyslogd.pid)/fd" | wc -l)";

Environment

# cat /etc/centos-release
CentOS Linux release 7.7.1908 (Core)

# rpm -qa "rsyslog*"
rsyslog-relp-8.2002.0-1.el7.x86_64
rsyslog-8.2002.0-1.el7.x86_64

alorbach · 2020-03-31T15:35:46Z

@StrongestNumber9 I will try to create a test based on your configuration.

alorbach · 2020-04-03T15:25:12Z

@StrongestNumber9 : I have successfully build a test based on this config and I could reproduce an unexpected problem which I will investigate next week.

In iRet handler of relpSessSendResponse, the sendbuf was freed if iRet returned a failure. However if error RELP_RET_IO_ERR happened in relpSendqAddBuf, sendbuf was already assigned to relpSendqe_t. As a result sendbuf was double freed in relpSendqDestruct. see also rsyslog/rsyslog#4184 rsyslog/rsyslog#4005

In iRet handler of relpSessSendResponse, the sendbuf was freed if iRet returned a failure. However if error RELP_RET_IO_ERR happened in relpSendqAddBuf, sendbuf was already assigned to relpSendqe_t. As a result sendbuf was double freed in relpSendqDestruct. see also rsyslog/rsyslog#4184 rsyslog/rsyslog#4005 closes rsyslog#183

For successfull test, the PR rsyslog/librelp#184 for Issue rsyslog/librelp#183 is needed. Otherwise the test will trigger a double free inside librelp which results in a segfault in some cases. closes: rsyslog#4184 closes: rsyslog#4005

timtofan · 2020-04-27T15:45:23Z

We have this problem too. The leak is happening both on log producers using omrelp, and on log receivers using imrelp. Within a day after restart imrelp process exhausts all available FDs and, being unable to open new connections, stops receiving messages.

It appeared after updating from 8.39 to 8.2002.

rgerhards · 2020-04-28T08:36:08Z

@alorbach: I think this is fixed with the new librelp versions, right?

alorbach · 2020-04-28T08:45:26Z

The leak itself is not fixed yet as it is difficult to fix. The crash that could happen with broken sessions was fixed.

I amworking on the testbench plumbing which is needed to create a session break test. Then we can directly run the test in librelp testbench. After that I wanted to address the leak.

lock · 2020-05-06T12:12:18Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

rgerhards added the potential bug label Dec 2, 2019

rgerhards added this to the v8.2001 milestone Dec 2, 2019

rgerhards modified the milestones: v8.2001, v8.2002 Jan 12, 2020

rgerhards modified the milestones: v8.2002, v8.2004 Feb 28, 2020

alorbach mentioned this issue Apr 8, 2020

random segfault when imrelp session gets interrupted rsyslog/librelp#183

Closed

alorbach mentioned this issue Apr 8, 2020

replsess: fix double free of sendbuf in some cases. rsyslog/librelp#184

Merged

alorbach mentioned this issue Apr 9, 2020

testbench: Add test for imrelp to check broken session handling. #4240

Merged

rgerhards added the bug label Apr 28, 2020

rgerhards assigned alorbach Apr 28, 2020

rgerhards removed the potential bug label Apr 28, 2020

rgerhards removed this from the v8.2004 milestone Apr 28, 2020

rgerhards added this to the v8.2006 milestone Apr 28, 2020

rgerhards closed this as completed in #4240 Apr 29, 2020

alorbach mentioned this issue Apr 30, 2020

Fix session break memory leak rsyslog/librelp#188

Closed

lock bot locked as resolved and limited conversation to collaborators May 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imrelp FD leak #4005

imrelp FD leak #4005

nerdquadrat commented Dec 2, 2019

falon commented Jan 8, 2020

nerdquadrat commented Mar 18, 2020

StrongestNumber9 commented Mar 30, 2020

alorbach commented Mar 31, 2020

alorbach commented Apr 3, 2020

timtofan commented Apr 27, 2020 •

edited

rgerhards commented Apr 28, 2020

alorbach commented Apr 28, 2020

lock bot commented May 6, 2020

imrelp FD leak #4005

imrelp FD leak #4005

Comments

nerdquadrat commented Dec 2, 2019

Expected behavior

Actual behavior

Steps to reproduce the behavior

Environment

falon commented Jan 8, 2020

nerdquadrat commented Mar 18, 2020

StrongestNumber9 commented Mar 30, 2020

Expected behavior

Actual behavior

Steps to reproduce the behavior

Environment

alorbach commented Mar 31, 2020

alorbach commented Apr 3, 2020

timtofan commented Apr 27, 2020 • edited

rgerhards commented Apr 28, 2020

alorbach commented Apr 28, 2020

lock bot commented May 6, 2020

timtofan commented Apr 27, 2020 •

edited