New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rsyslogd stops responding to process signals some after extended period where omkafka cannot reach its brokers #4111
Comments
kafka has a queue for the messages it's trying to deliver, when that queue fills
up, rsyslog is not able to deliver to the kafka sender and the queue in rsyslog
starts filling up, once that queue fills up and rsyslog is unable to move
messages from the main queue to the kafka queue, the main queue starts filling
up. Once the main queue fills up, no further messages can be accepted (per spec,
if syslog cannot process a message, it blocks acceptance of the message)
if you enable impstats, you will see the rsyslog queues. I don't know how to
find out info about the kafka queues.
David Lang
|
Interesting, thanks. I thought that by setting the discard values on the kafka queue that would prevent backpressure to the main queue. |
Also even if the main queue is filled up and no longer accepting messages, should that make rsyslog unable to respond to HUP and KILL signals? |
depending on you you have things configured, rsyslog may be trying to save the
contents of the queues when you send the kill signal
as for HUP, that just closes output files, it doesn't clear anything from the
queue, so how do you know it didn't happen.
David Lang
|
On Fri, 10 Jan 2020, Mike Sphar wrote:
Interesting, thanks. I thought that by setting the discard values on the kafka queue that would prevent backpressure to the main queue.
try setting the discard values lower.
David Lang
|
That's how we noticed there was a problem at all. Rsyslog was still writing to the original file handle long after the file was rotated. In other words it was still writing to i.e. /var/log/syslog-20191210 weeks later after the original /var/log/syslog file was rotated (logrotate has a post-rotate command to HUP the rsyslogd process), and it continued to do so even after I manually sent it HUP signals after we discovered the problem. This led to /var filling up which was how we noticed the problem. The only thing that got it to actually start writing to /var/log/syslog again was to kill -9 it and restart. |
that sounds like a problem that needs more investigation, can you enable debug
logging (possibly debug on demand) and see if you can get a log showing the
problem?
David Lang
On Fri, 10 Jan 2020, Mike Sphar wrote:
… Date: Fri, 10 Jan 2020 19:11:06 -0800
From: Mike Sphar ***@***.***>
Reply-To: rsyslog/rsyslog
***@***.***>
To: rsyslog/rsyslog ***@***.***>
Cc: David Lang ***@***.***>, Comment ***@***.***>
Subject: Re: [rsyslog/rsyslog] rsyslogd stops responding to process signals
some after extended period where omkafka cannot reach its brokers (#4111)
> as for HUP, that just closes output files, it doesn't clear anything from the
queue, so how do you know it didn't happen.
That's how we noticed there was a problem at all. Rsyslog was still writing to the original file handle long after the file was rotated. In other words it was still writing to i.e. /var/log/syslog-20191210 weeks later after the original /var/log/syslog file was rotated (logrotate has a post-rotate command to HUP the rsyslogd process), and it continued to do so even after I manually sent it HUP signals after we discovered the problem. This led to /var filling up which was how we noticed the problem. The only thing that got it to actually start writing to /var/log/syslog again was to kill -9 it and restart.
|
Thanks, I'll see if I can, but I think they may have fixed the kafka connectivity issues by now and based on when the rotated files got stuck vs when rsyslogd had been started it appeared to take possibly a couple of weeks of running in that state before it got stuck. I'll see if I can find a host I can break it again on and leave running for a while though. |
I haven't had an environment to intentionally reproduce this in, but we've also noticed this similar condition happening sporadically to a seemingly random handful of hosts that don't have any kafka connectivity issues. Pretty much the same symptoms described in Issue #4230 |
This might be a weird one and hard to reproduce, but we're seeing an issue where with an omkafka output, if the kafka targets are not reachable, some number of days later rsyslogd eventually stops responding to HUP signals so continues writing to rotated logs. It doesn't respond to a normal stop signal either, requiring a kill -9 to end the process and restart it.
The backstory is after rolling out a change to use omkafka, we've discovered much later that a few of our hosts unexpectedly couldn't talk to the kafka cluster for network config reasons. Weeks after the initial change was rolled out, some hosts started running out of disk space, which we traced to rsyslog still writing logs to days-old rotated log files. The post-rotate command to HUP the rsyslogd service was not working.
Obviously we're fixing the connectivity problem but this seems like there might be an issue somewhere that can lead to this degraded condition.
Expected behavior
rsyslogd responds to process signals even if the omkafka target is unavailable for a long time
Actual behavior
When the omkafka targets are not reachable, some significant amount of time later, possibly many days, rsyslog stops responding to process signals, even though it seems to still be writing logs locally.
Steps to reproduce the behavior
This is our module config
Environment
The text was updated successfully, but these errors were encountered: