rsyslogd stops responding to process signals some after extended period where omkafka cannot reach its brokers #4111

mikesphar · 2020-01-10T21:00:07Z

This might be a weird one and hard to reproduce, but we're seeing an issue where with an omkafka output, if the kafka targets are not reachable, some number of days later rsyslogd eventually stops responding to HUP signals so continues writing to rotated logs. It doesn't respond to a normal stop signal either, requiring a kill -9 to end the process and restart it.

The backstory is after rolling out a change to use omkafka, we've discovered much later that a few of our hosts unexpectedly couldn't talk to the kafka cluster for network config reasons. Weeks after the initial change was rolled out, some hosts started running out of disk space, which we traced to rsyslog still writing logs to days-old rotated log files. The post-rotate command to HUP the rsyslogd service was not working.

Obviously we're fixing the connectivity problem but this seems like there might be an issue somewhere that can lead to this degraded condition.

Expected behavior

rsyslogd responds to process signals even if the omkafka target is unavailable for a long time

Actual behavior

When the omkafka targets are not reachable, some significant amount of time later, possibly many days, rsyslog stops responding to process signals, even though it seems to still be writing logs locally.

Steps to reproduce the behavior

This is our module config

module(load="omkafka")
module(load="mmrm1stspace")


template(name="json_kafka_message" type="list" option.jsonf="on") {
  property(outname="time" name="timegenerated" dateFormat="rfc3339" format="jsonf" droplastlf="on")
  property(outname="rsyslog_time_generated" name="timegenerated" dateFormat="rfc3339" format="jsonf" droplastlf="on")
  property(outname="rsyslog_time_processed" name="$now-utc" dateFormat="rfc3339" format="jsonf" droplastlf="on")
  property(outname="host" name="hostname" format="jsonf" droplastlf="on")
  property(outname="severity" name="syslogseverity" caseConversion="upper" format="jsonf" datatype="number" droplastlf="on")
  property(outname="facility" name="syslogfacility" format="jsonf" datatype="number" droplastlf="on")
  property(outname="ident" name="programname" format="jsonf" droplastlf="on")
  property(outname="pid" name="procid" format="jsonf" droplastlf="on")
  property(outname="received_syslog_message" name="msg" format="jsonf" droplastlf="on")
}

ruleset(name="KafkaStructuredPipeline"
        queue.type="LinkedList"
        queue.size="100000"
        queue.discardMark="100000"
        queue.discardSeverity="0"
) {
  action(type="mmrm1stspace")
  action(name="v1_firehose_raw" type="omkafka"
    topic="v1_firehose_raw"
    Partitions.Auto="on"
    broker=[
      "BROKER_HOSTNAME_1",
      "BROKER_HOSTNAME_2",
      "BROKER_HOSTNAME_3"
    ]

    confParam=[
      "compression.codec=lz4",
      "compression.level=6",
      "queue.buffering.max.kbytes=1048576",
      "queue.buffering.max.messages=10000",
      "queue.buffering.max.ms=1000",
      "request.required.acks=all",
      "security.protocol=plaintext",
      "socket.keepalive.enable=true",
      "socket.timeout.ms=10000",
      "statistics.interval.ms=60000"
    ]
    template="json_kafka_message"
  )
}

if ($fromhost-ip == '127.0.0.1') then {
  call KafkaStructuredPipeline
}

Environment

rsyslog version: 8.1908.0
platform: Debian GNU/Linux 9 (stretch)"

The text was updated successfully, but these errors were encountered:

davidelang · 2020-01-10T22:24:01Z

kafka has a queue for the messages it's trying to deliver, when that queue fills up, rsyslog is not able to deliver to the kafka sender and the queue in rsyslog starts filling up, once that queue fills up and rsyslog is unable to move messages from the main queue to the kafka queue, the main queue starts filling up. Once the main queue fills up, no further messages can be accepted (per spec, if syslog cannot process a message, it blocks acceptance of the message) if you enable impstats, you will see the rsyslog queues. I don't know how to find out info about the kafka queues. David Lang

mikesphar · 2020-01-11T02:08:47Z

Interesting, thanks. I thought that by setting the discard values on the kafka queue that would prevent backpressure to the main queue.

mikesphar · 2020-01-11T02:09:49Z

Also even if the main queue is filled up and no longer accepting messages, should that make rsyslog unable to respond to HUP and KILL signals?

davidelang · 2020-01-11T02:33:26Z

depending on you you have things configured, rsyslog may be trying to save the contents of the queues when you send the kill signal as for HUP, that just closes output files, it doesn't clear anything from the queue, so how do you know it didn't happen. David Lang

davidelang · 2020-01-11T02:33:46Z

On Fri, 10 Jan 2020, Mike Sphar wrote: Interesting, thanks. I thought that by setting the discard values on the kafka queue that would prevent backpressure to the main queue.

try setting the discard values lower. David Lang

mikesphar · 2020-01-11T03:11:06Z

as for HUP, that just closes output files, it doesn't clear anything from the
queue, so how do you know it didn't happen.

That's how we noticed there was a problem at all. Rsyslog was still writing to the original file handle long after the file was rotated. In other words it was still writing to i.e. /var/log/syslog-20191210 weeks later after the original /var/log/syslog file was rotated (logrotate has a post-rotate command to HUP the rsyslogd process), and it continued to do so even after I manually sent it HUP signals after we discovered the problem. This led to /var filling up which was how we noticed the problem. The only thing that got it to actually start writing to /var/log/syslog again was to kill -9 it and restart.

davidelang · 2020-01-11T03:17:03Z

that sounds like a problem that needs more investigation, can you enable debug logging (possibly debug on demand) and see if you can get a log showing the problem? David Lang On Fri, 10 Jan 2020, Mike Sphar wrote:

…

Date: Fri, 10 Jan 2020 19:11:06 -0800 From: Mike Sphar ***@***.***> Reply-To: rsyslog/rsyslog ***@***.***> To: rsyslog/rsyslog ***@***.***> Cc: David Lang ***@***.***>, Comment ***@***.***> Subject: Re: [rsyslog/rsyslog] rsyslogd stops responding to process signals some after extended period where omkafka cannot reach its brokers (#4111) > as for HUP, that just closes output files, it doesn't clear anything from the queue, so how do you know it didn't happen. That's how we noticed there was a problem at all. Rsyslog was still writing to the original file handle long after the file was rotated. In other words it was still writing to i.e. /var/log/syslog-20191210 weeks later after the original /var/log/syslog file was rotated (logrotate has a post-rotate command to HUP the rsyslogd process), and it continued to do so even after I manually sent it HUP signals after we discovered the problem. This led to /var filling up which was how we noticed the problem. The only thing that got it to actually start writing to /var/log/syslog again was to kill -9 it and restart.

mikesphar · 2020-01-11T04:18:32Z

Thanks, I'll see if I can, but I think they may have fixed the kafka connectivity issues by now and based on when the rotated files got stuck vs when rsyslogd had been started it appeared to take possibly a couple of weeks of running in that state before it got stuck. I'll see if I can find a host I can break it again on and leave running for a while though.

mikesphar · 2020-04-06T15:55:05Z

I haven't had an environment to intentionally reproduce this in, but we've also noticed this similar condition happening sporadically to a seemingly random handful of hosts that don't have any kafka connectivity issues. Pretty much the same symptoms described in Issue #4230

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rsyslogd stops responding to process signals some after extended period where omkafka cannot reach its brokers #4111

rsyslogd stops responding to process signals some after extended period where omkafka cannot reach its brokers #4111

mikesphar commented Jan 10, 2020

davidelang commented Jan 10, 2020 via email

mikesphar commented Jan 11, 2020

mikesphar commented Jan 11, 2020

davidelang commented Jan 11, 2020 via email

davidelang commented Jan 11, 2020 via email

mikesphar commented Jan 11, 2020

davidelang commented Jan 11, 2020 via email

mikesphar commented Jan 11, 2020

mikesphar commented Apr 6, 2020

rsyslogd stops responding to process signals some after extended period where omkafka cannot reach its brokers #4111

rsyslogd stops responding to process signals some after extended period where omkafka cannot reach its brokers #4111

Comments

mikesphar commented Jan 10, 2020

Expected behavior

Actual behavior

Steps to reproduce the behavior

Environment

davidelang commented Jan 10, 2020 via email

mikesphar commented Jan 11, 2020

mikesphar commented Jan 11, 2020

davidelang commented Jan 11, 2020 via email

davidelang commented Jan 11, 2020 via email

mikesphar commented Jan 11, 2020

davidelang commented Jan 11, 2020 via email

mikesphar commented Jan 11, 2020

mikesphar commented Apr 6, 2020