Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsyslogd stops responding to process signals some after extended period where omkafka cannot reach its brokers #4111

Open
mikesphar opened this issue Jan 10, 2020 · 9 comments

Comments

@mikesphar
Copy link

This might be a weird one and hard to reproduce, but we're seeing an issue where with an omkafka output, if the kafka targets are not reachable, some number of days later rsyslogd eventually stops responding to HUP signals so continues writing to rotated logs. It doesn't respond to a normal stop signal either, requiring a kill -9 to end the process and restart it.

The backstory is after rolling out a change to use omkafka, we've discovered much later that a few of our hosts unexpectedly couldn't talk to the kafka cluster for network config reasons. Weeks after the initial change was rolled out, some hosts started running out of disk space, which we traced to rsyslog still writing logs to days-old rotated log files. The post-rotate command to HUP the rsyslogd service was not working.

Obviously we're fixing the connectivity problem but this seems like there might be an issue somewhere that can lead to this degraded condition.

Expected behavior

rsyslogd responds to process signals even if the omkafka target is unavailable for a long time

Actual behavior

When the omkafka targets are not reachable, some significant amount of time later, possibly many days, rsyslog stops responding to process signals, even though it seems to still be writing logs locally.

Steps to reproduce the behavior

This is our module config

module(load="omkafka")
module(load="mmrm1stspace")


template(name="json_kafka_message" type="list" option.jsonf="on") {
  property(outname="time" name="timegenerated" dateFormat="rfc3339" format="jsonf" droplastlf="on")
  property(outname="rsyslog_time_generated" name="timegenerated" dateFormat="rfc3339" format="jsonf" droplastlf="on")
  property(outname="rsyslog_time_processed" name="$now-utc" dateFormat="rfc3339" format="jsonf" droplastlf="on")
  property(outname="host" name="hostname" format="jsonf" droplastlf="on")
  property(outname="severity" name="syslogseverity" caseConversion="upper" format="jsonf" datatype="number" droplastlf="on")
  property(outname="facility" name="syslogfacility" format="jsonf" datatype="number" droplastlf="on")
  property(outname="ident" name="programname" format="jsonf" droplastlf="on")
  property(outname="pid" name="procid" format="jsonf" droplastlf="on")
  property(outname="received_syslog_message" name="msg" format="jsonf" droplastlf="on")
}

ruleset(name="KafkaStructuredPipeline"
        queue.type="LinkedList"
        queue.size="100000"
        queue.discardMark="100000"
        queue.discardSeverity="0"
) {
  action(type="mmrm1stspace")
  action(name="v1_firehose_raw" type="omkafka"
    topic="v1_firehose_raw"
    Partitions.Auto="on"
    broker=[
      "BROKER_HOSTNAME_1",
      "BROKER_HOSTNAME_2",
      "BROKER_HOSTNAME_3"
    ]

    confParam=[
      "compression.codec=lz4",
      "compression.level=6",
      "queue.buffering.max.kbytes=1048576",
      "queue.buffering.max.messages=10000",
      "queue.buffering.max.ms=1000",
      "request.required.acks=all",
      "security.protocol=plaintext",
      "socket.keepalive.enable=true",
      "socket.timeout.ms=10000",
      "statistics.interval.ms=60000"
    ]
    template="json_kafka_message"
  )
}

if ($fromhost-ip == '127.0.0.1') then {
  call KafkaStructuredPipeline
}

Environment

  • rsyslog version: 8.1908.0
  • platform: Debian GNU/Linux 9 (stretch)"
@davidelang
Copy link
Contributor

davidelang commented Jan 10, 2020 via email

@mikesphar
Copy link
Author

Interesting, thanks. I thought that by setting the discard values on the kafka queue that would prevent backpressure to the main queue.

@mikesphar
Copy link
Author

Also even if the main queue is filled up and no longer accepting messages, should that make rsyslog unable to respond to HUP and KILL signals?

@davidelang
Copy link
Contributor

davidelang commented Jan 11, 2020 via email

@davidelang
Copy link
Contributor

davidelang commented Jan 11, 2020 via email

@mikesphar
Copy link
Author

as for HUP, that just closes output files, it doesn't clear anything from the
queue, so how do you know it didn't happen.

That's how we noticed there was a problem at all. Rsyslog was still writing to the original file handle long after the file was rotated. In other words it was still writing to i.e. /var/log/syslog-20191210 weeks later after the original /var/log/syslog file was rotated (logrotate has a post-rotate command to HUP the rsyslogd process), and it continued to do so even after I manually sent it HUP signals after we discovered the problem. This led to /var filling up which was how we noticed the problem. The only thing that got it to actually start writing to /var/log/syslog again was to kill -9 it and restart.

@davidelang
Copy link
Contributor

davidelang commented Jan 11, 2020 via email

@mikesphar
Copy link
Author

Thanks, I'll see if I can, but I think they may have fixed the kafka connectivity issues by now and based on when the rotated files got stuck vs when rsyslogd had been started it appeared to take possibly a couple of weeks of running in that state before it got stuck. I'll see if I can find a host I can break it again on and leave running for a while though.

@mikesphar
Copy link
Author

I haven't had an environment to intentionally reproduce this in, but we've also noticed this similar condition happening sporadically to a seemingly random handful of hosts that don't have any kafka connectivity issues. Pretty much the same symptoms described in Issue #4230

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants