Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with elasticsearch buffers overflowing in logs-forwarder #1738

Merged
merged 2 commits into from
Apr 8, 2020

Conversation

smlx
Copy link
Member

@smlx smlx commented Mar 24, 2020

The main change that seems to increase performance is switching from the deprecated secure_forward plugin to the built-in forward plugin, which supports TLS.

The advantages of the forward plugin are:

  • it's maintained.
  • support for fluentd workers so that input performance can be scaled easily.
  • creates a new connection per chunk forward event.

On the last point: the old secure_forward plugin held onto its connection which meant that sessions were "sticky" to a pod and effectively all work was being sent to single forwarder pod. With the forward plugin, the load balancer can do its job of spreading work across all logs-fowarder pods.

The other changes include:

  • flush threads are now set to 2 (x4 workers = 8)
  • giving the output an @id so it is identifiable in logs
  • not using a timekey, so chunk flushing is only size dependent
  • storing any tag/time in the same chunk to avoid lots of little chunks
  • increasing the flush interval and chunk size
  • dropping chunks on buffer overflow
  • remove redundant index_name field in elasticsearch output
  • make scheme configurable via environment variable (default http)

Checklist

  • Affected Issues have been mentioned in the Closing issues section
  • Documentation has been written/updated
  • PR title is ready for changelog and subsystem label(s) applied

logs-forwarder was not able to write to Elasticsearch fast enough to keep up with the volume of logs. This change seeks to improve the situation.

Closing issues

n/a

@smlx smlx requested a review from Schnitzel March 24, 2020 09:13
services/logs-forwarder/.lagoon.multi.yml Outdated Show resolved Hide resolved
flush_mode interval
flush_interval 60s
flush_thread_count 16
overflow_action drop_oldest_chunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding of having overflow_action block is described here: https://docs.fluentd.org/configuration/buffer-section

block processing of input plugin to emit events into that buffer

basically it means that fluentd tells the inputs (which are fluentds again in logs-collector and external logs-forwarder) to not send data anymore, so it will will the buffers of these fluentds as well.
I think that's a bit safer than just dropping the oldest chunks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I guess this a philosophical debate 😄

I think that the chunks should be dropped at the point that the bottleneck is occurring because that makes it easy to identify the point in the log pipeline that the issue is located: it's the only fluentd with a full buffer and a log full of dropped chunk warnings.

If you block instead, the entire pipeline backs up which tends to obfuscate the location of the issue because buffers are full everywhere and you may see issues at points in the pipeline several hops removed from the actual bottleneck. You will still end up dropping logs eventually - it just take slightly longer and will be further back in the pipeline.

@@ -259,19 +261,17 @@ objects:
ssl_version TLSv1_2
request_timeout 60s
slow_flush_log_threshold 90s
<buffer tag,time>
<buffer>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhh interesting, my understanding of https://docs.fluentd.org/output/elasticsearch#index_name-optional is that you need to setup the buffer with time if you use time in the indexname.
but maybe because we use the record_modifier on top already, this is not necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is my understanding. We don't use index_name, we just use target_index_key.

The filter that adds index_name is also unconditional so we don't need the index_name fallback.

The main change that seems to increase performance is switching from the
deprecated secure_forward plugin to the built-in forward plugin, which
supports TLS.

The advantages of the forward plugin are:

* it's maintained.
* support for fluentd workers so that input performance can be scaled
  easily.
* creates a new connection per chunk forward event.

On the last point: the old secure_forward plugin held onto its
connection which meant that sessions were "sticky" to a pod and
effectively all work was being sent to single forwarder pod.
With the forward plugin, the load balancer can do its job of spreading
work across all logs-fowarder pods.

The other changes include:
* flush threads are now set to 2 (x4 workers = 8)
* giving the output an @id so it is identifiable in logs
* not using a timekey, so chunk flushing is _only_ size dependent
* storing any tag/time in the same chunk to avoid lots of little chunks
* increasing the flush interval and chunk size
* dropping chunks on buffer overflow
* remove redundant index_name field in elasticsearch output
* make scheme configurable via environment variable (default http)
@smlx smlx force-pushed the logs-forwarder-buffer-tweaks branch from 4b72541 to 7857e43 Compare April 1, 2020 13:37
@smlx smlx requested a review from Schnitzel April 1, 2020 13:41
@tobybellwood tobybellwood added the 3-logging-reporting Logging & Reporting subsystem label Apr 5, 2020
@Schnitzel Schnitzel merged commit 396680e into master Apr 8, 2020
@Schnitzel Schnitzel added this to the v1.4.1 milestone Apr 8, 2020
@smlx smlx deleted the logs-forwarder-buffer-tweaks branch April 30, 2020 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3-logging-reporting Logging & Reporting subsystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants