Fix issue with elasticsearch buffers overflowing in logs-forwarder #1738

smlx · 2020-03-24T09:12:50Z

The main change that seems to increase performance is switching from the deprecated secure_forward plugin to the built-in forward plugin, which supports TLS.

The advantages of the forward plugin are:

it's maintained.
support for fluentd workers so that input performance can be scaled easily.
creates a new connection per chunk forward event.

On the last point: the old secure_forward plugin held onto its connection which meant that sessions were "sticky" to a pod and effectively all work was being sent to single forwarder pod. With the forward plugin, the load balancer can do its job of spreading work across all logs-fowarder pods.

The other changes include:

flush threads are now set to 2 (x4 workers = 8)
giving the output an @id so it is identifiable in logs
not using a timekey, so chunk flushing is only size dependent
storing any tag/time in the same chunk to avoid lots of little chunks
increasing the flush interval and chunk size
dropping chunks on buffer overflow
remove redundant index_name field in elasticsearch output
make scheme configurable via environment variable (default http)

Checklist

Affected Issues have been mentioned in the Closing issues section
Documentation has been written/updated
PR title is ready for changelog and subsystem label(s) applied

logs-forwarder was not able to write to Elasticsearch fast enough to keep up with the volume of logs. This change seeks to improve the situation.

Closing issues

n/a

services/logs-forwarder/.lagoon.multi.yml

Schnitzel · 2020-03-29T15:10:13Z

services/logs-forwarder/.lagoon.multi.yml

+            flush_mode interval
+            flush_interval 60s
+            flush_thread_count 16
+            overflow_action drop_oldest_chunk


my understanding of having overflow_action block is described here: https://docs.fluentd.org/configuration/buffer-section

block processing of input plugin to emit events into that buffer

basically it means that fluentd tells the inputs (which are fluentds again in logs-collector and external logs-forwarder) to not send data anymore, so it will will the buffers of these fluentds as well.
I think that's a bit safer than just dropping the oldest chunks?

Well I guess this a philosophical debate 😄

I think that the chunks should be dropped at the point that the bottleneck is occurring because that makes it easy to identify the point in the log pipeline that the issue is located: it's the only fluentd with a full buffer and a log full of dropped chunk warnings.

If you block instead, the entire pipeline backs up which tends to obfuscate the location of the issue because buffers are full everywhere and you may see issues at points in the pipeline several hops removed from the actual bottleneck. You will still end up dropping logs eventually - it just take slightly longer and will be further back in the pipeline.

Schnitzel · 2020-03-29T15:13:12Z

services/logs-forwarder/.lagoon.multi.yml

@@ -259,19 +261,17 @@ objects:
          ssl_version TLSv1_2
          request_timeout 60s
          slow_flush_log_threshold 90s
-          <buffer tag,time>
+          <buffer>


mhh interesting, my understanding of https://docs.fluentd.org/output/elasticsearch#index_name-optional is that you need to setup the buffer with time if you use time in the indexname.
but maybe because we use the record_modifier on top already, this is not necessary?

Yes that is my understanding. We don't use index_name, we just use target_index_key.

The filter that adds index_name is also unconditional so we don't need the index_name fallback.

@id

The main change that seems to increase performance is switching from the deprecated secure_forward plugin to the built-in forward plugin, which supports TLS. The advantages of the forward plugin are: * it's maintained. * support for fluentd workers so that input performance can be scaled easily. * creates a new connection per chunk forward event. On the last point: the old secure_forward plugin held onto its connection which meant that sessions were "sticky" to a pod and effectively all work was being sent to single forwarder pod. With the forward plugin, the load balancer can do its job of spreading work across all logs-fowarder pods. The other changes include: * flush threads are now set to 2 (x4 workers = 8) * giving the output an @id so it is identifiable in logs * not using a timekey, so chunk flushing is _only_ size dependent * storing any tag/time in the same chunk to avoid lots of little chunks * increasing the flush interval and chunk size * dropping chunks on buffer overflow * remove redundant index_name field in elasticsearch output * make scheme configurable via environment variable (default http)

smlx requested a review from Schnitzel March 24, 2020 09:13

Schnitzel requested changes Mar 29, 2020

View reviewed changes

Schnitzel assigned smlx Mar 29, 2020

smlx force-pushed the logs-forwarder-buffer-tweaks branch from 4b72541 to 7857e43 Compare April 1, 2020 13:37

smlx requested a review from Schnitzel April 1, 2020 13:41

Tweak buffer params to account for multiple workers

f2618f4

tobybellwood added the 3-logging-reporting Logging & Reporting subsystem label Apr 5, 2020

Schnitzel merged commit 396680e into master Apr 8, 2020

Schnitzel added this to the v1.4.1 milestone Apr 8, 2020

smlx deleted the logs-forwarder-buffer-tweaks branch April 30, 2020 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with elasticsearch buffers overflowing in logs-forwarder #1738

Fix issue with elasticsearch buffers overflowing in logs-forwarder #1738

smlx commented Mar 24, 2020 •

edited

Schnitzel Mar 29, 2020

smlx Apr 1, 2020

Schnitzel Mar 29, 2020

smlx Apr 1, 2020

Fix issue with elasticsearch buffers overflowing in logs-forwarder #1738

Fix issue with elasticsearch buffers overflowing in logs-forwarder #1738

Conversation

smlx commented Mar 24, 2020 • edited

Checklist

Closing issues

Schnitzel Mar 29, 2020

Choose a reason for hiding this comment

smlx Apr 1, 2020

Choose a reason for hiding this comment

Schnitzel Mar 29, 2020

Choose a reason for hiding this comment

smlx Apr 1, 2020

Choose a reason for hiding this comment

smlx commented Mar 24, 2020 •

edited