-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with elasticsearch buffers overflowing in logs-forwarder #1738
Conversation
flush_mode interval | ||
flush_interval 60s | ||
flush_thread_count 16 | ||
overflow_action drop_oldest_chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding of having overflow_action block
is described here: https://docs.fluentd.org/configuration/buffer-section
block processing of input plugin to emit events into that buffer
basically it means that fluentd tells the inputs (which are fluentds again in logs-collector and external logs-forwarder) to not send data anymore, so it will will the buffers of these fluentds as well.
I think that's a bit safer than just dropping the oldest chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I guess this a philosophical debate 😄
I think that the chunks should be dropped at the point that the bottleneck is occurring because that makes it easy to identify the point in the log pipeline that the issue is located: it's the only fluentd with a full buffer and a log full of dropped chunk warnings.
If you block
instead, the entire pipeline backs up which tends to obfuscate the location of the issue because buffers are full everywhere and you may see issues at points in the pipeline several hops removed from the actual bottleneck. You will still end up dropping logs eventually - it just take slightly longer and will be further back in the pipeline.
@@ -259,19 +261,17 @@ objects: | |||
ssl_version TLSv1_2 | |||
request_timeout 60s | |||
slow_flush_log_threshold 90s | |||
<buffer tag,time> | |||
<buffer> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mhh interesting, my understanding of https://docs.fluentd.org/output/elasticsearch#index_name-optional is that you need to setup the buffer with time if you use time in the indexname.
but maybe because we use the record_modifier
on top already, this is not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is my understanding. We don't use index_name
, we just use target_index_key
.
The filter that adds index_name
is also unconditional so we don't need the index_name
fallback.
The main change that seems to increase performance is switching from the deprecated secure_forward plugin to the built-in forward plugin, which supports TLS. The advantages of the forward plugin are: * it's maintained. * support for fluentd workers so that input performance can be scaled easily. * creates a new connection per chunk forward event. On the last point: the old secure_forward plugin held onto its connection which meant that sessions were "sticky" to a pod and effectively all work was being sent to single forwarder pod. With the forward plugin, the load balancer can do its job of spreading work across all logs-fowarder pods. The other changes include: * flush threads are now set to 2 (x4 workers = 8) * giving the output an @id so it is identifiable in logs * not using a timekey, so chunk flushing is _only_ size dependent * storing any tag/time in the same chunk to avoid lots of little chunks * increasing the flush interval and chunk size * dropping chunks on buffer overflow * remove redundant index_name field in elasticsearch output * make scheme configurable via environment variable (default http)
4b72541
to
7857e43
Compare
The main change that seems to increase performance is switching from the deprecated secure_forward plugin to the built-in forward plugin, which supports TLS.
The advantages of the forward plugin are:
On the last point: the old secure_forward plugin held onto its connection which meant that sessions were "sticky" to a pod and effectively all work was being sent to single forwarder pod. With the forward plugin, the load balancer can do its job of spreading work across all logs-fowarder pods.
The other changes include:
Checklist
logs-forwarder
was not able to write to Elasticsearch fast enough to keep up with the volume of logs. This change seeks to improve the situation.Closing issues
n/a