Logging: AWS Elasticsearch: Cannot get new connection from pool #21744

nlowe · 2019-07-24T15:20:11Z

What kind of request is this (question/bug/enhancement/feature request):
Bug

Steps to reproduce (least amount of steps as possible):

Create an Elasticsearch Domain in AWS
Configure Rancher Cluster Logging to ship to the Elasticsearch endpoint

Result:

After some period of time, new logs stop showing up in Elasticsearch. Checking the logs of the fluentd container shows the following error:

2019-07-24 06:58:22 +0000 [warn]: #0     failed to flush the buffer. retry_time=9 next_retry_seconds=2019-07-24 07:02:13 +0000 chunk="58e67ce51dc61232b625376ad93f6e7a" error_class=Elasticsearch::Transport::Transport::Error error="Cannot get new connection from pool."
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/elasticsearch-transport-6.1.0/lib/elasticsearch/transport/transport/base.rb:258:in `perform_request'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/elasticsearch-transport-6.1.0/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/elasticsearch-transport-6.1.0/lib/elasticsearch/transport/client.rb:131:in `perform_request'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/elasticsearch-api-6.1.0/lib/elasticsearch/api/actions/bulk.rb:95:in `bulk'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.12.5/lib/fluent/plugin/out_elasticsearch.rb:639:in `send_bulk'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.12.5/lib/fluent/plugin/out_elasticsearch.rb:541:in `block in write'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.12.5/lib/fluent/plugin/out_elasticsearch.rb:540:in `each'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.12.5/lib/fluent/plugin/out_elasticsearch.rb:540:in `write'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluentd-1.3.3/lib/fluent/plugin/output.rb:1123:in `try_flush'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluentd-1.3.3/lib/fluent/plugin/output.rb:1423:in `flush_thread_run'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluentd-1.3.3/lib/fluent/plugin/output.rb:452:in `block (2 levels) in start'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluentd-1.3.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2019-07-24 06:58:22 +0000 [warn]: #0     failed to flush the buffer. retry_time=10 next_retry_seconds=2019-07-24 07:07:31 +0000 chunk="58e67cdcd7d1406de13fe55a26fe6cad" error_class=Elasticsearch::Transport::Transport::Error error="Cannot get new connection from pool."

Other details that may be helpful:

Looking at atomita/fluent-plugin-aws-elasticsearch-service#15 (comment) this appears to be related to how connection reloading works with the ruby library for Elasticsearch and a specific incompatibility with AWS.

Using:

kubectl delete po -n cattle-logging --selector=app=fluentd

to restart the fluentd pods corrected the issue and backfilled the missing logs.

I think this can be worked around by editing the logging config file in the UI directly to contain:

<match *>
  @type elasticsearch
  include_tag_key true
  hosts "https://***.***.es.amazonaws.com:443"
  logstash_prefix "***"
  logstash_format true
  logstash_dateformat "%Y-%m-%d"
  type_name "container_log"
  ssl_verify true
  ssl_version "TLSv1_2"
  
  # Workarounds for AWS
  reconnect_on_error true
  reload_connections false
  reload_on_failure false
</match>

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): v2.2.5
Installation option (single install/HA): HA

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Imported (EKS)
Machine type (cloud/VM/metal) and specifications (CPU/memory): Cloud (t3.large)
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:04:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.8-eks-7c34c0", GitCommit:"7c34c0d2f2d0f11f397d55a46945193a0e22d8f3", GitTreeState:"clean", BuildDate:"2019-03-01T22:49:39Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

The text was updated successfully, but these errors were encountered:

Tejeev · 2020-04-17T18:51:55Z

These settings should be built into our defaults. They are already the defaults in fluentd-kubernetes-daemonset: https://github.com/fluent/fluentd-kubernetes-daemonset/blob/85135da0b9ebc708f5ea5084ef13499f342063eb/templates/conf/fluent.conf.erb#L31-L33

sevenval-admins · 2020-05-04T14:49:21Z

It would be great if this change of the defaults would be included in the next Rancher 2.3 release. Is that planned?

izaac · 2020-05-12T22:45:58Z

Reproduced in v2.4.3 I will validate the fix in

master
v2.4-head
v2.3-head when available.

Deployed an AWS ES Domain. (Domain should have enough EBS disk capacity for a long run)
Setup cluster and project logging with flush interval to 1 second. (Cluster should have enough disk space for a long run).
In a project deploy a web workload or any other docker container to generate logs and start monitoring the logs in Kibana in the ES Domain (Kibana is deployed automatically to analyze the logs).
After a couple of hours I reached the error below in the fluentd containers in the Cluster system project.

  2020-05-12 22:38:31 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2020-05-12 22:38:32 +0000 chunk="5a57b1f58d5a3d8212649b26b4e415d2" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"search-rancherlogging-xus3urf5docq7rhbj5wwgfmgxe.us-east-2.es.amazonaws.com\", :port=>443, :scheme=>\"https\", :user=>\"izaacesmaster\", :password=>\"obfuscated\"}): Cannot get new connection from pool."
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-3.8.0/lib/fluent/plugin/out_elasticsearch.rb:886:in `rescue in send_bulk'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-3.8.0/lib/fluent/plugin/out_elasticsearch.rb:848:in `send_bulk'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-3.8.0/lib/fluent/plugin/out_elasticsearch.rb:710:in `block in write'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-3.8.0/lib/fluent/plugin/out_elasticsearch.rb:709:in `each'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-3.8.0/lib/fluent/plugin/out_elasticsearch.rb:709:in `write'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.8.0/lib/fluent/plugin/output.rb:1133:in `try_flush'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.8.0/lib/fluent/plugin/output.rb:1439:in `flush_thread_run'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.8.0/lib/fluent/plugin/output.rb:461:in `block (2 levels) in start'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.8.0/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

izaac · 2020-05-13T11:09:29Z

Rancher v2.4-head (05/12/20) commit id: 65b5205
Rancher master-head (05/12/20) commit id: dd102af

I left running the log generator for 12 hours without no errors with the same steps I followed to reproduce in v2.4.3. But I was not able to reproduce this time with these versions so the fix is validated.

cluster.conf and project.conf keys in the rancher-logging-fluentd secret had the values:

reload_connections false
reconnect_on_error true
reload_on_failure true

toddexp · 2020-05-13T13:54:16Z

Our team has been overriding these settings for the Elastic logging since Rancher 2.2. We have bare metal clusters. We found that reload_connections false and reload_on_failure true are important settings and would make good defaults. However reconnect_on_error true was giving us problems. This was resulting in duplicate log messages being sent to Elastic. Whenever there was an error with the connection Fluentd would reconnect. On the reconnection Fluentd would recatalog all of the logs for the index and resend them. We have reconnect_on_error unset and Elastic / Fluentd works fine for us.

loganhz · 2020-05-13T23:59:30Z

Cannot reproduce the duplication logs issue in v2.4-dev

Changed the NodePort of my ES svc to make it unreachable from fluentd. And change it back after 5 mins. In Kibana, I can see the logs after the fluentd retry. And there is no duplication.

reconnect_on_error true is recommended by the plugin to deal with the original issue. https://github.com/uken/fluent-plugin-elasticsearch#stopped-to-send-events-on-k8s-why

Filed an issue to make it configurable in Rancher.
#27072

For the duplication issues, I think we should leverage elasticsearch_genid filter. Filed another issue for it.
#27073

toddexp · 2020-05-14T14:51:53Z

Our Elastic Stack is external to the cluster. I am not sure if that has something to do with it. There are probably a number of other factors. like I said we haven't played with this setting since Rancher 2.2. Maybe even the Elastic version could impact this as well.

Regardless It does sound like the elasticsearch_genid suggestion is probably the correct way to handle the duplicate data issue though. Thanks for looking into this.

izaac · 2020-05-14T16:50:09Z

Closing this issue as the reported connection reloading issue reported was fixed and validated.

I will focus on #27003 to do the validation in 2.3-head

Duplication issues/feature requests were opened by logan.

Tejeev added the internal label Apr 17, 2020

Tejeev mentioned this issue Apr 17, 2020

Fluentd stops exporting logs after some time #25819

Closed

maggieliu assigned loganhz Apr 20, 2020

maggieliu added this to the v2.4.4 milestone Apr 20, 2020

maggieliu added the area/logging label Apr 20, 2020

loganhz added kind/bug Issues that are defects reported by users or that we know have reached a real release [zube]: Review labels Apr 21, 2020

This was referenced Apr 21, 2020

[2.4] Set reload_connections to false for logging #26732

Closed

[2.4.4] Set reload_connections to false by default for es #26733

Merged

[Master] Set reload_connections to false by default for es #26734

Merged

maggieliu added the [zube]: To Test label May 8, 2020

zube bot removed the [zube]: Review label May 8, 2020

maggieliu mentioned this issue May 8, 2020

[Backport 2.3] Logging: AWS Elasticsearch: Cannot get new connection from pool #27003

Closed

sangeethah assigned izaac May 8, 2020

maggieliu added the priority/0 label May 13, 2020

izaac closed this as completed May 14, 2020

zube bot added [zube]: Done and removed [zube]: To Test labels May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging: AWS Elasticsearch: Cannot get new connection from pool #21744

Logging: AWS Elasticsearch: Cannot get new connection from pool #21744

nlowe commented Jul 24, 2019 •

edited by zube bot

Loading

Tejeev commented Apr 17, 2020 •

edited

Loading

sevenval-admins commented May 4, 2020

izaac commented May 12, 2020 •

edited

Loading

izaac commented May 13, 2020 •

edited

Loading

toddexp commented May 13, 2020

loganhz commented May 13, 2020

toddexp commented May 14, 2020

izaac commented May 14, 2020 •

edited

Loading

Logging: AWS Elasticsearch: Cannot get new connection from pool #21744

Logging: AWS Elasticsearch: Cannot get new connection from pool #21744

Comments

nlowe commented Jul 24, 2019 • edited by zube bot Loading

Tejeev commented Apr 17, 2020 • edited Loading

sevenval-admins commented May 4, 2020

izaac commented May 12, 2020 • edited Loading

izaac commented May 13, 2020 • edited Loading

toddexp commented May 13, 2020

loganhz commented May 13, 2020

toddexp commented May 14, 2020

izaac commented May 14, 2020 • edited Loading

nlowe commented Jul 24, 2019 •

edited by zube bot

Loading

Tejeev commented Apr 17, 2020 •

edited

Loading

izaac commented May 12, 2020 •

edited

Loading

izaac commented May 13, 2020 •

edited

Loading

izaac commented May 14, 2020 •

edited

Loading