Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging: AWS Elasticsearch: Cannot get new connection from pool #21744

Closed
nlowe opened this issue Jul 24, 2019 · 8 comments
Closed

Logging: AWS Elasticsearch: Cannot get new connection from pool #21744

nlowe opened this issue Jul 24, 2019 · 8 comments
Assignees
Labels
area/logging internal kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0
Milestone

Comments

@nlowe
Copy link

nlowe commented Jul 24, 2019

What kind of request is this (question/bug/enhancement/feature request):
Bug

Steps to reproduce (least amount of steps as possible):

  1. Create an Elasticsearch Domain in AWS
  2. Configure Rancher Cluster Logging to ship to the Elasticsearch endpoint

Result:

After some period of time, new logs stop showing up in Elasticsearch. Checking the logs of the fluentd container shows the following error:

2019-07-24 06:58:22 +0000 [warn]: #0     failed to flush the buffer. retry_time=9 next_retry_seconds=2019-07-24 07:02:13 +0000 chunk="58e67ce51dc61232b625376ad93f6e7a" error_class=Elasticsearch::Transport::Transport::Error error="Cannot get new connection from pool."
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/elasticsearch-transport-6.1.0/lib/elasticsearch/transport/transport/base.rb:258:in `perform_request'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/elasticsearch-transport-6.1.0/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/elasticsearch-transport-6.1.0/lib/elasticsearch/transport/client.rb:131:in `perform_request'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/elasticsearch-api-6.1.0/lib/elasticsearch/api/actions/bulk.rb:95:in `bulk'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.12.5/lib/fluent/plugin/out_elasticsearch.rb:639:in `send_bulk'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.12.5/lib/fluent/plugin/out_elasticsearch.rb:541:in `block in write'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.12.5/lib/fluent/plugin/out_elasticsearch.rb:540:in `each'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-2.12.5/lib/fluent/plugin/out_elasticsearch.rb:540:in `write'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluentd-1.3.3/lib/fluent/plugin/output.rb:1123:in `try_flush'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluentd-1.3.3/lib/fluent/plugin/output.rb:1423:in `flush_thread_run'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluentd-1.3.3/lib/fluent/plugin/output.rb:452:in `block (2 levels) in start'
  2019-07-24 06:58:22 +0000 [warn]: #0     /var/lib/gems/2.3.0/gems/fluentd-1.3.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2019-07-24 06:58:22 +0000 [warn]: #0     failed to flush the buffer. retry_time=10 next_retry_seconds=2019-07-24 07:07:31 +0000 chunk="58e67cdcd7d1406de13fe55a26fe6cad" error_class=Elasticsearch::Transport::Transport::Error error="Cannot get new connection from pool."

Other details that may be helpful:

Looking at atomita/fluent-plugin-aws-elasticsearch-service#15 (comment) this appears to be related to how connection reloading works with the ruby library for Elasticsearch and a specific incompatibility with AWS.

Using:

kubectl delete po -n cattle-logging --selector=app=fluentd

to restart the fluentd pods corrected the issue and backfilled the missing logs.

I think this can be worked around by editing the logging config file in the UI directly to contain:

<match *>
  @type elasticsearch
  include_tag_key true
  hosts "https://***.***.es.amazonaws.com:443"
  logstash_prefix "***"
  logstash_format true
  logstash_dateformat "%Y-%m-%d"
  type_name "container_log"
  ssl_verify true
  ssl_version "TLSv1_2"
  
  # Workarounds for AWS
  reconnect_on_error true
  reload_connections false
  reload_on_failure false
</match>

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): v2.2.5
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Imported (EKS)
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): Cloud (t3.large)
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:04:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.8-eks-7c34c0", GitCommit:"7c34c0d2f2d0f11f397d55a46945193a0e22d8f3", GitTreeState:"clean", BuildDate:"2019-03-01T22:49:39Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
@Tejeev
Copy link

Tejeev commented Apr 17, 2020

These settings should be built into our defaults. They are already the defaults in fluentd-kubernetes-daemonset: https://github.com/fluent/fluentd-kubernetes-daemonset/blob/85135da0b9ebc708f5ea5084ef13499f342063eb/templates/conf/fluent.conf.erb#L31-L33

@sevenval-admins
Copy link

It would be great if this change of the defaults would be included in the next Rancher 2.3 release. Is that planned?

@izaac
Copy link
Contributor

izaac commented May 12, 2020

Reproduced in v2.4.3 I will validate the fix in

  • master
  • v2.4-head
  • v2.3-head when available.

Deployed an AWS ES Domain. (Domain should have enough EBS disk capacity for a long run)
Setup cluster and project logging with flush interval to 1 second. (Cluster should have enough disk space for a long run).
In a project deploy a web workload or any other docker container to generate logs and start monitoring the logs in Kibana in the ES Domain (Kibana is deployed automatically to analyze the logs).
After a couple of hours I reached the error below in the fluentd containers in the Cluster system project.

  2020-05-12 22:38:31 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2020-05-12 22:38:32 +0000 chunk="5a57b1f58d5a3d8212649b26b4e415d2" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"search-rancherlogging-xus3urf5docq7rhbj5wwgfmgxe.us-east-2.es.amazonaws.com\", :port=>443, :scheme=>\"https\", :user=>\"izaacesmaster\", :password=>\"obfuscated\"}): Cannot get new connection from pool."
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-3.8.0/lib/fluent/plugin/out_elasticsearch.rb:886:in `rescue in send_bulk'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-3.8.0/lib/fluent/plugin/out_elasticsearch.rb:848:in `send_bulk'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-3.8.0/lib/fluent/plugin/out_elasticsearch.rb:710:in `block in write'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-3.8.0/lib/fluent/plugin/out_elasticsearch.rb:709:in `each'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-3.8.0/lib/fluent/plugin/out_elasticsearch.rb:709:in `write'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.8.0/lib/fluent/plugin/output.rb:1133:in `try_flush'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.8.0/lib/fluent/plugin/output.rb:1439:in `flush_thread_run'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.8.0/lib/fluent/plugin/output.rb:461:in `block (2 levels) in start'
  2020-05-12 22:38:31 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.8.0/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

Screen Shot 2020-05-12 at 3 44 31 PM

@izaac
Copy link
Contributor

izaac commented May 13, 2020

Rancher v2.4-head (05/12/20) commit id: 65b5205
Rancher master-head (05/12/20) commit id: dd102af

I left running the log generator for 12 hours without no errors with the same steps I followed to reproduce in v2.4.3. But I was not able to reproduce this time with these versions so the fix is validated.

cluster.conf and project.conf keys in the rancher-logging-fluentd secret had the values:

reload_connections false
reconnect_on_error true
reload_on_failure true

@toddexp
Copy link

toddexp commented May 13, 2020

Our team has been overriding these settings for the Elastic logging since Rancher 2.2. We have bare metal clusters. We found that reload_connections false and reload_on_failure true are important settings and would make good defaults. However reconnect_on_error true was giving us problems. This was resulting in duplicate log messages being sent to Elastic. Whenever there was an error with the connection Fluentd would reconnect. On the reconnection Fluentd would recatalog all of the logs for the index and resend them. We have reconnect_on_error unset and Elastic / Fluentd works fine for us.

@loganhz
Copy link

loganhz commented May 13, 2020

Cannot reproduce the duplication logs issue in v2.4-dev

Changed the NodePort of my ES svc to make it unreachable from fluentd. And change it back after 5 mins. In Kibana, I can see the logs after the fluentd retry. And there is no duplication.

reconnect_on_error true is recommended by the plugin to deal with the original issue. https://github.com/uken/fluent-plugin-elasticsearch#stopped-to-send-events-on-k8s-why

Filed an issue to make it configurable in Rancher.
#27072

For the duplication issues, I think we should leverage elasticsearch_genid filter. Filed another issue for it.
#27073

@toddexp
Copy link

toddexp commented May 14, 2020

Our Elastic Stack is external to the cluster. I am not sure if that has something to do with it. There are probably a number of other factors. like I said we haven't played with this setting since Rancher 2.2. Maybe even the Elastic version could impact this as well.

Regardless It does sound like the elasticsearch_genid suggestion is probably the correct way to handle the duplicate data issue though. Thanks for looking into this.

@izaac
Copy link
Contributor

izaac commented May 14, 2020

Closing this issue as the reported connection reloading issue reported was fixed and validated.

I will focus on #27003 to do the validation in 2.3-head

Duplication issues/feature requests were opened by logan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/logging internal kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0
Projects
None yet
Development

No branches or pull requests

7 participants