Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thanos-query crashes with "concurrent map iteration and map write" #1272

Closed
bjakubski opened this issue Jun 24, 2019 · 7 comments
Closed

thanos-query crashes with "concurrent map iteration and map write" #1272

bjakubski opened this issue Jun 24, 2019 · 7 comments

Comments

@bjakubski
Copy link

Thanos, Prometheus and Golang version used

thanos, version 0.5.0 (branch: HEAD, revision: 72820b3)
build user: circleci@eeac5eb36061
build date: 20190606-10:53:12
go version: go1.12.5

What happened

In one of k8s clusters that we run thanos-query in it crashes every couple of minutes with "fatal error: concurrent map iteration and map write" or "fatal error: concurrent map writes"

What you expected to happen

No crash :-)

How to reproduce it (as minimally and precisely as possible):

I've no idea. I didn't manage to find anything that triggers it. Same problem was observed in 0.4.0. I'm not sure about 0.3.0.
thanos runs in GCP GKE cluster, query is deployed via our own helm chart. Crashing containers run:

  thanos query
      --log.level=debug
      --query.replica-label=prometheus_replica
      --grpc-server-tls-cert=/etc/certs/tls.crt
      --grpc-server-tls-key=/etc/certs/tls.key
      --store=dnssrv+_grpc._tcp.thanos-sidecars-prometheus.monitoring.svc
      --selector-label=location="REDACTED"
      --selector-label=stack="REDACTED"
      --selector-label=REDACTED

Same deployment (differs in selector-label values) crashes less in other GKE cluster and almost not at all in yet another GKE cluster, while receiving similar (very low) traffic via GRPC.

Those query instances serve as GRPC endpoints for global thanos-query (that runs in another, "observability" cluster and does not crash) to return recent data (older data is served from bucket). They are behind GCP load balancer (using http2 to communicate LB <-> thanos in GKE)

Full logs to relevant components

Example after-crash dump is here: https://gist.github.com/bjakubski/18a98f6f1fc2922e5056df3106fe1477

@GiedriusS
Copy link
Member

Seems like something related to the UI and the execution of templates. Do you use some kind of particular functionality of it? I wonder how to reproduce it.

@lx223
Copy link
Contributor

lx223 commented Jun 26, 2019

Hi, I am interested in helping. Can someone assign this to me pls?

@bjakubski
Copy link
Author

As far as I know traffic received by those crashing instances consists of:

  • queries passed from other (global) thanos-query instances. They go through GCP Loadbalancer
  • healthchecks performed by GKE (k8s) and GCP Loadbalanced: http port is checked by accessing /graph url and grpc via uses "http2" health check in GCP (which I do not know the exact behaviour).

Other than that thanos UI is practically unused

@lx223
Copy link
Contributor

lx223 commented Jun 26, 2019

I believe this issue is due to map concurrent writes in ui.go where web prefix are rendered into the HTML templates. A bit surprised that it actually triggered with low traffic. Here is the PR: #1280. PTAL

@povilasv
Copy link
Member

Fix was merged to master, please try the new version.

@povilasv
Copy link
Member

@lx223 thanks for the fix!

@bjakubski
Copy link
Author

I quickly deployed version with the fix and I see no crashes for one hour (usually there were a couple of them in similar time rage in this cluster).
Thanks @lx223, much appreciated!

lx223 added a commit to lx223/thanos that referenced this issue Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants