Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to configure correctly for HA Thanos #520

Closed
Alexvianet opened this issue Sep 13, 2018 · 24 comments
Closed

How to configure correctly for HA Thanos #520

Alexvianet opened this issue Sep 13, 2018 · 24 comments
Labels

Comments

@Alexvianet
Copy link

Alexvianet commented Sep 13, 2018

Can't understand examples in the documentation, what i have:

1)Prometheus 2 nodes different zone with Thanos sidecar
2)Grafana 2 nodes different zone
3) haproxy 2 nodes different zone for load Prometheus grafana ...
4) thanos store 1 node
5) thanos query 2 node different zone
6) thanos compact 1 node
7) S3 bucket as an object storage

prometheus, version 2.3.2 (branch: HEAD, revision: 71af5e29e815795e9dd14742ee7725682fa14b7b)
build user: root@5258e0bd9cc1
build date: 20180712-14:02:52
go version: go1.10.3

thanos, version 0.1.0-rc.2 (branch: HEAD, revision: 53e4d69)
build user: root@c7199d758b5e
build date: 20180705-12:54:50
go version: go1.10.3

What happened
level=debug ts=2018-09-13T14:17:22.530346817Z caller=cluster.go:278 component=cluster msg="refresh cluster done, peers joined" peers=127.0.0.1:10900 before=5 after=1
What you expected to happen
Need to understand what is the best practices of Thanos configuration for such infrastructure.
Want some real example of the multinode cluster of all Thanos components.

Did thanos metrics get automatically availble from query ?
I get thanos_cluster_members metric only when add thanos_query http adress to prometheus targets in
prometheus config.
How to reproduce it (as minimally and precisely as possible):

All s3 configuration added with export ...

./prometheus --storage.tsdb.no-lockfile --storage.tsdb.retention=1h

./thanos query --query.replica-label replica --log.level=debug --cluster.peers="127.0.0.1:10900"

./thanos sidecar --cluster.peers="thanos_query:10900"

./thanos store --tsdb.path=./store --cluster.peers="thanos_query:10900"

./thanos compacts --data-dir=./data

Environment:
CentOS Linux release 7.5.1804 (Core)
Linux 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

@bwplotka
Copy link
Member

Can you provide some simple diagram? I am kind of confused where for example HA proxy is in your setup. (:

@bwplotka
Copy link
Member

bwplotka commented Sep 14, 2018

The key here is to avoid gossip (since it will be removed soon) - and just configure your thanos-queriers (both of them) with

--store=<thanos-sidecar>:<grpc-port>
--store=<thanos-sidecar2>:<grpc-port>
--store=<thanos-store>:<grpc-port>

And point grafana to thanos-query endpoint (behind HAproxy I guess I you want)

And that's it (: Gossip seems to be overkill here.

@Alexvianet
Copy link
Author

Looks like this one:
11

ps: The key here is to avoid gossip (since it will be removed soon) <--- in new version thanos ?

@bwplotka
Copy link
Member

Yup, see this: #493

So essentially static configuration is what you want. In future there will be DNS based discovery as well as FILE SD that will allow to make it more flexible.

@Alexvianet
Copy link
Author

For test i have started for one node each component
I have configure thanos:

    exec thanos sidecar \
    --log.level="debug" \
    --prometheus.url="https://prometheus.net" \
    --http-address="0.0.0.0:19191" \
    --grpc-address="0.0.0.0:19091"  \
    --cluster.address="0.0.0.0:19391"   \
    --cluster.gossip-interval="5s"  \
    --cluster.pushpull-interval="5s" \
    --cluster.refresh-interval="1m0s" \
    --tsdb.path="/var/vcap/store/prometheus2" \
    --reloader.config-envsubst-file="/var/vcap/jobs/prometheus2/config/prometheus.yml

    exec thanos store \
    --tsdb.path="/var/vcap/store/thanos/store" \
    --log.level="debug" \
    --http-address="0.0.0.0:19193" \
    --grpc-address="0.0.0.0:19093"  \
    --cluster.address="0.0.0.0:19891"   \
    --cluster.gossip-interval="5s"  \
    --cluster.pushpull-interval="5s" \
    --cluster.refresh-interval="1m0s" \
    --index-cache-size="1GB" \
    --chunk-pool-size="2GB"

    exec thanos query \
    --log.level="debug" \
    --http-address="0.0.0.0:19192" \
    --grpc-address="0.0.0.0:19092"  \
    --cluster.address="0.0.0.0:19591" \
    --cluster.peers="0.0.0.0:19591"  \
    --cluster.gossip-interval="5s" \
    --cluster.pushpull-interval="5s" \
    --cluster.refresh-interval="1m0s" \
    --query.timeout=2m  --query.replica-label=thanos_query_replica  \
    --query.max-concurrent="20" \
    --store=<thanos-sidecar>:19091 \
    --store=<thanos-store>:19093

query logs:

level=info ts=2018-09-17T07:09:43.334407506Z caller=flags.go:53 msg="StoreAPI address that will be propagated through gossip" address=<thanos-query>:19092
level=info ts=2018-09-17T07:09:43.337044628Z caller=flags.go:68 msg="QueryAPI address that will be propagated through gossip" address=<thanos-query>:19192
level=info ts=2018-09-17T07:09:43.342363255Z caller=query.go:256 msg="starting query node"
level=info ts=2018-09-17T07:09:43.346142795Z caller=query.go:230 msg="Listening for query and metrics" address=0.0.0.0:19192
level=info ts=2018-09-17T07:09:43.346198456Z caller=query.go:248 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:19092
level=info ts=2018-09-17T07:09:43.347591644Z caller=storeset.go:226 component=storeset msg="adding new store to query storeset" address=<thanos-sidecar>:19091
level=info ts=2018-09-17T07:09:43.347663034Z caller=storeset.go:226 component=storeset msg="adding new store to query storeset" address=<thanos-store>:19093
level=warn ts=2018-09-17T07:09:53.348700243Z caller=cluster.go:300 component=cluster NumMembers=1 msg="I appear to be alone in the cluster"
level=warn ts=2018-09-17T07:10:03.348689325Z caller=cluster.go:300 component=cluster NumMembers=1 msg="I appear to be alone in the cluster"
level=warn ts=2018-09-17T07:10:13.348686534Z caller=cluster.go:300 component=cluster NumMembers=1 msg="I appear to be alone in the cluster"

and no query on web ui

@bwplotka
Copy link
Member

Remove all peer flags configuration. Let's not duplicate discovery mechanisms.

level=info ts=2018-09-17T07:09:43.347591644Z caller=storeset.go:226 component=storeset msg="adding new store to query storeset" address=:19091
level=info ts=2018-09-17T07:09:43.347663034Z caller=storeset.go:226 component=storeset msg="adding new store to query storeset" address=:19093

This indicates that query has access (: so now is the question, do you have metric anywhere (: worth to check Prometheus UI (where sidecar is) if the metrics is there, sidecar logs, and make sure you have correct time range.

@Alexvianet
Copy link
Author

thanks works now

@thesaadarshad
Copy link

hi @Alexvianet I have a question regarding your Thanos implementation if i may ask?

@Alexvianet
Copy link
Author

all ok thanks

@Alexvianet
Copy link
Author

Alexvianet commented Nov 22, 2019 via email

@thesaadarshad
Copy link

my stack is somewhat similar to yours that is

  1. Prometheus 2 nodes different zone with Thanos sidecar
  2. Deployed as Docker container on Query Node
  3. Thanos store 1 node
  4. Thanos query 1 node (at the moment)
  5. Thanos compact 1 node
  6. S3 bucket as an object storage

My Question is about the 3rd, Store Node. would you suggest to deploy Store Node separately on a different node and what would be its HA?

@Alexvianet
Copy link
Author

Alexvianet commented Nov 23, 2019 via email

@Alexvianet
Copy link
Author

Alexvianet commented Nov 23, 2019 via email

@thesaadarshad
Copy link

makes sense but why would you want to query Prometheus directly. the low retention data stored in prometheus is queryable via Querier which automatically talks to sidecar and store at the same time? connecting Grafana directly to Querier also makes it work.

image

@Alexvianet
Copy link
Author

Alexvianet commented Nov 23, 2019 via email

@thesaadarshad
Copy link

makes sense. help me solve this another confusion.
did you deploy Store independently on a separate node? I'm still facing issues deploying it correctly.
tl;dr how did you connect the querier to store API to retrieve old data?
🙏

@Alexvianet
Copy link
Author

Alexvianet commented Nov 23, 2019 via email

@thesaadarshad
Copy link

but which nodes stores data on S3 Storage? surely not every instance would be uploading data into the bucket?

@Alexvianet
Copy link
Author

Alexvianet commented Nov 23, 2019 via email

@thesaadarshad
Copy link

but remember, Thanos Store and Thanos Querier are on different notes? and in nowhere we define in Querier where Thanos Store is?
pardon my ignorance but I'm a bit confused.

@Alexvianet
Copy link
Author

Alexvianet commented Nov 24, 2019 via email

@thesaadarshad
Copy link

so how do they connect then? can you share your thanos store init params?

@Alexvianet
Copy link
Author

Alexvianet commented Nov 24, 2019 via email

@Sunil777
Copy link

Hi,

Can anyone please share the complete setup details here... It totally confusing

Prometheus 2 nodes different zone with Thanos sidecar ..... (Sidecar running on same Prometheus host ?)

Deployed as Docker container on Query Node ........(Is this single node?)

Thanos store 1 node ........ (Is this single node?)

Thanos query 1 node (at the moment)......... (what is diff b/w Thanos query 1 node and Deployed as Docker container on Query Node)

Thanos compact 1 node .............(Is this single node?)
S3 bucket as an object storage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants