Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: Got errors log "get appender: TSDB not ready;" when reciver cluster started with log "TSDB is now ready" #4533

Closed
imneov opened this issue Aug 8, 2021 · 3 comments
Labels

Comments

@imneov
Copy link

imneov commented Aug 8, 2021

Thanos, Prometheus and Golang version used:

Thanos: thanos-0.21.1
Object Storage Provider: N/A

What happened:

When starting new reciver cluster, we got error "get appender: TSDB not ready;"


Aug 05 14:02:20 Node001 systemd[1]: Started Thanos receiver component.
Aug 05 14:02:21 Node001 thanos[10247]: level=warn ts=2021-08-05T06:02:21.014100798Z caller=receive.go:118 component=receive msg="setting up receive"
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.014140078Z caller=options.go:24 component=receive protocol=HTTP msg="disabled TLS, key and cert must be set to enable"
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.014189229Z caller=receive.go:159 component=receive msg="no supported bucket was configured, uploads will be disabled"
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.014216469Z caller=receive.go:532 component=receive msg="no existing storage found, no data migration attempted"
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.014662136Z caller=receive.go:519 component=receive msg="starting receiver"
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.014722196Z caller=receive.go:309 component=receive msg="the hashring initialized with config watcher."
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.014772517Z caller=handler.go:220 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19100
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.014795538Z caller=options.go:24 component=receive protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.014742447Z caller=intrumentation.go:60 component=receive msg="changing probe status" status=healthy
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.014861488Z caller=http.go:63 component=receive service=http/server component=receive msg="listening for requests and metrics" address=0.0.0.0:8080
Aug 05 14:02:21 Node001 thanos[10247]: level=warn ts=2021-08-05T06:02:21.014914229Z caller=intrumentation.go:54 component=receive msg="changing probe status" status=not-ready reason="hashring has changed; server is not ready to receive web requests"
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.01494315Z caller=handler.go:247 component=receive component=receive-handler msg="Serving plain HTTP" address=0.0.0.0:19100
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.01495473Z caller=receive.go:352 component=receive msg="hashring has changed; server is not ready to receive web requests"
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.0149986Z caller=receive.go:262 component=receive msg="updating storage"
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.01493759Z caller=tls_config.go:191 component=receive service=http/server component=receive msg="TLS is disabled." http2=false
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.015134542Z caller=intrumentation.go:48 component=receive msg="changing probe status" status=ready
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.015148433Z caller=receive.go:275 component=receive msg="storage started, and server is ready to receive web requests"
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.015504898Z caller=receive.go:426 component=receive msg="listening for StoreAPI and WritableStoreAPI gRPC" address=0.0.0.0:10980
Aug 05 14:02:21 Node001 thanos[10247]: level=info ts=2021-08-05T06:02:21.015538278Z caller=grpc.go:123 component=receive service=gRPC/server component=receive msg="listening for serving gRPC" address=0.0.0.0:10980
Aug 05 14:21:33 Node001 thanos[10247]: level=info ts=2021-08-05T06:21:33.786245938Z caller=multitsdb.go:274 component=receive component=multi-tsdb tenant=default-tenant msg="opening TSDB"
Aug 05 14:21:33 Node001 thanos[10247]: level=info ts=2021-08-05T06:21:33.792983213Z caller=head.go:741 component=receive component=multi-tsdb tenant=default-tenant msg="Replaying on-disk memory mappable chunks if any"
Aug 05 14:21:33 Node001 thanos[10247]: level=info ts=2021-08-05T06:21:33.793020664Z caller=head.go:755 component=receive component=multi-tsdb tenant=default-tenant msg="On-disk memory mappable chunks replay completed" duration=2.32µs
Aug 05 14:21:33 Node001 thanos[10247]: level=info ts=2021-08-05T06:21:33.793033274Z caller=head.go:761 component=receive component=multi-tsdb tenant=default-tenant msg="Replaying WAL, this may take a while"
Aug 05 14:21:33 Node001 thanos[10247]: level=info ts=2021-08-05T06:21:33.793251217Z caller=head.go:813 component=receive component=multi-tsdb tenant=default-tenant msg="WAL segment loaded" segment=0 maxSegment=0
Aug 05 14:21:33 Node001 thanos[10247]: level=info ts=2021-08-05T06:21:33.793296048Z caller=head.go:818 component=receive component=multi-tsdb tenant=default-tenant msg="WAL replay completed" checkpoint_replay_duration=18.221µs wal_replay_duration=231.553µs total_replay_duration=274.134µs
Aug 05 14:21:33 Node001 thanos[10247]: level=info ts=2021-08-05T06:21:33.794535955Z caller=multitsdb.go:303 component=receive component=multi-tsdb tenant=default-tenant msg="TSDB is now ready"
Aug 05 14:21:33 Node001 thanos[10247]: level=error ts=2021-08-05T06:21:33.795268136Z caller=handler.go:352 component=receive component=receive-handler err="3 errors: replicate write request for endpoint 222.1.1.29:10980: quorum not reached: 2 errors: store locally for endpoint 222.1.1.29:10980: get appender: TSDB not ready; forwarding request to endpoint 222.1.1.43:10980: rpc error: code = Internal desc = store locally for endpoint 222.1.1.43:10980: get appender: TSDB not ready; replicate write request for endpoint 222.1.1.43:10980: quorum not reached: forwarding request to endpoint 222.1.1.43:10980: rpc error: code = Internal desc = store locally for endpoint 222.1.1.43:10980: get appender: TSDB not ready; replicate write request for endpoint 222.1.1.28:10980: quorum not reached: store locally for endpoint 222.1.1.29:10980: get appender: TSDB not ready" msg="internal server error"
Aug 05 15:41:19 Node001 thanos[10247]: level=error ts=2021-08-05T07:41:19.658732091Z caller=handler.go:352 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
Aug 05 15:41:19 Node001 thanos[10247]: level=error ts=2021-08-05T07:41:19.658733821Z caller=handler.go:352 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
Aug 05 15:41:19 Node001 thanos[10247]: level=error ts=2021-08-05T07:41:19.793138636Z caller=handler.go:352 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
Aug 05 15:41:19 Node001 thanos[10247]: level=error ts=2021-08-05T07:41:19.795700862Z caller=handler.go:352 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"

The start script (thanos-receiver.service):

[Unit]
Description=Thanos receiver component

[Service]
ExecStart=/opt/thanos-0.21.1.linux-arm64/thanos receive \
--tsdb.path "/data/prometheus" \
--tsdb.retention 90d \
--grpc-address 0.0.0.0:10980 \
--http-address 0.0.0.0:8080 \
--receive.replication-factor 2 \
--label "receive_replica=\"2\"" \
--label "receive_cluster=\"nm\"" \
--receive.local-endpoint 2.1.1.28:10980 \
--receive.hashrings-file /etc/thanos/hashring.json \
--remote-write.address 0.0.0.0:19100

[Install]
WantedBy=multi-user.target

The hashring.json:

[
  {
    "endpoints": [
      "222.1.1.28:10980",
      "222.1.1.29:10980",
      "222.1.1.43:10980"
    ]
  }
]
imneov added a commit to imneov/thanos that referenced this issue Aug 8, 2021
…write request

- if not's use block mode, we will got error (thanos-io#4533)
@bwplotka
Copy link
Member

Thanks for fix and issue.

I agree the log lines are confusing, but the problem is that TSDB is never fully ready. In a sense that there can be new tenant anytime and this will create new TSDB for it. In some way in the current implementation it's expected to fail and wait for client to repeat the write.

What else we should expect here if we need to support dynamic creation of TSDBs (appendables) per tenant?

cc @squat @brancz @kakkoyun

@stale
Copy link

stale bot commented Oct 30, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Oct 30, 2021
@stale
Copy link

stale bot commented Jan 9, 2022

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jan 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants