Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shipper doesn't upload to azure storage but no indication of errors. Azure storage container is empty after more than 2 hours #4287

Closed
max77p opened this issue May 29, 2021 · 4 comments · Fixed by #4392
Labels

Comments

@max77p
Copy link

max77p commented May 29, 2021

Thanos, Prometheus and Golang version used
Thanos: 0.18.0-scratch-r5 (helm chart)
Prometheus: v2.21.0

What happened

I have a sidecar with Azure storage account config set correctly, I have no errors and the bucket is still empty. I left it overnight and still it was empty, tried different min and max block duration like 1m to the recommended 2h, but nothing. This is an openshift cluster, but connectivity errors are not showing, if i were to use incorrect endpoint in the objectstore yaml, then it will show me connectivity errors Is there a way to test?

What you expected to happen

Files uploaded to container in storage account.

How to reproduce it (as minimally and precisely as possible):

Run basic prometheus setup with image above with kubernetes scrape config and side car image from helm chart

Full logs to relevant components

level=debug ts=2021-05-29T18:01:50.723917646Z caller=main.go:64 msg="maxprocs: Leaving GOMAXPROCS=[16]: CPU quota undefined"
level=info ts=2021-05-29T18:01:50.724334768Z caller=options.go:23 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2021-05-29T18:01:50.724714687Z caller=factory.go:46 msg="loading bucket configuration"
level=debug ts=2021-05-29T18:01:50.724872096Z caller=azure.go:69 msg="creating new Azure bucket connection" component=sidecar
level=debug ts=2021-05-29T18:01:50.844682481Z caller=azure.go:88 msg="Getting connection to existing Azure blob container" container=non-prod
level=error ts=2021-05-29T18:01:50.848772195Z caller=sidecar.go:250 err="WAL dir is not accessible. Is this dir a TSDB directory? If yes it is shared with TSDB?: stat /prometheus/wal: no such file or directory"
level=info ts=2021-05-29T18:01:50.848814798Z caller=sidecar.go:291 msg="starting sidecar"
level=info ts=2021-05-29T18:01:50.8488649Z caller=reloader.go:183 component=reloader msg="nothing to be watched"
level=info ts=2021-05-29T18:01:50.848919503Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2021-05-29T18:01:50.848966506Z caller=http.go:58 service=http/server component=sidecar msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2021-05-29T18:01:50.849151515Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2021-05-29T18:01:50.849275522Z caller=grpc.go:116 service=gRPC/server component=sidecar msg="listening for serving gRPC" address=0.0.0.0:10901
level=warn ts=2021-05-29T18:01:50.871877608Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:01:52.849562656Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:01:54.849533173Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:01:56.84973711Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:01:58.84963398Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:02:00.849727661Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:02:02.849766938Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:02:04.849539633Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:02:06.84985911Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:02:08.849565398Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:02:10.849825311Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:02:12.849653223Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:02:14.84992105Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2021-05-29T18:02:16.919524933Z caller=sidecar.go:303 msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=info ts=2021-05-29T18:02:18.852384063Z caller=sidecar.go:155 msg="successfully loaded prometheus external labels" external_labels="{monitor=\"dev002-infrastructure\", replica=\"A\"}"
level=info ts=2021-05-29T18:02:18.852451866Z caller=intrumentation.go:48 msg="changing probe status" status=ready

Anything else:

Object store config is passed in as a secret to the side car.

The secret

apiVersion: v1
kind: Secret
metadata:
  name: thanos-objstore-config
  namespace: monitoring
type: Opaque
stringData:
  objstore.yml: |-
    type: AZURE
    config:
      storage_account: "prometheusnonprod"
      storage_account_key: "<account key here>"
      container: "non-prod"
      endpoint: "blob.core.windows.net"
      max_retries: 0

Side car reference

    spec:
      serviceAccountName: prometheus
      containers:
        - name: prometheus
          args:
            - --config.file=/etc/prometheus/prometheus.yml
            - --web.listen-address=:9090
            - --storage.tsdb.retention=1y
            - --storage.tsdb.path=/prometheus
            - --storage.tsdb.min-block-duration=2h
            - --storage.tsdb.max-block-duration=2h   
          image: <link hidden for security reasons but using image mentioned in ticket>
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 9090
              name: prometheus
              protocol: TCP
          livenessProbe:
            failureThreshold: 3
            initialDelaySeconds: 300
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: 9090
            timeoutSeconds: 3
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: 9090
            timeoutSeconds: 2
          resources:
            requests:
              cpu: 1000m
              memory: 1500Mi
            limits:
              cpu: 1000m
              memory: 4000Mi
          volumeMounts:
            - mountPath: /etc/prometheus
              name: config-volume
            - mountPath: /prometheus
              name: data-volume
            - mountPath: /etc/prometheus-rules
              name: rules-volume
        - name: thanos-sidecar
          image: <using image 0.18.0-scratch-r5>
          imagePullPolicy: IfNotPresent
          args:
            - sidecar
            - --prometheus.url=http://localhost:9090
            - --grpc-address=0.0.0.0:10901
            - --http-address=0.0.0.0:10902
            - --tsdb.path=/prometheus/
            - --log.level=debug
            - --objstore.config=$(OBJSTORE_CONFIG)
            - --shipper.upload-compacted
          env:
            - name: OBJSTORE_CONFIG
              valueFrom:
                secretKeyRef:
                  name: thanos-objstore-config
                  key: objstore.yml
          ports:
            - name: grpc
              containerPort: 10901
              protocol: TCP
            - name: http
              containerPort: 10902
              protocol: TCP
          volumeMounts:
            - mountPath: /prometheus
              name: thanos-db
      restartPolicy: Always
      imagePullSecrets:
        - name: docker-registry
      volumes:
        - name: data-volume
          persistentVolumeClaim:
            claimName: prometheus-data-volume-claim
        - name: thanos-db
          persistentVolumeClaim:
            claimName: prometheus-thanos-data-volume-claim       
        - name: rules-volume
          configMap:
            defaultMode: 420
            name: prometheus-rules
        - name: config-volume
          configMap:
            name: prometheus-scrapes

Environment:

Openshift kubernetes cluster

@GiedriusS
Copy link
Member

@yeya24 I think you've specified a wrong issue in your PR, reopening this

@GiedriusS GiedriusS reopened this Jul 1, 2021
@stale
Copy link

stale bot commented Sep 3, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 3, 2021
@stale
Copy link

stale bot commented Sep 19, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Sep 19, 2021
@Nafalgar
Copy link

Nafalgar commented May 9, 2022

@max77p did you ever find a solution to this?
Would it be possible to re-open this issue @GiedriusS? I am currently experiencing the same thing, I have the following logs in the sidecar:

level=info ts=2022-05-09T14:21:47.212629765Z caller=options.go:27 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2022-05-09T14:21:47.213035665Z caller=factory.go:49 msg="loading bucket configuration"
level=info ts=2022-05-09T14:21:47.390504364Z caller=sidecar.go:326 msg="starting sidecar"
level=info ts=2022-05-09T14:21:47.390772364Z caller=reloader.go:197 component=reloader msg="nothing to be watched"
level=info ts=2022-05-09T14:21:47.390893264Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2022-05-09T14:21:47.390924164Z caller=http.go:63 service=http/server component=sidecar msg="listening for requests and metrics" address=:10902
ts=2022-05-09T14:21:47.391673965Z caller=log.go:168 service=http/server component=sidecar level=info msg="TLS is disabled." http2=false
level=info ts=2022-05-09T14:21:47.392009365Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2022-05-09T14:21:47.392065365Z caller=grpc.go:127 service=gRPC/server component=sidecar msg="listening for serving gRPC" address=:10901
level=info ts=2022-05-09T14:21:47.394762966Z caller=sidecar.go:166 msg="successfully loaded prometheus version"
level=info ts=2022-05-09T14:21:47.406204273Z caller=sidecar.go:188 msg="successfully loaded prometheus external labels" external_labels="{prometheus=\"monitoring/prometheus-kube-prometheus-prometheus\", prometheus_replica=\"prometheus-prometheus-kube-prometheus-prometheus-0\"}"
level=info ts=2022-05-09T14:21:47.406249673Z caller=intrumentation.go:48 msg="changing probe status" status=ready

However, after waiting for over 2 hours (default offload time), there is still no data stored on Azure and no blob has been created in the container.

The setup is a kube-prometheus-stack deployment in an AKS cluster. I can query the Prometheus Thanos data source just fine using i.e. Grafana. Changing the values in the thanos yaml file results in an error/timeout, so it is clearly reading the yaml correctly and connecting.

thanos.yaml file:

type: AZURE
config:
  storage_account: '<account name>'
  storage_account_key: '<key value>'
  container: 'thanos-storage'
  endpoint: 'blob.core.windows.net'
  max_retries: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants