Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector randomly stops shipping certain k8s logs #12014

Open
danthegoodman1 opened this issue Mar 29, 2022 · 67 comments
Open

Vector randomly stops shipping certain k8s logs #12014

danthegoodman1 opened this issue Mar 29, 2022 · 67 comments
Labels
source: kubernetes_logs Anything `kubernetes_logs` source related type: bug A code related bug.

Comments

@danthegoodman1
Copy link

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

After a while, our vector daemonset will randomly stop shipping logs for a select service (some other pods will keep shipping logs)

Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: vector
  namespace: observability
  labels:
    app.kubernetes.io/name: vector
    app.kubernetes.io/instance: vector
    app.kubernetes.io/component: Agent
    app.kubernetes.io/version: "0.20.0-distroless-libc"
data:
  agent.yaml: |
    data_dir: /vector-data-dir
    api:
      enabled: true
      address: 127.0.0.1:8686
      playground: false
    sources:
      kubernetes_logs:
        type: kubernetes_logs
      host_metrics:
        filesystem:
          devices:
            excludes: [binfmt_misc]
          filesystems:
            excludes: [binfmt_misc]
          mountPoints:
            excludes: ["*/proc/sys/fs/binfmt_misc"]
        type: host_metrics
      internal_metrics:
        type: internal_metrics
    transforms:
      setlevel:
        type: remap
        inputs: [kubernetes_logs]
        source: |-
          .temp = parse_json!(.message)
          if !exists(parse_json!(.message).level) {
            .level = "other"
          } else {
            .level = .temp.level
          }
          if exists(.temp.uri) {
            .uri = .temp.uri
          }
          if exists(.temp.msg) {
            .msg = .temp.msg
          }
          if exists(.temp.lat) {
            .lat = .temp.lat
            .lon = .temp.lon
          }
          del(.temp)
    sinks:
      prom_exporter:
        type: prometheus_exporter
        inputs: [host_metrics, internal_metrics]
        address: 0.0.0.0:9090
      # stdout:
      #   type: console
      #   inputs: [setlevel]
      #   encoding:
      #     codec: json
      loki:
        type: loki
        inputs:
          - "setlevel"
        endpoint: https://logs-prod-us-central1.grafana.net
        compression: gzip
        # remove_label_fields: true
        encoding:
          codec: json
        auth:
          password: ${LOKI_PASSWORD}
          user: "${LOKI_USERNAME}"
          strategy: basic
        labels:
          namespace: "{{ kubernetes.pod_namespace }}"
          pod: "{{ kubernetes.pod_name }}"
          level: "{{ level }}"
          app_label: "{{ kubernetes.pod_label.app }}"
          node: "{{ kubernetes.pod_node_name }}"
          pod_owner: "{{ kubernetes.pod_owner }}"
          cluster: ${CLUSTER_NAME}


### Version

timberio/vector:0.20.0-distroless-libc

### Debug Output

```text
Working on trying to get relevant debugging output, sends a lot of TRACE logs currently :P

Example Data

{"func":"github.com/xxx/xxx/crdb.ConnectToDB","level":"debug","msg":"Connected to CRDB","time":"2022-03-29T17:43:21.015Z"}

Additional Context

Only logs from vector:

2022-03-29T17:35:57.183220Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 10 times.
 2022-03-29T17:35:57.183241Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:35:57.200043Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.
 2022-03-29T17:36:32.201827Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 8 times.
 2022-03-29T17:36:32.201877Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:36:32.437566Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.

References

Similar to this I've had in the past #8634

@danthegoodman1 danthegoodman1 added the type: bug A code related bug. label Mar 29, 2022
@danthegoodman1
Copy link
Author

I will add that restart vector seems to solve the issue, but that's not a sustainable solution.

@danthegoodman1
Copy link
Author

It's also hard to get debug logs because this is not something that I can reproduce quickly, it's usually something we notice after 3-10 days when we try to look at logs and there are none.

@jszwedko
Copy link
Member

Hi @danthegoodman1 !

It seems like some of the incoming events are not JSON. I might suggest augmenting your remap program to log the event that fails to parse as JSON like this:

.temp, err  = parse_json(.message)
if err != null {
  log("failed to parse .message as JSON: " + .message)
} else {
  // message successfully parsed as JSON
}

@jszwedko
Copy link
Member

As an aside, you are currently parsing .message as JSON twice. I'd recommend making:

if !exists(parse_json!(.message).level) {

instead

f !exists(.temp.level) {

@danthegoodman1
Copy link
Author

Thanks for the quick response @jszwedko , very few of our logs are not JSON (some panics, some startup ones) but an enormous majority are JSON logs and they are the ones not coming through.

Will add the suggestions in the meantime

@jszwedko
Copy link
Member

@danthegoodman1 ah, I see. The log messages you shared above just indicate JSON parsing failures. Are there more error logs?

@danthegoodman1
Copy link
Author

No, that is what is so strange @jszwedko, makes it real hard to debug 🙃

@danthegoodman1
Copy link
Author

Also getting this, trying to work through why this is

 error[E630]: fallible argument
   ┌─ :3:7
   │
 3 │   log("failed to parse .message as JSON: " + .message)
   │       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   │       │
   │       this expression can fail
   │       handle the error before passing it in as an argument
   │
   = see documentation about error handling at https://errors.vrl.dev/#handling
   = see language documentation at https://vrl.dev

Will update with the solution when I find it

@danthegoodman1
Copy link
Author

Updated to

log("failed to parse .message as JSON: " + .message ?? "no mesage!")

@danthegoodman1
Copy link
Author

Here are some updated logs, seems like some of the JSON logs are the pods we don't control in EKS :P

 2022-03-29T18:27:59.052628Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
 18:27:49.577Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:27:49 [DEBUG] memberlist: Initiating push/pull sync with: 01FYC8DZXMP6ENGHGKM4PBK61Q 10.0.167.205:9094\\n\"" internal_log_rate_secs=1 vrl
 _position=53
 2022-03-29T18:27:59.052677Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:12.427827Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
 2022-03-29T18:28:12.427844Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:02 http:
 TLS handshake error from 10.0.167.22:11247: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:12.427861Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:17.165804Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 2 times.
 2022-03-29T18:28:17.165830Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:17 http:
 TLS handshake error from 10.0.167.22:60721: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:19.164409Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
 18:28:15.586Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:28:15 [DEBUG] memberlist: Stream connection from=10.0.167.205:34340\\n\"" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:19.164463Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:27.218994Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
 2022-03-29T18:28:27.219014Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:22 http:
 TLS handshake error from 10.0.167.22:27699: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:37.171126Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:27 http:
 TLS handshake error from 10.0.167.22:34561: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:37.171168Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:38.940324Z  INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}:file_server: vector::internal_events::file::source: Found new
  file to watch. file=/var/log/pods/eth-scanner_eth-scanner-27476308-hbml9_05116106-4a27-49a2-ba6e-c811190e304a/eth-scanner/0.log
 2022-03-29T18:28:39.221000Z  INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped w
 atching file. file=/var/log/pods/eth-scanner_eth-scanner-27476306-gkzmk_0a02fc81-f83a-4d3a-925b-37febe1dab62/eth-scanner/0.log
 2022-03-29T18:28:47.167084Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 2 times.
 2022-03-29T18:28:47.167106Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:47 http:
 TLS handshake error from 10.0.167.22:48077: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:47.234267Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:52.426194Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
 2022-03-29T18:28:52.426218Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:52 http:
 TLS handshake error from 10.0.167.22:62639: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:59.187233Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
 18:28:49.579Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:28:49 [DEBUG] memberlist: Initiating push/pull sync with: xxx 10.0.167.205:9094\\n\"" internal_log_rate_secs=1 vrl
 _position=53
 2022-03-29T18:28:59.187428Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.

@jszwedko jszwedko added the source: kubernetes_logs Anything `kubernetes_logs` source related label Mar 29, 2022
@danthegoodman1
Copy link
Author

This has just started happening again, no error or warning logs. Anything else I should be looking for or trying? We can't be dropping our production logs.

@danthegoodman1
Copy link
Author

@jszwedko one thing I am noticing is that we don't get the logs about vector finding new files. This is easier to reproduce on things that come and go relatively quickly like cron jobs (we leave the completed pod around so logs can be collected)

@Hades32
Copy link

Hades32 commented Mar 30, 2022

One thing to add: We checked the mounted /var/logs and the files with the missing logs are there, it seems like the event about the new files doesn't get triggered. We're running on AWS EKS and /proc/sys/fs/inotify/max_user_watches is 524288.

@wtchangdm
Copy link
Contributor

@danthegoodman1 If new logs were not "found" in time, maybe try to reduce glob_minimum_cooldown_ms?

It's 60000 ms in kubernetes_logs source, but 1000 ms in file source.

@danthegoodman1
Copy link
Author

@wtchangdm will try, but it’s the case for long running pods too where it just stops recording logs for those pods. Those files definitely stay around longer than 60s

@wtchangdm
Copy link
Contributor

@danthegoodman1 BTW, there is an issue I am also tracking: #8616. Not sure if it's related, though.

@danthegoodman1
Copy link
Author

Interesting, but we don’t get any warnings or errors like that does.

@jszwedko
Copy link
Member

@danthegoodman1 could you try with the latest release, 0.21.1? The kubernetes_logs source was rewritten there to use the community supported kube-rs crate to interact with Kubernetes that we expect may have fixed a few issues.

Though, in this case, it does seem like the issue is with file discovery rather than metadata enrichment.

@danthegoodman1
Copy link
Author

@jszwedko I will say that so far we haven't noticed any issues since we dropped the glob cooldown to 8000, but we also surely have not been in our logs as much.

Will update now and see how it works

@danthegoodman1
Copy link
Author

@jszwedko This went very not well, was printing this extremely quickly:

 2022-04-25T18:40:35.870916Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.871396Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.872182Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.873367Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.874306Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))

@danthegoodman1
Copy link
Author

danthegoodman1 commented Apr 25, 2022

Ah ok I needed to update the cluster role tags too :) still getting it...

@spencergilbert
Copy link
Contributor

@jszwedko This went very not well, was printing this extremely quickly:

 2022-04-25T18:40:35.870916Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.871396Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.872182Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.873367Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.874306Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))

The ClusterRole needs to allow for the list verb as well now, see here

@danthegoodman1
Copy link
Author

yep, just made that change :)

@danthegoodman1
Copy link
Author

After sending ~50log/s I can see the following errors show, then logs stop shipping. It prints a ton of the error logs.

It is able to ship them for about 30 seconds before it stops sending logs

2022-04-26T13:21:29.462788Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.StoreSc
 annerSlotOffset\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Writing slot 131412285 for consumer solanascanner-1\\\",\\\"time\\\":\\\"2022-04-26T13:21:
 29.450Z\\\"}\\n\",\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.45062706Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(20
 22-04-26T13:21:29.462766592Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definiti
 on: Definition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_f
 ailed" stage="processing"
 2022-04-26T13:21:29.497575Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.StoreSc
 annerSlotOffset\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Wrote offset in 35.128858ms\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.485Z\\\"}\\n\",\"stre
 am\":\"stderr\",\"time\":\"2022-04-26T13:21:29.485822529Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T13:21:29.49755
 5692Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Definition { collec
 tion: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" stage="processin
 g"
 2022-04-26T13:21:29.505901Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.ScanBlo
 ck\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Got slots in 167.388115ms got slots [131412285], len 1\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.502Z\\\
 "}\\n\",\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.502718488Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T
 13:21:29.505885264Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Defin
 ition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" st
 age="processing"
 2022-04-26T13:21:29.505972Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.ScanBlo
 ck\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Got same top slot 131412285, sleeping for 250ms\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.502Z\\\"}\\n\"
 ,\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.502731568Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T13:21:2
 9.505966965Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Definition {
  collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" stage="pr
 ocessing"
 2022-04-26T13:21:29.576577Z  INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}:file_server: vector::internal_events::file::source: Stopped watching file. file=/var/log/pods/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3e
 f46c17c9cd/solanascanner/0.log

@danthegoodman1
Copy link
Author

It does not affect watching other pods, and CPU never exceeds ~20 mCPU and MEM never exceeds 33 MB for the vector pods

@danthegoodman1
Copy link
Author

Seems related to #10122 (comment)

@spencergilbert
Copy link
Contributor

Seems related to #10122 (comment)

Possibly - the other answer would be that solanascanner was deleted in cluster before all of it's log lines were processed. When that happens delay_deletion_ms comes into play as it determines how long we hold onto old metadata before dumping it ourselves. Increasing that value will increase memory usage (however much it takes to hold onto the resource metadata), but allow for more time to enrich logs from deleted pods.

@danthegoodman1
Copy link
Author

It seems reducing the glob time to 3s is working at this rate. I also reduced the number of logs printed to reduce the frequency of log rotation

@imcheck
Copy link

imcheck commented Jun 7, 2022

Hi, I faced the same issue on kubernetes source. I guessed the vector agent never release file descriptor when some log files were deleted.
I got a workaround for it by restarting agents and aggregators.

@imcheck
Copy link

imcheck commented Jun 7, 2022

Is there anyone who know the answer of this problem? or is there any option in kubernetes logs source like remove_after_secs in file source?
I think this is related to #8101.

@Genry1
Copy link

Genry1 commented Jun 28, 2022

Hi, I've encountered similar issue vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata., and the root cause seemed to be in the pod name, which had the same name as the previous pod which was recreated.
It looks like it deletes the old pod name from the metadata cache, when the timeout hits, but as it has the same name as the newly created pod, then the new one doesn't have the pod name in the metadata cache anymore and it fails with that error.
I think some retry for pod metadata retrieval is necessary, if it's missing in the cache.

@kitos9112
Copy link

Similar sentiment after reading @imcheck and @Genry1 comments - in my case, it happens with a bunch of statefulsets every time there's a rolling upgrade, and of course, the pod name is reused. It might be related to the lock in the file descriptor opened by the vector-agent container at the host level. Cannot think of any WA other than deleting the agent in the node whenever I spot this occurrence?

@lpugoy
Copy link

lpugoy commented Jul 7, 2022

We also see this issue if a pod is recreated in the same node with the same name. It's pretty much a show stopper for us.

@jszwedko
Copy link
Member

jszwedko commented Jul 7, 2022

Thanks for the additional context @kitos9112 @Genry1 and @lpugoy ! I split this off to a separate issue since it seems to be different from the original report here. See: #13467

@igor-nikiforov
Copy link

igor-nikiforov commented Aug 16, 2022

@jszwedko We also faced with this issue after started rollout Vector to our production. We identified that the most safest version for using with kubernetes_logs source currently is 0.19.3. If we upgrade it to 0.2x.x we immediately start facing with 2 issues:

In case of new versions (0.20.x) only working solution now just to restart Vector pods after some of these issues start happening - this helps in both cases. Both issues makes Vector completely unusable in production environments with kubernetes_logs since there is big risk to lose logs. I'm really hope that root cause of these issues will be identified and fixed in upcoming releases.

@zamazan4ik
Copy link
Contributor

@jszwedko is there any updates regarding the issue? Stopping receiving logs from k8s is kinda a critical issue for us.

@danthegoodman1
Copy link
Author

@zamazan4ik fluentbit is a good alternative

@zamazan4ik
Copy link
Contributor

I know a lot of log forwarders (fluentbit, rsyslog, whatever else). But I am especially interested in Vector :)

@jszwedko
Copy link
Member

Thanks for all of the thoughts everyone. I can appreciate the severity of these issues and believe this area (the kubernetes_logs source) is something we'll be digging into again soon.

There seem to be a mix of issues reported in the comments above. I think they can be categorized into:

  • kubernetes_logs source seems to fail to annotate when pod names are reused #13467
    • this issue is fairly well understood at this point, but not easily remedied
  • Vector just stopping watch the new pod logs after small period of time. It seems like this is random and does not affect all pods but it usually take near 2-10 minutes for it to start happening.

    • From: Vector randomly stops shipping certain k8s logs #12014 (comment)
    • I don't see anything in particular that jumps out from 0.19.3 -> 0.20.X that seems related to the kubernetes_logs source. One thing you could try to do there, @igor-nikiforov, is bisect down the nightly versions of Vector to identify the first one that manifests the issue. For v0.19.3 to v0.20.0, this would be from 2021-12-28 to 2022-02-10. If we had a narrower set of commits to look at, something might jump out.
  • Issues with Vector not picking up files because they are rotated too quickly. For this I'd recommend trying to tweak glob_minimum_cooldown_ms by dropping it down so that Vector scans for new files more frequently and max_read_bytes by increasing it so that Vector reads more from each file before continuing on to the next file.
  • Logs stop shipping and many annotation errors are seen. This could be due to kubernetes_logs source seems to fail to annotate when pod names are reused #13467 or the cached pod metadata being deleted before Vector finishes reading the associated files. For the latter you can try increasing delay_deletion_ms to a higher value than 60000 (60s) so that Vector holds on to the metadata longer. This can exacerbate kubernetes_logs source seems to fail to annotate when pod names are reused #13467 though. Notably the annotation issues shouldn't stop Vector from reading the logs, they will just be lacking the Kubernetes metadata (like pod namespace, etc.).

If you don't think your issue is represented in the above set, please leave an additional comment!

@danthegoodman1
Copy link
Author

@jszwedko we have our glob cooldown to 2 seconds and still have observed it. Ultimately we have to move to something that doesn’t drop logs, because we depend on logs to know when errors occur.

I can’t imagine that k8s is not a massive user base of vector. We aren’t logging very quickly either, maybe 10-20/s.

what would need to be done from us to get more urgency behind improving the kubernetes experience? I truly want to use vector but can’t.

@CharlieC3
Copy link

CharlieC3 commented Nov 11, 2022

I'm in the same boat as @danthegoodman1. We're currently using Vector in most of our K8s clusters and I love what it brings to the table, but the lack of attention and priority to this specific issue is concerning.
What I've noticed is this bug seems to affect:

  1. Similarly named pods like those created by StatefulSets the most
  2. Pods running on a large node with more than a couple dozen neighboring pods, despite the Vector agent having more than enough resources

We're likely going to switch over to something else like fluent-bit for log collection until this issue is resolved.

@CharlieC3
Copy link

CharlieC3 commented Dec 1, 2022

I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:

  1. Update Vector to version 0.25.X. It doesn't seem to be mentioned in the release notes, but release 0.25.0 contains a very impactful fix for the kubernetes_logs source to help with annotations (relevant issue: kubernetes_logs source seems to fail to annotate when pod names are reused #13467)
  2. Add these settings to your kubernetes_logs source config:
    max_line_bytes: 16777216
    max_read_bytes: 16777216
    glob_minimum_cooldown_ms: 1000
    

The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower max_line_bytes max_read_bytes value as I've seen recommended above and elsewhere, but still much higher than the default so after so many attempts and little to no change I stopped increasing it. After revisiting it and trying something wildly large, to my surprise it worked!

I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for #13467 in the latest version, and since upgrading haven't seen any annotation issues.

@zamazan4ik
Copy link
Contributor

Wow, that's good to know. I hope to see comments from Vector dev team about the possible fixes for the issues.

@danthegoodman1
Copy link
Author

I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:

  1. Update Vector to version 0.25.X. It doesn't seem to be mentioned in the release notes, but release 0.25.0 contains a very impactful fix for the kubernetes_logs source to help with annotations (relevant issue: kubernetes_logs source seems to fail to annotate when pod names are reused #13467)
  2. Add these settings to your kubernetes_logs source config:
    max_line_bytes: 16777216
    max_read_bytes: 16777216
    glob_minimum_cooldown_ms: 1000
    

The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower max_line_bytes max_read_bytes value as I've seen recommended above and elsewhere, but still much higher than the default so after so many attempts and little to no change I stopped increasing it. After revisiting it and trying something wildly large, to my surprise it worked!

I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for #13467 in the latest version, and since upgrading haven't seen any annotation issues.

Great! for us the glob time drop really did it for us

@dm3ch
Copy link

dm3ch commented Jan 17, 2023

Just realised that glob_minimum_cooldown_ms: 1000 could dramatically increase CPU resource usage.

In my case (145 nodes in cluster) where I haven't yet faced described issue (I copied this config snippet from another cluster where such issue existed) dropping glob_minimum_cooldown_ms: 1000 option allowed me to decrease total CPU usage (all Vector pods summary) from 117 to 2 CPUs (avg per pod 0.4 CPU -> 0.01 CPU).

I am still checking that dropping this parameter haven't led to missing logs (cause such a big resource consumption change seems wrong to me).

@CharlieC3
Copy link

Just realised that glob_minimum_cooldown_ms: 1000 could dramatically increase CPU resource usage.

@dm3ch I noticed this on my setup too, but in my scenario I think the "increased" CPU usage is just a result of the service actually processing logs :) Dropping this setting for me results in very low CPU usage as well, but I was receiving only about 10% of the logs I should've been as well.
With the glob_minimum_cooldown_ms: 1000 and higher CPU usage, I find that my resource usage more closely reflects the resource and capacity planning expectations laid out here, so I feel it's working as expected with this setting for my situation. Though yours may be different than mine.

@dm3ch
Copy link

dm3ch commented Jan 17, 2023

@CharlieC3 Which version of the vector are you using?

@CharlieC3
Copy link

@dm3ch I haven't needed to upgrade since my last comment, so I'm still running Docker image timberio/vector:0.25.1-distroless-libc.

@k24dizzle
Copy link

sharing my results before and after adding:

max_line_bytes: 16777216
max_read_bytes: 16777216
glob_minimum_cooldown_ms: 1000

to my kubernetes_logs configuration

Screenshot 2023-10-11 at 10 06 33 PM

instead of ingesting bytes in 1 minute bursts, it now seems to funnel the logs at a steady pace

will defer to people more knowledgable to interpret why this happens

@sumeet-zuora
Copy link

sumeet-zuora commented Jan 30, 2024

Still seeing this errors in with above config changes

2024-01-30T07:39:33.600554Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { inner: Inner { fields: Object({KeyString("file"): Bytes(b"/var/log/pods/revenue-reporting_pinot-standby-server-22_cc651b75-805d-4120-b5f2-98dd1cf89b11/server/0.log"), KeyString("message"): Bytes(b"2024-01-29T17:38:10.001557068Z stdout F     \"ID\" : \"2247389355\","), KeyString("source_type"): Bytes(b"kubernetes_logs")}), size_cache: AtomicCell { value: None }, json_encoded_size_cache: AtomicCell { value: Some(NonZeroJsonSize(JsonSize(225))) } }, metadata: EventMetadata { value: Object({}), secrets: {}, finalizers: EventFinalizers([]), source_id: None, source_type: None, upstream_id: None, schema_definition: Definition { event_kind: Kind { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), undefined: Some(()), array: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }), object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, metadata_kind: Kind { bytes: None, integer: None, float: None, boolean: None, timestamp: None, regex: None, null: None, undefined: None, array: None, object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, meaning: {}, log_namespaces: {Vector, Legacy} }, dropped_fields: {}, datadog_origin_metadata: None } }) error_code="annotation_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
2024-01-30T07:39:33.600617Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] is being suppressed to avoid flooding.
2024-01-30T07:39:43.693441Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] has been suppressed 19999 times.
2024-01-30T07:39:43.693455Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { inner: Inner { fields: Object({KeyString("file"): Bytes(b"/var/log/pods/revenue-reporting_pinot-standby-server-22_cc651b75-805d-4120-b5f2-98dd1cf89b11/server/0.log"), KeyString("message"): Bytes(b"2024-01-29T17:38:10.097262663Z stdout F \t... 7 more"), KeyString("source_type"): Bytes(b"kubernetes_logs")}), size_cache: AtomicCell { value: None }, json_encoded_size_cache: AtomicCell { value: Some(NonZeroJsonSize(JsonSize(212))) } }, metadata: EventMetadata { value: Object({}), secrets: {}, finalizers: EventFinalizers([]), source_id: None, source_type: None, upstream_id: None, schema_definition: Definition { event_kind: Kind { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), undefined: Some(()), array: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }), object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, metadata_kind: Kind { bytes: None, integer: None, float: None, boolean: None, timestamp: None, regex: None, null: None, undefined: None, array: None, object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, meaning: {}, log_namespaces: {Vector, Legacy} }, dropped_fields: {}, datadog_origin_metadata: None } }) error_code="annotation_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
2024-01-30T07:39:43.693518Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] is being suppressed to avoid flooding.
    kubernetes_logs:
      type: kubernetes_logs
      timezone: local
      max_line_bytes: 16777216
      max_read_bytes: 16777216
      glob_minimum_cooldown_ms: 1000 

Any workaround would really help? .. we are using 0.35.0

@danthegoodman1
Copy link
Author

danthegoodman1 commented Feb 1, 2024

I can appreciate the severity of these issues and believe this area (the kubernetes_logs source) is something we'll be digging into again soon.

2 years ago... Doesn't feel remotely appreciated. I'm wondering if the Datadog acq. is to blame.

@danthegoodman1
Copy link
Author

@sumeet-zuora seems like you have some other error that's causing logs to get caught up

@danthegoodman1
Copy link
Author

This is still an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: kubernetes_logs Anything `kubernetes_logs` source related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests