Degredation of PVCs #19820

srstrickland · 2024-02-07T13:49:07Z

srstrickland
Feb 7, 2024

Hi!

We have been using PVC-backed data buffers, mostly because we were unsure about undertaking the operational burden of keeping kafka happy. We have since decided the operational flexibility is worth the effort, so will be redesigning our system as such, but recently I've run into a more pressing problem with the PVC solution: it appears to be unstable after time.

Unfortunately, I don't have a ton of information here... there is nothing outstanding in the logs and really all I see is CPU utilization going through the roof and discarded events being reported. If all I do is remove the pod, it comes back (with its PVC) and continues its high CPU utilization and discarded errors. We're using HPA, so all it takes is for one pod to throw off the average, which triggers HPA scale-out. Over time, it happens to more and more pods. My only recourse is to delete the pod and its corresponding PVC.

Recently, this happened to 6 of 20 pods (we only had 20 pods, up from the usual 8, because multiple pods started exhibiting this behavior). Here's what I saw:

NAME                                    CPU(cores)   MEMORY(bytes)
vector-writer-0                         323m         415Mi
vector-writer-1                         402m         501Mi
vector-writer-10                        359m         457Mi
vector-writer-11                        641m         516Mi
vector-writer-12                        402m         481Mi
vector-writer-13                    --> 2535m        554Mi 
vector-writer-14                    --> 2578m        525Mi
vector-writer-15                        436m         469Mi
vector-writer-16                        544m         442Mi
vector-writer-17                        318m         461Mi
vector-writer-18                        611m         486Mi
vector-writer-19                    --> 1802m        471Mi
vector-writer-2                         366m         335Mi
vector-writer-3                     --> 2729m        559Mi  
vector-writer-4                     --> 2426m        538Mi 
vector-writer-5                         360m         504Mi
vector-writer-6                     --> 3966m        506Mi
vector-writer-7                         451m         548Mi
vector-writer-8                         501m         438Mi
vector-writer-9                         244m         460Mi

I cleaned up the 6 indicated via this script:

#!/bin/bash

set -exu

for n in 13 14 19 3 4 6; do
    kubectl delete pod/vector-writer-$n pvc/data-vector-writer-$n
done

The chaos stopped, and HPA scaled back down with 8 pods hovering around 800m CPU.

The PVC data has to be the culprit here... I just don't know how to diagnose this issue. I've saved the /vector-data-dir from one of the instances, but not sure how to go about troubleshooting it.

Before we get to the kafka-based solution (which we're working on), I'm wondering whether it's worth applying some workaround:

Use some hook to rm -rf /vector-data-dir/* on "clean" shutdown, and possibly do regular rolling restarts
Drop the stateful set entirely and just accept the relatively low likelihood of data loss
Take it one step further and switch to memory buffers

Any guidance here would be great!

srstrickland · 2024-02-09T14:54:43Z

srstrickland
Feb 9, 2024
Author

This has happened again. Here is the current pod situation:

vector-writer-0                         629m         556Mi
vector-writer-1                         399m         570Mi
vector-writer-10                    --> 4096m        590Mi
vector-writer-11                        553m         654Mi
vector-writer-12                    --> 1714m        531Mi
vector-writer-13                        412m         621Mi
vector-writer-14                        640m         593Mi
vector-writer-15                        670m         607Mi
vector-writer-16                        553m         591Mi
vector-writer-17                        780m         679Mi
vector-writer-18                        396m         620Mi
vector-writer-19                        545m         702Mi
vector-writer-2                     --> 3948m        557Mi
vector-writer-3                         570m         698Mi
vector-writer-4                         388m         598Mi
vector-writer-5                     --> 2082m        730Mi
vector-writer-6                         507m         685Mi
vector-writer-7                         476m         629Mi
vector-writer-8                     --> 4166m        486Mi
vector-writer-9                     --> 4156m        550Mi

And a smoking gun... on every one of the highlighted pods (and none of the others!), there are multiple buffer-data-*.dat files in at least one buffer dir. Here are the contents of vector-writer-9:

vector-writer-8
/vector-data-dir/:
total 20
drwxr-xr-x 3 root root  4096 Feb  3 05:54 buffer
drwx------ 2 root root 16384 Feb  3 05:54 lost+found

/vector-data-dir/buffer:
total 4
drwxr-xr-x 6 root root 4096 Feb  3 05:54 v2

/vector-data-dir/buffer/v2:
total 16
drwxr-xr-x 2 root root 4096 Feb  9 03:36 grafana_cloud_loki
drwxr-xr-x 2 root root 4096 Feb  9 03:35 opensearch_serverless
drwxr-xr-x 2 root root 4096 Feb  9 03:24 pipedream_opensearch_logs
drwxr-xr-x 2 root root 4096 Feb  9 03:36 service_cloudwatch_logs

/vector-data-dir/buffer/v2/grafana_cloud_loki:
total 467136
-rw-r----- 1 root root  77438976 Feb  9 03:31 buffer-data-3107.dat
-rw-r----- 1 root root 132809264 Feb  9 03:33 buffer-data-3108.dat
-rw-r----- 1 root root 131838064 Feb  9 03:35 buffer-data-3109.dat
-rw-r----- 1 root root 132072664 Feb  9 03:36 buffer-data-3110.dat
-rw-r----- 1 root root   4162824 Feb  9 03:36 buffer-data-3111.dat
-rw-r----- 1 root root        24 Feb  9 03:37 buffer.db
-rw-r--r-- 1 root root         0 Feb  8 06:21 buffer.lock

/vector-data-dir/buffer/v2/opensearch_serverless:
total 130604
-rw-r----- 1 root root 133732680 Feb  9 03:37 buffer-data-3643.dat
-rw-r----- 1 root root        24 Feb  9 03:37 buffer.db
-rw-r--r-- 1 root root         0 Feb  8 06:21 buffer.lock

/vector-data-dir/buffer/v2/pipedream_opensearch_logs:
total 79708
-rw-r----- 1 root root 81609440 Feb  9 03:37 buffer-data-89.dat
-rw-r----- 1 root root       24 Feb  9 03:37 buffer.db
-rw-r--r-- 1 root root        0 Feb  8 06:21 buffer.lock

/vector-data-dir/buffer/v2/service_cloudwatch_logs:
total 441340
-rw-r----- 1 root root  51118080 Feb  9 03:31 buffer-data-3233.dat
-rw-r----- 1 root root 132908104 Feb  9 03:33 buffer-data-3234.dat
-rw-r----- 1 root root 132954256 Feb  9 03:35 buffer-data-3235.dat
-rw-r----- 1 root root 132980808 Feb  9 03:36 buffer-data-3236.dat
-rw-r----- 1 root root   1952080 Feb  9 03:36 buffer-data-3237.dat
-rw-r----- 1 root root        24 Feb  9 03:36 buffer.db
-rw-r--r-- 1 root root         0 Feb  8 06:21 buffer.lock

/vector-data-dir/lost+found:
total 0

Every single one of the others look something like this:

vector-writer-7
/vector-data-dir/:
total 20
drwxr-xr-x 3 root root  4096 Feb  3 05:18 buffer
drwx------ 2 root root 16384 Feb  3 05:18 lost+found

/vector-data-dir/buffer:
total 4
drwxr-xr-x 6 root root 4096 Feb  3 05:18 v2

/vector-data-dir/buffer/v2:
total 16
drwxr-xr-x 2 root root 4096 Feb  9 14:47 grafana_cloud_loki
drwxr-xr-x 2 root root 4096 Feb  9 14:47 opensearch_serverless
drwxr-xr-x 2 root root 4096 Feb  9 14:16 pipedream_opensearch_logs
drwxr-xr-x 2 root root 4096 Feb  9 14:46 service_cloudwatch_logs

/vector-data-dir/buffer/v2/grafana_cloud_loki:
total 27092
-rw-r----- 1 root root 27736104 Feb  9 14:47 buffer-data-5636.dat
-rw-r----- 1 root root       24 Feb  9 14:47 buffer.db
-rw-r--r-- 1 root root        0 Feb  9 01:32 buffer.lock

/vector-data-dir/buffer/v2/opensearch_serverless:
total 9484
-rw-r----- 1 root root 9704560 Feb  9 14:47 buffer-data-6637.dat
-rw-r----- 1 root root      24 Feb  9 14:47 buffer.db
-rw-r--r-- 1 root root       0 Feb  9 01:34 buffer.lock

/vector-data-dir/buffer/v2/pipedream_opensearch_logs:
total 47704
-rw-r----- 1 root root 48841416 Feb  9 14:47 buffer-data-144.dat
-rw-r----- 1 root root       24 Feb  9 14:47 buffer.db
-rw-r--r-- 1 root root        0 Feb  9 01:32 buffer.lock

/vector-data-dir/buffer/v2/service_cloudwatch_logs:
total 70596
-rw-r----- 1 root root 72283144 Feb  9 14:47 buffer-data-6989.dat
-rw-r----- 1 root root       24 Feb  9 14:47 buffer.db
-rw-r--r-- 1 root root        0 Feb  9 01:32 buffer.lock

/vector-data-dir/lost+found:
total 0

It seems possibly related to #19155

5 replies

jszwedko Feb 9, 2024
Maintainer

It does seem similar to #19155. Do you have logs from the instance? I'm wondering if there is anything interesting in there.

srstrickland Feb 10, 2024
Author

I started throwing away pervasive, uninteresting logs (don't hold it against me, vector internal logs go exclusively to cloudwatch at the moment and I hate it 😛 ... but at least we have the data):

fields @timestamp, @message
| filter message not like "is being suppressed to avoid flooding"
| filter message not like "has been suppressed"
| filter message not like "Retrying after response"
| filter message not like "Retrying after error"
| filter message not like "Events dropped"
| filter message not like "Not retriable; dropping the request"
| filter message not like "Non-retriable error; dropping the request"
| filter message not like "Endpoint is unhealthy"
| filter message not like "Service call failed. No retries or retries exhausted"
| filter message not like "HTTP error"
| filter host = "vector-writer-8"
| sort @timestamp desc
| limit 10000

and was left with these:

[
    {
        "@timestamp": "2024-02-09 03:31:12.010",
        "@message": {
            "host": "vector-writer-8",
            "message": "Last written record was unable to be deserialized. Corruption likely.",
            "metadata": {
                "kind": "event",
                "level": "ERROR",
                "log_stream": "observability-prod-usw2-01.observability-logs-shipping.vector-writer-8",
                "module_path": "vector_buffers::variants::disk_v2::writer",
                "role": "writer",
                "target": "vector_buffers::variants::disk_v2::writer"
            },
            "pid": 1,
            "reason": "\"invalid data: check failed for struct member payload: pointer out of bounds: base 0x7fa82e0b0098 offset 1650669613 not in range 0x7fa825840000..0x7fa828900000\"",
            "source_type": "internal_logs",
            "vector": {
                "component_id": "service_cloudwatch_logs",
                "component_kind": "sink",
                "component_type": "aws_cloudwatch_logs"
            }
        }
    },
    {
        "@timestamp": "2024-02-09 03:31:09.975",
        "@message": {
            "host": "vector-writer-8",
            "message": "Last written record was unable to be deserialized. Corruption likely.",
            "metadata": {
                "kind": "event",
                "level": "ERROR",
                "log_stream": "observability-prod-usw2-01.observability-logs-shipping.vector-writer-8",
                "module_path": "vector_buffers::variants::disk_v2::writer",
                "role": "writer",
                "target": "vector_buffers::variants::disk_v2::writer"
            },
            "pid": 1,
            "reason": "\"invalid data: check failed for struct member payload: pointer out of bounds: base 0x7fa8324c2798 offset 79628848 not in range 0x7fa829626000..0x7fa82e000000\"",
            "source_type": "internal_logs",
            "vector": {
                "component_id": "grafana_cloud_loki",
                "component_kind": "sink",
                "component_type": "loki"
            }
        }
    },
    {
        "@timestamp": "2024-02-08 18:17:46.695",
        "@message": {
            "host": "vector-writer-8",
            "internal_log_rate_limit": true,
            "message": "Request timed out. If this happens often while the events are actually reaching their destination, try decreasing `batch.max_bytes` and/or using `compression` if applicable. Alternatively `request.timeout_secs` can be increased.",
            "metadata": {
                "kind": "event",
                "level": "WARN",
                "log_stream": "observability-prod-usw2-01.observability-logs-shipping.vector-writer-8",
                "module_path": "vector::sinks::util::retries",
                "role": "writer",
                "target": "vector::sinks::util::retries"
            },
            "pid": 1,
            "source_type": "internal_logs",
            "vector": {
                "component_id": "opensearch_serverless",
                "component_kind": "sink",
                "component_type": "elasticsearch"
            }
        }
    },
    {
        "@timestamp": "2024-02-07 04:51:58.761",
        "@message": {
            "error": "The reader detected that a data file contains a partially-written record.",
            "error_code": "partial_write",
            "error_type": "reader_failed",
            "host": "vector-writer-8",
            "internal_log_rate_limit": true,
            "message": "Error encountered during buffer read.",
            "metadata": {
                "kind": "event",
                "level": "ERROR",
                "log_stream": "observability-prod-usw2-01.observability-logs-shipping.vector-writer-8",
                "module_path": "vector_buffers::internal_events",
                "role": "writer",
                "target": "vector_buffers::internal_events"
            },
            "pid": 1,
            "source_type": "internal_logs",
            "stage": "processing",
            "vector": {
                "component_id": "grafana_cloud_loki",
                "component_kind": "sink",
                "component_type": "loki"
            }
        }
    },
    {
        "@timestamp": "2024-02-07 04:51:55.715",
        "@message": {
            "assumed_version": "8",
            "error": "Failed to get Elasticsearch API version: Failed to make HTTP(S) request: error trying to connect: dns error: failed to lookup address information: Name or service not known",
            "host": "vector-writer-8",
            "message": "Failed to determine Elasticsearch API version. Please fix the reported error or set an API version explicitly via `api_version`.",
            "metadata": {
                "kind": "event",
                "level": "WARN",
                "log_stream": "observability-prod-usw2-01.observability-logs-shipping.vector-writer-8",
                "module_path": "vector::sinks::elasticsearch::common",
                "role": "writer",
                "target": "vector::sinks::elasticsearch::common"
            },
            "pid": 1,
            "source_type": "internal_logs",
            "vector": {
                "component_id": "openobserve",
                "component_kind": "sink",
                "component_type": "elasticsearch"
            }
        }
    },
    {
        "@timestamp": "2024-02-07 04:28:17.057",
        "@message": {
            "assumed_version": "8",
            "error": "Failed to get Elasticsearch API version: Failed to make HTTP(S) request: error trying to connect: dns error: failed to lookup address information: Name or service not known",
            "host": "vector-writer-8",
            "message": "Failed to determine Elasticsearch API version. Please fix the reported error or set an API version explicitly via `api_version`.",
            "metadata": {
                "kind": "event",
                "level": "WARN",
                "log_stream": "observability-prod-usw2-01.observability-logs-shipping.vector-writer-8",
                "module_path": "vector::sinks::elasticsearch::common",
                "role": "writer",
                "target": "vector::sinks::elasticsearch::common"
            },
            "pid": 1,
            "source_type": "internal_logs",
            "vector": {
                "component_id": "openobserve",
                "component_kind": "sink",
                "component_type": "elasticsearch"
            }
        }
    },
    {
        "@timestamp": "2024-02-07 04:28:16.826",
        "@message": {
            "host": "vector-writer-8",
            "message": "Last written record was unable to be deserialized. Corruption likely.",
            "metadata": {
                "kind": "event",
                "level": "ERROR",
                "log_stream": "observability-prod-usw2-01.observability-logs-shipping.vector-writer-8",
                "module_path": "vector_buffers::variants::disk_v2::writer",
                "role": "writer",
                "target": "vector_buffers::variants::disk_v2::writer"
            },
            "pid": 1,
            "reason": "\"invalid data: check failed for struct member payload: pointer out of bounds: base 0x7f30c28c2798 offset 1668445551 not in range 0x7f30bddb9000..0x7f30be400000\"",
            "source_type": "internal_logs",
            "vector": {
                "component_id": "grafana_cloud_loki",
                "component_kind": "sink",
                "component_type": "loki"
            }
        }
    },
    {
        "@timestamp": "2024-02-07 03:42:01.377",
        "@message": {
            "error": "error reading a body from connection: unexpected EOF during chunk size line",
            "host": "vector-writer-8",
            "internal_log_rate_limit": true,
            "message": "Unhandled error response.",
            "metadata": {
                "kind": "event",
                "level": "WARN",
                "log_stream": "observability-prod-usw2-01.observability-logs-shipping.vector-writer-8",
                "module_path": "vector::sinks::util::adaptive_concurrency::controller",
                "role": "writer",
                "target": "vector::sinks::util::adaptive_concurrency::controller"
            },
            "pid": 1,
            "source_type": "internal_logs",
            "vector": {
                "component_id": "openobserve",
                "component_kind": "sink",
                "component_type": "elasticsearch"
            }
        }
    },
    {
        "@timestamp": "2024-02-07 03:42:01.377",
        "@message": {
            "error": "error reading a body from connection: unexpected EOF during chunk size line",
            "host": "vector-writer-8",
            "internal_log_rate_limit": true,
            "message": "Unexpected error type; dropping the request.",
            "metadata": {
                "kind": "event",
                "level": "ERROR",
                "log_stream": "observability-prod-usw2-01.observability-logs-shipping.vector-writer-8",
                "module_path": "vector::sinks::util::retries",
                "role": "writer",
                "target": "vector::sinks::util::retries"
            },
            "pid": 1,
            "source_type": "internal_logs",
            "vector": {
                "component_id": "openobserve",
                "component_kind": "sink",
                "component_type": "elasticsearch"
            }
        }
    }
]

With the exception of the ElasticSearch & timeout errors, those seem pretty substantial. Notably, there are errors specifically about the buffers associated with the grafana_cloud_loki and service_cloudwatch_logs sinks, which are the same ones which are left with stale data files.

I am curious about whether the issue repeats itself with non-PVC disk buffers... though we switched to stateless, memory-buffered instances today (while we iterate on getting kafka in the mix). We can maybe set up a parallel environment with mirrored traffic if it will help. I also captured some of the corrupted buffer files, though not from the same instance as above.

srstrickland Feb 11, 2024
Author

🤔 Random thought... I bet this happens during high throughput, and the EBS volumes are being iops throttled. I will investigate this angle and try to produce some metrics. On the vector side, it would be great if we could detect the corruption and recover somehow, even if there is data loss (corrupted data is unlikely to be recovered, though maybe the corrupted files could be moved to a dedicated "corrupted" folder or something). It would be ideal to catch this on the write side (e.g. verifying number of bytes written?), but failing that it seems like vector is aware of the likely corruption while reading, given the logs above.

jszwedko Feb 14, 2024
Maintainer

I don't suppose you'd be able to share one of the corrupted disk buffers that triggers this behavior? If you don't want to post it here you could also email it to jesse.szwedko@datadoghq.com. That might help us identify how to skip over the corrupt bits so that Vector doesn't lock up as it seems to have done for you.

srstrickland Feb 17, 2024
Author

TBH I'm not sure... let me try to get to the bottom of that (I have the data, it's just a question of sensitivity). I may have to set up a canned environment, turn on a firehose of fake data, and wait for the buffers to get corrupted. I'll follow up when I have more (but may take some time). Thanks!

hillmandj · 2024-02-26T17:40:51Z

hillmandj
Feb 26, 2024

@jszwedko this seems pretty similar to the issues we're facing as well. Is there any plan to make disk buffers more fail safe / self healing?

2 replies

jszwedko Feb 29, 2024
Maintainer

In general yes, but I'm not sure of the timeline unfortunately.

hillmandj Feb 29, 2024

@jszwedko I have a corrupt staging buffer from this morning that I've kept in a broken state and am backing up to snapshot... Still trying to figure out how/if I can send it to you. Given that Datadog would have received the contents of this buffer should it not have gotten corrupted I'm guessing it's ok, but I'll let you know

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Degredation of PVCs #19820

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Degredation of PVCs #19820

srstrickland Feb 7, 2024

Replies: 2 comments · 7 replies

srstrickland Feb 9, 2024 Author

jszwedko Feb 9, 2024 Maintainer

srstrickland Feb 10, 2024 Author

srstrickland Feb 11, 2024 Author

jszwedko Feb 14, 2024 Maintainer

srstrickland Feb 17, 2024 Author

hillmandj Feb 26, 2024

jszwedko Feb 29, 2024 Maintainer

hillmandj Feb 29, 2024

srstrickland
Feb 7, 2024

Replies: 2 comments 7 replies

srstrickland
Feb 9, 2024
Author

jszwedko Feb 9, 2024
Maintainer

srstrickland Feb 10, 2024
Author

srstrickland Feb 11, 2024
Author

jszwedko Feb 14, 2024
Maintainer

srstrickland Feb 17, 2024
Author

hillmandj
Feb 26, 2024

jszwedko Feb 29, 2024
Maintainer