-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sink(loki): entry out of order #6651
Comments
I'm not maintainer or developer of vector but can share own experience. IMO it is not vector issues but Loki shortcoming (grafana/loki#1544). It just not ready for production... not now. I've tried myriads of configuration variants with several log collectors (promtail, fluenbit, vector) with no luck. And it can't work in theory because some log sources out of order itself (ie all http access logs) and concurrent log emission for same labels from different hosts (ie k8s based labels). No one combination of stream labels or other hacks can fix this. This is proofed by Graphana/Loki team itself - same issues raised for Promtail regularry. IMO vector is simply the best log collector for use with Loki because of:
Loki-native collector Promtail give me far worse results. But even all of this can't 'fix' design flaw... as usual. So I've suggests to wait for Loki fix and not try to 'fix' (complicate) log collectors. |
Hi, this might be the issue you are running into: many vector pods racing with each others (#3057 (comment)). This could be mitigated by adding pod's ID to the label set, but personally I don't think this is a good solution because pod's ID isn't a meaningful value to be included in the label set. |
This is exactly what I've mention above. Actually it is bad practice for Loki because of hight cardinality of stream labels. AND with recommended (accuracy crafted) set of label such concurrency (race) it is normal for multi-node env. So... Loki recommend approach just not works in real workloads. Plus we has out of order sources (access logs, highly concurrent loggers in middleware) that can't (and won't) be 'fixed'. So stream labels can't help at all. Another ugly workaround - buffers logs on vector aggregator and then push it to Loki... Fortunately there is noticeable progress on issue from Loki side. |
I'm not sure sure it's just Loki who is at fault here. I've been using fluent-bit with Loki for a year and I've had no problems at all. Same configuration (running as DaemonSet). Admittedly, I am using the Helm chart provided directly by Grafana. https://github.com/grafana/helm-charts/tree/main/charts/fluent-bit |
It is Loki design decision not the fault. That design based on assumption that logs within one stream will be ordered by timestamp and that logical. On other hand Loki recommends (practically, requires) low cardinality labels for streams and that logical too. It is ok but in real wold things like latency, outgages, load is exists and makes Loki current design absolute. We can't ... unfortunately. Even middleware apps (java/log4j2) can generate logs that out of order within 1 millisecond. HAProxy access log all out of order (by design). So we can drop out of order logs (for our system is very sensitive workaround) OR generate synthetic timestamp (with Vector or Logtail) and lost sequence on analysis. It is very frustrating that incoming proxy log going last in request processing sequence. On ELK or Graylog no such problem. |
Another user ran into this in discord even though it seems like they shouldn't: https://discord.com/channels/742820443487993987/746070591097798688/821406012123775047 sinks:
nginx_output_loki:
type: loki
inputs:
- nginx_parse_remap
endpoint: 'http://<loki_remote_url>:3100'
encoding:
codec: json
healthcheck: true
labels:
job: msp-nginx-access-log
host: "${HOSTNAME}"
remote_user: '{{ remote_user }}'
request_uri: '{{ request_uri }}'
status: '{{ status }}'
Here they are using vector 0.12.1, which should include the fix to limit concurrency to 1 for this sink, and the config includes a unique label (HOSTNAME) per vector instance so I would expect the streams to be partitioned by vector instance. They did note that they only hit this issue when restarting Loki, during which it is unavailable for a short period. It is possible that vector isn't replaying the batched events in the right order or there is some other interaction with Loki during a restart event that might cause this. |
There appears to be an issue that vector needs multiple loops just to start as the very first request doesnt get out of order protection (due to it being in-memory). This is the case even if you have re-order fixing.
In my case Perhaps vector could query for an appropriate last seen timestamp before starting the processing loop in the loki sink? |
Noting a user in discord ran into this again: https://discord.com/channels/742820443487993987/746070591097798688/872540348231073803 . In their case, it did seems related to retries not being ordered. |
Regarding retries, they can become unordered. The reason for that is ordering of retry loop layer and concurrency management layer where retry layer is above concurrency layer. So when a request failed it exited concurrency layer so it stopped being counted against the limit, and while the retry layer was figuring out what to do another request could be pushed through in front of the retrying request. By pure chance this is being fixed by #8615 which enforces ordering in a layer above retry layer. I'll push a draft with a test for this which will wait for #8615 to be merged. But I'm still investigating if that explains everything reported here. |
That's true, but I wasn't able to reproduce
but it does fail sending the event and
@splitice did you mean error or does vector really crash? If so could you provide debug output with such a crash.
Good idea that would help handle this case, I'll open a separate issue. |
Hi,
I've read #3057, but it looks like it was closed, and I am still seeing this issue.
Vector Version
Vector Configuration File
Debug Output
Loki
Vector
Expected Behavior
It should send logs to loki without error.
Actual Behavior
It doesn't send logs to loki.
Example Data
The full configuration can be found here: https://github.com/uhthomas/automata/tree/048ca6f97e2f9c15ab336bba3ffe07ae70fb6422/clusters/desire/vector
Additional Context
Per the full config mention earlier, Vector is running as a DaemonSet in Kubernetes. It's using many of the values from https://github.com/timberio/vector/blob/master/distribution/kubernetes/vector-agent/resources.yaml
References
The text was updated successfully, but these errors were encountered: