sink(loki): entry out of order #6651

uhthomas · 2021-03-06T01:08:06Z

Hi,

I've read #3057, but it looks like it was closed, and I am still seeing this issue.

Vector Version

nightly-2021-03-05-distroless-static

Vector Configuration File

{
  "data_dir": "/var/lib/vector",
  "log_schema": {
    "host_key": "host",
    "message_key": "message",
    "source_type_key": "source_type",
    "timestamp_key": "timestamp"
  },
  "sources": {
    "kubernetes_logs": {
      "type": "kubernetes_logs"
    },
    "host_metrics": {
      "type": "host_metrics"
    },
    "internal_metrics": {
      "type": "internal_metrics"
    }
  },
  "sinks": {
    "prometheus": {
      "type": "prometheus",
      "address": "0.0.0.0:9090",
      "inputs": [
        "internal_metrics",
        "host_metrics"
      ]
    },
    "loki": {
      "type": "loki",
      "inputs": [
        "kubernetes_logs"
      ],
      "endpoint": "http://loki.telemetry.svc:3100",
      "encoding": {
        "codec": "json"
      },
      "labels": {
        "event": "{{ event_field }}",
        "forwarder": "vector"
      },
      "request": {
        "in_flight_limit": 1
      }
    }
  }
}

Debug Output

Loki

level=warn ts=2021-03-06T00:57:50.132415117Z caller=grpc_logging.go:38 method=/logproto.Pusher/Push duration=375.045µs err="rpc error: code = Code(400) desc = entry with timestamp 2021-03-06 00:57:28.253819072 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 00:57:28.253880127 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 00:57:28.287916442 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 00:57:28.288352051 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 00:57:28.288358052 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 00:57:28.288363052 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 00:57:28.288390413 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 00:57:28.288397245 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 00:57:29.708474161 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 00:57:29.708512062 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\ntotal ignored: 55 out of 95" msg="gRPC\n"

Vector

Mar 06 01:04:07.769 ERROR sink{component_kind="sink" component_name=loki component_type=loki}:request{request_id=172}: vector::sinks::util::sink: Response wasn't successful. response=Response { status: 400, version: HTTP/1.1, headers: {"content-type": "text/plain; charset=utf-8", "x-content-type-options": "nosniff", "date": "Sat, 06 Mar 2021 01:04:07 GMT", "content-length": "292"}, body: b"entry with timestamp 2021-03-06 01:03:56.448988727 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 01:03:56.449072846 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\ntotal ignored: 2 out of 2\n" }
Mar 06 01:04:15.032 ERROR sink{component_kind="sink" component_name=loki component_type=loki}:request{request_id=174}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 400 Bad Request"
Mar 06 01:04:15.033 ERROR sink{component_kind="sink" component_name=loki component_type=loki}:request{request_id=174}: vector::sinks::util::sink: Response wasn't successful. response=Response { status: 400, version: HTTP/1.1, headers: {"content-type": "text/plain; charset=utf-8", "x-content-type-options": "nosniff", "date": "Sat, 06 Mar 2021 01:04:15 GMT", "content-length": "1355"}, body: b"entry with timestamp 2021-03-06 01:04:10.07336132 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 01:04:12.451416006 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 01:04:13.683831323 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 01:04:13.683900282 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 01:04:13.684003616 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 01:04:13.684027652 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 01:04:13.684049673 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 01:04:13.684124163 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 01:04:13.68420729 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\nentry with timestamp 2021-03-06 01:04:13.68428734 +0000 UTC ignored, reason: 'entry out of order' for stream: {forwarder=\"vector\"},\ntotal ignored: 21 out of 21\n" }

Expected Behavior

It should send logs to loki without error.

Actual Behavior

It doesn't send logs to loki.

Example Data

The full configuration can be found here: https://github.com/uhthomas/automata/tree/048ca6f97e2f9c15ab336bba3ffe07ae70fb6422/clusters/desire/vector

Additional Context

Per the full config mention earlier, Vector is running as a DaemonSet in Kubernetes. It's using many of the values from https://github.com/timberio/vector/blob/master/distribution/kubernetes/vector-agent/resources.yaml

References

Loki "entry out of order" #3057

The text was updated successfully, but these errors were encountered:

vladimirfx · 2021-03-06T07:43:50Z

I'm not maintainer or developer of vector but can share own experience. IMO it is not vector issues but Loki shortcoming (grafana/loki#1544). It just not ready for production... not now.

I've tried myriads of configuration variants with several log collectors (promtail, fluenbit, vector) with no luck. And it can't work in theory because some log sources out of order itself (ie all http access logs) and concurrent log emission for same labels from different hosts (ie k8s based labels). No one combination of stream labels or other hacks can fix this. This is proofed by Graphana/Loki team itself - same issues raised for Promtail regularry.

IMO vector is simply the best log collector for use with Loki because of:

label whitelist that helps to control cardinality ([Loki] Whitelisting automatic Kubernetes stream labels fluent/fluent-bit#3131, Don't auto-include pod labels as loki labels as a sane default grafana/loki#3390)
separate send queue (see what happens if is not - Fluent bit stops sending logs to Loki when any of the outputs goes offline fluent/fluent-bit#3088 )
has correct retry logic ([Question] Loki plugin out of order error behaviour? fluent/fluent-bit#3082)
build-in aggregator with reordering capability (I've testing this now)

Loki-native collector Promtail give me far worse results. But even all of this can't 'fix' design flaw... as usual.

So I've suggests to wait for Loki fix and not try to 'fix' (complicate) log collectors.

juchiast · 2021-03-06T08:45:59Z

Hi, this might be the issue you are running into: many vector pods racing with each others (#3057 (comment)). This could be mitigated by adding pod's ID to the label set, but personally I don't think this is a good solution because pod's ID isn't a meaningful value to be included in the label set.

vladimirfx · 2021-03-06T09:00:22Z

Hi, this might be the issue you are running into: many vector pods racing with each others (#3057 (comment)). This could be mitigated by adding pod's ID to the label set, but personally I don't think this is a good solution because pod's ID isn't a meaningful to be included in the label set.

This is exactly what I've mention above. Actually it is bad practice for Loki because of hight cardinality of stream labels. AND with recommended (accuracy crafted) set of label such concurrency (race) it is normal for multi-node env. So... Loki recommend approach just not works in real workloads.

Plus we has out of order sources (access logs, highly concurrent loggers in middleware) that can't (and won't) be 'fixed'. So stream labels can't help at all. Another ugly workaround - buffers logs on vector aggregator and then push it to Loki...

Fortunately there is noticeable progress on issue from Loki side.

uhthomas · 2021-03-06T13:15:58Z

I'm not sure sure it's just Loki who is at fault here. I've been using fluent-bit with Loki for a year and I've had no problems at all. Same configuration (running as DaemonSet). Admittedly, I am using the Helm chart provided directly by Grafana. https://github.com/grafana/helm-charts/tree/main/charts/fluent-bit

vladimirfx · 2021-03-09T11:45:25Z

It is Loki design decision not the fault. That design based on assumption that logs within one stream will be ordered by timestamp and that logical. On other hand Loki recommends (practically, requires) low cardinality labels for streams and that logical too. It is ok but in real wold things like latency, outgages, load is exists and makes Loki current design absolute.
@uhthomas it is good that your workloads not generating out of order logs. Or you can use hight cardinality streams, logs buffering (with reordering) and other hacks to make use Loki in prod.

We can't ... unfortunately. Even middleware apps (java/log4j2) can generate logs that out of order within 1 millisecond. HAProxy access log all out of order (by design). So we can drop out of order logs (for our system is very sensitive workaround) OR generate synthetic timestamp (with Vector or Logtail) and lost sequence on analysis. It is very frustrating that incoming proxy log going last in request processing sequence. On ELK or Graylog no such problem.

jszwedko · 2021-03-16T15:50:38Z

Another user ran into this in discord even though it seems like they shouldn't: https://discord.com/channels/742820443487993987/746070591097798688/821406012123775047

sinks:
  nginx_output_loki:
    type: loki
    inputs:
      - nginx_parse_remap
    endpoint: 'http://<loki_remote_url>:3100'
    encoding:
      codec: json
    healthcheck: true
    labels:
      job: msp-nginx-access-log
      host: "${HOSTNAME}"
      remote_user: '{{ remote_user }}'
      request_uri: '{{ request_uri }}'
      status: '{{ status }}'

vector --version
vector 0.12.1 (v0.12.1 x86_64-unknown-linux-gnu 2021-03-12)

Here they are using vector 0.12.1, which should include the fix to limit concurrency to 1 for this sink, and the config includes a unique label (HOSTNAME) per vector instance so I would expect the streams to be partitioned by vector instance.

They did note that they only hit this issue when restarting Loki, during which it is unavailable for a short period. It is possible that vector isn't replaying the batched events in the right order or there is some other interaction with Loki during a restart event that might cause this.

splitice · 2021-05-11T02:13:44Z

There appears to be an issue that vector needs multiple loops just to start as the very first request doesnt get out of order protection (due to it being in-memory). This is the case even if you have re-order fixing.

May 11 02:08:19.668 ERROR sink{component_kind="sink" component_name=loki_output component_type=loki}:request{request_id=30}: vector::sinks::util::sink: Response wasn't successful. response=Response { status: 400, version: HTTP/1.1, headers: {"content-type": "text/plain; charset=utf-8", "x-content-type-options": "nosniff", "date": "Tue, 11 May 2021 02:08:19 GMT", "content-length": "1568"}, body: b"entry with timestamp 2021-05-11 02:05:51.600411019 +0000 UTC ignored, reason: 'entry out of order' for stream: {host=\"vector-agent-9m6bn\", source=\"k8s\"},\nentry with timestamp 2021-05-11 02:05:51.600411019 +0000 UTC ignored, reason: 'entry out of order' for stream: {host=\"vector-agent-9m6bn\", source=\"k8s\"},\nentry with timestamp 2021-05-11 02:05:51.600411019 +0000 UTC ignored, reason: 'entry out of order' for stream: {host=\"vector-agent-9m6bn\", source=\"k8s\"},\nentry with timestamp 2021-05-11 02:05:51.600411019 +0000 UTC ignored, reason: 'entry out of order' for stream: {host=\"vector-agent-9m6bn\", source=\"k8s\"},\nentry with timestamp 2021-05-11 02:05:51.600411019 +0000 UTC ignored, reason: 'entry out of order' for stream: {host=\"vector-agent-9m6bn\", source=\"k8s\"},\nentry with timestamp 2021-05-11 02:05:51.600411019 +0000 UTC ignored, reason: 'entry out of order' for stream: {host=\"vector-agent-9m6bn\", source=\"k8s\"},\nentry with timestamp 2021-05-11 02:05:51.600411019 +0000 UTC ignored, reason: 'entry out of order' for stream: {host=\"vector-agent-9m6bn\", source=\"k8s\"},\nentry with timestamp 2021-05-11 02:05:51.600411019 +0000 UTC ignored, reason: 'entry out of order' for stream: {host=\"vector-agent-9m6bn\", source=\"k8s\"},\nentry with timestamp 2021-05-11 02:05:51.600411019 +0000 UTC ignored, reason: 'entry out of order' for stream: {host=\"vector-agent-9m6bn\", source=\"k8s\"},\nentry with timestamp 2021-05-11 02:05:51.600411019 +0000 UTC ignored, reason: 'entry out of order' for stream: {host=\"vector-agent-9m6bn\", source=\"k8s\"},\ntotal ignored: 62 out of 62\n" }

In my case host (the pod name) is a label. This causes vector to crash and repeats until sufficient time has passed to fix the out of orderness.

Perhaps vector could query for an appropriate last seen timestamp before starting the processing loop in the loki sink?

jszwedko · 2021-08-04T18:37:16Z

Noting a user in discord ran into this again: https://discord.com/channels/742820443487993987/746070591097798688/872540348231073803 . In their case, it did seems related to retries not being ordered.

ktff · 2021-08-14T15:36:43Z

Regarding retries, they can become unordered. The reason for that is ordering of retry loop layer and concurrency management layer where retry layer is above concurrency layer. So when a request failed it exited concurrency layer so it stopped being counted against the limit, and while the retry layer was figuring out what to do another request could be pushed through in front of the retrying request. By pure chance this is being fixed by #8615 which enforces ordering in a layer above retry layer.

I'll push a draft with a test for this which will wait for #8615 to be merged.

But I'm still investigating if that explains everything reported here.

ktff · 2021-08-14T16:32:54Z

There appears to be an issue that vector needs multiple loops just to start as the very first request doesnt get out of order protection (due to it being in-memory). This is the case even if you have re-order fixing.

That's true, but I wasn't able to reproduce

This causes vector to crash

but it does fail sending the event and

and repeats until sufficient time has passed to fix the out of orderness.

@splitice did you mean error or does vector really crash? If so could you provide debug output with such a crash.

Perhaps vector could query for an appropriate last seen timestamp before starting the processing loop in the loki sink?

Good idea that would help handle this case, I'll open a separate issue.

jszwedko · 2021-08-16T13:54:55Z

Regarding retries, they can become unordered. The reason for that is ordering of retry loop layer and concurrency management layer where retry layer is above concurrency layer.

Just a note that I think @tobz might be swapping the ordering of these two layers to fix #8647 .

uhthomas added the type: bug A code related bug. label Mar 6, 2021

jszwedko mentioned this issue Mar 16, 2021

Add troubleshooting section for loki out-of-order timestamps #6778

Closed

jszwedko assigned ktff Jul 26, 2021

spencergilbert mentioned this issue Jul 30, 2021

Investigate datadog_agent -> remap -> datadog_logs #8512

Closed

ktff mentioned this issue Aug 14, 2021

chore(loki sink): Add test for unordered retries #8724

Merged

ktff mentioned this issue Aug 14, 2021

Fetch last timestamp from loki for unknown streams #8727

Closed

ktff closed this as completed in #8724 Aug 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sink(loki): entry out of order #6651

sink(loki): entry out of order #6651

uhthomas commented Mar 6, 2021

vladimirfx commented Mar 6, 2021

juchiast commented Mar 6, 2021 •

edited

Loading

vladimirfx commented Mar 6, 2021

uhthomas commented Mar 6, 2021

vladimirfx commented Mar 9, 2021

jszwedko commented Mar 16, 2021 •

edited

Loading

splitice commented May 11, 2021

jszwedko commented Aug 4, 2021

ktff commented Aug 14, 2021

ktff commented Aug 14, 2021

jszwedko commented Aug 16, 2021

sink(loki): entry out of order #6651

sink(loki): entry out of order #6651

Comments

uhthomas commented Mar 6, 2021

Vector Version

Vector Configuration File

Debug Output

Loki

Vector

Expected Behavior

Actual Behavior

Example Data

Additional Context

References

vladimirfx commented Mar 6, 2021

juchiast commented Mar 6, 2021 • edited Loading

vladimirfx commented Mar 6, 2021

uhthomas commented Mar 6, 2021

vladimirfx commented Mar 9, 2021

jszwedko commented Mar 16, 2021 • edited Loading

splitice commented May 11, 2021

jszwedko commented Aug 4, 2021

ktff commented Aug 14, 2021

ktff commented Aug 14, 2021

jszwedko commented Aug 16, 2021

juchiast commented Mar 6, 2021 •

edited

Loading

jszwedko commented Mar 16, 2021 •

edited

Loading