Exporter fails to recover from log after failover #4350

menski · 2020-04-22T13:40:45Z

Describe the bug

A three node cluster with a single partition and replication factor 3 experienced multiple failovers. It continue in a state where broker 3 was leader for the partition but the exporter failed to recover, therefore no further events where exported.

To Reproduce

not sure how it happend

Expected behavior

The exporter should be able to recover. In the rare case that is not possible the broker should step down so another broker can continue with a working exporter hopefully.

Log/Stacktrace

Full Stacktrace

2020-04-22 10:53:45.115 [Broker-2-Exporter-1] [Broker-2-zb-fs-workers-0] ERROR io.zeebe.util.actor - Uncaught exception in 'Broker-2-Exporter-1' in phase 'STARTED'. Continuing with next job.
java.lang.IllegalStateException: Invalid address to read from 984020
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.executeReadMethod(LogStreamReaderImpl.java:201) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.readBlockIntoBuffer(LogStreamReaderImpl.java:192) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.readNextAddress(LogStreamReaderImpl.java:225) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.readNextEvent(LogStreamReaderImpl.java:243) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.next(LogStreamReaderImpl.java:176) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.next(LogStreamReaderImpl.java:20) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.broker.exporter.stream.ExporterDirector.readNextEvent(ExporterDirector.java:242) ~[zeebe-broker-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) ~[zeebe-util-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:115) [zeebe-util-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:195) [zeebe-util-0.23.0.jar:0.23.0]

Full Log File: https://drive.google.com/open?id=1UVq2qRfxTg18EiyFM7Xp3X10_6CDQyjx
Data folder: https://drive.google.com/open?id=1jfY2Kbs80PTSoFDt7d4ktEa4xbWsNyvo

Environment:

OS: Linux on k8s (Camunda Cloud)
Zeebe Version: 0.23.0

Configuration:

{
  "network" : {
    "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
    "portOffset" : 0,
    "maxMessageSize" : "4MB",
    "advertisedHost" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
    "commandApi" : {
      "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "port" : 26501,
      "advertisedHost" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "advertisedPort" : 26501,
      "advertisedAddress" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26501",
      "address" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26501"
    },
    "internalApi" : {
      "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "port" : 26502,
      "advertisedHost" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "advertisedPort" : 26502,
      "advertisedAddress" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502",
      "address" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502"
    },
    "monitoringApi" : {
      "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "port" : 9600,
      "advertisedHost" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "advertisedPort" : 9600,
      "advertisedAddress" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:9600",
      "address" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:9600"
    },
    "maxMessageSizeInBytes" : 4194304
  },
  "cluster" : {
    "initialContactPoints" : [ "zeebe-0.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502", "zeebe-1.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502", "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502" ],
    "partitionIds" : [ 1 ],
    "nodeId" : 2,
    "partitionsCount" : 1,
    "replicationFactor" : 3,
    "clusterSize" : 3,
    "clusterName" : "88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe",
    "gossipFailureTimeout" : 10000,
    "gossipInterval" : 250,
    "gossipProbeInterval" : 1000
  },
  "threads" : {
    "cpuThreadCount" : 1,
    "ioThreadCount" : 2
  },
  "data" : {
    "directories" : [ "/usr/local/zeebe/data" ],
    "logSegmentSize" : "512MB",
    "snapshotPeriod" : "PT5M",
    "logIndexDensity" : 100,
    "logSegmentSizeInBytes" : 536870912
  },
  "exporters" : {
    "elasticsearch" : {
      "jarPath" : null,
      "className" : "io.zeebe.exporter.ElasticsearchExporter",
      "args" : {
        "url" : "http://elasticsearch:9200",
        "bulk" : {
          "delay" : 5,
          "size" : 1000
        },
        "index" : {
          "prefix" : "zeebe-record",
          "createTemplate" : true,
          "command" : false,
          "event" : true,
          "rejection" : false,
          "deployment" : true,
          "incident" : true,
          "job" : true,
          "message" : false,
          "messageSubscription" : false,
          "raft" : false,
          "workflowInstance" : true,
          "workflowInstanceSubscription" : false
        }
      },
      "external" : false
    }
  },
  "gateway" : {
    "network" : {
      "host" : "0.0.0.0",
      "port" : 26500,
      "minKeepAliveInterval" : "PT30S"
    },
    "cluster" : {
      "contactPoint" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502",
      "requestTimeout" : "PT15S",
      "clusterName" : "zeebe-cluster",
      "memberId" : "gateway",
      "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "port" : 26502
    },
    "threads" : {
      "managementThreads" : 1
    },
    "monitoring" : {
      "enabled" : true,
      "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "port" : 9600
    },
    "security" : {
      "enabled" : false,
      "certificateChainPath" : null,
      "privateKeyPath" : null
    },
    "enable" : true
  },
  "backpressure" : {
    "enabled" : true,
    "algorithm" : "VEGAS"
  },
  "stepTimeout" : "PT5M"
}

The text was updated successfully, but these errors were encountered:

menski · 2020-04-22T13:53:50Z

Saw this also on another cluster, log:
zeebe.log.gz

menski · 2020-04-22T13:56:21Z

On another cluster with 3 brokers and 4 partitions (replication: 3)
zeebe-0.logs.gz
zeebe-2.logs.gz

menski · 2020-04-22T13:59:30Z

Also on a 6 broker 8 partition (replication: 3) cluster:
zeebe-0.log.gz
zeebe-1.log.gz
zeebe-2.log.gz
zeebe-4.log.gz

npepinpe · 2020-04-22T15:22:43Z

Right off the bat, mega ultra bug: the first index of your log (as posted above) is 1033737, which is way higher than 984020, almost 50k entries separating both.

Now we know that Atomix isn't great with indexes (and in fact we plan on fixing that this quarter), but essentially it does not write the indexes in the log, instead it just "counts". What I'm guessing is it's bad at counting and giving you a bad index, as we store the position and fetch the index from it. I'll dig deeper into it.

npepinpe · 2020-04-22T15:44:16Z

So log looks fine, and to me it doesn't look like an issue with failover per se. What I can see is:

2020-04-22 10:53:43.936 [] [raft-server-2-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.ZeebeRaftStateMachine - Compacting up to index 1033737
...
2020-04-22 10:53:45.115 [Broker-2-Exporter-1] [Broker-2-zb-fs-workers-0] ERROR io.zeebe.util.actor - Uncaught exception in 'Broker-2-Exporter-1' in phase 'STARTED'. Continuing with next job. java.lang.IllegalStateException: Invalid address to read from 984020

So looks like we compacted but the reader was still behind? How is this possible? Shouldn't we compact only after we've exported (meaning that the reader would be ahead)?

/cc @deepthidevaki since you fixed issues with readers thread safety recently

menski · 2020-04-22T15:54:29Z

BTW I think the only interesting setting which might be related is that I set the snapshot period to 5 minutes ("snapshotPeriod" : "PT5M")

menski · 2020-04-23T12:08:11Z

Another example:

Log: https://drive.google.com/open?id=1QiW66AN4jzVumXAXSYVwLpMf5iM7x49E
Data: https://drive.google.com/open?id=1qG3EgLuGYH_D7S8mQlMYwSVBsLM87ovB

4374: fix(broker): use exporter positions to calculate compactable index in snapshots r=npepinpe a=npepinpe ## Description - use concrete entry supplier in AtomixSnapshotStorageTest instead of mocks - use ZeebeIndexMapping and RaftLogReader directly to supply the correct Atomix entry for snapshot metadata - do not create a snapshot if a snapshot with that index already exists - add context map to stackdriver logs - update exported position in RecordingExporter - ensure log density is 1 in QA tests that require snapshotting with low synthetic loads - include partition in ZeebeRaftStateMachine log context ## Related issues closes #4350 # Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

menski added the kind/bug Categorizes an issue or PR as a bug label Apr 22, 2020

npepinpe self-assigned this Apr 22, 2020

npepinpe added Status: In Progress scope/broker Marks an issue or PR to appear in the broker section of the changelog labels Apr 22, 2020

npepinpe assigned deepthidevaki and unassigned npepinpe Apr 22, 2020

deepthidevaki added Status: Planned and removed Status: In Progress labels Apr 23, 2020

menski mentioned this issue Apr 24, 2020

[Backport] Use exporter positions to calculate compactable index in snapshots #4363

Merged

3 tasks

This was referenced Apr 24, 2020

Add integration tests to verify not exported events are not compacted #4371

Closed

Finding compactable index from position should return an index closer to the given position #4373

Closed

KerstinHebel added Status: Planned and removed Status: In Progress labels Apr 24, 2020

npepinpe mentioned this issue Apr 24, 2020

fix(broker): use exporter positions to calculate compactable index in snapshots #4374

Merged

3 tasks

KerstinHebel added Status: Needs Review and removed Status: In Progress labels Apr 24, 2020

zeebe-bors bot closed this as completed in 47d5a81 Apr 28, 2020

KerstinHebel removed the Status: Needs Review label Apr 28, 2020

npepinpe added the Release: 0.24.0-alpha2 label Jun 2, 2020

npepinpe added the Release: 0.24.0 label Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporter fails to recover from log after failover #4350

Exporter fails to recover from log after failover #4350

menski commented Apr 22, 2020

menski commented Apr 22, 2020

menski commented Apr 22, 2020

menski commented Apr 22, 2020

npepinpe commented Apr 22, 2020

npepinpe commented Apr 22, 2020 •

edited

menski commented Apr 22, 2020

menski commented Apr 23, 2020

Exporter fails to recover from log after failover #4350

Exporter fails to recover from log after failover #4350

Comments

menski commented Apr 22, 2020

menski commented Apr 22, 2020

menski commented Apr 22, 2020

menski commented Apr 22, 2020

npepinpe commented Apr 22, 2020

npepinpe commented Apr 22, 2020 • edited

menski commented Apr 22, 2020

menski commented Apr 23, 2020

npepinpe commented Apr 22, 2020 •

edited