Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporter fails to recover from log after failover #4350

Closed
menski opened this issue Apr 22, 2020 · 7 comments · Fixed by #4374
Closed

Exporter fails to recover from log after failover #4350

menski opened this issue Apr 22, 2020 · 7 comments · Fixed by #4374
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog

Comments

@menski
Copy link
Contributor

menski commented Apr 22, 2020

Describe the bug

A three node cluster with a single partition and replication factor 3 experienced multiple failovers. It continue in a state where broker 3 was leader for the partition but the exporter failed to recover, therefore no further events where exported.

To Reproduce

  • not sure how it happend

Expected behavior

The exporter should be able to recover. In the rare case that is not possible the broker should step down so another broker can continue with a working exporter hopefully.

Log/Stacktrace

Full Stacktrace

2020-04-22 10:53:45.115 [Broker-2-Exporter-1] [Broker-2-zb-fs-workers-0] ERROR io.zeebe.util.actor - Uncaught exception in 'Broker-2-Exporter-1' in phase 'STARTED'. Continuing with next job.
java.lang.IllegalStateException: Invalid address to read from 984020
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.executeReadMethod(LogStreamReaderImpl.java:201) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.readBlockIntoBuffer(LogStreamReaderImpl.java:192) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.readNextAddress(LogStreamReaderImpl.java:225) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.readNextEvent(LogStreamReaderImpl.java:243) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.next(LogStreamReaderImpl.java:176) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.logstreams.impl.log.LogStreamReaderImpl.next(LogStreamReaderImpl.java:20) ~[zeebe-logstreams-0.23.0.jar:0.23.0]
	at io.zeebe.broker.exporter.stream.ExporterDirector.readNextEvent(ExporterDirector.java:242) ~[zeebe-broker-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) ~[zeebe-util-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:115) [zeebe-util-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.23.0.jar:0.23.0]
	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:195) [zeebe-util-0.23.0.jar:0.23.0]

Full Log File: https://drive.google.com/open?id=1UVq2qRfxTg18EiyFM7Xp3X10_6CDQyjx
Data folder: https://drive.google.com/open?id=1jfY2Kbs80PTSoFDt7d4ktEa4xbWsNyvo

Environment:

  • OS: Linux on k8s (Camunda Cloud)
  • Zeebe Version: 0.23.0
Configuration:

{
  "network" : {
    "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
    "portOffset" : 0,
    "maxMessageSize" : "4MB",
    "advertisedHost" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
    "commandApi" : {
      "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "port" : 26501,
      "advertisedHost" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "advertisedPort" : 26501,
      "advertisedAddress" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26501",
      "address" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26501"
    },
    "internalApi" : {
      "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "port" : 26502,
      "advertisedHost" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "advertisedPort" : 26502,
      "advertisedAddress" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502",
      "address" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502"
    },
    "monitoringApi" : {
      "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "port" : 9600,
      "advertisedHost" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "advertisedPort" : 9600,
      "advertisedAddress" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:9600",
      "address" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:9600"
    },
    "maxMessageSizeInBytes" : 4194304
  },
  "cluster" : {
    "initialContactPoints" : [ "zeebe-0.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502", "zeebe-1.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502", "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502" ],
    "partitionIds" : [ 1 ],
    "nodeId" : 2,
    "partitionsCount" : 1,
    "replicationFactor" : 3,
    "clusterSize" : 3,
    "clusterName" : "88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe",
    "gossipFailureTimeout" : 10000,
    "gossipInterval" : 250,
    "gossipProbeInterval" : 1000
  },
  "threads" : {
    "cpuThreadCount" : 1,
    "ioThreadCount" : 2
  },
  "data" : {
    "directories" : [ "/usr/local/zeebe/data" ],
    "logSegmentSize" : "512MB",
    "snapshotPeriod" : "PT5M",
    "logIndexDensity" : 100,
    "logSegmentSizeInBytes" : 536870912
  },
  "exporters" : {
    "elasticsearch" : {
      "jarPath" : null,
      "className" : "io.zeebe.exporter.ElasticsearchExporter",
      "args" : {
        "url" : "http://elasticsearch:9200",
        "bulk" : {
          "delay" : 5,
          "size" : 1000
        },
        "index" : {
          "prefix" : "zeebe-record",
          "createTemplate" : true,
          "command" : false,
          "event" : true,
          "rejection" : false,
          "deployment" : true,
          "incident" : true,
          "job" : true,
          "message" : false,
          "messageSubscription" : false,
          "raft" : false,
          "workflowInstance" : true,
          "workflowInstanceSubscription" : false
        }
      },
      "external" : false
    }
  },
  "gateway" : {
    "network" : {
      "host" : "0.0.0.0",
      "port" : 26500,
      "minKeepAliveInterval" : "PT30S"
    },
    "cluster" : {
      "contactPoint" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local:26502",
      "requestTimeout" : "PT15S",
      "clusterName" : "zeebe-cluster",
      "memberId" : "gateway",
      "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "port" : 26502
    },
    "threads" : {
      "managementThreads" : 1
    },
    "monitoring" : {
      "enabled" : true,
      "host" : "zeebe-2.zeebe-broker-service.88ef03b9-92f0-4284-bc93-ffa6c3213da5-zeebe.svc.cluster.local",
      "port" : 9600
    },
    "security" : {
      "enabled" : false,
      "certificateChainPath" : null,
      "privateKeyPath" : null
    },
    "enable" : true
  },
  "backpressure" : {
    "enabled" : true,
    "algorithm" : "VEGAS"
  },
  "stepTimeout" : "PT5M"
}

@menski menski added the kind/bug Categorizes an issue or PR as a bug label Apr 22, 2020
@menski
Copy link
Contributor Author

menski commented Apr 22, 2020

Saw this also on another cluster, log:
zeebe.log.gz

@menski
Copy link
Contributor Author

menski commented Apr 22, 2020

On another cluster with 3 brokers and 4 partitions (replication: 3)
zeebe-0.logs.gz
zeebe-2.logs.gz

@menski
Copy link
Contributor Author

menski commented Apr 22, 2020

Also on a 6 broker 8 partition (replication: 3) cluster:
zeebe-0.log.gz
zeebe-1.log.gz
zeebe-2.log.gz
zeebe-4.log.gz

@npepinpe
Copy link
Member

Right off the bat, mega ultra bug: the first index of your log (as posted above) is 1033737, which is way higher than 984020, almost 50k entries separating both.

Now we know that Atomix isn't great with indexes (and in fact we plan on fixing that this quarter), but essentially it does not write the indexes in the log, instead it just "counts". What I'm guessing is it's bad at counting and giving you a bad index, as we store the position and fetch the index from it. I'll dig deeper into it.

@npepinpe npepinpe self-assigned this Apr 22, 2020
@npepinpe npepinpe added Status: In Progress scope/broker Marks an issue or PR to appear in the broker section of the changelog labels Apr 22, 2020
@npepinpe
Copy link
Member

npepinpe commented Apr 22, 2020

So log looks fine, and to me it doesn't look like an issue with failover per se. What I can see is:

2020-04-22 10:53:43.936 [] [raft-server-2-raft-partition-partition-1] DEBUG io.zeebe.broker.clustering.atomix.ZeebeRaftStateMachine - Compacting up to index 1033737
...
2020-04-22 10:53:45.115 [Broker-2-Exporter-1] [Broker-2-zb-fs-workers-0] ERROR io.zeebe.util.actor - Uncaught exception in 'Broker-2-Exporter-1' in phase 'STARTED'. Continuing with next job. java.lang.IllegalStateException: Invalid address to read from 984020

So looks like we compacted but the reader was still behind? How is this possible? Shouldn't we compact only after we've exported (meaning that the reader would be ahead)?

/cc @deepthidevaki since you fixed issues with readers thread safety recently

@npepinpe npepinpe assigned deepthidevaki and unassigned npepinpe Apr 22, 2020
@menski
Copy link
Contributor Author

menski commented Apr 22, 2020

BTW I think the only interesting setting which might be related is that I set the snapshot period to 5 minutes ("snapshotPeriod" : "PT5M")

@menski
Copy link
Contributor Author

menski commented Apr 23, 2020

zeebe-bors bot added a commit that referenced this issue Apr 24, 2020
4374: fix(broker): use exporter positions to calculate compactable index in snapshots r=npepinpe a=npepinpe

## Description

- use concrete entry supplier in AtomixSnapshotStorageTest instead of mocks
- use ZeebeIndexMapping and RaftLogReader directly to supply the correct Atomix entry for snapshot metadata
- do not create a snapshot if a snapshot with that index already exists
- add context map to stackdriver logs
- update exported position in RecordingExporter
- ensure log density is 1 in QA tests that require snapshotting with low synthetic loads
- include partition in ZeebeRaftStateMachine log context

## Related issues

closes #4350 

#

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@zeebe-bors zeebe-bors bot closed this as completed in 47d5a81 Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants