Recovering or Preventing Out of Disk Space #4441

deepthidevaki · 2020-05-04T07:10:39Z

Description

There are several reasons when a broker goes Out of Disk space.

Exporter is not making progress or is slow. Then partitions cannot compact the log in a timely manner and can lead to Out of Disk space.
Partitions cannot compact the logs due to :
- Temporary I/O errors
- Due to bugs in compaction module
Partitions cannot take snapshot (if leader) or not receive snapshots (if follower) and so they cannot compact.
- Temporary I/O errors
- Network issues
- Frequent fail overs resulting in leaders not up for long enough to take a snapshot

In all of the above cases there are no automatic recovery possible. But it is possible that the faults that leads to OOD is resolved eventually. Exporter comes back and starts exporting again. Temporary I/O errors are gone. Buggy compaction module is fixed and broker is upgraded. However, if the broker is already OOD we cannot take a new snapshot because there is no space for snapshot. So they cannot compact and brokers can never recover from OOD. Since we cannot prevent the faults that causes OOD (such as exporter is slow), we focus on preventing the broker from reaching a non-recoverable state. Towards that, we have the following proposals:

Broker rejects client requests if disk space usage > threshold. This threshold should guarantee that there is enough space left for a snapshot.
Follower rejects append requests from leaders when disk space usage > threshold

The above two features would prevent broker from going to fully OOD. Once the fault is fixed, partitions can take snapshot and compact.

In addition to the above two:
3. Add health checks for compaction and snapshoting
4. Modify snapshot taking frequency. Instead of periodic snapshots, use other heuristics such as size of the log, number of newly processed events/newly exported from last snapshot etc. This would ensure that once the fault is fixed, we can immediately take snapshot/compact and no need to wait until the next snapshot period.
5. Gateway redirects all requests to other partitions if only one of the partition is affected by the OOD.
6. In some cases stepdown will be helpful as other brokers are not OOD and can take over the leadership to ensure progress. However, in most cases it is not useful since brokers have similar configuration, and are compacting at the same time. Step down immediately when disk space usage > threshold is not a good idea, if other brokers are also OOD. It would lead to a back-to-back failover. Since no broker stays as leader for long enough to export or to take snapshot, they can never compact and recover from OOD. Investigate what is a good strategy to step down.

deepthidevaki · 2020-05-06T13:29:19Z

I did a spike to prevent/recover from OOD due to slow exporter.

Assumptions: OOD means free space <= threshold. There is always enough space for a snapshot.

Leader rejects external commands when freespace < X
external commands are :

client requests
commands from other partitions
- Deployment
- Message subscription etc.
  Follower rejects append requests from leaders when the freespace < Y

If the leader is OOD, at some point in time it can take snapshot i.e when the exporter is back.
Since followers have Y freespace, they can also receive snapshot and compact.
X and Y should be configured such that there is enough space for a taking and receiving a snapshot.

Questions:
What are good values for X and Y? Fixed value or should it be based on some other heuristics?

Problematic Case

Both followers are OOD (free space < Y) and reject event replication . Only way for them to recover is to get a snapshot and compact.
Problem : Leader cannot take snapshot because the event at the position cannot be committed. Followers then cannot compact and we will end up in a Non recoverable state.

What are the cases that both followers are OOD?

Follower didn't receive snapshots so they could not compact.
Followers are followers for other partitions which the leader broker does not replicate. The other partitions cannot compact because of various reasons (not exported, no snapshots).

The chances of such a case might be probably low. But if it happens we are in a non-recoverable state.

Verified - If leader is not OOD and followers are OOD, followers cannot recover because leader cannot take snapshot with out committing.

Verified - It can recover if we re-size atleast one follower's disk when it goes OOD, restart the node and wait until leader takes snapshot. To resize a pvc -> kubectl edit pvc <data-zeebe>

Main issue:
For a snapshot to be committed on leader, majority of followers has to be healthy. Ideally snapshoting on the leader should not depend on the follower.

/cc @Zelldon

4782: chore(broker): handle out of disk space in broker r=deepthidevaki a=deepthidevaki ## Description * expose configuration parameters for minimum free disk space * When broker free disk space available is less than configured min free disk, - reject client requests - reject commands from other partitions - pause stream processor ## Related issues closes #4441 # 4832: feat(clients/java): add user-agent to java client requests r=MiguelPires a=MiguelPires ## Description Add user-agent header with client and version information to gRPC and OAuth requests. ## Related issues closes #4265 # Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Miguel Pires <miguel.pires@camunda.com>

4782: chore(broker): handle out of disk space in broker r=deepthidevaki a=deepthidevaki ## Description * expose configuration parameters for minimum free disk space * When broker free disk space available is less than configured min free disk, - reject client requests - reject commands from other partitions - pause stream processor ## Related issues closes #4441 # Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

deepthidevaki added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. scope/broker Marks an issue or PR to appear in the broker section of the changelog Priority: High labels May 4, 2020

npepinpe added the Status: Ready label May 4, 2020

deepthidevaki self-assigned this May 6, 2020

deepthidevaki added Status: Planned and removed Status: Ready labels Jun 17, 2020

deepthidevaki mentioned this issue Jun 22, 2020

chore(broker): handle out of disk space in broker #4782

Merged

3 tasks

deepthidevaki added Status: Needs Review and removed Status: In Progress labels Jun 22, 2020

zeebe-bors bot closed this as completed in 9a32d29 Jul 9, 2020

KerstinHebel removed Priority: High labels Jul 9, 2020

deepthidevaki added the Release: 0.25.0-alpha1 label Aug 4, 2020

MiguelPires added the Release: 0.25.0 label Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovering or Preventing Out of Disk Space #4441

Recovering or Preventing Out of Disk Space #4441

deepthidevaki commented May 4, 2020

deepthidevaki commented May 6, 2020

Recovering or Preventing Out of Disk Space #4441

Recovering or Preventing Out of Disk Space #4441

Comments

deepthidevaki commented May 4, 2020

deepthidevaki commented May 6, 2020

Problematic Case