Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovering or Preventing Out of Disk Space #4441

Closed
deepthidevaki opened this issue May 4, 2020 · 1 comment · Fixed by #4782
Closed

Recovering or Preventing Out of Disk Space #4441

deepthidevaki opened this issue May 4, 2020 · 1 comment · Fixed by #4782
Assignees
Labels
kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. scope/broker Marks an issue or PR to appear in the broker section of the changelog

Comments

@deepthidevaki
Copy link
Contributor

Description

There are several reasons when a broker goes Out of Disk space.

  1. Exporter is not making progress or is slow. Then partitions cannot compact the log in a timely manner and can lead to Out of Disk space.
  2. Partitions cannot compact the logs due to :
    • Temporary I/O errors
    • Due to bugs in compaction module
  3. Partitions cannot take snapshot (if leader) or not receive snapshots (if follower) and so they cannot compact.
    • Temporary I/O errors
    • Network issues
    • Frequent fail overs resulting in leaders not up for long enough to take a snapshot

In all of the above cases there are no automatic recovery possible. But it is possible that the faults that leads to OOD is resolved eventually. Exporter comes back and starts exporting again. Temporary I/O errors are gone. Buggy compaction module is fixed and broker is upgraded. However, if the broker is already OOD we cannot take a new snapshot because there is no space for snapshot. So they cannot compact and brokers can never recover from OOD. Since we cannot prevent the faults that causes OOD (such as exporter is slow), we focus on preventing the broker from reaching a non-recoverable state. Towards that, we have the following proposals:

  1. Broker rejects client requests if disk space usage > threshold. This threshold should guarantee that there is enough space left for a snapshot.
  2. Follower rejects append requests from leaders when disk space usage > threshold

The above two features would prevent broker from going to fully OOD. Once the fault is fixed, partitions can take snapshot and compact.

In addition to the above two:
3. Add health checks for compaction and snapshoting
4. Modify snapshot taking frequency. Instead of periodic snapshots, use other heuristics such as size of the log, number of newly processed events/newly exported from last snapshot etc. This would ensure that once the fault is fixed, we can immediately take snapshot/compact and no need to wait until the next snapshot period.
5. Gateway redirects all requests to other partitions if only one of the partition is affected by the OOD.
6. In some cases stepdown will be helpful as other brokers are not OOD and can take over the leadership to ensure progress. However, in most cases it is not useful since brokers have similar configuration, and are compacting at the same time. Step down immediately when disk space usage > threshold is not a good idea, if other brokers are also OOD. It would lead to a back-to-back failover. Since no broker stays as leader for long enough to export or to take snapshot, they can never compact and recover from OOD. Investigate what is a good strategy to step down.

@deepthidevaki deepthidevaki added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. scope/broker Marks an issue or PR to appear in the broker section of the changelog Priority: High labels May 4, 2020
@deepthidevaki deepthidevaki self-assigned this May 6, 2020
@deepthidevaki
Copy link
Contributor Author

I did a spike to prevent/recover from OOD due to slow exporter.

Assumptions: OOD means free space <= threshold. There is always enough space for a snapshot.

Leader rejects external commands when freespace < X
external commands are :

  • client requests
  • commands from other partitions
    • Deployment
    • Message subscription etc.
      Follower rejects append requests from leaders when the freespace < Y

If the leader is OOD, at some point in time it can take snapshot i.e when the exporter is back.
Since followers have Y freespace, they can also receive snapshot and compact.
X and Y should be configured such that there is enough space for a taking and receiving a snapshot.

Questions:
What are good values for X and Y? Fixed value or should it be based on some other heuristics?

Problematic Case

Both followers are OOD (free space < Y) and reject event replication . Only way for them to recover is to get a snapshot and compact.
Problem : Leader cannot take snapshot because the event at the position cannot be committed. Followers then cannot compact and we will end up in a Non recoverable state.

What are the cases that both followers are OOD?

  • Follower didn't receive snapshots so they could not compact.
  • Followers are followers for other partitions which the leader broker does not replicate. The other partitions cannot compact because of various reasons (not exported, no snapshots).

The chances of such a case might be probably low. But if it happens we are in a non-recoverable state.

Verified - If leader is not OOD and followers are OOD, followers cannot recover because leader cannot take snapshot with out committing.

Verified - It can recover if we re-size atleast one follower's disk when it goes OOD, restart the node and wait until leader takes snapshot. To resize a pvc -> kubectl edit pvc <data-zeebe>

Main issue:
For a snapshot to be committed on leader, majority of followers has to be healthy. Ideally snapshoting on the leader should not depend on the follower.

/cc @Zelldon

zeebe-bors bot added a commit that referenced this issue Jul 8, 2020
4782: chore(broker): handle out of disk space in broker r=deepthidevaki a=deepthidevaki

## Description

* expose configuration parameters for minimum free disk space
* When broker free disk space available is less than configured min free disk,
  - reject client requests
  - reject commands from other partitions
  - pause stream processor

## Related issues

closes #4441 

#

4832: feat(clients/java): add user-agent to java client requests r=MiguelPires a=MiguelPires

## Description

Add user-agent header with client and version information to gRPC and OAuth requests.

## Related issues

closes #4265 

#

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: Miguel Pires <miguel.pires@camunda.com>
zeebe-bors bot added a commit that referenced this issue Jul 8, 2020
4782: chore(broker): handle out of disk space in broker r=deepthidevaki a=deepthidevaki

## Description

* expose configuration parameters for minimum free disk space
* When broker free disk space available is less than configured min free disk,
  - reject client requests
  - reject commands from other partitions
  - pause stream processor

## Related issues

closes #4441 

#

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
zeebe-bors bot added a commit that referenced this issue Jul 8, 2020
4782: chore(broker): handle out of disk space in broker r=deepthidevaki a=deepthidevaki

## Description

* expose configuration parameters for minimum free disk space
* When broker free disk space available is less than configured min free disk,
  - reject client requests
  - reject commands from other partitions
  - pause stream processor

## Related issues

closes #4441 

#

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
zeebe-bors bot added a commit that referenced this issue Jul 8, 2020
4782: chore(broker): handle out of disk space in broker r=deepthidevaki a=deepthidevaki

## Description

* expose configuration parameters for minimum free disk space
* When broker free disk space available is less than configured min free disk,
  - reject client requests
  - reject commands from other partitions
  - pause stream processor

## Related issues

closes #4441 

#

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
zeebe-bors bot added a commit that referenced this issue Jul 9, 2020
4782: chore(broker): handle out of disk space in broker r=deepthidevaki a=deepthidevaki

## Description

* expose configuration parameters for minimum free disk space
* When broker free disk space available is less than configured min free disk,
  - reject client requests
  - reject commands from other partitions
  - pause stream processor

## Related issues

closes #4441 

#

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
zeebe-bors bot added a commit that referenced this issue Jul 9, 2020
4782: chore(broker): handle out of disk space in broker r=deepthidevaki a=deepthidevaki

## Description

* expose configuration parameters for minimum free disk space
* When broker free disk space available is less than configured min free disk,
  - reject client requests
  - reject commands from other partitions
  - pause stream processor

## Related issues

closes #4441 

#

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@zeebe-bors zeebe-bors bot closed this as completed in 9a32d29 Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. scope/broker Marks an issue or PR to appear in the broker section of the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants