Skip to content

fix(logstorage): validate replicas < node count to prevent ILM stall [release-v1.38]#4556

Merged
rene-dekker merged 4 commits intotigera:release-v1.38from
tianfeng92:fix/logstorage-validate-replicas-node-count-release-v1.38
Mar 16, 2026
Merged

fix(logstorage): validate replicas < node count to prevent ILM stall [release-v1.38]#4556
rene-dekker merged 4 commits intotigera:release-v1.38from
tianfeng92:fix/logstorage-validate-replicas-node-count-release-v1.38

Conversation

@tianfeng92
Copy link
Contributor

@tianfeng92 tianfeng92 commented Mar 16, 2026

Summary

Cherry-pick from master branch of the following fixes:

  • Validate replicas < node count: Elasticsearch ILM can stall when the number of replicas equals or exceeds the available data node count. This adds validation in the LogStorage initializing controller to return a degraded status when replicas >= nodeCount, preventing silent ILM failures.
  • Warn when node count only exceeds replicas by 1: Added a warning condition when nodeCount == replicas + 1, since this leaves no headroom for node failures.
  • Return error as last argument per Go convention: Refactored the validation helper to follow Go's convention of returning error as the last return value.

Test plan

  • Unit tests pass (make ut UT_DIR=./pkg/controller/logstorage)
  • Verify degraded status is set when replicas >= nodeCount
  • Verify warning is logged when nodeCount == replicas + 1
  • Verify normal operation when nodeCount > replicas + 1

🤖 Generated with Claude Code

Release Note

Add validation for logstorage node count and replicas setting.

tianfeng92 and others added 4 commits March 16, 2026 14:32
On single-node ES clusters with replicas: 1, replica shards can never
be allocated. This causes the ILM warm phase migrate action to wait
indefinitely for shard copies to become active, blocking progression
to the delete phase and causing indices to accumulate beyond retention.

Add validation in the LogStorage initializer that rejects configurations
where indices.replicas >= nodes.count, with a clear error message
guiding users to set replicas to 0 for single-node deployments.
Fixes ST1008 staticcheck violation by swapping return order from
(error, string) to (string, error) in validateLogStorage and
validateReplicasForNodeCount.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rene-dekker rene-dekker merged commit 15e70f5 into tigera:release-v1.38 Mar 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants