Skip to content

refactor(agent): replace custom retry logic with Kubernetes backoff strategy#446

Merged
skevetter merged 2 commits intomainfrom
remove-custom-retry-replace-with-k8s
Feb 5, 2026
Merged

refactor(agent): replace custom retry logic with Kubernetes backoff strategy#446
skevetter merged 2 commits intomainfrom
remove-custom-retry-replace-with-k8s

Conversation

@skevetter
Copy link
Copy Markdown
Owner

@skevetter skevetter commented Feb 5, 2026

Signed-off-by: Samuel K skevetter@pm.me

Summary by CodeRabbit

  • Refactor
    • Updated the agent injection retry mechanism to use Kubernetes-standard retry strategies and backoff patterns, maintaining existing behavior while improving alignment with Kubernetes client libraries.

…trategy

Signed-off-by: Samuel K <skevetter@pm.me>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

Removes the entire custom retry mechanism from binary.go, including types and functions like RetryConfig and RetryWithDeadline. Replaces the retry logic in inject.go with Kubernetes standard libraries (wait.Backoff and client-go retry), migrating from deadline-based retry semantics to backoff-based retry strategy while maintaining agent injection functionality.

Changes

Cohort / File(s) Summary
Retry mechanism removal
pkg/agent/binary.go
Removes all retry-related types (RetryConfig, RetryFunc), public function (RetryWithDeadline), and helper methods (checkPreConditions, handleRetry, applyDefaults, checkDeadline, checkContextCancelled, calculateSleep, sleepWithContext). Simplifies code by eliminating ~124 lines of custom retry infrastructure.
Retry mechanism migration
pkg/agent/inject.go
Replaces custom RetryWithDeadline retry loop with Kubernetes wait.Backoff configuration and client-go retry.OnError mechanism. Introduces imports from k8s.io/apimachinery/pkg/util/wait and k8s.io/client-go/util/retry. Maintains retry functionality with incremental attempt counter and per-attempt logging, but uses standard Kubernetes libraries instead of custom deadline-based logic.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: removing custom retry logic and replacing it with Kubernetes backoff strategy, as evidenced by the deletions in binary.go and modifications in inject.go.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch remove-custom-retry-replace-with-k8s

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Samuel K <skevetter@pm.me>
@skevetter skevetter marked this pull request as ready for review February 5, 2026 21:34
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pkg/agent/inject.go`:
- Around line 191-215: The retry loop can exceed InjectOptions.Timeout because
backoff sleeps aren't context-aware; create a timeout-aware child context (ctx,
cancel := context.WithTimeout(opts.Ctx, opts.Timeout) and defer cancel()) and
use a context-aware backoff call instead of the current retry.OnError pattern:
replace the retry.OnError block with a wait.ExponentialBackoffWithContext (or
equivalent context-aware backoff) that uses the new ctx and invokes
injectAgent(&injectContext{attempt: attempt, opts: optsWithCtx, bm: bm, vc: vc,
metrics: metrics}) where optsWithCtx is opts with Ctx set to the new ctx; also
ensure the retry predicate and any opts.Ctx.Err() checks use this ctx so the
overall operation is bounded by InjectOptions.Timeout.

Comment thread pkg/agent/inject.go
Comment on lines +191 to +215
backoff := wait.Backoff{
Steps: 30,
Duration: 10 * time.Second,
Factor: 1.5,
Jitter: 0.1,
Cap: 60 * time.Second,
}

attempt := 0
return retry.OnError(backoff, func(err error) bool {
if opts.Ctx.Err() != nil {
return false
}
opts.Log.Debugf("retrying attempt %d: %v", attempt, err)
return true
}, func() error {
attempt++
return injectAgent(&injectContext{
attempt: attempt,
opts: opts,
bm: bm,
vc: vc,
metrics: metrics,
})
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, examine the file to understand the full context
head -n 30 pkg/agent/inject.go | tail -n +15

Repository: skevetter/devpod

Length of output: 488


🏁 Script executed:

# Check the imports to confirm which retry package is used
grep -n "import" pkg/agent/inject.go | head -n 20

Repository: skevetter/devpod

Length of output: 70


🏁 Script executed:

# Get the exact lines in question with more context
sed -n '180,220p' pkg/agent/inject.go

Repository: skevetter/devpod

Length of output: 1037


🏁 Script executed:

# Search for InjectOptions struct definition to understand Timeout field
rg "type InjectOptions" -A 10

Repository: skevetter/devpod

Length of output: 1150


🌐 Web query:

client-go v0.35.0 retry.OnError context handling deadline

💡 Result:

In client-go v0.35.0, the k8s.io/client-go/util/retry package does not provide retry.OnError (it only exposes conflict-focused helpers like RetryOnConflict). (pkg.go.dev)

So there is no context/deadline handling for retry.OnError in client-go v0.35.0, because that function isn’t part of that module/version.

If your goal is “retry but stop when ctx is canceled / its deadline expires”, use wait.ExponentialBackoffWithContext(ctx, backoff, condition), which explicitly returns when the context is canceled or hits its deadline. (pkg.go.dev)

Citations:


🏁 Script executed:

# Check go.mod for the client-go version
grep "client-go" go.mod

Repository: skevetter/devpod

Length of output: 141


🏁 Script executed:

# Search for the retry.OnError function definition
rg "func OnError" --type go

Repository: skevetter/devpod

Length of output: 42


🏁 Script executed:

# Check if wait.ExponentialBackoffWithContext is available
rg "ExponentialBackoffWithContext" --type go

Repository: skevetter/devpod

Length of output: 42


🏁 Script executed:

# Look for any custom retry definitions in the codebase
rg "package retry" --type go

Repository: skevetter/devpod

Length of output: 42


🏁 Script executed:

# Check the full imports section of inject.go
sed -n '1,40p' pkg/agent/inject.go

Repository: skevetter/devpod

Length of output: 995


🏁 Script executed:

# Check if the code compiles or if there are any build errors documented
find . -name "go.sum" -exec grep "client-go" {} \;

Repository: skevetter/devpod

Length of output: 401


🏁 Script executed:

# Try to find if OnError is defined anywhere in client-go imports or locally
rg "OnError" --type go

Repository: skevetter/devpod

Length of output: 597


🏁 Script executed:

# Check the k8s.io client-go package documentation for the actual exports
# First, see if there's any reference to what retry functions are available
rg "retry\." --type go | head -20

Repository: skevetter/devpod

Length of output: 456


🌐 Web query:

k8s.io/client-go/util/retry OnError function signature v0.35.0

💡 Result:

In k8s.io/client-go/util/retry v0.35.0, OnError has this signature:

func OnError(backoff wait.Backoff, errorFunc func(error) bool, fn func() error) error

(Some docs/name variants call the 2nd arg retriable, but the type/signature is the same.) [1][2]


🌐 Web query:

k8s.io/apimachinery/pkg/util/wait ExponentialBackoff context support

💡 Result:

wait.ExponentialBackoff does not take a context.Context (it uses ConditionFunc func() (bool, error)), so it can’t be directly canceled via context. (pkg.go.dev)

For context cancellation/deadlines, use wait.ExponentialBackoffWithContext:

  • Signature: func ExponentialBackoffWithContext(ctx context.Context, backoff Backoff, condition ConditionWithContextFunc) error (pkg.go.dev)
  • It returns early if ctx is canceled / hits deadline, if the condition returns an error, or when attempts are exhausted (ErrWaitTimeout). (pkg.go.dev)

Example:

err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{
    Duration: 100 * time.Millisecond,
    Factor:   2.0,
    Jitter:   0.1,
    Steps:    6,
    Cap:      2 * time.Second,
}, func(ctx context.Context) (bool, error) {
    // check condition; honor ctx if you do I/O
    return done, err
})

Citations:


🏁 Script executed:

# Check InjectOptions struct and how opts is used
sed -n '30,90p' pkg/agent/inject.go

Repository: skevetter/devpod

Length of output: 2159


🏁 Script executed:

# Verify that InjectOptions is passed by pointer in injectAgent
rg "func injectAgent" -A 5

Repository: skevetter/devpod

Length of output: 323


🏁 Script executed:

# Check if there are any other references to opts fields in the retry block that might be affected
sed -n '191,215p' pkg/agent/inject.go

Repository: skevetter/devpod

Length of output: 558


Overall retry can exceed InjectOptions.Timeout.

Timeout is documented as the maximum duration to wait for injection completion, but the retry loop runs for a fixed 30 steps regardless of elapsed time. The opts.Ctx.Err() check in the callback is insufficient because context expiration is only checked between retries, not during the backoff delays themselves. With the current backoff configuration (~27 minutes total sleep), injection can run far longer than the documented timeout (default 5 minutes). Bind retries to a timeout-aware context so that the overall time is actually capped by the timeout:

🛠️ Suggested fix
+	retryCtx := opts.Ctx
+	if opts.Timeout > 0 {
+		var cancel context.CancelFunc
+		retryCtx, cancel = context.WithTimeout(opts.Ctx, opts.Timeout)
+		defer cancel()
+	}
+	retryOpts := *opts
+	retryOpts.Ctx = retryCtx
+
 	backoff := wait.Backoff{
 		Steps:    30,
 		Duration: 10 * time.Second,
 		Factor:   1.5,
 		Jitter:   0.1,
 		Cap:      60 * time.Second,
 	}

 	attempt := 0
 	return retry.OnError(backoff, func(err error) bool {
-		if opts.Ctx.Err() != nil {
+		if retryCtx.Err() != nil {
 			return false
 		}
 		opts.Log.Debugf("retrying attempt %d: %v", attempt, err)
 		return true
 	}, func() error {
 		attempt++
 		return injectAgent(&injectContext{
 			attempt: attempt,
-			opts:    opts,
+			opts:    &retryOpts,
 			bm:      bm,
 			vc:      vc,
 			metrics: metrics,
 		})
 	})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
backoff := wait.Backoff{
Steps: 30,
Duration: 10 * time.Second,
Factor: 1.5,
Jitter: 0.1,
Cap: 60 * time.Second,
}
attempt := 0
return retry.OnError(backoff, func(err error) bool {
if opts.Ctx.Err() != nil {
return false
}
opts.Log.Debugf("retrying attempt %d: %v", attempt, err)
return true
}, func() error {
attempt++
return injectAgent(&injectContext{
attempt: attempt,
opts: opts,
bm: bm,
vc: vc,
metrics: metrics,
})
})
retryCtx := opts.Ctx
if opts.Timeout > 0 {
var cancel context.CancelFunc
retryCtx, cancel = context.WithTimeout(opts.Ctx, opts.Timeout)
defer cancel()
}
retryOpts := *opts
retryOpts.Ctx = retryCtx
backoff := wait.Backoff{
Steps: 30,
Duration: 10 * time.Second,
Factor: 1.5,
Jitter: 0.1,
Cap: 60 * time.Second,
}
attempt := 0
return retry.OnError(backoff, func(err error) bool {
if retryCtx.Err() != nil {
return false
}
opts.Log.Debugf("retrying attempt %d: %v", attempt, err)
return true
}, func() error {
attempt++
return injectAgent(&injectContext{
attempt: attempt,
opts: &retryOpts,
bm: bm,
vc: vc,
metrics: metrics,
})
})
🤖 Prompt for AI Agents
In `@pkg/agent/inject.go` around lines 191 - 215, The retry loop can exceed
InjectOptions.Timeout because backoff sleeps aren't context-aware; create a
timeout-aware child context (ctx, cancel := context.WithTimeout(opts.Ctx,
opts.Timeout) and defer cancel()) and use a context-aware backoff call instead
of the current retry.OnError pattern: replace the retry.OnError block with a
wait.ExponentialBackoffWithContext (or equivalent context-aware backoff) that
uses the new ctx and invokes injectAgent(&injectContext{attempt: attempt, opts:
optsWithCtx, bm: bm, vc: vc, metrics: metrics}) where optsWithCtx is opts with
Ctx set to the new ctx; also ensure the retry predicate and any opts.Ctx.Err()
checks use this ctx so the overall operation is bounded by
InjectOptions.Timeout.

@skevetter skevetter merged commit 714cb94 into main Feb 5, 2026
39 checks passed
@skevetter skevetter deleted the remove-custom-retry-replace-with-k8s branch February 5, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant