Skip to content

fix: harden retry logic - typed error checking and injectable RetryConfig#66

Merged
Kavirubc merged 3 commits intomainfrom
fix/retry-improvements
Feb 17, 2026
Merged

fix: harden retry logic - typed error checking and injectable RetryConfig#66
Kavirubc merged 3 commits intomainfrom
fix/retry-improvements

Conversation

@Kavirubc
Copy link
Copy Markdown
Contributor

@Kavirubc Kavirubc commented Feb 17, 2026

Summary

Follow-up to #63 addressing two open items flagged during review.

  • Typed error checking in isRetryableError: replaces strings.Contains matching with errors.As(*googleapi.Error) for HTTP status codes and status.FromError + gRPC codes for RPC errors. Eliminates false positives (e.g. the word "Internal" accidentally matching unrelated error messages).
  • Injectable RetryConfig on LLMClient and Embedder: both structs now hold a retryConfig field initialised to DefaultRetryConfig() in their constructors. generateWithRetry and Embed use the struct field instead of calling DefaultRetryConfig() inline, allowing integration-level tests to inject fast configs without waiting on real wall-clock delays.
  • Updated retry_test.go: all test cases now construct real *googleapi.Error and status.New(codes.X).Err() typed errors so the tests exercise the actual code path end-to-end.

Test plan

  • All 19 existing retry unit tests pass (go test ./internal/integrations/gemini/...)
  • Build passes with no new warnings (go build ./...)
  • isRetryableError correctly retries HTTP 429/5xx and gRPC ResourceExhausted/Unavailable/Internal
  • isRetryableError does not retry HTTP 400/403/404 or generic errors
  • LLMClient and Embedder can be constructed with a custom RetryConfig for fast test execution

Summary by CodeRabbit

  • Refactor
    • Improved error handling in Gemini integration with typed error inspection for more reliable retry detection
    • Enhanced retry logic to properly recognize rate-limit and transient errors across REST and gRPC protocols
  • New Features
    • Made retry settings configurable within Gemini embed and LLM clients
  • Tests
    • Updated tests to use structured REST/gRPC error types and expanded retry-path coverage

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 17, 2026

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

Refactors Gemini integration to add configurable retry fields to Embedder and LLMClient, initializes them with DefaultRetryConfig, and replaces string-based retry detection with typed error inspection for REST (googleapi.Error) and gRPC (status codes). Tests updated to use structured errors.

Changes

Cohort / File(s) Summary
Retry Configuration Integration
internal/integrations/gemini/embedder.go, internal/integrations/gemini/llm.go
Added internal retryConfig fields to Embedder and LLMClient, initialized via DefaultRetryConfig() in constructors; runtime retry logic now uses these fields instead of local defaults.
Retry Error Handling Refactor
internal/integrations/gemini/retry.go
Rewrote isRetryableError() and related retry flow to use typed inspections: handle REST errors via *googleapi.Error (429 or 5xx) and gRPC via status.Code (ResourceExhausted, Unavailable, Internal); removed message-fragment matching and clarified comments.
Tests Updated for Typed Errors
internal/integrations/gemini/retry_test.go
Replaced string-based error test cases with *googleapi.Error and grpc/status errors, added necessary imports, and updated wrapped/embedded error scenarios to validate typed retry behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Gemini as LLMClient/Embedder
    participant Retry as withRetry
    participant External as Google API / gRPC

    Client->>Gemini: Request (embed/generate)
    Gemini->>Retry: invoke with retryConfig
    Retry->>External: call API
    alt success
        External-->>Retry: response
        Retry-->>Gemini: return result
        Gemini-->>Client: final response
    else retryable error (429/5xx or gRPC codes)
        External-->>Retry: retryable error
        Retry-->>Retry: backoff + jitter (respect context)
        Retry->>External: retry call
    else non-retryable error
        External-->>Retry: non-retryable error
        Retry-->>Gemini: propagate error
        Gemini-->>Client: error
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 In burrows of code I hop and write,
Typed errors gleam in morning light,
Retries now follow tuned command,
Backoff, jitter — steady stand,
Gemini hums, robust and bright. 🌿

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the two main changes: typed error checking and injectable RetryConfig, which are the core objectives of this PR.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/retry-improvements

Comment @coderabbitai help to get the list of available commands and usage tips.

Replace fragile strings.Contains matching with errors.As(*googleapi.Error)
for HTTP status codes and gRPC status.FromError for RPC codes. This prevents
false positives (e.g. "Internal" matching unrelated messages) and makes
retry logic correct by construction.

Update retry_test.go to use real *googleapi.Error and gRPC status instances
instead of plain errors.New strings, so tests exercise the actual code path.

Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
Store retryConfig as a field on both structs, set to DefaultRetryConfig()
in their constructors. generateWithRetry and Embed now use the struct field
instead of calling DefaultRetryConfig() inline, allowing tests to inject
fast configs (e.g. 1ms delays) without waiting real wall-clock time.

Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
@Kavirubc Kavirubc force-pushed the fix/retry-improvements branch from 9a23f0f to b4a71a6 Compare February 17, 2026 04:31
@gh-simili-bot
Copy link
Copy Markdown
Contributor

Simili Triage Report

Note

Quality Score: 9.2/10 (Excellent)
The issue could be improved. See suggestions below.

Classification

Category Value
Labels
Quality Improvements
  • Test plan could be more detailed
  • Expand the test plan to explicitly mention if new tests were added for the RetryConfig injection mechanism and the new error type handling, or if existing tests were modified to cover these aspects.
Similar Threads
Similarity Thread Status
79% #50 [Reliability]: Implement Exponential Backoff fo...
77% #50 [Reliability]: Implement Exponential Backoff fo... Open
76% #33 Similar Issue

Generated by Simili Bot

@gh-simili-bot gh-simili-bot added bug Something isn't working enhancement New feature or request refactor labels Feb 17, 2026
@Kavirubc Kavirubc self-assigned this Feb 17, 2026
@Kavirubc Kavirubc requested a review from Copilot February 17, 2026 04:32
@Kavirubc Kavirubc moved this from Todo to Done in simili-bot-v1-release Feb 17, 2026
@Kavirubc Kavirubc linked an issue Feb 17, 2026 that may be closed by this pull request
@gh-simili-bot
Copy link
Copy Markdown
Contributor

Simili Triage Report

Note

Quality Score: 9.5/10 (Excellent)
The issue is well-described.

Classification

Category Value
Labels
Quality Improvements
  • Could optionally include small 'before' and 'after' code snippets for key changes to further illustrate implementation details, although the textual description is already very clear.
Similar Threads
Similarity Thread Status
80% #50 [Reliability]: Implement Exponential Backoff fo...
79% #50 [Reliability]: Implement Exponential Backoff fo... Open

Generated by Simili Bot

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Gemini retry behavior by switching transient-error detection from string matching to typed error inspection and by moving retry configuration onto client structs so it can be overridden.

Changes:

  • Updated isRetryableError to use errors.As(*googleapi.Error) for HTTP codes and gRPC status codes for RPC errors.
  • Added retryConfig fields to LLMClient and Embedder, initialized from DefaultRetryConfig(), and used by generateWithRetry/Embed.
  • Updated retry unit tests to use typed *googleapi.Error and status.New(...).Err() errors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
internal/integrations/gemini/retry.go Replaces string matching with typed REST/gRPC error classification for retryability.
internal/integrations/gemini/retry_test.go Updates tests to exercise typed error paths (googleapi + gRPC status).
internal/integrations/gemini/llm.go Stores retry config on LLMClient and uses it in generateWithRetry.
internal/integrations/gemini/embedder.go Stores retry config on Embedder and uses it in Embed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/integrations/gemini/retry.go Outdated
Comment on lines 59 to 65
// gRPC transport: generative-ai-go can return gRPC status errors.
if st, ok := status.FromError(err); ok {
switch st.Code() {
case codes.ResourceExhausted, codes.Unavailable, codes.Internal:
return true
}
}
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status.FromError(err) does not unwrap %w-wrapped errors, so retryable gRPC status errors can be missed once they’ve been wrapped upstream (e.g., fmt.Errorf("...: %w", err) in Embed). Consider using errors.As with an interface like interface{ GRPCStatus() *status.Status } (or iteratively unwrapping) so gRPC codes are detected through wrappers.

Copilot uses AI. Check for mistakes.
Comment on lines 105 to 122
// NewLLMClient creates a new Gemini LLM client.
func NewLLMClient(apiKey, model string) (*LLMClient, error) {
ctx := context.Background()
client, err := genai.NewClient(ctx, option.WithAPIKey(apiKey))
if err != nil {
return nil, fmt.Errorf("failed to create Gemini client: %w", err)
}

if model == "" {
model = "gemini-2.5-flash" // Fast and cost-effective
}

return &LLMClient{
client: client,
model: model,
client: client,
model: model,
retryConfig: DefaultRetryConfig(),
}, nil
}
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says RetryConfig is injectable, but retryConfig is unexported and NewLLMClient doesn’t provide any way for callers outside the gemini package to set it. To make this truly injectable, consider adding a constructor overload/option (e.g., NewLLMClientWithRetryConfig or functional options) or an exported setter.

Copilot uses AI. Check for mistakes.
Comment on lines 17 to 41
@@ -33,8 +34,9 @@ func NewEmbedder(apiKey, model string) (*Embedder, error) {
}

return &Embedder{
client: client,
model: model,
client: client,
model: model,
retryConfig: DefaultRetryConfig(),
}, nil
}
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says RetryConfig is injectable, but retryConfig is unexported and NewEmbedder doesn’t provide any way for callers outside the gemini package to set it. Consider adding a constructor overload/option or an exported setter so integration tests (or other packages) can inject a fast retry config without relying on package-internal access.

Copilot uses AI. Check for mistakes.
Comment on lines 25 to +36
{"nil error", nil, false},
{"rate limit 429", errors.New("googleapi: Error 429: Resource exhausted"), true},
{"ResourceExhausted gRPC", errors.New("rpc error: code = ResourceExhausted"), true},
{"server error 500", errors.New("googleapi: Error 500: Internal Server Error"), true},
{"bad gateway 502", errors.New("HTTP 502: Bad Gateway"), true},
{"unavailable 503", errors.New("googleapi: Error 503: Service Unavailable"), true},
{"gateway timeout 504", errors.New("HTTP 504: Gateway Timeout"), true},
{"Unavailable keyword", errors.New("rpc error: code = Unavailable"), true},
{"Internal keyword", errors.New("rpc error: code = Internal"), true},
{"client error 400", errors.New("googleapi: Error 400: Bad Request"), false},
{"forbidden 403", errors.New("googleapi: Error 403: Forbidden"), false},
{"not found 404", errors.New("googleapi: Error 404: Not Found"), false},
{"rate limit 429", &googleapi.Error{Code: 429, Message: "Resource exhausted"}, true},
{"ResourceExhausted gRPC", status.New(codes.ResourceExhausted, "resource exhausted").Err(), true},
{"server error 500", &googleapi.Error{Code: 500, Message: "Internal Server Error"}, true},
{"bad gateway 502", &googleapi.Error{Code: 502, Message: "Bad Gateway"}, true},
{"unavailable 503", &googleapi.Error{Code: 503, Message: "Service Unavailable"}, true},
{"gateway timeout 504", &googleapi.Error{Code: 504, Message: "Gateway Timeout"}, true},
{"Unavailable gRPC", status.New(codes.Unavailable, "service unavailable").Err(), true},
{"Internal gRPC", status.New(codes.Internal, "internal error").Err(), true},
{"client error 400", &googleapi.Error{Code: 400, Message: "Bad Request"}, false},
{"forbidden 403", &googleapi.Error{Code: 403, Message: "Forbidden"}, false},
{"not found 404", &googleapi.Error{Code: 404, Message: "Not Found"}, false},
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no test case that covers a retryable gRPC status error being wrapped (e.g., fmt.Errorf("wrap: %w", status.New(...).Err())). Since production code may wrap errors, adding such a case would ensure isRetryableError continues to work end-to-end with wrapped gRPC errors.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
internal/integrations/gemini/llm.go (1)

105-122: Consider exposing a way to customize RetryConfig for external callers.

The retryConfig field is unexported, which allows same-package tests to modify it directly. However, if external integrations need custom retry behavior (e.g., shorter delays for integration tests), they currently have no API to do so.

Consider adding either:

  • A functional option pattern: NewLLMClient(apiKey, model, WithRetryConfig(cfg))
  • A setter method: SetRetryConfig(cfg RetryConfig)

This is optional if external customization isn't a current requirement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/integrations/gemini/llm.go` around lines 105 - 122, The NewLLMClient
currently hardcodes retryConfig via DefaultRetryConfig and does not expose a way
for callers to customize it; add a public API to set custom retry behavior by
either implementing a functional option (e.g., add a WithRetryConfig(retryConfig
RetryConfig) option consumed by NewLLMClient(...opts ...LLMOption)) or by adding
a setter on LLMClient (e.g., SetRetryConfig(cfg RetryConfig)); ensure the
implementation updates the LLMClient.retryConfig field and that NewLLMClient
still defaults to DefaultRetryConfig when no option or setter is used so
existing behavior remains unchanged.
internal/integrations/gemini/retry.go (1)

53-67: Consider adding codes.DeadlineExceeded to retryable gRPC codes.

The current implementation handles ResourceExhausted, Unavailable, and Internal. However, DeadlineExceeded is also a transient error that can occur during temporary network slowdowns and is commonly retried.

💡 Optional: Add DeadlineExceeded to retryable codes
 	if st, ok := status.FromError(err); ok {
 		switch st.Code() {
-		case codes.ResourceExhausted, codes.Unavailable, codes.Internal:
+		case codes.ResourceExhausted, codes.Unavailable, codes.Internal, codes.DeadlineExceeded:
 			return true
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/integrations/gemini/retry.go` around lines 53 - 67, The retry logic
currently treats gRPC status codes ResourceExhausted, Unavailable, and Internal
as retryable; add codes.DeadlineExceeded to the switch in the
status.FromError(err) block so DeadlineExceeded is also considered transient and
returns true (update the switch that checks st.Code() alongside cases
codes.ResourceExhausted, codes.Unavailable, codes.Internal to include
codes.DeadlineExceeded).
internal/integrations/gemini/retry_test.go (1)

19-49: Consider adding test cases for wrapped errors.

The current tests cover direct googleapi.Error and gRPC status errors, but errors.As is designed to unwrap nested errors. Adding a test case for wrapped errors would confirm that isRetryableError correctly handles errors wrapped via fmt.Errorf("...: %w", err).

💡 Optional: Add wrapped error test case
{"wrapped 429", fmt.Errorf("API call failed: %w", &googleapi.Error{Code: 429, Message: "Rate limited"}), true},
{"wrapped gRPC", fmt.Errorf("RPC failed: %w", status.New(codes.Unavailable, "unavailable").Err()), true},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/integrations/gemini/retry_test.go` around lines 19 - 49, Test suite
TestIsRetryableError lacks cases where retryable errors are wrapped; update the
table of tests in TestIsRetryableError to include wrapped variants using
fmt.Errorf with %w that wrap a googleapi.Error (e.g. wrapped 429) and a gRPC
status error (e.g. wrapped Unavailable) so that isRetryableError's use of
errors.As is exercised for nested errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/integrations/gemini/llm.go`:
- Around line 105-122: The NewLLMClient currently hardcodes retryConfig via
DefaultRetryConfig and does not expose a way for callers to customize it; add a
public API to set custom retry behavior by either implementing a functional
option (e.g., add a WithRetryConfig(retryConfig RetryConfig) option consumed by
NewLLMClient(...opts ...LLMOption)) or by adding a setter on LLMClient (e.g.,
SetRetryConfig(cfg RetryConfig)); ensure the implementation updates the
LLMClient.retryConfig field and that NewLLMClient still defaults to
DefaultRetryConfig when no option or setter is used so existing behavior remains
unchanged.

In `@internal/integrations/gemini/retry_test.go`:
- Around line 19-49: Test suite TestIsRetryableError lacks cases where retryable
errors are wrapped; update the table of tests in TestIsRetryableError to include
wrapped variants using fmt.Errorf with %w that wrap a googleapi.Error (e.g.
wrapped 429) and a gRPC status error (e.g. wrapped Unavailable) so that
isRetryableError's use of errors.As is exercised for nested errors.

In `@internal/integrations/gemini/retry.go`:
- Around line 53-67: The retry logic currently treats gRPC status codes
ResourceExhausted, Unavailable, and Internal as retryable; add
codes.DeadlineExceeded to the switch in the status.FromError(err) block so
DeadlineExceeded is also considered transient and returns true (update the
switch that checks st.Code() alongside cases codes.ResourceExhausted,
codes.Unavailable, codes.Internal to include codes.DeadlineExceeded).

status.FromError does not unwrap %w-wrapped errors, so retryable gRPC
status errors (e.g. ResourceExhausted) could be missed when wrapped by
an upstream fmt.Errorf call. Switch to errors.As with the GRPCStatus()
interface, which correctly traverses the error chain.

Add two new test cases: wrapped retryable (ResourceExhausted) and
wrapped non-retryable (NotFound) to cover this code path.

Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
@gh-simili-bot
Copy link
Copy Markdown
Contributor

Simili Triage Report

Note

Quality Score: 9.5/10 (Excellent)
The issue could be improved. See suggestions below.

Classification

Category Value
Labels
Quality Improvements
  • Specific examples of the 'false positives' in isRetryableError are not provided.
  • Include a brief, concrete example of an error message that would have been incorrectly identified as retryable before this fix.
Similar Threads
Similarity Thread Status
80% #50 [Reliability]: Implement Exponential Backoff fo...
80% #50 [Reliability]: Implement Exponential Backoff fo... Open

Generated by Simili Bot

@Kavirubc Kavirubc merged commit e40a68c into main Feb 17, 2026
6 checks passed
@Kavirubc Kavirubc deleted the fix/retry-improvements branch February 17, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request fix refactor testing

Projects

Development

Successfully merging this pull request may close these issues.

[Reliability]: Implement Exponential Backoff for Gemini API

3 participants