Skip to content

fix: improve retry logic, error metadata, and container lifecycle#17

Merged
tirthpatell merged 12 commits intomainfrom
fix/retry-and-container-lifecycle
Mar 13, 2026
Merged

fix: improve retry logic, error metadata, and container lifecycle#17
tirthpatell merged 12 commits intomainfrom
fix/retry-and-container-lifecycle

Conversation

@tirthpatell
Copy link
Copy Markdown
Owner

Summary

This PR improves the retry/error handling infrastructure and container lifecycle management in the Threads API client:

  • Parse is_transient and error_subcode from Meta API error responses — all typed errors (APIError, ValidationError, AuthenticationError, etc.) now carry IsTransient, HTTPStatusCode, and ErrorSubcode fields on the shared BaseError, enabling richer error inspection by callers.
  • Fix retry decisions to use HTTP status code instead of API error code — previously, isRetryableError compared apiErr.Code (the Meta error code, e.g. 2) against the 500-599 range, which never matched. Now retry logic checks baseErr.HTTPStatusCode and the IsTransient flag, so 5xx responses and transient errors are correctly retried.
  • Remove dead shouldRetryStatus code pathexecuteRequest already returns an error for HTTP >= 400, so the status-code check in Do() was unreachable. Retry for 429/5xx is now handled via isRetryableError on the error path.
  • Add IsTransientError() public helper — follows the existing IsAuthenticationError/IsRateLimitError pattern, letting callers check if an error is transient without type-switching.
  • Parallelize child container readiness checks in carousel creation — child containers are now polled concurrently via goroutines instead of sequentially, reducing carousel creation latency proportionally to the number of children.
  • Respect context cancellation in waitForContainerReady — replaced bare time.Sleep with select on ctx.Done() / time.After, so timeouts and cancellation propagate correctly.
  • Extract setErrorMetadata helper and simplify switch cases — DRYs up the repeated extractBaseError + field-setting pattern, merges identical 401/403 and 5xx/default switch arms, and removes the redundant InProgress/Published case in waitForContainerReady.

Test plan

  • TestCreateErrorFromResponseParsesIsTransient — verifies is_transient is parsed from JSON
  • TestCreateErrorFromResponseParsesErrorSubcode — verifies error_subcode is parsed
  • TestIsRetryableErrorWithTransientAPIError — validates retry for transient, 5xx, and non-retryable errors
  • TestIsTransientErrorHelper — validates the public IsTransientError() helper
  • TestErrorIsTransient / TestErrorSubcode — field-level checks on error types
  • TestWaitForContainerReadyRespectsContext — confirms context timeout is honored
  • All existing tests pass (go test ./...)

Parse error_subcode from Meta API error responses in both
createErrorFromResponse and handleAPIError. Add IsTransientError()
helper function for checking transient errors from any error type.
shouldRetryStatus in Do() was unreachable because executeRequest
already returns an error for HTTP status >= 400, so the status code
at that point was always < 400. The retry logic for 429/5xx errors
is correctly handled via isRetryableError on the error path.
- Extract setErrorMetadata() to DRY up repeated extractBaseError pattern
- Collapse 401/403 into single switch case
- Fold redundant 500-504 case into default (both create APIError)
- Remove redundant InProgress/Published case in waitForContainerReady
@tirthpatell tirthpatell self-assigned this Mar 13, 2026
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR fixes a set of closely related reliability bugs in the Threads API client — incorrect retry decisions, missing error metadata, a nil-pointer on 429 responses, and unresponsive context cancellation during container polling — and introduces parallel child-container readiness checks to reduce carousel creation latency.

Key changes:

  • isRetryableError previously compared the Meta API error code (e.g. 2) against the 500–599 range, which never matched; it now correctly checks BaseError.HTTPStatusCode, so 5xx responses are actually retried.
  • The dead shouldRetryStatus code path (unreachable because executeRequest always returns an error for HTTP ≥ 400) has been removed.
  • is_transient and error_subcode are now parsed from the Meta error JSON in both createErrorFromResponse and handleAPIError and stored on BaseError via the new setErrorMetadata helper.
  • handleAPIError in client_utils.go no longer panics on a 429 that arrives without X-RateLimit-* headers (resp.RateLimit nil-check added).
  • waitForContainerReady replaces bare time.Sleep with a select on ctx.Done() / time.After, so context timeouts and cancellations propagate correctly.
  • Child container polling in CreateCarouselPost is now concurrent; sibling goroutines are cancelled eagerly on the first failure, and the error-reporting logic correctly skips context.Canceled/context.DeadlineExceeded side-effects to surface the actual root-cause error.
  • NetworkError gains an Unwrap() method and NewNetworkErrorWithCause, allowing errors.Is/errors.As to inspect the original cause (e.g. context.Canceled).
  • IsTransientError is added as a public helper using errors.As for each concrete type, consistent with all other IsXxx helpers.

Confidence Score: 4/5

  • This PR is safe to merge; all previously flagged issues have been addressed and no new critical bugs were found.
  • The core bug fixes (retry using HTTPStatusCode, nil-guard on RateLimit, context cancellation in poll loop, parallel carousel children with proper cancellation and error surfacing) are all implemented correctly. The new IsTransientError helper correctly uses errors.As. The internal extractBaseError helper uses a type switch rather than errors.As, but this is only called on unwrapped errors at construction time and inside isRetryableError where errors from executeRequest are never wrapped, so it is safe in practice. Test coverage is solid and the updated TestWaitForContainerReadyRespectsContext now genuinely exercises the ctx.Done() select branch via a local httptest.Server. Score is 4 rather than 5 because extractBaseError's type-switch pattern could silently fail for future callers who pass wrapped errors to isRetryableError if the function is ever broadened in scope.
  • No files require special attention; posts_create.go and http_client.go carry the most behavioural change and have been verified correct.

Important Files Changed

Filename Overview
errors.go Adds IsTransient, HTTPStatusCode, and ErrorSubcode fields to BaseError; introduces extractBaseError, setErrorMetadata helpers; adds NetworkError.Unwrap()/NewNetworkErrorWithCause; implements IsTransientError using errors.As for wrapped-error support. All changes are correct and consistent with existing patterns.
http_client.go Removes dead shouldRetryStatus code path; fixes isRetryableError to use HTTPStatusCode instead of the Meta API code for 5xx detection; merges 401/403 switch arms; propagates cause in wrapNetworkError. One minor note: UpdateFromHeaders is never called for error responses (pre-existing gap — for 429s MarkRateLimited is called inside createErrorFromResponse, so rate-limiting still works).
client_utils.go Adds is_transient and error_subcode parsing; fixes potential nil-pointer panic when a 429 arrives without rate-limit headers by guarding resp.RateLimit != nil; DRYs error construction via setErrorMetadata. All changes are correct.
posts_create.go Parallelises child-container readiness checks; propagates cancelChildren() eagerly on first failure; filters context.Canceled/context.DeadlineExceeded sibling side-effects to surface the real failure; adds ctx.Done() select to waitForContainerReady. Logic is sound for the common path.
client_test.go New tests cover is_transient/error_subcode parsing, corrected isRetryableError logic, IsTransientError helper including wrapped-error case, and a properly-wired TestWaitForContainerReadyRespectsContext using a local httptest.Server that genuinely exercises the ctx.Done() select branch.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[HTTPClient.Do] --> B{isRetryableError}
    B -->|RateLimitError| C[retry with backoff]
    B -->|NetworkError| D{Temporary or IsTransient?}
    D -->|yes| C
    D -->|no| E[return error immediately]
    B -->|other typed error| F{extractBaseError}
    F --> G{IsTransient?}
    G -->|yes| C
    G -->|no| H{HTTPStatusCode 5xx?}
    H -->|yes| C
    H -->|no| E
    C --> I{maxRetries exhausted?}
    I -->|no| A
    I -->|yes| J[return wrapped lastErr]

    K[createErrorFromResponse] --> L{HTTP status}
    L -->|401 or 403| M[AuthenticationError]
    L -->|429| N[RateLimitError + MarkRateLimited]
    L -->|400 or 422| O[ValidationError]
    L -->|default| P[APIError]
    M --> Q[setErrorMetadata]
    N --> Q
    O --> Q
    P --> Q
    Q --> R[IsTransient, HTTPStatusCode, ErrorSubcode set on BaseError]

    S[CreateCarouselPost] --> T[spawn N goroutines via childCtx]
    T --> U[waitForContainerReady with ctx.Done select]
    U --> V[buffered results channel]
    V --> W[collect all N results, cancelChildren on first error]
    W --> X{any non-ctx-cancel error?}
    X -->|yes| Y[return lowest-index real error]
    X -->|no| Z{any ctx-cancel error?}
    Z -->|yes| AA[return first cancel error]
    Z -->|no| AB[proceed to carousel container creation]
Loading

Last reviewed commit: 18cf6c4

Comment thread errors.go
Comment thread posts_create.go
Comment thread client_test.go
Comment thread client_utils.go Outdated
- IsTransientError now uses errors.As to support wrapped errors,
  consistent with all other IsXxx helpers
- Add context.WithCancel in carousel parallel polling to prevent
  goroutine leaks on early return
- Guard resp.RateLimit nil dereference on 429 in handleAPIError
- Rewrite context cancellation test with httptest server returning
  IN_PROGRESS so the select branch is actually exercised
- Add wrapped error test case for IsTransientError
Comment thread posts_create.go
Collect all child container results before reporting, so the
lowest-indexed failure is always surfaced first — matching the
behavior of the prior sequential implementation.
Comment thread posts_create.go Outdated
Call cancelChildren() inside the collection loop when an error is
detected, so remaining goroutines stop polling at their next
ctx.Done() check instead of running to exhaustion.
Comment thread posts_create.go Outdated
When one child container fails and cancelChildren() fires, siblings
receive context.Canceled. The error reporting loop now skips these
cancellation artifacts and surfaces the real root-cause failure first.
@tirthpatell
Copy link
Copy Markdown
Owner Author

@greptileai

- isRetryableError now checks netErr.IsTransient alongside Temporary,
  so a non-temporary but transient NetworkError is correctly retried
- Carousel polling calls cancelChildren() as soon as any failure is
  received, stopping siblings immediately instead of at function return
wrapNetworkError now stores the original error as Cause, so
errors.Is(err, context.Canceled) works through NetworkError wrappers.
This fixes the carousel sibling-error filter which relies on
errors.Is to distinguish real failures from cancellation side-effects.
@tirthpatell tirthpatell merged commit 5280c7f into main Mar 13, 2026
9 checks passed
@tirthpatell tirthpatell deleted the fix/retry-and-container-lifecycle branch March 13, 2026 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant