Skip to content

fix: retry tarball upload on transient SSL and connection errors#279

Merged
KAJdev merged 1 commit intomainfrom
zeke/fix-ssl-upload-retry
Mar 19, 2026
Merged

fix: retry tarball upload on transient SSL and connection errors#279
KAJdev merged 1 commit intomainfrom
zeke/fix-ssl-upload-retry

Conversation

@KAJdev
Copy link
Contributor

@KAJdev KAJdev commented Mar 18, 2026

Tarball uploads to Cloudflare R2 presigned URLs use a single requests.put() call with no timeout, no retry, and no Content-Length header. Transient SSL errors (SSLV3_ALERT_BAD_RECORD_MAC), connection resets, and timeouts cause immediate deploy failure.

Changes:

  • _upload_tarball() helper retries up to 3 times with exponential backoff (2s base, 30s max) on SSLError, ConnectionError, and Timeout
  • Explicit 600s (10 min) timeout per attempt
  • Explicit Content-Length header for large PUT uploads
  • SSL certificate verification errors (CERTIFICATE_VERIFY_FAILED) are detected and surfaced with actionable guidance instead of retried
  • Deploy command catches SSLError and prints the message cleanly instead of a raw traceback
  • Upload retry constants in constants.py for tunability

added new tests covering retry on SSL/connection/timeout errors, cert verification detection, exhausted retries, header correctness, and timeout configuration.

Fixes AE-2549

@KAJdev KAJdev requested review from deanq and jhcipar March 18, 2026 22:51
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Core fix — correct

Retry logic, cert-vs-transient SSL distinction, Content-Length header, and the explicit no-auth-header comment for presigned URLs are all right. Test coverage is solid across the important paths.

2. Issue: users see nothing during retry delays

Retry warnings go to log.warning(), not console.print(). A user who hits a transient SSL error waits 2–30s per delay with no terminal output — the CLI looks frozen. This is the most likely scenario to generate a support ticket ("flash deploy hung").

Quick fix:

log.warning(
    "Upload attempt %d/%d failed: %s. Retrying in %.1fs...",
    attempt + 1, UPLOAD_MAX_RETRIES, last_exc, delay,
)
console.print(
    f"[yellow]Upload attempt {attempt + 1}/{UPLOAD_MAX_RETRIES} failed "
    f"({last_exc}). Retrying in {delay:.1f}s...[/yellow]"
)

Nits

  • resp.close() is skipped if raise_for_status() throws — wrap in try/finally
  • After exhausted retries the re-raised exception hits _handle_deploy_error with no context that retries were attempted — a "Upload failed after 3 attempts" prefix on last_exc would help

Verdict: PASS WITH NITS — one ask before merge: surface the retry status to the console so users know the CLI is working, not hung.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Copy link
Contributor

@jhcipar jhcipar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry warnings go to log.warning(), not console.print(). A user who hits a transient SSL error waits 2–30s per delay with no terminal output — the CLI looks frozen. This is the most likely scenario to generate a support ticket ("flash deploy hung").

I'm tempted to say this would be a nice to have, as long as the output isn't too chatty
It'd be nice to show some progress. but p0 is getting the fix out the door I think

@KAJdev
Copy link
Contributor Author

KAJdev commented Mar 19, 2026

Retry warnings go to log.warning(), not console.print(). A user who hits a transient SSL error waits 2–30s per delay with no terminal output — the CLI looks frozen. This is the most likely scenario to generate a support ticket ("flash deploy hung").

I'm tempted to say this would be a nice to have, as long as the output isn't too chatty It'd be nice to show some progress. but p0 is getting the fix out the door I think

we have the persistent loading status so I think its fine.

@KAJdev KAJdev merged commit 564f51e into main Mar 19, 2026
4 checks passed
@KAJdev KAJdev deleted the zeke/fix-ssl-upload-retry branch March 19, 2026 01:39
@promptless
Copy link

promptless bot commented Mar 23, 2026

📝 Documentation updates detected!

New suggestion: Add SSL certificate verification troubleshooting for Flash deploy


Tip: Attach PDFs in Slack messages to Promptless—it can even extract images from them 📎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants