fix: retry tarball upload on transient SSL and connection errors#279
fix: retry tarball upload on transient SSL and connection errors#279
Conversation
runpod-Henrik
left a comment
There was a problem hiding this comment.
1. Core fix — correct
Retry logic, cert-vs-transient SSL distinction, Content-Length header, and the explicit no-auth-header comment for presigned URLs are all right. Test coverage is solid across the important paths.
2. Issue: users see nothing during retry delays
Retry warnings go to log.warning(), not console.print(). A user who hits a transient SSL error waits 2–30s per delay with no terminal output — the CLI looks frozen. This is the most likely scenario to generate a support ticket ("flash deploy hung").
Quick fix:
log.warning(
"Upload attempt %d/%d failed: %s. Retrying in %.1fs...",
attempt + 1, UPLOAD_MAX_RETRIES, last_exc, delay,
)
console.print(
f"[yellow]Upload attempt {attempt + 1}/{UPLOAD_MAX_RETRIES} failed "
f"({last_exc}). Retrying in {delay:.1f}s...[/yellow]"
)Nits
resp.close()is skipped ifraise_for_status()throws — wrap intry/finally- After exhausted retries the re-raised exception hits
_handle_deploy_errorwith no context that retries were attempted — a "Upload failed after 3 attempts" prefix onlast_excwould help
Verdict: PASS WITH NITS — one ask before merge: surface the retry status to the console so users know the CLI is working, not hung.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
jhcipar
left a comment
There was a problem hiding this comment.
Retry warnings go to log.warning(), not console.print(). A user who hits a transient SSL error waits 2–30s per delay with no terminal output — the CLI looks frozen. This is the most likely scenario to generate a support ticket ("flash deploy hung").
I'm tempted to say this would be a nice to have, as long as the output isn't too chatty
It'd be nice to show some progress. but p0 is getting the fix out the door I think
we have the persistent loading status so I think its fine. |
|
📝 Documentation updates detected! New suggestion: Add SSL certificate verification troubleshooting for Flash deploy Tip: Attach PDFs in Slack messages to Promptless—it can even extract images from them 📎 |
Tarball uploads to Cloudflare R2 presigned URLs use a single
requests.put()call with no timeout, no retry, and noContent-Lengthheader. Transient SSL errors (SSLV3_ALERT_BAD_RECORD_MAC), connection resets, and timeouts cause immediate deploy failure.Changes:
_upload_tarball()helper retries up to 3 times with exponential backoff (2s base, 30s max) onSSLError,ConnectionError, andTimeoutContent-Lengthheader for large PUT uploadsCERTIFICATE_VERIFY_FAILED) are detected and surfaced with actionable guidance instead of retriedSSLErrorand prints the message cleanly instead of a raw tracebackconstants.pyfor tunabilityadded new tests covering retry on SSL/connection/timeout errors, cert verification detection, exhausted retries, header correctness, and timeout configuration.
Fixes AE-2549