Invert SCP fallback to deny-list and increase install timeout to 3 minutes#10533
Conversation
- Increase INSTALL_TIMEOUT from 60s to 180s to accommodate slow hosts - Add --connect-timeout 15 to curl in install script so it fails fast on unreachable CDN hosts instead of blocking for the full TCP timeout - Refactor install_binary() SCP fallback from allow-list (only exit code 3) to deny-list (skip only exit code 2 = unsupported arch/OS). All other failures now trigger SCP fallback, covering DNS failures, SSL errors, connection refused, and other network issues (~65% of install errors) - Also trigger SCP fallback on timeout (previously returned TimedOut) - Add should_skip_scp_fallback() helper with unit tests - Add setup_tests for INSTALL_TIMEOUT value and curl connect-timeout Co-Authored-By: Oz <oz-agent@warp.dev>
|
I'm starting a first review of this pull request. You can view the conversation on Warp. I completed the review and no human review was requested for this pull request. Comment Powered by Oz |
Co-Authored-By: Oz <oz-agent@warp.dev>
There was a problem hiding this comment.
Overview
This PR broadens SSH remote-server install recovery so non-successful install-script exits and script timeouts fall back to the existing local-download-plus-SCP install path, adds a curl connect timeout, and raises the direct install timeout.
Concerns
- The direct install timeout was raised to 180s, but the existing SCP fallback timeout remains 120s even though the fallback is documented as the longer budget and is now invoked for more failure modes. Slow SSH uploads can still fail prematurely.
Verdict
Found: 0 critical, 1 important, 0 suggestions
Request changes
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
…dotdev/warp into orchestrator/scp-fallback-inversion
| fn should_skip_scp_fallback(exit_code: i32) -> bool { | ||
| match exit_code { | ||
| // Unsupported arch/OS — SCP won't change the architecture | ||
| 2 => true, |
There was a problem hiding this comment.
Is unsupported arch / OS the only error where we should skip fallback?
I thought you mentioned things like the permission errors and missing tar will also cause SCP install to fail too. I am not sure exit_code alone is sufficient here
There was a problem hiding this comment.
Yeah there are cases where falling back to SCP will also fail but I still think it's better to be overly permissive to fallback to SCP than under permissive.
The tradeoff is slightly more latency for a session where we'd fail to install anyway, but that seems better than the other way around and we can always add more cases where early exit and don't do SCP fallback.
Fwiw: VSCode does the same. If the remote box doesn't have tar installed it will still fallback to SCP and then fail again.
…nutes (#10533) ## Description I used orchestration to analye 1k remote server installation errors from telemetry. **~65% of all failures** are caused by the remote host being unable to reach the CDN — but the install succeeds when we bypass the remote's network by downloading on the client and uploading via SCP. Today, SCP fallback only triggers for one specific case (exit code 3 = no curl/wget). The other 650+ errors per 1,000 installs fail unnecessarily. **Breakdown of errors this PR fixes:** | Error | Count | % | Curl Exit Code | |-------|------:|--:|----------------| | Script timeout | 477 | 47.7% | n/a (our 60s hard cap) | | DNS resolution failure | 111 | 11.1% | 6 | | Connection refused / unreachable | 32 | 3.2% | 7 | | SSL/TLS errors | 24 | 2.4% | 35, 60, 77 | | Connection reset by peer | 12 | 1.2% | 56 | | Failure writing output | 7 | 0.7% | 23 | | HTTP 403 (corporate proxy) | 5 | 0.5% | 22 | | **Total fixed** | **668** | **~65%** | | All of these errors share the same root cause: the remote host can't download from `app.warp.dev`. The SCP fallback path (which already exists for the no-curl/wget case) downloads on the client machine and uploads via the existing SSH tunnel, completely bypassing the remote's network. This PR extends that fallback to cover all network failures. **Solution:** Invert the SCP fallback logic from an allowlist to a deny-list: - **Before:** Only exit code 3 triggers SCP fallback. Everything else fails. - **After:** ALL script failures trigger SCP fallback, EXCEPT exit code 2 (unsupported arch/OS) where SCP wouldn't help. Additional changes: - **Timeout increased** from 60s to 180s (3 minutes) — VS Code doesn't hard-cap at all; our 60s was too aggressive - **`--connect-timeout 15`** added to curl so hung connections fail fast, leaving budget for SCP fallback - **Timeout now triggers SCP fallback** instead of returning `Error::TimedOut` ## Testing I had the agent test by creating docker containers that would demonstrate the issues and then verify the changes fixed the issue. ### Live Demo: DNS Failure on Remote Host Built Warp from this branch in a cloud environment and SSH'd into a Docker container with broken DNS to demonstrate the SCP fallback end-to-end. **Setup:** - Docker container: Ubuntu 22.04 with SSH server, DNS broken (`nameserver 192.0.2.1` in `/etc/resolv.conf`) so curl cannot resolve any hosts - Warp built from this branch (`cargo build --bin dev`) - SSH extension install mode set to `always_install` - Connected via `ssh -o StrictHostKeyChecking=no root@localhost -p 2201` **Result: SCP fallback triggered correctly.** Warp logs show the full sequence: ``` [INFO] Installing remote server binary to ~/.warp-dev/remote-server/oz-dev-0.1.0 [INFO] Install script failed (exit 6), falling back to SCP upload [INFO] Downloading tarball locally from https://staging.warp.dev/download/cli?package=tar&os=linux&arch=x86_64&channel=dev&version=0.1.0 ``` 1. Remote install script ran on the container — curl could not resolve CDN hostname (DNS broken) 2. Script exited with code 6 (DNS resolution failure) 3. **SCP fallback triggered automatically** — `"Install script failed (exit 6), falling back to SCP upload"` 4. Warp attempted the local download + SCP upload path **Before vs After:** - **Before (allowlist):** Only exit code 3 (no curl/wget) triggered SCP fallback. Exit code 6 (DNS failure) would show a raw error to the user with no recovery path. - **After (deny-list):** Exit code 6 now correctly triggers SCP fallback. All network-related failures are automatically retried via SCP. ## Agent Mode - [x] Warp Agent Mode - This PR was created via Warp AI Agent Mode Co-Authored-By: Oz <oz-agent@warp.dev> CHANGELOG-IMPROVEMENT: Improved SSH extension install reliability — network failures now automatically retry via SCP fallback --------- Co-authored-by: Oz <oz-agent@warp.dev>
…nutes (#10533) ## Description I used orchestration to analye 1k remote server installation errors from telemetry. **~65% of all failures** are caused by the remote host being unable to reach the CDN — but the install succeeds when we bypass the remote's network by downloading on the client and uploading via SCP. Today, SCP fallback only triggers for one specific case (exit code 3 = no curl/wget). The other 650+ errors per 1,000 installs fail unnecessarily. **Breakdown of errors this PR fixes:** | Error | Count | % | Curl Exit Code | |-------|------:|--:|----------------| | Script timeout | 477 | 47.7% | n/a (our 60s hard cap) | | DNS resolution failure | 111 | 11.1% | 6 | | Connection refused / unreachable | 32 | 3.2% | 7 | | SSL/TLS errors | 24 | 2.4% | 35, 60, 77 | | Connection reset by peer | 12 | 1.2% | 56 | | Failure writing output | 7 | 0.7% | 23 | | HTTP 403 (corporate proxy) | 5 | 0.5% | 22 | | **Total fixed** | **668** | **~65%** | | All of these errors share the same root cause: the remote host can't download from `app.warp.dev`. The SCP fallback path (which already exists for the no-curl/wget case) downloads on the client machine and uploads via the existing SSH tunnel, completely bypassing the remote's network. This PR extends that fallback to cover all network failures. **Solution:** Invert the SCP fallback logic from an allowlist to a deny-list: - **Before:** Only exit code 3 triggers SCP fallback. Everything else fails. - **After:** ALL script failures trigger SCP fallback, EXCEPT exit code 2 (unsupported arch/OS) where SCP wouldn't help. Additional changes: - **Timeout increased** from 60s to 180s (3 minutes) — VS Code doesn't hard-cap at all; our 60s was too aggressive - **`--connect-timeout 15`** added to curl so hung connections fail fast, leaving budget for SCP fallback - **Timeout now triggers SCP fallback** instead of returning `Error::TimedOut` ## Testing I had the agent test by creating docker containers that would demonstrate the issues and then verify the changes fixed the issue. ### Live Demo: DNS Failure on Remote Host Built Warp from this branch in a cloud environment and SSH'd into a Docker container with broken DNS to demonstrate the SCP fallback end-to-end. **Setup:** - Docker container: Ubuntu 22.04 with SSH server, DNS broken (`nameserver 192.0.2.1` in `/etc/resolv.conf`) so curl cannot resolve any hosts - Warp built from this branch (`cargo build --bin dev`) - SSH extension install mode set to `always_install` - Connected via `ssh -o StrictHostKeyChecking=no root@localhost -p 2201` **Result: SCP fallback triggered correctly.** Warp logs show the full sequence: ``` [INFO] Installing remote server binary to ~/.warp-dev/remote-server/oz-dev-0.1.0 [INFO] Install script failed (exit 6), falling back to SCP upload [INFO] Downloading tarball locally from https://staging.warp.dev/download/cli?package=tar&os=linux&arch=x86_64&channel=dev&version=0.1.0 ``` 1. Remote install script ran on the container — curl could not resolve CDN hostname (DNS broken) 2. Script exited with code 6 (DNS resolution failure) 3. **SCP fallback triggered automatically** — `"Install script failed (exit 6), falling back to SCP upload"` 4. Warp attempted the local download + SCP upload path **Before vs After:** - **Before (allowlist):** Only exit code 3 (no curl/wget) triggered SCP fallback. Exit code 6 (DNS failure) would show a raw error to the user with no recovery path. - **After (deny-list):** Exit code 6 now correctly triggers SCP fallback. All network-related failures are automatically retried via SCP. ## Agent Mode - [x] Warp Agent Mode - This PR was created via Warp AI Agent Mode Co-Authored-By: Oz <oz-agent@warp.dev> CHANGELOG-IMPROVEMENT: Improved SSH extension install reliability — network failures now automatically retry via SCP fallback --------- Co-authored-by: Oz <oz-agent@warp.dev>
…nutes (#10533) ## Description I used orchestration to analye 1k remote server installation errors from telemetry. **~65% of all failures** are caused by the remote host being unable to reach the CDN — but the install succeeds when we bypass the remote's network by downloading on the client and uploading via SCP. Today, SCP fallback only triggers for one specific case (exit code 3 = no curl/wget). The other 650+ errors per 1,000 installs fail unnecessarily. **Breakdown of errors this PR fixes:** | Error | Count | % | Curl Exit Code | |-------|------:|--:|----------------| | Script timeout | 477 | 47.7% | n/a (our 60s hard cap) | | DNS resolution failure | 111 | 11.1% | 6 | | Connection refused / unreachable | 32 | 3.2% | 7 | | SSL/TLS errors | 24 | 2.4% | 35, 60, 77 | | Connection reset by peer | 12 | 1.2% | 56 | | Failure writing output | 7 | 0.7% | 23 | | HTTP 403 (corporate proxy) | 5 | 0.5% | 22 | | **Total fixed** | **668** | **~65%** | | All of these errors share the same root cause: the remote host can't download from `app.warp.dev`. The SCP fallback path (which already exists for the no-curl/wget case) downloads on the client machine and uploads via the existing SSH tunnel, completely bypassing the remote's network. This PR extends that fallback to cover all network failures. **Solution:** Invert the SCP fallback logic from an allowlist to a deny-list: - **Before:** Only exit code 3 triggers SCP fallback. Everything else fails. - **After:** ALL script failures trigger SCP fallback, EXCEPT exit code 2 (unsupported arch/OS) where SCP wouldn't help. Additional changes: - **Timeout increased** from 60s to 180s (3 minutes) — VS Code doesn't hard-cap at all; our 60s was too aggressive - **`--connect-timeout 15`** added to curl so hung connections fail fast, leaving budget for SCP fallback - **Timeout now triggers SCP fallback** instead of returning `Error::TimedOut` ## Testing I had the agent test by creating docker containers that would demonstrate the issues and then verify the changes fixed the issue. ### Live Demo: DNS Failure on Remote Host Built Warp from this branch in a cloud environment and SSH'd into a Docker container with broken DNS to demonstrate the SCP fallback end-to-end. **Setup:** - Docker container: Ubuntu 22.04 with SSH server, DNS broken (`nameserver 192.0.2.1` in `/etc/resolv.conf`) so curl cannot resolve any hosts - Warp built from this branch (`cargo build --bin dev`) - SSH extension install mode set to `always_install` - Connected via `ssh -o StrictHostKeyChecking=no root@localhost -p 2201` **Result: SCP fallback triggered correctly.** Warp logs show the full sequence: ``` [INFO] Installing remote server binary to ~/.warp-dev/remote-server/oz-dev-0.1.0 [INFO] Install script failed (exit 6), falling back to SCP upload [INFO] Downloading tarball locally from https://staging.warp.dev/download/cli?package=tar&os=linux&arch=x86_64&channel=dev&version=0.1.0 ``` 1. Remote install script ran on the container — curl could not resolve CDN hostname (DNS broken) 2. Script exited with code 6 (DNS resolution failure) 3. **SCP fallback triggered automatically** — `"Install script failed (exit 6), falling back to SCP upload"` 4. Warp attempted the local download + SCP upload path **Before vs After:** - **Before (allowlist):** Only exit code 3 (no curl/wget) triggered SCP fallback. Exit code 6 (DNS failure) would show a raw error to the user with no recovery path. - **After (deny-list):** Exit code 6 now correctly triggers SCP fallback. All network-related failures are automatically retried via SCP. ## Agent Mode - [x] Warp Agent Mode - This PR was created via Warp AI Agent Mode Co-Authored-By: Oz <oz-agent@warp.dev> CHANGELOG-IMPROVEMENT: Improved SSH extension install reliability — network failures now automatically retry via SCP fallback --------- Co-authored-by: Oz <oz-agent@warp.dev>
Description
I used orchestration to analye 1k remote server installation errors from telemetry.
~65% of all failures are caused by the remote host being unable to reach the CDN — but the install succeeds when we bypass the remote's network by downloading on the client and uploading via SCP. Today, SCP fallback only triggers for one specific case (exit code 3 = no curl/wget). The other 650+ errors per 1,000 installs fail unnecessarily.
Breakdown of errors this PR fixes:
All of these errors share the same root cause: the remote host can't download from
app.warp.dev. The SCP fallback path (which already exists for the no-curl/wget case) downloads on the client machine and uploads via the existing SSH tunnel, completely bypassing the remote's network. This PR extends that fallback to cover all network failures.Solution: Invert the SCP fallback logic from an allowlist to a deny-list:
Additional changes:
--connect-timeout 15added to curl so hung connections fail fast, leaving budget for SCP fallbackError::TimedOutTesting
I had the agent test by creating docker containers that would demonstrate the issues and then verify the changes fixed the issue.
Live Demo: DNS Failure on Remote Host
Built Warp from this branch in a cloud environment and SSH'd into a Docker container with broken DNS to demonstrate the SCP fallback end-to-end.
Setup:
nameserver 192.0.2.1in/etc/resolv.conf) so curl cannot resolve any hostscargo build --bin dev)always_installssh -o StrictHostKeyChecking=no root@localhost -p 2201Result: SCP fallback triggered correctly. Warp logs show the full sequence:
"Install script failed (exit 6), falling back to SCP upload"Before vs After:
Agent Mode
Co-Authored-By: Oz oz-agent@warp.dev
CHANGELOG-IMPROVEMENT: Improved SSH extension install reliability — network failures now automatically retry via SCP fallback