`railway environment new --duplicate` leaves an orphaned empty environment when duplication exceeds the hardcoded 30s client timeout

### Summary

`railway environment new <name> --duplicate <source>` is **not atomic**. It performs duplication as several separate GraphQL requests:

1. Create a brand-new **empty** environment (`environmentCreate` with `source_id: None`).
2. Fetch the source environment's config.
3. Apply that config to the new environment (the step that actually copies services + volumes).

All of these share a **single, hardcoded 30s HTTP timeout** in the GraphQL client. When step 3 takes longer than 30s, the CLI aborts with `operation timed out` — **but the empty environment created in step 1 is left behind** with zero services and no volume.

That orphan ("husk") is permanently broken: re-running `--duplicate` either skips (idempotency checks see the name) or errors `an environment with that name already exists`, yet the env has none of the duplicated services. Recovery requires manually deleting it and retrying — and the retry races the same 30s timeout.

**There is no way to configure or extend this timeout** — no flag, no env var, no config setting.

### Environment

| | |
|---|---|
| CLI version | `4.65.0` (also reproduced on `4.64.0`) |
| Install | `npm install -g @railway/cli` (GitHub Actions `ubuntu-latest`, and macOS) |
| Backend | `https://backboard.railway.com/graphql/v2` |
| Source env | 3 services (web, worker, Redis) + one ~1 GB Postgres volume |

### Steps to reproduce

```bash
# Source env should have a few services and a non-trivial Postgres volume.
railway environment new pr-test --duplicate staging \
  --service-config <SERVICE_ID> 'source.branch' 'some-branch'
```

Run repeatedly. Most invocations finish in a few seconds; intermittently the config-apply step exceeds 30s and the command fails.

### Expected behavior

`--duplicate` either fully succeeds or fully fails. A timeout (or any failure) after the empty environment is created should **roll back / delete** the partially-created environment, not leave it orphaned. Long-running duplications should also be able to complete (configurable timeout, or server-side atomic duplication).

### Actual behavior

The command exits non-zero with `operation timed out`, and an **empty environment (0 services, no volume) is left behind** in the project.

### Evidence (real run)

Command (GitHub Actions, CLI 4.65.0):

```bash
railway environment new "pr-843" --duplicate staging \
  --service-config <WEB_SERVICE_ID>    'source.branch' "doc-audit-phase-1" \
  --service-config <WORKER_SERVICE_ID> 'source.branch' "doc-audit-phase-1"
```

<details>
<summary>CLI output (timestamps from CI log — note the exactly 30.1s gap)</summary>

```
2026-05-29T14:37:55.78Z  > Environment name pr-843
2026-05-29T14:37:55.78Z  > Duplicate from staging
2026-05-29T14:38:25.88Z  Failed to fetch: error sending request for url (https://backboard.railway.com/graphql/v2)
2026-05-29T14:38:25.88Z  Caused by:
2026-05-29T14:38:25.88Z      0: error sending request for url (https://backboard.railway.com/graphql/v2)
2026-05-29T14:38:25.88Z      1: operation timed out
##[error]Process completed with exit code 1.
```

</details>

<details>
<summary>Resulting backend state (queried via GraphQL right after)</summary>

- Environment `pr-843` **exists**, `createdAt: 2026-05-29T14:38:15Z` (step 1's empty-env create succeeded ~20s in; the timeout fired at the 30s mark on a later request).
- `pr-843` has **0 service instances** and **no volume instance**.
- A sibling run on the **same CLI version** duplicated the same `staging` source **successfully in ~6s** (env + volume + Redis all materialized within 6s) — so this is purely backend config-apply latency vs. the 30s cap, not a malformed request or a client-version regression.

</details>

### Root cause (source references, `v4.65.0`)

**Hardcoded, non-configurable timeout** — [`src/client.rs`](https://github.com/railwayapp/cli/blob/v4.65.0/src/client.rs), `build_client()`:

```rust
fn build_client(headers: HeaderMap) -> Client {
    Client::builder()
        .danger_accept_invalid_certs(matches!(Configs::get_environment_id(), Environment::Dev))
        .user_agent(consts::get_user_agent())
        .default_headers(headers)
        .timeout(Duration::from_secs(30))   // hardcoded; no env-var / flag / config override
        .build()
        .unwrap()
}
```

(Set in #636, "bump gql client timeout to 30s", 2025-06-27; used by `post_graphql()` for every GraphQL call.)

**Non-atomic duplicate** — [`src/commands/environment/new.rs`](https://github.com/railwayapp/cli/blob/v4.65.0/src/commands/environment/new.rs), `new_environment()`:

```rust
// Step 1: Create a new empty environment (no sourceEnvironmentId)
let vars = mutations::environment_create::Variables {
    project_id: project.id.clone(),
    name,
    source_id: None,                 // backend's atomic-duplicate path is NOT used
    apply_changes_in_background: None,
};
let response = post_graphql::<mutations::EnvironmentCreate, _>(...).await?;  // empty env now EXISTS
let env_id = response.environment_create.id.clone();

if let Some(ref source_env_id) = duplicate_id {
    let source_config = fetch_environment_config(...).await?.config;         // request 2 (30s cap)
    let source_config = prepare_config_for_duplication(source_config);
    let source_instances = get_environment_instances(...).await?;           // request 3 (30s cap)
    let merged_config = merge_configs(source_config, override_config);
    if !config::is_empty(&merged_config) {
        apply_environment_config(&client, &configs, &env_id, merged_config).await?;  // request 4 (30s cap) — copies services + volume
    }
}
// On timeout/error here, the env created in Step 1 is never cleaned up → husk.
```

The `EnvironmentCreate` mutation already accepts a `source_id` (and `apply_changes_in_background`), i.e. the backend supports an atomic duplicate. The CLI passes `source_id: None` and reimplements duplication client-side across multiple round-trips, which is what creates the partial-failure window.

### Suggested fixes (in priority order)

1. **Atomicity / cleanup (the real fix):** roll back or delete the empty environment if any subsequent step fails, **or** use the backend's atomic `environmentCreate(sourceEnvironmentId: …)` path so duplication is one server-side operation. A longer timeout alone still orphans environments when the copy fails partway.
2. **Configurable timeout (stopgap):** support `RAILWAY_HTTP_TIMEOUT` (env var) and/or a `--timeout` flag. 30s is too short to duplicate an environment with a multi-service config and a ~1 GB volume, and there's currently no escape hatch.
3. **Backend latency:** investigate why config-apply on a 3-service + ~1 GB-volume environment intermittently exceeds 30s when it used to complete in single-digit seconds.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`railway environment new --duplicate` leaves an orphaned empty environment when duplication exceeds the hardcoded 30s client timeout #923

Summary

Environment

Steps to reproduce

Expected behavior

Actual behavior

Evidence (real run)

Root cause (source references, `v4.65.0`)

Suggested fixes (in priority order)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


CLI version	`4.65.0` (also reproduced on `4.64.0`)
Install	`npm install -g @railway/cli` (GitHub Actions `ubuntu-latest`, and macOS)
Backend	`https://backboard.railway.com/graphql/v2`
Source env	3 services (web, worker, Redis) + one ~1 GB Postgres volume

railway environment new --duplicate leaves an orphaned empty environment when duplication exceeds the hardcoded 30s client timeout #923

Description

Summary

Environment

Steps to reproduce

Expected behavior

Actual behavior

Evidence (real run)

Root cause (source references, v4.65.0)

Suggested fixes (in priority order)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`railway environment new --duplicate` leaves an orphaned empty environment when duplication exceeds the hardcoded 30s client timeout #923

Root cause (source references, `v4.65.0`)