Skip to content

railway environment new --duplicate leaves an orphaned empty environment when duplication exceeds the hardcoded 30s client timeout #923

@freemarmoset

Description

@freemarmoset

Summary

railway environment new <name> --duplicate <source> is not atomic. It performs duplication as several separate GraphQL requests:

  1. Create a brand-new empty environment (environmentCreate with source_id: None).
  2. Fetch the source environment's config.
  3. Apply that config to the new environment (the step that actually copies services + volumes).

All of these share a single, hardcoded 30s HTTP timeout in the GraphQL client. When step 3 takes longer than 30s, the CLI aborts with operation timed outbut the empty environment created in step 1 is left behind with zero services and no volume.

That orphan ("husk") is permanently broken: re-running --duplicate either skips (idempotency checks see the name) or errors an environment with that name already exists, yet the env has none of the duplicated services. Recovery requires manually deleting it and retrying — and the retry races the same 30s timeout.

There is no way to configure or extend this timeout — no flag, no env var, no config setting.

Environment

CLI version 4.65.0 (also reproduced on 4.64.0)
Install npm install -g @railway/cli (GitHub Actions ubuntu-latest, and macOS)
Backend https://backboard.railway.com/graphql/v2
Source env 3 services (web, worker, Redis) + one ~1 GB Postgres volume

Steps to reproduce

# Source env should have a few services and a non-trivial Postgres volume.
railway environment new pr-test --duplicate staging \
  --service-config <SERVICE_ID> 'source.branch' 'some-branch'

Run repeatedly. Most invocations finish in a few seconds; intermittently the config-apply step exceeds 30s and the command fails.

Expected behavior

--duplicate either fully succeeds or fully fails. A timeout (or any failure) after the empty environment is created should roll back / delete the partially-created environment, not leave it orphaned. Long-running duplications should also be able to complete (configurable timeout, or server-side atomic duplication).

Actual behavior

The command exits non-zero with operation timed out, and an empty environment (0 services, no volume) is left behind in the project.

Evidence (real run)

Command (GitHub Actions, CLI 4.65.0):

railway environment new "pr-843" --duplicate staging \
  --service-config <WEB_SERVICE_ID>    'source.branch' "doc-audit-phase-1" \
  --service-config <WORKER_SERVICE_ID> 'source.branch' "doc-audit-phase-1"
CLI output (timestamps from CI log — note the exactly 30.1s gap)
2026-05-29T14:37:55.78Z  > Environment name pr-843
2026-05-29T14:37:55.78Z  > Duplicate from staging
2026-05-29T14:38:25.88Z  Failed to fetch: error sending request for url (https://backboard.railway.com/graphql/v2)
2026-05-29T14:38:25.88Z  Caused by:
2026-05-29T14:38:25.88Z      0: error sending request for url (https://backboard.railway.com/graphql/v2)
2026-05-29T14:38:25.88Z      1: operation timed out
##[error]Process completed with exit code 1.
Resulting backend state (queried via GraphQL right after)
  • Environment pr-843 exists, createdAt: 2026-05-29T14:38:15Z (step 1's empty-env create succeeded ~20s in; the timeout fired at the 30s mark on a later request).
  • pr-843 has 0 service instances and no volume instance.
  • A sibling run on the same CLI version duplicated the same staging source successfully in ~6s (env + volume + Redis all materialized within 6s) — so this is purely backend config-apply latency vs. the 30s cap, not a malformed request or a client-version regression.

Root cause (source references, v4.65.0)

Hardcoded, non-configurable timeoutsrc/client.rs, build_client():

fn build_client(headers: HeaderMap) -> Client {
    Client::builder()
        .danger_accept_invalid_certs(matches!(Configs::get_environment_id(), Environment::Dev))
        .user_agent(consts::get_user_agent())
        .default_headers(headers)
        .timeout(Duration::from_secs(30))   // hardcoded; no env-var / flag / config override
        .build()
        .unwrap()
}

(Set in #636, "bump gql client timeout to 30s", 2025-06-27; used by post_graphql() for every GraphQL call.)

Non-atomic duplicatesrc/commands/environment/new.rs, new_environment():

// Step 1: Create a new empty environment (no sourceEnvironmentId)
let vars = mutations::environment_create::Variables {
    project_id: project.id.clone(),
    name,
    source_id: None,                 // backend's atomic-duplicate path is NOT used
    apply_changes_in_background: None,
};
let response = post_graphql::<mutations::EnvironmentCreate, _>(...).await?;  // empty env now EXISTS
let env_id = response.environment_create.id.clone();

if let Some(ref source_env_id) = duplicate_id {
    let source_config = fetch_environment_config(...).await?.config;         // request 2 (30s cap)
    let source_config = prepare_config_for_duplication(source_config);
    let source_instances = get_environment_instances(...).await?;           // request 3 (30s cap)
    let merged_config = merge_configs(source_config, override_config);
    if !config::is_empty(&merged_config) {
        apply_environment_config(&client, &configs, &env_id, merged_config).await?;  // request 4 (30s cap) — copies services + volume
    }
}
// On timeout/error here, the env created in Step 1 is never cleaned up → husk.

The EnvironmentCreate mutation already accepts a source_id (and apply_changes_in_background), i.e. the backend supports an atomic duplicate. The CLI passes source_id: None and reimplements duplication client-side across multiple round-trips, which is what creates the partial-failure window.

Suggested fixes (in priority order)

  1. Atomicity / cleanup (the real fix): roll back or delete the empty environment if any subsequent step fails, or use the backend's atomic environmentCreate(sourceEnvironmentId: …) path so duplication is one server-side operation. A longer timeout alone still orphans environments when the copy fails partway.
  2. Configurable timeout (stopgap): support RAILWAY_HTTP_TIMEOUT (env var) and/or a --timeout flag. 30s is too short to duplicate an environment with a multi-service config and a ~1 GB volume, and there's currently no escape hatch.
  3. Backend latency: investigate why config-apply on a 3-service + ~1 GB-volume environment intermittently exceeds 30s when it used to complete in single-digit seconds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions