Summary
railway environment new <name> --duplicate <source> is not atomic. It performs duplication as several separate GraphQL requests:
- Create a brand-new empty environment (
environmentCreate with source_id: None).
- Fetch the source environment's config.
- Apply that config to the new environment (the step that actually copies services + volumes).
All of these share a single, hardcoded 30s HTTP timeout in the GraphQL client. When step 3 takes longer than 30s, the CLI aborts with operation timed out — but the empty environment created in step 1 is left behind with zero services and no volume.
That orphan ("husk") is permanently broken: re-running --duplicate either skips (idempotency checks see the name) or errors an environment with that name already exists, yet the env has none of the duplicated services. Recovery requires manually deleting it and retrying — and the retry races the same 30s timeout.
There is no way to configure or extend this timeout — no flag, no env var, no config setting.
Environment
|
|
| CLI version |
4.65.0 (also reproduced on 4.64.0) |
| Install |
npm install -g @railway/cli (GitHub Actions ubuntu-latest, and macOS) |
| Backend |
https://backboard.railway.com/graphql/v2 |
| Source env |
3 services (web, worker, Redis) + one ~1 GB Postgres volume |
Steps to reproduce
# Source env should have a few services and a non-trivial Postgres volume.
railway environment new pr-test --duplicate staging \
--service-config <SERVICE_ID> 'source.branch' 'some-branch'
Run repeatedly. Most invocations finish in a few seconds; intermittently the config-apply step exceeds 30s and the command fails.
Expected behavior
--duplicate either fully succeeds or fully fails. A timeout (or any failure) after the empty environment is created should roll back / delete the partially-created environment, not leave it orphaned. Long-running duplications should also be able to complete (configurable timeout, or server-side atomic duplication).
Actual behavior
The command exits non-zero with operation timed out, and an empty environment (0 services, no volume) is left behind in the project.
Evidence (real run)
Command (GitHub Actions, CLI 4.65.0):
railway environment new "pr-843" --duplicate staging \
--service-config <WEB_SERVICE_ID> 'source.branch' "doc-audit-phase-1" \
--service-config <WORKER_SERVICE_ID> 'source.branch' "doc-audit-phase-1"
CLI output (timestamps from CI log — note the exactly 30.1s gap)
2026-05-29T14:37:55.78Z > Environment name pr-843
2026-05-29T14:37:55.78Z > Duplicate from staging
2026-05-29T14:38:25.88Z Failed to fetch: error sending request for url (https://backboard.railway.com/graphql/v2)
2026-05-29T14:38:25.88Z Caused by:
2026-05-29T14:38:25.88Z 0: error sending request for url (https://backboard.railway.com/graphql/v2)
2026-05-29T14:38:25.88Z 1: operation timed out
##[error]Process completed with exit code 1.
Resulting backend state (queried via GraphQL right after)
- Environment
pr-843 exists, createdAt: 2026-05-29T14:38:15Z (step 1's empty-env create succeeded ~20s in; the timeout fired at the 30s mark on a later request).
pr-843 has 0 service instances and no volume instance.
- A sibling run on the same CLI version duplicated the same
staging source successfully in ~6s (env + volume + Redis all materialized within 6s) — so this is purely backend config-apply latency vs. the 30s cap, not a malformed request or a client-version regression.
Root cause (source references, v4.65.0)
Hardcoded, non-configurable timeout — src/client.rs, build_client():
fn build_client(headers: HeaderMap) -> Client {
Client::builder()
.danger_accept_invalid_certs(matches!(Configs::get_environment_id(), Environment::Dev))
.user_agent(consts::get_user_agent())
.default_headers(headers)
.timeout(Duration::from_secs(30)) // hardcoded; no env-var / flag / config override
.build()
.unwrap()
}
(Set in #636, "bump gql client timeout to 30s", 2025-06-27; used by post_graphql() for every GraphQL call.)
Non-atomic duplicate — src/commands/environment/new.rs, new_environment():
// Step 1: Create a new empty environment (no sourceEnvironmentId)
let vars = mutations::environment_create::Variables {
project_id: project.id.clone(),
name,
source_id: None, // backend's atomic-duplicate path is NOT used
apply_changes_in_background: None,
};
let response = post_graphql::<mutations::EnvironmentCreate, _>(...).await?; // empty env now EXISTS
let env_id = response.environment_create.id.clone();
if let Some(ref source_env_id) = duplicate_id {
let source_config = fetch_environment_config(...).await?.config; // request 2 (30s cap)
let source_config = prepare_config_for_duplication(source_config);
let source_instances = get_environment_instances(...).await?; // request 3 (30s cap)
let merged_config = merge_configs(source_config, override_config);
if !config::is_empty(&merged_config) {
apply_environment_config(&client, &configs, &env_id, merged_config).await?; // request 4 (30s cap) — copies services + volume
}
}
// On timeout/error here, the env created in Step 1 is never cleaned up → husk.
The EnvironmentCreate mutation already accepts a source_id (and apply_changes_in_background), i.e. the backend supports an atomic duplicate. The CLI passes source_id: None and reimplements duplication client-side across multiple round-trips, which is what creates the partial-failure window.
Suggested fixes (in priority order)
- Atomicity / cleanup (the real fix): roll back or delete the empty environment if any subsequent step fails, or use the backend's atomic
environmentCreate(sourceEnvironmentId: …) path so duplication is one server-side operation. A longer timeout alone still orphans environments when the copy fails partway.
- Configurable timeout (stopgap): support
RAILWAY_HTTP_TIMEOUT (env var) and/or a --timeout flag. 30s is too short to duplicate an environment with a multi-service config and a ~1 GB volume, and there's currently no escape hatch.
- Backend latency: investigate why config-apply on a 3-service + ~1 GB-volume environment intermittently exceeds 30s when it used to complete in single-digit seconds.
Summary
railway environment new <name> --duplicate <source>is not atomic. It performs duplication as several separate GraphQL requests:environmentCreatewithsource_id: None).All of these share a single, hardcoded 30s HTTP timeout in the GraphQL client. When step 3 takes longer than 30s, the CLI aborts with
operation timed out— but the empty environment created in step 1 is left behind with zero services and no volume.That orphan ("husk") is permanently broken: re-running
--duplicateeither skips (idempotency checks see the name) or errorsan environment with that name already exists, yet the env has none of the duplicated services. Recovery requires manually deleting it and retrying — and the retry races the same 30s timeout.There is no way to configure or extend this timeout — no flag, no env var, no config setting.
Environment
4.65.0(also reproduced on4.64.0)npm install -g @railway/cli(GitHub Actionsubuntu-latest, and macOS)https://backboard.railway.com/graphql/v2Steps to reproduce
Run repeatedly. Most invocations finish in a few seconds; intermittently the config-apply step exceeds 30s and the command fails.
Expected behavior
--duplicateeither fully succeeds or fully fails. A timeout (or any failure) after the empty environment is created should roll back / delete the partially-created environment, not leave it orphaned. Long-running duplications should also be able to complete (configurable timeout, or server-side atomic duplication).Actual behavior
The command exits non-zero with
operation timed out, and an empty environment (0 services, no volume) is left behind in the project.Evidence (real run)
Command (GitHub Actions, CLI 4.65.0):
CLI output (timestamps from CI log — note the exactly 30.1s gap)
Resulting backend state (queried via GraphQL right after)
pr-843exists,createdAt: 2026-05-29T14:38:15Z(step 1's empty-env create succeeded ~20s in; the timeout fired at the 30s mark on a later request).pr-843has 0 service instances and no volume instance.stagingsource successfully in ~6s (env + volume + Redis all materialized within 6s) — so this is purely backend config-apply latency vs. the 30s cap, not a malformed request or a client-version regression.Root cause (source references,
v4.65.0)Hardcoded, non-configurable timeout —
src/client.rs,build_client():(Set in #636, "bump gql client timeout to 30s", 2025-06-27; used by
post_graphql()for every GraphQL call.)Non-atomic duplicate —
src/commands/environment/new.rs,new_environment():The
EnvironmentCreatemutation already accepts asource_id(andapply_changes_in_background), i.e. the backend supports an atomic duplicate. The CLI passessource_id: Noneand reimplements duplication client-side across multiple round-trips, which is what creates the partial-failure window.Suggested fixes (in priority order)
environmentCreate(sourceEnvironmentId: …)path so duplication is one server-side operation. A longer timeout alone still orphans environments when the copy fails partway.RAILWAY_HTTP_TIMEOUT(env var) and/or a--timeoutflag. 30s is too short to duplicate an environment with a multi-service config and a ~1 GB volume, and there's currently no escape hatch.