Skip to content

fix(drift): prevent false config drift on unchanged GPU endpoints#319

Merged
deanq merged 4 commits intorunpod:mainfrom
CircuitSerein:fix/false-drift-instance-ids-and-template-env
Apr 17, 2026
Merged

fix(drift): prevent false config drift on unchanged GPU endpoints#319
deanq merged 4 commits intorunpod:mainfrom
CircuitSerein:fix/false-drift-instance-ids-and-template-env

Conversation

@CircuitSerein
Copy link
Copy Markdown
Contributor

Problem

Two bugs caused every subsequent flash run to trigger a new RunPod release (cold start + billing event), even when the user's config hadn't changed at all.

Bug 1 — instanceIds=[] vs None false drift

The RunPod API returns instanceIds=[] for GPU endpoints that have no instance restriction. Locally the user never sets instanceIds, so it stays None. model_dump(exclude_none=True) drops None but keeps [], producing a different config_hash and a spurious update() call on every run.

Symptom: endpoint receives a new release on every invocation despite zero config changes.

Bug 2 — RUNPOD_API_KEY oscillating add/remove

_create_new_template() always passed env=[] explicitly to PodTemplate even when self.env was None. This put "env" into Pydantic's model_fields_set, which made update() evaluate has_explicit_template_env = True every time, setting env_needs_update = True.

For QB endpoints that don't make remote calls, _inject_runtime_template_vars() would not inject RUNPOD_API_KEY — so update_template was called with an empty env, removing the key. The next run added it back. Alternating add/remove on every run, visible in the RunPod web UI as repeated releases.

Fix

serverless.pyconfig_hash: normalize instanceIds=[] to absent, matching the treatment of None:

if not config_dict.get("instanceIds"):
    config_dict.pop("instanceIds", None)

serverless.py + serverless_cpu.py_create_new_template(): only pass env= when self.env is truthy, keeping "env" out of model_fields_set for the default no-env case:

kwargs: dict = {"name": self.resource_id, "imageName": self.imageName}
if self.env:
    kwargs["env"] = KeyValuePair.from_dict(self.env)
return PodTemplate(**kwargs)

Tests

Regression tests added in TestInstanceIdsFalseDrift and TestCreateNewTemplateEnvFieldSet:

  • instanceIds=None and instanceIds=[] produce identical hash
  • non-empty instanceIds still changes hash (real drift still caught)
  • _create_new_template() with env=None does not put "env" in model_fields_set (GPU and CPU endpoints)
  • _create_new_template() with env populated correctly sets the field

Test run

2640 passed, 2 skipped, 1 xfailed, 4 warnings

@deanq
Copy link
Copy Markdown
Member

deanq commented Apr 16, 2026

Thanks, @CircuitSerein . Why did your changes rewrite the entire files? Look at what's removed and added. Are you using an editor that completely reformats files (tabs to spaces)? We cannot accept PR that does this. I can't even tell what changed. Please reformat it back to what it was originally and apply the actual code change you intended.

Two bugs caused every subsequent run to trigger a new RunPod release
(cold start), even when the user's config hadn't changed.

Bug 1 — instanceIds=[] vs None false drift:
The RunPod API returns instanceIds=[] for GPU endpoints that have no
instance restriction. Locally the user never sets instanceIds, so it
stays None. exclude_none=True drops None but keeps [], producing a
different config_hash and a spurious update on every run.

Fix: normalize instanceIds=[] to absent in config_hash, matching the
same treatment applied to None.

Bug 2 — RUNPOD_API_KEY oscillating add/remove:
_create_new_template() always passed env=[] explicitly to PodTemplate
even when self.env was None. This put 'env' in Pydantic's
model_fields_set, causing update() to set env_needs_update=True on
every run. For QB endpoints that don't make remote calls this toggled
RUNPOD_API_KEY in and out of the template on alternate runs.

Fix: only pass env= to PodTemplate when self.env is truthy, keeping
'env' out of model_fields_set for the default (no-env) case.

Regression tests added for both bugs.
@CircuitSerein CircuitSerein force-pushed the fix/false-drift-instance-ids-and-template-env branch from eafe09b to 6740ef4 Compare April 16, 2026 07:19
@CircuitSerein
Copy link
Copy Markdown
Contributor Author

Sorry about that - CRLF problem with Windows and WSL. Recommited clean changes

Copy link
Copy Markdown
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two real bugs, both fixes correct. One observation worth a follow-up, one test gap.


Bug 1 — instanceIds normalization (serverless.py:436)

Correct. model_dump(exclude_none=True) drops None but keeps [] — without this fix the serialized JSON diverges and triggers a spurious release on every run. if not config_dict.get("instanceIds") handles both cases cleanly.

Confirmed non-issue for CPU: CpuServerlessEndpoint.config_hash uses an explicit include=cpu_fields allowlist that doesn't include instanceIds, so the field never enters the CPU hash. No fix needed there.


Bug 2 — _create_new_template() (serverless.py:622, serverless_cpu.py:162)

Correct for the described symptom. The old env=KeyValuePair.from_dict(self.env or {}) always put "env" into model_fields_set, which made has_explicit_template_env True on every update call → env_needs_update = True_inject_runtime_template_vars() ran → for QB endpoints without remote calls, RUNPOD_API_KEY was alternately added and removed. The fix breaks that chain.

Observation — env={} inconsistency between creation and update paths.

_create_new_template uses if self.env:, which treats env={} identically to env=None. _configure_existing_template (serverless.py:633) uses if self.env is not None:, which does not. The result: a user who sets env={} gets different template state depending on whether it's a new deploy or an update — new deploy leaves env unset, update explicitly clears it to []. Unlikely to hit in practice, but the inconsistency is a quiet trap if this area is touched later. Suggest either aligning _configure_existing_template to match, or adding a comment in _create_new_template explaining the intentional choice.


Update() mechanics

Traced through the env_needs_update guard at line 1087: after the fix, new_config.env is None"env" not in template_fields_sethas_explicit_template_env = False, env_changed = Falseenv_needs_update = False_inject_runtime_template_vars() not called, RUNPOD_API_KEY not touched. The logic chain is clean.


Tests

TestInstanceIdsFalseDrift covers all three relevant cases (None, absent, empty list) and confirms non-empty still changes the hash. TestCreateNewTemplateEnvFieldSet covers both GPU and CPU endpoints in set and unset state. Test run is clean (2640/0).

Gap: no test for env={} in _create_new_template. Given the if self.env: subtlety, a test pinning env={}"env" not in template.model_fields_set would document the intended behavior and guard against a later "fix" to is not None.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes false serverless config drift that was triggering unnecessary RunPod releases on subsequent flash run executions, even when the user’s configuration hadn’t changed.

Changes:

  • Normalize instanceIds=[] to be treated the same as “absent”/None when computing config_hash.
  • Adjust _create_new_template() to avoid marking env as explicitly set when env is unset by the user.
  • Add regression tests for instanceIds hashing and template env field-set behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/runpod_flash/core/resources/serverless.py Normalizes instanceIds for hashing; changes template creation to conditionally include env.
src/runpod_flash/core/resources/serverless_cpu.py Mirrors the template env-handling change for CPU endpoints.
tests/unit/resources/test_serverless.py Adds regression tests for instanceIds hash normalization and template env field-set behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/runpod_flash/core/resources/serverless_cpu.py Outdated
Comment on lines +3438 to +3467
def test_create_new_template_env_not_in_fields_set_when_env_none(self):
"""When self.env is None, 'env' must NOT appear in template.model_fields_set."""
resource = ServerlessEndpoint(name="test", imageName="test:latest")
assert resource.env is None

template = resource._create_new_template()

assert "env" not in template.model_fields_set

def test_create_new_template_env_in_fields_set_when_env_set(self):
"""When self.env is populated, 'env' MUST appear in template.model_fields_set."""
resource = ServerlessEndpoint(
name="test",
imageName="test:latest",
env={"MY_VAR": "value"},
)

template = resource._create_new_template()

assert "env" in template.model_fields_set
assert any(kv.key == "MY_VAR" for kv in template.env)

def test_create_new_template_env_not_in_fields_set_cpu_endpoint(self):
"""CpuServerlessEndpoint: same fix applies — env=None must not set 'env' field."""
resource = CpuServerlessEndpoint(name="test", imageName="test:latest")
assert resource.env is None

template = resource._create_new_template()

assert "env" not in template.model_fields_set
Comment thread src/runpod_flash/core/resources/serverless.py Outdated
Use 'if self.env is not None' instead of 'if self.env' in
_create_new_template() so that an explicitly-set empty env ({})
is still represented in template.model_fields_set, keeping it
distinct from the default env=None case.

Add regression test for env={} behavior per reviewer feedback.
@CircuitSerein CircuitSerein force-pushed the fix/false-drift-instance-ids-and-template-env branch from b69d9ab to f2d0080 Compare April 16, 2026 19:37
@deanq deanq merged commit f5884f2 into runpod:main Apr 17, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants