fix(drift): prevent false config drift on unchanged GPU endpoints#319
Conversation
|
Thanks, @CircuitSerein . Why did your changes rewrite the entire files? Look at what's removed and added. Are you using an editor that completely reformats files (tabs to spaces)? We cannot accept PR that does this. I can't even tell what changed. Please reformat it back to what it was originally and apply the actual code change you intended. |
Two bugs caused every subsequent run to trigger a new RunPod release (cold start), even when the user's config hadn't changed. Bug 1 — instanceIds=[] vs None false drift: The RunPod API returns instanceIds=[] for GPU endpoints that have no instance restriction. Locally the user never sets instanceIds, so it stays None. exclude_none=True drops None but keeps [], producing a different config_hash and a spurious update on every run. Fix: normalize instanceIds=[] to absent in config_hash, matching the same treatment applied to None. Bug 2 — RUNPOD_API_KEY oscillating add/remove: _create_new_template() always passed env=[] explicitly to PodTemplate even when self.env was None. This put 'env' in Pydantic's model_fields_set, causing update() to set env_needs_update=True on every run. For QB endpoints that don't make remote calls this toggled RUNPOD_API_KEY in and out of the template on alternate runs. Fix: only pass env= to PodTemplate when self.env is truthy, keeping 'env' out of model_fields_set for the default (no-env) case. Regression tests added for both bugs.
eafe09b to
6740ef4
Compare
|
Sorry about that - CRLF problem with Windows and WSL. Recommited clean changes |
runpod-Henrik
left a comment
There was a problem hiding this comment.
Two real bugs, both fixes correct. One observation worth a follow-up, one test gap.
Bug 1 — instanceIds normalization (serverless.py:436)
Correct. model_dump(exclude_none=True) drops None but keeps [] — without this fix the serialized JSON diverges and triggers a spurious release on every run. if not config_dict.get("instanceIds") handles both cases cleanly.
Confirmed non-issue for CPU: CpuServerlessEndpoint.config_hash uses an explicit include=cpu_fields allowlist that doesn't include instanceIds, so the field never enters the CPU hash. No fix needed there.
Bug 2 — _create_new_template() (serverless.py:622, serverless_cpu.py:162)
Correct for the described symptom. The old env=KeyValuePair.from_dict(self.env or {}) always put "env" into model_fields_set, which made has_explicit_template_env True on every update call → env_needs_update = True → _inject_runtime_template_vars() ran → for QB endpoints without remote calls, RUNPOD_API_KEY was alternately added and removed. The fix breaks that chain.
Observation — env={} inconsistency between creation and update paths.
_create_new_template uses if self.env:, which treats env={} identically to env=None. _configure_existing_template (serverless.py:633) uses if self.env is not None:, which does not. The result: a user who sets env={} gets different template state depending on whether it's a new deploy or an update — new deploy leaves env unset, update explicitly clears it to []. Unlikely to hit in practice, but the inconsistency is a quiet trap if this area is touched later. Suggest either aligning _configure_existing_template to match, or adding a comment in _create_new_template explaining the intentional choice.
Update() mechanics
Traced through the env_needs_update guard at line 1087: after the fix, new_config.env is None → "env" not in template_fields_set → has_explicit_template_env = False, env_changed = False → env_needs_update = False → _inject_runtime_template_vars() not called, RUNPOD_API_KEY not touched. The logic chain is clean.
Tests
TestInstanceIdsFalseDrift covers all three relevant cases (None, absent, empty list) and confirms non-empty still changes the hash. TestCreateNewTemplateEnvFieldSet covers both GPU and CPU endpoints in set and unset state. Test run is clean (2640/0).
Gap: no test for env={} in _create_new_template. Given the if self.env: subtlety, a test pinning env={} → "env" not in template.model_fields_set would document the intended behavior and guard against a later "fix" to is not None.
There was a problem hiding this comment.
Pull request overview
Fixes false serverless config drift that was triggering unnecessary RunPod releases on subsequent flash run executions, even when the user’s configuration hadn’t changed.
Changes:
- Normalize
instanceIds=[]to be treated the same as “absent”/Nonewhen computingconfig_hash. - Adjust
_create_new_template()to avoid markingenvas explicitly set whenenvis unset by the user. - Add regression tests for
instanceIdshashing and templateenvfield-set behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/runpod_flash/core/resources/serverless.py |
Normalizes instanceIds for hashing; changes template creation to conditionally include env. |
src/runpod_flash/core/resources/serverless_cpu.py |
Mirrors the template env-handling change for CPU endpoints. |
tests/unit/resources/test_serverless.py |
Adds regression tests for instanceIds hash normalization and template env field-set behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_create_new_template_env_not_in_fields_set_when_env_none(self): | ||
| """When self.env is None, 'env' must NOT appear in template.model_fields_set.""" | ||
| resource = ServerlessEndpoint(name="test", imageName="test:latest") | ||
| assert resource.env is None | ||
|
|
||
| template = resource._create_new_template() | ||
|
|
||
| assert "env" not in template.model_fields_set | ||
|
|
||
| def test_create_new_template_env_in_fields_set_when_env_set(self): | ||
| """When self.env is populated, 'env' MUST appear in template.model_fields_set.""" | ||
| resource = ServerlessEndpoint( | ||
| name="test", | ||
| imageName="test:latest", | ||
| env={"MY_VAR": "value"}, | ||
| ) | ||
|
|
||
| template = resource._create_new_template() | ||
|
|
||
| assert "env" in template.model_fields_set | ||
| assert any(kv.key == "MY_VAR" for kv in template.env) | ||
|
|
||
| def test_create_new_template_env_not_in_fields_set_cpu_endpoint(self): | ||
| """CpuServerlessEndpoint: same fix applies — env=None must not set 'env' field.""" | ||
| resource = CpuServerlessEndpoint(name="test", imageName="test:latest") | ||
| assert resource.env is None | ||
|
|
||
| template = resource._create_new_template() | ||
|
|
||
| assert "env" not in template.model_fields_set |
Use 'if self.env is not None' instead of 'if self.env' in
_create_new_template() so that an explicitly-set empty env ({})
is still represented in template.model_fields_set, keeping it
distinct from the default env=None case.
Add regression test for env={} behavior per reviewer feedback.
b69d9ab to
f2d0080
Compare
Problem
Two bugs caused every subsequent
flash runto trigger a new RunPod release (cold start + billing event), even when the user's config hadn't changed at all.Bug 1 —
instanceIds=[]vsNonefalse driftThe RunPod API returns
instanceIds=[]for GPU endpoints that have no instance restriction. Locally the user never setsinstanceIds, so it staysNone.model_dump(exclude_none=True)dropsNonebut keeps[], producing a differentconfig_hashand a spuriousupdate()call on every run.Symptom: endpoint receives a new release on every invocation despite zero config changes.
Bug 2 —
RUNPOD_API_KEYoscillating add/remove_create_new_template()always passedenv=[]explicitly toPodTemplateeven whenself.envwasNone. This put"env"into Pydantic'smodel_fields_set, which madeupdate()evaluatehas_explicit_template_env = Trueevery time, settingenv_needs_update = True.For QB endpoints that don't make remote calls,
_inject_runtime_template_vars()would not injectRUNPOD_API_KEY— soupdate_templatewas called with an empty env, removing the key. The next run added it back. Alternating add/remove on every run, visible in the RunPod web UI as repeated releases.Fix
serverless.py—config_hash: normalizeinstanceIds=[]to absent, matching the treatment ofNone:serverless.py+serverless_cpu.py—_create_new_template(): only passenv=whenself.envis truthy, keeping"env"out ofmodel_fields_setfor the default no-env case:Tests
Regression tests added in
TestInstanceIdsFalseDriftandTestCreateNewTemplateEnvFieldSet:instanceIds=NoneandinstanceIds=[]produce identical hashinstanceIdsstill changes hash (real drift still caught)_create_new_template()withenv=Nonedoes not put"env"inmodel_fields_set(GPU and CPU endpoints)_create_new_template()withenvpopulated correctly sets the fieldTest run