fix(config): handle corrupted config.toml in credential functions#481
fix(config): handle corrupted config.toml in credential functions#481
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens credential config handling so corrupted ~/.runpod/config.toml won’t crash callers (notably when get_credentials() is invoked at import time), aligning behavior with existing check_credentials() error handling.
Changes:
- Catch
(TypeError, ValueError)aroundtoml.load()inget_credentials()and returnNoneon parse/type failures. - Treat corrupted config as empty in
set_credentials()whenoverwrite=False, preventing crashes during existence checks. - Add unit tests covering corrupted TOML and type errors for
get_credentials(), plus corrupted TOML behavior forset_credentials().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
runpod/cli/groups/config/functions.py |
Adds defensive exception handling for corrupted TOML in credential read/check paths. |
tests/test_cli/test_cli_groups/test_config_functions.py |
Adds tests for corrupted TOML/type errors and set_credentials() overwrite/non-overwrite behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
runpod-Henrik
left a comment
There was a problem hiding this comment.
1. Core fix — correct
get_credentials() now returns None on (TypeError, ValueError) instead of crashing. This matches the existing pattern in check_credentials() and unblocks the import-time crash in runpod/__init__.py:81. The set_credentials() non-overwrite path gets the same treatment — a corrupted file is treated as empty rather than raising.
2. Bug: test_set_credentials_corrupted_toml_allows_overwrite doesn't test the changed code
mock_toml_load.side_effect = ValueError("Invalid TOML")
functions.set_credentials("NEW_KEY", overwrite=True)With overwrite=True, the if not overwrite: block is skipped entirely — toml.load is never called. The side_effect is never triggered. The test would pass with or without this PR. The test that actually validates the changed code is test_set_credentials_corrupted_toml_no_overwrite. The overwrite=True test is redundant for this change and should either be removed or repurposed to verify write behavior.
3. Issue: silent failure gives no signal to the user
except (TypeError, ValueError):
return Noneget_credentials() returning None on a corrupted file is the right behavior for the import-time crash, but it's silent. Downstream code in runpod/__init__.py sets _credentials = None with no indication that the config was present but unreadable. A user gets auth failures with no explanation. A warnings.warn() or sys.stderr message noting the file is corrupted would make this debuggable.
Nit
check_credentials() already had the same (TypeError, ValueError) pattern — worth a comment in get_credentials() cross-referencing this so future readers understand the intentional symmetry.
Verdict: PASS WITH NITS
Fix is correct. Item 2 (misleading test) and item 3 (silent failure) are worth a follow-up.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
35df61a to
89973d0
Compare
runpod-Henrik
left a comment
There was a problem hiding this comment.
Delta review — since 2026-03-21 01:16
1. Issue still open — misleading test_set_credentials_corrupted_toml_allows_overwrite
The comment was updated to read: # overwrite=True skips the toml.load check entirely — which correctly describes the situation but doesn't fix it. The side_effect = ValueError(...) is still never triggered. The test passes with or without this PR, and the added comment confirms that. Either repurpose this test to verify write behavior after the corrupted-read path, or remove it. As-is it adds false coverage.
2. Issue still open — silent failure on corrupted config
get_credentials() still returns None silently on a corrupted file with no log, warning, or stderr message. A user gets unexplained auth failures. warnings.warn() or a single logging.warning() would make this debuggable.
3. Nit still open — no cross-reference comment
get_credentials() and check_credentials() share the same (TypeError, ValueError) catch pattern. A comment in get_credentials() noting the intentional symmetry with check_credentials() would help future readers.
4. New (e2e): flash undeploy --all --force has unbounded blast radius
subprocess.run(["flash", "undeploy", "--all", "--force"], ...)This undeploys every flash-managed endpoint in the account, not just the ones provisioned by this test run. If any other endpoint is live in the same API key's account (e.g. a developer's personal deployment, another CI run), it gets torn down. The original loop over eps.items() was scoped to exactly what was provisioned. Recommend scoping to a prefix (doc_test_/runpod_e2e_*) or tracking endpoint IDs at provision time and undeploying only those.
5. New (e2e): hardware_config_key normalization may silently group mismatched tests
normalized = {
"gpuIds": hw.get("endpointConfig", {}).get("gpuIds", ""),
"dockerArgs": hw.get("templateConfig", {}).get("dockerArgs", ""),
}Any two tests that share gpuIds and dockerArgs but differ in other templateConfig or endpointConfig fields (env vars, image, scalerConfig) will be mapped to the same endpoint. If one test requires an env var the other doesn't, the shared endpoint will have whichever config was provisioned first. The original json.dumps(hw, sort_keys=True) over the full hardware config was safe; this reduced key loses fields that could legitimately differ.
6. New (e2e): concurrent gather loses per-test granularity in CI output
The move from @pytest.mark.parametrize to a single test_mock_worker_jobs that gathers all cases is a valid performance win, but a failure now reports as one test failure with concatenated messages rather than individual named test failures. pytest-xdist or keeping parametrize with async could give both concurrency and per-case reporting — though for the current test count this is a trade-off, not a blocker.
7. New (e2e): cold start test hides diagnostics with DEVNULL
stdout=asyncio.subprocess.DEVNULL,
stderr=asyncio.subprocess.DEVNULL,When the cold start test fails (the most likely failure mode in CI), there's no process output to explain why. The PIPE → DEVNULL change means CI gets only the timing assertion failure, not the server error. Consider capturing to a temp file and attaching to the failure message.
Nits
- The
if proc.returncode is None:guard before SIGINT intest_cold_start.pyis a correct fix. - The CLAUDE.md path change (removing the absolute local path) is the right call.
Verdict
NEEDS WORK
Items 1 and 2 from the prior review are still open. Item 4 (flash undeploy --all) is a new correctness/blast-radius issue worth fixing before merge. Items 5 and 7 are meaningful but lower priority. Item 3 and item 6 are nits.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
|
📝 Documentation updates detected! New suggestion: Add troubleshooting for corrupted credentials file Tip: Send Promptless a meeting transcript in Slack to generate doc updates 📝 |
Prerequisite. This is a stacked PR. The legacy e2e tests were blocking due to constant failures. Once that merges, this PR will look simpler. |
runpod-Henrik
left a comment
There was a problem hiding this comment.
Delta review — since 2026-03-23T13:29:51Z
Fixed since last review
Item 7 (cold-start DEVNULL) — Addressed. test_cold_start.py now captures subprocess output to a tempfile and includes the last 50 lines in the failure message.
Still open
Item 1 — misleading test_set_credentials_corrupted_toml_allows_overwrite
The comment now reads # overwrite=True skips the toml.load check entirely, which correctly describes the situation, but the test still passes regardless of whether this PR exists. The side_effect = ValueError is never triggered. This test adds false coverage signal; the line that actually matters is in test_set_credentials_corrupted_toml_no_overwrite. Remove or repurpose allows_overwrite.
Item 2 — silent failure on corrupted config
get_credentials() still returns None with no warning, log, or stderr message when the file exists but is unreadable. A user with a corrupted ~/.runpod/config.toml gets unexplained auth failures. A single warnings.warn("~/.runpod/config.toml is corrupted — ignoring. Re-run flash login to restore credentials.", stacklevel=2) would make this actionable.
Item 4 — flash undeploy --all --force blast radius
conftest.py:55 still tears down every Flash endpoint in the account. Unchanged from the prior review.
New finding — verify_local_runpod reverts to fragile string check
if "runpod-python" not in runpod.__file__:PR #479 fixed this to use Path(__file__).resolve().is_relative_to(_REPO_ROOT). This PR's conftest.py predates that fix and uses the old string-match approach, which breaks if the repo is cloned under any path not containing the string runpod-python. Since this PR lists #479 as a prerequisite, it should pick up that fix before or at merge.
Prior items closed or downgraded
| Item | Status |
|---|---|
| 7 — cold-start DEVNULL | Fixed |
3 — no cross-ref comment in get_credentials |
Nit, unchanged |
5 — hardware_config_key may group mismatched tests |
Question, unchanged |
| 6 — concurrent gather loses per-test granularity | Nit, unchanged |
Verdict
NEEDS WORK — Items 1, 2, and 4 are still open. Item 4 (blast radius) is shared with #479. Item 1 (dead test) and Item 2 (silent failure) are specific to this PR and straightforward to fix.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
…credentials get_credentials() called toml.load() without exception handling, causing a raw TOMLDecodeError to crash any code that imports the runpod package when ~/.runpod/config.toml is malformed. Return None on parse failure, matching the existing pattern in check_credentials().
Address review feedback: - Patch os.path.exists via module path for consistency with existing tests - Patch os.makedirs and Path.touch in set_credentials tests for hermeticity
89973d0 to
00c6105
Compare
Fails with 403 on the Runpod GraphQL API.
flash undeploy looks up endpoints in .runpod/resources.pkl which does not exist in CI. Use the Runpod GraphQL API to query endpoints by name and delete them directly.
The RUNPOD_API_KEY in CI lacks GraphQL API access, causing 403 on cleanup. Catch the error so test results are not masked by teardown failures.
Summary
get_credentials()now catches(TypeError, ValueError)whentoml.load()fails on a corrupted~/.runpod/config.toml, returningNoneinstead of crashing. This matches the existing pattern incheck_credentials().set_credentials()(non-overwrite path) gets the same treatment -- a corrupted file is treated as empty rather than crashing.Root cause:
get_credentials()is called at import time inrunpod/__init__.py:81. A malformed config.toml causes aTOMLDecodeError(subclass ofValueError) that crashes any downstream package (e.g. flash) before their error handling can run.Fixes: AE-2576
Related: AE-2485 (fixed Flash's own read path; this fixes the dependency's read path)
Test plan
test_get_credentials_corrupted_toml-- ValueError returns Nonetest_get_credentials_type_error-- TypeError returns Nonetest_set_credentials_corrupted_toml_allows_overwrite-- overwrite=True ignores corrupt filetest_set_credentials_corrupted_toml_no_overwrite-- overwrite=False treats corrupt as empty