Skip to content

fix(config): handle corrupted config.toml in credential functions#481

Merged
deanq merged 6 commits intomainfrom
deanq/ae-2576-bug-corrupted-credentials-file-crash
Mar 24, 2026
Merged

fix(config): handle corrupted config.toml in credential functions#481
deanq merged 6 commits intomainfrom
deanq/ae-2576-bug-corrupted-credentials-file-crash

Conversation

@deanq
Copy link
Member

@deanq deanq commented Mar 20, 2026

Prerequisite: #479

Summary

  • get_credentials() now catches (TypeError, ValueError) when toml.load() fails on a corrupted ~/.runpod/config.toml, returning None instead of crashing. This matches the existing pattern in check_credentials().
  • set_credentials() (non-overwrite path) gets the same treatment -- a corrupted file is treated as empty rather than crashing.

Root cause: get_credentials() is called at import time in runpod/__init__.py:81. A malformed config.toml causes a TOMLDecodeError (subclass of ValueError) that crashes any downstream package (e.g. flash) before their error handling can run.

Fixes: AE-2576
Related: AE-2485 (fixed Flash's own read path; this fixes the dependency's read path)

Test plan

  • test_get_credentials_corrupted_toml -- ValueError returns None
  • test_get_credentials_type_error -- TypeError returns None
  • test_set_credentials_corrupted_toml_allows_overwrite -- overwrite=True ignores corrupt file
  • test_set_credentials_corrupted_toml_no_overwrite -- overwrite=False treats corrupt as empty
  • Existing tests unchanged and passing

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens credential config handling so corrupted ~/.runpod/config.toml won’t crash callers (notably when get_credentials() is invoked at import time), aligning behavior with existing check_credentials() error handling.

Changes:

  • Catch (TypeError, ValueError) around toml.load() in get_credentials() and return None on parse/type failures.
  • Treat corrupted config as empty in set_credentials() when overwrite=False, preventing crashes during existence checks.
  • Add unit tests covering corrupted TOML and type errors for get_credentials(), plus corrupted TOML behavior for set_credentials().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
runpod/cli/groups/config/functions.py Adds defensive exception handling for corrupted TOML in credential read/check paths.
tests/test_cli/test_cli_groups/test_config_functions.py Adds tests for corrupted TOML/type errors and set_credentials() overwrite/non-overwrite behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Core fix — correct

get_credentials() now returns None on (TypeError, ValueError) instead of crashing. This matches the existing pattern in check_credentials() and unblocks the import-time crash in runpod/__init__.py:81. The set_credentials() non-overwrite path gets the same treatment — a corrupted file is treated as empty rather than raising.

2. Bug: test_set_credentials_corrupted_toml_allows_overwrite doesn't test the changed code

mock_toml_load.side_effect = ValueError("Invalid TOML")
functions.set_credentials("NEW_KEY", overwrite=True)

With overwrite=True, the if not overwrite: block is skipped entirely — toml.load is never called. The side_effect is never triggered. The test would pass with or without this PR. The test that actually validates the changed code is test_set_credentials_corrupted_toml_no_overwrite. The overwrite=True test is redundant for this change and should either be removed or repurposed to verify write behavior.

3. Issue: silent failure gives no signal to the user

except (TypeError, ValueError):
    return None

get_credentials() returning None on a corrupted file is the right behavior for the import-time crash, but it's silent. Downstream code in runpod/__init__.py sets _credentials = None with no indication that the config was present but unreadable. A user gets auth failures with no explanation. A warnings.warn() or sys.stderr message noting the file is corrupted would make this debuggable.

Nit

check_credentials() already had the same (TypeError, ValueError) pattern — worth a comment in get_credentials() cross-referencing this so future readers understand the intentional symmetry.


Verdict: PASS WITH NITS

Fix is correct. Item 2 (misleading test) and item 3 (silent failure) are worth a follow-up.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

@deanq deanq force-pushed the deanq/ae-2576-bug-corrupted-credentials-file-crash branch from 35df61a to 89973d0 Compare March 21, 2026 04:40
Copy link

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delta review — since 2026-03-21 01:16

1. Issue still open — misleading test_set_credentials_corrupted_toml_allows_overwrite

The comment was updated to read: # overwrite=True skips the toml.load check entirely — which correctly describes the situation but doesn't fix it. The side_effect = ValueError(...) is still never triggered. The test passes with or without this PR, and the added comment confirms that. Either repurpose this test to verify write behavior after the corrupted-read path, or remove it. As-is it adds false coverage.

2. Issue still open — silent failure on corrupted config

get_credentials() still returns None silently on a corrupted file with no log, warning, or stderr message. A user gets unexplained auth failures. warnings.warn() or a single logging.warning() would make this debuggable.

3. Nit still open — no cross-reference comment

get_credentials() and check_credentials() share the same (TypeError, ValueError) catch pattern. A comment in get_credentials() noting the intentional symmetry with check_credentials() would help future readers.


4. New (e2e): flash undeploy --all --force has unbounded blast radius

subprocess.run(["flash", "undeploy", "--all", "--force"], ...)

This undeploys every flash-managed endpoint in the account, not just the ones provisioned by this test run. If any other endpoint is live in the same API key's account (e.g. a developer's personal deployment, another CI run), it gets torn down. The original loop over eps.items() was scoped to exactly what was provisioned. Recommend scoping to a prefix (doc_test_/runpod_e2e_*) or tracking endpoint IDs at provision time and undeploying only those.

5. New (e2e): hardware_config_key normalization may silently group mismatched tests

normalized = {
    "gpuIds": hw.get("endpointConfig", {}).get("gpuIds", ""),
    "dockerArgs": hw.get("templateConfig", {}).get("dockerArgs", ""),
}

Any two tests that share gpuIds and dockerArgs but differ in other templateConfig or endpointConfig fields (env vars, image, scalerConfig) will be mapped to the same endpoint. If one test requires an env var the other doesn't, the shared endpoint will have whichever config was provisioned first. The original json.dumps(hw, sort_keys=True) over the full hardware config was safe; this reduced key loses fields that could legitimately differ.

6. New (e2e): concurrent gather loses per-test granularity in CI output

The move from @pytest.mark.parametrize to a single test_mock_worker_jobs that gathers all cases is a valid performance win, but a failure now reports as one test failure with concatenated messages rather than individual named test failures. pytest-xdist or keeping parametrize with async could give both concurrency and per-case reporting — though for the current test count this is a trade-off, not a blocker.

7. New (e2e): cold start test hides diagnostics with DEVNULL

stdout=asyncio.subprocess.DEVNULL,
stderr=asyncio.subprocess.DEVNULL,

When the cold start test fails (the most likely failure mode in CI), there's no process output to explain why. The PIPEDEVNULL change means CI gets only the timing assertion failure, not the server error. Consider capturing to a temp file and attaching to the failure message.

Nits

  • The if proc.returncode is None: guard before SIGINT in test_cold_start.py is a correct fix.
  • The CLAUDE.md path change (removing the absolute local path) is the right call.

Verdict

NEEDS WORK

Items 1 and 2 from the prior review are still open. Item 4 (flash undeploy --all) is a new correctness/blast-radius issue worth fixing before merge. Items 5 and 7 are meaningful but lower priority. Item 3 and item 6 are nits.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

@promptless
Copy link

promptless bot commented Mar 23, 2026

📝 Documentation updates detected!

New suggestion: Add troubleshooting for corrupted credentials file


Tip: Send Promptless a meeting transcript in Slack to generate doc updates 📝

Copy link

@KAJdev KAJdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a bunch of flash integration tests and docs, which also seems to be duplicated with #479? I'm not sure this belongs in runpod-python

@deanq
Copy link
Member Author

deanq commented Mar 23, 2026

I see a bunch of flash integration tests and docs, which also seems to be duplicated with #479? I'm not sure this belongs in runpod-python

Prerequisite. This is a stacked PR. The legacy e2e tests were blocking due to constant failures. Once that merges, this PR will look simpler.

Copy link

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delta review — since 2026-03-23T13:29:51Z

Fixed since last review

Item 7 (cold-start DEVNULL) — Addressed. test_cold_start.py now captures subprocess output to a tempfile and includes the last 50 lines in the failure message.


Still open

Item 1 — misleading test_set_credentials_corrupted_toml_allows_overwrite

The comment now reads # overwrite=True skips the toml.load check entirely, which correctly describes the situation, but the test still passes regardless of whether this PR exists. The side_effect = ValueError is never triggered. This test adds false coverage signal; the line that actually matters is in test_set_credentials_corrupted_toml_no_overwrite. Remove or repurpose allows_overwrite.

Item 2 — silent failure on corrupted config

get_credentials() still returns None with no warning, log, or stderr message when the file exists but is unreadable. A user with a corrupted ~/.runpod/config.toml gets unexplained auth failures. A single warnings.warn("~/.runpod/config.toml is corrupted — ignoring. Re-run flash login to restore credentials.", stacklevel=2) would make this actionable.

Item 4 — flash undeploy --all --force blast radius

conftest.py:55 still tears down every Flash endpoint in the account. Unchanged from the prior review.


New finding — verify_local_runpod reverts to fragile string check

if "runpod-python" not in runpod.__file__:

PR #479 fixed this to use Path(__file__).resolve().is_relative_to(_REPO_ROOT). This PR's conftest.py predates that fix and uses the old string-match approach, which breaks if the repo is cloned under any path not containing the string runpod-python. Since this PR lists #479 as a prerequisite, it should pick up that fix before or at merge.


Prior items closed or downgraded

Item Status
7 — cold-start DEVNULL Fixed
3 — no cross-ref comment in get_credentials Nit, unchanged
5 — hardware_config_key may group mismatched tests Question, unchanged
6 — concurrent gather loses per-test granularity Nit, unchanged

Verdict

NEEDS WORK — Items 1, 2, and 4 are still open. Item 4 (blast radius) is shared with #479. Item 1 (dead test) and Item 2 (silent failure) are specific to this PR and straightforward to fix.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

deanq added 2 commits March 23, 2026 16:16
…credentials

get_credentials() called toml.load() without exception handling, causing
a raw TOMLDecodeError to crash any code that imports the runpod package
when ~/.runpod/config.toml is malformed. Return None on parse failure,
matching the existing pattern in check_credentials().
Address review feedback:
- Patch os.path.exists via module path for consistency with existing tests
- Patch os.makedirs and Path.touch in set_credentials tests for hermeticity
@deanq deanq force-pushed the deanq/ae-2576-bug-corrupted-credentials-file-crash branch from 89973d0 to 00c6105 Compare March 23, 2026 23:16
deanq added 4 commits March 23, 2026 16:54
Fails with 403 on the Runpod GraphQL API.
flash undeploy looks up endpoints in .runpod/resources.pkl which does
not exist in CI. Use the Runpod GraphQL API to query endpoints by name
and delete them directly.
The RUNPOD_API_KEY in CI lacks GraphQL API access, causing 403 on
cleanup. Catch the error so test results are not masked by teardown
failures.
@deanq deanq merged commit 9894894 into main Mar 24, 2026
7 checks passed
@deanq deanq deleted the deanq/ae-2576-bug-corrupted-credentials-file-crash branch March 24, 2026 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants