fix(config): handle corrupted config.toml in credential functions by deanq · Pull Request #481 · runpod/runpod-python

deanq · 2026-03-20T23:50:49Z

Prerequisite: #479

Summary

get_credentials() now catches (TypeError, ValueError) when toml.load() fails on a corrupted ~/.runpod/config.toml, returning None instead of crashing. This matches the existing pattern in check_credentials().
set_credentials() (non-overwrite path) gets the same treatment -- a corrupted file is treated as empty rather than crashing.

Root cause: get_credentials() is called at import time in runpod/__init__.py:81. A malformed config.toml causes a TOMLDecodeError (subclass of ValueError) that crashes any downstream package (e.g. flash) before their error handling can run.

Fixes: AE-2576
Related: AE-2485 (fixed Flash's own read path; this fixes the dependency's read path)

Test plan

test_get_credentials_corrupted_toml -- ValueError returns None
test_get_credentials_type_error -- TypeError returns None
test_set_credentials_corrupted_toml_allows_overwrite -- overwrite=True ignores corrupt file
test_set_credentials_corrupted_toml_no_overwrite -- overwrite=False treats corrupt as empty
Existing tests unchanged and passing

Copilot

Pull request overview

This PR hardens credential config handling so corrupted ~/.runpod/config.toml won’t crash callers (notably when get_credentials() is invoked at import time), aligning behavior with existing check_credentials() error handling.

Changes:

Catch (TypeError, ValueError) around toml.load() in get_credentials() and return None on parse/type failures.
Treat corrupted config as empty in set_credentials() when overwrite=False, preventing crashes during existence checks.
Add unit tests covering corrupted TOML and type errors for get_credentials(), plus corrupted TOML behavior for set_credentials().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`runpod/cli/groups/config/functions.py`	Adds defensive exception handling for corrupted TOML in credential read/check paths.
`tests/test_cli/test_cli_groups/test_config_functions.py`	Adds tests for corrupted TOML/type errors and `set_credentials()` overwrite/non-overwrite behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_cli/test_cli_groups/test_config_functions.py

runpod-Henrik

1. Core fix — correct

get_credentials() now returns None on (TypeError, ValueError) instead of crashing. This matches the existing pattern in check_credentials() and unblocks the import-time crash in runpod/__init__.py:81. The set_credentials() non-overwrite path gets the same treatment — a corrupted file is treated as empty rather than raising.

2. Bug: `test_set_credentials_corrupted_toml_allows_overwrite` doesn't test the changed code

mock_toml_load.side_effect = ValueError("Invalid TOML")
functions.set_credentials("NEW_KEY", overwrite=True)

With overwrite=True, the if not overwrite: block is skipped entirely — toml.load is never called. The side_effect is never triggered. The test would pass with or without this PR. The test that actually validates the changed code is test_set_credentials_corrupted_toml_no_overwrite. The overwrite=True test is redundant for this change and should either be removed or repurposed to verify write behavior.

3. Issue: silent failure gives no signal to the user

except (TypeError, ValueError):
    return None

get_credentials() returning None on a corrupted file is the right behavior for the import-time crash, but it's silent. Downstream code in runpod/__init__.py sets _credentials = None with no indication that the config was present but unreadable. A user gets auth failures with no explanation. A warnings.warn() or sys.stderr message noting the file is corrupted would make this debuggable.

Nit

check_credentials() already had the same (TypeError, ValueError) pattern — worth a comment in get_credentials() cross-referencing this so future readers understand the intentional symmetry.

Verdict: PASS WITH NITS

Fix is correct. Item 2 (misleading test) and item 3 (silent failure) are worth a follow-up.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

runpod-Henrik

Delta review — since 2026-03-21 01:16

1. Issue still open — misleading `test_set_credentials_corrupted_toml_allows_overwrite`

The comment was updated to read: # overwrite=True skips the toml.load check entirely — which correctly describes the situation but doesn't fix it. The side_effect = ValueError(...) is still never triggered. The test passes with or without this PR, and the added comment confirms that. Either repurpose this test to verify write behavior after the corrupted-read path, or remove it. As-is it adds false coverage.

2. Issue still open — silent failure on corrupted config

get_credentials() still returns None silently on a corrupted file with no log, warning, or stderr message. A user gets unexplained auth failures. warnings.warn() or a single logging.warning() would make this debuggable.

3. Nit still open — no cross-reference comment

get_credentials() and check_credentials() share the same (TypeError, ValueError) catch pattern. A comment in get_credentials() noting the intentional symmetry with check_credentials() would help future readers.

4. New (e2e): `flash undeploy --all --force` has unbounded blast radius

subprocess.run(["flash", "undeploy", "--all", "--force"], ...)

This undeploys every flash-managed endpoint in the account, not just the ones provisioned by this test run. If any other endpoint is live in the same API key's account (e.g. a developer's personal deployment, another CI run), it gets torn down. The original loop over eps.items() was scoped to exactly what was provisioned. Recommend scoping to a prefix (doc_test_/runpod_e2e_*) or tracking endpoint IDs at provision time and undeploying only those.

5. New (e2e): `hardware_config_key` normalization may silently group mismatched tests

normalized = {
    "gpuIds": hw.get("endpointConfig", {}).get("gpuIds", ""),
    "dockerArgs": hw.get("templateConfig", {}).get("dockerArgs", ""),
}

Any two tests that share gpuIds and dockerArgs but differ in other templateConfig or endpointConfig fields (env vars, image, scalerConfig) will be mapped to the same endpoint. If one test requires an env var the other doesn't, the shared endpoint will have whichever config was provisioned first. The original json.dumps(hw, sort_keys=True) over the full hardware config was safe; this reduced key loses fields that could legitimately differ.

6. New (e2e): concurrent gather loses per-test granularity in CI output

The move from @pytest.mark.parametrize to a single test_mock_worker_jobs that gathers all cases is a valid performance win, but a failure now reports as one test failure with concatenated messages rather than individual named test failures. pytest-xdist or keeping parametrize with async could give both concurrency and per-case reporting — though for the current test count this is a trade-off, not a blocker.

7. New (e2e): cold start test hides diagnostics with DEVNULL

stdout=asyncio.subprocess.DEVNULL,
stderr=asyncio.subprocess.DEVNULL,

When the cold start test fails (the most likely failure mode in CI), there's no process output to explain why. The PIPE → DEVNULL change means CI gets only the timing assertion failure, not the server error. Consider capturing to a temp file and attaching to the failure message.

Nits

The if proc.returncode is None: guard before SIGINT in test_cold_start.py is a correct fix.
The CLAUDE.md path change (removing the absolute local path) is the right call.

Verdict

NEEDS WORK

Items 1 and 2 from the prior review are still open. Item 4 (flash undeploy --all) is a new correctness/blast-radius issue worth fixing before merge. Items 5 and 7 are meaningful but lower priority. Item 3 and item 6 are nits.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

promptless · 2026-03-23T18:05:28Z

📝 Documentation updates detected!

New suggestion: Add troubleshooting for corrupted credentials file

Tip: Send Promptless a meeting transcript in Slack to generate doc updates 📝

KAJdev

I see a bunch of flash integration tests and docs, which also seems to be duplicated with #479? I'm not sure this belongs in runpod-python

deanq · 2026-03-23T21:09:48Z

I see a bunch of flash integration tests and docs, which also seems to be duplicated with #479? I'm not sure this belongs in runpod-python

Prerequisite. This is a stacked PR. The legacy e2e tests were blocking due to constant failures. Once that merges, this PR will look simpler.

runpod-Henrik

Delta review — since 2026-03-23T13:29:51Z

Fixed since last review

Item 7 (cold-start DEVNULL) — Addressed. test_cold_start.py now captures subprocess output to a tempfile and includes the last 50 lines in the failure message.

Still open

Item 1 — misleading test_set_credentials_corrupted_toml_allows_overwrite

The comment now reads # overwrite=True skips the toml.load check entirely, which correctly describes the situation, but the test still passes regardless of whether this PR exists. The side_effect = ValueError is never triggered. This test adds false coverage signal; the line that actually matters is in test_set_credentials_corrupted_toml_no_overwrite. Remove or repurpose allows_overwrite.

Item 2 — silent failure on corrupted config

get_credentials() still returns None with no warning, log, or stderr message when the file exists but is unreadable. A user with a corrupted ~/.runpod/config.toml gets unexplained auth failures. A single warnings.warn("~/.runpod/config.toml is corrupted — ignoring. Re-run flash login to restore credentials.", stacklevel=2) would make this actionable.

Item 4 — flash undeploy --all --force blast radius

conftest.py:55 still tears down every Flash endpoint in the account. Unchanged from the prior review.

New finding — `verify_local_runpod` reverts to fragile string check

if "runpod-python" not in runpod.__file__:

PR #479 fixed this to use Path(__file__).resolve().is_relative_to(_REPO_ROOT). This PR's conftest.py predates that fix and uses the old string-match approach, which breaks if the repo is cloned under any path not containing the string runpod-python. Since this PR lists #479 as a prerequisite, it should pick up that fix before or at merge.

Prior items closed or downgraded

Item	Status
7 — cold-start DEVNULL	Fixed
3 — no cross-ref comment in `get_credentials`	Nit, unchanged
5 — `hardware_config_key` may group mismatched tests	Question, unchanged
6 — concurrent gather loses per-test granularity	Nit, unchanged

Verdict

NEEDS WORK — Items 1, 2, and 4 are still open. Item 4 (blast radius) is shared with #479. Item 1 (dead test) and Item 2 (silent failure) are specific to this PR and straightforward to fix.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

…credentials get_credentials() called toml.load() without exception handling, causing a raw TOMLDecodeError to crash any code that imports the runpod package when ~/.runpod/config.toml is malformed. Return None on parse failure, matching the existing pattern in check_credentials().

Address review feedback: - Patch os.path.exists via module path for consistency with existing tests - Patch os.makedirs and Path.touch in set_credentials tests for hermeticity

Fails with 403 on the Runpod GraphQL API.

flash undeploy looks up endpoints in .runpod/resources.pkl which does not exist in CI. Use the Runpod GraphQL API to query endpoints by name and delete them directly.

The RUNPOD_API_KEY in CI lacks GraphQL API access, causing 403 on cleanup. Catch the error so test results are not masked by teardown failures.

deanq requested a review from Copilot March 20, 2026 23:51

Copilot started reviewing on behalf of deanq March 20, 2026 23:52 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

tests/test_cli/test_cli_groups/test_config_functions.py Outdated Show resolved Hide resolved

tests/test_cli/test_cli_groups/test_config_functions.py Show resolved Hide resolved

runpod-Henrik reviewed Mar 21, 2026

View reviewed changes

deanq force-pushed the deanq/ae-2576-bug-corrupted-credentials-file-crash branch from 35df61a to 89973d0 Compare March 21, 2026 04:40

deanq mentioned this pull request Mar 21, 2026

fix(cli): catch corrupted credentials file at import time runpod/flash#282

Closed

4 tasks

runpod-Henrik reviewed Mar 23, 2026

View reviewed changes

KAJdev requested changes Mar 23, 2026

View reviewed changes

runpod-Henrik reviewed Mar 23, 2026

View reviewed changes

deanq added 2 commits March 23, 2026 16:16

fix(test): use module-scoped patches and mock filesystem calls

00c6105

Address review feedback: - Patch os.path.exists via module path for consistency with existing tests - Patch os.makedirs and Path.touch in set_credentials tests for hermeticity

deanq force-pushed the deanq/ae-2576-bug-corrupted-credentials-file-crash branch from 89973d0 to 00c6105 Compare March 23, 2026 23:16

deanq requested review from KAJdev, jhcipar and runpod-Henrik March 23, 2026 23:16

deanq added 4 commits March 23, 2026 16:54

chore: remove broken cleanup-endpoints workflow

281d103

Fails with 403 on the Runpod GraphQL API.

fix(e2e): replace flash undeploy with direct GraphQL endpoint deletion

336c18f

flash undeploy looks up endpoints in .runpod/resources.pkl which does not exist in CI. Use the Runpod GraphQL API to query endpoints by name and delete them directly.

fix(e2e): make endpoint cleanup non-fatal

4c6ec06

The RUNPOD_API_KEY in CI lacks GraphQL API access, causing 403 on cleanup. Catch the error so test results are not masked by teardown failures.

fix(e2e): use flash undeploy --all --force for cleanup

1c2537c

KAJdev approved these changes Mar 24, 2026

View reviewed changes

deanq merged commit 9894894 into main Mar 24, 2026
7 checks passed

deanq deleted the deanq/ae-2576-bug-corrupted-credentials-file-crash branch March 24, 2026 02:53

github-actions bot mentioned this pull request Mar 24, 2026

chore(main): release 1.8.2 #485

Merged

promptless bot mentioned this pull request Mar 24, 2026

docs: Add troubleshooting for corrupted credentials file runpod/docs#591

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(config): handle corrupted config.toml in credential functions#481

fix(config): handle corrupted config.toml in credential functions#481
deanq merged 6 commits intomainfrom
deanq/ae-2576-bug-corrupted-credentials-file-crash

deanq commented Mar 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

runpod-Henrik left a comment

Uh oh!

runpod-Henrik left a comment

Uh oh!

promptless bot commented Mar 23, 2026

Uh oh!

KAJdev left a comment •

edited

Loading

Uh oh!

deanq commented Mar 23, 2026

Uh oh!

runpod-Henrik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

deanq commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

1. Core fix — correct

2. Bug: test_set_credentials_corrupted_toml_allows_overwrite doesn't test the changed code

3. Issue: silent failure gives no signal to the user

Nit

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

1. Issue still open — misleading test_set_credentials_corrupted_toml_allows_overwrite

2. Issue still open — silent failure on corrupted config

3. Nit still open — no cross-reference comment

4. New (e2e): flash undeploy --all --force has unbounded blast radius

5. New (e2e): hardware_config_key normalization may silently group mismatched tests

6. New (e2e): concurrent gather loses per-test granularity in CI output

7. New (e2e): cold start test hides diagnostics with DEVNULL

Nits

Verdict

Uh oh!

promptless bot commented Mar 23, 2026

Uh oh!

KAJdev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deanq commented Mar 23, 2026

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

Fixed since last review

Still open

New finding — verify_local_runpod reverts to fragile string check

Prior items closed or downgraded

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deanq commented Mar 20, 2026 •

edited

Loading

2. Bug: `test_set_credentials_corrupted_toml_allows_overwrite` doesn't test the changed code

1. Issue still open — misleading `test_set_credentials_corrupted_toml_allows_overwrite`

4. New (e2e): `flash undeploy --all --force` has unbounded blast radius

5. New (e2e): `hardware_config_key` normalization may silently group mismatched tests

KAJdev left a comment •

edited

Loading

New finding — `verify_local_runpod` reverts to fragile string check