refactor(agentkit): engineering-standards uplift (logging / exceptions / timeouts) by tiantt · Pull Request #104 · volcengine/agentkit-sdk-python

tiantt · 2026-07-01T02:40:20Z

What

Raises the core HTTP / credential / SDK inner-loop to one consistent engineering standard across three dimensions, plus the underlying transient-retry fix.

Foundations (new leaves): agentkit/errors.py (single stdlib-only root AgentKitError), utils/http_defaults.py (one source of truth for timeout/retry env+defaults), utils/redact.py + a RedactionFilter on every log handler.

Logging: get_logger(__name__) now actually used in the scoped modules (previously 0 callers); library stays silent-by-default.
Exceptions: transport errors wrapped into NetworkError; typed ApiError (with error_code); from e chaining; raise e→raise; runtime-validation asserts→raises.
Secret leaks closed: API key (identity/auth), full response (request.py, cr.py debug, ve_sign.check_error). print(1)/print(2) removed. request() now re-raises instead of silently returning None.
Timeouts: base_service_client reads http_timeout()/http_retries() instead of hardcoded 30/30; cli_mount urlopen bounded.
Transient retry (base commit): OpenAPI calls get a bounded timeout + conservative retry (connection errors + 429/503 only, honors Retry-After, never retries read-timeouts — safe for non-idempotent Create*).

Tests

+22 offline unit tests (hierarchy, http_defaults clamping, RedactionFilter, _signed_request/_invoke_api error paths). Backward-compatible: except AuthError still catches the same set; no code did except AgentKitError before.

Stack

Base of the stack. Followed by the test PR (test/agentkit-apps-pipeline-ut) then the fix PR (fix/agentkit-latent-bugs).

🤖 Generated with Claude Code

…/exception/timeout standard Establishes one standard for each of three engineering dimensions on the inner-loop HTTP/credential/SDK layer, and applies it. Good building blocks already existed but were unused (get_logger had 0 callers; redact()/mask() were auth-only). Foundations (new, dependency-light leaves): - agentkit/errors.py: single stdlib-only root AgentKitError. toolkit.errors .AgentKitError and auth.errors.AuthError both reparent onto it, so a caller can `except agentkit.errors.AgentKitError` to catch any AgentKit failure while agentkit.auth stays free of any agentkit.toolkit import. - agentkit/utils/http_defaults.py: single source of truth for timeout/retry defaults + env vars (AGENTKIT_HTTP_TIMEOUT/RETRIES/STREAM_TIMEOUT), clamped. - agentkit/utils/redact.py: redact()/mask() moved here; auth/_redact.py is now a re-export shim. RedactionFilter installed on every log handler. Logging: swap logging.getLogger()/idioms for get_logger(__name__) in the scoped core modules; library stays silent-by-default. Exceptions: wrap transport errors into NetworkError at the HTTP-owning layer; typed ApiError (with optional error_code) for backend failures; always chain with `from e`; `raise e` -> bare `raise`; runtime-validation asserts -> raises. Secret-leak fixes: drop raw response/secret interpolation from identity/auth (API key), utils/request, cr.py debug log, ve_sign check_error. request() now re-raises instead of silently returning None on failure. Timeouts: base_service_client reads http_timeout()/http_retries() instead of hardcoded 30/30; cli_mount urlopen bounded via http_timeout(). Tests: +22 offline tests (hierarchy, http_defaults clamping, RedactionFilter, _signed_request/_invoke_api error paths). Full suite: only the 1 pre-existing failure on main (test_cli_add_harness env leak) remains; 0 new failures. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Targets the two highest-risk under-tested areas found by a coverage audit: the deployable app server surfaces (apps/, ~9.9% -> 76.3%) and the deploy pipeline logic (toolkit runners/executors/strategies). Overall line coverage 45.9% -> 50.9%. +258 tests, all offline (no network/docker/uvicorn). apps/ (Starlette TestClient + asyncio.run, telemetry singletons spied): - simple_app: health/readiness/liveness + /ping wiring + ping() arg guard - simple_app_handlers: InvokeHandler signature dispatch + response matrix, _convert_to_sse, AsyncTaskHandler, PingHandler - a2a_app / mcp_app: decorator guards, ping endpoint, env-detect routes/tools - agent_server: AgentKitAgentLoader + ctor guards; telemetry latency gating and error_type formatting; ASGI middleware header-exclusion + finish gating pipeline (fakes injected via the existing DI seams): - hybrid_strategy: _should_push_to_cr / _validate_cr_image_url error mapping, build/deploy config-updates assembly, invoke/status/destroy passthrough - base_executor: _classify_error / _handle_exception (pins the ErrorCode mapping introduced by the engineering-standards work), preflight handling - init_executor: render context, template/name validation, project scaffold - runners/base + ve_agentkit: pure payload/detection helpers, _needs_runtime_ update env diffing, deploy guard/dispatch, destroy/status mapping Pins (NOT fixes — this branch is tests-only) three latent source bugs to their current behavior with # NOTE comments, for a follow-up fix: - mcp_app tool()/agent_as_a_tool() wrappers: on a tool exception the finally passes func_result=result while result is unbound -> UnboundLocalError masks the real error and skips telemetry (all 4 wrapper branches). - simple_app_handlers PingHandler._format_ping_status: self.func.__name__ raises AttributeError when no ping func is registered. - simple_app_handlers AsyncTaskHandler.handle: signature omits request, violating the BaseHandler.handle(request) contract. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…st surface) The prior batch raised the coverage number but left the product's core orchestration ~0%. This batch tests that spine directly with proper fakes, asserting behavior (error codes, response shapes, collaborator calls), lifting the critical functions from ~0% to tested: ve_pipeline.build (build->push->deploy spine) 1.9% -> 100% ve_pipeline._upload_to_tos (bucket-ownership gate) 0% -> 70% ve_agentkit._create_new_runtime 3% -> 91% ve_agentkit._update_existing_runtime 1.8% -> 88% ve_agentkit._wait_for_runtime_status 9.5% -> 100% runners/base._http_post_invoke (invoke transport) 1% -> 91% builders/local_docker.build 0.7% -> 86% agent_server_app._invoke_compat (/invoke) 0% -> 89% Overall coverage 45.9% -> 53.7%. +81 tests, all offline (no network/docker/ uvicorn; polling loops patched). No source modified. Pins three more latent source bugs to current behavior (# NOTE, for follow-up): - ve_pipeline._upload_to_tos: the documented ValueError for an unrendered {{template}} bucket name is caught by the broad `except Exception` tail and re-raised as a generic Exception, so callers can never catch ValueError. - ve_agentkit._update_existing_runtime: client_token=generate_client_token() is passed to UpdateRuntimeRequest, which has no such field and extra="ignore", so the idempotency token is silently dropped (Create keeps it, Update sends none). - builders/local_docker.build: BuildResult.build_logs is typed List[str] but the raw str from DockerManager.build_image is assigned directly on both the success and failure paths. Also hardens test_ve_pipeline_upload_tos's AccountDisable assertion to be cloud-provider-independent (console host varies volcengine vs byteplus and provider global state leaks across the full suite). Remaining gap: _execute_build's poll loop is still thin (mocked out to isolate the build spine) — a follow-up target. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…rface + core logic) (#105)

cuericlee

/lgtm

Overall: LGTM ✅

This is a well-structured refactoring PR. Key strengths:

Category | Grade -- | -- assert → raise | ✅ Eliminates -O-dependent behavior across the SDK Exception hierarchy | ✅ AgentKitError root enables uniform error handling Redact extraction | ✅ Clean module relocation with backward-compatible shim HTTP defaults consolidation | ✅ DRY improvement Logging fixes | ✅ Exc_info, lazy formatting, and critical raise bugfix in request.py Test coverage | ✅ New dedicated test files; latent-bug-pinning pattern Backward compatibility | ✅ All public APIs preserved

One note: The raise addition in agentkit/utils/request.py (Theme 6) changes the function's contract from "may return None on error" to "always raises on error." This is the right fix, but verify that all call sites handle the exception rather than relying on a None-check pattern.

tiantt and others added 3 commits July 1, 2026 10:47

tiantt force-pushed the feat/agentkit-engineering-standards branch from fedc9c8 to ad13eaf Compare July 1, 2026 02:48

test(agentkit): unit tests for apps entrypoints + deploy-pipeline (su…

e369ea0

…rface + core logic) (#105)

cuericlee approved these changes Jul 1, 2026

View reviewed changes

cuericlee merged commit 5fb9f81 into main Jul 1, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(agentkit): engineering-standards uplift (logging / exceptions / timeouts)#104

refactor(agentkit): engineering-standards uplift (logging / exceptions / timeouts)#104
cuericlee merged 4 commits into
mainfrom
feat/agentkit-engineering-standards

tiantt commented Jul 1, 2026

Uh oh!

cuericlee left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tiantt commented Jul 1, 2026

What

Tests

Stack

Uh oh!

cuericlee left a comment

Choose a reason for hiding this comment

Overall: LGTM ✅

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants