Skip to content

fix: return 503 instead of 500 when upstream registry is unavailable#4352

Open
peppescg wants to merge 3 commits intomainfrom
issues/4351
Open

fix: return 503 instead of 500 when upstream registry is unavailable#4352
peppescg wants to merge 3 commits intomainfrom
issues/4351

Conversation

@peppescg
Copy link
Contributor

Summary

When the configured registry API URL is wrong or the upstream registry returns a non-auth error (e.g. 404, timeout, connection refused), the GET registry endpoints returned a generic 500 Internal Server Error with plain text "Failed to get registry". This made it impossible for clients (like ToolHive Studio) to distinguish a misconfigured URL from a real server bug.

This PR introduces a RegistryUnavailableError type and returns 503 Service Unavailable with a structured JSON response containing "code": "registry_unavailable" and the original upstream error message, following the same pattern as the existing registry_auth_required 503 response.

Fixes #4351

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Manual testing with curl:

  1. Set wrong API URL via PUT → 200
  2. GET /api/v1beta/registry503 with {"code":"registry_unavailable","message":"upstream registry at ... is unavailable: registry API returned status 404 ..."}
  3. All three GET endpoints (/registry, /registry/default, /registry/default/servers) return the same 503 JSON
  4. Set correct URL → GET returns registry_auth_required (correct behavior)

Changes

File Change
pkg/registry/errors.go New RegistryUnavailableError type with Error(), Unwrap()
pkg/registry/errors_test.go Unit tests for error type (Error, Unwrap, errors.As)
pkg/registry/provider_api.go Wrap non-auth upstream errors with RegistryUnavailableError in GetRegistry() and NewAPIRegistryProvider
pkg/api/v1/registry.go Add writeRegistryUnavailableError() helper; check for RegistryUnavailableError in getCurrentProvider, listRegistries, getRegistry, listServers
pkg/api/v1/registry_test.go Integration test: mock 404 server → verify 503 JSON on all three GET endpoints

Does this introduce a user-facing change?

Yes. Registry GET endpoints now return HTTP 503 with a structured JSON body {"code": "registry_unavailable", "message": "..."} instead of HTTP 500 with plain text when the upstream registry is unreachable or misconfigured. Clients can use the code field to distinguish this from registry_auth_required.

Special notes for reviewers

The RegistryUnavailableError follows the same pattern as the existing connectivityError type used in the PUT handler. The 503 JSON response reuses the registryAuthErrorResponse struct since it has the same shape (code + message).

When the configured registry API URL is wrong or the upstream registry
returns a non-auth error (404, timeout, connection refused), the GET
endpoints now return HTTP 503 with a structured JSON response containing
code "registry_unavailable" and the original error message, instead of
a generic 500 "Failed to get registry".

This follows the same pattern as the existing "registry_auth_required"
503 response, allowing clients to distinguish between auth issues and
upstream availability issues.

Fixes #4351

Made-with: Cursor
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 24, 2026
@peppescg
Copy link
Contributor Author

Tested with CURL and desktop app

➜  ~ curl -s -X PUT http://localhost:8181/api/v1beta/registry/default \
  -H 'Content-Type: application/json' \
  -d '{"api_url": "https://toolhive-registry.stacklok.dev/", "allow_private_ip": true}'
{"type":"api"}
➜  ~ curl -s -w '\nHTTP Status: %{http_code}\n' http://localhost:8181/api/v1beta/registry

{"code":"registry_unavailable","message":"upstream registry at https://toolhive-registry.stacklok.dev/ is unavailable: registry API returned status 404 for https://toolhive-registry.stacklok.dev/v0.1/servers?limit=100\u0026version=latest: 404 page not found\n"}

HTTP Status: 503
Screenshot 2026-03-24 at 18 11 46

@peppescg peppescg self-assigned this Mar 24, 2026
@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 45.94595% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.96%. Comparing base (d288822) to head (663d8f0).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/api/v1/registry.go 37.93% 17 Missing and 1 partial ⚠️
pkg/registry/provider_api.go 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4352      +/-   ##
==========================================
+ Coverage   68.84%   68.96%   +0.12%     
==========================================
  Files         479      480       +1     
  Lines       48505    48547      +42     
==========================================
+ Hits        33392    33481      +89     
+ Misses      12365    12284      -81     
- Partials     2748     2782      +34     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…tter

Address golangci-lint revive and gci formatting issues:
- Rename RegistryUnavailableError → UnavailableError (registry.RegistryUnavailableError stutters)
- Fix struct field alignment formatting (gci)

Made-with: Cursor
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 24, 2026
@peppescg
Copy link
Contributor Author

@claude succinct review

@claude
Copy link
Contributor

claude bot commented Mar 24, 2026

Claude finished @peppescg's task in 57s —— View job


Code Review

Overall: Clean, well-structured implementation. A few minor issues worth addressing.


Issues

1. Misleading struct reuse (pkg/api/v1/registry.go:57)

writeRegistryUnavailableError uses registryAuthErrorResponse for a non-auth error. The struct works (same shape), but the name is confusing — a future reader will wonder why an "auth error response" is used for an unavailable error. Consider renaming to registryErrorResponse or extracting a separate type.

Fix this →

2. Duplicate error types for the same concept (pkg/api/v1/registry.go:180 vs pkg/registry/errors.go:11)

connectivityError (in the API layer, used by PUT) and UnavailableError (in the registry package, used by GET) represent overlapping concepts. Not a blocker, but worth a follow-up to unify them.

3. Missing Swagger @Failure 503 annotations

The three GET handlers (listRegistries, getRegistry, listServers) now return 503 for UnavailableError, but the Swagger annotations only document @Success 200. Clients relying on generated docs won't see this. Example for listRegistries:

// @Failure  503  {object}  registryAuthErrorResponse  "Registry unavailable"

4. Raw upstream error exposed in response body

writeRegistryUnavailableError passes unavailableErr.Error() directly as the JSON message. This is useful for debugging (as shown in your curl test), but it leaks upstream URLs and raw error text to API consumers. This is probably acceptable for a local serve mode, but worth a conscious decision.


Minor observation

The slog.Error("upstream registry unavailable", ...) calls appear before writeRegistryUnavailableError in getCurrentProvider, listRegistries, getRegistry, and listServers. Since getCurrentProvider returns early, there's no double-logging — but the log is repeated in each call site unnecessarily when the error originates from getCurrentProvider. The callers that check ok won't reach the logging block, so this is fine as-is.


Summary: Functionally correct. The main actionable items are (1) rename registryAuthErrorResponse to something generic and (2) add Swagger 503 annotations.

The struct is now used for both registry_auth_required and
registry_unavailable 503 responses, so the name should be generic.

Made-with: Cursor
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Registry API returns 500 instead of 503 when upstream registry URL is misconfigured

2 participants