Skip to content

Gracefull fallback on ephemeral cache failure#2387

Open
dkosowski87 wants to merge 1 commit into
mainfrom
add-caching-resilience
Open

Gracefull fallback on ephemeral cache failure#2387
dkosowski87 wants to merge 1 commit into
mainfrom
add-caching-resilience

Conversation

@dkosowski87
Copy link
Copy Markdown
Contributor

What does this PR do?

During a production Dragonfly outage, workflow inference endpoints hard-failed with HTTP 500s because get_workflow_specification could not reach the ephemeral Redis/Dragonfly cache. With zero healthy cache endpoints, every request failed before falling back to the Roboflow API — even though the API itself was available. This change adds resilience so cache unavailability degrades gracefully instead of taking down workflow routes.

When ephemeral cache reads fail due to Redis connection or timeout errors, the function now logs a warning and continues to fetch the workflow definition directly from the Roboflow API. Cache write failures after a successful API fetch are also handled as best-effort: the specification is still returned to the caller. Redis errors are caught inside the cache layer so they are not misreported as Roboflow API connection failures — redis.exceptions.ConnectionError subclasses built-in ConnectionError, which @wrap_roboflow_api_errors would otherwise map to RoboflowAPIConnectionError. No HTTP endpoint or caller changes are required; both /infer/workflows/... and describe_interface paths benefit automatically.

Main elements:

  • CacheUnavailableError exception in inference/core/exceptions.py
  • Ephemeral cache read/write resilience in get_workflow_specification (inference/core/roboflow_api.py)
  • Private helpers: _try_retrieve_*, _try_cache_*, _raise_cache_unavailable_error
  • Unit tests in tests/inference/unit_tests/core/test_roboflow_api.py

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Other:

Testing

Unit tests

  • Cache read failure (RedisConnectionError) falls back to Roboflow API and returns the fetched specification
  • Cache write failure (RedisTimeoutError) after a successful API fetch still returns the specification
  • Regression note: Redis connection errors must not bubble up as RoboflowAPIConnectionError via @wrap_roboflow_api_errors

Integration tests

  • None for this PR.

Other

uv run pytest tests/inference/unit_tests/core/test_roboflow_api.py::test_get_workflow_specification_falls_back_to_api_when_ephemeral_cache_get_fails tests/inference/unit_tests/core/test_roboflow_api.py::test_get_workflow_specification_returns_when_ephemeral_cache_set_fails -q

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code where necessary, particularly in hard-to-understand areas
  • My changes generate no new warnings or errors
  • I have updated the documentation accordingly (if applicable)

…n retrieval

This commit introduces a new exception, CacheUnavailableError, to handle cases where the ephemeral cache (e.g., Redis/Dragonfly) is unreachable. The get_workflow_specification function is updated to fall back to the Roboflow API when the cache is unavailable, improving error handling and robustness. Additionally, helper functions for retrieving and caching workflow specifications are added, ensuring that cache failures are logged and managed gracefully.
@dkosowski87 dkosowski87 changed the title Add CacheUnavailableError exception and enhance workflow specificatio… Gracefull fallback on ephemeral cache failure May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant