Gracefull fallback on ephemeral cache failure#2387
Open
dkosowski87 wants to merge 1 commit into
Open
Conversation
…n retrieval This commit introduces a new exception, CacheUnavailableError, to handle cases where the ephemeral cache (e.g., Redis/Dragonfly) is unreachable. The get_workflow_specification function is updated to fall back to the Roboflow API when the cache is unavailable, improving error handling and robustness. Additionally, helper functions for retrieving and caching workflow specifications are added, ensuring that cache failures are logged and managed gracefully.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
During a production Dragonfly outage, workflow inference endpoints hard-failed with HTTP 500s because
get_workflow_specificationcould not reach the ephemeral Redis/Dragonfly cache. With zero healthy cache endpoints, every request failed before falling back to the Roboflow API — even though the API itself was available. This change adds resilience so cache unavailability degrades gracefully instead of taking down workflow routes.When ephemeral cache reads fail due to Redis connection or timeout errors, the function now logs a warning and continues to fetch the workflow definition directly from the Roboflow API. Cache write failures after a successful API fetch are also handled as best-effort: the specification is still returned to the caller. Redis errors are caught inside the cache layer so they are not misreported as Roboflow API connection failures —
redis.exceptions.ConnectionErrorsubclasses built-inConnectionError, which@wrap_roboflow_api_errorswould otherwise map toRoboflowAPIConnectionError. No HTTP endpoint or caller changes are required; both/infer/workflows/...anddescribe_interfacepaths benefit automatically.Main elements:
CacheUnavailableErrorexception ininference/core/exceptions.pyget_workflow_specification(inference/core/roboflow_api.py)_try_retrieve_*,_try_cache_*,_raise_cache_unavailable_errortests/inference/unit_tests/core/test_roboflow_api.pyType of Change
Testing
Unit tests
RedisConnectionError) falls back to Roboflow API and returns the fetched specificationRedisTimeoutError) after a successful API fetch still returns the specificationRoboflowAPIConnectionErrorvia@wrap_roboflow_api_errorsIntegration tests
Other
Checklist