Replace custom cache with CacheControl library#671
Draft
altendky wants to merge 1 commit intopython-jsonschema:mainfrom
Draft
Replace custom cache with CacheControl library#671altendky wants to merge 1 commit intopython-jsonschema:mainfrom
altendky wants to merge 1 commit intopython-jsonschema:mainfrom
Conversation
- Add CacheControl[filecache]>=0.14,<0.15 dependency - Rewrite cachedownloader.py to use CacheControl's FileCache backend - Add --log-level CLI option for debug logging (shows cache hits/misses) - Remove dead code (url_to_cache_filename, filename parameters) - Update tests to assert on cache directory existence - Add pytest warning filter for CacheControl's CallbackFileWrapper temp file leak Fixes python-jsonschema#668
f2717dc to
90339f3
Compare
Author
|
example with $ alias test_command="uv run check-jsonschema -qqq --log-level DEBUG --schemafile https://www.schemastore.org/openapi-arazzo-1.X.json tests/example-files/hooks/negative/github-workflows/empty.json || true" && rm -rf /home/altendky/.cache/check_jsonschema/{schemas,refs} && test_command && echo \n==========\n && test_command && echo \n========== sleeping\n && sleep 605 && test_command
check_jsonschema.cachedownloader [DEBUG]: using cache dir: /home/altendky/.cache/check_jsonschema/schemas
cachecontrol.controller [DEBUG]: Looking up "https://www.schemastore.org/openapi-arazzo-1.X.json" in the cache
cachecontrol.controller [DEBUG]: No cache entry available
cachecontrol.controller [DEBUG]: No cache entry available
urllib3.connectionpool [DEBUG]: Starting new HTTPS connection (1): www.schemastore.org:443
urllib3.connectionpool [DEBUG]: https://www.schemastore.org:443 "GET /openapi-arazzo-1.X.json HTTP/1.1" 200 299
cachecontrol.controller [DEBUG]: Updating cache with response from "https://www.schemastore.org/openapi-arazzo-1.X.json"
cachecontrol.controller [DEBUG]: etag object cached for 1209600 seconds
cachecontrol.controller [DEBUG]: Caching due to etag
filelock [DEBUG]: Attempting to acquire lock 138845482657792 on /home/altendky/.cache/check_jsonschema/schemas/9/9/3/b/8/993b86c39537faaeed8b07a44ee3324c3752c56c307652474a439b7b.lock
filelock [DEBUG]: Lock 138845482657792 acquired on /home/altendky/.cache/check_jsonschema/schemas/9/9/3/b/8/993b86c39537faaeed8b07a44ee3324c3752c56c307652474a439b7b.lock
filelock [DEBUG]: Attempting to release lock 138845482657792 on /home/altendky/.cache/check_jsonschema/schemas/9/9/3/b/8/993b86c39537faaeed8b07a44ee3324c3752c56c307652474a439b7b.lock
filelock [DEBUG]: Lock 138845482657792 released on /home/altendky/.cache/check_jsonschema/schemas/9/9/3/b/8/993b86c39537faaeed8b07a44ee3324c3752c56c307652474a439b7b.lock
check_jsonschema.cachedownloader [DEBUG]: using cache dir: /home/altendky/.cache/check_jsonschema/refs
cachecontrol.controller [DEBUG]: Looking up "https://spec.openapis.org/arazzo/1.0/schema/2025-10-15" in the cache
cachecontrol.controller [DEBUG]: No cache entry available
cachecontrol.controller [DEBUG]: No cache entry available
urllib3.connectionpool [DEBUG]: Starting new HTTPS connection (1): spec.openapis.org:443
urllib3.connectionpool [DEBUG]: https://spec.openapis.org:443 "GET /arazzo/1.0/schema/2025-10-15 HTTP/1.1" 200 3644
cachecontrol.controller [DEBUG]: Updating cache with response from "https://spec.openapis.org/arazzo/1.0/schema/2025-10-15"
cachecontrol.controller [DEBUG]: etag object cached for 1209600 seconds
cachecontrol.controller [DEBUG]: Caching due to etag
filelock [DEBUG]: Attempting to acquire lock 138845489461424 on /home/altendky/.cache/check_jsonschema/refs/8/a/4/e/8/8a4e82229907727e5b9829d2f2cd8ffa2423a1f8c91bffc3b8432991.lock
filelock [DEBUG]: Lock 138845489461424 acquired on /home/altendky/.cache/check_jsonschema/refs/8/a/4/e/8/8a4e82229907727e5b9829d2f2cd8ffa2423a1f8c91bffc3b8432991.lock
filelock [DEBUG]: Attempting to release lock 138845489461424 on /home/altendky/.cache/check_jsonschema/refs/8/a/4/e/8/8a4e82229907727e5b9829d2f2cd8ffa2423a1f8c91bffc3b8432991.lock
filelock [DEBUG]: Lock 138845489461424 released on /home/altendky/.cache/check_jsonschema/refs/8/a/4/e/8/8a4e82229907727e5b9829d2f2cd8ffa2423a1f8c91bffc3b8432991.lock
==========
check_jsonschema.cachedownloader [DEBUG]: using cache dir: /home/altendky/.cache/check_jsonschema/schemas
cachecontrol.controller [DEBUG]: Looking up "https://www.schemastore.org/openapi-arazzo-1.X.json" in the cache
cachecontrol.controller [DEBUG]: Current age based on date: 1
cachecontrol.controller [DEBUG]: Freshness lifetime from max-age: 600
cachecontrol.controller [DEBUG]: The response is "fresh", returning cached response
cachecontrol.controller [DEBUG]: 600 > 1
check_jsonschema.cachedownloader [DEBUG]: using cache dir: /home/altendky/.cache/check_jsonschema/refs
cachecontrol.controller [DEBUG]: Looking up "https://spec.openapis.org/arazzo/1.0/schema/2025-10-15" in the cache
cachecontrol.controller [DEBUG]: Current age based on date: 1
cachecontrol.controller [DEBUG]: Freshness lifetime from max-age: 600
cachecontrol.controller [DEBUG]: The response is "fresh", returning cached response
cachecontrol.controller [DEBUG]: 600 > 1
========== sleeping
check_jsonschema.cachedownloader [DEBUG]: using cache dir: /home/altendky/.cache/check_jsonschema/schemas
cachecontrol.controller [DEBUG]: Looking up "https://www.schemastore.org/openapi-arazzo-1.X.json" in the cache
cachecontrol.controller [DEBUG]: Current age based on date: 606
cachecontrol.controller [DEBUG]: Freshness lifetime from max-age: 600
urllib3.connectionpool [DEBUG]: Starting new HTTPS connection (1): www.schemastore.org:443
urllib3.connectionpool [DEBUG]: https://www.schemastore.org:443 "GET /openapi-arazzo-1.X.json HTTP/1.1" 304 0
filelock [DEBUG]: Attempting to acquire lock 138335991799648 on /home/altendky/.cache/check_jsonschema/schemas/9/9/3/b/8/993b86c39537faaeed8b07a44ee3324c3752c56c307652474a439b7b.lock
filelock [DEBUG]: Lock 138335991799648 acquired on /home/altendky/.cache/check_jsonschema/schemas/9/9/3/b/8/993b86c39537faaeed8b07a44ee3324c3752c56c307652474a439b7b.lock
filelock [DEBUG]: Attempting to release lock 138335991799648 on /home/altendky/.cache/check_jsonschema/schemas/9/9/3/b/8/993b86c39537faaeed8b07a44ee3324c3752c56c307652474a439b7b.lock
filelock [DEBUG]: Lock 138335991799648 released on /home/altendky/.cache/check_jsonschema/schemas/9/9/3/b/8/993b86c39537faaeed8b07a44ee3324c3752c56c307652474a439b7b.lock
check_jsonschema.cachedownloader [DEBUG]: using cache dir: /home/altendky/.cache/check_jsonschema/refs
cachecontrol.controller [DEBUG]: Looking up "https://spec.openapis.org/arazzo/1.0/schema/2025-10-15" in the cache
cachecontrol.controller [DEBUG]: Current age based on date: 606
cachecontrol.controller [DEBUG]: Freshness lifetime from max-age: 600
urllib3.connectionpool [DEBUG]: Starting new HTTPS connection (1): spec.openapis.org:443
urllib3.connectionpool [DEBUG]: https://spec.openapis.org:443 "GET /arazzo/1.0/schema/2025-10-15 HTTP/1.1" 304 0
filelock [DEBUG]: Attempting to acquire lock 138336008516416 on /home/altendky/.cache/check_jsonschema/refs/8/a/4/e/8/8a4e82229907727e5b9829d2f2cd8ffa2423a1f8c91bffc3b8432991.lock
filelock [DEBUG]: Lock 138336008516416 acquired on /home/altendky/.cache/check_jsonschema/refs/8/a/4/e/8/8a4e82229907727e5b9829d2f2cd8ffa2423a1f8c91bffc3b8432991.lock
filelock [DEBUG]: Attempting to release lock 138336008516416 on /home/altendky/.cache/check_jsonschema/refs/8/a/4/e/8/8a4e82229907727e5b9829d2f2cd8ffa2423a1f8c91bffc3b8432991.lock
filelock [DEBUG]: Lock 138336008516416 released on /home/altendky/.cache/check_jsonschema/refs/8/a/4/e/8/8a4e82229907727e5b9829d2f2cd8ffa2423a1f8c91bffc3b8432991.lock |
Author
|
note the general points in the 'draft for' list that we have already brought up. @sirosen, whenever you get a chance to look this over i would be curious about your initial feelings. if it looks like a good starting point we can work through the draft for items and anything else you have concerns or preferences about. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft For:
>=0.14,<0.15)Summary
Replace the custom mtime-based HTTP cache in
cachedownloader.pywith CacheControl, a well-maintained PSF library that implements full HTTP caching semantics.Fixes #668
Changes
CacheControl[filecache]>=0.14,<0.15dependency — provides transparent HTTP caching withETag,Last-Modified, andCache-Controlheader supportcachedownloader.py— remove custom mtime-based caching logic (~100 lines removed), delegate to CacheControl'sFileCachebackend--log-levelCLI option — enables debug logging to observe cache behavior (e.g.,--log-level debugshows cache hits/misses)Behavior Changes
If-None-Match/ 304)Cache-Controlheadermax-age,no-cache, etc.)Last-ModifiedIf-Modified-Since/ 304Testing Cache Behavior
The Arazzo schema is a good test case because:
max-age=600) for quick iteration$reftohttps://spec.openapis.org/arazzo/1.0/schema/2025-10-15schemas/andrefs/cache directories in useImplementation Plan (plan.md)
Plan: Replace Custom Cache with CacheControl
Fixes #668
Problem
The current cache implementation in
cachedownloader.pyuses only theLast-Modifiedheader for cache validation. When a server omits that header(but provides
ETagorCache-Control),_lastmod_from_response()returns0.0and_cache_hit()always returnsTrue, so the cache never invalidates.Solution
Replace the hand-rolled mtime-based cache logic with the
CacheControl library (v0.14, maintained
under the PSF). CacheControl wraps a
requests.Sessionand implements full HTTPcaching semantics —
ETag/If-None-Match,Last-Modified/If-Modified-Since,and
Cache-Controlheaders — transparently. On-disk persistence uses itsFileCachebackend (backed byfilelockfor concurrency safety).Behavior Changes
If-None-Match/ 304)Cache-Controlheadermax-age,no-cache,no-store)Last-ModifiedIf-Modified-Since/ 304--no-cacheflag_atomic_write(tempfile + shutil.copy)filelock(viaFileCache)Servers with no caching headers
When a server provides no caching headers at all (no
ETag, noLast-Modified,no
Cache-Control, noExpires), the current code caches the response foreverdue to the
_lastmod_from_responsereturning0.0bug. After this change, suchresponses will not be cached, following correct HTTP semantics. In practice,
nearly all real-world schema endpoints provide at least one caching header.
CacheControl has built-in heuristic support for overriding this behavior if
needed. The
ExpiresAfterheuristic (bundled, not custom) can apply a TTL to allresponses. For a conditional variant that only applies to responses lacking
server-provided caching headers, the documented
BaseHeuristicextensibilitypoint allows a small subclass (~10 lines). Neither is a hack — heuristics are a
first-class CacheControl feature designed for exactly this purpose, and
RFC 7234 explicitly permits caching
systems to use heuristics for responses that lack caching headers.
Open question: clean up old cache files?
CacheControl uses the same cache directories (
check_jsonschema/schemas/,check_jsonschema/refs/) but a different file layout (5-level SHA-224 tree vs.flat SHA-256 files). Old cache files will sit alongside the new CacheControl
tree, ignored but taking up space.
Options:
delete the cache directory if they want to reclaim space.
_build_session(), detect and remove filesmatching the old naming pattern (64-char hex + optional extension, no
subdirectories). Risk: could delete user files if they happen to match.
cache directory. Non-invasive but adds noise.
Proceeding with option 1 (do nothing) unless revisited.
1. Dependency Changes
File:
pyproject.toml(line 39-46)Add
CacheControl[filecache]to runtime dependencies:The
[filecache]extra pulls infilelock. CacheControl also depends onmsgpack(for response serialization) andrequests(already a dependency).2. Rewrite
src/check_jsonschema/cachedownloader.pyRemove
_LASTMOD_FMT_lastmod_from_response()_cache_hit()_atomic_write()filelockfor concurrencyPreserve existing imports
from __future__ import annotations(line 1) stays.hashlibstays (used byurl_to_cache_filename).platformstays (used by_base_cache_dir).Remove imports no longer needed
calendar,shutil,tempfile,time— verify each is truly unused afterthe rewrite.
Add imports (top-level, third-party group)
These go in the third-party import group alongside
import requests, followingthe project's PEP 8 / isort convention (no lazy imports in production code).
Keep unchanged
_base_cache_dir()_resolve_cache_dir()url_to_cache_filename()FailedDownloadErrorModify
_get_request()(lines 57-74)session: requests.Sessionas the first parameter.session.get(file_url)instead ofrequests.get(file_url, stream=True).stream=True— CacheControl needs the full response to cache it, andschemas are small.
response_okcallback — CacheControl doesnot provide retry logic.
Retry / CacheControl interaction: When a server returns a 200 with corrupt
body and caching headers (e.g., ETag), CacheControl caches the response before
our
response_okcallback runs. On a naive retry,session.get()would servethe cached corrupt response and every retry would fail identically. To avoid
this, retry attempts (
_attempt > 0) passCache-Control: no-cachein therequest headers. This is standard HTTP (RFC 7234 §5.2.1.4) — it tells
CacheControl to revalidate with the server rather than serve from its local
cache. CacheControl implements this natively; no custom logic is needed.
Rewrite
CacheDownloader(lines 123-180)__init__: store config only. Do not build the session eagerly —test_default_cache_dircreatesCacheDownloaderwith fake/relative cachepaths purely to assert
_cache_dirresolution and never makes HTTP requests.Eager
os.makedirs+FileCache()in__init__would create realdirectories using those fake paths.
_sessionproperty on first access (i.e.,first
open()call). This also avoids creating sessions that are never used(e.g., if the schema turns out to be local after all).
_download()entirely — CacheControl manages on-disk cache files.open()— always yieldsio.BytesIO(response.content). Nobranching needed: when cache is enabled,
self._sessionis aCacheControl-wrapped session (handles caching transparently); when disabled,
it is a plain
Session.bind()— unchanged API.CacheControl imports go at the top of the module alongside
requests,following the project convention (no lazy/deferred imports in production code):
The
filenameparameter inopen()andbind()is kept for API compatibilitybut ignored — CacheControl manages its own filenames (SHA-224 hash of normalized
URL in a 5-level directory tree).
BoundCacheDownloader.__init__still computesself._filename = filename or url_to_cache_filename(file_url)(line 193). This is now dead code — the valueis passed through to
CacheDownloader.open()which ignores it. Kept for nowsince
test_default_filename_from_uriasserts oncd._filename. Can be cleanedup in a follow-up.
Modify
BoundCacheDownloader._validate_response()(lines 206-213)Skip validation on cache hits. CacheControl sets
response.from_cache = Trueonevery response served from cache (either a direct hit or after a 304
revalidation). This preserves the fix for
#453
(validation callback must not run on cache hits).
3. Changes to Callers
No changes to
readers.pyorresolver.py. Both useCacheDownloaderviabind()andopen(), whose signatures are unchanged.readers.py:80—CacheDownloader("schemas", disable_cache=...).bind(url, validation_callback=...)resolver.py:53—CacheDownloader("refs", disable_cache=...)Cache subdirectory names (
"schemas","refs") are preserved.Update
cli/main_command.py:96: The--schemafilehelp text says schemaswill be
"downloaded and cached locally based on mtime."Replace with genericwording (e.g.,
"downloaded and cached locally") since caching is no longermtime-based.
4. Test Changes
Test strategy: trust CacheControl, test integration only
CacheControl's
FileCachewrite mechanism relies on aCallbackFileWrapperthat monitors when a response stream is fully consumed. The
responsesmocklibrary creates
BytesIObodies that never report.closed = Trueafterexhaustion, so the cache-write callback never fires. CacheControl never
populates its FileCache when responses come from
responsesmocks.This means tests cannot verify CacheControl's caching behavior (cache hits,
ETag revalidation, max-age expiry) using
responses. Rather than switching testinfrastructure (e.g., adding a real HTTP server), the test strategy is:
project with its own test suite.
--no-cacheflag,session construction, error handling.
response.from_cache, orlen(responses.calls)staying constant across requests.responseslibrary compatibilityThe
responseslibrary (v0.26.0) patchesHTTPAdapter.sendat the class level.CacheControl's
CacheControlAdaptercallssuper().send(), which hits theresponsesmock. HTTP request/response flow works normally — only the cache-writeside effect is broken. Tests that use
responsesfor mocking HTTP (retry tests,validation tests, error tests) work fine.
tests/conftest.py— Fixture changesUpdate
patch_cache_dir(line 67): Works as-is — it monkeypatches_base_cache_dir, which is still called by_resolve_cache_dir, which is stillcalled by
CacheDownloader.__init__. CacheControl'sFileCachewill use thepatched temp directory.
Remove the following fixtures — they either inject raw cache files (no longer
valid format) or predict cache file paths using
url_to_cache_filename(SHA-256), which CacheControl does not use:
url2cachepathinject_cached_downloadinject_cached_refget_download_cache_locget_ref_cache_locKeep
downloads_cache_dir/refs_cache_dir(lines 87, 113) andkeep
cache_dir(line 63) — still useful for asserting cache directoryexistence when needed.
tests/unit/test_cachedownloader.py— Specific changesUpdate the import block (lines 10-16): Remove
_cache_hitand_lastmod_from_response— both symbols are deleted fromcachedownloader.py.Tests to remove:
test_cache_hit_by_mtimetest_cachedownloader_handles_bad_lastmod_header_lastmod_from_responseremovedtest_lastmod_from_header_uses_gmtimetest_cachedownloader_cached_file_downloadmethod, which is removedtest_cachedownloader_validation_is_not_invoked_on_hitinject_cached_download(removed) andresponse.from_cache(not testable withresponsesmocks)Tests to simplify (remove
get_download_cache_locdependency and file-existence assertions — these tests are about retry behavior, not caching):
test_cachedownloader_on_success(line 132):get_download_cache_locfixture usage andf.exists()assertions.cd.open()returns correct content (b"{}").test_cachedownloader_succeeds_after_few_errors(line 160):get_download_cache_locfixture usage andf.exists()assertions.cd.open()returns correct content after retries.test_cachedownloader_fails_after_many_errors(line 194):get_download_cache_locfixture usage andf.exists()assertion.FailedDownloadErroris raised.test_cachedownloader_retries_on_bad_data(line 225):get_download_cache_locfixture usage andf.exists()assertions.cd.open()returns correct content after retrying past bad data.test_cachedownloader_using_alternate_target_dir(line 149):url2cachepath/ path-based assertions.cd.open()returns correct content.Tests to keep as-is:
test_default_filename_from_uriurl_to_cache_filenameunchangedtest_default_cache_dir_base_cache_dirunchanged. Works as-is thanks to lazy session building —CacheDownloader("downloads")only stores_cache_dir, noos.makedirsorFileCacheuntilopen()is called.New tests to add (integration-focused):
test_disable_cache_uses_plain_sessiondisable_cache=True, verify_build_sessionreturns a plainrequests.Session(not wrapped by CacheControl).test_enable_cache_uses_cachecontrol_sessiondisable_cache=Falseand cache dir is valid, verify_build_sessionreturns a CacheControl-wrapped session (check forCacheControlAdapteron the session).test_cache_dir_none_uses_plain_session_resolve_cache_dirreturnsNone(e.g., Windows with no env vars), verify_build_sessionreturns a plain session.Aspirational cache-behavior tests (require a real HTTP server or alternative
to
responses; not included in this PR but tracked for follow-up):test_etag_cache_revalidationETag, noLast-Modified. Second request sendsIf-None-Match, gets 304.test_cache_control_max_ageCache-Control: max-age=3600. Second request served from cache.test_last_modified_revalidationLast-Modified. Second request sendsIf-Modified-Since.test_no_caching_headers_not_cachedtests/acceptance/test_remote_ref_resolution.pytest_remote_ref_resolution_cache_control(line 68):get_ref_cache_locfixture dependency and file-path assertions.disable_cache=False: verify the command succeeds (exit code 0).Cache behavior is trusted to CacheControl.
disable_cache=True: verify the command succeeds (exit code 0).test_remote_ref_resolution_loads_from_cache(line 105):inject_cached_refandget_ref_cache_locfixture dependencies.used) cannot be replicated without direct FileCache manipulation. Remove
this test. Cache-hit behavior is trusted to CacheControl.
test_remote_ref_resolution_callout_count_is_scale_free_in_instancefiles(line 251):
lru_cacheon_get_validatorand the in-memoryResourceCache— not by the file cache.CacheControl's file cache only affects cross-run persistence.
tests/acceptance/test_nonjson_schema_handling.pytest_can_load_remote_yaml_schema_ref_from_cache(line 140):inject_cached_reffixture dependency. This test's premise (injectgood cache, serve bad HTTP) cannot be replicated. Remove this test.
Cache-hit behavior is trusted to CacheControl.
5. Pytest Warning Filter
CacheControl's
FileCacheusesfilelockfor concurrency safety. When connectionerrors occur during tests (e.g.,
test_cachedownloader_fails_after_many_errors),filelock may leave temporary files open, triggering a
ResourceWarningthat pytestconverts to
PytestUnraisableExceptionWarning.This is not a bug in our code — it's a side effect of how filelock handles error
conditions. Add a filter to
pyproject.tomlto ignore this specific warning:TODO: Investigate whether this warning can be avoided rather than suppressed.
Investigation results (2026-03-27):
CallbackFileWrapperclass infilewrapper.py, not filelock.CallbackFileWrapper.__init__creates aNamedTemporaryFile(line 37), whichis only closed in
_close()when the response is fully read (line 105-106).requests.ConnectionErroroccurs, the temp file is never closed because_close()is never called.problem.
CallbackFileWrapperclass lacks a__del__method or context managerprotocol that would ensure cleanup on exceptions.
CallbackFileWrapperimplement__del__to closeself.__bufif not alreadyclosed, or use a weak reference callback to ensure cleanup.
6. Execution Order
CacheControl[filecache]dependency topyproject.tomlcachedownloader.pyconftest.pyfixturestest_cachedownloader.pyresponseslibrary compatibility issues