[doc] conf: enumerate llms_txt_exclude notebooks from git, not a live rglob#64486
[doc] conf: enumerate llms_txt_exclude notebooks from git, not a live rglob#64486ronny-anyscale wants to merge 2 commits into
Conversation
… rglob
llms_txt_exclude was built from a live rglob("*.ipynb") over doc/source, so its
value depended on which notebooks existed at conf-eval time. sphinx-collections
pulls ~67 example notebooks into _collections/ during the build; they are
gitignored and get captured in the doc build cache. The clean cache-producer
tree (Buildkite, pre-pull) sees only the 59 tracked notebooks, but the
cache-restored consumer tree (Read the Docs) sees 59 + the 67 pulled ones.
Because llms_txt_exclude is an env-affecting Sphinx config value, that
difference makes Sphinx discard the restored doctree and re-read all ~3,967
docs -- so the incremental cache (DOC-1047) restores but delivers no speedup.
Enumerate the notebooks from the git-tracked set instead (identical across
environments for a given commit), falling back to the filesystem scan for a
non-git checkout. This matches what the clean producer and the published master
build already compute (their conf.py also runs before the collections pull), so
llms-full.txt is unchanged there.
First cut for DOC-1344 (doctree-cache config portability); the env-var options
(version_match, sidebars) and the collections function-ref hash remain and are
tracked there.
Related to DOC-1047
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ronny Roland <ronny.roland@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 575cc93. Configure here.
| llms_txt_exclude += sorted( | ||
| p.relative_to(_conf_dir).with_suffix("").as_posix() | ||
| for p in _conf_dir.rglob("*.ipynb") | ||
| pathlib.Path(nb).with_suffix("").as_posix() for nb in _notebooks |
There was a problem hiding this comment.
Wrong docname prefix from git
Medium Severity
git ls-files returns paths from the repository root (for example doc/source/...), but the shared conversion only strips .ipynb and never re-bases them under _conf_dir. Sphinx llms_txt_exclude entries must be docnames relative to the source directory, so fnmatch will not exclude tracked notebooks on normal git checkouts.
Reviewed by Cursor Bugbot for commit 575cc93. Configure here.
There was a problem hiding this comment.
git ls-files here runs with cwd=_conf_dir (i.e. doc/source), so it returns paths relative to that directory, not the repository root — e.g. data/examples/foo.ipynb, not doc/source/data/examples/foo.ipynb. After stripping the .ipynb suffix those are already the source-relative docnames llms_txt_exclude matches against, so tracked notebooks are excluded correctly on a normal checkout.
Verified against the tracked set — the resulting docnames are _templates/template, data/examples/..., etc., with no doc/source/ prefix. So no change needed here.
There was a problem hiding this comment.
Code Review
This pull request updates doc/source/conf.py to retrieve the list of notebook files from the git-tracked set using git ls-files instead of a live filesystem scan, preventing Sphinx from discarding the restored doctree cache due to environment differences. Feedback suggests improving the robustness of this implementation by using the -z flag with git ls-files to safely handle special characters in filenames, and catching general exceptions to prevent build crashes on decoding errors.
| try: | ||
| _notebooks = subprocess.run( | ||
| ["git", "ls-files", "*.ipynb"], | ||
| cwd=_conf_dir, | ||
| capture_output=True, | ||
| text=True, | ||
| check=True, | ||
| ).stdout.splitlines() | ||
| except (OSError, subprocess.CalledProcessError): |
There was a problem hiding this comment.
- Handle special characters in filenames: Git
ls-filesmight quote filenames with special characters or non-ASCII characters. This can causepathlib.Path(nb)to fail to parse the path correctly. Using the-zoption outputs paths verbatim, separated by NUL bytes (\0), which is the standard and most robust way to parse file paths from Git in Python. - Prevent build crashes on decoding errors: When
text=Trueis passed tosubprocess.run, Python decodes the output using the default system encoding. If the output contains invalid bytes for that encoding, it raises aUnicodeDecodeError(a subclass ofValueError). Since the currentexceptblock only catches(OSError, subprocess.CalledProcessError), a decoding error would crash the entire Sphinx build. CatchingExceptionensures that any failure in the Git command safely falls back to the filesystem scan.
| try: | |
| _notebooks = subprocess.run( | |
| ["git", "ls-files", "*.ipynb"], | |
| cwd=_conf_dir, | |
| capture_output=True, | |
| text=True, | |
| check=True, | |
| ).stdout.splitlines() | |
| except (OSError, subprocess.CalledProcessError): | |
| try: | |
| _git_out = subprocess.run( | |
| ["git", "ls-files", "-z", "*.ipynb"], | |
| cwd=_conf_dir, | |
| capture_output=True, | |
| text=True, | |
| check=True, | |
| ).stdout | |
| _notebooks = [nb for nb in _git_out.split("\x00") if nb] | |
| except Exception: |
Address review (@gemini-code-assist): parse `git ls-files -z` (NUL-separated, unquoted) instead of splitlines(), so notebook filenames with spaces, newlines, or non-ASCII characters are handled correctly; and widen the except so a decode error (or any git failure) falls back to the filesystem scan instead of crashing the Sphinx build. Bugbot flagged a repo-root docname prefix; not applicable -- `git ls-files` runs with cwd=_conf_dir and returns source-relative paths (verified: `_templates/ template`, `data/examples/...`, no `doc/source/` prefix). Related to DOC-1344 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ronny Roland <ronny.roland@anyscale.com>


Description
Part of making the incremental RtD doctree cache actually deliver its speedup (DOC-1047). #64414 (producer) and #64482 (consumer reaches the cache) are merged, but RtD still re-reads all ~3,967 docs on every preview: Sphinx restores the cache, then discards the doctree because
conf.pyevaluates to a different config than the cache was built with. Full audit in DOC-1344; this is the first cut — the single option Sphinx names as the doctree-rebuild trigger.Root cause
llms_txt_excludewas built from a liverglob("*.ipynb")overdoc/source, so its value depends on which notebooks exist whenconf.pyruns:make html, clean tree):conf.pyruns before sphinx-collections pulls, so it sees only the 59 git-tracked notebooks.make rtd, cache-restored tree): the cache captured the ~67 gitignored notebooks sphinx-collections pulled into_collections/…, soconf.pysees 59 + 67.llms_txt_excludeis an env-affecting config value, so that difference makes Sphinx invalidate the whole restored doctree →[config changed ('llms_txt_exclude')] 3967 added(verified on build 4166140, PR #64482).Fix
Enumerate notebooks from the git-tracked set (
git ls-files "*.ipynb") — identical across environments for a given commit — with a filesystem-scan fallback for non-git checkouts. This equals what the clean producer and the publishedmasterbuild already compute (theirconf.pyalso runs before the pull), sollms-full.txtis unchanged there.Verified
.ipynbgrows the oldrglob(126→127) but leaves the git-tracked result stable (59→59).rglob == git ls-files; the local 126-vs-59 gap is exactly the 67 gitignored_collectionspulled notebooks that cause the producer↔consumer mismatch.Related issues
First cut for DOC-1344 (doctree-cache config portability); unblocks DOC-1047.
Additional information
Only fully verifiable on a live RtD PR build — the payoff is
N changedinstead of3967 addedon this PR's preview. This fixes only the env-rebuild trigger; thehtml_*env-var diffs (version_match, sidebars) still force HTML re-output (much cheaper — no re-read / notebook execution) and thecollectionsfunction-ref hash remains — both enumerated in DOC-1344.🤖 Generated with Claude Code