Skip to content

[doc] conf: enumerate llms_txt_exclude notebooks from git, not a live rglob#64486

Open
ronny-anyscale wants to merge 2 commits into
masterfrom
fix-llms-txt-exclude-deterministic
Open

[doc] conf: enumerate llms_txt_exclude notebooks from git, not a live rglob#64486
ronny-anyscale wants to merge 2 commits into
masterfrom
fix-llms-txt-exclude-deterministic

Conversation

@ronny-anyscale

Copy link
Copy Markdown
Contributor

Description

Part of making the incremental RtD doctree cache actually deliver its speedup (DOC-1047). #64414 (producer) and #64482 (consumer reaches the cache) are merged, but RtD still re-reads all ~3,967 docs on every preview: Sphinx restores the cache, then discards the doctree because conf.py evaluates to a different config than the cache was built with. Full audit in DOC-1344; this is the first cut — the single option Sphinx names as the doctree-rebuild trigger.

Root cause

llms_txt_exclude was built from a live rglob("*.ipynb") over doc/source, so its value depends on which notebooks exist when conf.py runs:

  • Producer (Buildkite make html, clean tree): conf.py runs before sphinx-collections pulls, so it sees only the 59 git-tracked notebooks.
  • Consumer (RtD make rtd, cache-restored tree): the cache captured the ~67 gitignored notebooks sphinx-collections pulled into _collections/…, so conf.py sees 59 + 67.

llms_txt_exclude is an env-affecting config value, so that difference makes Sphinx invalidate the whole restored doctree → [config changed ('llms_txt_exclude')] 3967 added (verified on build 4166140, PR #64482).

Fix

Enumerate notebooks from the git-tracked set (git ls-files "*.ipynb") — identical across environments for a given commit — with a filesystem-scan fallback for non-git checkouts. This equals what the clean producer and the published master build already compute (their conf.py also runs before the pull), so llms-full.txt is unchanged there.

Verified

  • Determinism: adding an untracked .ipynb grows the old rglob (126→127) but leaves the git-tracked result stable (59→59).
  • Correctness: on a clean checkout rglob == git ls-files; the local 126-vs-59 gap is exactly the 67 gitignored _collections pulled notebooks that cause the producer↔consumer mismatch.

Related issues

First cut for DOC-1344 (doctree-cache config portability); unblocks DOC-1047.

Additional information

Only fully verifiable on a live RtD PR build — the payoff is N changed instead of 3967 added on this PR's preview. This fixes only the env-rebuild trigger; the html_* env-var diffs (version_match, sidebars) still force HTML re-output (much cheaper — no re-read / notebook execution) and the collections function-ref hash remains — both enumerated in DOC-1344.

🤖 Generated with Claude Code

… rglob

llms_txt_exclude was built from a live rglob("*.ipynb") over doc/source, so its
value depended on which notebooks existed at conf-eval time. sphinx-collections
pulls ~67 example notebooks into _collections/ during the build; they are
gitignored and get captured in the doc build cache. The clean cache-producer
tree (Buildkite, pre-pull) sees only the 59 tracked notebooks, but the
cache-restored consumer tree (Read the Docs) sees 59 + the 67 pulled ones.
Because llms_txt_exclude is an env-affecting Sphinx config value, that
difference makes Sphinx discard the restored doctree and re-read all ~3,967
docs -- so the incremental cache (DOC-1047) restores but delivers no speedup.

Enumerate the notebooks from the git-tracked set instead (identical across
environments for a given commit), falling back to the filesystem scan for a
non-git checkout. This matches what the clean producer and the published master
build already compute (their conf.py also runs before the collections pull), so
llms-full.txt is unchanged there.

First cut for DOC-1344 (doctree-cache config portability); the env-var options
(version_match, sidebars) and the collections function-ref hash remain and are
tracked there.

Related to DOC-1047

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ronny Roland <ronny.roland@anyscale.com>
@ronny-anyscale ronny-anyscale requested review from a team as code owners July 1, 2026 22:18

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 575cc93. Configure here.

Comment thread doc/source/conf.py
llms_txt_exclude += sorted(
p.relative_to(_conf_dir).with_suffix("").as_posix()
for p in _conf_dir.rglob("*.ipynb")
pathlib.Path(nb).with_suffix("").as_posix() for nb in _notebooks

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong docname prefix from git

Medium Severity

git ls-files returns paths from the repository root (for example doc/source/...), but the shared conversion only strips .ipynb and never re-bases them under _conf_dir. Sphinx llms_txt_exclude entries must be docnames relative to the source directory, so fnmatch will not exclude tracked notebooks on normal git checkouts.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 575cc93. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git ls-files here runs with cwd=_conf_dir (i.e. doc/source), so it returns paths relative to that directory, not the repository root — e.g. data/examples/foo.ipynb, not doc/source/data/examples/foo.ipynb. After stripping the .ipynb suffix those are already the source-relative docnames llms_txt_exclude matches against, so tracked notebooks are excluded correctly on a normal checkout.

Verified against the tracked set — the resulting docnames are _templates/template, data/examples/..., etc., with no doc/source/ prefix. So no change needed here.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates doc/source/conf.py to retrieve the list of notebook files from the git-tracked set using git ls-files instead of a live filesystem scan, preventing Sphinx from discarding the restored doctree cache due to environment differences. Feedback suggests improving the robustness of this implementation by using the -z flag with git ls-files to safely handle special characters in filenames, and catching general exceptions to prevent build crashes on decoding errors.

Comment thread doc/source/conf.py Outdated
Comment on lines +134 to +142
try:
_notebooks = subprocess.run(
["git", "ls-files", "*.ipynb"],
cwd=_conf_dir,
capture_output=True,
text=True,
check=True,
).stdout.splitlines()
except (OSError, subprocess.CalledProcessError):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

  1. Handle special characters in filenames: Git ls-files might quote filenames with special characters or non-ASCII characters. This can cause pathlib.Path(nb) to fail to parse the path correctly. Using the -z option outputs paths verbatim, separated by NUL bytes (\0), which is the standard and most robust way to parse file paths from Git in Python.
  2. Prevent build crashes on decoding errors: When text=True is passed to subprocess.run, Python decodes the output using the default system encoding. If the output contains invalid bytes for that encoding, it raises a UnicodeDecodeError (a subclass of ValueError). Since the current except block only catches (OSError, subprocess.CalledProcessError), a decoding error would crash the entire Sphinx build. Catching Exception ensures that any failure in the Git command safely falls back to the filesystem scan.
Suggested change
try:
_notebooks = subprocess.run(
["git", "ls-files", "*.ipynb"],
cwd=_conf_dir,
capture_output=True,
text=True,
check=True,
).stdout.splitlines()
except (OSError, subprocess.CalledProcessError):
try:
_git_out = subprocess.run(
["git", "ls-files", "-z", "*.ipynb"],
cwd=_conf_dir,
capture_output=True,
text=True,
check=True,
).stdout
_notebooks = [nb for nb in _git_out.split("\x00") if nb]
except Exception:

Address review (@gemini-code-assist): parse `git ls-files -z` (NUL-separated,
unquoted) instead of splitlines(), so notebook filenames with spaces, newlines,
or non-ASCII characters are handled correctly; and widen the except so a decode
error (or any git failure) falls back to the filesystem scan instead of crashing
the Sphinx build.

Bugbot flagged a repo-root docname prefix; not applicable -- `git ls-files` runs
with cwd=_conf_dir and returns source-relative paths (verified: `_templates/
template`, `data/examples/...`, no `doc/source/` prefix).

Related to DOC-1344

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ronny Roland <ronny.roland@anyscale.com>
@ray-gardener ray-gardener Bot added docs An issue or change related to documentation core Issues that should be addressed in Ray Core labels Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core docs An issue or change related to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant