Skip to content

Add checks for broken docs urls#6448

Open
carlosabadia wants to merge 2 commits intomainfrom
carlos/docs-links-ci
Open

Add checks for broken docs urls#6448
carlosabadia wants to merge 2 commits intomainfrom
carlos/docs-links-ci

Conversation

@carlosabadia
Copy link
Copy Markdown
Contributor

No description provided.

@carlosabadia carlosabadia requested review from a team and Alek99 as code owners May 4, 2026 11:36
@carlosabadia carlosabadia added the documentation Improvements or additions to documentation label May 4, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 4, 2026

Merging this PR will not alter performance

✅ 17 untouched benchmarks
⏩ 2 skipped benchmarks1


Comparing carlos/docs-links-ci (873b592) with main (70ab07e)

Open in CodSpeed

Footnotes

  1. 2 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 4, 2026

Greptile Summary

This PR adds a new GitHub Actions workflow and Python script that validate /docs/* Markdown links against the Reflex app's generated sitemap.xml, catching broken URLs and underscore-in-path violations before they reach production. The implementation is well-structured, correctly strips fragments/query strings before the underscore check, and ships good test coverage — including the fragment-underscore false-positive regression case from prior review.

  • The LINK_RE regex only handles double-quoted Markdown link titles (\"...\"), not the single-quoted ('...') or parenthesised ((...)) forms. Links like [text](/docs/foo 'My Title') would have the title text absorbed into raw, causing every such link to report a spurious "not found in sitemap" error.

Confidence Score: 4/5

Safe to merge after addressing the single-quoted title regex gap; otherwise the tool works correctly.

One P1 logic issue: single-quoted Markdown link titles are not stripped from the captured URL, causing false-positive "not found in sitemap" errors. All other logic (fragment/query stripping for the underscore check, sitemap prefix normalization, skip-dirs) is correct and well-tested.

docs/app/scripts/check_doc_links.py — specifically the LINK_RE constant on line 25.

Important Files Changed

Filename Overview
.github/workflows/check_doc_links.yml New CI workflow that builds the Reflex frontend to generate sitemap.xml, then runs the link-checker script; triggers on docs/**/*.md, the script, and this file itself.
docs/app/scripts/check_doc_links.py New script scanning .md files for /docs/* links and validating them against sitemap.xml; correctly strips fragment/query before underscore check, handles both /docs-prefixed and non-prefixed sitemaps.
docs/app/tests/test_doc_links.py Comprehensive unit tests covering valid links, missing links, underscore detection, fragment handling, skip-dirs, and both sitemap prefix styles; includes the fragment-underscore false-positive regression test.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GitHub Actions Trigger\npull_request / push to main\nwith docs path filter] --> B[Checkout & Setup Build Env\npython 3.14 + uv sync]
    B --> C[uv run reflex export\n--frontend-only --no-zip\nGenerates .web/public/sitemap.xml]
    C --> D[uv run python\nscripts/check_doc_links.py]
    D --> E[load_sitemap_paths\nParse sitemap.xml → set of normalized paths]
    D --> F[iter_md_files\nrglob *.md, skip SKIP_DIRS]
    F --> G[iter_md_links\nMatch LINK_RE on each line]
    G --> H{For each raw URL}
    H --> I{Underscore in path_only?}
    I -- Yes --> J[Append underscore error]
    I -- No --> K{sitemap_key in valid_paths?}
    J --> K
    K -- No --> L[Append not-found error]
    K -- Yes --> M[OK]
    L --> N{Any errors?}
    J --> N
    M --> N
    N -- Yes --> O[Print errors to stderr\nExit 1 → CI fails]
    N -- No --> P[Print success\nExit 0]
Loading

Reviews (2): Last reviewed commit: "updates" | Re-trigger Greptile

Comment thread docs/app/scripts/check_doc_links.py Outdated
Comment thread docs/app/tests/test_doc_links.py
@masenf
Copy link
Copy Markdown
Collaborator

masenf commented May 4, 2026

@greptile-apps re-review

from pathlib import Path
from urllib.parse import urlparse

LINK_RE = re.compile(r"\]\(\s*(/docs(?=[/)#?\s])[^)]*?)(?:\s+\"[^\"]*\")?\s*\)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The regex captures single-quoted Markdown link titles into raw. For a link like [text](/docs/foo 'My Title') the optional title group (?:\s+"[^"]*")? requires double quotes, so it won't strip the '...' text. Instead, [^)]*? absorbs the trailing space + title, making raw = "/docs/foo 'My Title'". The subsequent sitemap lookup then tries _strip_docs_prefix(_normalize("/docs/foo 'My Title'"))/foo 'My Title' which is never in the sitemap, producing a spurious "not found" error for every single-quoted-title link in the docs.

Suggested change
LINK_RE = re.compile(r"\]\(\s*(/docs(?=[/)#?\s])[^)]*?)(?:\s+\"[^\"]*\")?\s*\)")
LINK_RE = re.compile(r"\]\(\s*(/docs(?=[/)#?\s])[^)]*?)(?:\s+(?:\"[^\"]*\"|'[^']*'|\([^)]*\)))?\s*\)")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants