Skip to content

sftrkr/sitepulse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sitepulse

sitepulse is a Rust-based CLI tool for technical SEO, sitemap health checks, and AI agent readiness audits.

It discovers URLs from a sitemap.xml, checks each page's HTTP status, response time, redirect state, final URL, and optional metadata, then produces terminal, CSV, JSON, and HTML reports. It also includes an --agent-ready audit inspired by emerging agent-web standards such as llms.txt, AI crawler rules, discovery headers, protocol discovery, structured data, DNS-AID, and agentic commerce signals.

The project is designed for WordPress, WooCommerce, e-commerce, publisher, and SaaS websites that need to detect broken links, 404/500 errors, redirect issues, slow pages, metadata gaps, and whether the site is ready for AI agents and crawlers.

Status

The first working version has been implemented.

Current features:

  • sitepulse check <SITEMAP_URL> command
  • Standard sitemap parsing
  • Sitemap index support
  • Gzip sitemap support (.xml.gz)
  • Maximum sitemap index depth: 2
  • Extract URLs from <loc>...</loc> entries
  • Deduplicate repeated URLs
  • HTTP status code reporting
  • Response time measurement
  • Redirect following
  • Final URL reporting
  • Timeout support
  • Custom User-Agent support
  • Concurrency support
  • Option to show only errors
  • Retry support for network errors and 5xx responses
  • GET/HEAD check method selection
  • Optional title, meta description, and canonical URL extraction
  • Same-host filtering option
  • Optional robots.txt filtering
  • Initial agent readiness audit (--agent-ready)
  • CI-friendly agent readiness score threshold
  • Maximum URL limit option
  • CSV export
  • JSON export
  • HTML report export
  • CI-friendly non-zero exit option
  • Summary report
  • Top 10 slowest URLs
  • Custom User-Agent
sitepulse/0.1 (+https://example.local)

Installation

Requirements:

  • Rust stable
  • Cargo

Build the project:

cargo build

Build a release binary:

cargo build --release

Generated binary:

./target/release/sitepulse

Usage

Basic usage:

cargo run -- check https://example.com/sitemap.xml

Using the compiled binary:

sitepulse check https://example.com/sitemap.xml

CLI options

sitepulse check <SITEMAP_URL> [OPTIONS]

Options:

Option Description Default
--config <FILE> Load check options from a JSON config file None
--concurrency <N> Number of concurrent HTTP checks 10
--timeout <SECONDS> Request timeout in seconds 10
--user-agent <VALUE> Custom User-Agent for all HTTP requests sitepulse/0.1 (+https://example.local)
--method <METHOD> HTTP method for URL checks: get or head get
--analyze-meta Extract page title, meta description, and canonical URL. Uses GET even with --method=head Disabled
--only-errors Show only network errors and 4xx/5xx responses Disabled
--export <FILE> Write results to a CSV file None
--export-json <FILE> Write results to a JSON file None
--export-html <FILE> Write an HTML report None
--fail-on-errors Exit with code 2 if any 4xx, 5xx, timeout, or network error is found Disabled
--retries <N> Retry failed URL checks and 5xx responses 0
--sitemap-retries <N> Retry sitemap downloads before failing 2
--max-urls <N> Limit how many discovered URLs are checked None
--same-host-only Only check URLs whose host matches the sitemap URL host Disabled
--respect-robots Filter out URLs disallowed by robots.txt Disabled
--agent-ready Run an agent readiness audit for the sitemap host Disabled
--agent-ready-export-json <FILE> Write agent readiness results to a JSON file None
--agent-ready-export-html <FILE> Write agent readiness results to an HTML file None
--agent-ready-fail-under <PERCENT> Exit with code 3 if agent readiness score is below the threshold None

Examples:

cargo run -- check https://example.com/sitemap.xml --concurrency 20
cargo run -- check https://example.com/sitemap.xml --timeout 15
cargo run -- check https://example.com/sitemap.xml --method head
cargo run -- check https://example.com/sitemap.xml --analyze-meta
cargo run -- check https://example.com/sitemap.xml --only-errors
cargo run -- check https://example.com/sitemap.xml --export report.csv
cargo run -- check https://example.com/sitemap.xml --retries 2
cargo run -- check https://example.com/sitemap.xml --max-urls 100
cargo run -- check https://example.com/sitemap.xml --same-host-only
cargo run -- check https://example.com/sitemap.xml --respect-robots
cargo run -- check https://example.com/sitemap.xml --agent-ready
cargo run -- check https://example.com/sitemap.xml --sitemap-retries 3
cargo run -- check https://example.com/sitemap.xml \
  --agent-ready \
  --agent-ready-export-json agent-ready.json \
  --agent-ready-export-html agent-ready.html \
  --agent-ready-fail-under 80

Multiple options can be used together:

cargo run -- check https://example.com/sitemap.xml \
  --concurrency 20 \
  --timeout 10 \
  --method head \
  --analyze-meta \
  --retries 2 \
  --sitemap-retries 3 \
  --max-urls 1000 \
  --same-host-only \
  --respect-robots \
  --only-errors \
  --export report.csv \
  --export-json report.json \
  --export-html report.html \
  --agent-ready \
  --agent-ready-export-json agent-ready.json \
  --agent-ready-export-html agent-ready.html

Example terminal output

Checking sitemap: https://example.com/sitemap.xml
Concurrency: 20
Timeout: 10s
User-Agent: sitepulse/0.1 (+https://example.local)
Method: HEAD
Analyze meta: yes
Retries: 2
Sitemap retries: 2

Discovered URLs: 1240

STATUS      TIME ATTEMPTS  METHOD  REDIRECT    ERROR URL
------------------------------------------------------------------------------------------
200        184ms        1     HEAD        no       no https://example.com/
301         96ms        1     HEAD       yes       no https://example.com/old -> https://example.com/new
404        121ms        1     HEAD        no       no https://example.com/missing-page
500        430ms        3     HEAD        no       no https://example.com/broken

Summary:
Total: 1240
2xx: 1190
3xx: 22
4xx: 20
5xx: 4
Errors: 4
Average response time: 218ms

Slowest URLs:
1. 3820ms https://example.com/category/electronics
2. 2910ms https://example.com/product/example

Export

Export to CSV:

cargo run -- check https://example.com/sitemap.xml --export report.csv

Export to JSON:

cargo run -- check https://example.com/sitemap.xml --export-json report.json

Export to HTML:

cargo run -- check https://example.com/sitemap.xml --export-html report.html

CSV, JSON, and HTML result fields include:

  • url
  • status
  • time_ms
  • redirected
  • final_url
  • error
  • attempts
  • method
  • title
  • meta_description
  • canonical_url

Project structure

src/
  main.rs      # Application entry point
  cli.rs       # CLI arguments and command definitions
  sitemap.rs   # Sitemap download, parsing, and discovery
  checker.rs   # URL HTTP checks
  report.rs    # Terminal output and summary report
  export.rs    # CSV, JSON, and HTML export
  models.rs    # Shared data models

examples/
  sitemap.xml  # Example sitemap for testing

Configuration file

--config accepts a JSON file with check options. Example:

{
  "concurrency": 5,
  "timeout": 15,
  "method": "head",
  "analyze_meta": true,
  "same_host_only": true,
  "respect_robots": true,
  "agent_ready": true,
  "agent_ready_fail_under": 70
}

Command-line options are parsed first, then config values are applied. For repeated audits, keep shared defaults in a config file and pass target-specific values such as the sitemap URL on the command line.

Development

Format code:

cargo fmt

Run compile checks:

cargo check

Run tests:

cargo test

Roadmap

Completed:

  • Project skeleton

  • Cargo.toml

  • CLI command

  • Sitemap download

  • URL parsing

  • HTTP checks

  • Concurrency

  • Timeout

  • Custom User-Agent support

  • --only-errors

  • Retry support

  • Sitemap download retry support

  • GET/HEAD check method selection

  • Optional title, meta description, and canonical URL extraction

  • Same-host filtering option

  • Optional robots.txt filtering

  • Initial agent readiness audit (--agent-ready)

  • CI-friendly agent readiness score threshold

  • Maximum URL limit option

  • CSV export

  • JSON export

  • HTML report export

  • CI-friendly --fail-on-errors option

  • Sitemap index support

  • Gzip sitemap support

  • Slow URL list

  • README

  • Integration tests with a local HTTP server

  • Expanded agent readiness audit (--agent-ready)

    • Discoverability checks: robots.txt, sitemap directives, Link headers, DNS-AID
    • Content accessibility checks: llms.txt, llms-full.txt, Markdown negotiation
    • Bot access control checks: AI bot rules, allow/block detection, Content Signals, Web Bot Auth
    • Protocol discovery checks: MCP, Agent Skills, WebMCP, A2A, API catalog, OAuth, auth.md
    • Page intelligence checks: title, meta description, canonical URL, OpenGraph, JSON-LD, semantic HTML
    • Commerce readiness checks: x402, MPP, UCP, ACP
    • Scoring/reporting: score, PASS/WARN/FAIL checklist, JSON/HTML exports

Potential next improvements:

  • Add GitHub release workflow for tagged binary releases
  • Publish GitHub release notes and binaries for v0.1.0
  • Add packaged install instructions (cargo install, Homebrew, or prebuilt binaries)
  • Add configuration file support for repeated audits
  • Add SARIF/JUnit-style CI export
  • Add rate limiting and per-host politeness controls
  • Add richer structured data validation for JSON-LD schema types

Notes

  • HTTP errors do not crash the program; they are reported per URL.
  • If the sitemap cannot be downloaded or the XML is invalid, the program returns a clear error.
  • Redirects are followed and the final URL is recorded.
  • Duplicate URLs are deduplicated.

License

This project is licensed under the MIT License. See LICENSE for details.

Security

Please see SECURITY.md for vulnerability reporting guidelines.

Changelog

Please see CHANGELOG.md for release history.

About

Rust-based CLI for technical SEO, sitemap health checks, broken link detection, performance audits, and AI agent readiness reports.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages