New PR for plugin implementation #307

Devon-White · 2025-06-05T09:00:22Z

docusaurus-plugin-llms-txt

A powerful Docusaurus plugin that generates Markdown versions of your HTML pages and creates an llms.txt index file for AI/LLM consumption. Perfect for making your documentation easily accessible to Large Language Models while maintaining human-readable markdown files.

Features

🔄 HTML to Markdown Conversion: Automatically converts your Docusaurus HTML pages to clean markdown files
📝 llms.txt Generation: Creates a comprehensive index file with links to all your documentation
🗂️ Hierarchical Organization: Intelligently organizes documents into categories with configurable depth
⚡ Smart Caching: Efficient caching system for fast incremental builds
🎯 Content Filtering: Flexible filtering by content type (docs, blog, pages) and custom patterns
🔧 Route Rules: Advanced configuration for specific routes with custom processing
💻 CLI Commands: Standalone CLI for generation and cleanup operations
🎨 Customizable Content Extraction: Configurable CSS selectors for precise content extraction
📊 Table Processing: Intelligent table handling with rehype processing
🔗 Link Management: Smart internal link processing with relative/absolute path options

Installation

npm install @signalwire/docusaurus-plugin-llms-txt

Quick Start

Basic Setup

Add the plugin to your docusaurus.config.js:

module.exports = {
  plugins: [
    'docusaurus-plugin-llms-txt'
  ]
};

Basic Configuration

module.exports = {
  plugins: [
    [
      'docusaurus-plugin-llms-txt',
      {
        siteTitle: 'My Documentation',
        siteDescription: 'Comprehensive guide to our platform',
        depth: 2,
        content: {
          includeBlog: true,
          includePages: true
        }
      }
    ]
  ]
};

After building your site (npm run build), you'll find:

llms.txt in your build output directory
Individual markdown files for each page (if enabled)

Configuration Options

Main Plugin Options

Option	Type	Default	Description
`siteTitle`	`string`	Site config title	Title for the llms.txt header. Can be empty string.
`siteDescription`	`string`	`undefined`	Description for the llms.txt header. Can be empty string.
`depth`	`1\|2\|3\|4\|5`	`1`	Categorization depth for document hierarchy. Details →
`enableDescriptions`	`boolean`	`true`	Include descriptions in llms.txt links
`optionalLinks`	`OptionalLink[]`	`[]`	Additional external links. Details →
`includeOrder`	`string[]`	`[]`	Global category ordering (glob patterns). Details →
`runOnPostBuild`	`boolean`	`true`	Whether to run during postBuild phase
`onRouteError`	`'ignore'\|'log'\|'warn'\|'throw'`	`'warn'`	Route processing failure handling. Details →
`logLevel`	`0\|1\|2\|3`	`1`	Logging verbosity. Range: 0-3. Details →

Content Options (`content`)

Note: All content options are optional. If you don't specify a content object, all options use their defaults.

Option	Type	Default	Description
`enableMarkdownFiles`	`boolean`	`true`	Generate individual markdown files
`relativePaths`	`boolean`	`true`	Use relative paths in links. Details →
`includeBlog`	`boolean`	`false`	Include blog posts
`includePages`	`boolean`	`false`	Include pages
`includeDocs`	`boolean`	`true`	Include documentation
`excludeRoutes`	`string[]`	`[]`	Glob patterns to exclude. Details →
`contentSelectors`	`string[]`	Default selectors	CSS selectors for content extraction. Details →
`routeRules`	`RouteRule[]`	`[]`	Route-specific processing rules. Details →
`remarkStringify`	`object`	`{}`	Markdown formatting options. Details →
`remarkGfm`	`boolean\|object`	`true`	GitHub Flavored Markdown options. Details →
`rehypeProcessTables`	`boolean`	`true`	Process HTML tables for markdown conversion

Detailed Configuration

Depth Configuration

The depth option controls how deep the hierarchical organization goes in your document tree. This is crucial for determining how your URLs are categorized.

How it works:

depth: 1: /api/users → api category
depth: 2: /api/users/create → api/users category
depth: 3: /api/users/create/advanced → api/users/create category

// Shallow categorization - good for simple sites
{ depth: 1 }  // /docs/guide/advanced → "docs" category

// Deep categorization - good for complex sites
{ depth: 3 }  // /docs/guide/advanced → "docs/guide/advanced" category

Use cases:

depth: 1 - Simple sites with few top-level sections
depth: 2 - Most documentation sites with clear section/subsection structure
depth: 3+ - Complex sites with deep hierarchies or very specific organization needs

Optional Links

Add external or additional links to your llms.txt in a separate "Optional" section.

Structure:

{
  title: string;        // REQUIRED: Display text for the link
  url: string;          // REQUIRED: Full URL (can be external or internal)
  description?: string; // Optional description (respects enableDescriptions)
}

Required fields:

title - Display text for the link
url - The URL to link to

Example:

{
  optionalLinks: [
    {
      title: 'Live API Status',          // REQUIRED
      url: 'https://status.myapi.com',   // REQUIRED
      description: 'Real-time system status and uptime monitoring'
    },
    {
      title: 'Community Discord',        // REQUIRED
      url: 'https://discord.gg/myapi'   // REQUIRED
      // description is optional
    },
    {
      title: 'Legacy v1 Docs',          // REQUIRED
      url: '/legacy/v1/',               // REQUIRED
      description: 'Documentation for API version 1.x (deprecated)'
    }
  ]
}

Output:

## Optional
- [Live API Status](https://status.myapi.com): Real-time system status and uptime monitoring
- [Community Discord](https://discord.gg/myapi)
- [Legacy v1 Docs](/legacy/v1/): Documentation for API version 1.x (deprecated)

Include Order

Controls the order in which categories appear in your llms.txt using glob patterns. Categories matching earlier patterns appear first.

{
  includeOrder: [
    '/getting-started/**',  // Getting started content first
    '/api/auth/**',         // Authentication second
    '/api/reference/**',    // API reference third
    '/guides/**',           // Guides fourth
    '/examples/**'          // Examples last
  ]
}

Pattern matching rules:

Use /** for matching entire directory trees
Use * for single-level wildcards
Patterns are matched against the full route path
More specific patterns should come before general ones

Error Handling

The onRouteError option controls what happens when individual pages fail to process. Valid values: 'ignore', 'log', 'warn', 'throw'.

'ignore': Skip failed routes silently
'log': Log failures but continue (no console output in normal mode)
'warn': Show warnings for failures but continue (recommended)
'throw': Stop entire build on first failure

// For production builds - continue despite individual failures
{ onRouteError: 'warn' }

// For development - fail fast to catch issues
{ onRouteError: 'throw' }

// For large sites with some problematic pages
{ onRouteError: 'ignore' }

Logging Levels

The logLevel option controls verbosity of console output. Range: 0-3 (integer).

0 (Quiet): Only errors and final success/failure
1 (Normal): Errors, warnings, and completion messages (default)
2 (Verbose): Above + processing info and statistics
3 (Debug): Above + detailed debug information

// Production builds - minimal output
{ logLevel: 0 }

// Development - maximum detail
{ logLevel: 3 }

// CI/CD - standard output
{ logLevel: 1 }

Path Configuration

The relativePaths option controls link format in both llms.txt and markdown files:

true: ./getting-started/index.md, ../api/reference.md
false: https://mysite.com/getting-started/, https://mysite.com/api/reference/

When to use relative paths:

Documentation distributed as files
Offline usage requirements
Maximum portability

When to use absolute paths:

Web-only consumption
Integration with web-based tools
Cleaner link appearance
When you want full URLs with your site domain

Route Exclusion

Use glob patterns to exclude specific routes or route patterns:

{
  content: {
    excludeRoutes: [
      '/admin/**',           // Exclude all admin pages
      '/api/internal/**',    // Exclude internal API docs
      '**/_category_/**',    // Exclude Docusaurus category pages
      '/legacy/**',          // Exclude legacy documentation
      '**/*.xml',           // Exclude XML files
      '**/*.json',          // Exclude JSON files
      '/search',            // Exclude specific page
      '/404'                // Exclude error pages
    ]
  }
}

Common exclusion patterns:

**/_category_/** - Docusaurus auto-generated category pages
/tags/** - Blog tag pages
/archive/** - Archived content
**/*.xml - Sitemap and RSS files
**/internal/** - Internal documentation

Content Selectors

CSS selectors used to extract main content from HTML pages. The plugin tries each selector in order until it finds content.

Default selectors (used when contentSelectors is not specified):

[
  '.theme-doc-markdown',     // Docusaurus main content area
  'main .container .col',    // Bootstrap-style layout
  'main .theme-doc-wrapper', // Docusaurus wrapper
  'article',                 // Semantic article element
  'main .container',         // Broader container
  'main'                     // Fallback to main element
]

How it works:

The plugin tries each selector in the array order
First selector that finds content is used
If no selectors find content, the page is skipped with a warning
Empty array [] will use the default selectors above

Custom selectors for different themes:

{
  content: {
    contentSelectors: [
      '.custom-content-area',    // Your custom theme
      '.docs-wrapper .content',  // Custom documentation wrapper
      '#main-content',           // ID-based selector
      '.prose',                  // Tailwind typography
      'main'                     // Fallback
    ]
  }
}

Important notes:

More specific selectors should come first
Always include a fallback selector (like 'main' or 'article')
Selectors are standard CSS selectors (class, ID, element, attribute, etc.)
Use browser dev tools to inspect your page structure if content isn't being extracted

Debugging content extraction:

Set logLevel: 3 to see which selector is being used for each page
Use browser dev tools to inspect your page's HTML structure
Test selectors in browser console with document.querySelector('your-selector')
Add more specific selectors before general ones

Markdown Processing

`remarkStringify` Options

Controls how the HTML→Markdown conversion formats the output. These options are passed directly to the remark-stringify library.

{
  content: {
    remarkStringify: {
      bullet: '-',              // Use - for bullet points (vs *)
      emphasis: '_',            // Use _ for italic (vs *)
      strong: '**',             // Use ** for bold
      fence: '`',               // Use ` for code fences (vs ~)
      fences: true,             // Use fenced code blocks
      listItemIndent: 'one',    // Indent list items with 1 space
      rule: '-',                // Use - for horizontal rules
      ruleRepetition: 3,        // Repeat rule character 3 times
      ruleSpaces: false         // No spaces around rules
    }
  }
}

📖 For complete option reference, see: remark-stringify options

`remarkGfm` Options

Controls table processing, strikethrough text, task lists, and other GitHub-style markdown features. These options are passed directly to the remark-gfm library.

// Enable with defaults
{ content: { remarkGfm: true } }

// Disable entirely
{ content: { remarkGfm: false } }

// Custom options
{
  content: {
    remarkGfm: {
      singleTilde: false,      // Require ~~ for strikethrough (not ~)
      tableCellPadding: true,  // Add padding around table cells
      tablePipeAlign: true,    // Align table pipes
      stringLength: stringWidth // Custom string length function
    }
  }
}

Default values when remarkGfm: true:
When you set remarkGfm: true, the plugin automatically applies these defaults:

{
  singleTilde: false,        // Require ~~ for strikethrough
  tableCellPadding: true,    // Add padding around table cells  
  tablePipeAlign: true,      // Align table pipes
  stringLength: stringWidth  // Use string-width for accurate measurements
}

You can override any of these by providing an object instead of true:

{
  content: {
    remarkGfm: {
      tableCellPadding: false,  // Override: no padding around cells
      // Other defaults will still apply
    }
  }
}

📖 For complete option reference, see: remark-gfm options

`rehypeProcessTables` (Plugin Option)

Type: boolean | Default: true

This is a plugin-specific option that controls whether to process HTML tables for better markdown conversion. When enabled, the plugin intelligently processes HTML tables to create clean markdown tables. When disabled, tables are left as raw HTML in the markdown output.

Route Rules

Route rules provide powerful per-route customization capabilities. They allow you to override any processing option for specific routes or route patterns.

Basic Route Rule Structure

{
  route: string;              // REQUIRED: Glob pattern to match
  depth?: 1 | 2 | 3 | 4 | 5; // Override categorization depth (range: 1-5)
  categoryName?: string;      // Override category display name
  contentSelectors?: string[]; // Override content extraction selectors
  includeOrder?: string[];    // Override subcategory ordering
}

Required fields:

route - The glob pattern to match routes against

Optional fields (all have validation constraints):

depth - Must be integer 1-5
categoryName - Any string
contentSelectors - Array of CSS selector strings
includeOrder - Array of glob pattern strings

Route Pattern Matching

Route rules use glob patterns to match routes:

{
  content: {
    routeRules: [
      {
        route: '/api/**',              // Matches all API routes
        depth: 3
      },
      {
        route: '/guides/getting-started', // Exact match
        categoryName: 'Quick Start'
      },
      {
        route: '/blog/2024/**',        // Year-specific blog posts
        categoryName: '2024 Posts'
      },
      {
        route: '*/advanced',           // Any route ending in /advanced
        depth: 4
      }
    ]
  }
}

Rule Priority

When multiple rules match the same route, the most specific rule wins:

{
  content: {
    routeRules: [
      {
        route: '/api/**',              // Less specific
        depth: 2,
        categoryName: 'API'
      },
      {
        route: '/api/reference/**',    // More specific - this wins
        depth: 4,
        categoryName: 'API Reference'
      }
    ]
  }
}

Advanced Examples

API Documentation with Custom Structure:

{
  content: {
    routeRules: [
      {
        route: '/api',                    // REQUIRED field
        categoryName: 'API Overview',
        depth: 1                         // Must be 1-5
      },
      {
        route: '/api/reference/**',       // REQUIRED field
        depth: 4,                        // Must be 1-5
        categoryName: 'API Reference',
        contentSelectors: ['.openapi-content', '.api-docs']
      },
      {
        route: '/api/guides/**',          // REQUIRED field
        categoryName: 'API Guides',
        includeOrder: [
          '/api/guides/quickstart',
          '/api/guides/authentication/**',
          '/api/guides/rate-limiting/**'
        ]
      }
    ]
  }
}

Multi-Language Documentation:

{
  content: {
    routeRules: [
      {
        route: '/en/**',
        categoryName: 'English Documentation',
        includeOrder: ['/en/getting-started/**', '/en/api/**']
      },
      {
        route: '/es/**',
        categoryName: 'Documentación en Español',
        includeOrder: ['/es/comenzar/**', '/es/api/**']
      }
    ]
  }
}

CLI Commands

The plugin provides CLI commands for standalone operation and cleanup:

Generate Command

npx docusaurus llms-txt [siteDir]

Generates llms.txt and markdown files using cached routes from a previous build.

Arguments:

siteDir (optional) - Path to your Docusaurus site directory. Defaults to current working directory.

Prerequisites:
You must run npm run build first to create the route cache.

Examples:

# Generate in current directory
npx docusaurus llms-txt

# Generate for specific site directory
npx docusaurus llms-txt ./my-docs-site

# Generate using absolute path
npx docusaurus llms-txt /home/user/projects/docs

How it works:

Loads cached route information from the previous build
Validates that configuration hasn't changed significantly
Processes routes using the current plugin configuration
Generates output files in the build directory

When to use:

You want to regenerate llms.txt without rebuilding the entire site
You're in a CI/CD pipeline and want separate build steps
You have runOnPostBuild: false configured

Clean Command

npx docusaurus llms-txt-clean [siteDir] [options]

Removes all generated markdown files and llms.txt using cached file information.

Arguments:

siteDir (optional) - Path to your Docusaurus site directory. Defaults to current working directory.

Options:

--clear-cache - Also clear the plugin cache directory

Examples:

# Clean generated files but keep cache
npx docusaurus llms-txt-clean

# Clean everything including cache
npx docusaurus llms-txt-clean --clear-cache

# Clean specific directory
npx docusaurus llms-txt-clean ./docs

# Clean with cache clearing for specific directory
npx docusaurus llms-txt-clean ./docs --clear-cache

What gets cleaned:

All individual markdown files generated by the plugin
The llms.txt index file
Cache entries for markdown files (updates cache to reflect removal)
With --clear-cache: The entire plugin cache directory

Safe operation:

Only removes files that were generated by the plugin (tracked in cache)
Will not remove files that weren't created by the plugin
Updates cache to maintain consistency
Provides detailed logging of what was removed

When to use:

Switching between enableMarkdownFiles: true/false
Cleaning up before changing major configuration
Debugging generation issues
Preparing for a clean rebuild

Advanced Configuration Examples

Multi-Language Support

{
  content: {
    routeRules: [
      {
        route: '/en/**',
        categoryName: 'English Documentation',
        includeOrder: ['/en/getting-started/**', '/en/api/**']
      },
      {
        route: '/es/**',
        categoryName: 'Documentación en Español'
      }
    ]
  }
}

API Documentation Focus

{
  siteTitle: 'API Documentation',
  depth: 3,
  content: {
    includeBlog: false,
    includePages: true,
    contentSelectors: [
      '.api-docs-content',
      '.theme-doc-markdown',
      'main'
    ],
    routeRules: [
      {
        route: '/api/reference/**',
        depth: 4,
        categoryName: 'API Reference',
        contentSelectors: ['.openapi-content', '.api-section']
      },
      {
        route: '/api/guides/**',
        categoryName: 'API Guides',
        includeOrder: [
          '/api/guides/quickstart',
          '/api/guides/authentication/**',
          '/api/guides/pagination/**'
        ]
      }
    ],
    excludeRoutes: [
      '/api/legacy/**',
      '**/_category_/**'
    ]
  },
  optionalLinks: [
    {
      title: 'OpenAPI Spec',
      url: 'https://api.example.com/openapi.json',
      description: 'Machine-readable API specification'
    }
  ]
}

Blog-Heavy Site

{
  siteTitle: 'Tech Blog & Tutorials',
  depth: 2,
  content: {
    includeBlog: true,
    includePages: true,
    includeDocs: true,
    routeRules: [
      {
        route: '/blog/**',
        categoryName: 'Blog Posts',
        includeOrder: ['/blog/2024/**', '/blog/2023/**']
      },
      {
        route: '/tutorials/**',
        depth: 3,
        categoryName: 'Tutorials'
      }
    ]
  }
}

Custom Content Extraction

{
  content: {
    contentSelectors: [
      '.custom-content-area',
      '.documentation-body',
      'main .content',
      'article'
    ],
    remarkStringify: {
      bullet: '-',
      emphasis: '_',
      strong: '**',
      listItemIndent: 'one'
    },
    remarkGfm: {
      tablePipeAlign: false,
      singleTilde: false
    }
  }
}

Understanding the Output

llms.txt Structure

The generated llms.txt follows this structure:

# Site Title

> Site description (if provided)

## Category Name
- [Document Title](link): description
- [Another Document](link): description

### Subcategory
- [Subcategory Document](link): description

## Optional
- [External Link](url): description

Markdown Files

When enableMarkdownFiles is true, individual markdown files are created for each page:

Files are saved with clean markdown content
Internal links are processed based on relativePaths setting
Tables are processed for better markdown compatibility
Content is extracted using configured selectors

Troubleshooting

Common Issues

"No cached routes found"

Run npm run build first to generate the cache
Ensure the plugin is properly configured in docusaurus.config.js

Empty or minimal content

Check your contentSelectors configuration
Verify that your content matches the default selectors
Use logLevel: 3 for debug output

Route processing failures

Set onRouteError: 'ignore' to skip problematic routes
Use logLevel: 2 to see which routes are failing
Check excludeRoutes to filter out problematic paths

Debug Configuration

{
  logLevel: 3,           // Maximum verbosity
  onRouteError: 'warn',  // Log route errors but continue
  content: {
    // Your config here
  }
}

Performance Optimization

For large sites:

{
  content: {
    excludeRoutes: [
      '**/_category_/**',    // Exclude category pages
      '/archive/**',         // Exclude archived content
      '**/*.xml',           // Exclude XML files
      '**/*.json'           // Exclude JSON files
    ]
  },
  logLevel: 1  // Reduce logging overhead
}

Caching

The plugin uses intelligent caching to speed up subsequent builds:

Cache is stored in .docusaurus/docusaurus-plugin-llms-txt/
Routes are cached with content hashes for change detection
Configuration changes automatically invalidate relevant cache entries
Use llms-txt-clean --clear-cache to reset the cache

License

MIT

netlify · 2025-06-05T09:00:26Z

✅ Deploy Preview for signalwire-docs ready!

Name	Link
🔨 Latest commit	`5376a5e`
🔍 Latest deploy log	https://app.netlify.com/projects/signalwire-docs/deploys/68415ca82f816600084b26cd
😎 Deploy Preview	https://deploy-preview-307--signalwire-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

New PR for plugin implementation

5376a5e

Devon-White requested review from briankwest, hey-august and niravcodes as code owners June 5, 2025 09:00

niravcodes approved these changes Jun 5, 2025

View reviewed changes

Devon-White merged commit 2e1809c into main Jun 5, 2025
5 checks passed

Devon-White deleted the Devon/llms-txt branch June 5, 2025 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New PR for plugin implementation #307

New PR for plugin implementation #307

Uh oh!

Devon-White commented Jun 5, 2025 •

edited

Loading

Uh oh!

netlify bot commented Jun 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

New PR for plugin implementation #307

New PR for plugin implementation #307

Uh oh!

Conversation

Devon-White commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

docusaurus-plugin-llms-txt

Features

Installation

Quick Start

Basic Setup

Basic Configuration

Configuration Options

Main Plugin Options

Content Options (content)

Detailed Configuration

Depth Configuration

Optional Links

Include Order

Error Handling

Logging Levels

Path Configuration

Route Exclusion

Content Selectors

Markdown Processing

remarkStringify Options

remarkGfm Options

rehypeProcessTables (Plugin Option)

Route Rules

Basic Route Rule Structure

Route Pattern Matching

Rule Priority

Advanced Examples

CLI Commands

Generate Command

Clean Command

Advanced Configuration Examples

Multi-Language Support

API Documentation Focus

Blog-Heavy Site

Custom Content Extraction

Understanding the Output

llms.txt Structure

Markdown Files

Troubleshooting

Common Issues

Debug Configuration

Performance Optimization

Caching

License

Uh oh!

netlify bot commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for signalwire-docs ready!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Devon-White commented Jun 5, 2025 •

edited

Loading

Content Options (`content`)

`remarkStringify` Options

`remarkGfm` Options

`rehypeProcessTables` (Plugin Option)

netlify bot commented Jun 5, 2025 •

edited

Loading