Legible

A Rust port of Mozilla's Readability.js for extracting readable content from web pages.

Legible analyzes HTML documents and extracts the main article content, stripping away navigation, ads, sidebars, and other non-content elements to produce clean, readable output.

Installation

Add to your Cargo.toml:

[dependencies]
legible = "0.4"

Usage

Basic Extraction

use legible::parse;

let html = r#"
    <html>
    <head><title>My Article</title></head>
    <body>
        <nav>Navigation</nav>
        <article>
            <h1>Article Title</h1>
            <p>This is the main content of the article...</p>
        </article>
        <footer>Footer</footer>
    </body>
    </html>
"#;

match parse(html, Some("https://example.com"), None) {
    Ok(article) => {
        println!("Title: {}", article.title);
        println!("Content: {}", article.content);
        println!("Text: {}", article.text_content);
    }
    Err(e) => eprintln!("Error: {}", e),
}

Quick Readability Check

Before running the full extraction, you can check if a document is likely to contain readable content:

use legible::is_probably_readerable;

if is_probably_readerable(html, None) {
    // Document appears to have extractable content
}

Pre-parsed Document

If you want to check readability before parsing, use Document to parse the HTML once and reuse it for both operations:

use legible::Document;

let doc = Document::new(html);

if doc.is_probably_readerable(None) {
    match doc.parse(Some("https://example.com"), None) {
        Ok(article) => println!("Title: {}", article.title),
        Err(e) => eprintln!("Error: {}", e),
    }
}

is_probably_readerable borrows the document (read-only check), while parse consumes it (the extraction algorithm mutates the DOM).

Extracted Article Fields

The Article struct contains:

Field	Type	Description
`title`	`String`	The article title
`content`	`String`	The article content as HTML
`text_content`	`String`	The article content as plain text
`byline`	`Option<String>`	The author byline
`excerpt`	`Option<String>`	A short excerpt from the article
`site_name`	`Option<String>`	The site name
`published_time`	`Option<String>`	The published time
`dir`	`Option<String>`	Text direction (ltr or rtl)
`lang`	`Option<String>`	Document language
`length`	`usize`	Length of the text content

Configuration

Use the Options builder to customize parsing behavior:

use legible::{parse, Options};

let options = Options::new()
    .char_threshold(250)        // Minimum article length (default: 500)
    .keep_classes(true)         // Preserve CSS classes in output
    .disable_json_ld(true);     // Skip JSON-LD metadata extraction

let article = parse(html, Some(url), Some(options));

Available Options

Option	Default	Description
`max_elems_to_parse`	`0`	Maximum elements to parse (0 = unlimited)
`nb_top_candidates`	`5`	Number of top candidates to consider
`char_threshold`	`500`	Minimum article character length
`keep_classes`	`false`	Preserve CSS classes in output
`classes_to_preserve`	`["page"]`	Specific classes to keep
`disable_json_ld`	`false`	Skip JSON-LD metadata extraction
`allowed_video_regex`	-	Custom regex for allowed video embeds
`link_density_modifier`	`0.0`	Adjust link density threshold
`debug`	`false`	Enable debug logging

Security

The extracted HTML content is unsanitized and may contain malicious scripts or other dangerous content from the source document. Before rendering this HTML in a browser or other context where scripts could execute, you should sanitize it using a library like ammonia:

use legible::parse;

let article = parse(html, Some(url), None)?;

// Sanitize before rendering
let safe_html = ammonia::clean(&article.content);

How It Works

Legible implements the same algorithm as Readability.js:

Document Preparation - Removes scripts, normalizes markup, fixes lazy-loaded images
Metadata Extraction - Extracts title, byline, and other metadata from JSON-LD, OpenGraph tags, and meta elements
Content Scoring - Scores DOM nodes based on tag type, text density, and class/id patterns
Candidate Selection - Identifies the highest-scoring content container
Content Cleaning - Removes low-scoring elements, empty containers, and non-content markup

The library is tested against Mozilla's official Readability.js test suite.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.github		.github
benches		benches
examples/server		examples/server
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.prettierignore		.prettierignore
AGENTS.md		AGENTS.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legible

Installation

Usage

Basic Extraction

Quick Readability Check

Pre-parsed Document

Extracted Article Fields

Configuration

Available Options

Security

How It Works

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Legible

Installation

Usage

Basic Extraction

Quick Readability Check

Pre-parsed Document

Extracted Article Fields

Configuration

Available Options

Security

How It Works

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages