A Rust port of Mozilla's Readability.js for extracting readable content from web pages.
Legible analyzes HTML documents and extracts the main article content, stripping away navigation, ads, sidebars, and other non-content elements to produce clean, readable output.
Add to your Cargo.toml:
[dependencies]
legible = "0.4"use legible::parse;
let html = r#"
<html>
<head><title>My Article</title></head>
<body>
<nav>Navigation</nav>
<article>
<h1>Article Title</h1>
<p>This is the main content of the article...</p>
</article>
<footer>Footer</footer>
</body>
</html>
"#;
match parse(html, Some("https://example.com"), None) {
Ok(article) => {
println!("Title: {}", article.title);
println!("Content: {}", article.content);
println!("Text: {}", article.text_content);
}
Err(e) => eprintln!("Error: {}", e),
}Before running the full extraction, you can check if a document is likely to contain readable content:
use legible::is_probably_readerable;
if is_probably_readerable(html, None) {
// Document appears to have extractable content
}If you want to check readability before parsing, use Document to parse the HTML once and reuse it for both operations:
use legible::Document;
let doc = Document::new(html);
if doc.is_probably_readerable(None) {
match doc.parse(Some("https://example.com"), None) {
Ok(article) => println!("Title: {}", article.title),
Err(e) => eprintln!("Error: {}", e),
}
}is_probably_readerable borrows the document (read-only check), while parse consumes it (the extraction algorithm mutates the DOM).
The Article struct contains:
| Field | Type | Description |
|---|---|---|
title |
String |
The article title |
content |
String |
The article content as HTML |
text_content |
String |
The article content as plain text |
byline |
Option<String> |
The author byline |
excerpt |
Option<String> |
A short excerpt from the article |
site_name |
Option<String> |
The site name |
published_time |
Option<String> |
The published time |
dir |
Option<String> |
Text direction (ltr or rtl) |
lang |
Option<String> |
Document language |
length |
usize |
Length of the text content |
Use the Options builder to customize parsing behavior:
use legible::{parse, Options};
let options = Options::new()
.char_threshold(250) // Minimum article length (default: 500)
.keep_classes(true) // Preserve CSS classes in output
.disable_json_ld(true); // Skip JSON-LD metadata extraction
let article = parse(html, Some(url), Some(options));| Option | Default | Description |
|---|---|---|
max_elems_to_parse |
0 |
Maximum elements to parse (0 = unlimited) |
nb_top_candidates |
5 |
Number of top candidates to consider |
char_threshold |
500 |
Minimum article character length |
keep_classes |
false |
Preserve CSS classes in output |
classes_to_preserve |
["page"] |
Specific classes to keep |
disable_json_ld |
false |
Skip JSON-LD metadata extraction |
allowed_video_regex |
- | Custom regex for allowed video embeds |
link_density_modifier |
0.0 |
Adjust link density threshold |
debug |
false |
Enable debug logging |
The extracted HTML content is unsanitized and may contain malicious scripts or other dangerous content from the source document. Before rendering this HTML in a browser or other context where scripts could execute, you should sanitize it using a library like ammonia:
use legible::parse;
let article = parse(html, Some(url), None)?;
// Sanitize before rendering
let safe_html = ammonia::clean(&article.content);Legible implements the same algorithm as Readability.js:
- Document Preparation - Removes scripts, normalizes markup, fixes lazy-loaded images
- Metadata Extraction - Extracts title, byline, and other metadata from JSON-LD, OpenGraph tags, and meta elements
- Content Scoring - Scores DOM nodes based on tag type, text density, and class/id patterns
- Candidate Selection - Identifies the highest-scoring content container
- Content Cleaning - Removes low-scoring elements, empty containers, and non-content markup
The library is tested against Mozilla's official Readability.js test suite.
Apache-2.0