Skip to content

starc007/marklift

Repository files navigation

Marklift

URL → Clean Markdown — Fetch a webpage, extract the main content, and convert it to LLM-friendly Markdown. Built for agents and pipelines.

  • Fetches HTTP(S) URLs with configurable timeout and headers
  • Source types: website, twitter (Nitter), reddit — inferred from URL when not specified. Medium adapter is removed for now.
  • Extracts article content with Mozilla Readability (or raw body)
  • Converts to Markdown with Turndown and custom rules
  • Optimizes for agents: normalizes spacing, dedupes links, strips tracking params, optional chunking
  • Typed API and CLI

Requirements: Node.js 18+


The Web Is Not LLM-Ready — Raw HTML is noisy, heavy, tracking junk, inconsistent, and expensive in tokens.

Install

npm install marklift

Usage

Programmatic

import { urlToMarkdown } from "marklift";

// source is inferred from URL when omitted (twitter/x.com → twitter, reddit → reddit, else website)
const result = await urlToMarkdown("https://example.com/article", {
  timeout: 10_000,
});
const tweet = await urlToMarkdown("https://x.com/user/status/123"); // uses twitter adapter

console.log(result.title);
console.log(result.markdown);
console.log(result.wordCount, result.sections.length, result.links.length);

CLI

# Install globally to get the `marklift` command
npm install -g marklift

# Convert a URL to Markdown (prints to stdout). Source is inferred from URL.
marklift https://example.com
marklift https://x.com/user/status/123   # uses twitter adapter
marklift https://reddit.com/r/...         # uses reddit adapter

# Output full result as JSON
marklift https://example.com --json

# Options
marklift https://example.com --timeout 15000
marklift https://example.com --chunk-size 2000
marklift https://example.com --source website   # override inferred source

CLI options:

Option Description
--source <website|twitter|reddit> Source adapter (default: inferred from URL). Override when needed.
--timeout <ms> Request timeout in milliseconds (default: 15000)
--chunk-size <n> Split markdown into chunks of ~n characters
--json Output full result as JSON instead of markdown

API

urlToMarkdown(url, options?)

Converts a URL to clean Markdown. Returns a Promise<MarkdownResult>.

Options:

Option Type Description
source "website" | "twitter" | "reddit" Source adapter. Default: inferred from URL (twitter.com/x.com/nitter → twitter, reddit.com → reddit, else website). Override to force a specific adapter.
timeout number Request timeout in ms (default: 15000)
headers Record<string, string> Custom HTTP headers (e.g. User-Agent)
chunkSize number If set, result.chunks will contain token-safe chunks

Result (MarkdownResult):

  • url — Original URL
  • title — Page title
  • description — Meta description (if present)
  • markdown — Full markdown with source-specific frontmatter (see below) + body
  • sections{ heading, content }[] by heading (stable order)
  • links — Deduplicated links, sorted (tracking params stripped)
  • wordCount — Approximate word count
  • contentHash — SHA-256 of optimized markdown (stability checks)
  • metadata? — Structured metadata (OG, canonical, author, publishedAt, image, language)
  • chunks? — When chunkSize is set: { content, index, total }[] (no split inside code blocks or tables)

urlToMarkdownStream(url, options?)

Async generator that yields MarkdownChunk (meta, sections, links) as they are produced. Useful for streaming into an LLM or pipeline.

Markdown format (per source)

Each adapter outputs markdown with a frontmatter block (------) then the body.

Website (and reddit). Format type: website. Medium not supported currently.

---
source: https://example.com/article
canonical: https://example.com/article
title: Example Article Title
description: Short meta description
author: John Doe
published_at: 2025-01-12
language: en
content_hash: <sha256>
word_count: 1243
---
# Title

Body content…

Twitter:

---
platform: twitter
source: https://twitter.com/username/status/1234567890
tweet_id: 1234567890
author:
  name: Author Name
published_at: 2025-01-10T18:22:00Z
language: en
content_hash: <sha256>
---
Body content…

Errors

  • InvalidUrlError — Invalid or non-HTTP(S) URL
  • FetchError — Network error, timeout, or non-2xx response
  • ParseError — Readability or parsing failure

Production: Website and reddit adapters use a browser-like User-Agent by default so requests from servers/datacenters get full HTML. The Twitter adapter keeps the Marklift User-Agent so Nitter works. Override via headers if needed.


Example

import { urlToMarkdown, urlToMarkdownStream } from "marklift";

// One-shot (source inferred from URL)
const result = await urlToMarkdown("https://blog.example.com/post", {
  timeout: 10_000,
  chunkSize: 2000,
});
console.log(result.title, result.wordCount);
if (result.chunks) {
  for (const chunk of result.chunks) {
    // Send chunk to LLM, etc.
  }
}

// Streaming
for await (const chunk of urlToMarkdownStream(
  "https://blog.example.com/post"
)) {
  process.stdout.write(chunk.content);
}

Testing

npm test          # unit + E2E (E2E needs network)
npm run test:unit # unit only (no network)
npm run test:e2e  # E2E with real URLs only

Set SKIP_E2E=1 to skip E2E tests (e.g. in CI without network).


Contributing

Contributions are welcome. See CONTRIBUTING.md for setup, code style, and how to submit changes.


License

MIT

About

Turn Any Web Page Into Clean, Agent-Ready Markdown

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published