Skip to content

YawLabs/fetch-mcp

Repository files navigation

@yawlabs/fetch-mcp

A comprehensive HTTP fetch MCP server for AI assistants. Bring-your-own client: runs as a stdio MCP server so any MCP-compatible client (Claude Code, Claude Desktop, Cursor, mcph, …) can fetch web content safely.

Add to mcp.hosting

One click adds this to your mcp.hosting account so it syncs to every MCP client you use. Or install manually below.

What it gives the model

Tool What it does
http_get / http_head / http_options Bare HTTP requests with headers, auth, timeout, size cap, retry
http_post / http_put / http_patch / http_delete Write-method HTTP with JSON or raw body
fetch_html_to_markdown GET a page and convert to clean markdown (3–8× smaller than raw HTML)
fetch_html_to_text GET a page and convert to plain text with block structure preserved
fetch_reader Reader-mode extraction — isolates the article body and returns title + markdown
fetch_meta Extract <head> metadata: title, description, OpenGraph, Twitter cards, JSON-LD, feeds, icons
fetch_links Extract every outbound link, resolved to absolute URLs, classified internal/external
fetch_sitemap Parse sitemap.xml (including gzipped and sitemap-index chaining)
fetch_feed Parse an RSS 2.0 or Atom 1.0 feed into entries
fetch_robots Parse a site's robots.txt, return the verdict for a given path & user-agent

Safety

SSRF protection is on by default. The server refuses requests to:

  • Loopback (127.0.0.0/8, ::1)
  • RFC1918 private ranges (10/8, 172.16/12, 192.168/16)
  • Link-local (169.254/16, fe80::/10) — including the cloud metadata endpoint 169.254.169.254
  • CGNAT (100.64/10)
  • Unique-local IPv6 (fc00::/7)
  • Multicast / broadcast
  • IPv4-mapped IPv6 (::ffff:0:0/96) re-checked against the IPv4 rules
  • Non-http/https schemes (file://, gopher://, javascript:, …)
  • Hostname localhost and any *.localhost

DNS is resolved once per redirect hop, every returned address is checked, and the verified IP is pinned into the HTTP dispatcher so the subsequent TCP connection dials that exact address — closing the DNS-rebinding TOCTOU window. Authorization headers are stripped on cross-origin redirects. A 302 to http://127.0.0.1 through a public host gets caught. Set allow_private_hosts: true per-request when you really do need internal access (e.g. development).

Install & run

# One-off
npx -y @yawlabs/fetch-mcp

# Or globally
npm i -g @yawlabs/fetch-mcp
fetch-mcp

Requires Node ≥20.

Configure in Claude Code / Claude Desktop

Add to your client's MCP config (usually claude_desktop_config.json or ~/.claude.json):

{
  "mcpServers": {
    "fetch": {
      "command": "npx",
      "args": ["-y", "@yawlabs/fetch-mcp"]
    }
  }
}

Or via mcph:

mcph add fetch

Tool reference

http_get, http_post, http_put, http_patch, http_delete, http_head, http_options

Common parameters:

Field Type Default Meaning
url string Absolute URL
headers object Custom request headers
timeout_ms int 10000 Request timeout
max_bytes int 5242880 (5 MiB) Truncate body if larger
max_redirects int 5 Redirect hops allowed
retries int 0 Retry count on 408/425/429/5xx with backoff (honors Retry-After)
user_agent string @yawlabs/fetch-mcp/<v> User-Agent override
basic_auth {username,password} Injects Authorization: Basic …
bearer_token string Injects Authorization: Bearer …
allow_private_hosts bool false Bypass SSRF block
decode_text bool auto When unset, auto-detects by response Content-Type (text for text/*, JSON, XML, JS, form-urlencoded; binary otherwise). Set explicitly true to force text decoding, false to force base64 in body_base64.

Body-capable tools (POST/PUT/PATCH/DELETE) also take:

Field Type Meaning
body string Raw request body
body_json any Structured body — encoded as JSON, Content-Type: application/json set automatically
content_type string Overrides Content-Type

Response shape:

{
  ok: boolean;
  status: number;
  statusText: string;
  url: string;              // final URL after redirects
  headers: Record<string,string>;
  body_text?: string;
  body_base64?: string;     // when decode_text=false
  json?: unknown;           // auto-parsed when response is application/json
  truncated?: boolean;      // set when max_bytes hit
  redirects?: string[];     // chain of intermediate URLs
  duration_ms: number;
  error?: string;
}

fetch_html_to_markdown

GET the URL, strip scripts/styles/iframes/svg/canvas plus <nav>, <footer>, <aside>, convert to atx-headed markdown with fenced code blocks and dash bullets. Intended for feeding pages into an LLM without blowing the context budget.

fetch_html_to_text

Same fetch, but emits plain text with block-level structure preserved as newlines. Useful when the model doesn't need markdown formatting.

fetch_reader

Isolates the main article body using, in order: <article>, <main>, itemprop="articleBody", common CMS class names (post-content, entry-content, etc.), then <body> as fallback. Returns:

{
  url: string;          // final URL after redirects
  title?: string;       // og:title, then <title>, then <h1>
  byline?: string;      // meta[name=author] / article:author
  wordCount: number;
  markdown: string;     // main content converted to markdown
}

fetch_meta

GET a URL and return its head metadata without downloading the full body (caps at 2 MiB by default):

{
  url: string;
  title?: string;
  description?: string;
  canonical?: string;
  language?: string;
  robots?: string;
  og:      Record<string, string>;       // first value per key (og:title, og:image, og:type, ...)
  twitter: Record<string, string>;       // first value per key
  article: Record<string, string>;       // first value per key
  ogAll:      Record<string, string[]>;  // only keys that appear > once (e.g. multiple og:image)
  twitterAll: Record<string, string[]>;
  articleAll: Record<string, string[]>;
  icons:   Array<{ rel: string; href: string; sizes?: string }>;
  feeds:   Array<{ href: string; title?: string; type?: string }>;   // RSS/Atom
  jsonLd:  unknown[];                    // parsed application/ld+json blocks
}

fetch_links

GET a page and return every <a href> with text, resolved to absolute URLs. Respects <base href>. Skips #, javascript:, mailto:, tel:, data:, file:. Each link is classified internal or external vs. the page host. Optional filter/dedupe/limit.

fetch_sitemap

Fetch a sitemap.xml or sitemap-index and return the URL list:

{
  sitemaps: string[];       // indexes followed, in order
  urlCount: number;
  truncated: boolean;       // hit max_urls
  urls: Array<{
    loc: string;
    lastmod?: string;
    changefreq?: string;
    priority?: number;
  }>;
}

Gzipped .xml.gz sitemaps are auto-decompressed. max_depth controls how many levels of sitemap-index to follow (default 1). Setting max_depth: 0 on a sitemap-index returns the index's childSitemaps list without fetching any child (useful to discover structure cheaply). Partial failures — one child sitemap 500s while others succeed — are returned under warnings rather than aborting the whole call.

fetch_feed

Parse an RSS 2.0 or Atom 1.0 feed:

{
  kind: "rss" | "atom" | "unknown";
  title?: string;
  description?: string;
  link?: string;
  updated?: string;
  entryCount: number;
  truncated: boolean;       // hit limit
  entries: Array<{
    title?: string;
    link?: string;
    id?: string;
    published?: string;
    updated?: string;
    author?: string;
    summary?: string;
    content?: string;
    categories?: string[];
  }>;
}

fetch_robots

Fetches <origin>/robots.txt, parses it, and returns:

{
  robotsUrl: string;
  status: number;
  userAgent: string;
  path: string;
  allowed: boolean;
  matchedRule: string | null;  // the longest-match Allow/Disallow that decided it
  crawlDelay: number | null;   // from the matched group
  sitemaps: string[];          // top-level Sitemap: declarations
  rawRobotsText: string;       // first 512KB
}

The parser follows Google's rules: longest match wins, * is a wildcard segment, $ anchors the end of the path, specific user-agent group beats the * wildcard group when the UA matches (comparison uses the length of the actually-matched agent token, not the group's first agent). Allow beats Disallow on equal-length ties.

Development

npm install
npm run build
npm test         # vitest
npm run lint     # biome
npm run typecheck

Tests spin up a local loopback HTTP server on 127.0.0.1:0 to exercise the real request/response path — no mocking of HTTP. SSRF tests verify that the default-deny still applies to that local server unless the request opts into allow_private_hosts.

License

MIT © Yaw Labs

Links

About

Comprehensive HTTP fetch MCP server with SSRF protection, HTML-to-markdown, and robots.txt awareness

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors