MCP Data Fetch Server is secure, sandboxed server that fetches web content and extracts data via the Model Control Protocol (MCP). without executing JavaScript.
- Features
- Installation & Quick Start
- Command‑Line Options
- Integration with LM Studio
- MCP API Overview
- Available Tools
- Security Features
- Secure web page fetching – strips scripts, iframes and cookie banners; no JavaScript execution.
- Rich data extraction – retrieve links, metadata, Open Graph/Twitter cards, and downloadable resources.
- Safe file downloads – size limits, filename sanitisation, and path‑traversal protection within a sandboxed cache.
- Built‑in caching – optional cache directory reduces repeated network calls.
- Prompt‑injection detection – validates URLs and fetched content for malicious instructions.
# Clone the repository (or copy the MCPDataFetchServer.1 folder)
git clone https://github.com/undici77/MCPDataFetchServer.git
cd MCPDataFetchServer
# Make the startup script executable
chmod +x run.sh
# Run the server, pointing to a sandboxed working directory
./run.sh -d /path/to/working/directory📌 Three‑step overview
1️⃣ The script creates a virtual environment and installs dependencies.
2️⃣ It prepares a cache folder (.fetch_cache) inside the project root.
3️⃣main.pylaunches the MCP server, listening on stdin/stdout for JSON‑RPC requests.
| Option | Description |
|---|---|
-d, --working-dir |
Path to the sandboxed working directory where all file operations are confined (default: ~/.mcp_datafetch). |
-c, --cache-dir |
Name of the cache subdirectory relative to the working directory (default: cache). |
-h, --help |
Show help message and exit. |
Add an entry to your mcp.json configuration so that LM Studio can launch the server automatically.
{
"mcpServers": {
"datafetch": {
"command": "/absolute/path/to/MCPDataFetchServer.1/run.sh",
"args": [
"-d",
"/absolute/path/to/working/directory"
],
"env": { "WORKING_DIR": "." }
}
}
}📌 Tip: Ensure
run.shis executable (chmod +x …) and that the virtual environment can install the required Python packages on first launch.
All communication follows JSON‑RPC 2.0 over stdin/stdout.
Request:
{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {}
}Response contains the protocol version, server capabilities and basic metadata (e.g., name = mcp-datafetch-server, version = 2.1.0).
Request:
{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/list",
"params": {}
}Response: { "tools": [ …tool definitions… ] }. Each definition includes name, description and an input schema (JSON Schema).
Generic request shape (replace <tool_name> and arguments as needed):
{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "<tool_name>",
"arguments": { … }
}
}The server validates the request against the tool’s schema, executes the operation, and returns a ToolResult containing one or more content blocks.
- Securely fetches a web page and returns clean content in the requested format.
| Name | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ (no default) | URL to fetch (http/https only). |
format |
string | ❌ (markdown) |
Output format – one of markdown, text, or html. |
include_links |
boolean | ❌ (true) |
Whether to append an extracted links list. |
include_images |
boolean | ❌ (false) |
Whether to list image URLs in the output. |
remove_banners |
boolean | ❌ (true) |
Attempt to strip cookie banners & pop‑ups. |
Example
{
"jsonrpc": "2.0",
"id": 10,
"method": "tools/call",
"params": {
"name": "fetch_webpage",
"arguments": {
"url": "https://example.com/article",
"format": "markdown",
"include_links": true,
"include_images": false,
"remove_banners": true
}
}
}Note: The tool sanitises HTML, removes scripts/iframes, and checks for prompt‑injection patterns before returning content.
- Extracts and categorises all hyperlinks from a page.
| Name | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ (no default) | URL of the page to analyse. |
filter |
string | ❌ (all) |
Return only all, internal, external, or resources. |
Example
{
"jsonrpc": "2.0",
"id": 11,
"method": "tools/call",
"params": {
"name": "extract_links",
"arguments": {
"url": "https://example.com/blog",
"filter": "internal"
}
}
}Note: Links are classified as internal (same domain) or external; resource links (images, PDFs…) can be filtered with resources.
- Safely downloads a remote file into the sandboxed cache directory.
| Name | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ (no default) | Direct URL to the file. |
filename |
string | ❌ (auto‑generated) | Desired filename; will be sanitised and forced into the cache directory. |
Example
{
"jsonrpc": "2.0",
"id": 12,
"method": "tools/call",
"params": {
"name": "download_file",
"arguments": {
"url": "https://example.com/files/report.pdf",
"filename": "report_latest.pdf"
}
}
}Note: The server enforces a 100 MB download limit, validates the URL against blocked domains/extensions, and returns the relative path inside the working directory for cross‑agent access.
- Extracts structured metadata (title, description, Open Graph, Twitter Cards) from a web page.
| Name | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ (no default) | URL of the page to inspect. |
Example
{
"jsonrpc": "2.0",
"id": 13,
"method": "tools/call",
"params": {
"name": "get_page_metadata",
"arguments": { "url": "https://example.com/product/42" }
}
}Note: The tool returns a formatted text block with title, description, keywords, Open Graph properties and Twitter Card fields.
- Performs a lightweight HEAD request to report status code, headers and size without downloading the body.
| Name | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ (no default) | URL to probe. |
Example
{
"jsonrpc": "2.0",
"id": 14,
"method": "tools/call",
"params": {
"name": "check_url",
"arguments": { "url": "https://example.com/resource.zip" }
}
}Note: The response includes the final URL after redirects, a concise status summary (✅ OK or Content‑Type and Content‑Length.
- Path‑traversal protection – all file operations are confined to the sandboxed working directory.
- Prompt‑injection detection in URLs, fetched HTML and generated content.
- Blocked domains & extensions (localhost, private IP ranges, executable/script files).
- Content‑size limits – max 50 MB for page fetches, max 100 MB for file downloads.
- HTML sanitisation – removes
<script>,<iframe>, event handlers and other risky elements before processing. - Cookie/banner handling – optional removal of consent banners and pop‑ups during fetch.
© 2025 Undici77 – All rights reserved.