A concurrent web crawler built around one idea: different compute profiles need different concurrency models.
Most crawlers treat concurrency as an afterthought. Weave treats it as the architecture.
A single-threaded crawler is slow. A naive async crawler breaks on CPU-heavy parsing. A threaded crawler hits the GIL and stalls on content extraction. Most implementations pick one model and suffer the tradeoffs silently.
Weave doesn't. It maps each stage of crawling to the concurrency primitive that actually fits:
| Stage | Bottleneck | Model | Why |
|---|---|---|---|
| Fetching URLs | Network I/O | asyncio + aiohttp |
Await hundreds of requests concurrently without threads |
| Parsing HTML | CPU | multiprocessing |
GIL blocks threads here — separate processes bypass it entirely |
| Frontier management | Coordination | threading |
Shared state across producers/consumers, Lock on visited set |
This isn't theoretical. Each decision came from profiling. Threading the parser gave no speedup. ProcessPoolExecutor cut parse time by ~60%.
Seed URLs
↓
[Frontier Queue] asyncio.Queue — bounded, provides backpressure
↓
[Async Fetcher Pool] aiohttp + asyncio.Semaphore — capped concurrent connections
↓
[Raw HTML Queue] multiprocessing.Queue — crosses process boundary
↓
[Parser Workers] ProcessPoolExecutor — true parallelism, GIL bypassed
↓
[Result Store] SQLite / Redis — crawled content + metadata
Bounded queues at every boundary. Fast fetchers can't overwhelm slow parsers. Slow parsers don't starve the frontier.
Concurrency
- Async fetcher pool with per-domain connection limits via
asyncio.Semaphore ProcessPoolExecutor+ThreadPoolExecutorhybrid — right tool for each stage- Thread-safe URL deduplication with
threading.Lock - Backpressure via bounded
asyncio.Queuebetween every layer
Crawl control
- Configurable depth, domain scope, and concurrency limits
- Per-domain politeness delays and
robots.txtcompliance - BFS / DFS / priority queue frontier strategies
Observability
- Live crawl dashboard via
rich— pages/sec, error rate, queue depths - Per-domain stats exported to JSON
MCP Server Weave exposes its knowledge base as an MCP server — query everything it's crawled directly through Claude:
crawl_url(url, depth)— trigger a crawlsearch_crawled(query)— semantic search over crawled contentget_page_summary(url)— summarize a crawled pageextract_links(url)— outbound link graph for any crawled URL
git clone https://github.com/sagnikc395/weave
cd weave
pip install -e .
# Crawl a site
weave crawl https://example.com --depth 3 --concurrency 50
# Start the MCP server
weave serve --port 8000I wanted to understand Python concurrency at the level where you make deliberate tradeoffs — not just "use asyncio for everything." Weave is the project that forced those decisions concretely.
The GIL isn't just a trivia answer here. It's the reason the parser layer is a separate process pool.
aiohttp— async HTTP clientBeautifulSoup/lxml— HTML parsingconcurrent.futures— unified executor interfaceRedis— frontier queue in distributed modeSQLite+pgvector— result store + semantic searchrich/textual— live dashboardmcp— MCP server for Claude integration
Good question to think about early. Here's the honest breakdown:
The deployment challenge with Weave
Weave is trickier to deploy than your other projects because it's inherently a long-running process, not a request/response API. You can't just push to Railway and get a URL. The crawler needs to keep running, the MCP server needs to stay alive, and the result store needs to persist.
Recommended approach: Split it into two deployable pieces
Piece 1 — The MCP server + result store (always on) This is what you actually deploy. It serves the crawled knowledge base and exposes the MCP endpoints. Stateless-ish, easy to host.
- Railway or Render for the FastAPI/MCP server
- Railway Postgres for the result store
- This gives you a live URL to put on your resume
Piece 2 — The crawler itself (run on demand) Don't try to deploy the crawler as a persistent service — it's expensive and unnecessary for a portfolio project. Instead:
- Run it locally to seed the database with crawled content
- Or expose a
/crawlendpoint that triggers a background task viaasynciowhen hit - That way the deployed server always has data to show
Concrete stack:
Local machine Railway
────────────── ──────────────────────────
weave crawl ───────▶ Postgres (crawled content)
↑
FastAPI + MCP server
↑
Claude / any MCP client
Step by step:
- Build the crawler and MCP server locally first
- Dockerize both — one
Dockerfilefor the API, one for the worker - Push Postgres to Railway, run migrations
- Deploy the MCP/API server to Railway
- Run the crawler locally pointed at the Railway Postgres to seed it with real data
- Put the Railway URL in your resume
What to crawl to make the demo impressive
Don't crawl random sites. Crawl something with a coherent knowledge domain so the semantic search actually works well:
- Hacker News (crawl top posts + comments)
- A specific documentation site (FastAPI docs, Temporal docs)
- A curated list of engineering blogs
Then your demo becomes: "I crawled the Temporal.io docs and can query them through Claude" — that's a concrete, impressive demo in 10 seconds.
The resume line:
Deployed MCP server on Railway backed by Postgres — exposes crawled web content as a Claude-queryable knowledge base via semantic search
The crawler itself not being "deployed" doesn't matter — nobody expects a portfolio crawler to be running 24/7. What matters is the live MCP server with real data behind it.