A concurrent web crawler CLI that crawls a website and exports a JSON report of every page it visits.
- Async concurrent crawling via
aiohttpandasyncio - HTML parsing with
BeautifulSoup4— extracts headings, paragraphs, links, and images - URL normalization to avoid visiting the same page twice
- Stays within the target domain
- Configurable concurrency and page limits
- JSON report output
- Python 3.10+
- uv
uv venv
source .venv/bin/activate
uv syncuv run main.py <url> <max_concurrency> <max_pages>Arguments
| Argument | Description |
|---|---|
url |
The starting URL to crawl |
max_concurrency |
Maximum number of simultaneous requests |
max_pages |
Maximum number of pages to crawl |
Example
uv run main.py https://example.com 5 50This will crawl up to 50 pages of example.com with 5 concurrent requests and write the results to report.json.
After crawling, report.json is created in the current directory. It contains a sorted list of page records:
[
{
"url": "https://example.com/about",
"heading": "About Us",
"first_paragraph": "We build things.",
"outgoing_links": ["https://example.com/", "https://example.com/contact"],
"image_urls": ["https://example.com/logo.png"]
}
]uv run -m unittest