Skip to content

simddev/web-scraper-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

A concurrent web crawler CLI that crawls a website and exports a JSON report of every page it visits.

Features

  • Async concurrent crawling via aiohttp and asyncio
  • HTML parsing with BeautifulSoup4 — extracts headings, paragraphs, links, and images
  • URL normalization to avoid visiting the same page twice
  • Stays within the target domain
  • Configurable concurrency and page limits
  • JSON report output

Requirements

  • Python 3.10+
  • uv

Setup

uv venv
source .venv/bin/activate
uv sync

Usage

uv run main.py <url> <max_concurrency> <max_pages>

Arguments

Argument Description
url The starting URL to crawl
max_concurrency Maximum number of simultaneous requests
max_pages Maximum number of pages to crawl

Example

uv run main.py https://example.com 5 50

This will crawl up to 50 pages of example.com with 5 concurrent requests and write the results to report.json.

Output

After crawling, report.json is created in the current directory. It contains a sorted list of page records:

[
  {
    "url": "https://example.com/about",
    "heading": "About Us",
    "first_paragraph": "We build things.",
    "outgoing_links": ["https://example.com/", "https://example.com/contact"],
    "image_urls": ["https://example.com/logo.png"]
  }
]

Running Tests

uv run -m unittest

About

A web crawler which builds reports of html websites based on python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages