This is a webcrawler that generates summaries of website content. This crawler was developed as a guided project on boot.dev.
"This project requires Python 3.9 or later. You can use uv or pip to install dependencies."
This project was created using a virtual environment generated with uv. If you already have uv installed, navigate to the project root and run:
uv sync
If not, the project dependencies can be installed using pip:
- Create a virtual environment:
python3 -m venv .venv - Activate it:
source .venv/bin/activate - Upgrade pip and install dependencies:
pip install --upgrade pippip install -r requirements.txt
This is a command line program. The syntax for execution is:
python main.py URL CONCURRENCY MAX_PAGES
All fields are required
-
URL: This specifies the url that the crawler will work on. THe crawler will download the webpage at that url, identify any anchor (<a>) elements, and create a list of their href urls. It will then visit these urls--but only those that are actually hosted on the same domain as the original url.
-
CONCURRENCY: This specifies the maximum number of simultaneous urls that the crawler will request. Larger numbers will cause the program to run faster. However, numbers that are too large may result in disruptions to the target of the crawl. Many targets also respond to disruptive tools by banning and/or reporting the source IP addresses. For their sake and yours, keep this number low. (I have never run the tool with a concurrency higher than 5)
-
MAX_PAGES: This specifies the maximum number of pages to retrieve. For large, expansive web pages, or pages that have a theoretically infinite number of dynamically generated urls, this prevents your crawler from pulling in more data than it can handle or spending more time than you are willing on the process.
This is a guided project from the website boot.dev. The project is a webcrawler: it examines a user-provided webpage, follows anchor links, and generates a summary of all of the crawled pages. The summary will be stored in a local file: report.json.
The user can control concurrency (how many simultaneous web requests the crawler makes) and max_pages (how many total requests the crawler makes before terminating)
This project is licensed under the MIT License - see the LICENSE file for details.