Python Webscraper

This is a webcrawler that generates summaries of website content. This crawler was developed as a guided project on boot.dev.

Requirements

"This project requires Python 3.9 or later. You can use uv or pip to install dependencies."

Installation

This project was created using a virtual environment generated with uv. If you already have uv installed, navigate to the project root and run:

uv sync

If not, the project dependencies can be installed using pip:

Create a virtual environment: python3 -m venv .venv
Activate it: source .venv/bin/activate
Upgrade pip and install dependencies: pip install --upgrade pip pip install -r requirements.txt

Usage

This is a command line program. The syntax for execution is:

python main.py URL CONCURRENCY MAX_PAGES All fields are required

URL: This specifies the url that the crawler will work on. THe crawler will download the webpage at that url, identify any anchor (<a>) elements, and create a list of their href urls. It will then visit these urls--but only those that are actually hosted on the same domain as the original url.
CONCURRENCY: This specifies the maximum number of simultaneous urls that the crawler will request. Larger numbers will cause the program to run faster. However, numbers that are too large may result in disruptions to the target of the crawl. Many targets also respond to disruptive tools by banning and/or reporting the source IP addresses. For their sake and yours, keep this number low. (I have never run the tool with a concurrency higher than 5)
MAX_PAGES: This specifies the maximum number of pages to retrieve. For large, expansive web pages, or pages that have a theoretically infinite number of dynamically generated urls, this prevents your crawler from pulling in more data than it can handle or spending more time than you are willing on the process.

Features

This is a guided project from the website boot.dev. The project is a webcrawler: it examines a user-provided webpage, follows anchor links, and generates a summary of all of the crawled pages. The summary will be stored in a local file: report.json.

The user can control concurrency (how many simultaneous web requests the crawler makes) and max_pages (how many total requests the crawler makes before terminating)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
crawl.py		crawl.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
retrieve.sh		retrieve.sh
test.sh		test.sh
test_crawl.py		test_crawl.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Webscraper

Requirements

Installation

Usage

Features

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Python Webscraper

Requirements

Installation

Usage

Features

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages