Skip to content

sslaybac-bootdev/webscraper-python

Repository files navigation

Python Webscraper

This is a webcrawler that generates summaries of website content. This crawler was developed as a guided project on boot.dev.

Requirements

"This project requires Python 3.9 or later. You can use uv or pip to install dependencies."

Installation

This project was created using a virtual environment generated with uv. If you already have uv installed, navigate to the project root and run:

uv sync

If not, the project dependencies can be installed using pip:

  1. Create a virtual environment: python3 -m venv .venv
  2. Activate it: source .venv/bin/activate
  3. Upgrade pip and install dependencies: pip install --upgrade pip pip install -r requirements.txt

Usage

This is a command line program. The syntax for execution is:

python main.py URL CONCURRENCY MAX_PAGES All fields are required

  • URL: This specifies the url that the crawler will work on. THe crawler will download the webpage at that url, identify any anchor (<a>) elements, and create a list of their href urls. It will then visit these urls--but only those that are actually hosted on the same domain as the original url.

  • CONCURRENCY: This specifies the maximum number of simultaneous urls that the crawler will request. Larger numbers will cause the program to run faster. However, numbers that are too large may result in disruptions to the target of the crawl. Many targets also respond to disruptive tools by banning and/or reporting the source IP addresses. For their sake and yours, keep this number low. (I have never run the tool with a concurrency higher than 5)

  • MAX_PAGES: This specifies the maximum number of pages to retrieve. For large, expansive web pages, or pages that have a theoretically infinite number of dynamically generated urls, this prevents your crawler from pulling in more data than it can handle or spending more time than you are willing on the process.

Features

This is a guided project from the website boot.dev. The project is a webcrawler: it examines a user-provided webpage, follows anchor links, and generates a summary of all of the crawled pages. The summary will be stored in a local file: report.json.

The user can control concurrency (how many simultaneous web requests the crawler makes) and max_pages (how many total requests the crawler makes before terminating)

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

This is a webcrawler that generates summaries of website content. This crawler was developed as a guided project on boot.dev.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors