WebCrawler is a multi-threaded web crawler and search engine. It downloads and indexes web pages, supporting both boolean and vectorial search over the crawled content. The project features a Flask web interface for configuring crawl parameters, monitoring progress, and searching the indexed data with pagination support.
- Multi-threaded web crawling
- Configurable crawl parameters (start URL, output directory, max pages, thread count)
- Boolean and vectorial search over crawled content
- Flask web interface for configuration and search
- Index persistence (direct and inverted indexes)
-
Clone the repository:
git clone git@github.com:strings1/WebCrawler.git cd WebCrawler/WebCrawler
-
Create a virtual environment (recommended):
python3 -m venv venv source venv/bin/activate
-
Install the requirements:
pip install -r requirements.txt
To start the Flask web interface, run:
python -m webcrawler_project.web.app
The app will start in debug mode and be accessible at http://127.0.0.1:5000/.
- Open your browser and go to http://127.0.0.1:5000/
- Configure the crawler parameters (start URL, output directory, max pages, thread count)
- Start the crawl and monitor progress
- Use the search interface to query the indexed data
webcrawler_project/
- Main source codewebcrawler_project/web/app.py
- Flask web applicationrequirements.txt
- Python dependenciestests/
- Unit testsdocs/
- Sphinx documentation
MIT License
Developed by Darie Alexandru