WebCrawler

WebCrawler is a multi-threaded web crawler and search engine. It downloads and indexes web pages, supporting both boolean and vectorial search over the crawled content. The project features a Flask web interface for configuring crawl parameters, monitoring progress, and searching the indexed data with pagination support.

Features

Multi-threaded web crawling
Configurable crawl parameters (start URL, output directory, max pages, thread count)
Boolean and vectorial search over crawled content
Flask web interface for configuration and search
Index persistence (direct and inverted indexes)

Installation

Clone the repository:

git clone git@github.com:strings1/WebCrawler.git
cd WebCrawler/WebCrawler

Create a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate

Install the requirements:
```
pip install -r requirements.txt
```

Running the Flask Web App

To start the Flask web interface, run:

python -m webcrawler_project.web.app

The app will start in debug mode and be accessible at http://127.0.0.1:5000/.

Usage

Open your browser and go to http://127.0.0.1:5000/
Configure the crawler parameters (start URL, output directory, max pages, thread count)
Start the crawl and monitor progress
Use the search interface to query the indexed data

Project Structure

webcrawler_project/ - Main source code
webcrawler_project/web/app.py - Flask web application
requirements.txt - Python dependencies
tests/ - Unit tests
docs/ - Sphinx documentation

License

MIT License

Developed by Darie Alexandru

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
docs		docs
tests		tests
webcrawler_project		webcrawler_project
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WebCrawler

Features

Installation

Running the Flask Web App

Usage

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

strings1/WebCrawler

Folders and files

Latest commit

History

Repository files navigation

WebCrawler

Features

Installation

Running the Flask Web App

Usage

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages