A web crawler that seeks for urls. Currently it doesn't serve any purpose apart from collecting different urls it finds across the web.
- Python >= 3
- pipenv
OR
- Docker Compose
Navigate to the project root and run the following command:
ROOT_URL=insert_root_url_here docker-compose up --build --scale web-crawler=2
This starts the web-crawler-scheduler and two web-crawlers. The scale number can be anything, but keep in mind to not overload the target server.
OR
Open two seperate terminals and navigate to both web-crawler-scheduler and web-crawler directories, where in both run first the following command:
pipenv install
and then in the web-crawler-scheduler directory run
pipenv run python main.py insert_root_url_here
and in the web-crawler directory run
pipenv run python main.py
All urls that the crawler finds will be stored into /web-crawler-scheduler/data/data.txt
Please notice that this project was created just to practise network programming with Python. If you choose to test this app, be sure to not overload the servers you are targeting. Don't start too many crawlers at once and don't remove the time.sleep(1)
that slows down the loop in WebCrawler.py
file. I'm not liable for any misuse of this application.