Simple website crawler to get all URLs, Meta tags and <H1> from your web site.
Open main.py
and set up init_url
variable with you start URL.
Adjust use_pause
variable so do not abuse your web server.
Crawler does not go by redirections (check allow_redirects=False
).
Ignores React/JavaScript links if web site uses them.
In Python. Using BeautifulSoup. Saves report in CSV file.
https://github.com/sergeymusenko/simple-crawler/tree/main
Installation:
pip install bs4