Traverse site and represent it as a graph.
One script, given a URL, does a website crawl with depth up to N, and writes some data to a JSON file. Other scripts help to explore graph based on the saved data.
Please, note: this is only a demo (or prototype), not a production-ready software!
- Python >= 3.5
- Requests library
$ pip install -r requirements.txt
Or:
$ python setup.py develop
Python virtual environment is strongly recommended.
$ python tests/test_graph.py
First, traverse a site, e.g.
$ python sitewalker/traverse.py -d4 http://python.org/
- Sitemap named "sitemap.json" will be saved to a JSON file. You may choose another file name via -o command line parameter
- -d4 means maximal depth of 4 levels.
To see all available options, type:
$ python sitewalker/traverse.py --help
- To see extra logging, set 'DEBUG_SCRAPER' environment variable to 1.
- To see detailed logging, set 'DEBUG_SCRAPER' to 2.
To calculate the graph diameter:
$ python sitewalker/diameter.py sitemap.json
Sample output:
Length: 2
/
-->/psf-landing/
-->/psf/volunteer
Length shows the longest distance, in graph edges, between two nodes. Next goes the list of the path nodes.
This program is not capable of working with dynamic sites, such as Single Page Applications. Futher, it makes some naive assumptions:
- URLs from the same domain belong to the same site, which is not always true.
- Different domains mean different sites. So, weather.yandex.ru and tv.yandex.ru are recognized as different sites.
- .WWW prefixes in domain names are not important. Thus, www.python.org is the same entry as python.org. It works in most cases.
- Parameter and query parts of an URL do not determine page addresses. E.g. http://example.com/ and http://example.com/?foo=bar are treated as the same page. There are few web frameworks that generate different pages for different queries.
- Add more methods for exploring site maps, including visualization.
- Add configuration file.
- Make site traversal asynchronous.
- Replace recursive traversal function with Producer/Consumer design.
- Add proxy support.
- Allow unlimited depth.
- Add width limit (maybe, based on graph diameter).
- Use NetworkX library instead of custom graph.