Traverse site and represent it as a graph
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Traverse site and represent it as a graph.

One script, given a URL, does a website crawl with depth up to N, and writes some data to a JSON file. Other scripts help to explore graph based on the saved data.

Please, note: this is only a demo (or prototype), not a production-ready software!


  • Python >= 3.5
  • Requests library


$ pip install -r requirements.txt


$ python develop

Python virtual environment is strongly recommended.


$ python tests/


Site scrapping

First, traverse a site, e.g.

$ python sitewalker/ -d4
  • Sitemap named "sitemap.json" will be saved to a JSON file. You may choose another file name via -o command line parameter
  • -d4 means maximal depth of 4 levels.

To see all available options, type:

$ python sitewalker/ --help


  • To see extra logging, set 'DEBUG_SCRAPER' environment variable to 1.
  • To see detailed logging, set 'DEBUG_SCRAPER' to 2.


To calculate the graph diameter:

$ python sitewalker/ sitemap.json

Sample output:

Length: 2

Length shows the longest distance, in graph edges, between two nodes. Next goes the list of the path nodes.


This program is not capable of working with dynamic sites, such as Single Page Applications. Futher, it makes some naive assumptions:

  • URLs from the same domain belong to the same site, which is not always true.
  • Different domains mean different sites. So, and are recognized as different sites.
  • .WWW prefixes in domain names are not important. Thus, is the same entry as It works in most cases.
  • Parameter and query parts of an URL do not determine page addresses. E.g. and are treated as the same page. There are few web frameworks that generate different pages for different queries.


  1. Add more methods for exploring site maps, including visualization.
  2. Add configuration file.
  3. Make site traversal asynchronous.
  4. Replace recursive traversal function with Producer/Consumer design.
  5. Add proxy support.
  6. Allow unlimited depth.
  7. Add width limit (maybe, based on graph diameter).
  8. Use NetworkX library instead of custom graph.