sitewalker

Traverse site and represent it as a graph.

One script, given a URL, does a website crawl with depth up to N, and writes some data to a JSON file. Other scripts help to explore graph based on the saved data.

Please, note: this is only a demo (or prototype), not a production-ready software!

Requirements

Python >= 3.5
Requests library

Installation

$ pip install -r requirements.txt

Or:

$ python setup.py develop

Python virtual environment is strongly recommended.

Testing

$ python tests/test_graph.py

Running

Site scrapping

First, traverse a site, e.g.

$ python sitewalker/traverse.py -d4  http://python.org/

Sitemap named "sitemap.json" will be saved to a JSON file. You may choose another file name via -o command line parameter
-d4 means maximal depth of 4 levels.

To see all available options, type:

$ python sitewalker/traverse.py --help

Logging

To see extra logging, set 'DEBUG_SCRAPER' environment variable to 1.
To see detailed logging, set 'DEBUG_SCRAPER' to 2.

Exploring

To calculate the graph diameter:

$ python sitewalker/diameter.py sitemap.json

Sample output:

Length: 2
/
-->/psf-landing/
-->/psf/volunteer

Length shows the longest distance, in graph edges, between two nodes. Next goes the list of the path nodes.

Caveats

This program is not capable of working with dynamic sites, such as Single Page Applications. Futher, it makes some naive assumptions:

URLs from the same domain belong to the same site, which is not always true.
Different domains mean different sites. So, weather.yandex.ru and tv.yandex.ru are recognized as different sites.
.WWW prefixes in domain names are not important. Thus, www.python.org is the same entry as python.org. It works in most cases.
Parameter and query parts of an URL do not determine page addresses. E.g. http://example.com/ and http://example.com/?foo=bar are treated as the same page. There are few web frameworks that generate different pages for different queries.

TODOs

Add more methods for exploring site maps, including visualization.
Add configuration file.
Make site traversal asynchronous.
Replace recursive traversal function with Producer/Consumer design.
Add proxy support.
Allow unlimited depth.
Add width limit (maybe, based on graph diameter).
Use NetworkX library instead of custom graph.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
sitewalker		sitewalker
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sitewalker

Requirements

Installation

Testing

Running

Site scrapping

Logging

Exploring

Caveats

TODOs

About

Releases

Packages

Languages

License

skrushinsky/sitewalker

Folders and files

Latest commit

History

Repository files navigation

sitewalker

Requirements

Installation

Testing

Running

Site scrapping

Logging

Exploring

Caveats

TODOs

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages