Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

A simple web spider written in python

branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

README
Spyder - A simple spider written in python

When called on a url, it will spider the pages and any links found up to the depth specified.
After it's done, it will print a list of resources that it found.
Currently, the resources it tries to find are:

images   -  any images found on the page (ie: <img src="THIS"/>)
styles   -  any external stylesheets found on the page.  CSS included via '@import' is currently only supported if within a style tag!
            (ie: <link rel="stylesheet" src="THIS"/>  OR <style>@import url('THIS');</style> )
scripts  -  any external scripts found in the page (ie: <script src="THIS"> )
links    -  any urls found on the page.  'Fragments' are discarded. (ie: <a href="THIS#this-is-a-fragment"> )
emails   -  any email addresses found on the page (ie: <a href="mailto:THIS"> )

Internally, it uses html.parser.HTMLParser to parse pages, and both urllib.request, urllib.parse for making requests and doing url parsing.

Usage: Spyder.py -u http://www.example.com

Options:
  -h, --help            show this help message and exit
  -u URL, --url=URL     The url to start spidering from.
  -d, --debug           Print debugging information (very verbose).
  -l LEVEL, --level=LEVEL
                        Specify recursion maximum depth level depth.  The
                        default maximum depth is 5.
  -H SPAN_HOSTS, --span-hosts=SPAN_HOSTS
                        Enable spanning across hosts when spidering. The
                        default is to limit spidering to one domain.
  -F FILTER_HOSTS, --filter-hosts=FILTER_HOSTS
                        After finished, filter the list of resources printed
                        to the target domain. The default is to print ALL
                        resources found.


The original reason I made this was to do some url discovery for website benchmarking.
An example script for doing something like this, 'www-benchmark.py', is included.  It uses apache benchmark as an example.
Eventually I'll be experimenting with 'siege' for benchmarking & server stress-testing.


NOTE: Currently the spider can throw exceptions in certain cases (mainly character encoding stuff, but there are probably other bugs too)
      Getting *working* character encoding detection is a goal, and is sorta-working... ish?  Help in this area would be appreciated!
      Filtering the results by domain is almost working too
Something went wrong with that request. Please try again.