Skip to content
bcoles edited this page Jan 5, 2011 · 52 revisions

Suggestions, feature requests and discussion go here.


Direction

Things have been going well. WhatWeb 0.4.5 is a good, stable tool and has earned community recognition.

We've been tearing webpages apart and fingerprinting them piece by piece. We've built plugins for many web applications, client side libraries and HTML elements, but now we have a few important issues to consider regarding WhatWeb's direction.

Design Philosophy

  • always use an intuitive interface. never force a user to choose an option when a default is better. the following command must always work: ./whatweb slashdot.org
  • never take choices away from the user. each automatic decision should be a default for a configurable option. examples: follow redirects.
  • avoid premature over-engineering. do not implement core code to handle types of information that few plugins currently return. allow plugins to return the information in generic formats such as :string instead. wait until many plugins are returning the same type of information, such as operating system, filepaths, versions, or modules before considering how to solve this problem in the core. premature over-engineering is the type of error that kills a project.
  • When a solution to a problem is inelegant then do not implement it in WhatWeb. Instead continue to meditate on the problem for as long as required. If you need a fast solution then hack up your own version of WhatWeb and do not introduce the patch into the core, I have done this many times.
  • whatweb must grow horizontally and vertically together. whatweb must be good at solving a type of problem before entering a new area. for example, whatweb must be competent at identifying a system before it starts becoming good at identifying versions of systems. if whatweb is known to be patchy in it's coverage this could kill the project. this is the rationale behind not implementing security checks yet.
  • breaking backwards compatibility is OK.

Multi-App plugins

We're at a fork in the road on this one. On one side, we can fingerprint each application individually and write a plugin for each one. On the other we can incorporate many different applications of the same type into one plugin, for example all third party javascript libraries.

bcoles: This isn't really a design issue so much. I'm in favor of categorizing plugins rather than combining multiple applications into a single plugin. I'd rather see output Google-Analytics[713526426] than Third-Party-Library[Google-Analytics[713526426]]

Exceptions:

An exception would be fingerprinting generic wep apps: admin panels or web backdoors for example. Applications where you're only able to fingerprint generically using subtle clues, such as "/admin/", "/login/" or "?cmd=" in the URL. It doesn't necessarily mean that an admin panel or backdoor are present, but it's a good indication.

It also acceptable to write plugins which return different models for hardware. It is not feasible to write a different plugin for every model.

Output becomes a wall of text

We now have numerous plugins which return a file path from the source of an HTML element. For example,

  • Meta-Refresh
  • Meta-Author
  • Meta-Generator
  • Redirect-Location
  • Frame
  • RSS-Feed
  • Mailto
  • Title
  • Script
  • Shortcut Icon

These types of plugins are great for plugin development, data mining or noticing patterns across networks.

The problem is the WhatWeb output becomes a massive wall of text, even in --log-brief mode.

One way around this is by putting these types of plugins in a "plugin development" category and allowing the user to enable/disable certain categories.

Alternatively you could suppress these plugins from the --log-brief mode but still write to the log file.

For now most of these plugins are in the "plugins-disabled" directory.

Another alternative solution is a new output format combined with plugin categories.


Categories

Should plugins be categorized? If so, should they be layered?:

Aung Khant: It would be great if WhatWeb supported scanning by categories in the future.

  • Server
  • Language
  • Program
  • Third Party Library

or

  • HTML Elements
  • Program
  • Vendor
  • Server
  • Development
  • Config/Log files

or

  • HTTP Server. Apache, Nginx
  • Language. PHP, ASP, ASP.NET, ColdFusion
  • Framework. Cake, Zend, Ruby on Rails ( can u tell this from the language and CMS?)
  • CMS/Blog. WordPress, Joomla, Drupal
  • JS Library. Scriptaculus, Prototype, JQuery, Google Analytics
  • Hardware devices. Xerox Printers, Cisco routers, D-link cameras
  • Common. Title, Subdomains, Uncommon-headers, X-Powered-By, Mailto
  • Hashes. Header-hash, footer-hash

I (Andrew) like the above categories best but it is far from complete. The first categories break down into an OSI-like set of layers nicely. The 'hardware devices' category should be considered covering all layers from the server to the JS library. The common category defines plugins that are common to all types of websites, not necessarily commonly found plugins. The hashes are kept separate from the common plugins as hashes are primarily used to discover common content after a scan and a user may wish to disable these.

Here is a set of categories from builtwith.com:

  • Ads
  • Analytics
  • Blog
  • CDN
  • CMS
  • DocInfo
  • Ecommerce
  • Encoding (utf-8, big5)
  • Feeds (feed types and feed providers)
  • Framework (includes languages and frameworks)
  • JS (javascript libraries, not including analytics)
  • Media (Media provider such as youtube)
  • Server
  • Software (operating systems)
  • Widgets

Perhaps a way for us to define better categories is to discuss what is wrong with the builtwith approach. Some problems are:

  • Encoding should be a plugin value, not a plugin
  • Ecommerce has a lot of CMS's
  • Blogs and CMS's have cross over, such as WordPress

Some notes are: The Analytics category could be included in JS but it's better to have it's own category.

How should categories for plugins be defined?

  • option 1) define the category within the plugin's file
  • option 2) define the category by the directory it is within
  • option 3) a list of tags within the plugin file

option 2 works well for multiple categories when used with symlinks.

Categorization Trial

bcoles: I've trialled categorizing the plugins to determine the easiest way to implement directory based categorization. I know I've dropped a few plugins in the wrong directories. If you want to meticulously re-categorize 500+ plugins then be my guest. Here's what I came up with : http://whatweb.net/plugins-categorized.zip

Issues I faced while categorizing :

  • About 50 plugins (~10%) are in ./plugins/misc category

  • Should proxy servers go under ./plugins/misc or ./plugins/server? Should HTTP and Proxy servers be in separate directories?

  • ./plugins/vendor could perhaps be split into ./plugins/hosting and ./plugins/third-party, however ./plugins/client-side libraries are often third-party resources as well.

  • ./plugins/web-app is rather vague as a category. About 400 (~60%) plugins are in this category. Splitting this category results in duplication of content which is probably best solved with symlinks?

  • Do we need sub categories? Do we need more categories? Is that over specializing (I think it would be)?

  • Indecision - For example, where does WebDAV belong?


Command Line Options

Suppress 404s

Users can just grep for 200 or -v 404

Follow frames

Many websites still use frames on intro pages. A --follow-frames option would allow WhatWeb to grab these URLs instead of being stuck trying to fingerprint a HTML frameset.

Should frames be followed by default? Should following off-site frames be ignored or be a configurable option? Andrew: this can be configured with --follow-frames off,on(on-site only),always Is using on for onsite a bad choice? the alternative is onsite instead of on.

--follow-frames never,frame-only,iframe-only,same-site,same-domain,always (default: same-domain)

** Follow Redirect

--follow-redirect never,http-only,meta-only,same-site,same-domain,always (default: always)


How to scan websites that need authentication?

Types of authentication to potentially support:

  • HTTP Basic Authentication
  • HTTP Digest Authentication
  • URL parameter with session token
  • HTTP Cookies
  • SSL Certificate Support
  • HTTP Forms with passwords

Curl supports these and it might make sense for WhatWeb to copy curl's command line syntax.

A method, not necessarily a good one is to load WhatWeb with username and password combinations which it will try whenever it discovers a password prompt.

bcoles: Using HTTP authorization for WWW-Authenticate would be nice for fingerprinting devices with default credentials. For example:

  • About 12,383 SHodanHQ results for WWW-Authenticate: Basic realm="Default: admin/1234"

admin:1234@target.com usage would be ideal. Alternatively --http-user and --http-pass could be used.


Perhaps escape [ ] brackets in output? (Solved)

For example, <title>SMC[231] Console</title> would currently return Title[SMC[231] Console] but perhaps the [ ] should be escaped with backslashes or URL encoded Title[SMC%5B231%5D Console]

Andrew: The solution is to use the URL encoded version of the square brackets and all characters with an ASCII code of 1 to 31. This fixes output problems too.


Should :text=> matches be case insensitive? (Solved)

bcoles: Case insensitive matches are currently done with regex. Regex matches are slower than text matches, however making text matches case insensitive would require changing many of the text matches to regex. Text matches are case sensitive for this reason.


POST data

Aung Khant: Some frameworks issue unique error response when we do invalid post request

:url_post=>'/', :post_data=>'null=null'

bcoles: post can be achieved with custom ruby but POST request support would be worth adding. Also support for OPTIONS requests may be useful, for example WebDav.

Should WhatWeb exploit vulnerabilities to test them?

Andrew: No. Not yet at least. I want good coverage of plugins to identify systems first including aggressive plugins to detect exact version numbers. Plugins that test for vulnerabilities if or when introduced should be at a different aggression level, maybe 5.

How should WhatWeb save/store webpages?

add option to save html files and headers. optional folder. how to save the files?

option 1 (hostnames backwards by TLD, IPs forwards by octet) for login.yahoo.com and 208.51.4.1 u get: com/yahoo/login/head and download/com/yahoo/login/body 208/51/4/1/head and 208/51/4/1/body

option 2 (url, dots become -, every special char not allowed in a filename is converted to something? ) login-yahoo-com_index-html.head login-yahoo-com_index-html.body

option 3 (md5 hash of url, this is kind of brutal) 9e107d9d372bb6826bd81d3542a419d6.head 9e107d9d372bb6826bd81d3542a419d6.body

option 4 (URL encode every special character after the hostname. should dots remain dots?) login.yahoo.com%2findex.html.head login.yahoo.com%2findex.html.body

thoughts...: large sets - splitting the hostnames across directories (option 1) small sets - one directory for all hosts (keep the dots) URL encode every special character for the path

There should also be options for saving to DBs like gridfs, etc