Skip to content
bcoles edited this page Apr 7, 2011 · 52 revisions

Suggestions, feature requests and discussion go here.


Direction

Things have been going well. WhatWeb 0.4.5 is a good, stable tool and has earned community recognition.

We've been tearing webpages apart and fingerprinting them piece by piece. We've built plugins for many web applications, client side libraries and HTML elements, but now we have a few important issues to consider regarding WhatWeb's direction.

Design Philosophy

  • Always use an intuitive interface. never force a user to choose an option when a default is better. the following command must always work: ./whatweb slashdot.org
  • Never take choices away from the user. Each automatic decision should be a default for a configurable option. examples: follow redirects.
  • Avoid premature over-engineering. do not implement core code to handle types of information that few plugins currently return. Allow plugins to return the information in generic formats such as :string instead. Wait until many plugins are returning the same type of information, such as operating system, filepaths, versions, or modules before considering how to solve this problem in the core. Premature over-engineering is the type of error that kills a project.
  • When a solution to a problem is inelegant then do not implement it in WhatWeb. Instead continue to meditate on the problem for as long as required. If you need a fast solution then hack up your own version of WhatWeb and do not introduce the patch into the core, I have done this many times.
  • WhatWeb must grow horizontally and vertically together. WhatWeb must be good at solving a type of problem before entering a new area. for example, WhatWeb must be competent at identifying a system before it starts becoming good at identifying versions of systems. If WhatWeb is known to be patchy in it's coverage this could kill the project. this is the rationale behind not implementing security checks yet. This also works with the unix philosophy of doing one thing but doing it really well.
  • Breaking backwards compatibility is OK.

Multi-App plugins

We're at a fork in the road on this one. On one side, we can fingerprint each application individually and write a plugin for each one. On the other we can incorporate many different applications of the same type into one plugin, for example all third party javascript libraries.

bcoles: This isn't really a design issue so much. I'm in favor of categorizing plugins rather than combining multiple applications into a single plugin. I'd rather see output Google-Analytics[713526426] than Third-Party-Library[Google-Analytics[713526426]]

Exceptions:

An exception would be fingerprinting generic wep apps: admin panels or web backdoors for example. Applications where you're only able to fingerprint generically using subtle clues, such as "/admin/", "/login/" or "?cmd=" in the URL. It doesn't necessarily mean that an admin panel or backdoor are present, but it's a good indication.

It also acceptable to write plugins which return different models for hardware. It is not feasible to write a different plugin for every model.

Output becomes a wall of text

We now have numerous plugins which return a file path from the source of an HTML element. For example,

  • Meta-Refresh
  • Meta-Author
  • Meta-Generator
  • Redirect-Location
  • Frame
  • RSS-Feed
  • Mailto
  • Title
  • Script
  • Shortcut Icon

These types of plugins are great for plugin development, data mining or noticing patterns across networks.

The problem is the WhatWeb output becomes a massive wall of text, even in --log-brief mode.

One way around this is by putting these types of plugins in a "plugin development" category and allowing the user to enable/disable certain categories.

For now most of these plugins are in the "plugins-disabled" directory.

One solution is a new output format combined with plugin categories (see below).


Categories

Should plugins be categorized? If so, should they be layered?:

Aung Khant: It would be great if WhatWeb supported scanning by categories in the future.

  • Server
  • Language
  • Program
  • Third Party Library

or

  • HTML Elements
  • Program
  • Vendor
  • Server
  • Development
  • Config/Log files

or

  • HTTP Server. Apache, Nginx
  • Language. PHP, ASP, ASP.NET, ColdFusion
  • Framework. Cake, Zend, Ruby on Rails ( can u tell this from the language and CMS?)
  • CMS/Blog. WordPress, Joomla, Drupal
  • JS Library. Scriptaculus, Prototype, JQuery, Google Analytics
  • Hardware devices. Xerox Printers, Cisco routers, D-link cameras
  • Common. Title, Subdomains, Uncommon-headers, X-Powered-By, Mailto
  • Hashes. Header-hash, footer-hash

I (Andrew) like the above categories best but it is far from complete. The first categories break down into an OSI-like set of layers nicely. The 'hardware devices' category should be considered covering all layers from the server to the JS library. The common category defines plugins that are common to all types of websites, not necessarily commonly found plugins. The hashes are kept separate from the common plugins as hashes are primarily used to discover common content after a scan and a user may wish to disable these.

Here is a set of categories from builtwith.com:

  • Ads
  • Analytics
  • Blog
  • CDN
  • CMS
  • DocInfo
  • Ecommerce
  • Encoding (utf-8, big5)
  • Feeds (feed types and feed providers)
  • Framework (includes languages and frameworks)
  • JS (javascript libraries, not including analytics)
  • Media (Media provider such as youtube)
  • Server
  • Software (operating systems)
  • Widgets

Perhaps a way for us to define better categories is to discuss what is wrong with the builtwith approach. Some problems are:

  • Encoding should be a plugin value, not a plugin
  • Ecommerce has a lot of CMS's
  • Blogs and CMS's have cross over, such as WordPress

Some notes are: The Analytics category could be included in JS but it's better to have it's own category.

How should categories for plugins be defined?

  • option 1) define the category within the plugin's file
  • option 2) define the category by the directory it is within
  • option 3) a list of tags within the plugin file

option 2 works well for multiple categories when used with symlinks.

Categorization Trial

bcoles: I've trialled categorizing the plugins to determine the easiest way to implement directory based categorization. I know I've dropped a few plugins in the wrong directories. If you want to meticulously re-categorize 500+ plugins then be my guest. Here's what I came up with : http://whatweb.net/plugins-categorized.zip

Categories:

client-side
framework
hardware
host
language
misc
server
vendor
web-app

Issues I faced while categorizing :

  • About 50 plugins (~10%) are in ./plugins/misc category
  • Should proxy servers go under ./plugins/misc or ./plugins/server? Should HTTP and Proxy servers be in separate directories?
  • ./plugins/vendor could perhaps be split into ./plugins/hosting and ./plugins/third-party, however ./plugins/client-side libraries are often third-party resources as well.
  • ./plugins/web-app is rather vague as a category. About 400 (~60%) plugins are in this category. Splitting this category results in duplication of content which is probably best solved with symlinks?
  • Do we need sub categories? Do we need more categories? Is that over specializing (I think it would be)?
  • Indecision - For example, where does WebDAV belong?

Command Line Options

Suppress 404s

Users can just grep for 200 or -v 404

Follow frames

Many websites still use frames on intro pages. A --follow-frames option would allow WhatWeb to grab these URLs instead of being stuck trying to fingerprint a HTML frameset.

Should frames be followed by default? Should following off-site frames be ignored or be a configurable option?

Andrew: this could be configured with --follow-frames off,on(on-site only),always

Is using on for onsite a bad choice? the alternative is onsite instead of on

--follow-frames never,frame-only,iframe-only,same-site,same-domain,always (default: same-domain)

bcoles: It should function like the --follow-redirect option; that is:

--follow-frames=WHEN    Control when to follow frames. WHEN may be `never',
                        `frame-only', `iframe-only', `same-site', `same-domain'
                        or `always'. Default: never

I'm undecided on whether never or same-site is the best default.


How to scan websites that need authentication?

Types of authentication to potentially support:

  • HTTP Basic Authentication
  • HTTP Digest Authentication
  • URL parameter with session token
  • HTTP Cookies
  • SSL Certificate Support
  • HTTP Forms with passwords

Curl supports these and it might make sense for WhatWeb to copy curl's command line syntax.

A method, not necessarily a good one is to load WhatWeb with username and password combinations which it will try whenever it discovers a password prompt.

Using HTTP authorization would be nice for fingerprinting devices with default credentials. This belongs in aggression level 5 which has not yet been implemented.


POST data

Aung Khant: Some frameworks issue unique error response when we do invalid post request

:url_post=>'/', :post_data=>'null=null'

bcoles: post can be achieved with custom ruby but POST request support would be worth adding. Also support for OPTIONS requests may be useful, for example WebDav.


Should WhatWeb exploit vulnerabilities to test them?

Andrew: No. Not yet at least. I want good coverage of plugins to identify systems first including aggressive plugins to detect exact version numbers. Plugins that test for vulnerabilities if or when introduced should be at a different aggression level, maybe 5.


Returning Data

According to the WhatWeb design philosophy: avoid premature over-engineering. Do not implement core code to handle types of information that few plugins currently return.

The following are candidates as data-types for plugins to return (such as :version, :string, :firmware, etc) as it may be useful to separate them from results in :string=> :

  • :hostname=>
    • Internal host name - not widely used
  • :ip=>
    • Used for internal IP addresses and the IP plugin - not widely used
  • :mac=>
    • MAC address - not widely used
  • :year=>
    • The age of an installation can often be roughly determined by the year(s) in copyright messages. Several plugins report the year.

How should WhatWeb save/store webpages?

add option to save html files and headers. optional folder. how to save the files?

option 1 (hostnames backwards by TLD, IPs forwards by octet) for login.yahoo.com and 208.51.4.1 u get: com/yahoo/login/head and download/com/yahoo/login/body 208/51/4/1/head and 208/51/4/1/body

option 2 (url, dots become -, every special char not allowed in a filename is converted to something? ) login-yahoo-com_index-html.head login-yahoo-com_index-html.body

option 3 (md5 hash of url, this is kind of brutal) 9e107d9d372bb6826bd81d3542a419d6.head 9e107d9d372bb6826bd81d3542a419d6.body

option 4 (URL encode every special character after the hostname. should dots remain dots?) login.yahoo.com%2findex.html.head login.yahoo.com%2findex.html.body

thoughts...: large sets - splitting the hostnames across directories (option 1) small sets - one directory for all hosts (keep the dots) URL encode every special character for the path

There should also be options for saving to DBs like gridfs, etc


Custom Plugins

This feature should provide a gentle introduction into custom usage of WhatWeb and eventually lead into plugin writing.

Aims of the feature :

Reduce barrier to entry for custom searching with WhatWeb and remove the need for anyone to write this :

echo "\n\n" | netcat whatweb.net 80 | grep -Eo "<title>([^<]+)<\/title>"

For example:

$ ./whatweb --custom-plugin "{:version=>/<title>([^<]+)<\/title>/i,:regexp_offset=>0}" whatweb.net

This option allows WhatWeb to act as a powerful, threaded, grep-powered platform for HTTP(S).

Unfortunately the --custom-plugin option needs to be escaped and in some cases, such as :regexp=>//, needs to be double-escaped as it parsed directly from the command-line. This results in a complicated and unintuitive command line argument.

Splitting each match method up into its own command line argument would help reduce the complexity :

option 1 --custom-plugin-text, --custom-plugin-regex

option 2 --find-text, --find-regex, --find-md5

option 3 --match-text, --match-regex, --match-md5

option 4 --grep-text, --grep-regex, --grep-md5


Extract Injection Points

Andre Gironda: i would love to see WhatWeb identify candidate insertion points for testing - especially marking insertion points that are user controllable HTML element attributes

bcoles: any suggestions on how the results for candidates for insertion should be formatted?

Andre Gironda: ProxMon and Casaba Watcher tools do it right - they are open-source

bcoles: This could be achieved with a plugin. Something like :

  • GET params: split base_uri by ? then &
    • Extract params from /base_uri[^'"]+\?([^=]+)=([^&]+)/
  • POST params: look for <form> and <input> and extract the id
  • Elements: grep for the GET param values and extract the relevant HTML element type
    • Will most likely result in false positives unless non-default GET parameter values are sent