Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Cheerio's version of node-soupselect.
JavaScript
tag: v0.1.0

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
deps
lib
testdata
tests
.gitmodules
README.md
example.js
package.json
test.js

README.md

node-soupselect

A port of Simon Willison's soupselect for use with node.js and node-htmlparser.

Wanted a friendly way to scrape HTML using node.js. Tried using jsdom, prompted by this article but, unfortunately, jsdom takes a strict view of lax HTML making it unusable for scraping the kind of soup found in real world web pages. Luckily htmlparser is more forgiving.

A complete example including fetching HTML etc...;

var select = require('soupselect').select,
    htmlparser = require("htmlparser"),
    http = require('http'),
    sys = require('sys');

// fetch some HTML...
var http = require('http');
var host = 'www.reddit.com';
var client = http.createClient(80, host);
var request = client.request('GET', '/',{'host': host});

request.on('response', function (response) {
    response.setEncoding('utf8');

    var body = "";
    response.on('data', function (chunk) {
        body = body + chunk;
    });

    response.on('end', function() {

        // now we have the whole body, parse it and select the nodes we want...
        var handler = new htmlparser.DefaultHandler(function(err, dom) {
            if (err) {
                sys.debug("Error: " + err);
            } else {

                // soupselect happening here...
                var titles = select(dom, 'a.title');

                sys.puts("Top stories from reddit");
                titles.forEach(function(title) {
                    sys.puts("- " + title.children[0].raw + " [" + title.attribs.href + "]\n");
                })
            }
        });

        var parser = new htmlparser.Parser(handler);
        parser.parseComplete(body);
    });
});
request.end();

Notes:

  • Requires node-htmlparser > 1.6.2 & node.js 2+
  • Calls to select are synchronous - not worth trying to make it asynchronous IMO given the use case
Something went wrong with that request. Please try again.