Skip to content
Emilio G.C edited this page Oct 10, 2017 · 177 revisions

Osmosis is a utility for easily extracting data from HTML or XML documents.

Command reference

These are all of the "commands" that are available for chaining in an Osmosis instance.

click

( selector )

Click on nodes found by selector

contains

( string )

Discard any nodes whose contents do not match string

config

( opts )
( key, val )

Set HTTP options and configure Osmosis

data

( callback( data ) )

Calls callback with the current data object

( null )

Empty the data object

( object )

Add or replace each key in the data object with a new val

debug

( callback( msg ) )

Call callback when any debug messages are received

delay

( seconds )

Delay starting next promise for seconds (float or int)

do

( osmosis..., osmosis... )

Call each Osmosis instance with the current context. This will always continue, even if an instance fails.

doc

Reset the current context to the Document

dom

( callback )

Create a DOM object from the current context.

The callback will be be called with 3 arguments (window, data, and next). The next([context], [data]) function must be called at least once

done

( callback )

Calls callback when parsing has completely finished

error

( callback( msg ) )

Call callback when any error messages are received

failure/fail

( selector )

Discard any nodes that match selector

filter/success

( selector )

Discard any nodes that do not match selector

find

( selector )

Find elements based on selector anywhere within the current document

follow

( [selector] )

Follow URLs found via selector. If selector isn't provided, follow will search the current element text or common URL attributes (href, src, etc).

Examples:

.follow()
.follow('@href')
.follow('a')
.follow('a@href')
.follow('span.outlink')
.follow('input.cloneURL@value')
.follow('link[type="application/rss+xml"]@href')

get / post

( url , [data] , [opts] )

Make an HTTP request

url - A string containing a URL, which can be relative to the current context.

data (optional) - An object containing GET query parameters or POST request data.

opts (optional) - An object containing HTTP request options.


Note: Query parameter values will be urlencoded by needle so make sure that your parameter values are not urlencoded.

log

( callback( msg ) )

Call callback when any log messages are received

login

( user , pass , [success] , [fail] )

Submit a login form.

Arguments:

user - A string containing a username, email address, etc.

pass - A password string

success (optional) - A selector string determining if the login attempt succeeded

fail (optional) - A selector string determining if the login attempt failed


How it works

login finds the first form containing input[type="password"] and uses that input as the password field. It will use the preceding <input> element as the user field.

match

( [selector], RegExp )

Discard any nodes whose contents do not match RegExp

page / paginate

( selector , [limit] )

Paginate the previous request limit times based on selector.

selector:

selector (String) - A selector string for either:

  • an element with the next page URL in its inner text or in an attribute that commonly contains a URL (href, src, etc.)
  • an element whose name and value attributes will respectively be added or replaced in the next page query.

selector (Object) - An object where each key is a query parameter name and each value is either a selector string or an increment amount (+1, -1, etc.).

limit:

limit (Number) - Total number of "next page" requests to make.

limit (String) - A selector string for an element containing the total number of requests to make.


.paginate('a.nextPage') // go to `a.nextPage` `@href`
.paginate('link[rel="next"]@href') // go to `link` `@href`
.paginate('input[name="page"]') // update `page` parameter of the next query

// adds 20 to the `startIndex` query parameter
// sets `page` query parameter to `a.nextPage` content
// stops after 15 requests are made
.paginate({ startIndex: +20,  page: 'a.nextPage' }, 15)

pause / resume / stop

Pause, resume or stop an osmosis instance.

parse

( string )

Parse an HTML or XML string

Arguments:

string - A string or buffer containing the HTML/XML data

set

( name , selector)

Set name to the value of selector

( object )

Set each key to the value of each val selector.


.set('title') // set 'title' to current element text
.set('title', 'a.title') // set 'title' to text of 'a.title'
.set({ 
    title:  'a.title',
    description: 'p.description',
    url: 'a.permalink @href',
    images: ['img @src'],
    comments: [
        osmosis
        .follow('a.comments')
        .find('div.comment')
        .set({
            'author': '.author'
            'content': 'p.content',
            'date': '.date'
        })
    ]
});

submit

( selector , [data] )

Submit a form

Arguments:

selector - A selector for the <form> element or submit button.

data (optional) - An object where each key and value represents a form input name and value

then

( callback( context, data, [next], [done] ) )

Calls callback with the context of the current element.

context:

The context argument is the current context at that point in the command chain. If the previous command was get, post, follow, or parse then the context will be a Document. If the previous command was find then the current context will be one of the Elements that was found.

data:

The data argument contains values set via osmosis.set. This object can be modified in any way.

next:

The next argument is a function that will call the next command. It takes two arguments: context and data.

done:

The done argument is a function to call when then will no longer call next. This is only required if then calls next asynchronously any number of times.

Note: If the callback accepts done as an argument, it must always call done, even if next was never called.

Functions

The callback will have these functions bound to its this value:

  • this.request(method, url, [data], callback([err], context), [opts])
  • this.log(msg)
  • this.debug(msg)
  • this.error(msg)

Examples:

Example 1: find every ul > li and pass it to the next command

osmosis
...
.then(function(context, data, next) {
    var items = context.find('ul > li');
    items.forEach(function(item) {
        next(item, data);
    })
})

Example 2: set data.url to the current page URL

osmosis
...
.then(function(context, data, next) {
    data.url = context.doc().request.url;
    next(context, data);
})

Example 3: only continue if lastname != undefined

osmosis
...
.then(function(context, data, next) {
    if (data.lastname != undefined)
        next(context, data)
})

Example 4: using the done function

osmosis
...
.then(function(context, data, next, done) {
    if (db.connected == false) {
        this.error('database disconnected');
        done();
        return;
    }
    data.someArray.forEach(function(obj, index) {
        db.save(obj, function() {
            next(context, data);
            if (index == data.someArray.length-1)
                done();
        })
    })
})