Emilio G.C edited this page Oct 10, 2017 · 177 revisions

Osmosis is a utility for easily extracting data from HTML or XML documents.

Command reference

These are all of the "commands" that are available for chaining in an Osmosis instance.

click

( selector )

Click on nodes found by selector

contains

( string )

Discard any nodes whose contents do not match string

config

( opts )
( key, val )

Set HTTP options and configure Osmosis

data

( callback( data ) )

Calls callback with the current data object

( null )

Empty the data object

( object )

Add or replace each key in the data object with a new val

debug

( callback( msg ) )

Call callback when any debug messages are received

delay

( seconds )

Delay starting next promise for seconds (float or int)

do

( osmosis..., osmosis... )

Call each Osmosis instance with the current context. This will always continue, even if an instance fails.

doc

Reset the current context to the Document

dom

( callback )

Create a DOM object from the current context.

The callback will be be called with 3 arguments (window, data, and next). The next([context], [data]) function must be called at least once

done

( callback )

Calls callback when parsing has completely finished

error

( callback( msg ) )

Call callback when any error messages are received

failure/fail

( selector )

Discard any nodes that match selector

filter/success

( selector )

Discard any nodes that do not match selector

find

( selector )

Find elements based on selector anywhere within the current document

follow

( [selector] )

Follow URLs found via selector. If selector isn't provided, follow will search the current element text or common URL attributes (href, src, etc).

Examples:

.follow()
.follow('@href')
.follow('a')
.follow('a@href')
.follow('span.outlink')
.follow('input.cloneURL@value')
.follow('link[type="application/rss+xml"]@href')

get / post

( url , [data] , [opts] )

Make an HTTP request

url - A string containing a URL, which can be relative to the current context.

data (optional) - An object containing GET query parameters or POST request data.

opts (optional) - An object containing HTTP request options.


Note: Query parameter values will be urlencoded by needle so make sure that your parameter values are not urlencoded.

log

( callback( msg ) )

Call callback when any log messages are received

login

( user , pass , [success] , [fail] )

Submit a login form.

Arguments:

user - A string containing a username, email address, etc.

pass - A password string

success (optional) - A selector string determining if the login attempt succeeded

fail (optional) - A selector string determining if the login attempt failed


How it works

login finds the first form containing input[type="password"] and uses that input as the password field. It will use the preceding <input> element as the user field.

match

( [selector], RegExp )

Discard any nodes whose contents do not match RegExp

page / paginate

( selector , [limit] )

Paginate the previous request limit times based on selector.

selector:

selector (String) - A selector string for either:

  • an element with the next page URL in its inner text or in an attribute that commonly contains a URL (href, src, etc.)
  • an element whose name and value attributes will respectively be added or replaced in the next page query.

selector (Object) - An object where each key is a query parameter name and each value is either a selector string or an increment amount (+1, -1, etc.).

limit:

limit (Number) - Total number of "next page" requests to make.

limit (String) - A selector string for an element containing the total number of requests to make.


.paginate('a.nextPage') // go to `a.nextPage` `@href`
.paginate('link[rel="next"]@href') // go to `link` `@href`
.paginate('input[name="page"]') // update `page` parameter of the next query

// adds 20 to the `startIndex` query parameter
// sets `page` query parameter to `a.nextPage` content
// stops after 15 requests are made
.paginate({ startIndex: +20,  page: 'a.nextPage' }, 15)

pause / resume / stop

Pause, resume or stop an osmosis instance.

parse

( string )

Parse an HTML or XML string

Arguments:

string - A string or buffer containing the HTML/XML data

set

( name , selector)

Set name to the value of selector

( object )

Set each key to the value of each val selector.


.set('title') // set 'title' to current element text
.set('title', 'a.title') // set 'title' to text of 'a.title'
.set({ 
    title:  'a.title',
    description: 'p.description',
    url: 'a.permalink @href',
    images: ['img @src'],
    comments: [
        osmosis
        .follow('a.comments')
        .find('div.comment')
        .set({
            'author': '.author'
            'content': 'p.content',
            'date': '.date'
        })
    ]
});

submit

( selector , [data] )

Submit a form

Arguments:

selector - A selector for the <form> element or submit button.

data (optional) - An object where each key and value represents a form input name and value

then

( callback( context, data, [next], [done] ) )

Calls callback with the context of the current element.

context:

The context argument is the current context at that point in the command chain. If the previous command was get, post, follow, or parse then the context will be a Document. If the previous command was find then the current context will be one of the Elements that was found.

data:

The data argument contains values set via osmosis.set. This object can be modified in any way.

next:

The next argument is a function that will call the next command. It takes two arguments: context and data.

done:

The done argument is a function to call when then will no longer call next. This is only required if then calls next asynchronously any number of times.

Note: If the callback accepts done as an argument, it must always call done, even if next was never called.

Functions

The callback will have these functions bound to its this value:

  • this.request(method, url, [data], callback([err], context), [opts])
  • this.log(msg)
  • this.debug(msg)
  • this.error(msg)

Examples:

Example 1: find every ul > li and pass it to the next command

osmosis
...
.then(function(context, data, next) {
    var items = context.find('ul > li');
    items.forEach(function(item) {
        next(item, data);
    })
})

Example 2: set data.url to the current page URL

osmosis
...
.then(function(context, data, next) {
    data.url = context.doc().request.url;
    next(context, data);
})

Example 3: only continue if lastname != undefined

osmosis
...
.then(function(context, data, next) {
    if (data.lastname != undefined)
        next(context, data)
})

Example 4: using the done function

osmosis
...
.then(function(context, data, next, done) {
    if (db.connected == false) {
        this.error('database disconnected');
        done();
        return;
    }
    data.someArray.forEach(function(obj, index) {
        db.save(obj, function() {
            next(context, data);
            if (index == data.someArray.length-1)
                done();
        })
    })
})
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.