-
Notifications
You must be signed in to change notification settings - Fork 246
Home
Osmosis is a utility for easily extracting data from HTML or XML documents.
These are all of the "commands" that are available for chaining in an Osmosis instance.
- click
- config
- contains
- data
- debug
- delay
- do
- doc
- dom
- done
- failure
- filter
- find
- follow
- get/post
- login
- match
- paginate
- parse
- set
- submit
- then
Click on nodes found by selector
Discard any nodes whose contents do not match string
Set HTTP options and configure Osmosis
Calls
callback
with the current data objectEmpty the data object
Add or replace each
key
in the data object with a newval
Call
callback
when any debug messages are received
Delay starting next promise for
seconds
(float or int)
Call each Osmosis instance with the current context. This will always continue, even if an instance fails.
Reset the current context to the
Document
Create a DOM object from the current context.
The
callback
will be be called with 3 arguments (window
,data
, andnext
). Thenext([context], [data])
function must be called at least once
Calls
callback
when parsing has completely finished
Call
callback
when any error messages are received
Discard any nodes that match
selector
Discard any nodes that do not match
selector
Find elements based on
selector
anywhere within the current document
Follow URLs found via
selector
. Ifselector
isn't provided,follow
will search the current element text or common URL attributes (href, src, etc).
.follow()
.follow('@href')
.follow('a')
.follow('a@href')
.follow('span.outlink')
.follow('input.cloneURL@value')
.follow('link[type="application/rss+xml"]@href')
Make an HTTP request
url - A string containing a URL, which can be relative to the current context.
data (optional) - An object containing GET query parameters or POST request data.
opts (optional) - An object containing HTTP request options.
Note: Query parameter values will be urlencoded by needle so make sure that your parameter values are not urlencoded.
Call
callback
when any log messages are received
Submit a login form.
user - A string containing a username, email address, etc.
pass - A password string
success (optional) - A selector string determining if the login attempt succeeded
fail (optional) - A selector string determining if the login attempt failed
login
finds the first form containinginput[type="password"]
and uses that input as the password field. It will use the preceding<input>
element as the user field.
Discard any nodes whose contents do not match
RegExp
Paginate the previous request
limit
times based onselector
.selector (String) - A selector string for either:
- an element with the next page URL in its inner text or in an attribute that commonly contains a URL (href, src, etc.)
- an element whose
name
andvalue
attributes will respectively be added or replaced in the next page query.selector (Object) - An object where each
key
is a query parameter name and eachvalue
is either a selector string or an increment amount (+1, -1, etc.).limit (Number) - Total number of "next page" requests to make.
limit (String) - A selector string for an element containing the total number of requests to make.
.paginate('a.nextPage') // go to `a.nextPage` `@href` .paginate('link[rel="next"]@href') // go to `link` `@href` .paginate('input[name="page"]') // update `page` parameter of the next query // adds 20 to the `startIndex` query parameter // sets `page` query parameter to `a.nextPage` content // stops after 15 requests are made .paginate({ startIndex: +20, page: 'a.nextPage' }, 15)
Pause, resume or stop an osmosis instance.
Parse an HTML or XML string
string - A string or buffer containing the HTML/XML data
Set
name
to the value ofselector
Set each
key
to the value of eachval
selector.
.set('title') // set 'title' to current element text
.set('title', 'a.title') // set 'title' to text of 'a.title'
.set({
title: 'a.title',
description: 'p.description',
url: 'a.permalink @href',
images: ['img @src'],
comments: [
osmosis
.follow('a.comments')
.find('div.comment')
.set({
'author': '.author'
'content': 'p.content',
'date': '.date'
})
]
});
Submit a form
selector - A selector for the
<form>
element orsubmit
button.data (optional) - An object where each
key
andvalue
represents a form input name and value
Calls
callback
with the context of the current element.The
context
argument is the current context at that point in the command chain. If the previous command wasget
,post
,follow
, orparse
then the context will be a Document. If the previous command wasfind
then the current context will be one of the Elements that was found.The
data
argument contains values set viaosmosis.set
. This object can be modified in any way.The
next
argument is a function that will call the next command. It takes two arguments: context and data.The
done
argument is a function to call whenthen
will no longer callnext
. This is only required ifthen
callsnext
asynchronously any number of times.Note: If the callback accepts
done
as an argument, it must always calldone
, even ifnext
was never called.The callback will have these functions bound to its
this
value:
- this.request(method, url, [data], callback([err], context), [opts])
- this.log(msg)
- this.debug(msg)
- this.error(msg)
Example 1: find every
ul > li
and pass it to the next command
osmosis
...
.then(function(context, data, next) {
var items = context.find('ul > li');
items.forEach(function(item) {
next(item, data);
})
})
Example 2: set data.url
to the current page URL
osmosis
...
.then(function(context, data, next) {
data.url = context.doc().request.url;
next(context, data);
})
Example 3: only continue if lastname != undefined
osmosis
...
.then(function(context, data, next) {
if (data.lastname != undefined)
next(context, data)
})
Example 4: using the done
function
osmosis
...
.then(function(context, data, next, done) {
if (db.connected == false) {
this.error('database disconnected');
done();
return;
}
data.someArray.forEach(function(obj, index) {
db.save(obj, function() {
next(context, data);
if (index == data.someArray.length-1)
done();
})
})
})