A fluent interface for scraping web content in Node with the PhantomJS headless browser. Its most notable feature is that, similar to waitFor.js, you can wait until the availability of content on a page, which comes in handy when scraping content generated by AJAX requests.
Install via npm with:
npm install fluent-phantom
Note that this package depends on the PhantomJS bridge for Node, which assumes that you have already installed PhantomJS.
The most likely scenario for usage – scraping content dynamically generated by the client using a CSS selector – is outlined in CoffeeScript below. For more information, read the rest of this document.
phantom = require 'fluent-phantom'`
phantom.create()
.select('#headlines h3')
.with().members('innerText')
.from('http://example.com/news')
.and().process((elements) ->
for element in elements
new Headline(element.innerText).save()
)
.until(5000) # Timeout after 5 seconds
.otherwise(->
throw Error "Headlines were never loaded"
)
.execute()
The core of this package is two classes: a Request
class that wraps the PhantomJS bridge and a fluent Builder
to simplify its use.
- Connection strategies
- Builder
- Request
- Setters and getters
- Debugging
- Events
By default, a new Phantom object is created for each request, but this can be wasteful. In v1.1.0, alternative connection strategies were introduced.
Those strategies are:
NewPhantom
: Creates a new Phantom for each request. [default]NewPhantomAndPort
: Creates a new Phantom for each request with a new port. This ensures no EADDRINUSE conflicts, but could create an infinite number of phantoms and so is discouraged. If used, consider automatically closing connections when finished usingRequest.closeWhenFinished(true)
.RecycledPhantom
: Recycles the same Phantom connection for each request. This is the simplest way to avoid an EADDRINUSE bug if making many requests.RoundRobin(poolSize, [minimum], [queueDepth])
: AcceptspoolSize
as the number of PhantomJS objects in the pool,min
as the number of PhantomJS objects to initialize beforehand, andqueueDepth
as the number of enqueued functions to allow to wait for a ready connection.RandomPool(poolSize, [queueDepth])
: Accepts the poolsize
andqueueDepth
as constructor arguments with the same meaning asRoundRobin
and selects a connection at random.
A package-level method, fluent-phantom.setConnectionStrategy(strategy)
, can be invoked with a new instance of one of the connection strategies above to change the strategy.
fluent = require 'fluent-phantom'
fluent.setConnectionStrategy new fluent.ConnectionStrategy.RoundRobin(10, 2)
Builders can be created with new
, but using the package-level create()
helper is encouraged, as in the following example:
phantom = require 'fluent-phantom'
builder = phantom.create()
Synonyms: url()
Sets the request URL.
phantom.create().from('http://example.com/')
The easiest way to scrape content is to call select()
with a CSS selector and let the library generate your scraping code for you. This will automatically wait until the number of elements found with this selector is greater than zero by immediately invoking when()
. Optionally, the members of nodes in the result set can be limited using properties()
.
Here is an example of extracting all h3
elements in a container with the id 'headlines':
phantom.create()
.select('#headlines h3')
.from('http://example.com/news')
.and().process((elements) -> console.log "Handle results here", elements)
.with().members('innerHTML', 'className', 'id')
.until(5000) # Timeout after 5 seconds
.otherwise(-> console.error "The selector was never satisfied; handle errors here")
.execute()
Synonyms: extract()
See also: when()
, properties()
, handle()
When select
is invoked with a string, it is assumed to be a CSS selector describing elements to be scraped. When used in this way, select
will immediately invoke when
to automatically wait for the selector to be satisfied. If a number is passed as the second argument, it is treated as a minimum number of elements that must be satisfied and passed to when
.
Note that WebKit's document.querySelectorAll()
is used to interpret selectors.
If an array or map of strings is provided, it is assumed that multiple selections are desired, and an array or map of the corresponding results will be returned. Index ordering of arrays will be preserved.
Synonyms: process()
, receive()
Specify a callback function to process results. Because scraping performed by select
happens inside the PhantomJS VM, the results must be serialized to JSON and processed by a function in the Node.js VM where this module is being used. This is where you would publish results to a message queue or persist them to a data store.
The callback will receive two arguments. The first will be an array of results – even if there is only one result in the set. The second will be a reference to the page
object.
Builder.properties(property: string, [property2: string], [...]) or Builder.properties(properties: array)
Synonyms: members()
By default, elements in the result set are stripped down to a handful of properties: id
, children
, className
, innerHTML
, and a few others. This list can be expanded or contracted by invoking the properties
method with a list of property names as strings. They should be valid DOM properties.
Sometimes, a simple CSS selector is not enough to describe the content you want to scrape. The Builder
provides several alternatives that allow greater expression.
Synonyms: extract()
The select
method can also accept a function as its first argument. This function will run in the context of the page being scraped within the PhantomJS VM, and its return value will be passed as the first argument to handle
for further processing. The second argument to handle
will be a reference to the page
object.
Because the function body is effectively serialized and deserialized, any closed over variables will be lost. To allow for some dynamism, a JSON-serializable argument may be provided like a CSS selector or a record identifier. This argument will be the first and only passed to the extractor function when invoked.
Note that using select
with a function rather than a CSS selector overrides the use of properties
, and does not automatically cause the Request
to wait until content is ready. Also note that references in the PhantomJS VM – including those to DOM elements – do not serialize as you might expect. For best results, be sure to return values, not references.
Because DOM elements are references within the PhantomJS VM, in the example below, you won't receive a long list of DOM elements as you might expect:
phantom.create()
.select(->
document.querySelectorAll '#headlines h3'
)
# ...
To work around this, return values instead of references, like so:
phantom.create()
.select(->
# Returns an array of objects of the form {id: ..., text: ...}
for elem in document.querySelectorAll '#headlines h3'
id: elem.id
text: elem.innerText
)
# ...
See also: select(function)
Assign an extractor function and a handler in one method call. The result is the same as calling select(extractor, argument)
and handle(handler)
.
This method is named evaluate because it wraps page.evaluate
in the PhantomJS bridge.
Synonyms: invoke()
For those who already know what they're doing with PhantomJS and only want the convenience of using this module to wait for content to be available, this method accepts a function that receives a PhantomJS page
object as its only argument for use with, for instance, page.evaluate
. Use of this is exclusionary to the use of select
and evaluate
.
The two forms of when
cause any actions specified by handle
, evaluate
, or run
to be suspended until a sentry condition is satisfied or the timeout period set with [timeout
] has ellapsed. The author's original motivation for using PhantomJS over HTMLUnit or Beautiful Soup was to scrape content generated client-side via a long-polling mechanism. The when
method is ideal for this as it delays scraping until the content to be scraped is available.
When a sentry condition has been set, the Request will test for it immediately and every ~250ms afterwards until it has been satisfied or the test times out. At each check, the Request will emit the 'checking' event.
Causes execution to be delayed until the document has at least one element satisfying the provided CSS selector using document.querySelectorAll()
. If a minimum is provided, execution will be delayed until the minimum has been reached.
This form of when
is automatically invoked when using select
with a CSS selector.
Causes execution to be delayed until the sentry function provided returns true. The function will be invoked in the context of the page within the PhantomJS runtime and, as described in the documention for using select
with functions, any closed over scope will be lost. A second, JSON-serializable parameter may be provided to be passed as the first argument to the sentry function when invoked.
If multiple parameters are needed, pass them in one object, as is done in this adaptation from the source for Builder.when()
:
argument =
minimum: minimum
query: condition
callback = (args) -> document.querySelectorAll(args.query).length >= args.minimum
builder.when(callback, argument)
Builders provide support for setting bounds on the time to wait for content to become ready.
Set a callback to be invoked when the request times out while waiting for content to become available. The callback will receive no arguments.
Synonyms: for()
, until()
Set the duration to wait before timing out. Argument should be a number of milliseconds as an integer, a string that can be parsed by Duration (e.g. "5s"), or a Duration object.
Allow a request to wait forever for content.
Causes the request to test its condition once, but only once.
For expressiveness, several meaningless terms exist in the builder grammar. They are:
and()
then()
of()
with()
These terms provide a more fluent feel. For instance, compare the following two examples:
# Vanilla
phantom.create()
.select('div.feature a')
.process(-> ...)
phantom.create()
# Slightly more expressive
.select('div.feature a')
.and().then().process(-> ...)
The difference is small, but available if desired.
Once a builder has received all input to create a request, you can build it using build()
or immediately execute (and return) it with execute()
.
Builds a Request
object to your specifications.
Builds and immediately executes a new Request
object. The Request
is returned.
Documentation coming soon. In the meantime, view the annotated source in its raw form or prettied up by docco.