New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lua scripting support #117

Closed
kmike opened this Issue Oct 13, 2014 · 3 comments

Comments

Projects
None yet
3 participants
@kmike
Copy link
Member

kmike commented Oct 13, 2014

This is going to be long.. I think everybody agree we want some browser automation features in Splash. Examples: go to a different page, wait for an element, wait for N seconds, click a link, submit a form, customize what to return, process HTTP requests somehow.

There are two ways to add browser automation support to Splash:

  1. Create an out-of-band communication channel. When Splash starts rendering it could return a token/url-to-connect to the client; client can open another connection and send commands to the browser. Approach is similar to Selenium WebDriver.
  2. Second way is to add scripting support: in addition to other options Splash could accept a script with a "rendering scenario" similar to http://phantomjs.org/ scripts.

1. Out-of-band channel

Advantage of (1) is that the client can use any language/software for the automation. All libraries are available, as well as data which is not accessable from Splash (e.g. a local database or a filesystem). It also doesn't require sandboxing. This is the most flexible way to do automation, and maybe the most future-proof.

But I don't like it as a default. It is bad for latency, writing client code can become challenging, "wait 10ms on a webpage" command is hard because of latency, out-of-band channel makes Splash stateful (so load balancers must be aware of that feature), disconnected clients may cause Splash to use extra resources, it is bad for batch processing. This approach looks heavyweight; it is more powerful, but it makes Splash less scalable and makes client code more complex.

2. Scripting

For (2) there are two main questions: which language to use and how to make it secure.

2.1. Python

First obvious scripting language candidate is Python.

The problem with Python is that sandboxing is very hard (impossible?) if the script is executed in the same interpreter as the rest of Splash. See e.g. https://mail.python.org/pipermail/python-dev/2013-November/130132.html.

PyPy has a working sandbox (http://pypy.readthedocs.org/en/latest/sandbox.html); to implement it they start a separate process which sends serialized requests and waits for a reply from a host process instead of making system calls. I haven't tried it. PyPy's sandbox looks like a PITA to install, integration with CPython is "experimental and unpolished" and it can't run any code that is implemented as C Python extension (so no lxml and no twisted? it seems there is e.g. no datetime.datetime). Also, I'm not sure how can we make IPC work - we need to pass some in-memory objects to the sandboxed process.

In-process Python sandboxing is broken, and out-of process Python sandboxing looks hard to use.

2.2. JavaScript

Second obvious candidate is JavaScript.

We already have a JS engine in QWebKit, and we already allow to execute JS code in page context - maybe it can be used for automation? There are problems with js-in-qwebkit:

a) JS code is executed in page context, so providing browser automation methods is not secure (a webpage can call them).
b) Variables get cleared after each page reload - it'll make programming hard. We can provide a persistent storage, but there will be gotchas with closures, etc.
c) We need to load the page to start working with it, so nothing can be done in between of page loads or before the first page is loaded.

These problems can be fixed by executing JS in an unrelated QWebView, but it looks like an ugly hack. Also, with QWebKit we can only fire a script and get a result, this is limiting.

Another option for JS integration is https://code.google.com/p/pyv8/. The docs are mostly absent, but I think it could work.

We can even try to implement PhantomJS API - an advantage of doing this could be a compatibility with the existing software, thought it is a hard battle because "mostly compatible with PhantomJS" means "broken".

I'm not a big fan of JS scripting / PhantomJS API. JavaScript lacks features that can make code compact. Check an example from casperjs library (which is a wrapper to make PhantomJS more usable):

var casper = require('casper').create();

casper.start('http://casperjs.org/', function() {
    this.echo(this.getTitle());
});

casper.thenOpen('http://phantomjs.org', function() {
    this.echo(this.getTitle());
});

casper.run();

Callbacks are not easy to read and follow, there are many gotchas with them. The example is simple because only a few callbacks are involved; real casperjs code can get messy. In Python with a proper library we can write something like

page = yield pasper.open('http://casperjs.org')
print page.get_title()
page = yield pasper.open('http://phantomjs.org')
print page.get_title()

I'm aware of an upcoming "yield" in JS, but not sure if it is available/supported in pyv8 and if it is already usable. It seems yield is not supported by PhantomJS and CasperJS APIs, but I can be wrong.

An undeniable advantage of JS is that everyone knows this language.

Adding JavaScript scripting to Splash looks doable, and it will work OK, but the tools are not easy to use, PyV8 is hard to setup, we'll have 2 different JS interpreters, the API won't be too nice because of the language, and the best of what we can eventually get after a lot of work is a subpar / slightly broken implementation of PhantomJS / CasperJS APIs.

2.3. Lua

So, what I propose it to add Lua scripting support to Splash.

The biggest disadvantage is that Lua is much less known than Python or JavaScript, but other than that it looks like the best option. As a personal opinion: I've worked with Lua for a couple of months. It has its own warts, and I like Lua less than Python as a language, but still it was good; overall it felt like JavaScript done right.

To evaluate JS in page context users will have to write JS inside Lua scripts. I don't think that's bad. Writing JavaScript inside JavaScript isn't better for our use case. Check how PhantomJS does it:

var webPage = require('webpage');
var page = webPage.create();

page.open('http://m.bing.com', function(status) {
  var title = page.evaluate(function() {
    return document.title;
  });
  console.log(title);
  phantom.exit();
});

Argument of page.evaluate is a function. It looks like a standard JS closure creation, but it isn't: the function passed is executed in a totally different environment, it is not a closure: you can't use variables from the outer scope as expected, but you can use variables from the page context. What looks like a standard JS code is not a standard JS code. IMHO using the same language is not an advantage here. With Lua the separation will be clear:

page = splash.open("http://m.bing.com")
title = page.evaluate([[
    return document.title;
]])

There is a great two-way Python-Lua binding written by one of the Cython authors: https://github.com/scoder/lupa. It looks maintained very well - this weekend I found a bug in it, and the bug was fixed in a day. Installing it is just

# OS X
brew install lua
pip install lupa

or

# Debian
apt-get install liblua5.2-dev
pip install lupa

Lua has coroutines support. They can do the same as Python yield with one twist: you can yield from everywhere. They are more like greenlets (but without crazy VM hacks). What it means is that we can make code like this:

page = splash.open("http://casperjs.com")
splash.wait(1000)
return page.render()

to integrate in event loop properly by using Lua's coroutine.yield() in splash.open and splash.wait. Lupa wrapper supports coroutines fully, so it is possible to make the code above run properly with Splash's event loop.

Greenlet criticism from https://glyph.twistedmatrix.com/2014/02/unyielding.html applies here, but it may be OK. Scripts are meant to be small, we're essentially yielding to the code outside Lua (so from the point of view of a script it is not that different from a blocking call), and we may also provide an explicit callback interface. There may be problems e.g. when the same event fires while processing an event handler, but I'm not sure using explicit callbacks or explicit yields makes this any better.

Sandboxing is generally hard, but with Lua it should be easier than in most other languages. There are some docs about that: http://lua-users.org/wiki/SandBoxes. I think it is OK to start with something very restricted. CPU/memory limiting is a related issue, but we haven't solved it in other Splash parts as well. I believe there is a way to implement instruction count limit for Lua, or maybe we can use some watchdog.

Lua enables the nicest programmer API and looks easiest to integrate; if everyone is fine with Lua I'll create a proof of concept soon.

@kmike kmike added the enhancement label Oct 13, 2014

@dangra

This comment has been minimized.

Copy link
Member

dangra commented Oct 13, 2014

you are selling it so well that 👍's are imminent. I tried Lua in the past with certain windowmanagers, I never get too used to it but it was very fast to get small things done as far as I remember. Sorry, I have nothing to add here, looks promising and well drafted.

@cyberplant

This comment has been minimized.

Copy link
Member

cyberplant commented Oct 16, 2014

I've worked with LUA on embedded devices and it was nice, so I give my 👍 .

Also, I like the name of that programming language ;-)

@kmike

This comment has been minimized.

Copy link
Member Author

kmike commented Oct 16, 2014

@dangra @cyberplant that's great you also think Lua is fine!

Current status: as a proof-of-concept I've integrated Lua coroutines with Twisted/Qt event loop and exposed some of the Splash API to Lua script. So far only splash:go(url) and splash:wait(time_ms) commands are working (in a proper async way), the example script is hardcoded (no HTTP API), and the code is bad (cleaning it up before commiting). But overall it seems to work well, I haven't faced a roadblock yet.

Scripting API could work like this, but it is not set in stone:

-- "main" function is called by Splash
-- developers can create other helper functions outside "main" if needed
function main(splash)
    -- splash.args allows script to access parameters sent to server;
    -- an alternative would be to use templating language to create Lua scripts and 
    -- hardcode values in script itself, but parameters look cleaner.
    local url = splash.args.url 
    local wait = splash.args.wait

    -- This blocks until page is loaded. If there is an issue with page loading 
    -- Lua error happens; it can be catched by pcall and handled in a script 
    -- if needed. Unhandled errors are returned to client as HTTP errors. 
    -- Alternative for splash:go would be to return status text without 
    -- raising an error; I don't know what is better.
    splash:go{url=url}  

    if wait > 0 then
        -- this also blocks; splash continues doing other things
        -- and resumes script after a timeout
        splash:wait(wait)  
    end

    splash:stop_loading()

    local title = splash:evaluate([[
        function getTitle(){ return document.title; }
        getTitle();
    ]])

    local html = splash:html()
    local screenshot = splash:png{width=300, height=200}

    -- the result of 'main' function is encoded to JSON and returned as a response.
    return {html=html, title=title, png=screenshot}
end

There are many possible improvements: callback interfaces for event subscription, splash:return(data) for returning from inside callbacks, returning of "raw" data (not json-encoded); obviously lots of browser controlling functions should be exposed (click, scroll, reload/back/forward, etc.)

I think the first milestone would be to allow users to imitate existing endpoints using Lua scripts; with it we'll get hundreds of functional tests for free as we can run existing tests against Lua-powered endpoints.

PR to track is #118; maybe I'll commit more changes later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment