Consult with :ref:`install-docs` to get Splash up and running.
Splash is controlled via HTTP API. For all endpoints below parameters
may be sent either as GET arguments or encoded to JSON and
POSTed with Content-Type: application/json
header.
The most versatile endpoint that provides all Splash features is :ref:`execute` (WARNING: it is still experimental). Other endpoints may be easier to use in specific cases - for example, :ref:`render.png` returns a screenshot in PNG format that can be used as img src without any further processing, and :ref:`render.json` is convenient if you don't need to interact with a page.
The following endpoints are supported:
Return the HTML of the javascript-rendered page.
Arguments:
- url : string : required
- The url to render (required)
- baseurl : string : optional
The base url to render the page with.
If given, base HTML content will be feched from the URL given in the url argument, and render using this as the base url.
- timeout : float : optional
A timeout (in seconds) for the render (defaults to 30).
By default, maximum allowed value for the timeout is 60 seconds. To override it start Splash with
--max-timeout
command line option. For example, here Splash is configured to allow timeouts up to 2 minutes:$ python -m splash.server --max-timeout 120
- wait : float : optional
- Time (in seconds) to wait for updates after page is loaded (defaults to 0). Increase this value if you expect pages to contain setInterval/setTimeout javascript calls, because with wait=0 callbacks of setInterval/setTimeout won't be executed. Non-zero :ref:`wait <arg-wait>` is also required for PNG rendering when doing full-page rendering (see :ref:`render_all <arg-render-all>`).
- proxy : string : optional
- Proxy profile name. See :ref:`Proxy Profiles`.
- js : string : optional
- Javascript profile name. See :ref:`Javascript Profiles`.
- js_source : string : optional
- JavaScript code to be executed in page context. See :ref:`execute javascript`.
- filters : string : optional
- Comma-separated list of request filter names. See Request Filters
- allowed_domains : string : optional
- Comma-separated list of allowed domain names. If present, Splash won't load anything neither from domains not in this list nor from subdomains of domains not in this list.
- viewport : string : optional
View width and height (in pixels) of the browser viewport to render the web page. Format is "<width>x<height>", e.g. 800x600. Default value is 1024x768.
'viewport' parameter is more important for PNG rendering; it is supported for all rendering endpoints because javascript code execution can depend on viewport size.
For backward compatibility reasons, it also accepts 'full' as value;
viewport=full
is semantically equivalent torender_all=1
(see :ref:`render_all <arg-render-all>`).
- images : integer : optional
Whether to download images. Possible values are
1
(download images) and0
(don't downoad images). Default is 1.Note that cached images may be displayed even if this parameter is 0. You can also use Request Filters to strip unwanted contents based on URL.
- headers : JSON array or object : optional
HTTP headers to set for the first outgoing request.
This option is only supported for
application/json
POST requests. Value could be either a JSON array with(header_name, header_value)
pairs or a JSON object with header names as keys and header values as values."User-Agent" header is special: is is used for all outgoing requests, unlike other headers.
Curl example:
curl 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'
The result is always encoded to utf-8. Always decode HTML data returned by render.html endpoint from utf-8 even if there are tags like
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
in the result.
Return a image (in PNG format) of the javascript-rendered page.
Arguments:
Same as render.html plus the following ones:
- width : integer : optional
- Resize the rendered image to the given width (in pixels) keeping the aspect ratio.
- height : integer : optional
- Crop the renderd image to the given height (in pixels). Often used in conjunction with the width argument to generate fixed-size thumbnails.
- render_all : int : optional
Possible values are
1
and0
. Whenrender_all=1
, extend the viewport to include the whole webpage (possibly very tall) before rendering. Default isrender_all=0
.Note
render_all=1
requires non-zero :ref:`wait <arg-wait>` parameter. This is an unfortunate restriction, but it seems that this is the only way to make rendering work reliably withrender_all=1
.
- scale_method : string : optional
Possible values are
raster
(default) andvector
. Ifscale_method=raster
, rescaling operation performed via :ref:`width <arg-width>` parameter is pixel-wise. Ifscale_method=vector
, rescaling is done element-wise during rendering.Note
Vector-based rescaling is more performant and results in crisper fonts and sharper element boundaries, however there may be rendering issues, so use it with caution.
Curl examples:
# render with timeout curl 'http://localhost:8050/render.png?url=http://domain.com/page-with-javascript.html&timeout=10' # 320x240 thumbnail curl 'http://localhost:8050/render.png?url=http://domain.com/page-with-javascript.html&width=320&height=240'
Return information about Splash interaction with a website in HAR format. It includes information about requests made, responses received, timings, headers, etc.
You can use online HAR viewer to visualize information returned from this endpoint; it will be very similar to "Network" tabs in Firefox and Chrome developer tools.
Currently this endpoint doesn't expose raw request and response contents; only meta-information like headers and timings is available.
Arguments for this endpoint are the same as for render.html.
Return a json-encoded dictionary with information about javascript-rendered webpage. It can include HTML, PNG and other information, based on arguments passed.
Arguments:
Same as render.png plus the following ones:
- html : integer : optional
- Whether to include HTML in output. Possible values are
1
(include) and0
(exclude). Default is 0.
- png : integer : optional
- Whether to include PNG in output. Possible values are
1
(include) and0
(exclude). Default is 0.
- iframes : integer : optional
- Whether to include information about child frames in output.
Possible values are
1
(include) and0
(exclude). Default is 0.
- script : integer : optional
- Whether to include the result of the executed javascript final
statement in output (see :ref:`execute javascript`).
Possible values are
1
(include) and0
(exclude). Default is 0.
- console : integer : optional
- Whether to include the executed javascript console messages in output.
Possible values are
1
(include) and0
(exclude). Default is 0.
- history : integer : optional
Whether to include the history of requests/responses for webpage main frame. Possible values are
1
(include) and0
(exclude). Default is 0.Use it to get HTTP status codes and headers. Only information about "main" requests/responses is returned (i.e. information about related resources like images and AJAX queries is not returned). To get information about all requests and responses use :ref:`'har' <arg-har>` argument.
- har : integer : optional
- Whether to include HAR in output. Possible values are
1
(include) and0
(exclude). Default is 0. If this option is ON the result will contain the same data as render.har provides under 'har' key.
By default, URL, requested URL, page title and frame geometry is returned:
{ "url": "http://crawlera.com/", "geometry": [0, 0, 640, 480], "requestedUrl": "http://crawlera.com/", "title": "Crawlera" }
Add 'html=1' to request to add HTML to the result:
{ "url": "http://crawlera.com/", "geometry": [0, 0, 640, 480], "requestedUrl": "http://crawlera.com/", "html": "<!DOCTYPE html><!--[if IE 8]>....", "title": "Crawlera" }
Add 'png=1' to request to add base64-encoded PNG screenshot to the result:
{ "url": "http://crawlera.com/", "geometry": [0, 0, 640, 480], "requestedUrl": "http://crawlera.com/", "png": "iVBORw0KGgoAAAAN...", "title": "Crawlera" }
Setting both 'html=1' and 'png=1' allows to get HTML and a screenshot at the same time - this guarantees that the screenshot matches the HTML.
By adding "iframes=1" information about iframes could be obtained:
{ "geometry": [0, 0, 640, 480], "frameName": "", "title": "Scrapinghub | Autoscraping", "url": "http://scrapinghub.com/autoscraping.html", "childFrames": [ { "title": "Tutorial: Scrapinghub's autoscraping tool - YouTube", "url": "", "geometry": [235, 502, 497, 310], "frameName": "<!--framePath //<!--frame0-->-->", "requestedUrl": "http://www.youtube.com/embed/lSJvVqDLOOs?version=3&rel=1&fs=1&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent", "childFrames": [] } ], "requestedUrl": "http://scrapinghub.com/autoscraping.html" }
Note that iframes can be nested.
Pass both 'html=1' and 'iframes=1' to get HTML for all iframes as well as for the main page:
{ "geometry": [0, 0, 640, 480], "frameName": "", "html": "<!DOCTYPE html...", "title": "Scrapinghub | Autoscraping", "url": "http://scrapinghub.com/autoscraping.html", "childFrames": [ { "title": "Tutorial: Scrapinghub's autoscraping tool - YouTube", "url": "", "html": "<!DOCTYPE html>...", "geometry": [235, 502, 497, 310], "frameName": "<!--framePath //<!--frame0-->-->", "requestedUrl": "http://www.youtube.com/embed/lSJvVqDLOOs?version=3&rel=1&fs=1&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent", "childFrames": [] } ], "requestedUrl": "http://scrapinghub.com/autoscraping.html" }
Unlike 'html=1', 'png=1' does not affect data in childFrames.
When executing JavaScript code (see :ref:`execute javascript`) add the parameter 'script=1' to the request to include the code output in the result:
{ "url": "http://crawlera.com/", "geometry": [0, 0, 640, 480], "requestedUrl": "http://crawlera.com/", "title": "Crawlera", "script": "result of script..." }
The JavaScript code supports the console.log() function to log messages. Add 'console=1' to the request to include the console output in the result:
{ "url": "http://crawlera.com/", "geometry": [0, 0, 640, 480], "requestedUrl": "http://crawlera.com/", "title": "Crawlera", "script": "result of script...", "console": ["first log message", "second log message", ...] }
Curl examples:
# full information curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&png=1&html=1&iframes=1' # HTML and meta information of page itself and all its iframes curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&html=1&iframes=1' # only meta information (like page/iframes titles and urls) curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&iframes=1' # render html and 320x240 thumbnail at once; do not return info about iframes curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&html=1&png=1&width=320&height=240' # Render page and execute simple Javascript function, display the js output curl -X POST -H 'content-type: application/javascript' \ -d 'function getAd(x){ return x; } getAd("abc");' \ 'http://localhost:8050/render.json?url=http://domain.com&script=1' # Render page and execute simple Javascript function, display the js output and the console output curl -X POST -H 'content-type: application/javascript' \ -d 'function getAd(x){ return x; }; console.log("some log"); console.log("another log"); getAd("abc");' \ 'http://localhost:8050/render.json?url=http://domain.com&script=1&console=1'
Warning
This endpoint is experimental. API could change in future releases.
Execute a custom rendering script and return a result.
:ref:`render.html`, :ref:`render.png`, :ref:`render.har` and :ref:`render.json` endpoints cover many common use cases, but sometimes they are not enough. This endpoint allows to write custom :ref:`Splash Scripts <scripting-tutorial>`.
Arguments:
- lua_source : string : required
- Browser automation script. See :ref:`scripting-tutorial` for more info.
- timeout : float : optional
- Same as :ref:`'timeout' <arg-timeout>` argument for render.html.
- allowed_domains : string : optional
- Same as :ref:`'allowed_domains' <arg-allowed-domains>` argument for render.html.
- proxy : string : optional
- Same as :ref:`'proxy' <arg-proxy>` argument for render.html.
- filters : string : optional
- Same as :ref:`'filters' <arg-filters>` argument for render.html.
Splash supports executing JavaScript code within the context of the page. The JavaScript code is executed after the page finished loading (including any delay defined by 'wait') but before the page is rendered. This allow to use the javascript code to modify the page being rendered.
To execute JavaScript code use :ref:`js_source <arg-js-source>` parameter. It should contain JavaScript code to be executed.
Note that browsers and proxies limit the amount of data can be sent using GET,
so it is a good idea to use content-type: application/json
POST request.
Curl example:
# Render page and modify its title dynamically curl -X POST -H 'content-type: application/json' \ -d '{"js_source": "document.title=\"My Title\";", "url": "http://example.com"}' \ 'http://localhost:8050/render.html'
Another way to do it is to use a POST request with the content-type set to 'application/javascript'. The body of the request should contain the code to be executed.
Curl example:
# Render page and modify its title dynamically curl -X POST -H 'content-type: application/javascript' \ -d 'document.title="My Title";' \ 'http://localhost:8050/render.html?url=http://domain.com'
To get the result of a javascript function executed within page context use render.json endpoint with :ref:`script <arg-script>` = 1 parameter.
In :ref:`Splash-as-a-proxy <splash as a proxy>` mode use X-Splash-js-source
header instead of a POST request.
Splash supports "javascript profiles" that allows to preload javascript files. Javascript files defined in a profile are executed after the page is loaded and before any javascript code defined in the request.
The preloaded files can be used in the user's POST'ed code.
To enable javascript profiles support, run splash server with the
--js-profiles-path=<path to a folder with js profiles>
option:
python -m splash.server --js-profiles-path=/etc/splash/js-profiles
Note
See also: :ref:`splash and docker`.
Then create a directory with the name of the profile and place inside it the javascript files to load (note they must be utf-8 encoded). The files are loaded in the order they appear in the filesystem. Directory example:
/etc/splash/js-profiles/ mywebsite/ lib1.js
To apply this javascript profile add the parameter
js=mywebsite
to the request:
curl -X POST -H 'content-type: application/javascript' \ -d 'myfunc("Hello");' \ 'http://localhost:8050/render.html?js=mywebsite&url=http://domain.com'
Note that this example assumes that myfunc is a javascript function defined in lib1.js.
If Splash is started with --js-cross-domain-access
option
python -m splash.server --js-cross-domain-access
then javascript code is allowed to access the content of iframes loaded from a security origin diferent to the original page (browsers usually disallow that). This feature is useful for scraping, e.g. to extract the html of a iframe page. An example of its usage:
curl -X POST -H 'content-type: application/javascript' \ -d 'function getContents(){ var f = document.getElementById("external"); return f.contentDocument.getElementsByTagName("body")[0].innerHTML; }; getContents();' \ 'http://localhost:8050/render.html?url=http://domain.com'
The javascript function 'getContents' will look for a iframe with the id 'external' and extract its html contents.
Note that allowing cross origin javascript calls is a potential security issue, since it is possible that secret information (i.e cookies) is exposed when this support is enabled; also, some websites don't load when cross-domain security is disabled, so this feature is OFF by default.
Splash supports filtering requests based on Adblock Plus rules. You can use filters from EasyList to remove ads and tracking codes (and thus speedup page loading), and/or write filters manually to block some of the requests (e.g. to prevent rendering of images, mp3 files, custom fonts, etc.)
To activate request filtering support start splash with --filters-path
option:
python -m splash.server --filters-path=/etc/splash/filters
Note
See also: :ref:`splash and docker`.
The folder --filters-path
points to should contain .txt
files with
filter rules in Adblock Plus format. You may download easylist.txt
from EasyList and put it there, or create .txt
files with your own rules.
For example, let's create a filter that will prevent custom fonts
in ttf
and woff
formats from loading (due to qt bugs they may cause
splash to segfault on Mac OS X):
! put this to a /etc/splash/filters/nofonts.txt file ! comments start with an exclamation mark .ttf| .woff|
To use this filter in a request add filters=nofonts
parameter
to the query:
curl 'http://localhost:8050/render.png?url=http://domain.com/page-with-fonts.html&filters=nofonts'
You can apply several filters; separate them by comma:
curl 'http://localhost:8050/render.png?url=http://domain.com/page-with-fonts.html&filters=nofonts,easylist'
If default.txt
file is present in --filters-path
folder it is
used by default when filters
argument is not specified. Pass
filters=none
if you don't want default filters to be applied.
To learn about Adblock Plus filter syntax check these links:
Splash doesn't support full Adblock Plus filters syntax, there are some limitations:
- element hiding rules are not supported; filters can prevent network request from happening, but they can't hide parts of an already loaded page;
- only
domain
option is supported.
Unsupported rules are silently discarded.
Note
If you want to stop downloading images check :ref:`'images' <arg-images>` parameter. It doesn't require URL-based filters to work, and it can filter images that are hard to detect using URL-based patterns.
Warning
It is very important to have pyre2 library installed if you are going to use filters with a large number of rules (this is the case for files downloaded from EasyList).
Without pyre2 library splash (via adblockparser) relies on re module from stdlib, and it can be 1000x+ times slower than re2 - it may be faster to download files than to discard them if you have a large number of rules and don't use re2. With re2 matching becomes very fast.
Make sure you are not using re2==0.2.20 installed from PyPI (it is broken); use the latest version from github.
Splash supports "proxy profiles" that allows to set proxy handling rules
per-request using proxy
parameter.
To enable proxy profiles support, run splash server with
--proxy-profiles-path=<path to a folder with proxy profiles>
option:
python -m splash.server --proxy-profiles-path=/etc/splash/proxy-profiles
Note
If you run Splash using Docker, check :ref:`docker-folder-sharing`.
Then create an INI file with "proxy profile" config inside the
specified folder, e.g. /etc/splash/proxy-profiles/mywebsite.ini
.
Example contents of this file:
[proxy] ; required host=proxy.crawlera.com port=8010 ; optional, default is no auth username=username password=password [rules] ; optional, default ".*" whitelist= .*mywebsite\.com.* ; optional, default is no blacklist blacklist= .*\.js.* .*\.css.* .*\.png
whitelist and blacklist are newline-separated lists of regexes.
If URL matches one of whitelist patterns and matches none of blacklist
patterns, proxy specified in [proxy]
section is used;
no proxy is used otherwise.
Then, to apply proxy rules according to this profile,
add proxy=mywebsite
parameter to request:
curl 'http://localhost:8050/render.html?url=http://mywebsite.com/page-with-javascript.html&proxy=mywebsite'
If default.ini
profile is present, it will be used when proxy
argument is not specified. If you have default.ini
profile
but don't want to apply it pass none
as proxy
value.
Splash supports working as HTTP proxy. In this mode all the HTTP requests received will be proxied and the response will be rendered based in the following HTTP headers:
- X-Splash-render : string : required
- The render mode to use, valid modes are: html, png and json. These modes have the same behavior as the endpoints: render.html, render.png and render.json respectively.
- X-Splash-js-source : string
- Allow to execute custom javascript code in page context. See :ref:`execute javascript`.
- X-Splash-js : string
- Same as :ref:`'js' <arg-js>` argument for render.html. See :ref:`Javascript Profiles`.
- X-Splash-timeout : string
- Same as :ref:`'timeout' <arg-timeout>` argument for render.html.
- X-Splash-wait : string
- Same as :ref:`'wait' <arg-wait>` argument for render.html.
- X-Splash-proxy : string
- Same as :ref:`'proxy' <arg-proxy>` argument for render.html.
- X-Splash-filters : string
- Same as :ref:`'filters' <arg-filters>` argument for render.html.
- X-Splash-allowed-domains : string
- Same as :ref:`'allowed_domains' <arg-allowed-domains>` argument for render.html.
- X-Splash-viewport : string
- Same as :ref:`'viewport' <arg-viewport>` argument for render.html.
- X-Splash-images : string
- Same as :ref:`'images' <arg-images>` argument for render.html.
- X-Splash-width : string
- Same as :ref:`'width' <arg-width>` argument for render.png.
- X-Splash-height : string
- Same as :ref:`'height' <arg-height>` argument for render.png.
- X-Splash-render-all : string
- Same as :ref:`'render_all' <arg-render-all>` argument for render.png.
- X-Splash-scale-method : string
- Same as :ref:`'scale_method' <arg-scale-method>` argument for render.png.
- X-Splash-html : string
- Same as :ref:`'html' <arg-html>` argument for render.json.
- X-Splash-png : string
- Same as :ref:`'png' <arg-png>` argument for render.json.
- X-Splash-iframes : string
- Same as :ref:`'iframes' <arg-iframes>` argument for render.json.
- X-Splash-script : string
- Same as :ref:`'script' <arg-script>` argument for render.json.
- X-Splash-console : string
- Same as :ref:`'console' <arg-console>` argument for render.json.
- X-Splash-history : string
- Same as :ref:`'history' <arg-history>` argument for render.json.
- X-Splash-har : string
- Same as :ref:`'har' <arg-har>` argument for render.json.
Note
Proxying of HTTPS requests is not supported.
Curl examples:
# Display json stats curl -x localhost:8051 -H 'X-Splash-render: json' \ http://www.domain.com # Get the html page and screenshot curl -x localhost:8051 \ -H "X-Splash-render: json" \ -H "X-Splash-html: 1" \ -H "X-Splash-png: 1" \ http://www.mywebsite.com # Execute JS and return output curl -x localhost:8051 \ -H 'X-Splash-render: json' \ -H 'X-Splash-script: 1' \ -H 'X-Splash-js-source: function test(x){ return x; } test("abc");' \ http://www.domain.com # Send POST request to site and save screenshot of results curl -X POST -d '{"key":"val"}' -x localhost:8051 -o screenshot.png \ -H 'X-Splash-render: png' \ http://www.domain.com
Splash proxy mode is enabled by default; it uses port 8051. To change the port
use --proxy-portnum
option:
python -m splash.server --proxy-portnum=8888
To disable Splash proxy mode run splash server with --disable-proxy
option:
python -m splash.server --disable-proxy