Creating and viewing WARC web archives in Servo
Servo can make use of web archives for repeatable performance testing against real-world web content, in particular sites which have a large amount of third-party ad content. Web archives can be played using an http proxy which does not have live access to the internet, so provides a stable platform for performance benchmarking.
This note describes how to record and replay web archives, using Servo as the web engine.
The pywb tools support recording and playback of web archives. They can be run using Python 2 or 3, and can be installed using
pip (inside a virtualenv if preferred):
pip install git+https://github.com/ikreymer/pywb.git
Docs are at https://pywb.readthedocs.io.
Creating a web archive
To create a web archive, first initialize
wb-manger init archives
Then start the http server, giving it access to the live internet, and asking it to record and index an archive:
wayback --live --record --autoindex
You can now browse the web, and pages will be recorded in your archives. For example (from the servo build directory):
./mach run -r http://localhost:8080/archives/record/https://nytimes.com/
Once enough of the page is visible, you can quit servo and the wayback server. This should have created an archive file:
ls collections/archives/archive rec-TIMESTAMP-MACHINE.warc.gz
Replaying a web archive
To replay a web archive already in your collection, first start the wayback server:
Then view the archived content:
./mach run -r http://localhost:8080/archives/https://nytimes.com/
To replay a web archive recorded elsewhere, first add it to your collection:
wb-manager add archives some-other-archive.warc.gz
Replaying a web archive as an http proxy
Replaying archives this way is simple, but has some problems:
It relies on URL rewriting, to add the
Any unwritten URLs will be fetched from the live internet, so not all content is the same between runs.
Since URLs are rewritten to be under
localhost, they are all considered same-origin by servo, so will all be executed in the same content thread, losing a lot of the benefit of concurrency.
pywb also supports replaying via an HTTP proxy, which removes the need for URL rewriting, since all content is delivered via the proxy.
wayback --proxy archives
Now, as well as serving URL-rewritten content on
localhost, the wayback server is also acting as an HTTP proxy, serving the original content without any URL-rewriting. Unfortunately there are some steps to get Servo to view this content:
Servo does not have support for HTTP proxies, so needs to be proxified. On Linux this can be done with the
proxychainscommand (installed in Debian-based systems by
apt-get install proxychains).
Any https content is served using a certificate signed with a key stored in
proxy-certs/pywb-ca.pem. This certificate must be added as a root certificate for Servo.
To run Servo with proxychains, first create a
[ProxyList] http 127.0.0.1 8080
proxychains ./mach run -r --certificate-path proxy-certs/pywb-ca.pem https://nytimes.com/
This will view the archived content without any URL rewriting.
Unfortunately, there is a gotcha: there are two ways content can be proxied via http: using CONNECT or using GET, and proxychains uses CONNECT but pywb only supports GET. Fortunately, for https content, there is just CONNECT, so this technique works for https content (which these days is most of the web).
At some point, Servo may get native support for HTTP proxies, at which point this should become a non-issue, but for now we're stuck only being able to test https content.
Recording a web archive as an http proxy
To record an archive while running as an http proxy:
wayback --proxy archives --live --proxy-record --autoindex
then run Servo with an http proxy as before.