Simple HTML DOM parsers benchmark.
Switch branches/tags
Nothing to show
Clone or download
Latest commit e56fbc0 Jan 5, 2016

HTML parsers benchmark

Simple HTML DOM parser benchmark.





  • BeautifulSoup 3
  • BeautifulSoup 4
  • html5lib





Google Go







Install OS dependencies python-virtualenv, erlang, pypy, C compiler and libxml2 dev packages

sudo apt-get install ...
    libxml2-dev libxslt1-dev build-essential  # common
    python-virtualenv python-lxml             # python
    erlang-base                               # erlang
    pypy                                      # python PyPy
    nodejs npm                                # NodeJS
    cabal-install libicu-dev                  # Haskell
    php5-cli php5-tidy                        # PHP
    golang                                    # Go
    ruby1.9.1 ruby1.9.1-dev rubygems1.9.1     # Ruby
    maven2 default-jdk                        # Java
    mono-runtime mono-dmcs                    # Mono

Then run (it will prepare virtual environments, fetch dependencies, compile sources etc)


In case of errors, I recommended to install also cython, python-dev and retry.

To prepare only some of the platforms, define PLATFORMS environment variable:

PLATFORMS="pypy python" ./


Just run

./ <number of parser iterations>


./ 5000

To run tests only for some of the platforms, define PLATFORMS envifonment variable:

PLATFORMS="pypy python" ./ 5000

To run series of tests use snippets like

for C in $(echo "10 50 100 400 600 1000"); do ./ $C | tee output_$C.txt; done


To convert results to CSV file, use

./ 5000 | ./

or smth like

./ 5000 | tee output.txt
./ < output.txt

or, for series

for C in $(echo "10 50 100 400 600 1000"); do ./ < output_$C.txt > results-$C.csv; done

There is also R - script that can build some pretty graphs: stats/main.r.

How to add my %platformname% to benchmark set?

Create directory %platformname%

mkdir %platformname%

Create and scripts:

  • - called every time when benchmark starts. Must use print_header() and timeit() functions from to format output for each test. It must accept 2 arguments: HTML file path and number of iterations and pass them unchanged to benchmark scripts.
  • - called only once, before runing any benchmarks. It can download dependencies, compile sources etc.

Create your benchmark scripts. Requirements:

  • Must accept 2 arguments: path to HTML file and number of iterations
  • Must read HTML file once, then perform "number of iterations" parse cycles
  • Must print parser-loop runtime in seconds, calculated like start = time(); do_n_iterations(N); print time() - start
  • On each iteration must build full DOM tree in memory

Add %platformname% to platforms.txt file.

How to add new HTML to benchmark?

Just create HTML file named page_<some_page_name>.html.