Skip to content
Domain-specific language for extracting structured data from HTML documents
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
build initial import Dec 8, 2014
cmake libhext/bindings/php: update extension to version PHP7 Mar 16, 2019
htmlext use https when linking to hext.thomastrapp.com Mar 15, 2019
libhext
man use https when linking to hext.thomastrapp.com Mar 15, 2019
scripts scripts/build-hext-npm: index.js: fix module lookup Apr 15, 2019
syntaxhl
test blackbox-test: add test cases for greedy rules Apr 10, 2019
.travis.yml add .travis.yml Apr 13, 2019
CMakeLists.txt htmlext: bump version to 0.5 Mar 12, 2019
LICENSE add apache 2 license Oct 12, 2015
README.md readme: add shields for npm and pypi Apr 13, 2019

README.md

Hext — Extract Data from HTML

Build Status PyPI Version npm version

Hext is a domain-specific language for extracting structured data from HTML documents.

Hext Logo

See https://hext.thomastrapp.com for documentation, build instructions and a live demo.

The Hext project is released under the terms of the Apache License v2.0.

Example

Suppose you want to extract all hyperlinks from a web page. Hyperlinks have an anchor tag <a>, an attribute called href and a text that visitors can click. The following Hext snippet will produce a dictionary for every matched element. Each dictionary will contain the keys link and title which refer to the href attribute and the text content of the matched <a>.

# Extract links and their text
<a href:link @text:title />

» Load example in editor

Visit Hext's project page to learn more about Hext. For examples that use the libhext C++ library check out /libhext/examples and the main page of the doxygen documentation.

Quick Install via Pip

You can install the htmlext command-line utility and the python bindings through pip:

pip install hext
htmlext --version

Available for all flavors of Linux (x86_64) and Mac OS X ≥ 10.11 (x86_64). Visit https://pypi.org/project/hext/.

Hext for Node via NPM

Hext for Node is available on npm:

npm install hext
node -e 'require("hext")' && echo "hext loaded successfully"

(Does not include the htmlext command-line utility). Visit https://www.npmjs.com/package/hext.

Compatibility

The npm package is compatible with:

  • Node v8, v10, v11
  • Linux (GLIBC ≥2.14, basically any distribution built after the year 2012)
  • Mac OS X (10.11 El Capitan or later)
  • x86_64 only

Components of this Project

  • htmlext: Command line utility that applies Hext snippets to an HTML document and produces JSON.
  • libhext: C++ library that contains a Hext parser but also allows for customization.
  • libhext-test: Unit tests for libhext.
  • Hext bindings: Bindings for scripting languages. There are extensions for Node.js, Python, Ruby and PHP that are able to parse Hext and extract values from HTML.

Project layout

├── build             Build directory for htmlext
├── cmake             CMake modules used by the project
├── htmlext           Source for the htmlext command line tool
├── libhext           The libhext project
│   ├── bindings      Hext bindings for scripting languages
│   ├── build         Build directory for libhext
│   ├── doc           Doxygen documentation for libhext
│   ├── examples      Examples making use of libhext
│   ├── include       Public libhext API
│   ├── ragel         Ragel input files
│   ├── scripts       Helper scripts for libhext
│   ├── src           libhext implementation files
│   └── test          The libhext-test project
│       ├── build     Build directory for libhext-test
│       └── src       Source for libhext-test
├── man               Htmlext man page
├── scripts           Scripts for building and testing releases
├── syntaxhl          Syntax highlighters for Vim and ACE
└── test              Blackbox tests for htmlext

Dependencies for development

  • Ragel generates the state machine that is used to parse Hext
  • The unit tests for libhext are written with Google Test
  • libhext's public API documentation is generated by Doxygen
  • libhext's scripting language bindings are generated by Swig

Tests

There are unit tests for libhext and blackbox tests for Hext as a language, whose main purpose is to detect unwanted change in syntax or behavior.
The libhext-test project is located in /libhext/test and depends on Google Test. Nothing fancy, just build the project and run the executable libhext-test. How to write test cases with Google Test is described here.
The blackbox tests are located in /test. There you'll find a shell script called blackbox.sh. This script applies Hext snippets to HTML documents and compares the result to a third file that contains the expected output. For example, there is a test case icase-quoted-regex that consists of three files: icase-quoted-regex.hext, icase-quoted-regex.html, and icase-quoted-regex.expected. To run this test case you would do the following:

$ ./blackbox.sh case/icase-quoted-regex.hext

blackbox.sh will then look for the corresponding .html and .expected files of the same name in the directory of icase-quoted-regex.hext. Then it will invoke htmlext with the given Hext snippet and HTML document and compare the result to icase-quoted-regex.expected. To run all blackbox tests in succession:

$ ./blackbox.sh case/*.hext

By default blackbox.sh will look for the htmlext binary in $PATH. Failing that, it looks for the binary in the default build directory. You can tell blackbox.sh which command to use by setting HTMLEXT. For example, to run all tests through valgrind you'd run the following:

$ HTMLEXT="valgrind -q ../build/htmlext" ./blackbox.sh case/*.hext

Acknowledgements

  • GumboAn HTML5 parsing library in pure C99
    Gumbo is used as the HTML parser behind hext::Html. It's fast, easy to integrate and even fixes invalid HTML.
  • RagelRagel State Machine Compiler
    The state machine that is used to parse Hext snippets is generated by Ragel. You can find the definition of this machine in /libhext/ragel/hext-machine.rl.
  • RapidJSONA fast JSON parser/generator for C++
    RapidJSON powers the JSON output of the htmlext command line utility.
  • jqA lightweight and flexible command-line JSON processor
    An indispensable tool when dealing with JSON in the shell. Piping the output of htmlext into jq lets you do all sorts of crazy things.
  • AceA Code Editor for the Web
    Used as the code editor in the "Try Hext in your Browser!" section and as a highlighter for all code examples. The highlighting rules for Hext are included in this project in /syntaxhl/ace. Also, there's a script in /libhext/scripts/syntax-hl-ace that uses Ace to transform a code snippet into highlighted HTML.
  • Boost.BeastHTTP and WebSocket built on Boost.Asio in C++11
    The Websocket server behind the "Try Hext in your Browser!" section is built with Beast. See github.com/thomastrapp/hext-on-websockets for more.

Licensing

All source code of the Hext project is released under the Apache License v2.0, with the sole exception of /libhext/doc/doxygen/resources/bootstrap.min.css, which was authored by a third party and is included in this project under the terms of the MIT license. See /libhext/doc/doxygen/resources/bootstrap.min.css.LICENSE for the full license text. Visit Bootstrap on Github.

About me

I am a freelancing software developer living in Munich. Visit thomastrapp.com for my email address and let me know what you think about Hext!

You can’t perform that action at this time.