purehtml

purehtml

is a reusable purely HTML-only parser that can be used for developing crawlers, converters or simple clients for consuming HTML.
aims to follow the latest WHATWG HTML LS specification.
is intentionally small / "human-scale" so that it can be understood or maintained by a single person without requiring a team.
allows using it in either in DOM or SAX mode, that is either by immediately working with the data we get and then release the data as we go (low processing and memory footprint), or waiting for getting the full tree constructed so that we can navigate the tree and then do whatever we want, for how many times we like.

The parser can be used to create other programs that, for example:

Programmatically read data from a HTML page.
Convert HTML to XML.
Convert HTML to text/gemini.
Convert HTML to Markdown.
Indent HTML.
Minify HTML.
Validate HTML.
Make an RSS or Atom feed directly from HTML content.
View HTML in a special way for increased accessibility (e.g. for blind users).
Allow geeks to ultimately customize the Web user experience.

The parser intentionally does not support:

Javascript: HTML is still somewhat "human-scale", but Javascript is not. Adding Javascript would mean a team would need to support the parser, or understand its code. There are other projects that require a team. Here, the purpose is to support "human-scale" development.
CSS: Parsing CSS is far simpler than Javascript and CSS support could be added, but it is better for maintainability and for facilitating human-scale development if purehtml is used for extracting the CSS content and then feeding it to a separate parser. CSS is not needed by the most use cases envisioned for purehtml, so bundling CSS with purehtml would be bloating the parser in most use cases.
Full DOM: Supporting full DOM as specified in WHATWG DOM Living Standard is not necessary because there is no Javascript support. However, it is very easy to use the parser either in a SAX or DOM-style.
JSON-LD: JSON-LD content can be extracted using purehtml and then fed to a separate parser. On the other hand, Microdata does not need a separate parser but purehtml can do it because Microdata is HTML.

About the intentional lack of Javascript support

Some Web sites are broken because they require Javascript. Javascript should be used only for providing optional interactivity enhancements. Since such sites are essentially broken (or at least they shouldn't be seen as having anything to do with programmatically readable HTML), purehtml will also fail with these.

However, being able to disable Javascript completely:

greatly decreases interaction latency and speeds up page loading (which is important, see Response Times: The 3 Important Limits);
facilitates human-scale development;
hugely increases privacy;
hugely decreases theoretical risk for security vulnerabilities because there is thousands times less attack surface to worry about;
removes asynchronous page load behavior which is highly annoying and decreases usability;
moves control back from Web developers to application developers (e.g. developers that use purehtml for developing customized special-purpose clients) and to end users (because they can choose the application based on how they want to view the content).

Dependencies

None.

The parser intentionally does not require libraries or infrastructure other than a normal near POSIX-compliant environment and a C compiler, so that using the parser would be as simple as possible for the envisioned use cases.

Compile

./configure ~
make
make install

export PKG_CONFIG_PATH=~/lib/pkgconfig

cd examples/dumptree
./configure ~
make
make install

cd ../..

cd examples/webgem
./configure ~
make
make install

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
LICENSE		LICENSE
Makefile.in		Makefile.in
README.md		README.md
attr.c		attr.c
attr.h		attr.h
attrs.awk		attrs.awk
attrs.txt		attrs.txt
cdata.c		cdata.c
cdata.h		cdata.h
configure		configure
dispatch.c		dispatch.c
dispatch.h		dispatch.h
document.c		document.c
document.h		document.h
elem.c		elem.c
elem.h		elem.h
enum.awk		enum.awk
imodes.txt		imodes.txt
node.c		node.c
node.h		node.h
ostack.c		ostack.c
ostack.h		ostack.h
purehtml.pc.in		purehtml.pc.in
states.txt		states.txt
tagmap.c		tagmap.c
tagmap.h		tagmap.h
tags.awk		tags.awk
tags.txt		tags.txt
token.c		token.c
token.h		token.h
tokenize.c		tokenize.c
tokenize.h		tokenize.h
util.c		util.c
util.h		util.h

License

tleino/purehtml

Folders and files

Latest commit

History

Repository files navigation

purehtml

About the intentional lack of Javascript support

Dependencies

Compile

See also

About

Topics

Resources

License

Stars

Watchers

Forks

Languages