Lexentity: A context-aware, medium-neutral entity maker
by Sean Coates
Let's face it--this sentence is much "uglier" than the one below it. Let’s face it–this sentence is much “prettier” than the one above it.
Lexentity is a simple piece of software that takes HTML as input and outputs a context-aware, medium-neutral representation of that HTML, with apostrophes, quotes, emdashes, ellipses, accents, etc., replaced with their respective numeric XML/Unicode entities.
Context is important. It is especially important when considering a piece of HTML like this:
<p>…and here's the example code:</p> <pre><code>echo "watermelon!\n";</pre></code>
Contextually, you'd want
here's to become
you certainly don't want the code to read
A fancy/smart/curly quotes apostrophe is appropriate, but curly quotes in the code are likely to cause a parse error.
Lexentity understands its context, and acts appropriately, my means of lexical analysis, and turning tokens into text, not through a mostly-naive and overly-complicated regular expression.
My friend and colleague Jon Gibbins said it best in [http://dotjay.co.uk/2006/sep/named-html-entities-in-rss](this piece on his blog). In modern systems, you can't count on your HTML to always be represented as HTML. It's often (poorly) embedded in RSS or other HTML-like media, as XML.
Therefore, it is important to avoid HTML-specific entities like
…, and instead use their Unicode
code point to form numeric entities such as
…. This ensures
proper display on any terminal that can properly render Unicode XML, and avoids
missing entity errors.
Try a demo at http://files.seancoates.com/lexentity/.