A context-aware, medium-neutral entity maker
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Lexentity: A context-aware, medium-neutral entity maker

by Sean Coates

Let's face it--this sentence is much "uglier" than the one below it.
Let’s face it–this sentence is much “prettier” than the one above it.

Lexentity is a simple piece of software that takes HTML as input and outputs a context-aware, medium-neutral representation of that HTML, with apostrophes, quotes, emdashes, ellipses, accents, etc., replaced with their respective numeric XML/Unicode entities.


Context is important. It is especially important when considering a piece of HTML like this:

<p>…and here's the example code:</p>
<pre><code>echo "watermelon!\n";</pre></code>

Contextually, you'd want here's to become here’s, but you certainly don't want the code to read echo “watermelon!\n”;.

A fancy/smart/curly quotes apostrophe is appropriate, but curly quotes in the code are likely to cause a parse error.

Lexentity understands its context, and acts appropriately, my means of lexical analysis, and turning tokens into text, not through a mostly-naive and overly-complicated regular expression.


My friend and colleague Jon Gibbins said it best in [http://dotjay.co.uk/2006/sep/named-html-entities-in-rss](this piece on his blog). In modern systems, you can't count on your HTML to always be represented as HTML. It's often (poorly) embedded in RSS or other HTML-like media, as XML.

Therefore, it is important to avoid HTML-specific entities like and , and instead use their Unicode code point to form numeric entities such as &#8230;. This ensures proper display on any terminal that can properly render Unicode XML, and avoids missing entity errors.


Try a demo at http://files.seancoates.com/lexentity/.