Automatically convert character references in HTMLParser #57842
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = 'https://github.com/ezio-melotti' closed_at = <Date 2013-11-23.18:17:29.997> created_at = <Date 2011-12-19.06:55:56.952> labels = ['type-feature', 'library'] title = 'Automatically convert character references in HTMLParser' updated_at = <Date 2013-11-23.18:17:29.995> user = 'https://github.com/ezio-melotti'
activity = <Date 2013-11-23.18:17:29.995> actor = 'ezio.melotti' assignee = 'ezio.melotti' closed = True closed_date = <Date 2013-11-23.18:17:29.997> closer = 'ezio.melotti' components = ['Library (Lib)'] creation = <Date 2011-12-19.06:55:56.952> creator = 'ezio.melotti' dependencies = ['2927', '11113'] files = ['32729', '32803'] hgrepos =  issue_num = 13633 keywords = ['patch'] message_count = 8.0 messages = ['149822', '154036', '188223', '203520', '203836', '204041', '204065', '204068'] nosy_count = 5.0 nosy_names = ['ezio.melotti', 'eric.araujo', 'r.david.murray', 'python-dev', 'serhiy.storchaka'] pr_nums =  priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue13633' versions = ['Python 3.4']
The text was updated successfully, but these errors were encountered:
The doc for handle_charref and handle_entityref say:
HTMLParser.handle_entityref(name) This method is called to process a general entity reference of the form "&name;" where name is an general entity reference. It is intended to be overridden by a derived class; the base class implementation does nothing. """
The doc doesn't mention hex references, like ">", and apparently they are passed to handle_charref without the '&#' but with the leading 'x':
>>> from HTMLParser import HTMLParser >>> class MyParser(HTMLParser): ... def handle_charref(self, data): ... print data ... >>> MyParser().feed('> > >') 62 x3E
I've seen code in the wild doing unichr(int(data)) in handle_charref (once they figured out that '62' is passed) and then fail when an hex entity is found. Passing 'x3E' doesn't seem too useful because the user has to first check if there's a leading 'x', if there is remove it, then convert the hex string to int, and finally use unichr() to get the char, otherwise just convert to int and use unichr().
There 3 different possible solutions:
The first solution alone doesn't solve much, but the doc should be clearer regardless of the decision we take.
This behavior is now documented, but the situation could still be improved. Adding a new method that receives the converted entity seems a good way to handle this. The parser can call both, and users can pick either one.
One problem with the current methods (handle_charref and handle_entityref) is that they don't do any processing on the entity and let invalid character references like � or &#iamnotanentity; go through.
There are at least 3 changes that should be done in order to follow the HTML5 standard 0:
Now, 1) can be done for both the old and new method, but for 2) and 3) the situation is a bit more complicated. The best thing is probably to keep sending them unchanged to the old methods, and implement the correct behavior for the new method only.