HTMLParser lacking a few features to reconstruct input exactly #70197

jasons · 2016-01-04T17:35:38Z

BPO	26009
Nosy	@ezio-melotti
Files	test2.py test1.html

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2016-01-04.17:35:38.424>
labels = ['type-feature']
title = 'HTMLParser lacking a few features to reconstruct input exactly'
updated_at = <Date 2016-01-08.18:26:03.687>
user = 'https://bugs.python.org/jasons'

bugs.python.org fields:

activity = <Date 2016-01-08.18:26:03.687>
actor = 'terry.reedy'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = []
creation = <Date 2016-01-04.17:35:38.424>
creator = 'jason_s'
dependencies = []
files = ['41496', '41497']
hgrepos = []
issue_num = 26009
keywords = []
message_count = 4.0
messages = ['257472', '257473', '257475', '257770']
nosy_count = 2.0
nosy_names = ['ezio.melotti', 'jason_s']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'test needed'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue26009'
versions = ['Python 3.6']

jasons · 2016-01-04T17:35:38Z

The HTMLParser class (https://docs.python.org/2/library/htmlparser.html) is lacking a few features to reconstruct input exactly. For the most part it can do this, but I found two items where it falls short (there may be others):

There is a get_starttag_text() method but no get_endtag_text() method, which is necessary if the end tag is not in canonical form, e.g. instead of it is or
The effect of the parse_bogus_comment() internal method is to call handle_comment(), so content like <! I AM BOGUS > cannot be distinguished by subclasses of HTMLParser from actual comments

Suggested changes:

Add a get_endtag_text() method to return the exact endtag text
change parse_bogus_comment to call self.handle_bogus_comment(), and define self.handle_bogus_comment() to call self.handle_comment(). This way it is backwards-compatible with existing behavior, but subclasses can redefine self.handle_bogus_comment() to do what they want.

jasons · 2016-01-04T17:36:45Z

sample file attached containing VerbatimParser

jasons · 2016-01-04T17:45:00Z

sample file test1.html attached.

When running test2.py on it, the output is identical except for two things:

test1.html contains <!DAMMIT HTML PUBLIC CRAP>
test1b.html contains

test1.html contains end tags that are capitalized e.g. or have spaces </ goober >
test1b.html contains end tags that are canonicalized to lowercase and without spaces e.g. and </goober>

ezio-melotti · 2016-01-08T17:46:11Z

What is your use case?
Also note that new features can only go on 3.6.

jasons mannequin added the type-bug An unexpected behavior, bug, or error label Jan 4, 2016

terryjreedy added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Jan 8, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

iritkatriel added the stdlib Python modules in the Lib dir label Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTMLParser lacking a few features to reconstruct input exactly #70197

HTMLParser lacking a few features to reconstruct input exactly #70197

jasons mannequin commented Jan 4, 2016

jasons mannequin commented Jan 4, 2016

jasons mannequin commented Jan 4, 2016

jasons mannequin commented Jan 4, 2016

ezio-melotti commented Jan 8, 2016

HTMLParser lacking a few features to reconstruct input exactly #70197

HTMLParser lacking a few features to reconstruct input exactly #70197

Comments

jasons mannequin commented Jan 4, 2016

jasons mannequin commented Jan 4, 2016

jasons mannequin commented Jan 4, 2016

jasons mannequin commented Jan 4, 2016

ezio-melotti commented Jan 8, 2016