Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser lacking a few features to reconstruct input exactly #70197

Open
jasons mannequin opened this issue Jan 4, 2016 · 4 comments
Open

HTMLParser lacking a few features to reconstruct input exactly #70197

jasons mannequin opened this issue Jan 4, 2016 · 4 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@jasons
Copy link
Mannequin

jasons mannequin commented Jan 4, 2016

BPO 26009
Nosy @ezio-melotti
Files
  • test2.py
  • test1.html
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2016-01-04.17:35:38.424>
    labels = ['type-feature']
    title = 'HTMLParser lacking a few features to reconstruct input exactly'
    updated_at = <Date 2016-01-08.18:26:03.687>
    user = 'https://bugs.python.org/jasons'

    bugs.python.org fields:

    activity = <Date 2016-01-08.18:26:03.687>
    actor = 'terry.reedy'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = []
    creation = <Date 2016-01-04.17:35:38.424>
    creator = 'jason_s'
    dependencies = []
    files = ['41496', '41497']
    hgrepos = []
    issue_num = 26009
    keywords = []
    message_count = 4.0
    messages = ['257472', '257473', '257475', '257770']
    nosy_count = 2.0
    nosy_names = ['ezio.melotti', 'jason_s']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'test needed'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue26009'
    versions = ['Python 3.6']

    @jasons
    Copy link
    Mannequin Author

    jasons mannequin commented Jan 4, 2016

    The HTMLParser class (https://docs.python.org/2/library/htmlparser.html) is lacking a few features to reconstruct input exactly. For the most part it can do this, but I found two items where it falls short (there may be others):

    • There is a get_starttag_text() method but no get_endtag_text() method, which is necessary if the end tag is not in canonical form, e.g. instead of </p> it is </P> or </ P >

    • The effect of the parse_bogus_comment() internal method is to call handle_comment(), so content like <! I AM BOGUS > cannot be distinguished by subclasses of HTMLParser from actual comments <!-- I AM BOGUS -->

    Suggested changes:

    • Add a get_endtag_text() method to return the exact endtag text
    • change parse_bogus_comment to call self.handle_bogus_comment(), and define self.handle_bogus_comment() to call self.handle_comment(). This way it is backwards-compatible with existing behavior, but subclasses can redefine self.handle_bogus_comment() to do what they want.

    @jasons jasons mannequin added the type-bug An unexpected behavior, bug, or error label Jan 4, 2016
    @jasons
    Copy link
    Mannequin Author

    jasons mannequin commented Jan 4, 2016

    sample file attached containing VerbatimParser

    @jasons
    Copy link
    Mannequin Author

    jasons mannequin commented Jan 4, 2016

    sample file test1.html attached.

    When running test2.py on it, the output is identical except for two things:

    test1.html contains <!DAMMIT HTML PUBLIC CRAP>
    test1b.html contains <!--DAMMIT HTML PUBLIC CRAP-->

    test1.html contains end tags that are capitalized e.g. </P> or have spaces </ goober >
    test1b.html contains end tags that are canonicalized to lowercase and without spaces e.g. </p> and </goober>

    @ezio-melotti
    Copy link
    Member

    What is your use case?
    Also note that new features can only go on 3.6.

    @terryjreedy terryjreedy added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Jan 8, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 27, 2023
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants