Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sgmllib fail to parse html containing <!- .... -> #54244

Closed
halfjuice mannequin opened this issue Oct 6, 2010 · 6 comments
Closed

sgmllib fail to parse html containing <!- .... -> #54244

halfjuice mannequin opened this issue Oct 6, 2010 · 6 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@halfjuice
Copy link
Mannequin

halfjuice mannequin commented Oct 6, 2010

BPO 10035
Nosy @birkenfeld

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-10-06.07:16:28.814>
created_at = <Date 2010-10-06.04:28:01.272>
labels = ['type-bug', 'library']
title = 'sgmllib fail to parse html containing <!- .... ->'
updated_at = <Date 2010-10-06.07:16:28.813>
user = 'https://bugs.python.org/halfjuice'

bugs.python.org fields:

activity = <Date 2010-10-06.07:16:28.813>
actor = 'georg.brandl'
assignee = 'none'
closed = True
closed_date = <Date 2010-10-06.07:16:28.814>
closer = 'georg.brandl'
components = ['Library (Lib)']
creation = <Date 2010-10-06.04:28:01.272>
creator = 'halfjuice'
dependencies = []
files = []
hgrepos = []
issue_num = 10035
keywords = []
message_count = 6.0
messages = ['118048', '118049', '118052', '118053', '118054', '118055']
nosy_count = 2.0
nosy_names = ['georg.brandl', 'halfjuice']
pr_nums = []
priority = 'normal'
resolution = 'works for me'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue10035'
versions = ['Python 2.6']

@halfjuice
Copy link
Mannequin Author

halfjuice mannequin commented Oct 6, 2010

When parsing html containing the following tag:
... <!- ie6 doesn't allow empty div. -> ...
SGMLParser will stop parse following content without any warning. When such tag is removed everything works fine.

When looking into sgmllib.py, statement below found:

    if rawdata.startswith("<!", i):
        # This is some sort of declaration; in "HTML as
        # deployed," this should only be the document type
        # declaration ("<!DOCTYPE html...>").

I think that's why something goes wrong here.

@halfjuice halfjuice mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 6, 2010
@birkenfeld
Copy link
Member

Are you sure you got the comment syntax right? e.g.
<!-- ie6 doesn't allow empty div. -->

SGMLParser should handle that.

@halfjuice
Copy link
Mannequin Author

halfjuice mannequin commented Oct 6, 2010

well, <!-- ... -> is ok since it's comment. <!- ... -> is probably a IE hack. see http://www.google.com/dictionary?langpair=en|zh-CN&q=vague&hl=en&aq=f

@birkenfeld
Copy link
Member

Is that URL really what you wanted to show me?

Also, I'm not intimate with all of SGML's syntax, but ISTM that what you show here is invalid SGML, and as such SGMLParser is not required to parse it.

@halfjuice
Copy link
Mannequin Author

halfjuice mannequin commented Oct 6, 2010

Sorry, the URL on the page is sort of broken. The URL contains the "<!- ... ->" stuff.

I think you're right, the <!- is probably just a mistake which is not in the SGML standard. But I'm wondering if the SGMLParser can SKIP such an invalid statement? My browser does this.

@birkenfeld
Copy link
Member

The browser needs to be very liberal in what it accepts, since nobody wants their page view to break because of such a technicality. This is different for a tool like SGMLParser.

In light of this, and because sgmllib is removed anyway in Python 3, I'm closing this.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant