-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTMLParser.py - more robust SCRIPT tag parsing #37799
Comments
http://www.ebay.com contains a script element of the form <SCRIPT> which is not enclosed in "<!-- ... -->" comments. The parser The changes are: interesting_cdata is now a dict mapping start tag to HTMLParser.set_cdata_mode takes an extra argument, |
Logged In: YES Found regression test, used it, found error, fixed it. |
Logged In: YES From python-dev: John Paulson wrote:
Jerry Williams wrote: <SCRIPT language="JavaScript"> That could be a problem, since this is commonly used See: |
Logged In: YES You will get a sequence of: whereas before the sequence was: |
Logged In: YES Removed older version of the patch. |
Adding test suite patch from bpo-674449; closed that as a duplicate. |
Here is an additional test case. I have a super simple HTML "minifier" ======== $ cat test.html
'foo <sc'+'ript>' ======== The explosion is: ======== $ ./minify.py test.html
Warning: malformed start tag
'foo Traceback (most recent call last):
File "./minify.py", line 84, in <module>
m.feed(f.read())
File "/usr/local/lib/python2.5/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/local/lib/python2.5/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/local/lib/python2.5/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/local/lib/python2.5/HTMLParser.py", line 302, in
check_for_whole_start_tag
raise AssertionError("we should not get here!")
AssertionError: we should not get here! ======== |
Now that BeautifulSoup uses HTMLParser, more people are seeing these |
A simple workaround for the BeautifulSoup is the following wrapper. It def bs(input):
pattern = re.compile('\"\+\"')
match = lambda x: ""
massage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
massage.extend([(pattern, match)])
return BeautifulSoup(input, markupMassage=massage) |
The HTMLParser.py fails when inside $ tar tvzf lt-in-script-example.tgz | cut -c24-
796 2010-09-30 16:52 h2t.py
23678 2010-09-30 16:39 t.html here's what happens: $ python h2t.py t.html /tmp/t.txt
HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py
Traceback (most recent call last):
File "h2t.py", line 31, in <module>
text = html2text(f_html.read())
File "h2t.py", line 23, in html2text
te = TextExtractor(html)
File "h2t.py", line 15, in __init__
self.feed(html)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in check_for_whole_start_tag
self.error("malformed start tag")
File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332 I have a suggested patch -- yotam |
The attached suggested patch fixes the problems shown in msg117762. |
Would it be reasonable to add knowledge to html.parser to make it recognize script elements as CDATA and handle it correctly (that is let “<” pass)? |
Suggested fix for the attached cases: |
If you provide some tests augumenting the currently existing tests |
This is small patch for related bug bpo-9577 which actually is not related to this bug. |
And this patch fix the both bugs in more elegant way |
Thanks for the patch, however it would be better if you could get a clone of the CPython repo and make a patch against it. You can check http://docs.python.org/devguide/ for more information. |
The number of problems produced by this bug can be greatly reduced by adding a relatively small check to the parser. Currently, <script> and <style> tags call set_cdata_mode(), which sets self.interesting to HTMLParser.interesting_cdata. This is bad because it searches for ANY closing tag, rather than a closing tag which matches the opening tag. Alexander's fix solved about half the problem, but it didn't handle ending tags as text. I've fixed this and added some tests. This is my first patch, so if there's a better way that I could be submitting this, input would be appreciated. |
I left a review about your patch on rietveld, including a description of what I think it's going on there (the patch lacks some context and it's not easy to figure out how everything works there). >>> from HTMLParser import HTMLParser as HP
>>> class MyHP(HP):
... def handle_data(self, data): print 'data: %r' % data
...
>>> myhp = MyHP()
# without the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar' # this looks ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # where's the </p>?
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # some tags missing, 2 chunks received
data: 'bar'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
data: '<p>foo'
data: " '"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "/usr/lib/python2.7/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247
# with the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar' # ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # all the content is there, but why 2 chunks?
data: '</p>'
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # same as previous
data: '</p>'
data: '<span>bar'
data: '</span>'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
data: '<p>foo' # same
data: '</p>'
data: " '"
data: "</scr'+'ipt>"
data: "' <span>bar"
data: '</span>' So my question is: is it normal that the data is passed to handle_data in chunks? |
Ezio wrote:
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # where's the </p>? http://www.w3.org/TR/html4/types#type-cdata says: So I think the example is invalid (should escape the <), and that HTMLParser is not buggy. |
It's not buggy, but it is also not helpful. This kind of thing is what we introduced the 'strict' parameter for. And indeed I believe we've fixed some of these cases thereby. So any additional fixes should go into non-strict mode in Python3. |
On the other hand, the HTML5 spec clearly dictates otherwise: http://www.w3.org/TR/html5/syntax.html#cdata-rcdata-restrictions Additionally, no browsers (perhaps unless they are in quirks mode) currently obey the HTML4 variant of the rule. This is due largely in part to the need to include strings such as "</scr" + "ipt>" within a script tag itself. This behavior can be observed firsthand by loading this snippet in a browser: <script><span></span>This should not be visible.</script> |
Yes, but we don't claim to support HTML5 yet. The best way to support HTML5 is probably a topic for python-dev. |
There's also no claim in the docs or the source that HTMLParser specifically adheres to HTML4, either. Ideally, the parser should strive for parity with the functionality of major web browsers, as they are the de-facto standard for HTML parser behavior. All of the browsers on my machine, for instance, will even parse the following snippet with the behavior described in the HTML5 spec: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" Even in pre-HTML5 browsers, this is the way that HTML gets parsed. For the heck of it, I downloaded an old copy of Firefox 2.0 and ran the above snippet. The behavior is consistent. While I would otherwise agree that keeping to the HTML4 spec is the right thing to do, this is a quirk of the spec that is not only ignored by browsers (as can be seen in FX2) and changed in a future version of the spec, but is causing problems for a good number of developers. It could be argued that the patch is a far more elegant solution for Beautiful Soup developers than the workaround in msg88864. |
I thought HTLM4 conformance was documented somewhere, but I could be wrong. HTML5, from what I understand (I haven't read the spec), is explicitly or implicitly following "what browsers really do" exactly because nobody conformed to HTML4, so arguing that "a later spec changed the rules" isn't really relevant in this case :) We made the change the way we did (strict option) out of backward compatibility concerns, so I still think this topic needs to be discussed on python-dev. I think the argument that python should handle what most browsers handle is a strong one (I myself would have been in favor of just making this stuff work, absent backward compatibility concerns). The question in my mind is what's the best way to get there from here? |
IIRC we have been following what browsers do in other cases already. Note that changing the way things are parsed is generally not backward-compatible, but you might argue that new behavior is useful enough to break some hackish code that was trying to workaround the limitations of HTMLParser. |
HTML5 being a spec that builds on HTML 4.01 and real-world ways to deal with non-compliant input, I don’t object to fixes that follow the HTML5 spec. Regarding backward compatibility, we can break it if we decide that the behavior we’re changing was a bug. I think it’s far more useful to support BeautifoulSoup than to retain a non-useful behavior forever. |
Unless someone else has picked it up, BeautifulSoup is a no longer an issue since its author has abandoned it. That doesn't change the fact that IMO it would be nice for our library to handle input generously. |
I also think this is a bug that should be fixed. Not being able to parse real-world HTML is a nuisance. I agree with Ezio's review comments about the custom regex. |
It sounds like the early consensus on python-dev is that html5 support is a good thing. I'm happy with that. I presume that means the 'strict' keyword in 3.x becomes strict-per-html5, and possibly useless :) |
As I said somewhere else, the only use case I can think of where the 'strict' flag is useful is validation, but AFAIK even in "strict mode" it's possible to parse non-valid documents, so I agree it's pretty useless. Moving to HTML5 and offering something able to parse real-world HTML seems the way to go IMHO. |
Yeah... But wait another 8 years untill these guys decides that there is enough tests and other cool stuff. |
Which guys are you talking about? |
Seeing as everyone seems pretty satisfied with the 2.7 version, I'd be happy to put together a patch for 3 as well. To confirm, though, this fix is NOT going behind the strict parameter, correct? |
Attached a new patch with a few more tests and minor refactoring. |
|
I think it's internal. While it's not explicitly mentioned in the source, the method is not documented and I don't think people subclassed it. All that it does is changing the regex used to parse the data, and if someone needs to change this, it's probably easier to just change the regex. |
New changeset 0a5eb57d5876 by Ezio Melotti in branch '2.7': New changeset a6f2244b251f by Ezio Melotti in branch '3.2': New changeset b40752e227fa by Ezio Melotti in branch 'default': |
Fixed, thanks to everyone who contributed to this over the years! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: