Description
Bug report
Originally, the definition of the HTML format was not formally strict. It was similar to SGML and XML, but with a lot of looseness. HTMLParser
tried its best to parse anything that looked like HTML. But after creation of HTML5, its specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. It is important to follow these rules, for security reasons.
The current HTMLParser
mainly follows the HTML5 specification, but there are a number of differences:
--!>
should end the comment.-- >
should not end the comment.<-->
and<--->
should be abnormally ended empty comments.] ]>
and]] >
should not end the CDATA section.- CDATA handling should depend on the current node. This is important, because the ending condition are different for the CDATA section and the bogus comment (
]]>
and>
). - Whitespaces should not be acceptable between
</
and the tag name. E.g.</ script>
should not end the script section. - Vertical tabulation (
\v
) and non-ASCII whitespaces should not be recognized as whitespaces. The only whitespaces are\t\n\r\f
. - Null character (U+0000) should not end the tag name.
- Null character (U+0000), surrogate characters and many other special characters should be replaced by
\xfffd
. I think we can leave this, because it is easy to do in pre-processing or post-processing, and they usually do not cause issues in Python. - End tag can have attributes and slashes after tag name. It can not end after the first
>
. E.g.</script/foo=">"/>
. - Case-insensitive matching should only transform ASCII letters. E.g.
</script>
does not match</ſcript>
, andLINK
does not matchLINK
(the last letter is U+212A). - There may be multiple slashes and whitespaces between the last attribute and closing
>
in both start and end tags. E.g.<a foo=bar/ //>
. - There should only be one
=
separator between attribute name and value. E.g.<a foo==bar>
should have attribute "foo" with value "=bar". - No whitespace should be acceptable between the
=
separator and attribute name and value. E.g.<a foo =bar>
should have two attributes "foo" and "=bar", both with value None;<a foo= bar>
should have two attributes: "foo" with value "" and "bar" with value None.
This can cause security issues for some programs. If the program uses HTMLParser
to check the HTML input for dangerous code, it can miss some code. For example, "<!----!><script>...</script><!---->
" is parsed by browsers as a script block surrounded by two comments, but the current HTMLParser
parses it as a single comment.
Linked PRs
Metadata
Metadata
Assignees
Labels
Projects
Status