Skip to content

HTMLParser differences from the HTML5 specification #135661

Open
@serhiy-storchaka

Description

@serhiy-storchaka

Bug report

Originally, the definition of the HTML format was not formally strict. It was similar to SGML and XML, but with a lot of looseness. HTMLParser tried its best to parse anything that looked like HTML. But after creation of HTML5, its specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. It is important to follow these rules, for security reasons.

The current HTMLParser mainly follows the HTML5 specification, but there are a number of differences:

  1. --!> should end the comment.
  2. -- > should not end the comment.
  3. <--> and <---> should be abnormally ended empty comments.
  4. ] ]> and ]] > should not end the CDATA section.
  5. CDATA handling should depend on the current node. This is important, because the ending condition are different for the CDATA section and the bogus comment (]]> and >).
  6. Whitespaces should not be acceptable between </ and the tag name. E.g. </ script> should not end the script section.
  7. Vertical tabulation (\v) and non-ASCII whitespaces should not be recognized as whitespaces. The only whitespaces are \t\n\r\f .
  8. Null character (U+0000) should not end the tag name.
  9. Null character (U+0000), surrogate characters and many other special characters should be replaced by \xfffd. I think we can leave this, because it is easy to do in pre-processing or post-processing, and they usually do not cause issues in Python.
  10. End tag can have attributes and slashes after tag name. It can not end after the first >. E.g. </script/foo=">"/>.
  11. Case-insensitive matching should only transform ASCII letters. E.g. </script> does not match </ſcript>, and LINK does not match LINK (the last letter is U+212A).
  12. There may be multiple slashes and whitespaces between the last attribute and closing > in both start and end tags. E.g. <a foo=bar/ //>.
  13. There should only be one = separator between attribute name and value. E.g. <a foo==bar> should have attribute "foo" with value "=bar".
  14. No whitespace should be acceptable between the = separator and attribute name and value. E.g. <a foo =bar> should have two attributes "foo" and "=bar", both with value None; <a foo= bar> should have two attributes: "foo" with value "" and "bar" with value None.

This can cause security issues for some programs. If the program uses HTMLParser to check the HTML input for dangerous code, it can miss some code. For example, "<!----!><script>...</script><!---->" is parsed by browsers as a script block surrounded by two comments, but the current HTMLParser parses it as a single comment.

Linked PRs

Metadata

Metadata

Labels

3.13bugs and security fixes3.14bugs and security fixes3.15new features, bugs and security fixesstdlibPython modules in the Lib dirtype-bugAn unexpected behavior, bug, or error

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions