New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug on regexp of HTMLParser #51560
Comments
Hi all, I'm using BeautifulSoup to parsing an HTML page and find it refused to attrfind = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?') Note that the Chinese character (also any other non-english BTW: It seems something like : <script> can not be parsed. :-/ |
Re: the BTW -- < and > should be entity-escaped when used in attribute But the example you showed is not an attribute in a tag, but rather text But your suggestion for the regexp seems correct to me, if the non-ASCII |
re: Yes. In fact, the BTW is a different problem with respect to this |
The attached patch changes the regex to allow non-ascii letters in attribute values (using \w with the re.UNICODE flag instead of [a-zA-Z0-9_]). Using [^\>\\s] (or even [^\> ]) might be OK too, since that's what browsers seem to use (e.g. Firefox and Chrome show "テ<ス |
The HTML 4.01 specifications says0: The HTML 5 draft says1: So maybe [^\>\\s] is a little too permissive here. |
Here's a patch that matches unquoted attribute values according to the HTML5 specifications. The regex uses \s even if this includes the \v char that, according to the HTML5 specs, shouldn't be included. I left it there for simplicity and backward-compatibility, and also because it's a rather obscure corner case. |
New changeset 7d4dea76c476 by Ezio Melotti in branch '2.7': |
With 3.2 the situation is more complicated because there is a strict and a non-strict mode. and the tolerant mode uses: This means that the strict mode doesn't allow valid non-ASCII chars, and that tolerant mode is a little too permissive. The attached patch changes the strict regex to be more permissive and leaves the tolerant regex unchanged. The difference between the two are now so small that the tolerant version could be removed, except that re.search is used instead of re.match when the tolerant regex is used. |
The goal of tolerant mode is to accept anything a typical browser would accept. I suspect that means the tolerant regex should stay, but I don't remember the details. As for the strict....as far as I know the current module follows 4.01, not 5. I'm not sure what should be done about that. |
I don't see many use cases for the strict mode. It is not strict enough to be used for validation, and while parsing HTML I can't think of any other case where I would want an exception raised (always as long as what is parsed by the tolerant mode is a superset of what is parsed by the strict mode). If the parser is still able to parse what it was parsing before, I wouldn't worry too much about backward compatibility, because I can't imagine a valid use case where people would want the parser to fail (maybe someone else can?). |
I think the stdlib should comply with HTML 4.01, and in the future HTML 5. (FTR, I don’t think XHTML is useful, and deny that XHTML-compatible HTML exists. See http://bugs.python.org/issue11567#msg131509 :) |
I would agree if the HTMLParser was compliant with the HTML 4.01 specs, but since it's more permissive and uses its own heuristic to determine what should be parsed and what shouldn't, I think it's better to use already existing heuristics (either the HTML5 ones or the ones used by the browsers). |
Okay, sounds good. |
We need not base changes to html/parser.py on html5 spec, but rather make changes based on the requirements on parsers which may rely on this library. Like the tolerant mode was brought in bpo-1486713 for some practical reasons and it was seen useful tor parsers. I don't know, how common is leaving out quotes for attributes is, but I think it can become really confusing to parsers (custom parsers). If we had not supported non-quote attributes I think, it is still okay still to not-to-support unless presented with case as very concrete bug. (like spec html 4.1 allows, which I see it does not). The patch which added support for non-ascii characters is fine. |
So is the bpo-7311-3.diff patch fine? It changes the strict regex to match the 2.7 one, and leave the tolerant one unchanged (even if now the two regexs are really close). |
Sounds fine to me. |
Just that it allows unquoted attrs for unicode too. My previous suggestion was not to allow unquoted attribute values, but as the change is already made in 2.7 and discussion pointed out a portion in 4.1 spec which allows unquoted attrs for ASCII, it seems fine. html/parse.py will be bit more permissive than what the spec says.
That is fine. |
On 3.2 the patch changes only the range of chars matched by the regex when the attribute value doesn't have quotes and strict=True. The parser already allowed unquotes attribute values even before the patch (in both strict and tolerant mode), but used an explicit list of allowed chars that was limited to the ASCII range. |
New changeset 225400cb6e84 by Ezio Melotti in branch '3.2': New changeset a1dea7cde58f by Ezio Melotti in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: