-
Notifications
You must be signed in to change notification settings - Fork 106
[MRG+2] Detect encoding when specified as a separate attribute in <meta> #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Detect encoding when specified as a separate attribute in <meta> #42
Conversation
Current coverage is
|
|
Looks like TravisCI has some race condition Here's what I got from tox after adding the test but before fixing |
|
Among 6524 sites, these give different results before and after the patch.
After thoughts, I think both regexes just look in a But what should be done about duplicate encodings such as in http://corporate.vattenfall.com/? |
|
Any interest in reviewing this? |
w3lib/encoding.py
Outdated
|
|
||
| # regexp for parsing HTTP meta tags | ||
| _TEMPLATE = r'''%s\s*=\s*["']?\s*%s\s*["']?''' | ||
| _SKIP_ATTRS = r'''(?:\s+[\w-]+=(?:'[^']*'|"[^"]*"|[^'"\s]+))*?''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Digenis,
Sorry for a review taking so long!
Could you please reformat and comment this regex to make it easier to read?
|
I've been thinking about this
We can, with a single regex, My concern with this is duplicate meta tags like my example above |
|
I'm fine with The regex starts getting hard to read so +1 to @kmike 's suggestion to make it more digestable (I guess @kmike meant verbose mode?) As for the 2 It's true there should be only one. (cf. HTML5, section 4.2 Document metadata:
but, well, humans...) In that specific case though, response headers have |
|
Done. |
|
LGTM. Are you able to rebase to please codecov? |
d32cf27 to
acdb821
Compare
|
rebased and squashed |
|
Thanks @Digenis ! |
|
Can't trigger a build for the latest commit |
|
Can you try adding something to the release notes about this? hopefully, it'll trigger a new build |
w3lib/encoding.py
Outdated
| _TEMPLATE = r'''%s\s*=\s*["']?\s*%s\s*["']?''' | ||
| _SKIP_ATTRS = r'''(?x)(?:\s+ | ||
| [\w-]+ # Attribute name | ||
| \s*=\s* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it makes sense to support HTML specs more closely? I.e. other characters (numbers, underscores, etc.) in attribute names, as well as attributes without values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
I wrote [\w-] while looking the possible attribute names of meta
but we are supposed to also deal with human errors.
I'll patch it.
544f0da to
8982b75
Compare
|
regex updated |
w3lib/encoding.py
Outdated
| # regexp for parsing HTTP meta tags | ||
| _TEMPLATE = r'''%s\s*=\s*["']?\s*%s\s*["']?''' | ||
| _SKIP_ATTRS = r'''(?x)(?:\s+ | ||
| [^=<>/\s"']+ # Attribute name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness we can also exclude non-printable characters from HTML attribute name regex.
8982b75 to
f7f48f8
Compare
|
updated |
|
@kmike , all good for you? |
|
@redapple yep! |
If the charset attribute in
<meta>is preceded by other attributesthe regex fails to match it.
<meta stuff=stuff charset='utf8'>To solve this, I added another non-greedy subregex to clean irrelevant attributes.
This is more like html5's meta charset made a bit more generic
so I think it shouldn't even need a separate regex and should be in
_CONTENT2_RE.Would this made it the encoding detection too tolerant towards bad html?
I will perform some tests on a bigger dataset tomorrow
because I can't rely on the automated tests
so don't merge yet.