[MRG+2] Detect encoding when specified as a separate attribute in <meta> #42

Digenis · 2016-02-09T15:57:44Z

If the charset attribute in <meta> is preceded by other attributes
the regex fails to match it. <meta stuff=stuff charset='utf8'>

To solve this, I added another non-greedy subregex to clean irrelevant attributes.
This is more like html5's meta charset made a bit more generic
so I think it shouldn't even need a separate regex and should be in _CONTENT2_RE.
Would this made it the encoding detection too tolerant towards bad html?

I will perform some tests on a bigger dataset tomorrow
because I can't rely on the automated tests
so don't merge yet.

codecov-io · 2016-02-09T16:02:35Z

Current coverage is `88.59%`

Merging #42 into master will increase coverage by +0.03% as of 56b667e

Powered by Codecov. Updated on successful CI builds.

Digenis · 2016-02-09T16:32:09Z

Looks like TravisCI has some race condition
it just clones without limit to the commit of the push event.
Both build are for the second commit.

Here's what I got from tox after adding the test but before fixing
(After adding the test but before the patch):

GLOB sdist-make: /home/digenis/src/w3lib/setup.py
py27 inst-nodeps: /home/digenis/src/w3lib/.tox/dist/w3lib-1.13.0.zip
py27 installed: coverage==4.0.3,py==1.4.31,pytest==2.8.7,pytest-cov==2.2.1,six==1.10.0,w3lib==1.13.0,wheel==0.26.0
py27 runtests: PYTHONHASHSEED='1580087328'
py27 runtests: commands[0] | py.test --cov=w3lib --cov-report= w3lib tests
============================= test session starts ==============================
platform linux2 -- Python 2.7.11, pytest-2.8.7, py-1.4.31, pluggy-0.3.1
rootdir: /home/digenis/src/w3lib, inifile: 
plugins: cov-2.2.1
collected 81 items 

tests/test_encoding.py .FF...............
tests/test_form.py ...
tests/test_html.py ................................................
tests/test_http.py ...
tests/test_url.py .........

=================================== FAILURES ===================================
____________ RequestEncodingTests.test_html_body_declared_encoding _____________

self = <tests.test_encoding.RequestEncodingTests testMethod=test_html_body_declared_encoding>

    def test_html_body_declared_encoding(self):
        for fragment in self.utf8_fragments:
            encoding = html_body_declared_encoding(fragment)
>           self.assertEqual(encoding, 'utf-8', fragment)
E           AssertionError: <meta http-equiv="Content-Type" content="text/html" charset="utf-8">

tests/test_encoding.py:50: AssertionError
________ RequestEncodingTests.test_html_body_declared_encoding_unicode _________

self = <tests.test_encoding.RequestEncodingTests testMethod=test_html_body_declared_encoding_unicode>

    def test_html_body_declared_encoding_unicode(self):
        # html_body_declared_encoding should work when unicode body is passed
        self.assertEqual(None, html_body_declared_encoding(u"something else"))

        for fragment in self.utf8_fragments:
            encoding = html_body_declared_encoding(fragment.decode('utf8'))
>           self.assertEqual(encoding, 'utf-8', fragment)
E           AssertionError: <meta http-equiv="Content-Type" content="text/html" charset="utf-8">

tests/test_encoding.py:66: AssertionError
===================== 2 failed, 79 passed in 0.18 seconds ======================
ERROR: InvocationError: '/home/digenis/src/w3lib/.tox/py27/bin/py.test --cov=w3lib --cov-report= w3lib tests'
pypy create: /home/digenis/src/w3lib/.tox/pypy
ERROR: InterpreterNotFound: pypy
py33 create: /home/digenis/src/w3lib/.tox/py33
ERROR: InterpreterNotFound: python3.3
py34 create: /home/digenis/src/w3lib/.tox/py34
ERROR: InterpreterNotFound: python3.4
___________________________________ summary ____________________________________
ERROR:   py27: commands failed
ERROR:   pypy: InterpreterNotFound: pypy
ERROR:   py33: InterpreterNotFound: python3.3
ERROR:   py34: InterpreterNotFound: python3.4

Digenis · 2016-02-12T11:35:26Z

Among 6524 sites, these give different results before and after the patch.

master	patch	url	comment
utf-8	cp1252	http://corporate.vattenfall.com/	2 meta tags with different charsets, for both patterns (buggy site)
	cp1251	http://www.akcent.bg/site/home.php	pattern is almost `_CONTENT_RE` but with different attr ordering
	utf-8	http://www.euroinvestor.com/rss	pattern is almost `_CONTENT_RE` but the tag has an extra id attr
	utf-8	http://www.idg.bg/	pattern is almost `_CONTENT2_RE` but charset preceded by other attrs

After thoughts, I think both regexes just look in a <meta> for the charset in a separate attribute or embedded in the content attribute, ignoring other attributes.
So probably they can be summarized into a shorter regex.

But what should be done about duplicate encodings such as in http://corporate.vattenfall.com/?

Digenis · 2016-03-10T13:21:59Z

Any interest in reviewing this?
(Bump)

kmike · 2016-03-25T19:16:57Z

w3lib/encoding.py


 # regexp for parsing HTTP meta tags
 _TEMPLATE = r'''%s\s*=\s*["']?\s*%s\s*["']?'''
+_SKIP_ATTRS = r'''(?:\s+[\w-]+=(?:'[^']*'|"[^"]*"|[^'"\s]+))*?'''


Hey @Digenis,
Sorry for a review taking so long!
Could you please reformat and comment this regex to make it easier to read?

Digenis · 2016-03-25T20:56:15Z

I've been thinking about this
and I believe the result is almost equivalent
to some more tolerant version of the html5 meta encoding tag.

_CONTENT_RE and _CONTENT_RE2 can be merged into a single regex
without the need of having _SKIP_ATTRS.
(This is a far from a bugfix and although it does fix the bug
it would better be named refactoring)

We can, with a single regex,
extract from a meta tag
either the charset attribute
or the charset part of the content attribute.

My concern with this is duplicate meta tags like my example above
and possible meta tags used for other purposes.
This will take another round of testing
so should I get started from the beginning
or would you prefer an _SKIP_ATTRS solution?

redapple · 2016-03-31T16:40:48Z

I'm fine with _SKIP_ATTRS solution.
I'd just suggest adding tests on whitespace and single quotes, e.g.

--- a/tests/test_encoding.py
+++ b/tests/test_encoding.py
@@ -9,10 +9,13 @@ class RequestEncodingTests(unittest.TestCase):
         b"""<meta http-equiv="content-type" content="text/html;charset=UTF-8" />""",
         b"""\n<meta http-equiv="Content-Type"\ncontent="text/html; charset=utf-8">""",
         b"""<meta http-equiv="Content-Type" content="text/html" charset="utf-8">""",
+        b"""<meta http-equiv="Content-Type" content="text/html" charset\n='utf-8'>""",
+        b"""<meta http-equiv="Content-Type" content="text/html" charset\t=    "utf-8">""",
         b"""<meta content="text/html; charset=utf-8"\n http-equiv='Content-Type'>""",
         b""" bad html still supported < meta http-equiv='Content-Type'\n content="text/html; charset=utf-8">""",
         # html5 meta charset
         b"""<meta charset="utf-8">""",
+        b"""<meta charset\n\n=\n\t"utf-8">""",
         # xml encoding
         b"""<?xml version="1.0" encoding="utf-8"?>""",
     ]

The regex starts getting hard to read so +1 to @kmike 's suggestion to make it more digestable (I guess @kmike meant verbose mode?)

As for the 2 <meta> case, e.g in http://corporate.vattenfall.com/, I would go for the first encoding found, that's what my Chrome (Version 49.0.2623.110) does.

It's true there should be only one. (cf. HTML5, section 4.2 Document metadata:

In addition, due to a number of restrictions on meta elements, there can only be one meta-based character encoding declaration per document.

but, well, humans...)

In that specific case though, response headers have Content-Type:text/html; charset=utf-8 so it takes precedence: if I read encoding sniffing algorithm, step 4. using transport layer info comes before prescan the byte stream to determine its encoding.

redapple · 2016-04-04T10:45:16Z

@Digenis , do you have updates regarding comments from @kmike and me?
If we can include your change in upcoming w3lib 1.14 (which we need to wrap up before scrapy 1.1), that'd be awesome. Thanks!

Digenis · 2016-04-04T14:34:54Z

Done.
I think codecov is just confused
because master differs a lot from base.

redapple · 2016-04-04T15:41:15Z

LGTM. Are you able to rebase to please codecov?

Digenis · 2016-04-04T15:46:43Z

rebased and squashed

redapple · 2016-04-04T16:02:20Z

Thanks @Digenis !

Digenis · 2016-04-04T17:29:47Z

Can't trigger a build for the latest commit

redapple · 2016-04-04T17:53:54Z

Can you try adding something to the release notes about this? hopefully, it'll trigger a new build

kmike · 2016-04-05T03:23:46Z

w3lib/encoding.py

 _TEMPLATE = r'''%s\s*=\s*["']?\s*%s\s*["']?'''
+_SKIP_ATTRS = r'''(?x)(?:\s+
+    [\w-]+  # Attribute name
+    \s*=\s*


Do you think it makes sense to support HTML specs more closely? I.e. other characters (numbers, underscores, etc.) in attribute names, as well as attributes without values?

Yes.
I wrote [\w-] while looking the possible attribute names of meta
but we are supposed to also deal with human errors.
I'll patch it.

Digenis · 2016-04-05T09:51:26Z

regex updated

kmike · 2016-04-06T05:15:59Z

w3lib/encoding.py

 # regexp for parsing HTTP meta tags
 _TEMPLATE = r'''%s\s*=\s*["']?\s*%s\s*["']?'''
+_SKIP_ATTRS = r'''(?x)(?:\s+
+    [^=<>/\s"']+  # Attribute name


For completeness we can also exclude non-printable characters from HTML attribute name regex.

Digenis · 2016-04-06T09:29:19Z

updated

redapple · 2016-04-06T14:55:19Z

@kmike , all good for you?

kmike · 2016-04-06T15:06:57Z

@redapple yep!

Digenis closed this Feb 9, 2016

Digenis deleted the detect_encoding_from_separate_charset_attr branch February 9, 2016 16:17

Digenis restored the detect_encoding_from_separate_charset_attr branch February 9, 2016 16:23

Digenis reopened this Feb 9, 2016

kmike reviewed Mar 25, 2016
View reviewed changes

redapple added this to the v1.14 milestone Apr 4, 2016

redapple changed the title ~~Detect encoding when specified as a separate attribute in <meta>~~ [MRG+1] Detect encoding when specified as a separate attribute in <meta> Apr 4, 2016

Digenis added 2 commits April 4, 2016 18:45

add test for encoding as a separate attribute

04415e4

test for unusual quotes and spaces

1970e87

Digenis force-pushed the detect_encoding_from_separate_charset_attr branch from d32cf27 to acdb821 Compare April 4, 2016 15:46

Digenis closed this Apr 4, 2016

Digenis reopened this Apr 4, 2016

kmike reviewed Apr 5, 2016
View reviewed changes

Digenis force-pushed the detect_encoding_from_separate_charset_attr branch from 544f0da to 8982b75 Compare April 5, 2016 09:11

kmike changed the title ~~[MRG+1] Detect encoding when specified as a separate attribute in <meta>~~ [MRG+2] Detect encoding when specified as a separate attribute in <meta> Apr 6, 2016

kmike reviewed Apr 6, 2016
View reviewed changes

Digenis added 2 commits April 6, 2016 10:01

Fix charset detection when meta has many attrs

5d088f0

release note for html_body_declared_encoding patch

f7f48f8

Digenis force-pushed the detect_encoding_from_separate_charset_attr branch from 8982b75 to f7f48f8 Compare April 6, 2016 07:01

redapple merged commit cc6d7df into scrapy:master Apr 6, 2016

Digenis deleted the detect_encoding_from_separate_charset_attr branch April 6, 2016 16:47

[MRG+2] Detect encoding when specified as a separate attribute in <meta> #42

[MRG+2] Detect encoding when specified as a separate attribute in <meta> #42

Uh oh!

Conversation

Digenis commented Feb 9, 2016

Uh oh!

codecov-io commented Feb 9, 2016

Current coverage is 88.59%

Uh oh!

Digenis commented Feb 9, 2016

Uh oh!

Digenis commented Feb 12, 2016

Uh oh!

Digenis commented Mar 10, 2016

Uh oh!

kmike Mar 25, 2016

Choose a reason for hiding this comment

Uh oh!

Digenis commented Mar 25, 2016

Uh oh!

redapple commented Mar 31, 2016

Uh oh!

redapple commented Apr 4, 2016

Uh oh!

Digenis commented Apr 4, 2016

Uh oh!

redapple commented Apr 4, 2016

Uh oh!

Digenis commented Apr 4, 2016

Uh oh!

redapple commented Apr 4, 2016

Uh oh!

Digenis commented Apr 4, 2016

Uh oh!

redapple commented Apr 4, 2016

Uh oh!

kmike Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

Digenis Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

Digenis commented Apr 5, 2016

Uh oh!

kmike Apr 6, 2016

Choose a reason for hiding this comment

Uh oh!

Digenis commented Apr 6, 2016

Uh oh!

redapple commented Apr 6, 2016

Uh oh!

kmike commented Apr 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Current coverage is `88.59%`