Skip to content

Conversation

@Digenis
Copy link
Member

@Digenis Digenis commented Feb 9, 2016

If the charset attribute in <meta> is preceded by other attributes
the regex fails to match it. <meta stuff=stuff charset='utf8'>

To solve this, I added another non-greedy subregex to clean irrelevant attributes.
This is more like html5's meta charset made a bit more generic
so I think it shouldn't even need a separate regex and should be in _CONTENT2_RE.
Would this made it the encoding detection too tolerant towards bad html?

I will perform some tests on a bigger dataset tomorrow
because I can't rely on the automated tests
so don't merge yet.

@codecov-io
Copy link

Current coverage is 88.59%

Merging #42 into master will increase coverage by +0.03% as of 56b667e

Powered by Codecov. Updated on successful CI builds.

@Digenis Digenis closed this Feb 9, 2016
@Digenis Digenis deleted the detect_encoding_from_separate_charset_attr branch February 9, 2016 16:17
@Digenis Digenis restored the detect_encoding_from_separate_charset_attr branch February 9, 2016 16:23
@Digenis
Copy link
Member Author

Digenis commented Feb 9, 2016

Looks like TravisCI has some race condition
it just clones without limit to the commit of the push event.
Both build are for the second commit.

Here's what I got from tox after adding the test but before fixing
(After adding the test but before the patch):

GLOB sdist-make: /home/digenis/src/w3lib/setup.py
py27 inst-nodeps: /home/digenis/src/w3lib/.tox/dist/w3lib-1.13.0.zip
py27 installed: coverage==4.0.3,py==1.4.31,pytest==2.8.7,pytest-cov==2.2.1,six==1.10.0,w3lib==1.13.0,wheel==0.26.0
py27 runtests: PYTHONHASHSEED='1580087328'
py27 runtests: commands[0] | py.test --cov=w3lib --cov-report= w3lib tests
============================= test session starts ==============================
platform linux2 -- Python 2.7.11, pytest-2.8.7, py-1.4.31, pluggy-0.3.1
rootdir: /home/digenis/src/w3lib, inifile: 
plugins: cov-2.2.1
collected 81 items 

tests/test_encoding.py .FF...............
tests/test_form.py ...
tests/test_html.py ................................................
tests/test_http.py ...
tests/test_url.py .........

=================================== FAILURES ===================================
____________ RequestEncodingTests.test_html_body_declared_encoding _____________

self = <tests.test_encoding.RequestEncodingTests testMethod=test_html_body_declared_encoding>

    def test_html_body_declared_encoding(self):
        for fragment in self.utf8_fragments:
            encoding = html_body_declared_encoding(fragment)
>           self.assertEqual(encoding, 'utf-8', fragment)
E           AssertionError: <meta http-equiv="Content-Type" content="text/html" charset="utf-8">

tests/test_encoding.py:50: AssertionError
________ RequestEncodingTests.test_html_body_declared_encoding_unicode _________

self = <tests.test_encoding.RequestEncodingTests testMethod=test_html_body_declared_encoding_unicode>

    def test_html_body_declared_encoding_unicode(self):
        # html_body_declared_encoding should work when unicode body is passed
        self.assertEqual(None, html_body_declared_encoding(u"something else"))

        for fragment in self.utf8_fragments:
            encoding = html_body_declared_encoding(fragment.decode('utf8'))
>           self.assertEqual(encoding, 'utf-8', fragment)
E           AssertionError: <meta http-equiv="Content-Type" content="text/html" charset="utf-8">

tests/test_encoding.py:66: AssertionError
===================== 2 failed, 79 passed in 0.18 seconds ======================
ERROR: InvocationError: '/home/digenis/src/w3lib/.tox/py27/bin/py.test --cov=w3lib --cov-report= w3lib tests'
pypy create: /home/digenis/src/w3lib/.tox/pypy
ERROR: InterpreterNotFound: pypy
py33 create: /home/digenis/src/w3lib/.tox/py33
ERROR: InterpreterNotFound: python3.3
py34 create: /home/digenis/src/w3lib/.tox/py34
ERROR: InterpreterNotFound: python3.4
___________________________________ summary ____________________________________
ERROR:   py27: commands failed
ERROR:   pypy: InterpreterNotFound: pypy
ERROR:   py33: InterpreterNotFound: python3.3
ERROR:   py34: InterpreterNotFound: python3.4

@Digenis Digenis reopened this Feb 9, 2016
@Digenis
Copy link
Member Author

Digenis commented Feb 12, 2016

Among 6524 sites, these give different results before and after the patch.

master patch url comment
utf-8 cp1252 http://corporate.vattenfall.com/ 2 meta tags with different charsets, for both patterns (buggy site)
cp1251 http://www.akcent.bg/site/home.php pattern is almost _CONTENT_RE but with different attr ordering
utf-8 http://www.euroinvestor.com/rss pattern is almost _CONTENT_RE but the tag has an extra id attr
utf-8 http://www.idg.bg/ pattern is almost _CONTENT2_RE but charset preceded by other attrs

After thoughts, I think both regexes just look in a <meta> for the charset in a separate attribute or embedded in the content attribute, ignoring other attributes.
So probably they can be summarized into a shorter regex.

But what should be done about duplicate encodings such as in http://corporate.vattenfall.com/?

@Digenis
Copy link
Member Author

Digenis commented Mar 10, 2016

Any interest in reviewing this?
(Bump)


# regexp for parsing HTTP meta tags
_TEMPLATE = r'''%s\s*=\s*["']?\s*%s\s*["']?'''
_SKIP_ATTRS = r'''(?:\s+[\w-]+=(?:'[^']*'|"[^"]*"|[^'"\s]+))*?'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Digenis,
Sorry for a review taking so long!
Could you please reformat and comment this regex to make it easier to read?

@Digenis
Copy link
Member Author

Digenis commented Mar 25, 2016

I've been thinking about this
and I believe the result is almost equivalent
to some more tolerant version of the html5 meta encoding tag.

_CONTENT_RE and _CONTENT_RE2 can be merged into a single regex
without the need of having _SKIP_ATTRS.
(This is a far from a bugfix and although it does fix the bug
it would better be named refactoring)

We can, with a single regex,
extract from a meta tag
either the charset attribute
or the charset part of the content attribute.

My concern with this is duplicate meta tags like my example above
and possible meta tags used for other purposes.
This will take another round of testing
so should I get started from the beginning
or would you prefer an _SKIP_ATTRS solution?

@redapple
Copy link
Contributor

I'm fine with _SKIP_ATTRS solution.
I'd just suggest adding tests on whitespace and single quotes, e.g.

--- a/tests/test_encoding.py
+++ b/tests/test_encoding.py
@@ -9,10 +9,13 @@ class RequestEncodingTests(unittest.TestCase):
         b"""<meta http-equiv="content-type" content="text/html;charset=UTF-8" />""",
         b"""\n<meta http-equiv="Content-Type"\ncontent="text/html; charset=utf-8">""",
         b"""<meta http-equiv="Content-Type" content="text/html" charset="utf-8">""",
+        b"""<meta http-equiv="Content-Type" content="text/html" charset\n='utf-8'>""",
+        b"""<meta http-equiv="Content-Type" content="text/html" charset\t=    "utf-8">""",
         b"""<meta content="text/html; charset=utf-8"\n http-equiv='Content-Type'>""",
         b""" bad html still supported < meta http-equiv='Content-Type'\n content="text/html; charset=utf-8">""",
         # html5 meta charset
         b"""<meta charset="utf-8">""",
+        b"""<meta charset\n\n=\n\t"utf-8">""",
         # xml encoding
         b"""<?xml version="1.0" encoding="utf-8"?>""",
     ]

The regex starts getting hard to read so +1 to @kmike 's suggestion to make it more digestable (I guess @kmike meant verbose mode?)

As for the 2 <meta> case, e.g in http://corporate.vattenfall.com/, I would go for the first encoding found, that's what my Chrome (Version 49.0.2623.110) does.

It's true there should be only one. (cf. HTML5, section 4.2 Document metadata:

In addition, due to a number of restrictions on meta elements, there can only be one meta-based character encoding declaration per document.

but, well, humans...)

In that specific case though, response headers have Content-Type:text/html; charset=utf-8 so it takes precedence: if I read encoding sniffing algorithm, step 4. using transport layer info comes before prescan the byte stream to determine its encoding.

@redapple
Copy link
Contributor

redapple commented Apr 4, 2016

@Digenis , do you have updates regarding comments from @kmike and me?
If we can include your change in upcoming w3lib 1.14 (which we need to wrap up before scrapy 1.1), that'd be awesome. Thanks!

@redapple redapple added this to the v1.14 milestone Apr 4, 2016
@Digenis
Copy link
Member Author

Digenis commented Apr 4, 2016

Done.
I think codecov is just confused
because master differs a lot from base.

@redapple
Copy link
Contributor

redapple commented Apr 4, 2016

LGTM. Are you able to rebase to please codecov?

@redapple redapple changed the title Detect encoding when specified as a separate attribute in <meta> [MRG+1] Detect encoding when specified as a separate attribute in <meta> Apr 4, 2016
@Digenis Digenis force-pushed the detect_encoding_from_separate_charset_attr branch from d32cf27 to acdb821 Compare April 4, 2016 15:46
@Digenis
Copy link
Member Author

Digenis commented Apr 4, 2016

rebased and squashed

@redapple
Copy link
Contributor

redapple commented Apr 4, 2016

Thanks @Digenis !

@Digenis Digenis closed this Apr 4, 2016
@Digenis Digenis reopened this Apr 4, 2016
@Digenis
Copy link
Member Author

Digenis commented Apr 4, 2016

Can't trigger a build for the latest commit

@redapple
Copy link
Contributor

redapple commented Apr 4, 2016

Can you try adding something to the release notes about this? hopefully, it'll trigger a new build

_TEMPLATE = r'''%s\s*=\s*["']?\s*%s\s*["']?'''
_SKIP_ATTRS = r'''(?x)(?:\s+
[\w-]+ # Attribute name
\s*=\s*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it makes sense to support HTML specs more closely? I.e. other characters (numbers, underscores, etc.) in attribute names, as well as attributes without values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
I wrote [\w-] while looking the possible attribute names of meta
but we are supposed to also deal with human errors.
I'll patch it.

@Digenis Digenis force-pushed the detect_encoding_from_separate_charset_attr branch from 544f0da to 8982b75 Compare April 5, 2016 09:11
@Digenis
Copy link
Member Author

Digenis commented Apr 5, 2016

regex updated

@kmike kmike changed the title [MRG+1] Detect encoding when specified as a separate attribute in <meta> [MRG+2] Detect encoding when specified as a separate attribute in <meta> Apr 6, 2016
# regexp for parsing HTTP meta tags
_TEMPLATE = r'''%s\s*=\s*["']?\s*%s\s*["']?'''
_SKIP_ATTRS = r'''(?x)(?:\s+
[^=<>/\s"']+ # Attribute name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness we can also exclude non-printable characters from HTML attribute name regex.

@Digenis Digenis force-pushed the detect_encoding_from_separate_charset_attr branch from 8982b75 to f7f48f8 Compare April 6, 2016 07:01
@Digenis
Copy link
Member Author

Digenis commented Apr 6, 2016

updated

@redapple
Copy link
Contributor

redapple commented Apr 6, 2016

@kmike , all good for you?

@kmike
Copy link
Member

kmike commented Apr 6, 2016

@redapple yep!

@redapple redapple merged commit cc6d7df into scrapy:master Apr 6, 2016
@Digenis Digenis deleted the detect_encoding_from_separate_charset_attr branch April 6, 2016 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants