[MRG+1] data URI parser. #71

ArturGaspar · 2016-08-12T06:08:25Z

No description provided.

codecov-io · 2016-08-12T06:10:03Z

Current coverage is 94.71% (diff: 95.34%)

Merging #71 into master will increase coverage by 0.62%

@@             master        #71   diff @@
==========================================
  Files             7          7          
  Lines           406        454    +48   
  Methods           0          0          
  Messages          0          0          
  Branches         84         93     +9   
==========================================
+ Hits            382        430    +48   
  Misses           16         16          
  Partials          8          8

Powered by Codecov. Last update 03c28d2...5ec787b

redapple · 2016-08-12T14:00:21Z

I wonder if this shouldn't be in w3lib.encoding

ArturGaspar · 2016-08-12T14:25:08Z

I don't see why encoding.

I think it could belong in w3lib.url.

redapple · 2016-08-12T14:39:33Z

Right, w3lib.url looks good for this

redapple · 2016-08-16T12:52:14Z

LGTM. Thanks @ArturGaspar !
cc @kmike , @eliasdorneles

ArturGaspar · 2016-08-16T13:15:28Z

Only fixed an invalid URL used in the tests above.

eliasdorneles · 2016-08-16T20:40:10Z

w3lib/url.py

+            raise ValueError("invalid data URI")
+        data = base64.b64decode(data)
+
+    return media_type, media_type_params, data


I think it would be better to return a dictionary instead, to ease future maintenance and also to make the result immediately understood by the occasional user.

@eliasdorneles Good suggestion. I have used a namedtuple like the standard library URL parsing functions.

kmike · 2016-08-25T08:39:38Z

tests/test_url.py

+
+    def test_non_ascii_uri(self):
+        with self.assertRaises(UnicodeEncodeError):
+            parse_data_uri(u"data:,é")


I think raising UnicodeEncodeError is not a good way to handle it in Python 3. We settled on unicode URLs by default for Python 3, they shouldn't raise errors. I tried it in Firefox - it escaped this URL to data:,%C3%A9 and then processed it. w3lib.url.safe_url_string function escaped URL the same way. I'm not 100% sure about that, but it looks like we should use safe_url_string here instead of encoding to ascii. This way data: urls will be handled the same way as a browser handles them. What do you think?

Makes sense.

redapple · 2016-09-08T08:28:13Z

@kmike , are you ok with the latest changes?

redapple · 2016-09-13T08:30:56Z

@kmike ping :)

redapple · 2016-10-24T18:04:52Z

tests/test_url.py

+        with self.assertRaises(ValueError):
+            parse_data_uri("http://example.com/")
+
+


Wikipedia states:

Data URIs encoded in Base64 may contain whitespace for human readability.

@ArturGaspar , can you add tests for multiline content?

RFC2397 has an example:

<IMG SRC="data:image/gif;base64,R0lGODdhMAAwAPAAAAAAAP///ywAAAAAMAAw AAAC8IyPqcvt3wCcDkiLc7C0qwyGHhSWpjQu5yqmCYsapyuvUUlvONmOZtfzgFz ByTB10QgxOR0TqBQejhRNzOfkVJ+5YiUqrXF5Y5lKh/DeuNcP5yLWGsEbtLiOSp a/TPg7JpJHxyendzWTBfX0cxOnKPjgBzi4diinWGdkF8kjdfnycQZXZeYGejmJl ZeGl9i2icVqaNVailT6F5iJ90m6mvuTS4OK05M0vDk0Q4XUtwvKOzrcd3iq9uis F81M1OIcR7lEewwcLp7tuNNkM3uNna3F2JQFo97Vriy/Xl4/f1cf5VWzXyym7PH hhx4dbgYKAAA7" ALT="Larry">

redapple · 2016-10-24T18:08:15Z

w3lib/url.py

+        uri = safe_url_string(uri).encode('ascii')
+
+    scheme, uri = uri.split(b':', 1)
+    if scheme != b'data':


It looks like this does not handle scheme case-insensitivity correctly:

Python 2 test:

>>> parse_data_uri("data:,A%20brief%20note") ParseDataURIResult(media_type='text/plain', media_type_parameters={'charset': 'US-ASCII'}, data='A brief note') >>> parse_data_uri(b"DATA:,A%20brief%20note") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "w3lib/url.py", line 327, in parse_data_uri raise ValueError("not a data URI") ValueError: not a data URI >>> parse_data_uri(u"DATA:,A%20brief%20note") ParseDataURIResult(media_type='text/plain', media_type_parameters={'charset': 'US-ASCII'}, data='A brief note') >>>

redapple · 2016-11-04T11:17:10Z

LGTM, @ArturGaspar. Thanks!

ArturGaspar · 2016-11-04T11:24:08Z

tests/test_url.py

+        self.assertEqual(result.media_type, "text/plain")
+        self.assertEqual(result.data, b"Hello, world.")
+
+        result = parse_data_uri("data:text/plain;base64,SGVsb G8sIH\n  "


This URI with actual spaces rather than escape sequences is not valid, I think it is not a good idea to test for it. The test passes because parsing just so happens to be implemented in a way that handles it as well.

RFC 3986 appendix C sort of allows for spaces, but per my understanding it would be the responsibility of the program that extracts the URI to handle it, not of a library that is not dealing with the source of the URI directly. Also only whitespace in the base64 part is handled, not in any part of the URI.

Firefox seems to allow spaces anywhere after the scheme. Chrome seems to allow spaces only on the base64 part.

If the goal is to be compatible with browsers, then perhaps this URI should be considered valid?

imho, the implementation should support whitespace in base64 encoded data part.
I don't think we need to support whitespace elsewhere. (so that would mean following what Chrome does, I believe)

That's what it already does, then.

redapple · 2016-12-06T14:29:54Z

@eliasdorneles , @kmike , what do you think of the patch now?

kmike · 2016-12-12T12:02:05Z

w3lib/url.py

+
+    scheme, uri = uri.split(b':', 1)
+    if scheme.lower() != b'data':
+        raise ValueError("not a data URI")


What do you think about raising the same ValueError if there is no scheme at all (currently it will fail earlier, when unpacking split result)?

kmike · 2016-12-12T12:11:28Z

w3lib/url.py

+    is_base64, data = uri.split(b',', 1)
+    if is_base64:
+        if is_base64 != b";base64":
+            raise ValueError("invalid data URI")


It'd be nice to have an error message different from the error message of other ValueError, and also to raise ValueError if uri.split returns only a single result

kmike · 2017-02-08T13:50:54Z

Ok, sorry for the delay. Let's merge it.

kmike · 2017-02-08T13:51:02Z

Thanks @ArturGaspar!

data URI parser.

8d7a972

Formatting data_uri.py.

b375452

redapple mentioned this pull request Aug 12, 2016

[WIP] data URI download handler. scrapy/scrapy#2175

Closed

Move parse_data_uri() to url module.

7485ed2

redapple changed the title ~~data URI parser.~~ [MRG+1] data URI parser. Aug 16, 2016

ArturGaspar added 2 commits August 16, 2016 10:07

Percent-encode spaces in test URI.

10f6524

Fix test.

6778ea0

redapple mentioned this pull request Aug 16, 2016

Is it any plan to support downloading "data:image/jpeg;base64" scrapy/scrapy#2156

Closed

eliasdorneles reviewed Aug 16, 2016
View reviewed changes

ArturGaspar added 2 commits August 16, 2016 21:25

Use named tuple in parse_data_uri.

d6dc7dd

Handle bytes in Python 3 in parse_data_uri().

69d9594

kmike reviewed Aug 25, 2016
View reviewed changes

Allow Unicode URIs in parse_data_uri.

ee2f1a9

redapple reviewed Oct 24, 2016

View reviewed changes

ArturGaspar added 2 commits November 4, 2016 09:04

Whitespace test for parse_data_uri.

5b767a9

Case insensitive scheme in parse_data_uri.

fcb8f09

redapple approved these changes Nov 4, 2016

View reviewed changes

ArturGaspar commented Nov 4, 2016

View reviewed changes

kmike reviewed Dec 12, 2016

View reviewed changes

ArturGaspar added 2 commits January 13, 2017 12:05

Better exceptions in parse_data_uri().

84a3a4b

Tests for exceptions in parse_data_uri().

5ec787b

kmike merged commit f46b4c4 into scrapy:master Feb 8, 2017

		with self.assertRaises(ValueError):
		parse_data_uri("http://example.com/")

[MRG+1] data URI parser. #71

[MRG+1] data URI parser. #71

Uh oh!

Conversation

ArturGaspar commented Aug 12, 2016

Uh oh!

codecov-io commented Aug 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 94.71% (diff: 95.34%)

Uh oh!

redapple commented Aug 12, 2016

Uh oh!

ArturGaspar commented Aug 12, 2016

Uh oh!

redapple commented Aug 12, 2016

Uh oh!

redapple commented Aug 16, 2016

Uh oh!

ArturGaspar commented Aug 16, 2016

Uh oh!

eliasdorneles Aug 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArturGaspar Aug 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmike Aug 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

redapple commented Sep 8, 2016

Uh oh!

redapple commented Sep 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

redapple commented Nov 4, 2016

Uh oh!

ArturGaspar Nov 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

redapple commented Dec 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmike commented Feb 8, 2017

Uh oh!

kmike commented Feb 8, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-io commented Aug 12, 2016 •

edited

Loading

eliasdorneles Aug 16, 2016 •

edited

Loading

ArturGaspar Aug 17, 2016 •

edited

Loading

kmike Aug 25, 2016 •

edited

Loading

ArturGaspar Nov 4, 2016 •

edited

Loading