-
Notifications
You must be signed in to change notification settings - Fork 106
[MRG+1] data URI parser. #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current coverage is 94.71% (diff: 95.34%)@@ master #71 diff @@
==========================================
Files 7 7
Lines 406 454 +48
Methods 0 0
Messages 0 0
Branches 84 93 +9
==========================================
+ Hits 382 430 +48
Misses 16 16
Partials 8 8
|
|
I wonder if this shouldn't be in |
|
I don't see why I think it could belong in |
|
Right, |
|
LGTM. Thanks @ArturGaspar ! |
|
Only fixed an invalid URL used in the tests above. |
w3lib/url.py
Outdated
| raise ValueError("invalid data URI") | ||
| data = base64.b64decode(data) | ||
|
|
||
| return media_type, media_type_params, data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to return a dictionary instead, to ease future maintenance and also to make the result immediately understood by the occasional user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eliasdorneles Good suggestion. I have used a namedtuple like the standard library URL parsing functions.
tests/test_url.py
Outdated
|
|
||
| def test_non_ascii_uri(self): | ||
| with self.assertRaises(UnicodeEncodeError): | ||
| parse_data_uri(u"data:,é") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think raising UnicodeEncodeError is not a good way to handle it in Python 3. We settled on unicode URLs by default for Python 3, they shouldn't raise errors. I tried it in Firefox - it escaped this URL to data:,%C3%A9 and then processed it. w3lib.url.safe_url_string function escaped URL the same way. I'm not 100% sure about that, but it looks like we should use safe_url_string here instead of encoding to ascii. This way data: urls will be handled the same way as a browser handles them. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
|
@kmike , are you ok with the latest changes? |
|
@kmike ping :) |
| with self.assertRaises(ValueError): | ||
| parse_data_uri("http://example.com/") | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wikipedia states:
Data URIs encoded in Base64 may contain whitespace for human readability.
@ArturGaspar , can you add tests for multiline content?
<IMG
SRC="data:image/gif;base64,R0lGODdhMAAwAPAAAAAAAP///ywAAAAAMAAw
AAAC8IyPqcvt3wCcDkiLc7C0qwyGHhSWpjQu5yqmCYsapyuvUUlvONmOZtfzgFz
ByTB10QgxOR0TqBQejhRNzOfkVJ+5YiUqrXF5Y5lKh/DeuNcP5yLWGsEbtLiOSp
a/TPg7JpJHxyendzWTBfX0cxOnKPjgBzi4diinWGdkF8kjdfnycQZXZeYGejmJl
ZeGl9i2icVqaNVailT6F5iJ90m6mvuTS4OK05M0vDk0Q4XUtwvKOzrcd3iq9uis
F81M1OIcR7lEewwcLp7tuNNkM3uNna3F2JQFo97Vriy/Xl4/f1cf5VWzXyym7PH
hhx4dbgYKAAA7"
ALT="Larry">
w3lib/url.py
Outdated
| uri = safe_url_string(uri).encode('ascii') | ||
|
|
||
| scheme, uri = uri.split(b':', 1) | ||
| if scheme != b'data': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this does not handle scheme case-insensitivity correctly:
Python 2 test:
>>> parse_data_uri("data:,A%20brief%20note")
ParseDataURIResult(media_type='text/plain', media_type_parameters={'charset': 'US-ASCII'}, data='A brief note')
>>> parse_data_uri(b"DATA:,A%20brief%20note")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "w3lib/url.py", line 327, in parse_data_uri
raise ValueError("not a data URI")
ValueError: not a data URI
>>> parse_data_uri(u"DATA:,A%20brief%20note")
ParseDataURIResult(media_type='text/plain', media_type_parameters={'charset': 'US-ASCII'}, data='A brief note')
>>>
|
LGTM, @ArturGaspar. Thanks! |
| self.assertEqual(result.media_type, "text/plain") | ||
| self.assertEqual(result.data, b"Hello, world.") | ||
|
|
||
| result = parse_data_uri("data:text/plain;base64,SGVsb G8sIH\n " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This URI with actual spaces rather than escape sequences is not valid, I think it is not a good idea to test for it. The test passes because parsing just so happens to be implemented in a way that handles it as well.
RFC 3986 appendix C sort of allows for spaces, but per my understanding it would be the responsibility of the program that extracts the URI to handle it, not of a library that is not dealing with the source of the URI directly. Also only whitespace in the base64 part is handled, not in any part of the URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firefox seems to allow spaces anywhere after the scheme. Chrome seems to allow spaces only on the base64 part.
If the goal is to be compatible with browsers, then perhaps this URI should be considered valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imho, the implementation should support whitespace in base64 encoded data part.
I don't think we need to support whitespace elsewhere. (so that would mean following what Chrome does, I believe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what it already does, then.
|
@eliasdorneles , @kmike , what do you think of the patch now? |
|
|
||
| scheme, uri = uri.split(b':', 1) | ||
| if scheme.lower() != b'data': | ||
| raise ValueError("not a data URI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about raising the same ValueError if there is no scheme at all (currently it will fail earlier, when unpacking split result)?
| is_base64, data = uri.split(b',', 1) | ||
| if is_base64: | ||
| if is_base64 != b";base64": | ||
| raise ValueError("invalid data URI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to have an error message different from the error message of other ValueError, and also to raise ValueError if uri.split returns only a single result
|
Ok, sorry for the delay. Let's merge it. |
|
Thanks @ArturGaspar! |
No description provided.