Skip to content

Conversation

@ArturGaspar
Copy link
Contributor

No description provided.

@codecov-io
Copy link

codecov-io commented Aug 12, 2016

Current coverage is 94.71% (diff: 95.34%)

Merging #71 into master will increase coverage by 0.62%

@@             master        #71   diff @@
==========================================
  Files             7          7          
  Lines           406        454    +48   
  Methods           0          0          
  Messages          0          0          
  Branches         84         93     +9   
==========================================
+ Hits            382        430    +48   
  Misses           16         16          
  Partials          8          8          

Powered by Codecov. Last update 03c28d2...5ec787b

@redapple
Copy link
Contributor

I wonder if this shouldn't be in w3lib.encoding

@ArturGaspar
Copy link
Contributor Author

I don't see why encoding.

I think it could belong in w3lib.url.

@redapple
Copy link
Contributor

Right, w3lib.url looks good for this

@redapple
Copy link
Contributor

LGTM. Thanks @ArturGaspar !
cc @kmike , @eliasdorneles

@redapple redapple changed the title data URI parser. [MRG+1] data URI parser. Aug 16, 2016
@ArturGaspar
Copy link
Contributor Author

Only fixed an invalid URL used in the tests above.

w3lib/url.py Outdated
raise ValueError("invalid data URI")
data = base64.b64decode(data)

return media_type, media_type_params, data
Copy link
Member

@eliasdorneles eliasdorneles Aug 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to return a dictionary instead, to ease future maintenance and also to make the result immediately understood by the occasional user.

Copy link
Contributor Author

@ArturGaspar ArturGaspar Aug 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eliasdorneles Good suggestion. I have used a namedtuple like the standard library URL parsing functions.


def test_non_ascii_uri(self):
with self.assertRaises(UnicodeEncodeError):
parse_data_uri(u"data:,é")
Copy link
Member

@kmike kmike Aug 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think raising UnicodeEncodeError is not a good way to handle it in Python 3. We settled on unicode URLs by default for Python 3, they shouldn't raise errors. I tried it in Firefox - it escaped this URL to data:,%C3%A9 and then processed it. w3lib.url.safe_url_string function escaped URL the same way. I'm not 100% sure about that, but it looks like we should use safe_url_string here instead of encoding to ascii. This way data: urls will be handled the same way as a browser handles them. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@redapple
Copy link
Contributor

redapple commented Sep 8, 2016

@kmike , are you ok with the latest changes?

@redapple
Copy link
Contributor

@kmike ping :)

with self.assertRaises(ValueError):
parse_data_uri("http://example.com/")


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wikipedia states:

Data URIs encoded in Base64 may contain whitespace for human readability.

@ArturGaspar , can you add tests for multiline content?

RFC2397 has an example:

 <IMG
   SRC="data:image/gif;base64,R0lGODdhMAAwAPAAAAAAAP///ywAAAAAMAAw
   AAAC8IyPqcvt3wCcDkiLc7C0qwyGHhSWpjQu5yqmCYsapyuvUUlvONmOZtfzgFz
   ByTB10QgxOR0TqBQejhRNzOfkVJ+5YiUqrXF5Y5lKh/DeuNcP5yLWGsEbtLiOSp
   a/TPg7JpJHxyendzWTBfX0cxOnKPjgBzi4diinWGdkF8kjdfnycQZXZeYGejmJl
   ZeGl9i2icVqaNVailT6F5iJ90m6mvuTS4OK05M0vDk0Q4XUtwvKOzrcd3iq9uis
   F81M1OIcR7lEewwcLp7tuNNkM3uNna3F2JQFo97Vriy/Xl4/f1cf5VWzXyym7PH
   hhx4dbgYKAAA7"
   ALT="Larry">

w3lib/url.py Outdated
uri = safe_url_string(uri).encode('ascii')

scheme, uri = uri.split(b':', 1)
if scheme != b'data':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this does not handle scheme case-insensitivity correctly:

Python 2 test:

>>> parse_data_uri("data:,A%20brief%20note")
ParseDataURIResult(media_type='text/plain', media_type_parameters={'charset': 'US-ASCII'}, data='A brief note')

>>> parse_data_uri(b"DATA:,A%20brief%20note")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "w3lib/url.py", line 327, in parse_data_uri
    raise ValueError("not a data URI")
ValueError: not a data URI

>>> parse_data_uri(u"DATA:,A%20brief%20note")
ParseDataURIResult(media_type='text/plain', media_type_parameters={'charset': 'US-ASCII'}, data='A brief note')
>>> 

@redapple
Copy link
Contributor

redapple commented Nov 4, 2016

LGTM, @ArturGaspar. Thanks!

self.assertEqual(result.media_type, "text/plain")
self.assertEqual(result.data, b"Hello, world.")

result = parse_data_uri("data:text/plain;base64,SGVsb G8sIH\n "
Copy link
Contributor Author

@ArturGaspar ArturGaspar Nov 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This URI with actual spaces rather than escape sequences is not valid, I think it is not a good idea to test for it. The test passes because parsing just so happens to be implemented in a way that handles it as well.

RFC 3986 appendix C sort of allows for spaces, but per my understanding it would be the responsibility of the program that extracts the URI to handle it, not of a library that is not dealing with the source of the URI directly. Also only whitespace in the base64 part is handled, not in any part of the URI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firefox seems to allow spaces anywhere after the scheme. Chrome seems to allow spaces only on the base64 part.

If the goal is to be compatible with browsers, then perhaps this URI should be considered valid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho, the implementation should support whitespace in base64 encoded data part.
I don't think we need to support whitespace elsewhere. (so that would mean following what Chrome does, I believe)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what it already does, then.

@redapple
Copy link
Contributor

redapple commented Dec 6, 2016

@eliasdorneles , @kmike , what do you think of the patch now?


scheme, uri = uri.split(b':', 1)
if scheme.lower() != b'data':
raise ValueError("not a data URI")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about raising the same ValueError if there is no scheme at all (currently it will fail earlier, when unpacking split result)?

is_base64, data = uri.split(b',', 1)
if is_base64:
if is_base64 != b";base64":
raise ValueError("invalid data URI")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have an error message different from the error message of other ValueError, and also to raise ValueError if uri.split returns only a single result

@kmike kmike merged commit f46b4c4 into scrapy:master Feb 8, 2017
@kmike
Copy link
Member

kmike commented Feb 8, 2017

Ok, sorry for the delay. Let's merge it.

@kmike
Copy link
Member

kmike commented Feb 8, 2017

Thanks @ArturGaspar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants