Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporters shouldn't encode unicode values to bytes by default #1080

Closed
kmike opened this issue Mar 17, 2015 · 5 comments
Closed

Exporters shouldn't encode unicode values to bytes by default #1080

kmike opened this issue Mar 17, 2015 · 5 comments

Comments

@kmike
Copy link
Member

@kmike kmike commented Mar 17, 2015

I think it makes more sense to send unicode to serializer libraries - encoding shouldn't be handled per-field.

  • JsonLinesItemExporter and JsonItemExporter shouldn't encode values to bytes because JSON is a text format. The encoding is always utf8. In Python 3.x stdlib json module doesn't even support byte keys/values, and docs say that the result of json.dump is always str (aka unicode in 2.x). By the way, it makes no sense to inherit JsonItemExporter from JsonLinesItemExporter. JsonItemExporter even overrides all JsonLinesItemExporter methods.
  • XmlItemExporter shouldn't encode values to bytes because since Python 2.7.6 xml.sax.saxutils wants unicode, and in Python 3.x it requires unicode. Currently this is worked around by decoding the encoded value. The problem is that before Python 2.7.6 xml.sax.saxutils required bytes; IMHO it is better to workaround this instead of encoding/decoding data multiple times.
  • For CsvItemExporter it makes sense to encode values because in Python 2.x csv module doesn't support unicode. In Python 3.x csv supports unicode, so encoding can be seen as a temporary workaround of stdlib issues.
  • For PickleItemExporter and MarshalItemExporter encoding unicode values is bad because instead of original values users will get bytes when the result is unpickled/unmarshalled. It is not convenient, and it is an information loss (it is not clear what encoding to use to decode the value).
  • I'm not sure about PprintItemExporter. Its behaviour should depend on 'file' argument - if it can handle unicode the output should be unicode, if it can't - the output should be bytes.
  • PythonItemExporter assumes that serialization library can't handle unicode and mentions json, msgpack and binc. JSON can't handle bytes; msgpack can handle both bytes and unicode; I'm not sure binc is of any use currently. I think leaving unicode values as-is is a better behaviour for PythonItemExporter.
@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Sep 16, 2015

I agree with this, I think we should favor unicode over encoded bytes (specially for JSON output).

About JSON output -- wouldn't be a backward compatibility issue to change that?
It's a bit of a breaking change, so I'm not sure how to handle this.

I'm working on porting the current exporters to Python 3 and for now I think I'll keep the current behavior, unless we make a decision on this.

What do you think?

@kmike
Copy link
Member Author

@kmike kmike commented Sep 20, 2015

About JSON output -- wouldn't be a backward compatibility issue to change that?
It's a bit of a breaking change, so I'm not sure how to handle this.

I think for JSON it should be fine. For me in Python 2.x json tries to decode bytes data before encoding it again, and it fails if encodings don't match:

>>> import json
>>> data = u'привет'.encode('cp1251')
>>> json.dumps({'x': data})
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-7-223ffea81bb4> in <module>()
----> 1 json.dumps({'x': data})

/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.pyc in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, encoding, default, sort_keys, **kw)
    241         cls is None and indent is None and separators is None and
    242         encoding == 'utf-8' and default is None and not sort_keys and not kw):
--> 243         return _default_encoder.encode(obj)
    244     if cls is None:
    245         cls = JSONEncoder

/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.pyc in encode(self, o)
    205         # exceptions aren't as detailed.  The list call should be roughly
    206         # equivalent to the PySequence_Fast that ''.join() would do.
--> 207         chunks = self.iterencode(o, _one_shot=True)
    208         if not isinstance(chunks, (list, tuple)):
    209             chunks = list(chunks)

/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.pyc in iterencode(self, o, _one_shot)
    268                 self.key_separator, self.item_separator, self.sort_keys,
    269                 self.skipkeys, _one_shot)
--> 270         return _iterencode(o, 0)
    271 
    272 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

UnicodeDecodeError: 'utf8' codec can't decode byte 0xef in position 0: invalid continuation byte

so it seems that by using unicode for json we'll be fixing a bug.

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Feb 2, 2016

@kmike I think we can close this now, right?

@kmike
Copy link
Member Author

@kmike kmike commented Feb 2, 2016

Yes, thanks for fixing it!

@kmike kmike closed this Feb 2, 2016
@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Feb 2, 2016

Thank you for making it easier for us. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants