New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Added option to turn off ensure_ascii for JSON exporters #2034
Conversation
a7cfec6
to
ba86319
Compare
For some reason I just can't get the tests to pass on Py3, so annoying... |
Current coverage is 83.37%
|
Fixed it finally ) |
Any feedback for it? Would really love to finally have this in my scraper ) |
@dracony , thanks for this. |
def test_unicode_utf8(self): | ||
self.ie = JsonLinesItemExporter(self.output, ensure_ascii = False) | ||
i1 = TestItem(name=u'Test\u2603') | ||
self.assertExportResult(i1, b'{"name": "Test\xe2\x98\x83"}\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I find it a bit easier to read a test against u'Test\u2603'.encode('utf8')
And here is the FEED_EXPORT_ENCODING setting =) |
a0efb36
to
2a1bf5b
Compare
Ah damn it, The CSV tests fail on Py3 because StringIO is behaving differently =( |
793bd18
to
c5c34ee
Compare
@redapple Yay, I made it work. |
@@ -33,9 +34,9 @@ def _configure(self, options, dont_fail=False): | |||
If dont_fail is set, it won't raise an exception on unexpected options | |||
(useful for using with keyword arguments in subclasses constructors) | |||
""" | |||
self.encoding=options.pop('encoding', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style
@robsonpeixoto Thanks, missed those. I amended the commit to pep8-ize it =) |
Any other feedback on it? Would really be great to have it merged (I work with quite a lot of cp-1251 encoded stuff) |
self.include_headers_line = include_headers_line | ||
file = file if six.PY2 else io.TextIOWrapper(file, line_buffering=True) | ||
self.csv_writer = csv.writer(file, **kwargs) | ||
self.stream = six.StringIO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated on line 189 below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I'll remove that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplication is still there.
If you know of any other way to make it work on py2 and 3 id be happy to implement ut. Although i think the current one is best |
self.file.write(to_bytes(self.encoder.encode(itemdict) + '\n')) | ||
data = self.encoder.encode(itemdict) + '\n' | ||
if six.PY3: | ||
data = to_unicode(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed for Python 3?
I tried with a very simple test and it seems json encoder already outputs Py3-'str'
$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26)
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3})
'{"é": 3}'
>>> type(json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3}))
<class 'str'>
>>> json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3})
'{"\\u00e9": 3}'
>>> type(json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3}))
<class 'str'>
>>> from scrapy.utils.python import to_unicode
>>> type(to_unicode(json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3})))
<class 'str'>
>>> type(to_unicode(json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3})))
<class 'str'>
am I missing something?
@redapple True, it seems I overcomplicated it a bit when I was trying to make the test pass. I removed those lines and the tests still pass 👍 |
@redapple @robsonpeixoto Ping =) |
|
||
def _build_row(self, values): | ||
for s in values: | ||
try: | ||
yield to_native_str(s) | ||
yield to_bytes(s) if six.PY2 else s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, why is this needed?
unless I'm missing something, to_native_str()
is equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, perhaps you want to pass self.encoding
here too?
Hey @dracony -- sorry for the delay in reviewing, I've made some comments above. The main thing not clear to me are the changes in the CSV exporter -- is line buffering really needed? |
Also, if you could revise the style once more, there are some minor stuff there (try running |
@eliasdorneles I gave it one more try and made the CSV writer work without changing encoding line by line (although in my defence I got that approach from the csv manual page) As for the |
@dracony Please see my other comment about passing |
@eliasdorneles I changed the ensure_ascii to kwargs as you suggested. Also I found a solution for why |
@dracony hey, thanks for digging! 👍 |
@eliasdorneles It's not the file that is nt being flushed, it's that TextIOBuffer |
Yup, you're right. :) |
Yay =) On Mon, Jul 11, 2016 at 5:24 PM, Elias Dorneles notifications@github.com
|
FEED_EXPORT_ENCODING | ||
-------------------- | ||
|
||
The encoding to be used for the feed. Defaults to UTF-8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct for JSON(lines), the default encoding is \uXXXX
escape sequences.
A special case indeed, but worth a mention I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would I phrase it though? Wouldn't it be confusing to say it's
"\uXXXX escape
sequences" ?
On Tue, Jul 12, 2016 at 3:16 PM, Paul Tremberth notifications@github.com
wrote:
In docs/topics/feed-exports.rst
#2034 (comment):@@ -233,6 +234,13 @@ The serialization format to be used for the feed. See
.. setting:: FEED_EXPORT_FIELDS
+FEED_EXPORT_ENCODING
+--------------------
+
+The encoding to be used for the feed. Defaults to UTF-8.This is not correct for JSON(lines), the default encoding is \uXXXX
escape sequences.
A special case indeed, but worth a mention I believe.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/df4190bfd442de7b829a06059d744491455f593c#r70435005,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KLke72tdcawwYV8ryP2LEk5d2gl3ks5qU5PCgaJpZM4Iu3Yz
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @redapple, missed that on the review.
You can say "safe numeric encoding (\uXXXX
sequences)".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. The default value is None
, which means is the backward-compatible default behavior (current master, before the merge), which is:
\uXXXX
escape sequences for JSON- UTF-8 for XML and CSV
(if I'm not mistaken)
And users can change to UTF-8 for JSON instead of \uXXXX
with FEED_EXPORT_ENCODING = 'utf-8'
If set, and different from "utf-8"
, it changes behavior for all exporters.
I don't have good phrasing right now. Maybe @eliasdorneles has ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concretely, here is my suggestion:
FEED_EXPORT_ENCODING
--------------------
Default: ``None``
The encoding to be used for the feed.
If unset or set to ``None`` (default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (``\uXXXX`` sequences) for historic reasons.
Use ``utf-8`` if you want UTF-8 for JSON too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let it be so =) I updated the PR with your suggestion
On Tue, Jul 12, 2016 at 3:41 PM, Elias Dorneles notifications@github.com
wrote:
In docs/topics/feed-exports.rst
#2034 (comment):@@ -233,6 +234,13 @@ The serialization format to be used for the feed. See
.. setting:: FEED_EXPORT_FIELDS
+FEED_EXPORT_ENCODING
+--------------------
+
+The encoding to be used for the feed. Defaults to UTF-8.Concretely, here is my suggestion:
FEED_EXPORT_ENCODING
Default:
None
The encoding to be used for the feed.
If unset or set to
None
(default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (\uXXXX
sequences) for historic reasons.Use
utf-8
if you want UTF-8 for JSON too.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/df4190bfd442de7b829a06059d744491455f593c#r70439572,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KEN03iDEZO6Mxvay9Ud84SiwrbUkks5qU5mdgaJpZM4Iu3Yz
.
|
||
Use ``utf-8`` if you want UTF-8 for JSON too. | ||
|
||
.. setting:: FEED_EXPORT_ENCODING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I had missed this:
it should be .. setting:: FEED_EXPORT_FIELDS
here and .. setting:: FEED_EXPORT_ENCODING
earlier
Ah I see. Fixed that |
@dracony , thanks a lot! |
See #1965