[MRG+1] Added option to turn off ensure_ascii for JSON exporters #2034
Conversation
a7cfec6
to
ba86319
For some reason I just can't get the tests to pass on Py3, so annoying... |
Current coverage is 83.37%
|
Fixed it finally ) |
Any feedback for it? Would really love to finally have this in my scraper ) |
@dracony , thanks for this. |
def test_unicode_utf8(self): | ||
self.ie = JsonLinesItemExporter(self.output, ensure_ascii = False) | ||
i1 = TestItem(name=u'Test\u2603') | ||
self.assertExportResult(i1, b'{"name": "Test\xe2\x98\x83"}\n') |
redapple
Jun 7, 2016
Contributor
nitpick: I find it a bit easier to read a test against u'Test\u2603'.encode('utf8')
nitpick: I find it a bit easier to read a test against u'Test\u2603'.encode('utf8')
self.first_item = True | ||
|
||
def _configure(self, options, dont_fail=False): | ||
self.ensure_ascii = options.pop('ensure_ascii', True) |
dracony
Jun 7, 2016
Author
Contributor
I added check for ensure_ascii from the options, so you would not have to subclass the exporter, just pass the option to it.
Setting would probably be a better option though. Kk, I'll add that one too.
As for the encoding parameter: JSON encode accepts separate ensure_ascii and encoding parameters. Would you like them to be separate like that? Or turn of ensure_ascii automatically if the encoding is specified?
I added check for ensure_ascii from the options, so you would not have to subclass the exporter, just pass the option to it.
Setting would probably be a better option though. Kk, I'll add that one too.
As for the encoding parameter: JSON encode accepts separate ensure_ascii and encoding parameters. Would you like them to be separate like that? Or turn of ensure_ascii automatically if the encoding is specified?
redapple
Jun 7, 2016
Contributor
Sorry, I meant that you'd need to pass the option when the feedexporter initializes exporters.
Or maybe I'm missing some bit?
I'm still new to exporters code
Sorry, I meant that you'd need to pass the option when the feedexporter initializes exporters.
Or maybe I'm missing some bit?
I'm still new to exporters code
dracony
Jun 7, 2016
Author
Contributor
Yup, I got that now. I'm using a custom feedexporter, so I could pass ensure_ascii option myself.
I'll fix it later today =)
Yup, I got that now. I'm using a custom feedexporter, so I could pass ensure_ascii option myself.
I'll fix it later today =)
redapple
Jun 7, 2016
•
Contributor
Regarding feed encoding setting,
I believe we could to use None
as default feed encoding,
- for JSON that would mean
ensure_ascii=True
, i.e. \uXXXX
escape sequences
- for XML and CSV, that would mean UTF-8
if feed encoding set to UTF-8 explicitly:
- for JSON export, that would mean
ensure_ascii=False
and UTF-8 encoding
- for XML and CSV, that would not change from default
None
setting
For other encodings, data is written with the supplied encoding, with warnings or exceptions for failing item encoding writes.
what do you think?
Regarding feed encoding setting,
I believe we could to use None
as default feed encoding,
- for JSON that would mean
ensure_ascii=True
, i.e.\uXXXX
escape sequences - for XML and CSV, that would mean UTF-8
if feed encoding set to UTF-8 explicitly:
- for JSON export, that would mean
ensure_ascii=False
and UTF-8 encoding - for XML and CSV, that would not change from default
None
setting
For other encodings, data is written with the supplied encoding, with warnings or exceptions for failing item encoding writes.
what do you think?
dracony
Jun 7, 2016
Author
Contributor
Sounds good
Sounds good
And here is the FEED_EXPORT_ENCODING setting =) |
a0efb36
to
2a1bf5b
Ah damn it, The CSV tests fail on Py3 because StringIO is behaving differently =( |
793bd18
to
c5c34ee
@redapple Yay, I made it work. |
@@ -33,9 +34,9 @@ def _configure(self, options, dont_fail=False): | |||
If dont_fail is set, it won't raise an exception on unexpected options | |||
(useful for using with keyword arguments in subclasses constructors) | |||
""" | |||
self.encoding=options.pop('encoding', None) |
robsonpeixoto
Jun 8, 2016
Code style
Code style
@robsonpeixoto Thanks, missed those. I amended the commit to pep8-ize it =) |
Any other feedback on it? Would really be great to have it merged (I work with quite a lot of cp-1251 encoded stuff) |
self.include_headers_line = include_headers_line | ||
file = file if six.PY2 else io.TextIOWrapper(file, line_buffering=True) | ||
self.csv_writer = csv.writer(file, **kwargs) | ||
self.stream = six.StringIO() |
redapple
Jun 10, 2016
Contributor
duplicated on line 189 below
duplicated on line 189 below
dracony
Jun 10, 2016
Author
Contributor
Yup, I'll remove that
Yup, I'll remove that
eliasdorneles
Jul 4, 2016
Member
The duplication is still there.
The duplication is still there.
def _writerow(self, row): | ||
self.csv_writer.writerow(row) | ||
if six.PY2: | ||
data = self.stream.getvalue() |
redapple
Jun 10, 2016
Contributor
why are these lines below needed?
why are these lines below needed?
dracony
Jun 10, 2016
Author
Contributor
Python 3 can make use of the TextIOWrapper to write in particular encoding. To write csv files in a specified encoding in Python 2 you need to rely on StringIO, as actually shown in the example in the docs: https://docs.python.org/2/library/csv.html
So basically in python2 the csv writer will write a utf string to StringIO buffer, and then the result will be reencoded to the the specified encoding and written to file
Python 3 can make use of the TextIOWrapper to write in particular encoding. To write csv files in a specified encoding in Python 2 you need to rely on StringIO, as actually shown in the example in the docs: https://docs.python.org/2/library/csv.html
So basically in python2 the csv writer will write a utf string to StringIO buffer, and then the result will be reencoded to the the specified encoding and written to file
dracony
Jun 13, 2016
Author
Contributor
Any blockers with this PR? Really want it to get merged, so I dont need to
use my own fork
On Jun 10, 2016 16:42, "Paul Tremberth" notifications@github.com wrote:
In scrapy/exporters.py
#2034 (comment):
@@ -214,7 +232,16 @@ def _write_headers_and_set_fields_to_export(self, item):
# use fields declared in Item
self.fields_to_export = list(item.fields.keys())
row = list(self._build_row(self.fields_to_export))
-
self.csv_writer.writerow(row)
-
self._writerow(row)
- def _writerow(self, row):
-
self.csv_writer.writerow(row)
-
if six.PY2:
-
data = self.stream.getvalue()
why are these lines below needed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/d4715f452800af1ba82026a24097badbf8ff5274#r66624813,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KIWseHK6XP5mwN9QqdjyVI380HMvks5qKXfjgaJpZM4Iu3Yz
.
Any blockers with this PR? Really want it to get merged, so I dont need to
use my own fork
On Jun 10, 2016 16:42, "Paul Tremberth" notifications@github.com wrote:
In scrapy/exporters.py
#2034 (comment):@@ -214,7 +232,16 @@ def _write_headers_and_set_fields_to_export(self, item):
# use fields declared in Item
self.fields_to_export = list(item.fields.keys())
row = list(self._build_row(self.fields_to_export))
self.csv_writer.writerow(row)
self._writerow(row)
- def _writerow(self, row):
self.csv_writer.writerow(row)
if six.PY2:
data = self.stream.getvalue()
why are these lines below needed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/d4715f452800af1ba82026a24097badbf8ff5274#r66624813,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KIWseHK6XP5mwN9QqdjyVI380HMvks5qKXfjgaJpZM4Iu3Yz
.
If you know of any other way to make it work on py2 and 3 id be happy to implement ut. Although i think the current one is best |
self.file.write(to_bytes(self.encoder.encode(itemdict) + '\n')) | ||
data = self.encoder.encode(itemdict) + '\n' | ||
if six.PY3: | ||
data = to_unicode(data) |
redapple
Jun 14, 2016
Contributor
why is this needed for Python 3?
I tried with a very simple test and it seems json encoder already outputs Py3-'str'
$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26)
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3})
'{"é": 3}'
>>> type(json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3}))
<class 'str'>
>>> json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3})
'{"\\u00e9": 3}'
>>> type(json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3}))
<class 'str'>
>>> from scrapy.utils.python import to_unicode
>>> type(to_unicode(json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3})))
<class 'str'>
>>> type(to_unicode(json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3})))
<class 'str'>
am I missing something?
why is this needed for Python 3?
I tried with a very simple test and it seems json encoder already outputs Py3-'str'
$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26)
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3})
'{"é": 3}'
>>> type(json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3}))
<class 'str'>
>>> json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3})
'{"\\u00e9": 3}'
>>> type(json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3}))
<class 'str'>
>>> from scrapy.utils.python import to_unicode
>>> type(to_unicode(json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3})))
<class 'str'>
>>> type(to_unicode(json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3})))
<class 'str'>
am I missing something?
@redapple @robsonpeixoto Ping =) |
|
||
def _build_row(self, values): | ||
for s in values: | ||
try: | ||
yield to_native_str(s) | ||
yield to_bytes(s) if six.PY2 else s |
eliasdorneles
Jul 4, 2016
Member
hm, why is this needed?
unless I'm missing something, to_native_str()
is equivalent.
hm, why is this needed?
unless I'm missing something, to_native_str()
is equivalent.
eliasdorneles
Jul 4, 2016
Member
also, perhaps you want to pass self.encoding
here too?
also, perhaps you want to pass self.encoding
here too?
|
||
self._writerow(row) | ||
|
||
def _writerow(self, row): |
eliasdorneles
Jul 4, 2016
Member
Can you clarify why is this needed?
It looks unrelated to this PR, what am I missing?
Can you clarify why is this needed?
It looks unrelated to this PR, what am I missing?
@@ -84,19 +85,22 @@ class JsonLinesItemExporter(BaseItemExporter): | |||
def __init__(self, file, **kwargs): | |||
self._configure(kwargs, dont_fail=True) | |||
self.file = file | |||
self.encoder = ScrapyJSONEncoder(**kwargs) | |||
ensure_ascii = not self.encoding |
eliasdorneles
Jul 4, 2016
Member
It would be better to do this as:
kwargs.setdefault('ensure_ascii', not self.encoding)`
and removing the explicit ensure_ascii
from the ScrapyJSONEncoder call.
This way it won't break subclasses that are already be passing ensure_ascii
for this in kwargs.
It would be better to do this as:
kwargs.setdefault('ensure_ascii', not self.encoding)`
and removing the explicit ensure_ascii
from the ScrapyJSONEncoder call.
This way it won't break subclasses that are already be passing ensure_ascii
for this in kwargs.
|
||
|
||
class JsonItemExporter(BaseItemExporter): | ||
|
||
def __init__(self, file, **kwargs): | ||
self._configure(kwargs, dont_fail=True) | ||
self.file = file | ||
self.encoder = ScrapyJSONEncoder(**kwargs) | ||
ensure_ascii = not self.encoding | ||
self.encoder = ScrapyJSONEncoder(ensure_ascii = ensure_ascii, **kwargs) |
eliasdorneles
Jul 4, 2016
Member
The same comment about ensure_ascii
above applies here.
The same comment about ensure_ascii
above applies here.
self.stream = six.StringIO() | ||
self.file = file | ||
self.stream = six.StringIO() if six.PY2 else io.TextIOWrapper(file, | ||
line_buffering=True, encoding = self.encoding) |
eliasdorneles
Jul 4, 2016
Member
why is it enabling line_buffering here? is this really needed?
why is it enabling line_buffering here? is this really needed?
Hey @dracony -- sorry for the delay in reviewing, I've made some comments above. The main thing not clear to me are the changes in the CSV exporter -- is line buffering really needed? |
Also, if you could revise the style once more, there are some minor stuff there (try running |
@eliasdorneles I gave it one more try and made the CSV writer work without changing encoding line by line (although in my defence I got that approach from the csv manual page) As for the |
@dracony Please see my other comment about passing |
@eliasdorneles I changed the ensure_ascii to kwargs as you suggested. Also I found a solution for why |
@dracony hey, thanks for digging! |
@eliasdorneles It's not the file that is nt being flushed, it's that TextIOBuffer |
Yup, you're right. :) |
Yay =) On Mon, Jul 11, 2016 at 5:24 PM, Elias Dorneles notifications@github.com
|
FEED_EXPORT_ENCODING | ||
-------------------- | ||
|
||
The encoding to be used for the feed. Defaults to UTF-8. |
redapple
Jul 12, 2016
Contributor
This is not correct for JSON(lines), the default encoding is \uXXXX
escape sequences.
A special case indeed, but worth a mention I believe.
This is not correct for JSON(lines), the default encoding is \uXXXX
escape sequences.
A special case indeed, but worth a mention I believe.
dracony
Jul 12, 2016
Author
Contributor
How would I phrase it though? Wouldn't it be confusing to say it's
"\uXXXX escape
sequences" ?
On Tue, Jul 12, 2016 at 3:16 PM, Paul Tremberth notifications@github.com
wrote:
In docs/topics/feed-exports.rst
#2034 (comment):
@@ -233,6 +234,13 @@ The serialization format to be used for the feed. See
.. setting:: FEED_EXPORT_FIELDS
+FEED_EXPORT_ENCODING
+--------------------
+
+The encoding to be used for the feed. Defaults to UTF-8.
This is not correct for JSON(lines), the default encoding is \uXXXX
escape sequences.
A special case indeed, but worth a mention I believe.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/df4190bfd442de7b829a06059d744491455f593c#r70435005,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KLke72tdcawwYV8ryP2LEk5d2gl3ks5qU5PCgaJpZM4Iu3Yz
.
How would I phrase it though? Wouldn't it be confusing to say it's
"\uXXXX escape
sequences" ?
On Tue, Jul 12, 2016 at 3:16 PM, Paul Tremberth notifications@github.com
wrote:
In docs/topics/feed-exports.rst
#2034 (comment):@@ -233,6 +234,13 @@ The serialization format to be used for the feed. See
.. setting:: FEED_EXPORT_FIELDS
+FEED_EXPORT_ENCODING
+--------------------
+
+The encoding to be used for the feed. Defaults to UTF-8.This is not correct for JSON(lines), the default encoding is \uXXXX
escape sequences.
A special case indeed, but worth a mention I believe.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/df4190bfd442de7b829a06059d744491455f593c#r70435005,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KLke72tdcawwYV8ryP2LEk5d2gl3ks5qU5PCgaJpZM4Iu3Yz
.
eliasdorneles
Jul 12, 2016
Member
Thanks @redapple, missed that on the review.
You can say "safe numeric encoding (\uXXXX
sequences)".
Thanks @redapple, missed that on the review.
You can say "safe numeric encoding (\uXXXX
sequences)".
redapple
Jul 12, 2016
Contributor
Not sure. The default value is None
, which means is the backward-compatible default behavior (current master, before the merge), which is:
\uXXXX
escape sequences for JSON
- UTF-8 for XML and CSV
(if I'm not mistaken)
And users can change to UTF-8 for JSON instead of \uXXXX
with FEED_EXPORT_ENCODING = 'utf-8'
If set, and different from "utf-8"
, it changes behavior for all exporters.
I don't have good phrasing right now. Maybe @eliasdorneles has ideas?
Not sure. The default value is None
, which means is the backward-compatible default behavior (current master, before the merge), which is:
\uXXXX
escape sequences for JSON- UTF-8 for XML and CSV
(if I'm not mistaken)
And users can change to UTF-8 for JSON instead of \uXXXX
with FEED_EXPORT_ENCODING = 'utf-8'
If set, and different from "utf-8"
, it changes behavior for all exporters.
I don't have good phrasing right now. Maybe @eliasdorneles has ideas?
eliasdorneles
Jul 12, 2016
Member
Concretely, here is my suggestion:
FEED_EXPORT_ENCODING
--------------------
Default: ``None``
The encoding to be used for the feed.
If unset or set to ``None`` (default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (``\uXXXX`` sequences) for historic reasons.
Use ``utf-8`` if you want UTF-8 for JSON too.
Concretely, here is my suggestion:
FEED_EXPORT_ENCODING
--------------------
Default: ``None``
The encoding to be used for the feed.
If unset or set to ``None`` (default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (``\uXXXX`` sequences) for historic reasons.
Use ``utf-8`` if you want UTF-8 for JSON too.
dracony
Jul 12, 2016
Author
Contributor
Let it be so =) I updated the PR with your suggestion
On Tue, Jul 12, 2016 at 3:41 PM, Elias Dorneles notifications@github.com
wrote:
In docs/topics/feed-exports.rst
#2034 (comment):
@@ -233,6 +234,13 @@ The serialization format to be used for the feed. See
.. setting:: FEED_EXPORT_FIELDS
+FEED_EXPORT_ENCODING
+--------------------
+
+The encoding to be used for the feed. Defaults to UTF-8.
Concretely, here is my suggestion:
FEED_EXPORT_ENCODING
Default: None
The encoding to be used for the feed.
If unset or set to None
(default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (\uXXXX
sequences) for historic reasons.
Use utf-8
if you want UTF-8 for JSON too.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/df4190bfd442de7b829a06059d744491455f593c#r70439572,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KEN03iDEZO6Mxvay9Ud84SiwrbUkks5qU5mdgaJpZM4Iu3Yz
.
Let it be so =) I updated the PR with your suggestion
On Tue, Jul 12, 2016 at 3:41 PM, Elias Dorneles notifications@github.com
wrote:
In docs/topics/feed-exports.rst
#2034 (comment):@@ -233,6 +234,13 @@ The serialization format to be used for the feed. See
.. setting:: FEED_EXPORT_FIELDS
+FEED_EXPORT_ENCODING
+--------------------
+
+The encoding to be used for the feed. Defaults to UTF-8.Concretely, here is my suggestion:
FEED_EXPORT_ENCODING
Default:
None
The encoding to be used for the feed.
If unset or set to
None
(default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (\uXXXX
sequences) for historic reasons.Use
utf-8
if you want UTF-8 for JSON too.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/df4190bfd442de7b829a06059d744491455f593c#r70439572,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KEN03iDEZO6Mxvay9Ud84SiwrbUkks5qU5mdgaJpZM4Iu3Yz
.
|
||
Use ``utf-8`` if you want UTF-8 for JSON too. | ||
|
||
.. setting:: FEED_EXPORT_ENCODING |
redapple
Jul 12, 2016
Contributor
Sorry, I had missed this:
it should be .. setting:: FEED_EXPORT_FIELDS
here and .. setting:: FEED_EXPORT_ENCODING
earlier
Sorry, I had missed this:
it should be .. setting:: FEED_EXPORT_FIELDS
here and .. setting:: FEED_EXPORT_ENCODING
earlier
Ah I see. Fixed that |
@dracony , thanks a lot! |
See #1965