Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Added option to turn off ensure_ascii for JSON exporters #2034

Merged
merged 1 commit into from Jul 12, 2016

Conversation

@dracony
Copy link
Contributor

@dracony dracony commented Jun 6, 2016

See #1965

@redapple redapple changed the title Added option to turn off ensure_ascii for JSON exporters. See #1965 Added option to turn off ensure_ascii for JSON exporters Jun 6, 2016
@dracony dracony force-pushed the dracony:master branch 2 times, most recently from a7cfec6 to ba86319 Jun 6, 2016
@dracony
Copy link
Contributor Author

@dracony dracony commented Jun 6, 2016

For some reason I just can't get the tests to pass on Py3, so annoying...
And I don't have a local install of it atm, so I guess I'll be torturing Travis for a while)

@codecov-io
Copy link

@codecov-io codecov-io commented Jun 6, 2016

Current coverage is 83.37%

Merging #2034 into master will increase coverage by 0.03%

Powered by Codecov. Last updated by 8a22a74...33a39b3

@dracony
Copy link
Contributor Author

@dracony dracony commented Jun 6, 2016

Fixed it finally )

@dracony
Copy link
Contributor Author

@dracony dracony commented Jun 7, 2016

Any feedback for it? Would really love to finally have this in my scraper )

@redapple
Copy link
Contributor

@redapple redapple commented Jun 7, 2016

@dracony , thanks for this.
how will people use this? use their own JSON exporter subclass?
if yes, how about introducing a setting for this? it could be simpler.
Also, @mgachhui was suggesting having an encoding parameter, not only default UTF-8
(and applying this to XML exports too)

def test_unicode_utf8(self):
self.ie = JsonLinesItemExporter(self.output, ensure_ascii = False)
i1 = TestItem(name=u'Test\u2603')
self.assertExportResult(i1, b'{"name": "Test\xe2\x98\x83"}\n')

This comment has been minimized.

@redapple

redapple Jun 7, 2016
Contributor

nitpick: I find it a bit easier to read a test against u'Test\u2603'.encode('utf8')

self.first_item = True

def _configure(self, options, dont_fail=False):
self.ensure_ascii = options.pop('ensure_ascii', True)

This comment has been minimized.

@dracony

dracony Jun 7, 2016
Author Contributor

I added check for ensure_ascii from the options, so you would not have to subclass the exporter, just pass the option to it.

Setting would probably be a better option though. Kk, I'll add that one too.

As for the encoding parameter: JSON encode accepts separate ensure_ascii and encoding parameters. Would you like them to be separate like that? Or turn of ensure_ascii automatically if the encoding is specified?

This comment has been minimized.

@redapple

redapple Jun 7, 2016
Contributor

Sorry, I meant that you'd need to pass the option when the feedexporter initializes exporters.
Or maybe I'm missing some bit?
I'm still new to exporters code

This comment has been minimized.

@dracony

dracony Jun 7, 2016
Author Contributor

Yup, I got that now. I'm using a custom feedexporter, so I could pass ensure_ascii option myself.
I'll fix it later today =)

This comment has been minimized.

@redapple

redapple Jun 7, 2016
Contributor

Regarding feed encoding setting,
I believe we could to use None as default feed encoding,

  • for JSON that would mean ensure_ascii=True, i.e. \uXXXX escape sequences
  • for XML and CSV, that would mean UTF-8

if feed encoding set to UTF-8 explicitly:

  • for JSON export, that would mean ensure_ascii=False and UTF-8 encoding
  • for XML and CSV, that would not change from default None setting

For other encodings, data is written with the supplied encoding, with warnings or exceptions for failing item encoding writes.
what do you think?

This comment has been minimized.

@dracony

dracony Jun 7, 2016
Author Contributor

Sounds good

@dracony dracony force-pushed the dracony:master branch from ba86319 to 44b936d Jun 8, 2016
@dracony
Copy link
Contributor Author

@dracony dracony commented Jun 8, 2016

And here is the FEED_EXPORT_ENCODING setting =)
Took me a while, I so hate all that encoding stuff))

@dracony dracony force-pushed the dracony:master branch 4 times, most recently from a0efb36 to 2a1bf5b Jun 8, 2016
@dracony
Copy link
Contributor Author

@dracony dracony commented Jun 8, 2016

Ah damn it, The CSV tests fail on Py3 because StringIO is behaving differently =(
Any idea how to fix this?

@dracony dracony force-pushed the dracony:master branch 2 times, most recently from 793bd18 to c5c34ee Jun 8, 2016
@dracony
Copy link
Contributor Author

@dracony dracony commented Jun 8, 2016

@redapple Yay, I made it work.
Like it more now?

@@ -33,9 +34,9 @@ def _configure(self, options, dont_fail=False):
If dont_fail is set, it won't raise an exception on unexpected options
(useful for using with keyword arguments in subclasses constructors)
"""
self.encoding=options.pop('encoding', None)

This comment has been minimized.

@dracony dracony force-pushed the dracony:master branch from c5c34ee to a888cb1 Jun 8, 2016
@dracony
Copy link
Contributor Author

@dracony dracony commented Jun 8, 2016

@robsonpeixoto Thanks, missed those. I amended the commit to pep8-ize it =)

@dracony dracony force-pushed the dracony:master branch from a888cb1 to d4715f4 Jun 8, 2016
@dracony
Copy link
Contributor Author

@dracony dracony commented Jun 10, 2016

Any other feedback on it? Would really be great to have it merged (I work with quite a lot of cp-1251 encoded stuff)

self.include_headers_line = include_headers_line
file = file if six.PY2 else io.TextIOWrapper(file, line_buffering=True)
self.csv_writer = csv.writer(file, **kwargs)
self.stream = six.StringIO()

This comment has been minimized.

@redapple

redapple Jun 10, 2016
Contributor

duplicated on line 189 below

This comment has been minimized.

@dracony

dracony Jun 10, 2016
Author Contributor

Yup, I'll remove that

This comment has been minimized.

@eliasdorneles

eliasdorneles Jul 4, 2016
Member

The duplication is still there.

def _writerow(self, row):
self.csv_writer.writerow(row)
if six.PY2:
data = self.stream.getvalue()

This comment has been minimized.

@redapple

redapple Jun 10, 2016
Contributor

why are these lines below needed?

This comment has been minimized.

@dracony

dracony Jun 10, 2016
Author Contributor

Python 3 can make use of the TextIOWrapper to write in particular encoding. To write csv files in a specified encoding in Python 2 you need to rely on StringIO, as actually shown in the example in the docs: https://docs.python.org/2/library/csv.html

So basically in python2 the csv writer will write a utf string to StringIO buffer, and then the result will be reencoded to the the specified encoding and written to file

This comment has been minimized.

@dracony

dracony Jun 13, 2016
Author Contributor

Any blockers with this PR? Really want it to get merged, so I dont need to
use my own fork
On Jun 10, 2016 16:42, "Paul Tremberth" notifications@github.com wrote:

In scrapy/exporters.py
#2034 (comment):

@@ -214,7 +232,16 @@ def _write_headers_and_set_fields_to_export(self, item):
# use fields declared in Item
self.fields_to_export = list(item.fields.keys())
row = list(self._build_row(self.fields_to_export))

  •        self.csv_writer.writerow(row)
    
  •        self._writerow(row)
    
  • def _writerow(self, row):
  •    self.csv_writer.writerow(row)
    
  •    if six.PY2:
    
  •        data = self.stream.getvalue()
    

why are these lines below needed?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/d4715f452800af1ba82026a24097badbf8ff5274#r66624813,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KIWseHK6XP5mwN9QqdjyVI380HMvks5qKXfjgaJpZM4Iu3Yz
.

@dracony dracony force-pushed the dracony:master branch from 5a62680 to d00b70f Jun 10, 2016
@dracony
Copy link
Contributor Author

@dracony dracony commented Jun 10, 2016

If you know of any other way to make it work on py2 and 3 id be happy to implement ut. Although i think the current one is best

self.file.write(to_bytes(self.encoder.encode(itemdict) + '\n'))
data = self.encoder.encode(itemdict) + '\n'
if six.PY3:
data = to_unicode(data)

This comment has been minimized.

@redapple

redapple Jun 14, 2016
Contributor

why is this needed for Python 3?
I tried with a very simple test and it seems json encoder already outputs Py3-'str'

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3})
'{"é": 3}'
>>> type(json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3}))
<class 'str'>
>>> json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3})
'{"\\u00e9": 3}'
>>> type(json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3}))
<class 'str'>
>>> from scrapy.utils.python import to_unicode
>>> type(to_unicode(json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3})))
<class 'str'>
>>> type(to_unicode(json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3})))
<class 'str'>

am I missing something?

@dracony
Copy link
Contributor Author

@dracony dracony commented Jun 20, 2016

@redapple @robsonpeixoto Ping =)
Can this get merged now?


def _build_row(self, values):
for s in values:
try:
yield to_native_str(s)
yield to_bytes(s) if six.PY2 else s

This comment has been minimized.

@eliasdorneles

eliasdorneles Jul 4, 2016
Member

hm, why is this needed?
unless I'm missing something, to_native_str() is equivalent.

This comment has been minimized.

@eliasdorneles

eliasdorneles Jul 4, 2016
Member

also, perhaps you want to pass self.encoding here too?


self._writerow(row)

def _writerow(self, row):

This comment has been minimized.

@eliasdorneles

eliasdorneles Jul 4, 2016
Member

Can you clarify why is this needed?
It looks unrelated to this PR, what am I missing?

@@ -84,19 +85,22 @@ class JsonLinesItemExporter(BaseItemExporter):
def __init__(self, file, **kwargs):
self._configure(kwargs, dont_fail=True)
self.file = file
self.encoder = ScrapyJSONEncoder(**kwargs)
ensure_ascii = not self.encoding

This comment has been minimized.

@eliasdorneles

eliasdorneles Jul 4, 2016
Member

It would be better to do this as:

kwargs.setdefault('ensure_ascii', not self.encoding)`

and removing the explicit ensure_ascii from the ScrapyJSONEncoder call.

This way it won't break subclasses that are already be passing ensure_ascii for this in kwargs.



class JsonItemExporter(BaseItemExporter):

def __init__(self, file, **kwargs):
self._configure(kwargs, dont_fail=True)
self.file = file
self.encoder = ScrapyJSONEncoder(**kwargs)
ensure_ascii = not self.encoding
self.encoder = ScrapyJSONEncoder(ensure_ascii = ensure_ascii, **kwargs)

This comment has been minimized.

@eliasdorneles

eliasdorneles Jul 4, 2016
Member

The same comment about ensure_ascii above applies here.

self.stream = six.StringIO()
self.file = file
self.stream = six.StringIO() if six.PY2 else io.TextIOWrapper(file,
line_buffering=True, encoding = self.encoding)

This comment has been minimized.

@eliasdorneles

eliasdorneles Jul 4, 2016
Member

why is it enabling line_buffering here? is this really needed?

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Jul 4, 2016

Hey @dracony -- sorry for the delay in reviewing, I've made some comments above.

The main thing not clear to me are the changes in the CSV exporter -- is line buffering really needed?
It would be better to have it in a second PR (less decisions to be done to merge this one :) ), it might deserve its own setting, since flushing at every line can be slower.

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Jul 4, 2016

Also, if you could revise the style once more, there are some minor stuff there (try running flake8 scrapy/exporters.py).
Not a big deal, but makes it nicer for the reviewer's brain to parse it. :)
Thanks!

@dracony dracony force-pushed the dracony:master branch from c41921e to fe9be67 Jul 5, 2016
@dracony
Copy link
Contributor Author

@dracony dracony commented Jul 5, 2016

@eliasdorneles
The change #2034 (diff) was indeed irrelevant and changes nothing. It was a sideeffect with me fighting with the CSV writer.

I gave it one more try and made the CSV writer work without changing encoding line by line (although in my defence I got that approach from the csv manual page)

As for the line_buffering being on it does seem to be required since turning it off broke the test entirely. I would be happy to skip the TextIOWrapper altogether, but the only way to do that would be reopening the file again with the encoding parameter specified. And considering it's the file object that is being passed to the exporter, closing and reopening the file seems more like a hack

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Jul 6, 2016

@dracony
Using TextIOWrapper is totally fine, the current PY3 code is already using it.
I still think it's weird why line_buffering is being required for PY3, it looks like we'll have different behavior in PY2. Which test failed?

Please see my other comment about passing ensure_ascii parameter -- it's best to use the kwargs.setdefault approach to maintain compatibility with subclasses.
Aside that little nitpicking, the PR looks good to me, nice work! 👍

@dracony dracony force-pushed the dracony:master branch from fe9be67 to df4190b Jul 11, 2016
@dracony
Copy link
Contributor Author

@dracony dracony commented Jul 11, 2016

@eliasdorneles I changed the ensure_ascii to kwargs as you suggested. Also I found a solution for why line_buffering False was failing tests. The solution was to also enable write_through, otherwise the buffer was not flushed to disk

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Jul 11, 2016

@dracony hey, thanks for digging! 👍
Hm, that kinda makes me suspicious of the tests, like they may be trying to read without having guarantee the file is flushed.
I don't think we should need to be forcing flush (we don't seem to need it for the PY 2 code)... I'll have a look to confirm, thank you!

@dracony
Copy link
Contributor Author

@dracony dracony commented Jul 11, 2016

@eliasdorneles It's not the file that is nt being flushed, it's that TextIOBuffer

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Jul 11, 2016

Yup, you're right. :)
Alright, this looks good now!
Thank you, @dracony !

@eliasdorneles eliasdorneles changed the title Added option to turn off ensure_ascii for JSON exporters [MRG+1] Added option to turn off ensure_ascii for JSON exporters Jul 11, 2016
@dracony
Copy link
Contributor Author

@dracony dracony commented Jul 11, 2016

Yay =)

On Mon, Jul 11, 2016 at 5:24 PM, Elias Dorneles notifications@github.com
wrote:

Yup, you're right. :)
Alright, this looks good now!
Thank you, @dracony https://github.com/dracony !


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2034 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AC-3KI7FN-GaAVBW08_ZCtXABujFYKFdks5qUmBFgaJpZM4Iu3Yz
.

@redapple redapple added this to the v1.2 milestone Jul 12, 2016
FEED_EXPORT_ENCODING
--------------------

The encoding to be used for the feed. Defaults to UTF-8.

This comment has been minimized.

@redapple

redapple Jul 12, 2016
Contributor

This is not correct for JSON(lines), the default encoding is \uXXXX escape sequences.
A special case indeed, but worth a mention I believe.

This comment has been minimized.

@dracony

dracony Jul 12, 2016
Author Contributor

How would I phrase it though? Wouldn't it be confusing to say it's
"\uXXXX escape
sequences" ?

On Tue, Jul 12, 2016 at 3:16 PM, Paul Tremberth notifications@github.com
wrote:

In docs/topics/feed-exports.rst
#2034 (comment):

@@ -233,6 +234,13 @@ The serialization format to be used for the feed. See

.. setting:: FEED_EXPORT_FIELDS

+FEED_EXPORT_ENCODING
+--------------------
+
+The encoding to be used for the feed. Defaults to UTF-8.

This is not correct for JSON(lines), the default encoding is \uXXXX
escape sequences.
A special case indeed, but worth a mention I believe.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/df4190bfd442de7b829a06059d744491455f593c#r70435005,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KLke72tdcawwYV8ryP2LEk5d2gl3ks5qU5PCgaJpZM4Iu3Yz
.

This comment has been minimized.

@eliasdorneles

eliasdorneles Jul 12, 2016
Member

Thanks @redapple, missed that on the review.

You can say "safe numeric encoding (\uXXXX sequences)".

This comment has been minimized.

@redapple

redapple Jul 12, 2016
Contributor

Not sure. The default value is None, which means is the backward-compatible default behavior (current master, before the merge), which is:

  • \uXXXX escape sequences for JSON
  • UTF-8 for XML and CSV
    (if I'm not mistaken)

And users can change to UTF-8 for JSON instead of \uXXXX with FEED_EXPORT_ENCODING = 'utf-8'

If set, and different from "utf-8", it changes behavior for all exporters.

I don't have good phrasing right now. Maybe @eliasdorneles has ideas?

This comment has been minimized.

@eliasdorneles

eliasdorneles Jul 12, 2016
Member

Concretely, here is my suggestion:

FEED_EXPORT_ENCODING
--------------------

Default: ``None``

The encoding to be used for the feed.

If unset or set to ``None`` (default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (``\uXXXX`` sequences) for historic reasons.

Use ``utf-8`` if you want UTF-8 for JSON too.

This comment has been minimized.

@dracony

dracony Jul 12, 2016
Author Contributor

Let it be so =) I updated the PR with your suggestion

On Tue, Jul 12, 2016 at 3:41 PM, Elias Dorneles notifications@github.com
wrote:

In docs/topics/feed-exports.rst
#2034 (comment):

@@ -233,6 +234,13 @@ The serialization format to be used for the feed. See

.. setting:: FEED_EXPORT_FIELDS

+FEED_EXPORT_ENCODING
+--------------------
+
+The encoding to be used for the feed. Defaults to UTF-8.

Concretely, here is my suggestion:

FEED_EXPORT_ENCODING

Default: None

The encoding to be used for the feed.

If unset or set to None (default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (\uXXXX sequences) for historic reasons.

Use utf-8 if you want UTF-8 for JSON too.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/df4190bfd442de7b829a06059d744491455f593c#r70439572,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KEN03iDEZO6Mxvay9Ud84SiwrbUkks5qU5mdgaJpZM4Iu3Yz
.

@dracony dracony force-pushed the dracony:master branch from df4190b to 40c78e7 Jul 12, 2016

Use ``utf-8`` if you want UTF-8 for JSON too.

.. setting:: FEED_EXPORT_ENCODING

This comment has been minimized.

@redapple

redapple Jul 12, 2016
Contributor

Sorry, I had missed this:
it should be .. setting:: FEED_EXPORT_FIELDS here and .. setting:: FEED_EXPORT_ENCODING earlier

@dracony dracony force-pushed the dracony:master branch from 40c78e7 to 33a39b3 Jul 12, 2016
@dracony
Copy link
Contributor Author

@dracony dracony commented Jul 12, 2016

Ah I see. Fixed that

@redapple redapple merged commit c3109da into scrapy:master Jul 12, 2016
2 checks passed
2 checks passed
codecov/patch 100% of diff hit (target 100%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@redapple
Copy link
Contributor

@redapple redapple commented Jul 12, 2016

@dracony , thanks a lot!
It took a while to review it properly, sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants