Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Added option to turn off ensure_ascii for JSON exporters #2034

Merged
merged 1 commit into from Jul 12, 2016

Conversation

dracony
Copy link
Contributor

@dracony dracony commented Jun 6, 2016

See #1965

@redapple redapple changed the title Added option to turn off ensure_ascii for JSON exporters. See #1965 Added option to turn off ensure_ascii for JSON exporters Jun 6, 2016
@dracony dracony force-pushed the master branch 2 times, most recently from a7cfec6 to ba86319 Compare June 6, 2016 14:18
@dracony
Copy link
Contributor Author

dracony commented Jun 6, 2016

For some reason I just can't get the tests to pass on Py3, so annoying...
And I don't have a local install of it atm, so I guess I'll be torturing Travis for a while)

@codecov-io
Copy link

codecov-io commented Jun 6, 2016

Current coverage is 83.37%

Merging #2034 into master will increase coverage by 0.03%

Powered by Codecov. Last updated by 8a22a74...33a39b3

@dracony
Copy link
Contributor Author

dracony commented Jun 6, 2016

Fixed it finally )

@dracony
Copy link
Contributor Author

dracony commented Jun 7, 2016

Any feedback for it? Would really love to finally have this in my scraper )

@redapple
Copy link
Contributor

redapple commented Jun 7, 2016

@dracony , thanks for this.
how will people use this? use their own JSON exporter subclass?
if yes, how about introducing a setting for this? it could be simpler.
Also, @mgachhui was suggesting having an encoding parameter, not only default UTF-8
(and applying this to XML exports too)

def test_unicode_utf8(self):
self.ie = JsonLinesItemExporter(self.output, ensure_ascii = False)
i1 = TestItem(name=u'Test\u2603')
self.assertExportResult(i1, b'{"name": "Test\xe2\x98\x83"}\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I find it a bit easier to read a test against u'Test\u2603'.encode('utf8')

@dracony
Copy link
Contributor Author

dracony commented Jun 8, 2016

And here is the FEED_EXPORT_ENCODING setting =)
Took me a while, I so hate all that encoding stuff))

@dracony dracony force-pushed the master branch 4 times, most recently from a0efb36 to 2a1bf5b Compare June 8, 2016 02:58
@dracony
Copy link
Contributor Author

dracony commented Jun 8, 2016

Ah damn it, The CSV tests fail on Py3 because StringIO is behaving differently =(
Any idea how to fix this?

@dracony dracony force-pushed the master branch 2 times, most recently from 793bd18 to c5c34ee Compare June 8, 2016 15:24
@dracony
Copy link
Contributor Author

dracony commented Jun 8, 2016

@redapple Yay, I made it work.
Like it more now?

@@ -33,9 +34,9 @@ def _configure(self, options, dont_fail=False):
If dont_fail is set, it won't raise an exception on unexpected options
(useful for using with keyword arguments in subclasses constructors)
"""
self.encoding=options.pop('encoding', None)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style

@dracony
Copy link
Contributor Author

dracony commented Jun 8, 2016

@robsonpeixoto Thanks, missed those. I amended the commit to pep8-ize it =)

@dracony
Copy link
Contributor Author

dracony commented Jun 10, 2016

Any other feedback on it? Would really be great to have it merged (I work with quite a lot of cp-1251 encoded stuff)

self.include_headers_line = include_headers_line
file = file if six.PY2 else io.TextIOWrapper(file, line_buffering=True)
self.csv_writer = csv.writer(file, **kwargs)
self.stream = six.StringIO()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated on line 189 below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I'll remove that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplication is still there.

@dracony
Copy link
Contributor Author

dracony commented Jun 10, 2016

If you know of any other way to make it work on py2 and 3 id be happy to implement ut. Although i think the current one is best

self.file.write(to_bytes(self.encoder.encode(itemdict) + '\n'))
data = self.encoder.encode(itemdict) + '\n'
if six.PY3:
data = to_unicode(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed for Python 3?
I tried with a very simple test and it seems json encoder already outputs Py3-'str'

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3})
'{"é": 3}'
>>> type(json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3}))
<class 'str'>
>>> json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3})
'{"\\u00e9": 3}'
>>> type(json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3}))
<class 'str'>
>>> from scrapy.utils.python import to_unicode
>>> type(to_unicode(json.encoder.JSONEncoder(ensure_ascii=True).encode({u"é": 3})))
<class 'str'>
>>> type(to_unicode(json.encoder.JSONEncoder(ensure_ascii=False).encode({u"é": 3})))
<class 'str'>

am I missing something?

@dracony
Copy link
Contributor Author

dracony commented Jun 15, 2016

@redapple True, it seems I overcomplicated it a bit when I was trying to make the test pass. I removed those lines and the tests still pass 👍

@dracony
Copy link
Contributor Author

dracony commented Jun 20, 2016

@redapple @robsonpeixoto Ping =)
Can this get merged now?


def _build_row(self, values):
for s in values:
try:
yield to_native_str(s)
yield to_bytes(s) if six.PY2 else s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, why is this needed?
unless I'm missing something, to_native_str() is equivalent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, perhaps you want to pass self.encoding here too?

@eliasdorneles
Copy link
Member

Hey @dracony -- sorry for the delay in reviewing, I've made some comments above.

The main thing not clear to me are the changes in the CSV exporter -- is line buffering really needed?
It would be better to have it in a second PR (less decisions to be done to merge this one :) ), it might deserve its own setting, since flushing at every line can be slower.

@eliasdorneles
Copy link
Member

Also, if you could revise the style once more, there are some minor stuff there (try running flake8 scrapy/exporters.py).
Not a big deal, but makes it nicer for the reviewer's brain to parse it. :)
Thanks!

@dracony
Copy link
Contributor Author

dracony commented Jul 5, 2016

@eliasdorneles
The change #2034 (diff) was indeed irrelevant and changes nothing. It was a sideeffect with me fighting with the CSV writer.

I gave it one more try and made the CSV writer work without changing encoding line by line (although in my defence I got that approach from the csv manual page)

As for the line_buffering being on it does seem to be required since turning it off broke the test entirely. I would be happy to skip the TextIOWrapper altogether, but the only way to do that would be reopening the file again with the encoding parameter specified. And considering it's the file object that is being passed to the exporter, closing and reopening the file seems more like a hack

@eliasdorneles
Copy link
Member

@dracony
Using TextIOWrapper is totally fine, the current PY3 code is already using it.
I still think it's weird why line_buffering is being required for PY3, it looks like we'll have different behavior in PY2. Which test failed?

Please see my other comment about passing ensure_ascii parameter -- it's best to use the kwargs.setdefault approach to maintain compatibility with subclasses.
Aside that little nitpicking, the PR looks good to me, nice work! 👍

@dracony
Copy link
Contributor Author

dracony commented Jul 11, 2016

@eliasdorneles I changed the ensure_ascii to kwargs as you suggested. Also I found a solution for why line_buffering False was failing tests. The solution was to also enable write_through, otherwise the buffer was not flushed to disk

@eliasdorneles
Copy link
Member

@dracony hey, thanks for digging! 👍
Hm, that kinda makes me suspicious of the tests, like they may be trying to read without having guarantee the file is flushed.
I don't think we should need to be forcing flush (we don't seem to need it for the PY 2 code)... I'll have a look to confirm, thank you!

@dracony
Copy link
Contributor Author

dracony commented Jul 11, 2016

@eliasdorneles It's not the file that is nt being flushed, it's that TextIOBuffer

@eliasdorneles
Copy link
Member

Yup, you're right. :)
Alright, this looks good now!
Thank you, @dracony !

@eliasdorneles eliasdorneles changed the title Added option to turn off ensure_ascii for JSON exporters [MRG+1] Added option to turn off ensure_ascii for JSON exporters Jul 11, 2016
@dracony
Copy link
Contributor Author

dracony commented Jul 11, 2016

Yay =)

On Mon, Jul 11, 2016 at 5:24 PM, Elias Dorneles notifications@github.com
wrote:

Yup, you're right. :)
Alright, this looks good now!
Thank you, @dracony https://github.com/dracony !


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2034 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AC-3KI7FN-GaAVBW08_ZCtXABujFYKFdks5qUmBFgaJpZM4Iu3Yz
.

@redapple redapple added this to the v1.2 milestone Jul 12, 2016
FEED_EXPORT_ENCODING
--------------------

The encoding to be used for the feed. Defaults to UTF-8.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct for JSON(lines), the default encoding is \uXXXX escape sequences.
A special case indeed, but worth a mention I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would I phrase it though? Wouldn't it be confusing to say it's
"\uXXXX escape
sequences" ?

On Tue, Jul 12, 2016 at 3:16 PM, Paul Tremberth notifications@github.com
wrote:

In docs/topics/feed-exports.rst
#2034 (comment):

@@ -233,6 +234,13 @@ The serialization format to be used for the feed. See

.. setting:: FEED_EXPORT_FIELDS

+FEED_EXPORT_ENCODING
+--------------------
+
+The encoding to be used for the feed. Defaults to UTF-8.

This is not correct for JSON(lines), the default encoding is \uXXXX
escape sequences.
A special case indeed, but worth a mention I believe.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/df4190bfd442de7b829a06059d744491455f593c#r70435005,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KLke72tdcawwYV8ryP2LEk5d2gl3ks5qU5PCgaJpZM4Iu3Yz
.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @redapple, missed that on the review.

You can say "safe numeric encoding (\uXXXX sequences)".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. The default value is None, which means is the backward-compatible default behavior (current master, before the merge), which is:

  • \uXXXX escape sequences for JSON
  • UTF-8 for XML and CSV
    (if I'm not mistaken)

And users can change to UTF-8 for JSON instead of \uXXXX with FEED_EXPORT_ENCODING = 'utf-8'

If set, and different from "utf-8", it changes behavior for all exporters.

I don't have good phrasing right now. Maybe @eliasdorneles has ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concretely, here is my suggestion:

FEED_EXPORT_ENCODING
--------------------

Default: ``None``

The encoding to be used for the feed.

If unset or set to ``None`` (default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (``\uXXXX`` sequences) for historic reasons.

Use ``utf-8`` if you want UTF-8 for JSON too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let it be so =) I updated the PR with your suggestion

On Tue, Jul 12, 2016 at 3:41 PM, Elias Dorneles notifications@github.com
wrote:

In docs/topics/feed-exports.rst
#2034 (comment):

@@ -233,6 +234,13 @@ The serialization format to be used for the feed. See

.. setting:: FEED_EXPORT_FIELDS

+FEED_EXPORT_ENCODING
+--------------------
+
+The encoding to be used for the feed. Defaults to UTF-8.

Concretely, here is my suggestion:

FEED_EXPORT_ENCODING

Default: None

The encoding to be used for the feed.

If unset or set to None (default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (\uXXXX sequences) for historic reasons.

Use utf-8 if you want UTF-8 for JSON too.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2034/files/df4190bfd442de7b829a06059d744491455f593c#r70439572,
or mute the thread
https://github.com/notifications/unsubscribe/AC-3KEN03iDEZO6Mxvay9Ud84SiwrbUkks5qU5mdgaJpZM4Iu3Yz
.


Use ``utf-8`` if you want UTF-8 for JSON too.

.. setting:: FEED_EXPORT_ENCODING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had missed this:
it should be .. setting:: FEED_EXPORT_FIELDS here and .. setting:: FEED_EXPORT_ENCODING earlier

@dracony
Copy link
Contributor Author

dracony commented Jul 12, 2016

Ah I see. Fixed that

@redapple redapple merged commit c3109da into scrapy:master Jul 12, 2016
@redapple
Copy link
Contributor

redapple commented Jul 12, 2016

@dracony , thanks a lot!
It took a while to review it properly, sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants