Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Feed exports: beautify JSON and XML #2456

Merged
merged 6 commits into from May 12, 2017

Conversation

@elacuesta
Copy link
Member

@elacuesta elacuesta commented Dec 19, 2016

Hello folks, these are my two cents to fix #1327.
I look forward to reading your comments 馃槃

@codecov-io
Copy link

@codecov-io codecov-io commented Dec 19, 2016

Codecov Report

Merging #2456 into master will increase coverage by 0.08%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2456      +/-   ##
==========================================
+ Coverage   84.66%   84.75%   +0.08%     
==========================================
  Files         162      162              
  Lines        9122     9155      +33     
  Branches     1353     1358       +5     
==========================================
+ Hits         7723     7759      +36     
- Misses       1141     1145       +4     
+ Partials      258      251       -7
Impacted Files Coverage 螖
scrapy/settings/default_settings.py 98.64% <100%> (酶) 猬嗭笍
scrapy/exporters.py 100% <100%> (酶) 猬嗭笍
scrapy/extensions/feedexport.py 72.95% <100%> (+0.42%) 猬嗭笍
scrapy/commands/parse.py 21.15% <0%> (酶) 猬嗭笍
scrapy/shell.py 68.61% <0%> (酶) 猬嗭笍
scrapy/pipelines/files.py 71.18% <0%> (酶) 猬嗭笍
scrapy/utils/misc.py 95% <0%> (酶) 猬嗭笍
scrapy/utils/datatypes.py 60.21% <0%> (酶) 猬嗭笍
scrapy/core/downloader/handlers/http11.py 92.12% <0%> (+1.28%) 猬嗭笍

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 7fc11c1...3a0a86e. Read the comment docs.

@elacuesta elacuesta force-pushed the elacuesta:feed_export_beautify branch from a9bc763 to 67c34a3 Jan 2, 2017
@redapple
Copy link
Contributor

@redapple redapple commented Jan 6, 2017

I like the feature and implementation.
XML out definitely gets easier to read.
However, I'm not sure about having prettified-JSON as default. On the other hand, everyone uses jq these days right?
And it would be weird to prettify XML by default, and not JSON.
I'll have to sleep on it.

@redapple
Copy link
Contributor

@redapple redapple commented Jan 6, 2017

@elacuesta , I believe it also needs tests for non-default (None, 2, etc.) widths.

@elacuesta elacuesta force-pushed the elacuesta:feed_export_beautify branch 2 times, most recently from e77de90 to 40d191e Jan 6, 2017
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jan 6, 2017

Hey @redapple, thanks for your comments. I added a few tests with distinct width values.
Regarding the default, I set it to 4 following @kmike's comment and my own desire, but I think it would be just fine to leave it optional.

@elacuesta elacuesta force-pushed the elacuesta:feed_export_beautify branch from 40d191e to 53d11af Jan 6, 2017
@elacuesta elacuesta force-pushed the elacuesta:feed_export_beautify branch from 53d11af to 8b07582 Feb 8, 2017
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Feb 8, 2017

Alright, I set FEED_EXPORT_INDENT_WIDTH=None to maintain backwards compatibility.

Ping @redapple, @kmike: I'm sorry to bother you guys, but do you have any further comments on this? Thanks!

@elacuesta elacuesta force-pushed the elacuesta:feed_export_beautify branch from 8b07582 to bce5557 Feb 8, 2017
@redapple
Copy link
Contributor

@redapple redapple commented Feb 8, 2017

Looks good to me @elacuesta

@redapple redapple changed the title Feed exports: beautify JSON and XML [MRG+1] Feed exports: beautify JSON and XML Feb 8, 2017
@redapple redapple added this to the v1.4 milestone Feb 8, 2017
.. attribute:: indent_width

Amount of spaces used to indent the output on each level.
Defaults to ``None``, which disables indentation.

This comment has been minimized.

@kmike

kmike Feb 22, 2017
Member

This argument controls newlines, not only indent spaces. I wonder if we should follow the same approach as stdlib json:

If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0, or negative, will only insert newlines. None (the default) selects the most compact representation.

This comment has been minimized.

@elacuesta

elacuesta Feb 23, 2017
Author Member

Updated, now the code follows that behaviour for both JSON and XML 馃槃

@@ -128,33 +129,52 @@ def __init__(self, file, **kwargs):
self.encoding = 'utf-8'
self.xg = XMLGenerator(file, encoding=self.encoding)

def _beautify_newline(self):
if self.indent_width:

This comment has been minimized.

@kmike

kmike Feb 22, 2017
Member

It seems the meaning on indent_width is different for json and xml: zero value means no newlines for XML, but for json new lines are still inserted, i.e. None and 0 are different for json.

This comment has been minimized.

@elacuesta

elacuesta Feb 23, 2017
Author Member

Actually, this current implementation is consistent for JSON and XML in the sense that None==0, because of this line: https://github.com/scrapy/scrapy/pull/2456/files#diff-b40c5755a5dda2b02aaac6bf4c8e2ff8R175.
However, I think you have a point and it would be even better to enforce the difference between None and 0. I'm working on it, I'll update this PR shortly 馃憤

This comment has been minimized.

@elacuesta

elacuesta Feb 23, 2017
Author Member

The link is not pointing to the previous commit 馃槙
Anyway, the line was self.indent = settings.getint('FEED_EXPORT_INDENT') or None, so self.indent ended up being None if settings['FEED_EXPORT_INDENT'] was 0

@@ -99,7 +100,7 @@ def __init__(self, file, **kwargs):
self._configure(kwargs, dont_fail=True)
self.file = file
kwargs.setdefault('ensure_ascii', not self.encoding)
self.encoder = ScrapyJSONEncoder(**kwargs)
self.encoder = ScrapyJSONEncoder(indent=self.indent_width, **kwargs)

This comment has been minimized.

@kmike

kmike Feb 22, 2017
Member

This looks backwards incompatible for people who passed indent argument for JsonItemExporter - I think it'll now raise a TypeError exception got multiple values for keyword argument 'indent'. For backwards compatibility it'd be great to handle this case; indent in kwargs should take precedence over indent_width.

This comment has been minimized.

@elacuesta

elacuesta Feb 22, 2017
Author Member

Nice! I think something like the following would work, I'll update it shortly.

indent = kwargs.pop('indent', self.indent_width)
self.encoder = ScrapyJSONEncoder(indent=indent, **kwargs)

This comment has been minimized.

@kmike

kmike Feb 22, 2017
Member

you can also handle it the same way as ensure_ascii

This comment has been minimized.

@elacuesta

elacuesta Feb 22, 2017
Author Member

That works too. It was so close and I didn't see it! 馃槢

@elacuesta elacuesta force-pushed the elacuesta:feed_export_beautify branch 2 times, most recently from bd16ade to bc33c39 Feb 23, 2017
"bar"
]
}
]""",

This comment has been minimized.

@kmike

kmike Feb 27, 2017
Member

hm, is this correct that here there is an empty line in the beginning on the file instead of a newline at the end, and with indent=0 there are new lines both in the beginning and in the end? Wouldn't it be better not to have an empty line in the beginning and always have a new line in the end (unless indent is None)?

This comment has been minimized.

@elacuesta

elacuesta Mar 1, 2017
Author Member

That newline is there only to make the tests more pretty, it does not end up in the final expected result (note the strip call in https://github.com/scrapy/scrapy/pull/2456/files#diff-0f004a6e06393cc5e29012b83956eed2R632).
Up until now (before your comment), the square brackets in the JSON export had their own line (even with indent=0), this PR didn't modify that. But now that you mentioned it, I amended the latest commit to make it fully compliant with the difference between indent=None and indent=<integer>. The only newline I don't know how to remove when indent=None is the one after the XML declaration tag, but I don't think that's a problem.

@elacuesta elacuesta force-pushed the elacuesta:feed_export_beautify branch from bc33c39 to c674966 Mar 1, 2017
@redapple
Copy link
Contributor

@redapple redapple commented Mar 6, 2017

Looking at the new tests, I think this change reverts #1950 .

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Mar 6, 2017

You're right Paul. I personally like the square brackets having their own line even if indent=None, I made this latest change as per Mikhail's suggestion of following stdlib JSONEncoder's behaviour.
I think diverting from the standard in some very particular cases (like this one, or the newline after the XML declaration tag) is not that big of a deal, I can make the brackets have their own line again if you guys agree. Thanks for your time 馃榾

@kmike
Copy link
Member

@kmike kmike commented Mar 6, 2017

Previous way of exporting to JSON kind of makes sense; it seems it becomes unsupported after this PR: no indentation inside items, but each item is on a new line. It is not only a question about brackets. It can be seen as a backwards-incompatible change: wc -l will stop working as a way to count exported items in json output.

@kmike
Copy link
Member

@kmike kmike commented Mar 6, 2017

What do you think about making indent option work only inside individual items in case of json, not for the whole output? That'd be a bit weird though..

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Mar 6, 2017

I think we all agree that the current JSON output (one item per line) is just fine. However it's not consistent with the current XML output (all items in one line). What would you say if I modified this PR to make both outputs consistent in the following manner:

  • indent = None: "most compact representation", i.e. all items in the same line (same as the current behaviour for XML)
  • indent <= 0: each item on it's own line but no "internal" indent, square brackets on their own line for JSON (same as the current behaviour for JSON)
  • indent > 0: indent with the specified number

That would be a little deviation from the "standard" in the sense that indent=0 would not imply newlines everywhere, but I believe it would be worth in this case. Thoughts?

@kmike
Copy link
Member

@kmike kmike commented Mar 6, 2017

It sounds fine to me. Alternatively, we can support string constants for indent:

  • "compact" (all items in the same line),
  • "lines" (the way scrapy currently works)
  • numeric indent >= 0 (the same as in this PR)

Though 'indent' argument name could be confusing with these value names

@kmike
Copy link
Member

@kmike kmike commented Mar 9, 2017

I think the problem is that there are in fact two different settings, cramped into one setting:

  1. how to format the result;
  2. how many spaces to use for indentation.

Strings + integers is a way to introduce a single option instead of two options: string values activate (1), while int value activates (2). With two options there'd be 3 string values for the first option, and an int value would be active only when mode is "indent items". So that's where weirdness comes from - instead of explicit 3 possible string values + a separate option for indentation there is a single option with 2 string values which also accepts int.

For me negative and None values instead of strings feels like we're using negative and None values to choose (1) mode; it doesn't seem right - first, we won't be able to add other modes (it is not extendable), and second, why use magic number constants instead of readable string constants?

For me strings + int sounds ok, but maybe two explicit options is better.

@redapple
Copy link
Contributor

@redapple redapple commented Mar 14, 2017

I've re-read the discussion this morning.
I now wonder if the compact mode for XML (current behavior, with all items on the same line) makes much sense for users. I do see the point of "lines" output though.
For XML output, I think having the "lines" mode by default (like for JSON) would be fine (but it's backward incompatible).
With that assumption, i.e. that all items on same line not being useful in practice, I would go for a setting saying "lines" (default) or "prettified/beautified" (e.g. FEED_EXPORT_PRETTY=True/False) and an indent int as another setting (FEED_EXPORT_INDENT).

@kmike
Copy link
Member

@kmike kmike commented Mar 16, 2017

The difference between XML and JSON is that new lines are allowed inside XML tags, but in JSON keys or values, where new lines must be escaped. So you can guarantee "one item per line" for JSON, but not for XML. Another difference is that in XML whitespaces and new lines between tags are significant - you can access them in a parser, they affect parse result. In JSON indentation and new lines don't matter, parse result is the same. Just saying :) I'm not an user of XML exports myself.

@redapple
Copy link
Contributor

@redapple redapple commented Mar 16, 2017

new lines are allowed inside XML tags

Ok. This is already true with the current "all items on one line" XML (which is not really the case then).

Another difference is that in XML whitespaces and new lines between tags are significant

Ok. But I'm not sure it is a big problem for Scrapy's item exports and their consumers. I don't expect users/parsers of items.xml files to care about or choke on newlines between <item></item> tags. Again, this is debatable as with any format breaking change. Thoughts?

@redapple
Copy link
Contributor

@redapple redapple commented Apr 12, 2017

@kmike @elacuesta ,
what do you think of what I commented earlier?
The gist being:

  • change default XML output from "no newlines between <items> or <item> elements" to "one newline after <items>, and one newline between consecutive <item>: that way, XML output is similar to current default JSON output with opening and closing brackets on their own line, and each JSON item on its own line too ; that would be the new "compact" default
  • having a "set pretty items on" setting, that would be OFF by default (and maybe ON in the default project template)
  • having a "indent value when prettifying" setting to accomodate user's taste for pretty output
@kmike
Copy link
Member

@kmike kmike commented Apr 12, 2017

Sounds good to me @redapple!

@redapple
Copy link
Contributor

@redapple redapple commented Apr 13, 2017

Alright, I gave it a shot: elacuesta#2
Looking forward to your feedback @elacuesta and @kmike .

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Apr 14, 2017

Hello fellows, thank you for revisiting this PR, I'm sorry I didn't have the time to comment before.

I thought the whole point of separating the beautifying into two settings was to allow for multiple modes of indentation, beyond on/off. If we are not doing that anymore (it seems to me like on/off it's all we need) I'm not sure why we can't stick with the original implementation (following the standard json behavior).

Don't get me wrong, I really like Paul's approach. I think it's definitely the way to go in the sense that it easily allows the introduction of new modes, I just think that we might not need those modes and the code could be even simpler :-D

@redapple
Copy link
Contributor

@redapple redapple commented Apr 14, 2017

@elacuesta ,
now that I check your earlier commits, and after sleeping on it a bit, it does indeed look like using indent>0 for prettifying is sufficient. And simpler is better here: no real need for 2 settings. @kmike's reference to "indent" meaning for stdlib json does make a lot of sense:

If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0, or negative, will only insert newlines. None (the default) selects the most compact representation.

But at the same time, I think this PR should still preserve current formatting of JSON with opening and trailing square brackets on their own lines and each item on their own line too.
For XML, I'll stick to my comment earlier that I would prefer to change the default XML serialization to have individual <item> elements starting on a new line.

I think I can summarize my updated view as:

  • one setting is sufficient
  • opening/closing brackets for JSON (resp. XML declaration and root <items> element and closing tag for XML output) should still be on their own lines: this means a change for XML output vs. current behavior;
  • serialized items always start on a new line: this is a change for XML output vs. current implementation;
  • this new "indent" setting affect only individual items and has the same meaning as stdlib json ;
  • by default, items are serialized in the current compact form (yet, each starting on a new line, as stated above):
    this makes the default serialization for JSON and XML human digestible in a text editor (you can roughly count the items, except when there are newlines in XML data): this is a change for XML output vs. current implementation;

How does that sound?
Sorry for the noise and changing of mind.
And thanks @elacuesta for being persistent!

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Apr 14, 2017

By all means, thanks to you :-)
I agree completely. I believe this PR, up until the third commit (c7bb2fa), covers all of that. Maybe we could just remove the fourth and last commit (a630839), what do you guys think?

@redapple
Copy link
Contributor

@redapple redapple commented Apr 14, 2017

I see that c7bb2fa#diff-0f004a6e06393cc5e29012b83956eed2R538 still has the "all on 1 line" XML

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Apr 14, 2017

Right, but that's only with FEED_EXPORT_INDENT=None, which is there for consistency with stdlib json ("None (...) selects the most compact representation"). The default value is FEED_EXPORT_INDENT=0, which causes each item to go on it's own line.

@redapple
Copy link
Contributor

@redapple redapple commented Apr 14, 2017

Alright, so FEED_EXPORT_INDENT=None also serializes to JSON as [{},{},{}] without newlines?

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Apr 14, 2017

@redapple
Copy link
Contributor

@redapple redapple commented Apr 18, 2017

Thanks @elacuesta .
I've tested https://github.com/scrapy/scrapy/compare/c7bb2fa8ce2633d92a7ec2840f84b174a5494428?expand=1 locally and it works as I would expect, for None, 0 or >0.
I think the only non-obvious thing for me when reviewing the change are the diffs in test_export_no_items_store_empty: you could even leave it unchanged with the default FEED_EXPORT_INDENT=0. You seem to cover enough for None with test_export_indentation.
Other than that, yeah, c7bb2fa looks good to me.

@redapple
Copy link
Contributor

@redapple redapple commented Apr 24, 2017

@kmike , what's your opinion on using only 1 setting as @elacuesta comments in #2456 (comment) ?

@elacuesta elacuesta force-pushed the elacuesta:feed_export_beautify branch from a630839 to c7bb2fa May 4, 2017
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented May 4, 2017

Thanks for your comments @redapple. I removed the 4th commit and left the branch at c7bb2fa.
About the changes in test_export_no_items_store_empty, if I remove the FEED_EXPORT_INDENT=None I would need to add a \n to the xml line, so there would still be a change in there. I think I prefer to explicitely set FEED_EXPORT_INDENT=None so we don't rely on defaults and know exactly what to expect, what do you think? Thanks again!

settings = {'FEED_FORMAT': row['format'], 'FEED_EXPORT_INDENT': row['indent']}
data = yield self.exported_data(items, settings)
print(row['format'], row['indent'])
self.assertEqual(row['expected'].strip(), data)

This comment has been minimized.

@kmike

kmike May 5, 2017
Member

Could you please write this test without .strip(), so that we're explicit about newlines before and after items?

This comment has been minimized.

@elacuesta

elacuesta May 9, 2017
Author Member

Sure 馃憤


* ``indent=None`` selects the most compact representation,
all items in the same line with no indentation
* ``indent<=0`` each item on it's own line, no indentation

This comment has been minimized.

* ``indent=None`` selects the most compact representation,
all items in the same line with no indentation
* ``indent<=0`` each item on it's own line, no indentation
* ``indent>0`` each item on it's own line, indentated with the provided numeric value

This comment has been minimized.

@kmike

kmike May 10, 2017
Member

indented

``None`` selects the most compact representation

Currently used by :class:`~scrapy.exporters.JsonItemExporter`
and :class:`~scrapy.exporters.XmlItemExporter`

This comment has been minimized.

@kmike

kmike May 10, 2017
Member

I think that'd be nice to mention something like "i.e. when you're exporting to .json or .xml".

This comment has been minimized.

@elacuesta

elacuesta May 10, 2017
Author Member

That was the spirit of the following comment ("Currently used by..."), perhaps I could emphasize that only those implement it for now.

This comment has been minimized.

@kmike

kmike May 11, 2017
Member

I mean that for user it may be unclear what these JsonItemExporter and XmlItemExporter mean - user almost never instantiates them directly, for user these classes are implementation details of how json or xml exports are implemented. User doesn't have to know that these classes are used when scrapy crawl foo -o result.json is called - that's why I suggested to mention that the option affects the format of -o result.json and -o resut.xml exports, not only some classes user may be not aware of.

This comment has been minimized.

@redapple

redapple May 12, 2017
Contributor

@elacuesta , @kmike , I took the liberty to push a commit with your proposal @kmike : 3a0a86e

This comment has been minimized.

@elacuesta

elacuesta May 12, 2017
Author Member

Thanks Paul. That's true, I'm sorry I missed it before.

This comment has been minimized.

@redapple

redapple May 12, 2017
Contributor

Hey @elacuesta , don't be sorry! Your work is great

@kmike
Copy link
Member

@kmike kmike commented May 10, 2017

Looks good to me, other than a few minor comments!

elacuesta and others added 2 commits May 10, 2017
@redapple redapple merged commit dfe6d3d into scrapy:master May 12, 2017
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 84.66%)
Details
codecov/project 84.75% (+0.08%) compared to 7fc11c1
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@elacuesta elacuesta deleted the elacuesta:feed_export_beautify branch Jun 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants