Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMLGenerated xml looks ugly #1327

Closed
Andrej730 opened this issue Jun 30, 2015 · 10 comments
Closed

XMLGenerated xml looks ugly #1327

Andrej730 opened this issue Jun 30, 2015 · 10 comments

Comments

@Andrej730
Copy link

@Andrej730 Andrej730 commented Jun 30, 2015

Whenever i trying to get output items in xml i get something like:

<?xml version="1.0" encoding="utf8"?>
<news><item><name>0</name><content>00000</content></item><item><name>1</name><content>11111</content></item><item><name>2</name><content>22222</content></item></news>

Maybe scrapy have some already provided ways to "beautify" it to something like (like xmls i found in all docs examples):

<?xml version="1.0" encoding="UTF-8"?>
<news>
   <item>
      <name>0</name>
      <content>00000</content>
   </item>
   <item>
      <name>1</name>
      <content>11111</content>
   </item>
   <item>
      <name>2</name>
      <content>22222</content>
   </item>
</news>
@yiakwy
Copy link

@yiakwy yiakwy commented Jun 30, 2015

Use JQuery like tool, we call it PyQuery, and u can use Pq("name") to get a tags collection just like using jQuery in javascript runtime.

@Andrej730
Copy link
Author

@Andrej730 Andrej730 commented Jul 4, 2015

What i realy was need - bring somehow exported xml to beauty and readable look. I thought that was generic case and maybe have some implementation in scrapy.
Anyway, i finded a solution for this problem - to beautify xml i used xml.dom.minidom parse() method to parse data from exported file to dom-object and then i saved results from toprettyxml() method of this object.

@kmike
Copy link
Member

@kmike kmike commented Jul 5, 2015

I think that'd be a nice option to generate human-readable XML, and we can make it default; PRs are welcome :)

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jul 7, 2015

Before you start creating PRs; you must take into account that generated XMLs can be big so creating a dom isn't an option. 😄

@barraponto
Copy link
Contributor

@barraponto barraponto commented Jul 7, 2015

We use xml.sax.saxutils.XMLGenerator, but it seems like lxml.etree.tostring has a pretty_print keyword argument that indents properly. And since we already depend on lxml, maybe we can leverage that.

http://lxml.de/tutorial.html#the-element-class

@yiakwy
Copy link

@yiakwy yiakwy commented Jul 10, 2015

@Andrej730 the most effective way is not just converting the xml to dom model in memory but add a jquery wrapper upon it. That is why jQuery is here. By invoking jquery we can manipulate dom efficiently in innner memory. dom tree is parser is essence to analyze files of hirarchical tags tree. I recommand that you consider it seriously.

@yiakwy
Copy link

@yiakwy yiakwy commented Jul 10, 2015

@nramirezuy Another method is to create a javascript runtime and clicent codes to send task to js runtime to process it. Phantom or v8 engine will help on this topic.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jul 17, 2015

@yiakwy I like your enthusiasm but using Phantom or v8 to generate XML is a little bit too much 😄

I found this XMLIndentGenerator

@barraponto
Copy link
Contributor

@barraponto barraponto commented Aug 4, 2015

As mentioned, you can just use lxml. Here's a Feed Exporter exporting a tidy XML: https://gist.github.com/413fa084152d6845cc3d

@kmike
Copy link
Member

@kmike kmike commented Sep 1, 2015

@barraponto it'd be nice to have a solution which doesn't require building the whole DOM tree, as @nramirezuy suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

6 participants
You can’t perform that action at this time.