Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Lossy Conversion from XML to Hash #1

Closed
trans opened this Issue · 11 comments

6 participants

@trans

So the only return result is a Hash? That's very limited. A Hash can't encode all the aspects of an XML document.

I started a project kind of like this some time ago called Cherry. But with it I just created a common API for working on the each underlying back-end via adapters. Considering what you are doing here, on retrospect I probably should have kept my own internal representation of the XML, and used the the high level API for manipulating that model. Then convert it back into any other representations (LibXML, Nokogiri, REXML). That seems like a good idea for Multi-XML too.

@sferik
Owner

Why specifically is a Hash an insufficient representation of an XML document? It seems to be to be sufficient for many use cases. Maybe you could offer an example where a Hash is a poor choice.

PS: What ever happened to the Cherry source code? I'd be interested to take a look at it.

@trans

How do you represent processing instructions <? ... ?>, and though perhaps not as important comments <!-- --> get lost. And there is the whole question of how to differentiate body text from attributes. Hashes work well in general. But it's kind of a one way street, and to have fully consistent results it can be rather verbose. Looking over multi-xml's specs, I'm not sure one would be certain what to expect. For instance:

<user name="tom"/>

and

<user><name>tom</name></user>

Look like they would return the same result. So how would one know the difference?

@sferik
Owner

Technically, comments and processing instructions are not part of the XML document proper, they're meta-data, which I believe is acceptable to drop during parsing. That said, if processing instructions are critical to your application, I would accept a patch to wrap the response in an envelope that includes processing instructions as headers. However, I don't see why this would necessitate returning something other than a Hash. For example:

doc = MultiXML.parse(xml, :symbolize_keys => true)
doc[:headers][:processing_instructions] # e.g. {:xml_stylesheet => "href=\"/style.xsl\" type=\"text/xsl\""}
doc[:body] # the parsed document

While I'm aware that processing instructions can appear anywhere in the document, all of the uses I've seen appear in the prolog, before the root node.

To your second point, I would argue that the distinction between attributes and child nodes is syntactic or stylistic, not semantic. In my mind, both of your examples parse as "a user whose name is tom." Do you have a different interpretation?

The easiest way to persuade me is by pointing to a real-world example of a document or API that uses the same attribute and child node name to describe two distinct properties. For example:

<user name="tom"><name>tom</name></user>

where the value of the attribute "name" means something different from contents of the node "name". In such a case, it would make sense for the values to be different:

<user name="tom"><name>bob</name></user>

but I would argue that this document, while perfectly valid XML, is nonsensical and would never appear in the wild (but I'd be happy to be proven wrong—there are a lot of crazy documents in the world).

I hope you understand, I'm trying to be pragmatic, solving for the most common use cases first. I'm happy to solve for edge cases when they arise, but, in my experience, edge cases often turn out to be theoretical barriers to progress, as opposed to actual constraints.

@paulwalker
  <xml>
  <param name="name_a">foo</param>
  <param name="name_b">bar</param>
  </xml>

The name attribute values aren't present in the hash at all.

@paulwalker

Attributes and their values are not optional in xml parsing.

@juggy

I just bumped into the problem @paulwalker mentionned here. When you have a node with an attribute but without inner node (ie content only) the attribute is not present in the hash.

This is causing some problem right now. Do you have a workaround?

@sferik
Owner

I agree this is a bug and don't have a simple workaround. Would you be able to write a failing spec? That seems like a good start.

@juggy

I've got a fix I created for a project, but it break the current way of handling attributes. Instead of adding the attributes to the node itself, I changed it so it is added to the parent node in the following format: node_name@attr_name

It also solve the problem that first created this ticket: know if a node is an attribute or not.

It is a one line fix within the lib2xmlparser. It could be implemented as a different parsing mode, so multi_xml would stay compatible. I'll commit it and send it over.

EDIT: you can have a look here juggy@1d4fd5d#L0L41

@ginjo

Running into same problem here. Real-world XML pulled from a Filemaker database. The generated hash drops the DISPLAY attribute entirely.

<VALUELIST NAME="Employee Unique ID">
  <VALUE DISPLAY="281 Abel">281</VALUE>
  <VALUE DISPLAY="254 Adam">254</VALUE>
  <VALUE DISPLAY="182 Adriane">182</VALUE>
  <VALUE DISPLAY="213 Alma">213</VALUE>
  <VALUE DISPLAY="183 Amanda">183</VALUE>
</VALUELIST>
@sferik
Owner

Could you please write a failing spec for this case?

@rubiii

hey guys,

everyone could write a failing spec for this, but i don't think that's the problem here. as @trans pointed out, the current
hash structure can probably represent most xml data, but it fails in more complex situations. you identified two cases
in which multi_xml fails to return a proper result and i know that similar projects (crack for example) have the same problems.

i came across a project that seems to solve those problems by introducing a little more structure to the hash years ago.
it's called cobra vs mongoose (brilliant name!) and here's an example:

xml = '<alice id="1"><bob id="2">charlie</bob><bob id="3">david</bob></alice>'
CobraVsMongoose.xml_to_hash(xml)
# => { "alice" => { "@id" => "1", "bob" => [ { "@id" => "2", "$" => "charlie" }, { "@id" => "3", "$" => "david" } ] } }

a structure like this is certainly less convenient, but i just wanted to point out that these problems can be solved.

cheers,
daniel

@sferik sferik closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.