Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAX parser with escaped characters #1406

Closed
summera opened this issue Jan 7, 2016 · 6 comments
Closed

SAX parser with escaped characters #1406

summera opened this issue Jan 7, 2016 · 6 comments

Comments

@summera
Copy link

summera commented Jan 7, 2016

I am in a situation where I am consuming xml data via a Net HTTP stream.
First, I am writing this stream to a file as it comes in. The response body is in the following form:

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <soap:Body>
    <RetrieveImagesResponse xmlns="NWMLS:EverNet">
      <RetrieveImagesResult>
        &lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;
        &lt;Results xmlns=&quot;NWMLS:EverNet:ImageData:1.0&quot;&gt;
          &lt;ImagesListingNumber=&quot;874586&quot;&gt;
            &lt;Image&gt;
              &lt;ImageId&gt;
                75624114
              &lt;/ImageId&gt;
              &lt;ImageOrder&gt;
                0
              &lt;/ImageOrder&gt;
              &lt;UploadDt&gt;
                2015-12-02T23:36:06.437
              &lt;/UploadDt&gt;
              &lt;BLOB&gt;
                [Base64 Encoded Image]
              &lt;/BLOB&gt;
            &lt;/Image&gt;
          &lt;/Images&gt;
        &lt;/Results&gt;
    </RetrieveImagesResponse>
  </soap:Body>
</soap:Envelope>

You'll notice that the inner nodes are html encoded and there are two XML declarations. In order to read the file written to, I am using Nokogiri::HTML::SAX::Parser. The XML SAX parser did not like that there were two xml declarations.

When I attempt to read this back out with the SAX parser, it does not decode the inner nodes. It thinks that the inner nodes (starting with &lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt; and ending with &lt;/Results&gt;) is one big text value for the RetrieveImagesResult node.

My current solution is to CGI.unescapeHTML each chunk that comes in through the Net HTTP stream when writing to the file. However, this is not reliable since, as far as I know, there is no guarantee that the chunk obeys the boundaries of where characters are escaped, i.e. it's possible for the first chunk to have &lt and the next chunk to contain the following ;.

Any idea how to handle this reliably and why this may be happening? Another thing of note is that if I grab the inner decoded nodes and call xml = Nokogiri::XML(raw_response), it does parse it correctly.

Thanks!

@flavorjones
Copy link
Member

Hi,

Thanks for asking this question.

You say above:

It thinks that the inner nodes (starting with <?xml version="1.0" encoding="utf-8"?> and ending with </Results>) is one big text value for the RetrieveImagesResult node.

To be clear, this is exactly what your document is: the RetrieveImagesResult contains a string which happens to be escaped markup.

My advice is to grab that string from the parsed document, unescape it, and then parse it as a new document. Good luck!

@summera
Copy link
Author

summera commented Jan 11, 2016

@flavorjones thanks for the response. This is actually what I am currently doing, but is not the most memory efficient.

Is it possible to treat the escaped markup just like unescaped markup during SAX parsing?

The problem with unescaping the string from the parsed document all at once is that this string is quite large, containing a lot of binary data. My main objective is to avoid having the whole string in memory at any point in time.

@flavorjones
Copy link
Member

No, unfortunately, this is a limitation to the XML representation you're dealing with. You may want to experiment with unescaping the original document and seeing if that works.

Alternatively, you could take the escaped string presented to your SAX handler and stream that into the push parser to be a bit more memory-efficient.

@summera
Copy link
Author

summera commented Jan 11, 2016

Ahh. Darn!
Does nokogiri set the tokens which determine when an element starts/ends for a Nokogiri::XML::SAX::Document firing off the start_element/end_element events, or is this buried inside libxml2 somewhere?

@flavorjones
Copy link
Member

The tokenization of XML/HTML is buried inside libxml, unfortunately.

Again, how I would solve the problem is by doing two passes through the document.:

  1. First to parse the "envelope" and extract the escaped XML contents (likely into a temporary file if the content is large). You can use the SAX parser for this if you like, as it sounds like you already have code to do this, and it is more memory-efficient to do so.
  2. Second to unescape and parse the temporary file (the envelope's contents).

Any reason why this isn't a reasonable situation for your case?

@summera
Copy link
Author

summera commented Jan 12, 2016

Ok. Thanks for the help! I am doing something very similar. But, instead of doing #1, I am unescaping the whole thing, including the envelope, to a temporary file. Then reading the temp file with the SAX parser to do my processing of the binary data.

It definitely works, but I was looking for a way to either avoid unescaping or to do my unescaping during streaming to a temp file. As far as I know, if I need to escape the envelope's contents, I will need to have all of the contents in memory at once. Is this correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants