SAX parser with escaped characters #1406

summera · 2016-01-07T23:59:53Z

I am in a situation where I am consuming xml data via a Net HTTP stream.
First, I am writing this stream to a file as it comes in. The response body is in the following form:

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <soap:Body>
    <RetrieveImagesResponse xmlns="NWMLS:EverNet">
      <RetrieveImagesResult>
        &lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;
        &lt;Results xmlns=&quot;NWMLS:EverNet:ImageData:1.0&quot;&gt;
          &lt;ImagesListingNumber=&quot;874586&quot;&gt;
            &lt;Image&gt;
              &lt;ImageId&gt;
                75624114
              &lt;/ImageId&gt;
              &lt;ImageOrder&gt;
                0
              &lt;/ImageOrder&gt;
              &lt;UploadDt&gt;
                2015-12-02T23:36:06.437
              &lt;/UploadDt&gt;
              &lt;BLOB&gt;
                [Base64 Encoded Image]
              &lt;/BLOB&gt;
            &lt;/Image&gt;
          &lt;/Images&gt;
        &lt;/Results&gt;
    </RetrieveImagesResponse>
  </soap:Body>
</soap:Envelope>

You'll notice that the inner nodes are html encoded and there are two XML declarations. In order to read the file written to, I am using Nokogiri::HTML::SAX::Parser. The XML SAX parser did not like that there were two xml declarations.

When I attempt to read this back out with the SAX parser, it does not decode the inner nodes. It thinks that the inner nodes (starting with <?xml version="1.0" encoding="utf-8"?> and ending with </Results>) is one big text value for the RetrieveImagesResult node.

My current solution is to CGI.unescapeHTML each chunk that comes in through the Net HTTP stream when writing to the file. However, this is not reliable since, as far as I know, there is no guarantee that the chunk obeys the boundaries of where characters are escaped, i.e. it's possible for the first chunk to have &lt and the next chunk to contain the following ;.

Any idea how to handle this reliably and why this may be happening? Another thing of note is that if I grab the inner decoded nodes and call xml = Nokogiri::XML(raw_response), it does parse it correctly.

Thanks!

The text was updated successfully, but these errors were encountered:

flavorjones · 2016-01-11T04:45:26Z

Hi,

Thanks for asking this question.

You say above:

It thinks that the inner nodes (starting with <?xml version="1.0" encoding="utf-8"?> and ending with </Results>) is one big text value for the RetrieveImagesResult node.

To be clear, this is exactly what your document is: the RetrieveImagesResult contains a string which happens to be escaped markup.

My advice is to grab that string from the parsed document, unescape it, and then parse it as a new document. Good luck!

summera · 2016-01-11T16:10:24Z

@flavorjones thanks for the response. This is actually what I am currently doing, but is not the most memory efficient.

Is it possible to treat the escaped markup just like unescaped markup during SAX parsing?

The problem with unescaping the string from the parsed document all at once is that this string is quite large, containing a lot of binary data. My main objective is to avoid having the whole string in memory at any point in time.

flavorjones · 2016-01-11T18:20:00Z

No, unfortunately, this is a limitation to the XML representation you're dealing with. You may want to experiment with unescaping the original document and seeing if that works.

Alternatively, you could take the escaped string presented to your SAX handler and stream that into the push parser to be a bit more memory-efficient.

summera · 2016-01-11T21:27:38Z

Ahh. Darn!
Does nokogiri set the tokens which determine when an element starts/ends for a Nokogiri::XML::SAX::Document firing off the start_element/end_element events, or is this buried inside libxml2 somewhere?

flavorjones · 2016-01-12T17:09:46Z

The tokenization of XML/HTML is buried inside libxml, unfortunately.

Again, how I would solve the problem is by doing two passes through the document.:

First to parse the "envelope" and extract the escaped XML contents (likely into a temporary file if the content is large). You can use the SAX parser for this if you like, as it sounds like you already have code to do this, and it is more memory-efficient to do so.
Second to unescape and parse the temporary file (the envelope's contents).

Any reason why this isn't a reasonable situation for your case?

summera · 2016-01-12T17:29:13Z

Ok. Thanks for the help! I am doing something very similar. But, instead of doing #1, I am unescaping the whole thing, including the envelope, to a temporary file. Then reading the temp file with the SAX parser to do my processing of the binary data.

It definitely works, but I was looking for a way to either avoid unescaping or to do my unescaping during streaming to a temp file. As far as I know, if I need to escape the envelope's contents, I will need to have all of the contents in memory at once. Is this correct?

thisgeek mentioned this issue Jan 8, 2016

Possible bad parsing of encoded angle brackets freeCodeCamp/devdocs#325

Closed

flavorjones closed this as completed Jan 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAX parser with escaped characters #1406

SAX parser with escaped characters #1406

summera commented Jan 7, 2016

flavorjones commented Jan 11, 2016

summera commented Jan 11, 2016

flavorjones commented Jan 11, 2016

summera commented Jan 11, 2016

flavorjones commented Jan 12, 2016

summera commented Jan 12, 2016

SAX parser with escaped characters #1406

SAX parser with escaped characters #1406

Comments

summera commented Jan 7, 2016

flavorjones commented Jan 11, 2016

summera commented Jan 11, 2016

flavorjones commented Jan 11, 2016

summera commented Jan 11, 2016

flavorjones commented Jan 12, 2016

summera commented Jan 12, 2016