-
-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAX parser with escaped characters #1406
Comments
Hi, Thanks for asking this question. You say above:
To be clear, this is exactly what your document is: the My advice is to grab that string from the parsed document, unescape it, and then parse it as a new document. Good luck! |
@flavorjones thanks for the response. This is actually what I am currently doing, but is not the most memory efficient. Is it possible to treat the escaped markup just like unescaped markup during SAX parsing? The problem with unescaping the string from the parsed document all at once is that this string is quite large, containing a lot of binary data. My main objective is to avoid having the whole string in memory at any point in time. |
No, unfortunately, this is a limitation to the XML representation you're dealing with. You may want to experiment with unescaping the original document and seeing if that works. Alternatively, you could take the escaped string presented to your SAX handler and stream that into the push parser to be a bit more memory-efficient. |
Ahh. Darn! |
The tokenization of XML/HTML is buried inside libxml, unfortunately. Again, how I would solve the problem is by doing two passes through the document.:
Any reason why this isn't a reasonable situation for your case? |
Ok. Thanks for the help! I am doing something very similar. But, instead of doing It definitely works, but I was looking for a way to either avoid unescaping or to do my unescaping during streaming to a temp file. As far as I know, if I need to escape the envelope's contents, I will need to have all of the contents in memory at once. Is this correct? |
I am in a situation where I am consuming xml data via a Net HTTP stream.
First, I am writing this stream to a file as it comes in. The response body is in the following form:
You'll notice that the inner nodes are html encoded and there are two XML declarations. In order to read the file written to, I am using
Nokogiri::HTML::SAX::Parser
. The XML SAX parser did not like that there were two xml declarations.When I attempt to read this back out with the SAX parser, it does not decode the inner nodes. It thinks that the inner nodes (starting with
<?xml version="1.0" encoding="utf-8"?>
and ending with</Results>
) is one big text value for theRetrieveImagesResult
node.My current solution is to
CGI.unescapeHTML
each chunk that comes in through the Net HTTP stream when writing to the file. However, this is not reliable since, as far as I know, there is no guarantee that the chunk obeys the boundaries of where characters are escaped, i.e. it's possible for the first chunk to have<
and the next chunk to contain the following;
.Any idea how to handle this reliably and why this may be happening? Another thing of note is that if I grab the inner decoded nodes and call
xml = Nokogiri::XML(raw_response)
, it does parse it correctly.Thanks!
The text was updated successfully, but these errors were encountered: