-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with tab-character #242
Comments
Edit: Tab and won't be changed to |
During loading, pugixml performs attribute value normalization as described here: https://www.w3.org/TR/xml/#AVNormalize that replaces tab characters with a single space. However, this doesn't really preclude escaping tab characters during output. At first glance it feels like this is an omission - pugixml does escape both forms of newlines in attribute values. I'll need to take a closer look at this but it might be that this warrants a change in behavior so that you can save tab characters in attributes. |
Brief survey of other major XML libraries:
It looks like libraries that don't escape tab character in attribute values make it impossible (AFAICT) to save this character into an attribute value, since trying to use |
This change modifies the table entries for ctx_special_attr to treat TAB character as special, which makes the output code escape it. Before this change, trying to use TAB in an attribute value would output it verbatim; during subsequent parsing, pugixml - and other compliant parsers - would apply attribute-value normalization, turning the TAB into a space and losing the original value. Using 	 fixes this; if an input document has 	 in an attribute value, that gets unescaped into \t during parsing and escaped back into 	 during output, which means we can now roundtrip values like this. Fixes #242.
Hello. Can you give a suggest - why pcdata that contains only tab ('\t') or/and space (' ') will be skip? As I see, by build next sample with difference versions of pugixml, that was planned from the beginning.
Early reference in this thread '.NET System.Xml' will not skip such data.
Sample of xml data was took from this file. Thank you. P.S. Back to Python script - even authors of it added addition space in other method (writeExtension) when decide to made string join of data. Probably that was done in case if authors of xml document formatting data and remove empty pcdata with white space only. |
@TheVice By default pcdata that's whitespace-exclusive is not included into the resulting document for performance reasons. If you want to include it, you should pass |
Thanks @zeux that's what I need. |
When an attribute value string contains a tab character (0x09), during saving it stays as is and won't be changed to " ". On loading however it will be read as a space character (0x20).
I found that the 0 in the first line of the chartypex_table is responsible for that behaviour.
static const unsigned char chartypex_table[256] =
{
3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 2, 3, 3, 2, 3, 3, // 0-15
Is this intentional? (The comments for enum chartypex_t) hint in that direction.
Changing the '0' to '2' fixes the problem:
3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, // 0-15
The text was updated successfully, but these errors were encountered: