Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with tab-character #242

Closed
joachim99 opened this issue Nov 19, 2018 · 6 comments
Closed

Problem with tab-character #242

joachim99 opened this issue Nov 19, 2018 · 6 comments

Comments

@joachim99
Copy link

When an attribute value string contains a tab character (0x09), during saving it stays as is and won't be changed to " ". On loading however it will be read as a space character (0x20).
I found that the 0 in the first line of the chartypex_table is responsible for that behaviour.
static const unsigned char chartypex_table[256] =
{
3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 2, 3, 3, 2, 3, 3, // 0-15

Is this intentional? (The comments for enum chartypex_t) hint in that direction.
Changing the '0' to '2' fixes the problem:
3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, // 0-15

@joachim99
Copy link
Author

Edit: Tab and won't be changed to 	

@zeux
Copy link
Owner

zeux commented Nov 20, 2018

During loading, pugixml performs attribute value normalization as described here: https://www.w3.org/TR/xml/#AVNormalize that replaces tab characters with a single space.

However, this doesn't really preclude escaping tab characters during output. At first glance it feels like this is an omission - pugixml does escape both forms of newlines in attribute values. I'll need to take a closer look at this but it might be that this warrants a change in behavior so that you can save tab characters in attributes.

@zeux
Copy link
Owner

zeux commented Nov 20, 2018

Brief survey of other major XML libraries:

  • .NET System.Xml XmlDocument doesn't escape tab character
  • Python ElementTree doesn't escape tab character
  • Python lxml escapes tab character using 	
  • JavaScript DOM API escapes tab character using 	

It looks like libraries that don't escape tab character in attribute values make it impossible (AFAICT) to save this character into an attribute value, since trying to use 	 will just escape the ampersand. I'm going to fix this, which should also make it so that attributes with 	 roundtrip properly.

zeux added a commit that referenced this issue Nov 20, 2018
This change modifies the table entries for ctx_special_attr to treat TAB
character as special, which makes the output code escape it.

Before this change, trying to use TAB in an attribute value would output
it verbatim; during subsequent parsing, pugixml - and other compliant
parsers - would apply attribute-value normalization, turning the TAB
into a space and losing the original value.

Using 	 fixes this; if an input document has 	 in an attribute
value, that gets unescaped into \t during parsing and escaped back into
	 during output, which means we can now roundtrip values like this.

Fixes #242.
@zeux zeux closed this as completed in aac75cd Nov 20, 2018
@TheVice
Copy link

TheVice commented Jan 21, 2019

Hello.

Can you give a suggest - why pcdata that contains only tab ('\t') or/and space (' ') will be skip?
Probably it was 'eat' ;) by PUGI__SKIPWS macros.

As I see, by build next sample with difference versions of pugixml, that was planned from the beginning.


//#include <vector>//PUGIXML_VERSION_0_1
#include <pugixml.hpp>

#include <string>
#include <cstdlib>
#include <iostream>

static const char* xml_data = "<proto><ptype>EGLBoolean</ptype> <name>eglBindAPI</name></proto>";

void walk(const pugi::xml_node& node, std::string& t)
{
	for (const auto& i : node)
	{
		t += i.value();
		walk(i, t);
	}
}

int main(int argc, char** argv)
{
	(void)argc;
	(void)argv;
	//
#if PUGIXML_VERSION_0_1
	std::string tmp(xml_data);
	tmp.resize(tmp.size());
	pugi::xml_parser parser(tmp.front());
	const auto a = parser.parse(&tmp.front());
	auto doc = parser.document();
#else
	pugi::xml_document doc;
#if PUGIXML_VERSION < 141
	const auto result = doc.load(xml_data);
#else
	const auto result = doc.load_string(xml_data);
#endif

#endif
	//if (pugi::status_ok != result.status)
	//{
	//	return EXIT_FAILURE;
	//}

	const std::string expected_data("EGLBoolean eglBindAPI");
	//
	std::string returned_data;
	walk(doc, returned_data);
	//
	std::cout << "'" << returned_data << "'" << std::endl;
	std::cout << "Is returned data equal to expected? "<< std::endl;
	std::cout << ((expected_data == returned_data) ? "Yes." : "No.") << std::endl;
	//
	return EXIT_SUCCESS;
}

Early reference in this thread '.NET System.Xml' will not skip such data.
Please have a look at C++/CLI sample:


using namespace System;
using namespace System::IO;
using namespace System::Xml;

int main(array<System::String ^> ^args)
{
	const auto xml_data = gcnew String(L"<proto><ptype>EGLBoolean</ptype> <name>eglBindAPI</name></proto>");

	const auto stringReader = gcnew StringReader(xml_data);
	const auto xr = XmlReader::Create(stringReader);

	const auto expected_data = gcnew String("EGLBoolean eglBindAPI");
	auto returned_data = gcnew String(L"");

	while (xr->Read())
	{
		returned_data += xr->Value;
	}

	Console::WriteLine(L"'{0}'", returned_data);
	Console::WriteLine(L"Is returned data equal to expected?");
	Console::WriteLine(L"{0}", ((expected_data == returned_data) ? L"Yes." : L"No."));

	return 0;
}

Sample of xml data was took from this file.
Current aspect with pugixml detected while comparing output from Python script that also not skip such data.
Work with xml at Python script above is powered by xml.dom.minidom.

Thank you.

P.S.
Reading of such data not critical for me. So if adding possibility to read such data provoke regression of library performance - current behaviour (skip, PUGI__SKIPWS) should be save.

Back to Python script - even authors of it added addition space in other method (writeExtension) when decide to made string join of data. Probably that was done in case if authors of xml document formatting data and remove empty pcdata with white space only.

@zeux
Copy link
Owner

zeux commented Jan 22, 2019

@TheVice By default pcdata that's whitespace-exclusive is not included into the resulting document for performance reasons. If you want to include it, you should pass pugi::parse_default | pugi::parse_ws_pcdata as flags to load_string.

@TheVice
Copy link

TheVice commented Jan 22, 2019

Thanks @zeux that's what I need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants