New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rapidxml namespace strip patch #295

merged 2 commits into from Mar 21, 2017


None yet
2 participants

jennybc commented Mar 21, 2017

Applies the patch mentioned on sourceforge for rapidxml to nuke the namespace prefixes for element names and attributes. As usual, this arises in creative spreadsheets written by 3rd party tools. However, Excel and the Apache POI-using R packages can read these files and, therefore, so should readxl.

The other solution would be to use a rapidxml fork that handles namespaces properly. That seems like a much bigger change and, more importantly, would require C++11. So far readxl has done without that, so I assume we should continue?

I've probably used the new flag in more places than is strictly necessary, but it's hard to anticipate where the novel prefixes will show up. I'm still trying to get more problematic specimens for testing, but it's hard. It's usually niche BI tools that do this and people aren't able to share the sheets.

I reached out to the author of the patch, to say "thanks" at the very least.

Patch rapidxml to strip namespace prefixes from element names and att…

Using code from a patch by Dan Posluns found here and included below:

Index: rapidxml.hpp
>     //! Parse flag instructing the parser to strip any XML namespace identifiers from element names and attributes.
>     //! Turning this flag on will remove the namespace prefix and colon from all tags and attributes.
>     //! By default, XML namespace identifiers are left as part of the names of tags and attributes.
>     //! This flag does not cause the parser to modify source text.
>     //! Can be combined with other flags by use of | operator.
>     //! <br><br>
>     //! See xml_document::parse() function.
>     const int parse_strip_xml_namespaces = 0x1000;
> 		// Detect element XML namespace prefix character
> 		struct element_namespace_prefix_pred
> 		{
> 			static unsigned char test(Ch ch)
> 			{
>                 return ch != ':' && internal::lookup_tables<0>::lookup_node_name[static_cast<unsigned char>(ch)];
> 			}
> 		};
> 		// Detect attribute XML namespace prefix character
> 		struct attribute_namespace_prefix_pred
> 		{
> 			static unsigned char test(Ch ch)
> 			{
>                 return ch != ':' && internal::lookup_tables<0>::lookup_attribute_name[static_cast<unsigned char>(ch)];
> 			}
> 		};
> 		//
> 			if (Flags & parse_strip_xml_namespaces)
> 			{
> 				Ch *saved_name = name;
> 				skip<element_namespace_prefix_pred, Flags>(name);
> 				if (name++ == text)
> 					name = saved_name;
> 			}
> 				if (Flags & parse_strip_xml_namespaces)
> 				{
> 					Ch *saved_name = name;
> 					skip<attribute_namespace_prefix_pred, Flags>(name);
> 					if (name++ == text)
> 						name = saved_name;
> 				}

@jennybc jennybc requested a review from hadley Mar 21, 2017


hadley approved these changes Mar 21, 2017

This seems pretty minimal. Looks good!

@jennybc jennybc merged commit da6f8be into tidyverse:master Mar 21, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
continuous-integration/travis-ci/pr The Travis CI build passed

@jennybc jennybc deleted the jennybc:rapidxml-namespace-strip-patch branch Mar 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment