Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML Canonicalization: as an option, allow parsing eol's into the document model (as PCDATA holding those) #602

Closed
omascia opened this issue Dec 11, 2023 · 2 comments

Comments

@omascia
Copy link

omascia commented Dec 11, 2023

Facing the need to implement XML Canonicalization (for XML Signature validation and building), I cannot easily reuse pugixml for the basic steps because the parser seems to always remove eol, unless from identified PCDATA contents when parse_ws_pcdata has been set.

<doc>
<sub>
text
</sub>
</doc>

I can't find a way to keep the eol / whitespace between <doc> and <sub> for instance. Though parse_ws_pcdata do keep the eols within <sub></sub>.
When traversing the tree resulting from such an eol supporting parse, I would expect to see a PCDATA node as first child of <doc> before its sibling child element <sub>. That would allow me to adjust whatever eol might be in there in order to keep only one, which is one of the transformations steps I need to apply before output of the canonicalized text.
To overcome any unwanted behavior when outputting canonicalized content, I can write my own output code from the tree traversal, albeit with the lost eol as a beginning, it is getting nowhere.

@omascia omascia changed the title As an option, allow parsing all eol into the document model (as PCDATA holding those) XML Canonicalization: as an option, allow parsing eol's into the document model (as PCDATA holding those) Dec 11, 2023
@zeux
Copy link
Owner

zeux commented Dec 11, 2023

Maybe I'm missing something, but parse_ws_pcdata is exactly what you need here.

Quick example:

#include "pugixml.hpp"

#include <stdio.h>

void display(pugi::xml_node node, int depth = 0)
{
        for (int i = 0; i < depth; ++i)
                printf(".");

        if (node.type() == pugi::node_element)
        {
                printf("element %s", node.name());
        }
        else if (node.type() == pugi::node_pcdata)
        {
                printf("pcdata ");

                for (const char* str = node.value(); *str; ++str)
                        printf("0x%02X ", (unsigned char)*str);
        }

        printf("\n");

        for (pugi::xml_node child : node.children())
                display(child, depth + 1);
}

int main()
{
        const char* xml = R"(
<doc>
<sub>
text
</sub>
</doc>
)";

        pugi::xml_document doc;
        pugi::xml_parse_result res = doc.load_string(xml, pugi::parse_default | pugi::parse_ws_pcdata);

        printf("%s\n", res.description());
        display(doc);
}

Output:

No error

.element doc
..pcdata 0x0A 
..element sub
...pcdata 0x0A 0x74 0x65 0x78 0x74 0x0A 
..pcdata 0x0A 

As you can see, doc has three children, two of them are PCDATA with a single newline.

@omascia
Copy link
Author

omascia commented Dec 11, 2023

You're right! :)
I was mislead with my real test which had PI too. Looks like there are no PCDATA kept in between PI and elements or between multiple successive PIs. I will re-assemble a more complete and real-life sample and build from there. Anyway, I overlooked the side issue that I will not be able to control whitespace like eol around closing tags. The whole idea of relying on a DOM parser to modify the content (namespace updates and attributes sorting) and then output the required canonicalized form, is probably wrong.
This issue should be closed I think.
Thank you Arseny.

@zeux zeux closed this as not planned Won't fix, can't repro, duplicate, stale Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants