XML Canonicalization: as an option, allow parsing eol's into the document model (as PCDATA holding those) #602

omascia · 2023-12-11T12:23:04Z

Facing the need to implement XML Canonicalization (for XML Signature validation and building), I cannot easily reuse pugixml for the basic steps because the parser seems to always remove eol, unless from identified PCDATA contents when parse_ws_pcdata has been set.

<doc>
<sub>
text
</sub>
</doc>

I can't find a way to keep the eol / whitespace between <doc> and <sub> for instance. Though parse_ws_pcdata do keep the eols within <sub></sub>.
When traversing the tree resulting from such an eol supporting parse, I would expect to see a PCDATA node as first child of <doc> before its sibling child element <sub>. That would allow me to adjust whatever eol might be in there in order to keep only one, which is one of the transformations steps I need to apply before output of the canonicalized text.
To overcome any unwanted behavior when outputting canonicalized content, I can write my own output code from the tree traversal, albeit with the lost eol as a beginning, it is getting nowhere.

The text was updated successfully, but these errors were encountered:

zeux · 2023-12-11T16:51:40Z

Maybe I'm missing something, but parse_ws_pcdata is exactly what you need here.

Quick example:

#include "pugixml.hpp"

#include <stdio.h>

void display(pugi::xml_node node, int depth = 0)
{
        for (int i = 0; i < depth; ++i)
                printf(".");

        if (node.type() == pugi::node_element)
        {
                printf("element %s", node.name());
        }
        else if (node.type() == pugi::node_pcdata)
        {
                printf("pcdata ");

                for (const char* str = node.value(); *str; ++str)
                        printf("0x%02X ", (unsigned char)*str);
        }

        printf("\n");

        for (pugi::xml_node child : node.children())
                display(child, depth + 1);
}

int main()
{
        const char* xml = R"(
<doc>
<sub>
text
</sub>
</doc>
)";

        pugi::xml_document doc;
        pugi::xml_parse_result res = doc.load_string(xml, pugi::parse_default | pugi::parse_ws_pcdata);

        printf("%s\n", res.description());
        display(doc);
}

Output:

No error

.element doc
..pcdata 0x0A 
..element sub
...pcdata 0x0A 0x74 0x65 0x78 0x74 0x0A 
..pcdata 0x0A

As you can see, doc has three children, two of them are PCDATA with a single newline.

omascia · 2023-12-11T22:44:54Z

You're right! :)
I was mislead with my real test which had PI too. Looks like there are no PCDATA kept in between PI and elements or between multiple successive PIs. I will re-assemble a more complete and real-life sample and build from there. Anyway, I overlooked the side issue that I will not be able to control whitespace like eol around closing tags. The whole idea of relying on a DOM parser to modify the content (namespace updates and attributes sorting) and then output the required canonicalized form, is probably wrong.
This issue should be closed I think.
Thank you Arseny.

omascia added the enhancement label Dec 11, 2023

omascia changed the title ~~As an option, allow parsing all eol into the document model (as PCDATA holding those)~~ XML Canonicalization: as an option, allow parsing eol's into the document model (as PCDATA holding those) Dec 11, 2023

zeux closed this as not planned Won't fix, can't repro, duplicate, stale Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML Canonicalization: as an option, allow parsing eol's into the document model (as PCDATA holding those) #602

XML Canonicalization: as an option, allow parsing eol's into the document model (as PCDATA holding those) #602

omascia commented Dec 11, 2023 •

edited

Loading

zeux commented Dec 11, 2023

omascia commented Dec 11, 2023

XML Canonicalization: as an option, allow parsing eol's into the document model (as PCDATA holding those) #602

XML Canonicalization: as an option, allow parsing eol's into the document model (as PCDATA holding those) #602

Comments

omascia commented Dec 11, 2023 • edited Loading

zeux commented Dec 11, 2023

omascia commented Dec 11, 2023

omascia commented Dec 11, 2023 •

edited

Loading