Skip to content
Eric Perrino edited this page Jun 30, 2017 · 2 revisions

XML Parsing and serialisation

Back to Background

Current situation

OM:

As you have probably seen, there are two entirely different styles for XML parsing and serialization in the current code, which should be corrected. There is also a lot of dependencies on simpleSAMLphp itself.

There is the code in SAML2_Assertion, SAML2_Response, etc, which is what I wrote first. They are very limited in what they allow you to do with the final XML.

Later I needed classes for processing XML metadata, and created the under classes in SAML2_XML. At that time I had started to see the problems with the original code, and I wanted code that basically allowed me to load metadata from several sources and writing it back out again without losing information.

Losing information during parsing serialisation

[You lose information because] the SAML2_Assertion class (and others there) are not complete. They cannot represent stuff that hasn't been added. (For example the assertion with attributes with different NameFormats).

[During parsing and serialisation namespaces and comments can be ignored. The XML] should however have the same meaning when parsed again. (Possibly by a different parser.) Consider a proxy where you just grab the AttributeStatement from one assertion and insert into a new assertion.

For example, it makes a metadata aggregator trivial to implement. There is also the copying of attributes in a proxy, which I think can be useful. After all, some attributes are transferred as complex datastructures (eduPersonTargetedID), and that isn't handled by the current Assertion-class.

Discussion

OM:

Regarding parsing and generating XML -- I have been experimenting a bit with alternate methods for building the classes. I uploaded the experiment to GitHub today:

https://github.com/simplesamlphp-test/saml-xml-test

Basically, it is two experiments. One, with classes in SimpleSAML/XML, is basically a different way of implementing what you can currently find in lib/SAML2/XML today. The major difference is that I tried to abstract away all the common code for parsing and serializing the different parts, which leads to almost no code in the classes for the various XML elements. Doing something like that would make it relatively easy to create a complete SAML 2.0 XML library.

Now, keep in mind that these were just some experiments I made. I wanted to look at alternative ways to work with XML in PHP.

BB:

I'm afraid this would be a step back for us as it has less static guarantees.

OM:

Hmmm... That's true -- you don't get discoverability by tools and IDEs in that case. It is however easy to implement runtime checks. For example, this will fail because the entityID isn't set:

$ed = new \SimpleSAML\XML\md\EntityDescriptor();
$ed->toXML(new DOMDocument());

Making it fail due to the wrong class being used wouldn't be hard either. E.g. in order to make this fail:

$ed = new \SimpleSAML\XML\md\EntityDescriptor();
$ed->entityID = 'https://whatever.../';
$ed->SPSSODescriptor[] = new \SimpleSAML\XML\md\IDPSSODescriptor();
$ed->toXML(new DOMDocument());

But I take it that you want getters and setters with parameter validation? I am not totally against them. I'm only tired of writing boilerplate code. For example, the SPSSODescriptor attribute (which is an array) would probably require something like the following set of functions:

  • getSPSSODescriptor($index)
  • appendSPSSODescriptor($spd)
  • removeSPSSODescriptor($spd)
  • insertSPSSODescriptor($index, $spd)

The alternative is to use two functions to get and set the entire array of SPSSODescriptor elements. This makes simple manipulation more difficult, but it reduces the amount of code required. But in that case you won't get static checks for the contents of the arrays.

The other, located in SimpleSAML/OOXML, is an experiment with a different method for loading and storing XML elements, where the XML elements are no longer tied to a DOM document. Basically, that allows us to treat XML elements the same way as other PHP objects, wrt. moving, copying, etc.

The second also contains an experiment with an alternative XML serializer because I was disappointed with the performance of the DOM API.

BB:

I would caution against implementing a custom XML serializer, maybe you could look at: http://www.php.net/manual/en/book.xmlwriter.php

OM:

Actually, serializing XML is really trivial as long as you require the application to not misbehave (and use invalid characters in element or attribute names). Just remember to htmlspecialchars() on all text content and attribute values, and the rest is just trivial string concatenation.

However, as I said, it was an experiment. It grew out of frustration with the DOM API, its slowness and its quirks. (For example, adding a proper xsi:type attribute is always difficult. Things have to be done in the correct order and with an ugly hack, so that one gets the prefix used in the attribute value to be included in the final XML.)

BB:

I actually spent a while researching alternatives last year and proposed something more meta. That is to get a data-binding library (like pibx) to understand the SAML XSDs and generate the data model and marshalling /unmarshalling.
This would make interop with something like MDUI or XACML much easier. And it would make it easier to introduce advanced features like JIT marshalling / unmarshalling. But unfortunately so far the advantages have not outweighed the necessary budget required.

OM:

My worry about these kind of data binding layers is that sooner or later you want to do something that the code generator can't generate code for, and then you are left with two possibilities -- either extending the generator, or removing the generator.

For example -- how hard would it be to make it process XML signatures, so that you have the information you need for signature validation after the XML has been loaded into classes.

BB:

Well if your model contains no extensions then it really isn't that hard, as long as you have feature parity with the XSD. The difficulty comes with the extensions. What Corto does (and is sensible IMHO) if when it marshalls the XML to an array it also preserves the XML that the it was marshalled from and uses that in Signature Validation. So even if our model doesn't contain everything the original XML contained, it doesn't have to. Off course for a general purpose library we should take care to invalidate this XML if changes are made to the model.

OM:

The problem is that signatures require you to keep almost everything from the original XML. Namespace prefix declarations (possibly multiple for the same namespace), comments, whitespace, +++. Just about the only thing you can discard is attribute order... (I really dislike XML signatures. It is far simpler when you just have a blob of data encapsulated in a signature.)

[Won't C14N step during XML signature creation solve all this?] One would wish, but no. C14N barely touches namespaces and namespace prefixes. The only thing it can do is to try to clean up unused namespace prefixes. (Which in turn is a problem because not every prefix is "visibly used" (e.g. "xs" in xsi:type="xs:string"). Thus the PrefixList attribute.)

It doesn't remove any whitespace, but may convert some of it. (I.e. line feeds). Comments can be removed depending on the canonicalization method and the reference URI, so you cannot assume that they can be removed before you have examined the signature element.

What we currently do in the SAML2_XML_*-classes is to do partial signature validation when parsing the XML. That way we can grab what we need during parsing, and store it for later.