Work around a (suspected) regression in libxml2 2.9.1 -> 2.9.2#2
Merged
Work around a (suspected) regression in libxml2 2.9.1 -> 2.9.2#2
Conversation
In libxml 2.9.1 -> 2.9.2, a change was made to fix the handling of URIs with rootless paths ([initial report](https://bugzilla.gnome.org/show_bug.cgi?id=731063), [initial fix](https://git.gnome.org/browse/libxml2/commit/?id=8eb55d782a2b9afacc7938694891cc6fad7b42a5), [follow-up patch](https://git.gnome.org/browse/libxml2/commit/uri.c?id=beb7281055dbf0ed4d041022a67c6c5cfd126f25)). However, there still seems to be a bug with regard to how relative URIs are resolved when XML catalogs are processed. In particular, when installing (for example) the Debian `w3c-dtd-xhtml` package to provide a local cached copy of the XHTML DTD, libxml will correctly follow references to other catalog files as long as they are provided as absoulte `file://` URLs. But once it reaches `file:///usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml`, additional references use relative URIs like `uri="xhtml1-strict.dtd"`. libxml will resolve this as `file:/usr/share/xml/xhtml/schema/dtd/1.0/xhtml1-strict.dtd` with a single leading slash only. This URL is subsequently used as a file system path, fails during a `stat()` call and will make PHP return an error message like `failed to load external entity "file:/usr/share/xml/xhtml/schema/dtd/1.0/xhtml1-strict.dtd"`. An earlier report of this problem can be found on the [libxml mailing list](https://mail.gnome.org/archives/xml/2014-December/msg00000.html). A second mail to this list is currently awaiting moderation and will hopefully become available in the [November 2016 archive](https://mail.gnome.org/archives/xml/2016-November/date.html#00000). As a workaround (and to make us somewhat independent of the exact libxml version installed on the underlying OS), we're including the four DTD and entity definition files relevant for `XHTML 1.0 Strict` in this repository and using a custom libxml entity loader in PHP userland to load these files. Note that this change effectively cuts out XML catalog support in libxml when used from PHP. It may cause unexpected side effects as only one external entity loader can seemingly be registered. Another approach we've tried was to come up with a custom XML catalog like ``` <?xml version="1.0"?> <!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <public publicId="-//W3C//DTD XHTML 1.0 Strict//EN" uri="xhtml1-strict.dtd" /> <public publicId="-//W3C//ENTITIES Latin 1 for XHTML//EN" uri="xhtml-lat1.ent" /> <public publicId="-//W3C//ENTITIES Symbols for XHTML//EN" uri="xhtml-symbol.ent" /> <public publicId="-//W3C//ENTITIES Special for XHTML//EN" uri="xhtml-special.ent" /> </catalog> ``` When pointing to this file using the `XML_CATALOG_FILES` environment variable, file resolution will work because no `file://` based URLs are ever introduced during the resolution process. The difficulty, however, is to set this environment variable early enough: It seems the setting is read during libxml initialization while PHP starts up, so a simple `putenv()` from PHP comes too late. When running PHP as an Apache module, the environment variable has to be passed when starting Apache (`/etc/apache2/envvars`, for example), not from within virtual host definitions (`SetEnv`). This makes it hard to install/maintain the necessary setting and affects catalog processing at a lower (system-wide) level, which is why we abandoned it.
Member
Author
|
Found the issue in libxml, message here: https://mail.gnome.org/archives/xml/2016-November/msg00012.html |
Member
Author
|
For the record, this has been fixed upstream in libxml2 2.9.5, as per https://git.gnome.org/browse/libxml2/commit/?id=3daee3f159a1f962278e6f92572b7749b2b2babb |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In libxml 2.9.1 -> 2.9.2, a change was made to fix the handling of URIs with rootless paths (initial report, initial fix, follow-up patch).
However, there still seems to be a bug with regard to how relative URIs are resolved when XML catalogs are processed. In particular, when installing (for example) the Debian
w3c-dtd-xhtmlpackage to provide a local cached copy of the XHTML DTD, libxml will correctly follow references to other catalog files as long as they are provided as absoultefile://URLs.But once it reaches
file:///usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml, additional references use relative URIs likeuri="xhtml1-strict.dtd". libxml will resolve this asfile:/usr/share/xml/xhtml/schema/dtd/1.0/xhtml1-strict.dtdwith a single leading slash only. This URL is subsequently used as a file system path, fails during astat()call and will make PHP return an error message likefailed to load external entity "file:/usr/share/xml/xhtml/schema/dtd/1.0/xhtml1-strict.dtd".An earlier report of this problem can be found on the libxml mailing list. A second mail to this list is currently awaiting moderation and will hopefully become available in the November 2016 archive.
As a workaround (and to make us somewhat independent of the exact libxml version installed on the underlying OS), we're including the four DTD and entity definition files relevant for
XHTML 1.0 Strictin this repository and using a custom libxml entity loader in PHP userland to load these files.Note that this change effectively cuts out XML catalog support in libxml when used from PHP. It may cause unexpected side effects as only one external entity loader can seemingly be registered.
Alternative approach using XML_CATALOG_FILES
Another approach we've tried was to come up with a custom XML catalog like
When pointing to this file using the
XML_CATALOG_FILESenvironment variable, file resolution will work because nofile://based URLs are ever introduced during the resolution process.The difficulty, however, is to set this environment variable early enough: It seems the setting is read during libxml initialization while PHP starts up, so a simple
putenv()from PHP comes too late. When running PHP as an Apache module, the environment variable has to be passed when starting Apache (/etc/apache2/envvars, for example), not from within virtual host definitions (SetEnv).This makes it hard to install/maintain the necessary setting and affects catalog processing at a lower (system-wide) level, which is why we abandoned it.