Skip to content

Work around a (suspected) regression in libxml2 2.9.1 -> 2.9.2#2

Merged
mpdude merged 1 commit intomasterfrom
workaround-libxml_2_9_2-regression
Nov 24, 2016
Merged

Work around a (suspected) regression in libxml2 2.9.1 -> 2.9.2#2
mpdude merged 1 commit intomasterfrom
workaround-libxml_2_9_2-regression

Conversation

@mpdude
Copy link
Copy Markdown
Member

@mpdude mpdude commented Nov 24, 2016

In libxml 2.9.1 -> 2.9.2, a change was made to fix the handling of URIs with rootless paths (initial report, initial fix, follow-up patch).

However, there still seems to be a bug with regard to how relative URIs are resolved when XML catalogs are processed. In particular, when installing (for example) the Debian w3c-dtd-xhtml package to provide a local cached copy of the XHTML DTD, libxml will correctly follow references to other catalog files as long as they are provided as absoulte file:// URLs.

But once it reaches file:///usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml, additional references use relative URIs like uri="xhtml1-strict.dtd". libxml will resolve this as file:/usr/share/xml/xhtml/schema/dtd/1.0/xhtml1-strict.dtd with a single leading slash only. This URL is subsequently used as a file system path, fails during a stat() call and will make PHP return an error message like failed to load external entity "file:/usr/share/xml/xhtml/schema/dtd/1.0/xhtml1-strict.dtd".

An earlier report of this problem can be found on the libxml mailing list. A second mail to this list is currently awaiting moderation and will hopefully become available in the November 2016 archive.

As a workaround (and to make us somewhat independent of the exact libxml version installed on the underlying OS), we're including the four DTD and entity definition files relevant for XHTML 1.0 Strict in this repository and using a custom libxml entity loader in PHP userland to load these files.

Note that this change effectively cuts out XML catalog support in libxml when used from PHP. It may cause unexpected side effects as only one external entity loader can seemingly be registered.

Alternative approach using XML_CATALOG_FILES

Another approach we've tried was to come up with a custom XML catalog like

<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//W3C//DTD XHTML 1.0 Strict//EN" uri="xhtml1-strict.dtd" />
<public publicId="-//W3C//ENTITIES Latin 1 for XHTML//EN" uri="xhtml-lat1.ent" />
<public publicId="-//W3C//ENTITIES Symbols for XHTML//EN" uri="xhtml-symbol.ent" />
<public publicId="-//W3C//ENTITIES Special for XHTML//EN" uri="xhtml-special.ent" />
</catalog>

When pointing to this file using the XML_CATALOG_FILES environment variable, file resolution will work because no file:// based URLs are ever introduced during the resolution process.

The difficulty, however, is to set this environment variable early enough: It seems the setting is read during libxml initialization while PHP starts up, so a simple putenv() from PHP comes too late. When running PHP as an Apache module, the environment variable has to be passed when starting Apache (/etc/apache2/envvars, for example), not from within virtual host definitions (SetEnv).

This makes it hard to install/maintain the necessary setting and affects catalog processing at a lower (system-wide) level, which is why we abandoned it.

In libxml 2.9.1 -> 2.9.2, a change was made to fix the handling of URIs with rootless paths ([initial report](https://bugzilla.gnome.org/show_bug.cgi?id=731063), [initial fix](https://git.gnome.org/browse/libxml2/commit/?id=8eb55d782a2b9afacc7938694891cc6fad7b42a5), [follow-up patch](https://git.gnome.org/browse/libxml2/commit/uri.c?id=beb7281055dbf0ed4d041022a67c6c5cfd126f25)).

However, there still seems to be a bug with regard to how relative URIs are resolved when XML catalogs are processed. In particular, when installing (for example) the Debian `w3c-dtd-xhtml` package to provide a local cached copy of the XHTML DTD, libxml will correctly follow references to other catalog files as long as they are provided as absoulte `file://` URLs.

But once it reaches `file:///usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml`, additional references use relative URIs like `uri="xhtml1-strict.dtd"`. libxml will resolve this as `file:/usr/share/xml/xhtml/schema/dtd/1.0/xhtml1-strict.dtd` with a single leading slash only. This URL is subsequently used as a file system path, fails during a `stat()` call and will make PHP return an error message like `failed to load external entity "file:/usr/share/xml/xhtml/schema/dtd/1.0/xhtml1-strict.dtd"`.

An earlier report of this problem can be found on the [libxml mailing list](https://mail.gnome.org/archives/xml/2014-December/msg00000.html). A second mail to this list is currently awaiting moderation and will hopefully become available in the [November 2016 archive](https://mail.gnome.org/archives/xml/2016-November/date.html#00000).

As a workaround (and to make us somewhat independent of the exact libxml version installed on the underlying OS), we're including the four DTD and entity definition files relevant for `XHTML 1.0 Strict` in this repository and using a custom libxml entity loader in PHP userland to load these files.

Note that this change effectively cuts out XML catalog support in libxml when used from PHP. It may cause unexpected side effects as only one external entity loader can seemingly be registered.

Another approach we've tried was to come up with a custom XML catalog like

```
<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//W3C//DTD XHTML 1.0 Strict//EN" uri="xhtml1-strict.dtd" />
<public publicId="-//W3C//ENTITIES Latin 1 for XHTML//EN" uri="xhtml-lat1.ent" />
<public publicId="-//W3C//ENTITIES Symbols for XHTML//EN" uri="xhtml-symbol.ent" />
<public publicId="-//W3C//ENTITIES Special for XHTML//EN" uri="xhtml-special.ent" />
</catalog>
```

When pointing to this file using the `XML_CATALOG_FILES` environment variable, file resolution will work because no `file://` based URLs are ever introduced during the resolution process.

The difficulty, however, is to set this environment variable early enough: It seems the setting is read during libxml initialization while PHP starts up, so a simple `putenv()` from PHP comes too late. When running PHP as an Apache module, the environment variable has to be passed when starting Apache (`/etc/apache2/envvars`, for example), not from within virtual host definitions (`SetEnv`).

This makes it hard to install/maintain the necessary setting and affects catalog processing at a lower (system-wide) level, which is why we abandoned it.
@mpdude mpdude merged commit 44949f7 into master Nov 24, 2016
@mpdude mpdude deleted the workaround-libxml_2_9_2-regression branch November 24, 2016 13:42
@mpdude
Copy link
Copy Markdown
Member Author

mpdude commented Nov 24, 2016

Found the issue in libxml, message here: https://mail.gnome.org/archives/xml/2016-November/msg00012.html

@mpdude
Copy link
Copy Markdown
Member Author

mpdude commented Oct 7, 2017

For the record, this has been fixed upstream in libxml2 2.9.5, as per https://git.gnome.org/browse/libxml2/commit/?id=3daee3f159a1f962278e6f92572b7749b2b2babb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant