Skip to content
This repository has been archived by the owner on Jan 8, 2020. It is now read-only.

Zend\Dom\Query and special UTF-8 characters #7618

Closed
mtrippodi opened this issue Aug 26, 2015 · 3 comments
Closed

Zend\Dom\Query and special UTF-8 characters #7618

mtrippodi opened this issue Aug 26, 2015 · 3 comments
Labels

Comments

@mtrippodi
Copy link

use Zend\Dom\Query;
use Zend\Debug\Debug;

$html = '<div><h1>ßüöä</h1></div>';
$dom = new Query($html);
$nodes = $dom->execute('h1');
Debug::dump($nodes->current()->nodeValue);

...will result in sth. like:

�üöä

$html = '<div><h1>ßüöä</h1></div>';
$dom = new Query(utf8_decode($html));
$nodes = $dom->execute('h1');
Debug::dump($nodes->current()->nodeValue);

... will solve the problem and result in correct rendering.

For convenience I extended Zend\Dom\Query:

<?php

namespace MyNamespace\Dom;

use Zend\Dom\Query as ZF2Query;

class Query extends ZF2Query
{

    /**
     * Set document to query. If is UTF-8: decode.
     *
     * @param  string $document
     * @param  null|string $encoding Document encoding
     * @return Query
     */
    public function setDocument($document, $encoding = null)
    {
        if (0 === strlen($document)) {
            return $this;
        }

        $_encoding = empty($encoding) ? $this->getEncoding() : $encoding;
        if($_encoding == 'UTF-8')
            $document = utf8_decode($document);

        return parent::setDocument($document, $encoding);
    }
}

Now I wonder if this could be perhaps implemented in Zend\Dom\Query. Or do I miss something and there's a better solution?
Thanks
m.

@mtrippodi
Copy link
Author

OK, forget my first "solution". It's bad because e.g. ...

$html = '<div><h1>€</h1></div>';
$dom = new Query(utf8_decode($html));
$nodes = $dom->execute('h1');
Debug::dump($nodes->current()->nodeValue); 

...will result in:

?

This is, because all that utf8_decode() does is convert a string encoded in UTF-8 to ISO-8859-1. This is of course not good because UTF-8 can represent many more characters than ISO-8859-1. See this comment at PHP Man.

The real problem is, that DOMDocument::loadHTML () by default will always treat the source-string as ISO-8859-1-encoded. Unfortunately, you can only change this behavior by specifying the encoding in the html head at the beginning of the source-string. This comment at PHP Man still seems to apply even though it is 10 years old and UTF-8 is so common nowadays!

So, based on this comment I again extended Zend\Dom\Query as follows:

<?php

namespace MyNamespace\Dom;

use Zend\Dom\Query as ZF2Query;

class Query extends ZF2Query
{

    /**
     * Set document to query
     *
     * @param  string $document
     * @param  null|string $encoding Document encoding
     * @return Query
     */
    public function setDocument($document, $encoding = null)
    {
        if (0 === strlen($document)) {
            return $this;
        }

        $prepend = '';
        $_encoding = empty($encoding) ? $this->getEncoding() : $encoding;
        if(!empty($_encoding) && strtolower($_encoding) != 'iso-8859-1')
                 $prepend = sprintf('<?xml encoding="%s">', $_encoding);

        // breaking XML declaration to make syntax highlighting work
        if ('<' . '?xml' == substr(trim($document), 0, 5)) {
            if (preg_match('/<html[^>]*xmlns="([^"]+)"[^>]*>/i', $document, $matches)) {
                $this->xpathNamespaces[] = $matches[1];
                return $this->setDocumentXhtml($prepend . $document, $encoding);
            }
            return $this->setDocumentXml($document, $encoding);
        }
        if (strstr($document, 'DTD XHTML')) {
            return $this->setDocumentXhtml($prepend . $document, $encoding);
        }
        return $this->setDocumentHtml($prepend . $document, $encoding);
    }
}

Still, two questions remain:

  • Is this the best solution?
  • Should a solution be implemented in Zend\Dom\Query?

@croensch
Copy link

AFAIK if no header is present the passed encoding is used, if the header is present the passed encoding is ignored. So if your documents are always in iso-8859-1 then just try setDocument() as it is?

@GeeH
Copy link

GeeH commented Jun 28, 2016

This issue has been moved from the zendframework repository as part of the bug migration program as outlined here - http://framework.zend.com/blog/2016-04-11-issue-closures.html
New issue can be found at: zendframework/zend-dom#10

@GeeH GeeH closed this as completed Jun 28, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants