A PHP library to use XPath expressions to get nodes of a DOMElement node set.
The HTMLElement
class inherits the methods and properties of the DOMElement
class.
- Will convert HTML entities to UTF-8 characters.
- Will remove invalid tags.
- Will not repair HTML.
- Will not format HTML.
Reduce the repetitive lines of code needed to query a DOMElement
with an XPath expression.
DOMDocument::loadHTML()
will add<body>
tag as the root node but this doesn't affect XPath expressions.
git clone --depth=1 https://github.com/stemar/html-element.git
wget -O phpunit https://phar.phpunit.de/phpunit-8.phar
php phpunit HTMLElementTest.php
curl -sS https://getcomposer.org/installer -o composer | php
composer require phpunit/phpunit
vendor/bin/phpunit HTMLElementTest.php
Require this library in your code.
require_once __DIR__.'/HTMLElement.php';
$html = "<p>Example</p>";
$html_element = new HTMLElement($html);
$elements = $html_element->elements('//p');
Or add a namespace
at the top of HTMLElement.php
and use
it in your code with autoload.
- Convert HTML to
UTF-8
. - Create a
DOMDocument
instance with the HTML. - Make a HTML object from the
DOMDocument
instance. - Create a
DOMXPath
instance with the HTMLDOMDocument
object. - Query the
DOMXPath
instance with a XPath expression to get aDOMNodeList
collection. - Iterate the
DOMNodeList
collection to eachDOMNode
object. - Check if each
DOMNode
object is aDOMElement
object. - Get the
nodeValue
of eachDOMElement
object to get the content in the node.
No
innerHTML
property is available to aDOMElement
object.
Try in a php -a
console:
$html = '</z><p>ØÜ <a href="#">Link</a></p><p>Second para</p>';
$doc = new \DOMDocument();
libxml_use_internal_errors(TRUE);
$doc->loadHTML($html, LIBXML_HTML_NODEFDTD);
libxml_clear_errors();
$xpath = new \DOMXPath($doc);
$nodeList = $xpath->query('//p');
if ($nodeList->length) {
foreach ($nodeList as $i => $node) {
if ($nodeList->item($i)->nodeType == XML_ELEMENT_NODE) {
echo $node->nodeValue, PHP_EOL;
echo $node->ownerDocument->saveHTML($node), PHP_EOL; // like outerHTML
}
}
}
Result:
Ã�Ü Link
<p>Ã�Ü <a href="#">Link</a></p>
Second para
<p>Second para</p>
- Invalid tag
</z>
is correctly removed. - The UTF-8 character Ø gets incorrectly double-encoded to
Ã<0x98>
. - The HTML entity
Ü
gets correctly encoded to Ü. - The loop through the NodeList collection has to be performed every time you query with a XPath expression.
- The flow of classes you have to code through to get a DOMElement is:
- DOMDocument => DOMXPath => DOMNodeList => DOMNode => DOMElement
- Instantiate a
HTMLElement
from HTML. - Get
DOMElement
nodes from this instance. - Get the
nodeValue
of eachDOMElement
object to get the content in the node.
Try in a php -a
console:
require 'HTMLElement.php';
$html = '</z><p>ØÜ <a href="#">Link</a></p><p>Second para</p>';
$html_element = new HTMLElement($html);
$elements = $html_element->elements('//p');
foreach ($elements as $element) {
echo $element->nodeValue, PHP_EOL;
echo $html_element->outerHTML($element), PHP_EOL;
echo $html_element->innerHTML($element), PHP_EOL;
}
var_export(HTMLElement::new($html)->xpath('//p'));
Result:
ØÜ Link
<p>ØÜ <a href="#">Link</a></p>
ØÜ <a href="#">Link</a>
Second para
<p>Second para</p>
Second para
array (
0 => '<p>ØÜ <a href="#">Link</a></p>',
1 => '<p>Second para</p>',
)
You can get the <a>
child nodes of the <p>
node by using the second argument ($contextnode
) in HTMLElement::nodes()
.
Try in a php -a
console:
require 'HTMLElement.php';
$html = '</z><p>ØÜ <a href="#">Link</a></p><p>Second para</p>';
$html_element = new HTMLElement($html);
// Get the first <p> node
$contextnode = $html_element->elements('//p')[0];
// Get the <a> nodes only under this <p> context node
$elements = $html_element->elements('//a', $contextnode);
foreach ($elements as $element) {
echo $element->nodeValue, PHP_EOL;
echo $element->getAttribute('href'), PHP_EOL;
}
$p = HTMLElement::new($html)->xpath('//p')[0];
echo HTMLElement::new($p)->xpath('//a')[0], PHP_EOL;
Result:
Link
#
<a href="#">Link</a>