HTMLElement.php

A PHP library to use XPath expressions to get nodes of a DOMElement node set.

The HTMLElement class inherits the methods and properties of the DOMElement class.

Will convert HTML entities to UTF-8 characters.
Will remove invalid tags.
Will not repair HTML.
Will not format HTML.

Goal

Reduce the repetitive lines of code needed to query a DOMElement with an XPath expression.

DOMDocument::loadHTML() will add <body> tag as the root node but this doesn't affect XPath expressions.

Installation

git clone --depth=1 https://github.com/stemar/html-element.git

Test

One local file

wget -O phpunit https://phar.phpunit.de/phpunit-8.phar
php phpunit HTMLElementTest.php

With Composer

curl -sS https://getcomposer.org/installer -o composer | php
composer require phpunit/phpunit
vendor/bin/phpunit HTMLElementTest.php

Usage

Require this library in your code.

require_once __DIR__.'/HTMLElement.php';
$html = "<p>Example</p>";
$html_element = new HTMLElement($html);
$elements = $html_element->elements('//p');

Or add a namespace at the top of HTMLElement.php and use it in your code with autoload.

Regular way

Convert HTML to UTF-8.
Create a DOMDocument instance with the HTML.
Make a HTML object from the DOMDocument instance.
Create a DOMXPath instance with the HTML DOMDocument object.
Query the DOMXPath instance with a XPath expression to get a DOMNodeList collection.
Iterate the DOMNodeList collection to each DOMNode object.
Check if each DOMNode object is a DOMElement object.
Get the nodeValue of each DOMElement object to get the content in the node.

No innerHTML property is available to a DOMElement object.

Parsing the DOMElement nodes

Try in a php -a console:

$html = '</z><p>Ø&Uuml; <a href="#">Link</a></p><p>Second para</p>';
$doc = new \DOMDocument();
libxml_use_internal_errors(TRUE);
$doc->loadHTML($html, LIBXML_HTML_NODEFDTD);
libxml_clear_errors();
$xpath = new \DOMXPath($doc);
$nodeList = $xpath->query('//p');
if ($nodeList->length) {
    foreach ($nodeList as $i => $node) {
        if ($nodeList->item($i)->nodeType == XML_ELEMENT_NODE) {
            echo $node->nodeValue, PHP_EOL;
            echo $node->ownerDocument->saveHTML($node), PHP_EOL; // like outerHTML
        }
    }
}

Result:

Ã�Ü Link
<p>Ã�Ü <a href="#">Link</a></p>
Second para
<p>Second para</p>

Observations

Invalid tag </z> is correctly removed.
The UTF-8 character Ø gets incorrectly double-encoded to Ã<0x98>.
The HTML entity Ü gets correctly encoded to Ü.
The loop through the NodeList collection has to be performed every time you query with a XPath expression.
The flow of classes you have to code through to get a DOMElement is:
- DOMDocument => DOMXPath => DOMNodeList => DOMNode => DOMElement

New way

Instantiate a HTMLElement from HTML.
Get DOMElement nodes from this instance.
Get the nodeValue of each DOMElement object to get the content in the node.

Parsing the HTMLElement nodes

Try in a php -a console:

require 'HTMLElement.php';
$html = '</z><p>Ø&Uuml; <a href="#">Link</a></p><p>Second para</p>';
$html_element = new HTMLElement($html);
$elements = $html_element->elements('//p');
foreach ($elements as $element) {
    echo $element->nodeValue, PHP_EOL;
    echo $html_element->outerHTML($element), PHP_EOL;
    echo $html_element->innerHTML($element), PHP_EOL;
}
var_export(HTMLElement::new($html)->xpath('//p'));

Result:

ØÜ Link
<p>ØÜ <a href="#">Link</a></p>
ØÜ <a href="#">Link</a>
Second para
<p>Second para</p>
Second para
array (
  0 => '<p>ØÜ <a href="#">Link</a></p>',
  1 => '<p>Second para</p>',
)

Parsing child nodes

You can get the <a> child nodes of the <p> node by using the second argument ($contextnode) in HTMLElement::nodes().

Try in a php -a console:

require 'HTMLElement.php';
$html = '</z><p>Ø&Uuml; <a href="#">Link</a></p><p>Second para</p>';
$html_element = new HTMLElement($html);
// Get the first <p> node
$contextnode = $html_element->elements('//p')[0];
// Get the <a> nodes only under this <p> context node
$elements = $html_element->elements('//a', $contextnode);
foreach ($elements as $element) {
    echo $element->nodeValue, PHP_EOL;
    echo $element->getAttribute('href'), PHP_EOL;
}
$p = HTMLElement::new($html)->xpath('//p')[0];
echo HTMLElement::new($p)->xpath('//a')[0], PHP_EOL;

Result:

Link
#
<a href="#">Link</a>

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
HTMLElement.php		HTMLElement.php
HTMLElementTest.php		HTMLElementTest.php
LICENSE		LICENSE
README.md		README.md
php-doc.md		php-doc.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTMLElement.php

Goal

Installation

Test

One local file

With Composer

Usage

Regular way

Parsing the DOMElement nodes

Observations

New way

Parsing the HTMLElement nodes

Parsing child nodes

References

About

Releases

Packages

Languages

License

stemar/html-element

Folders and files

Latest commit

History

Repository files navigation

HTMLElement.php

Goal

Installation

Test

One local file

With Composer

Usage

Regular way

Parsing the DOMElement nodes

Observations

New way

Parsing the HTMLElement nodes

Parsing child nodes

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages