Skip to content

A PHP library to use XPath expressions to get nodes of a DOMElement node set.

License

Notifications You must be signed in to change notification settings

stemar/html-element

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTMLElement.php

A PHP library to use XPath expressions to get nodes of a DOMElement node set.

The HTMLElement class inherits the methods and properties of the DOMElement class.

  • Will convert HTML entities to UTF-8 characters.
  • Will remove invalid tags.
  • Will not repair HTML.
  • Will not format HTML.

Goal

Reduce the repetitive lines of code needed to query a DOMElement with an XPath expression.

DOMDocument::loadHTML() will add <body> tag as the root node but this doesn't affect XPath expressions.

Installation

git clone --depth=1 https://github.com/stemar/html-element.git

Test

One local file

wget -O phpunit https://phar.phpunit.de/phpunit-8.phar
php phpunit HTMLElementTest.php

With Composer

curl -sS https://getcomposer.org/installer -o composer | php
composer require phpunit/phpunit
vendor/bin/phpunit HTMLElementTest.php

Usage

Require this library in your code.

require_once __DIR__.'/HTMLElement.php';
$html = "<p>Example</p>";
$html_element = new HTMLElement($html);
$elements = $html_element->elements('//p');

Or add a namespace at the top of HTMLElement.php and use it in your code with autoload.

Regular way

  1. Convert HTML to UTF-8.
  2. Create a DOMDocument instance with the HTML.
  3. Make a HTML object from the DOMDocument instance.
  4. Create a DOMXPath instance with the HTML DOMDocument object.
  5. Query the DOMXPath instance with a XPath expression to get a DOMNodeList collection.
  6. Iterate the DOMNodeList collection to each DOMNode object.
  7. Check if each DOMNode object is a DOMElement object.
  8. Get the nodeValue of each DOMElement object to get the content in the node.

No innerHTML property is available to a DOMElement object.

Parsing the DOMElement nodes

Try in a php -a console:

$html = '</z><p>Ø&Uuml; <a href="#">Link</a></p><p>Second para</p>';
$doc = new \DOMDocument();
libxml_use_internal_errors(TRUE);
$doc->loadHTML($html, LIBXML_HTML_NODEFDTD);
libxml_clear_errors();
$xpath = new \DOMXPath($doc);
$nodeList = $xpath->query('//p');
if ($nodeList->length) {
    foreach ($nodeList as $i => $node) {
        if ($nodeList->item($i)->nodeType == XML_ELEMENT_NODE) {
            echo $node->nodeValue, PHP_EOL;
            echo $node->ownerDocument->saveHTML($node), PHP_EOL; // like outerHTML
        }
    }
}

Result:

Ã�Ü Link
<p>Ã�Ü <a href="#">Link</a></p>
Second para
<p>Second para</p>

Observations

  • Invalid tag </z> is correctly removed.
  • The UTF-8 character Ø gets incorrectly double-encoded to Ã<0x98>.
  • The HTML entity &Uuml; gets correctly encoded to Ü.
  • The loop through the NodeList collection has to be performed every time you query with a XPath expression.
  • The flow of classes you have to code through to get a DOMElement is:
    • DOMDocument => DOMXPath => DOMNodeList => DOMNode => DOMElement

New way

  1. Instantiate a HTMLElement from HTML.
  2. Get DOMElement nodes from this instance.
  3. Get the nodeValue of each DOMElement object to get the content in the node.

Parsing the HTMLElement nodes

Try in a php -a console:

require 'HTMLElement.php';
$html = '</z><p>Ø&Uuml; <a href="#">Link</a></p><p>Second para</p>';
$html_element = new HTMLElement($html);
$elements = $html_element->elements('//p');
foreach ($elements as $element) {
    echo $element->nodeValue, PHP_EOL;
    echo $html_element->outerHTML($element), PHP_EOL;
    echo $html_element->innerHTML($element), PHP_EOL;
}
var_export(HTMLElement::new($html)->xpath('//p'));

Result:

ØÜ Link
<p>ØÜ <a href="#">Link</a></p>
ØÜ <a href="#">Link</a>
Second para
<p>Second para</p>
Second para
array (
  0 => '<p>ØÜ <a href="#">Link</a></p>',
  1 => '<p>Second para</p>',
)

Parsing child nodes

You can get the <a> child nodes of the <p> node by using the second argument ($contextnode) in HTMLElement::nodes().

Try in a php -a console:

require 'HTMLElement.php';
$html = '</z><p>Ø&Uuml; <a href="#">Link</a></p><p>Second para</p>';
$html_element = new HTMLElement($html);
// Get the first <p> node
$contextnode = $html_element->elements('//p')[0];
// Get the <a> nodes only under this <p> context node
$elements = $html_element->elements('//a', $contextnode);
foreach ($elements as $element) {
    echo $element->nodeValue, PHP_EOL;
    echo $element->getAttribute('href'), PHP_EOL;
}
$p = HTMLElement::new($html)->xpath('//p')[0];
echo HTMLElement::new($p)->xpath('//a')[0], PHP_EOL;

Result:

Link
#
<a href="#">Link</a>

References

About

A PHP library to use XPath expressions to get nodes of a DOMElement node set.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages