Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP Warning: DOMNode::cloneNode(): ID u1 already defined in .../src/QueryPath/DOMQuery.php on line 3176 #168

Open
x-yuri opened this issue Jun 8, 2015 · 13 comments

Comments

@x-yuri
Copy link

@x-yuri x-yuri commented Jun 8, 2015

#!/usr/bin/env php                                                                                                     
<?php                                                                                                                  
require 'vendor/autoload.php';                                                                                         
$qp = htmlqp('                                                                                                         
<!doctype html>                                                                                                        
<html>                                                                                                                 
<body>                                                                                                                 
<ul id="u1">                                                                                                           
    <li>                                                                                                               
    <li>                                                                                                               
</ul>                                                                                                                  
</body>                                                                                                                
</html>                                                                                                                
', '#u1')->children();
$ php --version
PHP 5.6.9 (cli) (built: May 15 2015 10:24:33) 
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies
$ php -i
...
libxml Version => 2.9.2
...

What am I doing wrong?

UPD Well, with php-5.4.10 and libxml-2.7.8 (debian squeeze) and php-5.6.7 and libxml-2.9.2 (debian jessie) it doesn't trigger warnings. The warning is supposedly triggered here. I can't find --without-valid in debian directory of the source package, so it must have nothing to do with this particular configure option. What else to check?

@technosophos

This comment has been minimized.

Copy link
Owner

@technosophos technosophos commented Jun 8, 2015

Whoa. That looks like a bug in libxml (or PHP's usage of it). Calling children() should not clone any nodes at all.

The problem is that a DOM element can't have a duplicate of an existing ID attribute. Something internally is cloning the ul with the ID attribute, and it's causing the failure you saw.

@x-yuri

This comment has been minimized.

Copy link
Author

@x-yuri x-yuri commented Jun 8, 2015

Here's the backtrace:

#0  QueryPath\DOMQuery->cloneAll() called at [/home/yuri/_/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3195]
#1  QueryPath\DOMQuery->__clone() called at [/home/yuri/_/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3151]
#2  QueryPath\DOMQuery->inst(SplObjectStorage Object (), , Array ([ignore_parser_warnings] => 1,[convert_to_encoding] => ISO-8859-1,[convert_from_encoding] => auto,[use_parser] => html,[parser_flags] => ,[omit_xml_declaration] => ,[replace_entities] => ,[exception_level] => 771,[escape_xhtml_js_css_sections] => /* \1 */)) called at [/home/yuri/_/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:2075]
#3  QueryPath\DOMQuery->children() called at [/home/yuri/_/2/2.php:14]

And yet it clones. Moreover, on those two debian boxes, it executes cloneAll as well, but triggers no warnings.

UPD More precisely, it executes cloneNode on one element, but triggers no warnings. One possible explanation would that php started displaying this warning.

@technosophos

This comment has been minimized.

Copy link
Owner

@technosophos technosophos commented Jun 8, 2015

Hmm. Yes, I see. I'll need to look at this. I'm having a hard time seeing
how one version of PHP could choke on this code, while others are fine.

On Mon, Jun 8, 2015 at 4:04 PM, x-yuri notifications@github.com wrote:

Here's the bactrace:

#0 QueryPath\DOMQuery->cloneAll() called at [/home/yuri//2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3195]
#1 QueryPath\DOMQuery->__clone() called at [/home/yuri/
/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3151]
#2 QueryPath\DOMQuery->inst(SplObjectStorage Object (), , Array ([ignore_parser_warnings] => 1,[convert_to_encoding] => ISO-8859-1,[convert_from_encoding] => auto,[use_parser] => html,[parser_flags] => ,[omit_xml_declaration] => ,[replace_entities] => ,[exception_level] => 771,[escape_xhtml_js_css_sections] => /* \1 */)) called at [/home/yuri//2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:2075]
#3 QueryPath\DOMQuery->children() called at [/home/yuri/
/2/2.php:14]


Reply to this email directly or view it on GitHub
#168 (comment)
.

http://technosophos.com
https://github.com/Masterminds

@x-yuri

This comment has been minimized.

Copy link
Author

@x-yuri x-yuri commented Jun 8, 2015

I seem to have found the culprit. The package was built for jessie before this commit happened. So this has to do with libxml after all.

@technosophos

This comment has been minimized.

Copy link
Owner

@technosophos technosophos commented Jun 8, 2015

Ah! Good find!

On Mon, Jun 8, 2015 at 4:49 PM, x-yuri notifications@github.com wrote:

I seem to have found the culprit
https://git.gnome.org/browse/libxml2/commit/valid.c?id=a16eb968075a82ec33b2c1e77db8909a35b44620.
The package was built for jessie before this commit happened.


Reply to this email directly or view it on GitHub
#168 (comment)
.

http://technosophos.com
https://github.com/Masterminds

@x-yuri

This comment has been minimized.

Copy link
Author

@x-yuri x-yuri commented Jun 9, 2015

And the only workaround I can think of right now is removing id before doing anything else:

htmlqp('...', '#u1')->removeAttr('id')->children();

Unless the children have ids that is :)

@x-yuri

This comment has been minimized.

Copy link
Author

@x-yuri x-yuri commented Jun 9, 2015

In which case the best I could think of is this:

function fix_children($el) {                                                                                           
    foreach ((new DOMXPath($el->document()))->query('.//*[@id]') as $_el) {                                            
        $_el->removeAttribute('id');                                                                                   
    }                                                                                                                  
    return $el;                                                                                                        
}
fix_children(htmlqp('...', '#u1')->removeAttr('id'))->children();
@technosophos

This comment has been minimized.

Copy link
Owner

@technosophos technosophos commented Jun 9, 2015

Unfortunately, that's probably what you'll have to do. I guess you could simply rename the attribute from id to something else. Only id is treated as special by the libxml library.

@x-yuri

This comment has been minimized.

Copy link
Author

@x-yuri x-yuri commented Aug 12, 2015

And even better workaround probably would be:

#!/usr/bin/env php
<?php
require 'vendor/autoload.php';

function set_error_handler_block($block, $error_handler) {
    $prv_error_handler = set_error_handler(function() use (&$prv_error_handler, $error_handler) {
        return call_user_func_array($error_handler, array_merge([$prv_error_handler], func_get_args()));
    });
    try {
        return call_user_func($block);
    } finally {
        restore_error_handler();
    }
}

set_error_handler_block(function() {
    $qp = htmlqp(
        '<!doctype html>
        <html>
        <body>
        <ul id="u1">
            <li>
            <li>
        </ul>
        </body>
        </html>
        ', '#u1')->children();
}, function($prv_error_handler, $errno, $errstr, $errfile, $errline) {
    # printf("error: %u %s\n", $errno, $errstr);
    if ($errno == E_WARNING
    && preg_match('/^DOM.+?::.+?\(\): ID .*? already defined/', $errstr))
        return;   # ignore error
    return $prv_error_handler
        ? call_user_func_array($prv_error_handler, array_slice(func_get_args(), 1))
        : FALSE;
});

One might need to tailor regexp to one's needs though.

@marcimat

This comment has been minimized.

Copy link

@marcimat marcimat commented Sep 11, 2015

Same problem here, also with $qp->top(),

  • PHP Version 5.6.4-4ubuntu6.2 (from a ubuntu 15.04 up to date)
  • libxml Version 2.9.2

[edit] A correction has been commited into libxml yesterday : https://bugzilla.gnome.org/show_bug.cgi?id=737840#c9

@technosophos

This comment has been minimized.

Copy link
Owner

@technosophos technosophos commented Oct 10, 2015

FWIW, if you use the HTML5 parser, you will not hit this error, since that uses a native PHP parser.

@muka

This comment has been minimized.

Copy link

@muka muka commented Oct 23, 2015

Hi, this happens to me too. How may we use the HTML5 parser?

UPDATE: Ok, sorted out

composer update querypath/QueryPath dev-master
$crawler = \QueryPath::withHTML5($raw);

Thank you
Luca

@logbon72

This comment has been minimized.

Copy link

@logbon72 logbon72 commented Dec 16, 2015

libxml_use_internal_errors(true);

Suppressed the error message in my case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.