Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DomCrawler] filterXPath limited usage since 2.3.14 #11503

Closed
dol opened this issue Jul 29, 2014 · 11 comments
Closed

[DomCrawler] filterXPath limited usage since 2.3.14 #11503

dol opened this issue Jul 29, 2014 · 11 comments

Comments

@dol
Copy link

dol commented Jul 29, 2014

I used the DomCrawler::filterXPath method to do some filtering. Since this update 80438c2, the filterXPath method can't handle a simple queries like 'child::div'.
The error is the same as in #10986 .

Warning: DOMXPath::query(): Invalid expression in /vendor/symfony/symfony/src/Symfony/Component/DomCrawler/Crawler.php on line 832

Most of the listed XPath axes have a problem with this change.

The main cause to this problem is, that the method 'relativize' will append 'self::' to every expression, that's not matched in here.

As a result to this, the XPath expression is invalid.
E.g:
Input

child::div

relativized Xpath

self::child::div

The W3C spec don't allow Xpath expressions like the relativized one.

@fabpot
Copy link
Member

fabpot commented Jul 29, 2014

ping @stof

@xabbuh
Copy link
Member

xabbuh commented Aug 2, 2014

The issue isn't limited to the child:: axis, but also affects nearly all other possible axes defined in the W3C spec.

fabpot added a commit that referenced this issue Aug 3, 2014
…erXPath() (xabbuh)

This PR was merged into the 2.3 branch.

Discussion
----------

[Component][DomCrawler] fix axes handling in Crawler::filterXPath()

| Q             | A
| ------------- | ---
| Bug fix?      | yes
| New feature?  | no
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | yes
| Fixed tickets | #11503
| License       | MIT
| Doc PR        |

Due to some limitations in the ``relativize()`` method, it was not possible to use XPath axes other than ``descendant`` or ``descendant-or-self`` in the ``filterXPath()`` method of the ``Crawler`` class. This commit adds support for the ``ancestor``, ``ancestor-or-self``, ``attribute``, ``child``, ``following``, ``following-sibling``, ``parent``, ``preceding``, ``preceding-sibling`` and ``self`` axes.

The only axis missing after this is the ``namespace`` axis. Filtering for namespace nodes returns ``DOMNameSpaceNode`` instances which can't be passed to the ``add()`` method.

Commits
-------

8dc322b fix axes handling in Crawler::filterXPath()
@fabpot fabpot closed this as completed Aug 3, 2014
@dol
Copy link
Author

dol commented Aug 5, 2014

@fabpot + @xabbuh Thank you for the fix of the axis.

Can someone explain the idea behind the relativize of the XPath expressions?
IMHO is nearly impossible to get this right due to the complexity of XPath. I tested some simple XPath expressions like 'string' (string) or count(*) (function) and the relativize routine generates an invalid XPath expression.

To get more test cases, I extracted the XPath expression from this test suite. I removed XSLT specific XPath expressions. My test case file can be found at this gist.

All the XPath expression in 01_xpath_test_suite.txt are valid when used with DOMXPath::query.
If I evaluate the same expressions via the Crawler::filterXPath it breaks in 572 of 1305 cases.
The list of breaking expressions can be found in the same gist.

A good example of the complexity is the XPath expression //ancestor::* that is converted to descendant-or-self::ancestor::*. In this case, we have a similar problem as reported in this issue. The fix of this edge case might be very complex.

With the extended test case file it's IMHO possible to fix all broken cases. But I'm not convinced if the complexity to relativize an XPath is worth the benefit.

@fabpot fabpot reopened this Aug 5, 2014
@xabbuh
Copy link
Member

xabbuh commented Aug 7, 2014

@stof Can you explain the reasons that made it necessary to introduce the relativize() method?

@stof
Copy link
Member

stof commented Aug 7, 2014

@xabbuh the issue is that filterXPath mixes 2 responsibilities:

  • filtering the current collection
  • finding children of elements in the collection

And depending of how you write the XPath, it actually mixes both. For instance, when using a CSS query h1 (which generates a XPAth query descendant-or-self::h1), it finds all h1 tags in the current collection or in any child.

the previous API was written this way because it was copying nodes to a separate document with a special root tag. So the collection elements were actually children of the root were the XPath was applied. This means that if you query for the descendant::h1, it needs to match elements of the collection, not descendants of these elements. This also means that ancestor was never working in Symfony (there was no ancestor previously), which is why it has not been taken into account in the migration. The "fix" done in #11548 is not a fix but a BC break. It changes the behavior of the method.

The way to avoid the need for relativize would be to change the API of the Crawler to have 2 different methods:

  • a method filtering the collection for real (i.e. all elements of the new collections are elements of the previous collection too, not elements deeper in the DOM).
  • a separate method for finding child elements matching a query.

If you look at jQuery, this is the equivalent of $.fn.filter() and $.fn.find() respectively. The current behavior of filterXPath has no equivalent in jQuery.

@xabbuh
Copy link
Member

xabbuh commented Aug 7, 2014

@stof We could remove the ancestor axes support again. This would also apply to the previous* and following* axis, wouldn't it?

But, the question is then, how you would handle invalid values passed to filterXPath()? Passing them to relativize() is rather confusing since the error message you get is really meaningless and confusing. Throwing an exception if the given XPath expression is supported, may be an option. But that would be a BC break, wouldn't it?

@stof
Copy link
Member

stof commented Aug 7, 2014

@xabbuh yes. and for other axes, we need to relativize them to match the behavior of the previous implementation.

for other axes, we need to turn the query in a query matching nothing, to be BC with the way it worked in previous version (using ancestor::* was not throwing an exception in 2.3.0 but matching nothing). This is already what we do for /div (and we should do the same for child::div which is strictly equivalent but was missed in the update)

@xabbuh
Copy link
Member

xabbuh commented Aug 7, 2014

@stof Obviously, I do miss something, but I don't see why the other axes have to be relativized? Do you have an example in mind (we should at least add one or more tests for them to avoid future regressions).

@stof
Copy link
Member

stof commented Aug 7, 2014

@xabbuh read again how the behavior of XPath is working. filterXPath considers that its XPath is applied on a virtual parent of the elements of the collection, not on the collection elements themselves (yeah, it is a weird API, but we are stuck with BC here)

@xabbuh
Copy link
Member

xabbuh commented Aug 9, 2014

@stof Thanks, after having a look how this was handled in old Symfony versions, I understood what you mean. #11624 should fix the behavior in a BC way.

stof added a commit that referenced this issue Aug 19, 2014
This PR was merged into the 2.3 branch.

Discussion
----------

[DomCrawler] fix the axes handling in a bc way

| Q             | A
| ------------- | ---
| Bug fix?      | yes
| New feature?  | no
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | yes
| Fixed tickets | #11503
| License       | MIT
| Doc PR        |

The previous fix in #11548 for handling XPath axes was not backward compatible. In previous Symfony versions the Crawler handled nodes by holding a "fake root node". This must be taken into account when evaluating (relativizing) XPath expressions.

Commits
-------

d26040f [DomCrawler] fix the axes handling in a bc way
@fabpot fabpot closed this as completed Dec 24, 2014
@dol
Copy link
Author

dol commented Dec 24, 2014

@stof Thank you for the BC fix. 🎁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants