Skip to content
Permalink
Browse files

improved explanations, clarified blog post as source, added link for …

…XPath string functions in the spec
  • Loading branch information
eliasdorneles authored and dangra committed Aug 8, 2014
1 parent 037f6ab commit 65c8f05d76cea0137e1b38a33ff51eb809d8b9a2
Showing with 24 additions and 24 deletions.
  1. +24 −24 docs/topics/selectors.rst
@@ -412,55 +412,55 @@ Some XPath tips
---------------

Here are some tips that you may find useful when using XPath
with Scrapy selectors. If you are not much familiar with XPath yet,
with Scrapy selectors, based on `this post from ScrapingHub's blog`_.
If you are not much familiar with XPath yet,
you may want to take a look first at this `XPath tutorial`_.

You can also find `more XPath tips like these here.`_

.. _`XPath tutorial`: http://www.zvon.org/comp/r/tut-XPath_1.html
.. _`more XPath tips like these here.`: http://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/
.. _`this post from ScrapingHub's blog`: http://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/


Using text nodes in a condition
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When you need to use the text content as argument to a XPath string function
like ``contains`` or ``substring-after``, avoid using ``.//text()`` and use
just ``.`` instead.
When you need to use the text content as argument to a `XPath string function`_,
avoid using ``.//text()`` and use just ``.`` instead.

This is because the ``.//text()`` selects several text nodes (a node-set),
that when converted to string gets only the first element. The ``.`` notation
selects the whole element, which gets the whole element text,
including its descendants.
This is because the expression ``.//text()`` yields a collection of text elements -- a *node-set*.
And when a node-set is converted to a string, which happens when it is passed as argument to
a string function like ``contains()`` or ``starts-with()``, it results in the text for the first element only.

Example::

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')

Using the ``.//text()`` node-set::
Converting a *node-set* to string::

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract()
[]
>>> sel.xpath('//a//text()').extract() # take a peek at the node-set
[u'Click here to go to the ', u'Next Page']
>>> sel.xpath("string(//a[1]//text())").extract() # convert it to string
[u'Click here to go to the ']

Using the node::
A *node* converted to a string, however, puts together the text of itself plus of all its descendants::

>>> sel.xpath("//a[contains(., 'Next Page')]").extract()
>>> sel.xpath("//a[1]").extract() # select the first node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").extract() # convert it to string
[u'Click here to go to the Next Page']

You can verify this difference between the conversion to string from a node or
a *node-set* using the ``string()`` XPath function.

Converting a *node-set* to string::
So, using the ``.//text()`` node-set won't select anything in this case::

>>> sel.xpath("string(//a[1]//text())").extract()
[u'Click here to go to the ']
>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract()
[]

Converting a *node* to string::
But using the ``.`` to mean the node, works::

>>> sel.xpath("string(//a[1])").extract()
[u'Click here to go to the Next Page']
>>> sel.xpath("//a[contains(., 'Next Page')]").extract()
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

.. _`XPath string function`: http://www.w3.org/TR/xpath/#section-String-Functions

Beware the difference between //node[1] and (//node)[1]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0 comments on commit 65c8f05

Please sign in to comment.