adding some xpath tips to selectors docs

Aug 8, 2014
@@ -407,6 +407,129 @@ inside another ``itemscope``.
.. _regular expressions:
.. _set manipulation:

Some XPath tips

Here are some tips that you may find useful when using XPath
with Scrapy selectors. If you are not much familiar with XPath yet,
you may want to take a look first at this `XPath tutorial`_.

You can also find `more XPath tips like these here.`_

.. _`XPath tutorial`:
.. _`more XPath tips like these here.`:

Using text nodes in a condition

When you need to use the text content as argument to a XPath string function
like ``contains`` or ``substring-after``, avoid using ``.//text()`` and use
just ``.`` instead.

This is because the ``.//text()`` selects several text nodes (a node-set),
that when converted to string gets only the first element. The ``.`` notation
selects the whole element, which gets the whole element text,
including its descendants.


>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')

Using the ``.//text()`` node-set::

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract()

Using the node::

>>> sel.xpath("//a[contains(., 'Next Page')]").extract()
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

You can verify this difference between the conversion to string from a node or
a *node-set* using the ``string()`` XPath function.

Converting a *node-set* to string::

>>> sel.xpath("string(//a[1]//text())").extract()
[u'Click here to go to the ']

Converting a *node* to string::

>>> sel.xpath("string(//a[1])").extract()
[u'Click here to go to the Next Page']

Beware the difference between //node[1] and (//node)[1]

``//node[1]`` selects all the nodes occurring first under their respective parents.

``(//node)[1]`` selects all the nodes in the document, and then gets only the first of them.


>>> from scrapy import Selector
>>> sel = Selector(text="""
....: <ul class="list">
....: <li>1</li>
....: <li>2</li>
....: <li>3</li>
....: </ul>
....: <ul class="list">
....: <li>4</li>
....: <li>5</li>
....: <li>6</li>
....: </ul>""")
>>> xp = lambda x: sel.xpath(x).extract()

This gets all first ``<li>`` elements under whatever it is its parent::

>>> xp("//li[1]")
[u'<li>1</li>', u'<li>4</li>']

And this gets the first ``<li>`` element in the whole document::

>>> xp("(//li)[1]")

This gets all first ``<li>`` elements under an ``<ul>`` parent::

>>> xp("//ul/li[1]")
[u'<li>1</li>', u'<li>4</li>']

And this gets the first ``<li>`` element under an ``<ul>`` parent in the whole document::

>>> xp("(//ul/li)[1]")

When querying by class, consider using CSS

Because an element can contain multiple CSS classes, the XPath way to select elements
by class is the rather verbose::

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

If you use ``@class='someclass'`` you may end up missing elements that have
other classes, and if you just use ``contains(@class, 'someclass')`` to make up
for that you may end up with more elements that you want, if they have a different
class name that shares the string ``someclass``.

As it turns out, Scrapy selectors allow you to chain selectors, so most of the time
you can just select by class using CSS and then switch to XPath when needed::

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').extract()
[u'2014-07-23 19:00']

This is cleaner than using the verbose XPath trick shown above. Just remember
to use the ``.`` in the XPath expressions that will follow.

.. _topics-selectors-ref:

Built-in Selectors reference

