Skip to content
Permalink
Browse files

adding some xpath tips to selectors docs

  • Loading branch information
eliasdorneles authored and dangra committed Aug 8, 2014
1 parent f8d366a commit 2d103e08e31ffcce0a8ab8e3997bf3b1e8bddadc
Showing with 123 additions and 0 deletions.
  1. +123 −0 docs/topics/selectors.rst
@@ -407,6 +407,129 @@ inside another ``itemscope``.
.. _regular expressions: http://www.exslt.org/regexp/index.html
.. _set manipulation: http://www.exslt.org/set/index.html


Some XPath tips
---------------

Here are some tips that you may find useful when using XPath
with Scrapy selectors. If you are not much familiar with XPath yet,
you may want to take a look first at this `XPath tutorial`_.

You can also find `more XPath tips like these here.`_

.. _`XPath tutorial`: http://www.zvon.org/comp/r/tut-XPath_1.html
.. _`more XPath tips like these here.`: http://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/


Using text nodes in a condition
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When you need to use the text content as argument to a XPath string function
like ``contains`` or ``substring-after``, avoid using ``.//text()`` and use
just ``.`` instead.

This is because the ``.//text()`` selects several text nodes (a node-set),
that when converted to string gets only the first element. The ``.`` notation
selects the whole element, which gets the whole element text,
including its descendants.

Example::

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')

Using the ``.//text()`` node-set::

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract()
[]

Using the node::

>>> sel.xpath("//a[contains(., 'Next Page')]").extract()
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

You can verify this difference between the conversion to string from a node or
a *node-set* using the ``string()`` XPath function.

Converting a *node-set* to string::

>>> sel.xpath("string(//a[1]//text())").extract()
[u'Click here to go to the ']

Converting a *node* to string::

>>> sel.xpath("string(//a[1])").extract()
[u'Click here to go to the Next Page']


Beware the difference between //node[1] and (//node)[1]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``//node[1]`` selects all the nodes occurring first under their respective parents.

``(//node)[1]`` selects all the nodes in the document, and then gets only the first of them.

Example::

>>> from scrapy import Selector
>>> sel = Selector(text="""
....: <ul class="list">
....: <li>1</li>
....: <li>2</li>
....: <li>3</li>
....: </ul>
....: <ul class="list">
....: <li>4</li>
....: <li>5</li>
....: <li>6</li>
....: </ul>""")
>>> xp = lambda x: sel.xpath(x).extract()

This gets all first ``<li>`` elements under whatever it is its parent::

>>> xp("//li[1]")
[u'<li>1</li>', u'<li>4</li>']

And this gets the first ``<li>`` element in the whole document::

>>> xp("(//li)[1]")
[u'<li>1</li>']

This gets all first ``<li>`` elements under an ``<ul>`` parent::

>>> xp("//ul/li[1]")
[u'<li>1</li>', u'<li>4</li>']

And this gets the first ``<li>`` element under an ``<ul>`` parent in the whole document::

>>> xp("(//ul/li)[1]")
[u'<li>1</li>']

When querying by class, consider using CSS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Because an element can contain multiple CSS classes, the XPath way to select elements
by class is the rather verbose::

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

If you use ``@class='someclass'`` you may end up missing elements that have
other classes, and if you just use ``contains(@class, 'someclass')`` to make up
for that you may end up with more elements that you want, if they have a different
class name that shares the string ``someclass``.

As it turns out, Scrapy selectors allow you to chain selectors, so most of the time
you can just select by class using CSS and then switch to XPath when needed::

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').extract()
[u'2014-07-23 19:00']

This is cleaner than using the verbose XPath trick shown above. Just remember
to use the ``.`` in the XPath expressions that will follow.


.. _topics-selectors-ref:

Built-in Selectors reference

0 comments on commit 2d103e0

Please sign in to comment.