Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register EXSLT namespaces by default (resolves #470) #472

Merged
merged 7 commits into from Jan 16, 2014
@@ -240,6 +240,161 @@ XPath specification.

.. _Location Paths: http://www.w3.org/TR/xpath#location-paths

Using EXSLT extensions
----------------------

Being built atop `lxml`_, Scrapy selectors also support some `EXSLT`_ extensions
and come with these pre-registered namespaces to use in XPath expressions:


====== ==================================== =======================
prefix namespace usage
====== ==================================== =======================
regexp http://exslt.org/regular-expressions `regular expressions`_
set http://exslt.org/sets `set manipulation`_
str http://exslt.org/strings `string manipulations`_
math http://exslt.org/math `mathematical operations`_
====== ==================================== =======================

Regular expressions
~~~~~~~~~~~~~~~~~~~

The ``test()`` function for example can prove quite useful when XPath's
``starts-with()`` or ``contains()`` are not sufficient.

Example selecting links in list item with a "class" attribute ending with a digit::

>>> doc = """
... <div>
... <ul>
... <li class="item-0"><a href="link1.html">first item</a></li>
... <li class="item-1"><a href="link2.html">second item</a></li>
... <li class="item-inactive"><a href="link3.html">third item</a></li>
... <li class="item-1"><a href="link4.html">fourth item</a></li>
... <li class="item-0"><a href="link5.html">fifth item</a></li>
... </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').extract()
[u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']
>>> sel.xpath('//li[regexp:test(@class, "item-\d$")]//@href').extract()
[u'link1.html', u'link2.html', u'link4.html', u'link5.html']
>>>



Set operations
~~~~~~~~~~~~~~

These can be handy for excluding parts of a document tree before
extracting text elements for example.

Example extracting midrodata (sample content taken from http://schema.org/Product)

This comment has been minimized.

Copy link
@dangra

dangra Jan 14, 2014

Member

s/midrodata/microdata/?

This comment has been minimized.

Copy link
@redapple

redapple Jan 14, 2014

Author Contributor

indeed ;)

with groups of itemscopes and corresponding itemprops::

>>> doc = """
... <div itemscope itemtype="http://schema.org/Product">
... <span itemprop="name">Kenmore White 17" Microwave</span>
... <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
... <div itemprop="aggregateRating"
... itemscope itemtype="http://schema.org/AggregateRating">
... Rated <span itemprop="ratingValue">3.5</span>/5
... based on <span itemprop="reviewCount">11</span> customer reviews
... </div>
...
... <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
... <span itemprop="price">$55.00</span>
... <link itemprop="availability" href="http://schema.org/InStock" />In stock
... </div>
...
... Product description:
... <span itemprop="description">0.7 cubic feet countertop microwave.
... Has six preset cooking categories and convenience features like
... Add-A-Minute and Child Lock.</span>
...
... Customer reviews:
...
... <div itemprop="review" itemscope itemtype="http://schema.org/Review">
... <span itemprop="name">Not a happy camper</span> -
... by <span itemprop="author">Ellie</span>,
... <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
... <meta itemprop="worstRating" content = "1">
... <span itemprop="ratingValue">1</span>/
... <span itemprop="bestRating">5</span>stars
... </div>
... <span itemprop="description">The lamp burned out and now I have to replace
... it. </span>
... </div>
...
... <div itemprop="review" itemscope itemtype="http://schema.org/Review">
... <span itemprop="name">Value purchase</span> -
... by <span itemprop="author">Lucas</span>,
... <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
... <meta itemprop="worstRating" content = "1"/>
... <span itemprop="ratingValue">4</span>/
... <span itemprop="bestRating">5</span>stars
... </div>
... <span itemprop="description">Great microwave for the price. It is small and
... fits in my apartment.</span>
... </div>
... ...
... </div>
... """
>>>
>>> for scope in sel.xpath('//div[@itemscope]'):
... print "current scope:", scope.xpath('@itemtype').extract()
... props = scope.xpath('''
... set:difference(./descendant::*/@itemprop,
... .//*[@itemscope]/*/@itemprop)''')
... print " properties:", props.extract()
... print
...
current scope: [u'http://schema.org/Product']
properties: [u'name', u'aggregateRating', u'offers', u'description', u'review', u'review']

current scope: [u'http://schema.org/AggregateRating']
properties: [u'ratingValue', u'reviewCount']

current scope: [u'http://schema.org/Offer']
properties: [u'price', u'availability']

current scope: [u'http://schema.org/Review']
properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']
properties: [u'worstRating', u'ratingValue', u'bestRating']

current scope: [u'http://schema.org/Review']
properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']
properties: [u'worstRating', u'ratingValue', u'bestRating']

>>>

Here we first iterate over ``itemscope`` elements, and for each one,
we look for all ``itemprops`` elements and exclude those that are themselves
inside another ``itemscope``.

Maths
~~~~~

Not that useful in practice, but you never know.

String manipulation
~~~~~~~~~~~~~~~~~~~

In practive, Python's string manipulation outside XPath is usually more
powerful.

.. _EXSLT: http://www.exslt.org/
.. _regular expressions: http://www.exslt.org/regexp/index.html
.. _set manipulation: http://www.exslt.org/set/index.html
.. _mathematical operations: http://www.exslt.org/math/index.html
.. _string manipulations: http://www.exslt.org/str/index.html

.. _topics-selectors-ref:

@@ -46,6 +46,45 @@ class Selector(object_ref):
'__weakref__', '_parser', '_csstranslator', '_tostring_method']

_default_type = None
_default_namespaces = {
"regexp": "http://exslt.org/regular-expressions",

# supported in libxslt:
# set:difference
# set:has-same-node
# set:intersection
# set:leading
# set:trailing
"set": "http://exslt.org/sets",

# supported in libxslt:
# math:abs()
# math:acos()
# math:asin()
# math:atan()
# math:atan2()
# math:constant()
# math:cos()
# math:exp()
# math:highest()
# math:log()
# math:lowest()
# math:max()
# math:min()
# math:power()
# math:random()
# math:sin()
# math:sqrt()
# math:tan()
"math": "http://exslt.org/math",

# supported in libxslt:
# str:align
# str:concat
# str:padding
# str:tokenize
"str": "http://exslt.org/strings",
}

def __init__(self, response=None, text=None, type=None, namespaces=None,
_root=None, _expr=None):
@@ -61,7 +100,9 @@ def __init__(self, response=None, text=None, type=None, namespaces=None,
_root = LxmlDocument(response, self._parser)

self.response = response
self.namespaces = namespaces
self.namespaces = dict(self._default_namespaces)
if namespaces is not None:
self.namespaces.update(namespaces)
self._root = _root
self._expr = _expr

@@ -333,3 +333,107 @@ def test_xmlxpathselector(self):
self.assertEqual(xs.select("//div").extract(),
[u'<div><img src="a.jpg"><p>Hello</p></img></div>'])
self.assertRaises(RuntimeError, xs.css, 'div')



class ExsltTestCase(unittest.TestCase):

sscls = Selector

def test_regexp(self):
"""EXSLT regular expression tests"""
body = """
<p><input name='a' value='1'/><input name='b' value='2'/></p>
<div class="links">
<a href="/first.html">first link</a>
<a href="/second.html">second link</a>
<a href="http://www.bayes.co.uk/xml/index.xml?/xml/utils/rechecker.xml">EXSLT match example</a>
</div>
"""
response = TextResponse(url="http://example.com", body=body)
sel = self.sscls(response)

# regexp:test()
self.assertEqual(sel.xpath('//input[regexp:test(@name, "[A-Z]+", "i")]').extract(),
[x.extract() for x in sel.xpath('//input[regexp:test(@name, "[A-Z]+", "i")]')])
self.assertEqual([x.extract() for x in sel.xpath('//a[regexp:test(@href, "\.html$")]/text()')],
[u'first link', u'second link'])
self.assertEqual([x.extract() for x in sel.xpath('//a[regexp:test(@href, "first")]/text()')],
[u'first link'])
self.assertEqual([x.extract() for x in sel.xpath('//a[regexp:test(@href, "second")]/text()')],
[u'second link'])

# regexp:match() is rather special: it returns a node-set of <match> nodes
#[u'<match>http://www.bayes.co.uk/xml/index.xml?/xml/utils/rechecker.xml</match>',
#u'<match>http</match>',
#u'<match>www.bayes.co.uk</match>',
#u'<match></match>',
#u'<match>/xml/index.xml?/xml/utils/rechecker.xml</match>']
self.assertEqual(sel.xpath(''
'regexp:match(//a[regexp:test(@href, "\.xml$")]/@href,'
'"(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)")/text()').extract(),
[u'http://www.bayes.co.uk/xml/index.xml?/xml/utils/rechecker.xml',
u'http',
u'www.bayes.co.uk',
u'',
u'/xml/index.xml?/xml/utils/rechecker.xml'])

# regexp:replace()
self.assertEqual(sel.xpath('regexp:replace(//a[regexp:test(@href, "\.xml$")]/@href,'
'"(\w+)://(.+)(\.xml)", "","https://\\2.html")').extract(),
[u'https://www.bayes.co.uk/xml/index.xml?/xml/utils/rechecker.html'])

def test_set(self):
"""EXSLT set manipulation tests"""
# microdata example from http://schema.org/Event
body="""
<div itemscope itemtype="http://schema.org/Event">
<a itemprop="url" href="nba-miami-philidelphia-game3.html">
NBA Eastern Conference First Round Playoff Tickets:
<span itemprop="name"> Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1) </span>
</a>
<meta itemprop="startDate" content="2016-04-21T20:00">
Thu, 04/21/16
8:00 p.m.
<div itemprop="location" itemscope itemtype="http://schema.org/Place">
<a itemprop="url" href="wells-fargo-center.html">
Wells Fargo Center
</a>
<div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="addressLocality">Philadelphia</span>,
<span itemprop="addressRegion">PA</span>
</div>
</div>
<div itemprop="offers" itemscope itemtype="http://schema.org/AggregateOffer">
Priced from: <span itemprop="lowPrice">$35</span>
<span itemprop="offerCount">1938</span> tickets left
</div>
</div>
"""
response = TextResponse(url="http://example.com", body=body)
sel = self.sscls(response)

self.assertEqual(
sel.xpath('''//div[@itemtype="http://schema.org/Event"]
//@itemprop''').extract(),
[u'url',
u'name',
u'startDate',
u'location',
u'url',
u'address',
u'addressLocality',
u'addressRegion',
u'offers',
u'lowPrice',
u'offerCount']
)
self.assertEqual(sel.xpath('''
set:difference(//div[@itemtype="http://schema.org/Event"]
//@itemprop,
//div[@itemtype="http://schema.org/Event"]
//*[@itemscope]/*/@itemprop)''').extract(),
[u'url', u'name', u'startDate', u'location', u'offers'])
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.