Disable smart strings in lxml XPath evaluations #535
Conversation
+1, a good idea. Even if somebody else used this feature previously, values of the fancy attributes can be obtained using other means, so I think we can live without an option in Selector constructor, and the implementation is fine as-is. |
Can you do things like this? >>> Selector(text='<root><a>A</a><b>B</b></root>').xpath('//a').xpath('../b').extract()
[u'<b>B</b>'] |
@nramirezuy , yes this works. The "smart" strings thing is for string results, such as attribute values and text nodes. |
is I wonder if it can be used to pass this testcase that we disabled on libxml2->lxml migration: https://github.com/scrapy/scrapy/blob/master/scrapy/tests/test_selector.py#L260 def test_nested_select_on_text_nodes(self):
# FIXME: does not work with lxml backend [upstream]
r = self.sscls(text=u'<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>')
x1 = r.xpath("//div/descendant::text()")
x2 = x1.xpath("./preceding-sibling::b[contains(text(), 'Options')]")
self.assertEquals(x2.extract(), [u'<b>Options:</b>'])
test_nested_select_on_text_nodes.skip = "Text nodes lost parent node reference in lxml" |
The test case still fails. With paul@wheezy:~$ python
Python 2.7.3 (default, Jan 2 2013, 13:56:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.html
>>>
>>> doc = """<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>"""
>>> root = lxml.html.fromstring(doc)
>>>
>>> root.xpath("//div/descendant::text()")
['Options:', 'opt1', 'Other', 'opt2']
>>> map(lambda e: e.xpath("./preceding-sibling::b[contains(text(), 'Options')]"), root.xpath("//div/descendant::text()"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
AttributeError: '_ElementStringResult' object has no attribute 'xpath'
>>>
>>> map(lxml.html.tostring, map(lambda e: e.getparent(), root.xpath("//div/descendant::text()")))
['<b>Options:</b>opt1', '<b>Options:</b>opt1', '<b>Other</b>opt2', '<b>Other</b>opt2']
>>> map(lxml.html.tostring, map(lambda e: e.getparent(), root.xpath("//div/descendant::text()", smart_strings=False)))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
AttributeError: 'str' object has no attribute 'getparent'
>>>
>>> map(lxml.html.tostring, map(lambda e: e.getparent(), root.xpath("//div/descendant::text()")))
['<b>Options:</b>opt1', '<b>Options:</b>opt1', '<b>Other</b>opt2', '<b>Other</b>opt2']
>>> |
There is no harm on merging it, but looks like smart strings haven't an effect on selectors because More important is to check other uses of lxml in FormRequest, Sitemap and LxmlLinkExtractor >>> from scrapy.selector import Selector
>>> sel = Selector(text='<html><body><span>Hey</span></body></html>')
>>> oo = sel.xpath('//span/text()')[0]
>>> oo
<Selector xpath='//span/text()' data=u'Hey'>
>>> oo._root
'Hey'
>>> oo._root.getparent()
<Element span at 0x30592d0>
>>> oo.extract()
u'Hey'
>>> oo.extract().getparent()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-9-236a9fa1a312> in <module>()
----> 1 oo.extract().getparent()
AttributeError: 'unicode' object has no attribute 'getparent' |
@dangra do you want this smart strings off in FormRequest, Sitemap and LxmlLinkExtractor in the same PR? |
Well, as this PR doesn't change too much in Selectors, I was thinking on focusing it on disabling "smart_strings" everywhere in Scrapy. But I'm cool if you prefer to submit multiple PRs. what do you think about adding a testcase to check that |
@dangra , I may have missed something but I didn't see any lxml related |
@redapple: you right, I confused the scope of smart_strings. It's not mergeable as-is, do you mind rebasing on top of master, no need to squash. thx |
Rebased. Dunno why the Travis build failed |
I triggered a rebuild in Travis console. |
any(map(lambda e: hasattr(e._root, 'getparent'), li_text)), | ||
False) | ||
div_class = x.xpath('//div/@class') | ||
self.assertIs( |
kmike
Jan 20, 2014
Member
I think it is better to use assertTrue/assertFalse here:
self.assertFalse(any(hasattr(e._root, 'getparent') for e in div_class))
I think it is better to use assertTrue/assertFalse here:
self.assertFalse(any(hasattr(e._root, 'getparent') for e in div_class))
LGTM |
Disable smart strings in lxml XPath evaluations
lxml XPath string results are "smart" by default. They have a
getparent()
method to know their parentElement
(http://lxml.de/xpathxslt.html#xpath-return-values).This functionality is not used in Scrapy selectors.