Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable smart strings in lxml XPath evaluations #535

Merged
merged 4 commits into from Jan 22, 2014

Conversation

@redapple
Copy link
Contributor

@redapple redapple commented Jan 15, 2014

lxml XPath string results are "smart" by default. They have a getparent() method to know their parent Element (http://lxml.de/xpathxslt.html#xpath-return-values).

There are certain cases where the smart string behaviour is undesirable. For example, it means that the tree will be kept alive by the string, which may have a considerable memory impact in the case that the string value is the only thing in the tree that is actually of interest. For these cases, you can deactivate the parental relationship using the keyword argument smart_strings.

This functionality is not used in Scrapy selectors.

@kmike
Copy link
Member

@kmike kmike commented Jan 15, 2014

+1, a good idea. Even if somebody else used this feature previously, values of the fancy attributes can be obtained using other means, so I think we can live without an option in Selector constructor, and the implementation is fine as-is.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 15, 2014

Can you do things like this?

>>> Selector(text='<root><a>A</a><b>B</b></root>').xpath('//a').xpath('../b').extract()
[u'<b>B</b>']
@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 15, 2014

@nramirezuy , yes this works. The "smart" strings thing is for string results, such as attribute values and text nodes.

@dangra
Copy link
Member

@dangra dangra commented Jan 16, 2014

is smart_strings flag new?

I wonder if it can be used to pass this testcase that we disabled on libxml2->lxml migration: https://github.com/scrapy/scrapy/blob/master/scrapy/tests/test_selector.py#L260

     def test_nested_select_on_text_nodes(self):
        # FIXME: does not work with lxml backend [upstream]
        r = self.sscls(text=u'<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>')
        x1 = r.xpath("//div/descendant::text()")
        x2 = x1.xpath("./preceding-sibling::b[contains(text(), 'Options')]")
        self.assertEquals(x2.extract(), [u'<b>Options:</b>'])
    test_nested_select_on_text_nodes.skip = "Text nodes lost parent node reference in lxml"
@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 16, 2014

The test case still fails.

With smart_strings=True you have .getparent() available, but not .xpath()
With smart_strings=False you get regular strings back

paul@wheezy:~$ python
Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.html
>>>
>>> doc = """<div><b>Options:</b>opt1</div><div><b>Other</b>opt2</div>"""
>>> root = lxml.html.fromstring(doc)
>>>
>>> root.xpath("//div/descendant::text()")
['Options:', 'opt1', 'Other', 'opt2']
>>> map(lambda e: e.xpath("./preceding-sibling::b[contains(text(), 'Options')]"), root.xpath("//div/descendant::text()"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <lambda>
AttributeError: '_ElementStringResult' object has no attribute 'xpath'
>>>
>>> map(lxml.html.tostring, map(lambda e: e.getparent(), root.xpath("//div/descendant::text()")))
['<b>Options:</b>opt1', '<b>Options:</b>opt1', '<b>Other</b>opt2', '<b>Other</b>opt2']
>>> map(lxml.html.tostring, map(lambda e: e.getparent(), root.xpath("//div/descendant::text()", smart_strings=False)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <lambda>
AttributeError: 'str' object has no attribute 'getparent'
>>> 
>>> map(lxml.html.tostring, map(lambda e: e.getparent(), root.xpath("//div/descendant::text()")))
['<b>Options:</b>opt1', '<b>Options:</b>opt1', '<b>Other</b>opt2', '<b>Other</b>opt2']
>>> 
@dangra
Copy link
Member

@dangra dangra commented Jan 16, 2014

There is no harm on merging it, but looks like smart strings haven't an effect on selectors because Selector.extract() method always returns python unicode (no getparent())

More important is to check other uses of lxml in FormRequest, Sitemap and LxmlLinkExtractor

>>> from scrapy.selector import Selector
>>> sel = Selector(text='<html><body><span>Hey</span></body></html>')
>>> oo = sel.xpath('//span/text()')[0]
>>> oo
<Selector xpath='//span/text()' data=u'Hey'>

>>> oo._root
'Hey'

>>> oo._root.getparent()
<Element span at 0x30592d0>

>>> oo.extract()
u'Hey'

>>> oo.extract().getparent()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-236a9fa1a312> in <module>()
----> 1 oo.extract().getparent()

AttributeError: 'unicode' object has no attribute 'getparent'
@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 16, 2014

@dangra do you want this smart strings off in FormRequest, Sitemap and LxmlLinkExtractor in the same PR?

@dangra
Copy link
Member

@dangra dangra commented Jan 16, 2014

Well, as this PR doesn't change too much in Selectors, I was thinking on focusing it on disabling "smart_strings" everywhere in Scrapy. But I'm cool if you prefer to submit multiple PRs.

what do you think about adding a testcase to check that Selector.extract() doesn't return smart strings?

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 16, 2014

@dangra , I may have missed something but I didn't see any lxml related .xpath() call for text nodes or attributes in FormRequest, LxmlParserLinkExtractor or Sitemap.

@dangra
Copy link
Member

@dangra dangra commented Jan 17, 2014

@redapple: you right, I confused the scope of smart_strings.

It's not mergeable as-is, do you mind rebasing on top of master, no need to squash. thx

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 20, 2014

Rebased. Dunno why the Travis build failed

@dangra
Copy link
Member

@dangra dangra commented Jan 20, 2014

I triggered a rebuild in Travis console.

any(map(lambda e: hasattr(e._root, 'getparent'), li_text)),
False)
div_class = x.xpath('//div/@class')
self.assertIs(

This comment has been minimized.

@kmike

kmike Jan 20, 2014
Member

I think it is better to use assertTrue/assertFalse here:
self.assertFalse(any(hasattr(e._root, 'getparent') for e in div_class))

@kmike
Copy link
Member

@kmike kmike commented Jan 22, 2014

LGTM

pablohoffman added a commit that referenced this pull request Jan 22, 2014
Disable smart strings in lxml XPath evaluations
@pablohoffman pablohoffman merged commit 2d60f86 into scrapy:master Jan 22, 2014
1 check passed
1 check passed
default The Travis CI build passed
Details
@redapple redapple deleted the redapple:xpath-smartstrings branch Jul 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.