Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register EXSLT namespaces by default (resolves #470) #472

Merged
merged 7 commits into from Jan 16, 2014

Conversation

@redapple
Copy link
Contributor

commented Nov 24, 2013

lxml supports interesting EXSLT extensions to XPath 1.0 (via libxslt).
This PR register some of the corresponding namespaces by default in the Selector class.

The most useful IMO are:

  • set manipulation, using prefix set: (namespace http://exslt.org/sets)
  • regular expressions, and especially regexp:test() when one needs to use regex inside a XPath; using prefix regexp: (namespace http://exslt.org/regular-expressions)

One could add re: as an alias prefix for regular expressions.
(Regex support is in fact NOT in libxslt but is implemented in lxml using XPath extensions in Python)

Only regex tests added for now.

@redapple

This comment has been minimized.

Copy link
Contributor Author

commented Nov 25, 2013

Not exactly related but lxml tools/ contains an interesting snippet to have Python's builtins and string methods available in XPath through a "py:" prefix
https://github.com/lxml/lxml/blob/master/tools/xpathgrep.py#L191

@@ -61,7 +101,9 @@ def __init__(self, response=None, text=None, type=None, namespaces=None,
_root = LxmlDocument(response, self._parser)

self.response = response
self.namespaces = namespaces
ns = copy.copy(self._default_namespaces)

This comment has been minimized.

Copy link
@pablohoffman

pablohoffman Nov 25, 2013

Member

How about dict(self._default_namespaces) or self._default_namespaces.copy()?. Both would achieve the same without requiring the copy import.

I think the ns variable is also unnecessary, consider this:

self.namespaces = self._default_namespaces.copy()
self.namespaces.update(namespaces or {})

Finally, how about optimizing the common case to avoid duplicating the _default_namespaces dict on every Selector instantiatation? (which are a lot)

if namespaces is None:
    self.namespaces = self._default_namespaces
else:
    self.namespaces = self._default_namespaces.copy()
    self.namespaces.update(namespaces)

I'm really unsure about the impacts of this dict duplication, I would do some profiling to confirm it's worthwhile.

And there's the concern about further modifications to .namespaces if it's a public attribute.

This comment has been minimized.

Copy link
@dangra

dangra Nov 25, 2013

Member

And there's the concern about further modifications to .namespaces if it's a public attribute.

I think this is enough reason not to assign self._default_namespaces to self.namespaces :)

any post namespace registered post selector creation will propagate to future selectors

@pablohoffman

This comment has been minimized.

Copy link
Member

commented Nov 25, 2013

I left some comments.

I like the motivation of this change. Could you add a couple real/useful examples to the documentation to finish convincing myself and also to serve as documentation?

@redapple

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2014

@pablohoffman how does it look to you?

These can be handy for excluding parts of a document tree before
extracting text elements for example.

Example extracting midrodata (sample content taken from http://schema.org/Product)

This comment has been minimized.

Copy link
@dangra

dangra Jan 14, 2014

Member

s/midrodata/microdata/?

This comment has been minimized.

Copy link
@redapple

redapple Jan 14, 2014

Author Contributor

indeed ;)

@redapple

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2014

would example of math and string functions be helpful?

@dangra

This comment has been minimized.

Copy link
Member

commented Jan 14, 2014

would example of math and string functions be helpful?

at least it will look more useful, I don't see the point of "strings" namespace at the moment. :)

@redapple

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2014

I can remove the string namespace for now

@redapple

This comment has been minimized.

Copy link
Contributor Author

commented Jan 15, 2014

The only string function I could maybe see an interest in is str:concat(node-set) but that should be pretty rare. str:tokenize() is not available for XPath in lxml, only for XSLT. str:padding() and str:align() would be even more rarely used.

Regarding math functions, math:highest() and math:lowest() could be useful
http://www.exslt.org/math/functions/highest/index.html
http://www.exslt.org/math/functions/lowest/index.html

@pablohoffman

This comment has been minimized.

Copy link
Member

commented Jan 15, 2014

I'm a bit against the defensive programming approach here. If only the regex & set namespaces are actually useful in practice, why add the others?. Not to mention that "string" kinda breaks the Python zen ("There should be one-- and preferably only one --obvious way to do it.")

Btw, is "regex" the official namespace?. Otherwise I'd use "re" and drop "regex" (in favor of succinctness).

"py" prefix looks kinda handy but I worry about the performance implications (also a concern for regex that needs to callback to python). Should we add a warning that this feature may slow down selectors substantially and that it should be use with care?. Of course, what'll happen in practice is that 99% of developers won't give a damn (... and that's sad).

@redapple

This comment has been minimized.

Copy link
Contributor Author

commented Jan 15, 2014

I agree http://exslt.org/strings namespace doesn't add much. "math" could have its uses but let's say the use cases are not obvious right now, and could be added at a later stage. Let's drop these 2 namespaces.

"regexp" is the prefix used in http://www.exslt.org/regexp/index.html but we could use another one (or have prefix aliases). In parslepy is use "re". Although re:match() "returns a node set of match elements" so that may confuse devs used to Python's re.match()

lxml implementation of EXSLT regular expression in Python (https://github.com/lxml/lxml/blob/master/src/lxml/extensions.pxi#L455) surely has performance implications but regexes sometimes are the only way around limitations of XPath 1.0

@dangra

This comment has been minimized.

Copy link
Member

commented Jan 16, 2014

@redapple is it ready to merge?

@dangra

This comment has been minimized.

Copy link
Member

commented Jan 16, 2014

it looks good to me. /cc @pablohoffman are you ok to merge now?

@redapple

This comment has been minimized.

Copy link
Contributor Author

commented Jan 16, 2014

@dangra, yes, that's all I have for now :)

pablohoffman added a commit that referenced this pull request Jan 16, 2014
Merge pull request #472 from redapple/exslt
Register EXSLT namespaces by default (resolves #470)

@pablohoffman pablohoffman merged commit 71ada54 into scrapy:master Jan 16, 2014

1 check passed

default The Travis CI build passed
Details
@Digenis

This comment has been minimized.

Copy link
Member

commented Mar 19, 2015

How about an example in the doc that shows how to register the rest of the exslt extensions?
It can't take more than 2 lines, just a sentence "To enable the rest of the exslt extensions supported by lxml pass namespaces={'str': ...} to Selector()"

@redapple redapple deleted the redapple:exslt branch Jul 8, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.