[MRG] Add LxmlLinkExtractor class similar to SgmlLinkExtractor (#528) #559
Conversation
The last change makes the thing a bit more readable. But the |
This question on StackOverflow makes me wonder about |
@dangra @kmike @nramirezuy @pablohoffman @darkrho , any thoughts so far? The extractor could even be renamed to |
deny_extensions=None): | ||
|
||
tag_func = lambda x: x in tags | ||
attr_func = lambda x: x in attrs |
kmike
Feb 1, 2014
Member
I think it is better to make sets from tags and attrs before creating lambdas. And maybe we can write tag_func=tags.__contains__
then, not sure about it. Micro-optimizing ftw.
I think it is better to make sets from tags and attrs before creating lambdas. And maybe we can write tag_func=tags.__contains__
then, not sure about it. Micro-optimizing ftw.
allowed &= not url_has_any_extension(parsed_url, self.deny_extensions) | ||
if allowed and self.canonicalize: | ||
link.url = canonicalize_url(parsed_url) | ||
return allowed |
kmike
Feb 1, 2014
Member
I see that you've just copied this code, but it seems if allowed
could be added to every if
to avoid extra computations. Or it can be rewritten like
if self.allow_res and not _matches(link.url, self.allow_res):
return False
if self.deny_res and _matches(link.url, self.deny_res):
return False
# ...
Also, modifying link.url
in _link_allowed
method is bad.
Feel free to ignore this all because it is not relevant for this PR :)
I see that you've just copied this code, but it seems if allowed
could be added to every if
to avoid extra computations. Or it can be rewritten like
if self.allow_res and not _matches(link.url, self.allow_res):
return False
if self.deny_res and _matches(link.url, self.deny_res):
return False
# ...
Also, modifying link.url
in _link_allowed
method is bad.
Feel free to ignore this all because it is not relevant for this PR :)
rmax
Feb 2, 2014
Contributor
IMO, that method needs to be revamped to allow easy extension and modification. But that is out of the focus of this PR.
IMO, that method needs to be revamped to allow easy extension and modification. But that is out of the focus of this PR.
to make it clear: I haven't really looked at this PR, just noticed a couple of suboptimal chunks of code. |
Just like the |
for e, a, l, p in html.iterlinks(): | ||
def _extract_links(self, selector, response_url, response_encoding): | ||
# hacky way to get the underlying lxml parsed document | ||
for e, a, l, p in selector._root.iterlinks(): |
rmax
Feb 2, 2014
Contributor
I have a spider which uses tags="div"
and attrs="data-url"
in the link extractor. I think it won't work here given that .iterlinks
returns only a subset of predefined tags: http://lxml.de/lxmlhtml.html
I have a spider which uses tags="div"
and attrs="data-url"
in the link extractor. I think it won't work here given that .iterlinks
returns only a subset of predefined tags: http://lxml.de/lxmlhtml.html
redapple
Feb 2, 2014
Author
Contributor
Interesting @darkrho . Do you mind making up an example HTML snippet illustrating this?
We could rewrite a scrapy-specific .iterlinks
, the common case not being that complicated:
for el in self.iter():
attribs = el.attrib
tag = _nons(el.tag)
if tag != 'object':
for attrib in link_attrs:
if attrib in attribs:
yield (el, attrib, attribs[attrib], 0)
(https://github.com/lxml/lxml/blob/master/src/lxml/html/__init__.py#L363)
Interesting @darkrho . Do you mind making up an example HTML snippet illustrating this?
We could rewrite a scrapy-specific .iterlinks
, the common case not being that complicated:
for el in self.iter():
attribs = el.attrib
tag = _nons(el.tag)
if tag != 'object':
for attrib in link_attrs:
if attrib in attribs:
yield (el, attrib, attribs[attrib], 0)
(https://github.com/lxml/lxml/blob/master/src/lxml/html/__init__.py#L363)
rmax
Feb 2, 2014
Contributor
@redapple Here is a simplified example:
In [1]: from scrapy.http import HtmlResponse
In [2]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [3]: response = HtmlResponse('http://example.com', body="""
<html><body>
<div id="item1" data-url="get?id=1"><a href="#">Item 1</a></div>
<div id="item2" data-url="get?id=2"><a href="#">Item 2</a></div>
</body></html>
""")
In [4]: lx = SgmlLinkExtractor(tags='div', attrs='data-url')
In [5]: lx.extract_links(response)
Out[5]:
[Link(url='http://example.com/get?id=1', text=u'Item 1', fragment='', nofollow=False),
Link(url='http://example.com/get?id=2', text=u'Item 2', fragment='', nofollow=False)]
@redapple Here is a simplified example:
In [1]: from scrapy.http import HtmlResponse
In [2]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [3]: response = HtmlResponse('http://example.com', body="""
<html><body>
<div id="item1" data-url="get?id=1"><a href="#">Item 1</a></div>
<div id="item2" data-url="get?id=2"><a href="#">Item 2</a></div>
</body></html>
""")
In [4]: lx = SgmlLinkExtractor(tags='div', attrs='data-url')
In [5]: lx.extract_links(response)
Out[5]:
[Link(url='http://example.com/get?id=1', text=u'Item 1', fragment='', nofollow=False),
Link(url='http://example.com/get?id=2', text=u'Item 2', fragment='', nofollow=False)]
dangra
Feb 3, 2014
Member
it lxml.html.iterlinks is not good for us, what do you think about basing the new link extractor on top of scrapy selectors. Additionally it avoids the switch of the default parser, now lxml.etree.HtmlParser, to lxml.html.HtmlParser.
it lxml.html.iterlinks is not good for us, what do you think about basing the new link extractor on top of scrapy selectors. Additionally it avoids the switch of the default parser, now lxml.etree.HtmlParser, to lxml.html.HtmlParser.
redapple
Feb 3, 2014
Author
Contributor
who wants to give a shot at a Selector
-based .extract_links()
?
I personally think it's easier to implement using lxml.etree.Element.iter()
and using element's attrib
property than using .xpath()
or .css()
on Selector
objects (it's probably more efficient also)
who wants to give a shot at a Selector
-based .extract_links()
?
I personally think it's easier to implement using lxml.etree.Element.iter()
and using element's attrib
property than using .xpath()
or .css()
on Selector
objects (it's probably more efficient also)
@@ -16,7 +16,7 @@ | |||
__all__ = ['Selector', 'SelectorList'] | |||
|
|||
_ctgroup = { | |||
'html': {'_parser': etree.HTMLParser, | |||
'html': {'_parser': html.HTMLParser, |
dangra
Feb 3, 2014
Member
I can be paranoid, but shouldn't we compare the parsing performance before/after making this change?
This is going to affect every parsed page even if not using linkextractors.
I can be paranoid, but shouldn't we compare the parsing performance before/after making this change?
This is going to affect every parsed page even if not using linkextractors.
redapple
Feb 3, 2014
Author
Contributor
lxml.html.HTMLParser
is specially meant for HTML and the main difference seems to be the helpers for links. I don't think it would make that much of a difference (but I have no proof).
But if we say that .iterlinks
doesn't fit our needs in the end, we may not need it at all.
And @dangra as you mention parsing performance, I'd be thrilled to have some performance tests in scrapy, if only to validate/invalidate my beloved compiled XPath expressions :)
lxml.html.HTMLParser
is specially meant for HTML and the main difference seems to be the helpers for links. I don't think it would make that much of a difference (but I have no proof).
But if we say that .iterlinks
doesn't fit our needs in the end, we may not need it at all.
And @dangra as you mention parsing performance, I'd be thrilled to have some performance tests in scrapy, if only to validate/invalidate my beloved compiled XPath expressions :)
See also: #331 |
@kmike , @darkrho , @nramirezuy , @pablohoffman , probably room for cleanup but this should contain all fixes from your comments. |
@@ -2,7 +2,7 @@ | |||
XPath selectors based on lxml | |||
""" | |||
|
|||
from lxml import etree | |||
from lxml import etree, html |
dangra
Jun 20, 2014
Member
unneeded import lxml.etree.html
unneeded import lxml.etree.html
# hacky way to get the underlying lxml parsed document | ||
for el, attr, attr_val in self._iter_links(selector._root): | ||
if self.scan_tag(el.tag): | ||
if self.scan_attr(attr): |
dangra
Jun 20, 2014
Member
looks like there is no need for nesting two "if"s, it can be reduced to if self.scan_tag(el.tag) and self.scan_attr(attr)
looks like there is no need for nesting two "if"s, it can be reduced to if self.scan_tag(el.tag) and self.scan_attr(attr)
+1 to update docs and merge. |
I can still squash of course |
LGTM |
I had to change the Selector HTML parse fromlxml.etree.HTMLParser
tolxml.html.HTMLParser
to have those helpful.make_links_absolute()
and.iterlinks()
There's still margin for some factorisation and cleanup (as much of the code is bluntly copied fromSgmlLinkExtractor
implementation of #528