Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add LxmlLinkExtractor class similar to SgmlLinkExtractor (#528) #559

Merged
merged 2 commits into from Jun 25, 2014

Conversation

@redapple
Copy link
Contributor

@redapple redapple commented Jan 23, 2014

I had to change the Selector HTML parse from lxml.etree.HTMLParser to lxml.html.HTMLParser to have those helpful .make_links_absolute() and .iterlinks()

There's still margin for some factorisation and cleanup (as much of the code is bluntly copied from SgmlLinkExtractor

implementation of #528

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 26, 2014

The last change makes the thing a bit more readable. But the self.base_url = None line in SgmlLinkExtractor.__init__() feels a bit awkward

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 27, 2014

This question on StackOverflow makes me wonder about nofollow https://stackoverflow.com/questions/21392222/scrapy-honor-rel-nofollow

@redapple
Copy link
Contributor Author

@redapple redapple commented Feb 1, 2014

@dangra @kmike @nramirezuy @pablohoffman @darkrho , any thoughts so far?

The extractor could even be renamed to SelectorLinkExtractor but users may expect that it takes a Selector parameter somehow...

deny_extensions=None):

tag_func = lambda x: x in tags
attr_func = lambda x: x in attrs

This comment has been minimized.

@kmike

kmike Feb 1, 2014
Member

I think it is better to make sets from tags and attrs before creating lambdas. And maybe we can write tag_func=tags.__contains__ then, not sure about it. Micro-optimizing ftw.

allowed &= not url_has_any_extension(parsed_url, self.deny_extensions)
if allowed and self.canonicalize:
link.url = canonicalize_url(parsed_url)
return allowed

This comment has been minimized.

@kmike

kmike Feb 1, 2014
Member

I see that you've just copied this code, but it seems if allowed could be added to every if to avoid extra computations. Or it can be rewritten like

if self.allow_res and not _matches(link.url, self.allow_res):
    return False
if self.deny_res and _matches(link.url, self.deny_res):
    return False
# ...

Also, modifying link.url in _link_allowed method is bad.
Feel free to ignore this all because it is not relevant for this PR :)

This comment has been minimized.

@rmax

rmax Feb 2, 2014
Contributor

IMO, that method needs to be revamped to allow easy extension and modification. But that is out of the focus of this PR.

@kmike
Copy link
Member

@kmike kmike commented Feb 1, 2014

to make it clear: I haven't really looked at this PR, just noticed a couple of suboptimal chunks of code.

@rmax
Copy link
Contributor

@rmax rmax commented Feb 2, 2014

The extractor could even be renamed to SelectorLinkExtractor but users may expect that it takes a Selector parameter somehow...

Just like the Selector class, I think this class can be named LinkExtractor and be importable from scrapy.contrib.linkextractors. This class is going to deprecate the sgml lx, right?

for e, a, l, p in html.iterlinks():
def _extract_links(self, selector, response_url, response_encoding):
# hacky way to get the underlying lxml parsed document
for e, a, l, p in selector._root.iterlinks():

This comment has been minimized.

@rmax

rmax Feb 2, 2014
Contributor

I have a spider which uses tags="div" and attrs="data-url" in the link extractor. I think it won't work here given that .iterlinks returns only a subset of predefined tags: http://lxml.de/lxmlhtml.html

This comment has been minimized.

@redapple

redapple Feb 2, 2014
Author Contributor

Interesting @darkrho . Do you mind making up an example HTML snippet illustrating this?
We could rewrite a scrapy-specific .iterlinks, the common case not being that complicated:

        for el in self.iter():
            attribs = el.attrib
            tag = _nons(el.tag)
            if tag != 'object':
                for attrib in link_attrs:
                    if attrib in attribs:
                        yield (el, attrib, attribs[attrib], 0)

(https://github.com/lxml/lxml/blob/master/src/lxml/html/__init__.py#L363)

This comment has been minimized.

@rmax

rmax Feb 2, 2014
Contributor

@redapple Here is a simplified example:

In [1]: from scrapy.http import HtmlResponse

In [2]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

In [3]: response = HtmlResponse('http://example.com', body="""
<html><body>
<div id="item1" data-url="get?id=1"><a href="#">Item 1</a></div>
<div id="item2" data-url="get?id=2"><a href="#">Item 2</a></div>
</body></html>
""")

In [4]: lx = SgmlLinkExtractor(tags='div', attrs='data-url')

In [5]: lx.extract_links(response)
Out[5]: 
[Link(url='http://example.com/get?id=1', text=u'Item 1', fragment='', nofollow=False),
 Link(url='http://example.com/get?id=2', text=u'Item 2', fragment='', nofollow=False)]

This comment has been minimized.

@dangra

dangra Feb 3, 2014
Member

it lxml.html.iterlinks is not good for us, what do you think about basing the new link extractor on top of scrapy selectors. Additionally it avoids the switch of the default parser, now lxml.etree.HtmlParser, to lxml.html.HtmlParser.

This comment has been minimized.

@redapple

redapple Feb 3, 2014
Author Contributor

who wants to give a shot at a Selector-based .extract_links()?

I personally think it's easier to implement using lxml.etree.Element.iter() and using element's attrib property than using .xpath() or .css() on Selector objects (it's probably more efficient also)

@@ -16,7 +16,7 @@
__all__ = ['Selector', 'SelectorList']

_ctgroup = {
'html': {'_parser': etree.HTMLParser,
'html': {'_parser': html.HTMLParser,

This comment has been minimized.

@dangra

dangra Feb 3, 2014
Member

I can be paranoid, but shouldn't we compare the parsing performance before/after making this change?
This is going to affect every parsed page even if not using linkextractors.

This comment has been minimized.

@redapple

redapple Feb 3, 2014
Author Contributor

lxml.html.HTMLParser is specially meant for HTML and the main difference seems to be the helpers for links. I don't think it would make that much of a difference (but I have no proof).

But if we say that .iterlinks doesn't fit our needs in the end, we may not need it at all.

And @dangra as you mention parsing performance, I'd be thrilled to have some performance tests in scrapy, if only to validate/invalidate my beloved compiled XPath expressions :)

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Feb 3, 2014

@redapple These are my thoughts about the subject: #578

@kmike
Copy link
Member

@kmike kmike commented Apr 15, 2014

See also: #331

@redapple
Copy link
Contributor Author

@redapple redapple commented Jun 20, 2014

@kmike , @darkrho , @nramirezuy , @pablohoffman , probably room for cleanup but this should contain all fixes from your comments.
Question: go that route? or add another extractor based on Selector?

@@ -2,7 +2,7 @@
XPath selectors based on lxml
"""

from lxml import etree
from lxml import etree, html

This comment has been minimized.

@dangra

dangra Jun 20, 2014
Member

unneeded import lxml.etree.html

# hacky way to get the underlying lxml parsed document
for el, attr, attr_val in self._iter_links(selector._root):
if self.scan_tag(el.tag):
if self.scan_attr(attr):

This comment has been minimized.

@dangra

dangra Jun 20, 2014
Member

looks like there is no need for nesting two "if"s, it can be reduced to if self.scan_tag(el.tag) and self.scan_attr(attr)

@dangra
Copy link
Member

@dangra dangra commented Jun 20, 2014

+1 to update docs and merge.

@redapple
Copy link
Contributor Author

@redapple redapple commented Jun 20, 2014

I can still squash of course

@kmike
Copy link
Member

@kmike kmike commented Jun 22, 2014

LGTM

@redapple redapple changed the title [WIP] Add LxmlLinkExtractor class similar to SgmlLinkExtractor (#528) [MRG] Add LxmlLinkExtractor class similar to SgmlLinkExtractor (#528) Jun 23, 2014
@dangra dangra merged commit 90e6914 into scrapy:master Jun 25, 2014
1 check passed
1 check passed
continuous-integration/travis-ci The Travis CI build passed
Details
@redapple redapple deleted the redapple:lxmlextractor branch Jul 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.