Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote LxmlLinkExtractor as LxmlExtractor #761

Merged
merged 3 commits into from Jun 25, 2014
Merged

Conversation

@dangra
Copy link
Member

@dangra dangra commented Jun 25, 2014

on top of the amazing work of @redapple in #559, this PR includes an alias to LxmlLinkExtractor importable from scrapy.contrib.linkextractors.LinkExtractor and update docs.

@dangra dangra mentioned this pull request Jun 25, 2014
@dangra
Copy link
Member Author

@dangra dangra commented Jun 25, 2014

I avoided the route of improving how CrawlSpider interacts with Rules and LinkExtractors, also avoided the idea of moving the promoted linkextractor to a short import path. I still think that ideas are important but too much to get integrated in time for 0.24.

from scrapy.contrib.linkextractors import LinkExtractor


.. module:: scrapy.contrib.linkextractors.lxmlhtml

This comment has been minimized.

@kmike

kmike Jun 25, 2014
Member

It seems this directive will be overwritten by the next directive, so the docs would be for scrapy.contrib.linkextractors.sgml.LxmlLinkExtractor.


.. class:: LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), tags=('a', 'area'), attrs=('href',), canonicalize=True, unique=True, process_value=None)

LxmlLinkExtractor provides the same API as :class:`SgmlLinkExtractor`,

This comment has been minimized.

@kmike

kmike Jun 25, 2014
Member

I'm not fond of starting LxmlLinkExtractor docs with a reference to (supposely inferior, and maybe pending deprecation) SgmlLinkExtractor. Maybe something like this?

LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using lxml's robust HTMLParser.

@@ -36,16 +36,88 @@ Built-in link extractors reference
All available link extractors classes bundled with Scrapy are provided in the
:mod:`scrapy.contrib.linkextractors` module.

If you don't know what link extractor to choose, just use the default which is
the same than LxmlLinkExtractor (see below)::

This comment has been minimized.

@kmike

kmike Jun 25, 2014
Member

the same as

@kmike
Copy link
Member

@kmike kmike commented Jun 25, 2014

Other than some minor issues (see comments) it looks good.

@dangra
Copy link
Member Author

@dangra dangra commented Jun 25, 2014

thanks, I think I addressed all your comments and added a warning on SGML usage.

dangra added a commit that referenced this pull request Jun 25, 2014
Promote LxmlLinkExtractor as LxmlExtractor
@dangra dangra merged commit 2ad8db6 into scrapy:master Jun 25, 2014
1 check was pending
1 check was pending
continuous-integration/travis-ci The Travis CI build is in progress
Details
@dangra dangra deleted the dangra:lxmlextractor branch Jun 25, 2014
@kmike
Copy link
Member

@kmike kmike commented Jun 25, 2014

Are there some commits missing?

@dangra
Copy link
Member Author

@dangra dangra commented Jun 25, 2014

yep :(

@dangra
Copy link
Member Author

@dangra dangra commented Jun 25, 2014

I pushed the missing commit as 436c1c8 to master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants